When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
Use the new sk_dst_confirm() helper to propagate the
indication from received packets to sock_confirm_neigh().
Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 5110effee8 ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bedf3 ("ipv4: Cache output routes in fib_info nexthops.")
Tested-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new transport flag to allow sockets to confirm neighbour.
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
The flag is propagated from transport to every packet.
It is reset when cached dst is reset.
Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 5110effee8 ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bedf3 ("ipv4: Cache output routes in fib_info nexthops.")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new skbuff flag to allow protocols to confirm neighbour.
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
Add sock_confirm_neigh() helper to confirm the neighbour and
use it for IPv4, IPv6 and VRF before dst_neigh_output.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new sock flag to allow sockets to confirm neighbour.
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
As not all call paths lock the socket use full word for
the flag.
Add sk_dst_confirm as replacement for dst_confirm when
called for received packets.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Popov reported that an application may trigger a BUG_ON in
sctp_wait_for_sndbuf if the socket tx buffer is full, a thread is
waiting on it to queue more data and meanwhile another thread peels off
the association being used by the first thread.
This patch replaces the BUG_ON call with a proper error handling. It
will return -EPIPE to the original sendmsg call, similarly to what would
have been done if the association wasn't found in the first place.
Acked-by: Alexander Popov <alex.popov@linux.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warnings:
net/ipv6/seg6_iptunnel.c:58:5: warning:
symbol 'nla_put_srh' was not declared. Should it be static?
net/ipv6/seg6_iptunnel.c:238:5: warning:
symbol 'seg6_input' was not declared. Should it be static?
net/ipv6/seg6_iptunnel.c:254:5: warning:
symbol 'seg6_output' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry reported that UDP sockets being destroyed would trigger the
WARN_ON(atomic_read(&sk->sk_rmem_alloc)); in inet_sock_destruct()
It turns out we do not properly destroy skb(s) that have wrong UDP
checksum.
Thanks again to syzkaller team.
Fixes : 7c13f97ffd ("udp: do fwd memory scheduling on dequeue")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow drivers to use the new DSA API with platform data. Most of the
code in net/dsa/dsa2.c does not rely so much on device_nodes and can get
the same information from platform_data instead.
We purposely do not support distributed configurations with platform
data, so drivers should be providing a pointer to a 'struct
dsa_chip_data' structure if they wish to communicate per-port layout.
Multiple CPUs port could potentially be supported and dsa_chip_data is
extended to receive up to one reference to an upstream network device
per port described by a dsa_chip_data structure.
dsa_dev_to_net_device() increments the network device's reference count,
so we intentionally call dev_put() to be consistent with the DT-enabled
path, until we have a generic notifier based solution.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for using this function in net/dsa/dsa2.c, rename the function
to make its scope DSA specific, and export it.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Somehow these files were never present or lost, but the code
is there and they seem somewhat useful, so add them back.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Writing once per jiffy is enough to limit the bridge's false sharing.
After this change the bridge doesn't show up in the local load HitM stats.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fdb's used and updated fields are written to on every packet forward and
packet receive respectively. Thus if we are receiving packets from a
particular fdb, they'll cause false-sharing with everyone who has looked
it up (even if it didn't match, since mac/vid share cache line!). The
"used" field is even worse since it is updated on every packet forward
to that fdb, thus the standard config where X ports use a single gateway
results in 100% fdb false-sharing. Note that this patch does not prevent
the last scenario, but it makes it better for other bridge participants
which are not using that fdb (and are only doing lookups over it).
The point is with this move we make sure that only communicating parties
get the false-sharing, in a later patch we'll show how to avoid that too.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the fdb garbage collector to a workqueue which fires at least 10
milliseconds apart and cleans chain by chain allowing for other tasks
to run in the meantime. When having thousands of fdbs the system is much
more responsive. Most importantly remove the need to check if the
matched entry has expired in __br_fdb_get that causes false-sharing and
is completely unnecessary if we cleanup entries, at worst we'll get 10ms
of traffic for that entry before it gets deleted.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move around net_bridge so the vlan fields are in the beginning since
they're checked on every packet even if vlan filtering is disabled.
For the port move flags & vlan group to the beginning, so they're in the
same cache line with the port's state (both flags and state are checked
on each packet).
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Splicing from TCP socket is vulnerable when a packet with URG flag is
received and stored into receive queue.
__tcp_splice_read() returns 0, and sk_wait_data() immediately
returns since there is the problematic skb in queue.
This is a nice way to burn cpu (aka infinite loop) and trigger
soft lockups.
Again, this gem was found by syzkaller tool.
Fixes: 9c55e01c0c ("[TCP]: Splice receive support.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
A slave device will now notify the switch fabric once its port is
bridged or unbridged, instead of calling directly its switch operations.
This code allows propagating cross-chip bridging events in the fabric.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a notifier block per DSA switch, registered against a notifier head
in the switch fabric they belong to.
This infrastructure will allow to propagate fabric-wide events such as
port bridging, VLAN configuration, etc. If a DSA switch driver cares
about cross-chip configuration, such events can be caught.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The scope of the functions inside net/dsa/slave.c must be the slave
net_device pointer. Change to state setter helper accordingly to
simplify callers.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an error is returned during the bridging of a port in a
NETDEV_CHANGEUPPER event, net/core/dev.c rolls back the operation.
Be consistent and unassign dp->bridge_dev when this happens.
In the meantime, add comments to document this behavior.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify the code handling the slave netdevice notifier call by
providing a dsa_slave_changeupper helper for NETDEV_CHANGEUPPER, and so
on (only this event is supported at the moment.)
Return NOTIFY_DONE when we did not care about an event, and NOTIFY_OK
when we were concerned but no error occurred, as the API suggests.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the netdevice notifier block register code in slave.c and provide
helpers for dsa.c to register and unregister it.
At the same time, check for errors since (un)register_netdevice_notifier
may fail.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes use of is_vlan_dev() function instead of flag
comparison which is exactly done by is_vlan_dev() helper function.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Jon Maxwell <jmaxwell37@gmail.com>
Acked-by: Johannes Thumshirn <jth@kernel.org>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to check if asoc->peer.prsctp_capable is set before
processing fwd tsn chunk, if not, it will return an ERROR to the
peer, just as rfc3758 section 3.3.1 demands.
Reported-by: Julian Cordes <julian.cordes@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When for instance a mobile Linux device roams from one access point to
another with both APs sharing the same broadcast domain and a
multicast snooping switch in between:
1) (c) <~~~> (AP1) <--[SSW]--> (AP2)
2) (AP1) <--[SSW]--> (AP2) <~~~> (c)
Then currently IPv6 multicast packets will get lost for (c) until an
MLD Querier sends its next query message. The packet loss occurs
because upon roaming the Linux host so far stayed silent regarding
MLD and the snooping switch will therefore be unaware of the
multicast topology change for a while.
This patch fixes this by always resending MLD reports when an interface
change happens, for instance from NO-CARRIER to CARRIER state.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 18bfb924f0 ("net: introduce default neigh_construct/destroy
ndo calls for L2 upper devices") we added these ndos to stacked devices
such as team and bond, so that calls will be propagated to mlxsw.
However, previous commit removed the reliance on these ndos and no new
users of these ndos have appeared since above mentioned commit. We can
therefore safely remove this dead code.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* fix FILS AEAD cipher usage to use the correct AAD vectors
and to use synchronous algorithms
* fix using mesh HT operation data from userspace
* fix adding mesh vendor elements to beacons & plink frames
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJYmCBmAAoJEGt7eEactAAd5BAP/2oPaRDgJv0ByqoFPh0pzKqx
RwoXOW9xqtp+wWFA8hPTe2niVtNpexwo4ZQ2I2hkjeomFfbw0gwklBFQQ0Vbq5b9
6UtClEBHp/xW5vdvooBwMAcUBJQMM25wIFt2jwz9xRIUxjiOisZBIp7avLTtoQKC
+hsNJOWOmyeJYLXdeJVaJM953dANCKdzL590JX3f6tbr8LPpszrg8TmVLJWklTYQ
Cm2latv0GezxL/d+KcSWbNoX+X+d5D0gVZXHmp5UFWX6yT0FMkNmSURmkHEfuiuD
z11befXgvXAr3l7cxE/TEtrNCh57pwDoPtJmBqJ9G68aURK8iVb4XB/ZEB8hEvHi
EchMXompYU/xPiGVbkb/wOFXlBY+xc85uoEwkSL1CZs4eX6r6JawrHG7RUcTKFsv
V2zAQU0pDO29OcprHbjD+rnjrG2qtZ/pDKO7X5+eIgHvEzwaqZY3yd1YmJK52d67
J4slSS/jislTg+rbhFi8NrCONuRlp5rixjmHINUWCsilojrKeDh9thMYrVmXWZjT
qjoOojMmiGH7ekhvSVDciRxoLgP9aIShuIvbub9uOPQAPXsVf3KHquSiY9JOpJI8
PpY3hPWQS6j2r5Q2pZu/LM345r0rcj5At1BzCzGqcfKxRUH7rbFDQQ1D3Moehzho
Gqrkv2/p4FAAGFG+4bJ6
=ZzHl
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2017-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A few simple fixes:
* fix FILS AEAD cipher usage to use the correct AAD vectors
and to use synchronous algorithms
* fix using mesh HT operation data from userspace
* fix adding mesh vendor elements to beacons & plink frames
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry reported use-after-free in ip6_datagram_recv_specific_ctl()
A similar bug was fixed in commit 8ce48623f0 ("ipv6: tcp: restore
IP6CB for pktoptions skbs"), but I missed another spot.
tcp_v6_syn_recv_sock() can indeed set np->pktoptions from ireq->pktopts
Fixes: 971f10eca1 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A previous change to fix checks for NL80211_MESHCONF_HT_OPMODE
missed setting the flag when replacing FILL_IN_MESH_PARAM_IF_SET
with checking codes. This results in dropping the received HT
operation value when called by nl80211_update_mesh_config(). Fix
this by setting the flag properly.
Fixes: 9757235f45 ("nl80211: correct checks for NL80211_MESHCONF_HT_OPMODE value")
Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
[rewrite commit message to use Fixes: line]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The function ieee80211_ie_split_vendor doesn't return 0 on errors. Instead
it returns any offset < ielen when WLAN_EID_VENDOR_SPECIFIC is found. The
return value in mesh_add_vendor_ies must therefore be checked against
ifmsh->ie_len and not 0. Otherwise all ifmsh->ie starting with
WLAN_EID_VENDOR_SPECIFIC will be rejected.
Fixes: 082ebb0c25 ("mac80211: fix mesh beacon format")
Signed-off-by: Thorsten Horstmann <thorsten@defutech.de>
Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fit.fraunhofer.de>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
[sven@narfation.org: Add commit message]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The skcipher could have been of the async variant which may return from
skcipher_encrypt() with -EINPROGRESS after having queued the request.
The FILS AEAD implementation here does not have code for dealing with
that possibility, so allocate a sync cipher explicitly to avoid
potential issues with hardware accelerators.
This is based on the patch sent out by Ard.
Fixes: 39404feee6 ("mac80211: FILS AEAD protection for station mode association frames")
Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Incorrect num_elem parameter value (1 vs. 5) was used in the
aes_siv_encrypt() call. This resulted in only the first one of the five
AAD vectors to SIV getting included in calculation. This does not
protect all the contents correctly and would not interoperate with a
standard compliant implementation.
Fix this by using the correct number. A matching fix is needed in the AP
side (hostapd) to get FILS authentication working properly.
Fixes: 39404feee6 ("mac80211: FILS AEAD protection for station mode association frames")
Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Andrey Konovalov reported out of bound accesses in ip6gre_err()
If GRE flags contains GRE_KEY, the following expression
*(((__be32 *)p) + (grehlen / 4) - 1)
accesses data ~40 bytes after the expected point, since
grehlen includes the size of IPv6 headers.
Let's use a "struct gre_base_hdr *greh" pointer to make this
code more readable.
p[1] becomes greh->protocol.
grhlen is the GRE header length.
Fixes: c12b395a46 ("gre: Support GRE over IPv6")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All __napi_complete() callers have been converted to
use the more standard napi_complete_done(),
we can now remove this NAPI method for good.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_print_replace_route_err logs an error if a route replace fails with
IPv6 addresses in the full format. e.g,:
IPv6: IPV6: multipath route replace failed (check consistency of installed routes): 2001:0db8:0200:0000:0000:0000:0000:0000 nexthop 2001:0db8:0001:0000:0000:0000:0000:0016 ifi 0
Change the message to dump the addresses in the compressed format.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If an entire multipath route is deleted using prefix and len (without any
nexthops), send a single RTM_DELROUTE notification with the full route
using RTA_MULTIPATH. This is done by generating the skb before the route
delete when all of the sibling routes are still present but sending it
after the route has been removed from the FIB. The skip_notify flag
is used to tell the lower fib code not to send notifications for the
individual nexthop routes.
If a route is deleted using RTA_MULTIPATH for any nexthops or a single
nexthop entry is deleted, then the nexthops are deleted one at a time with
notifications sent as each hop is deleted. This is necessary given that
IPv6 allows individual hops within a route to be deleted.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change ip6_route_multipath_add to send one notifciation with the full
route encoded with RTA_MULTIPATH instead of a series of individual routes.
This is done by adding a skip_notify flag to the nl_info struct. The
flag is used to skip sending of the notification in the fib code that
actually inserts the route. Once the full route has been added, a
notification is generated with all nexthops.
ip6_route_multipath_add handles 3 use cases: new routes, route replace,
and route append. The multipath notification generated needs to be
consistent with the order of the nexthops and it should be consistent
with the order in a FIB dump which means the route with the first nexthop
needs to be used as the route reference. For the first 2 cases (new and
replace), a reference to the route used to send the notification is
obtained by saving the first route added. For the append case, the last
route added is used to loop back to its first sibling route which is
the first nexthop in the multipath route.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 returns multipath routes as a series of individual routes making
their display and handling by userspace different and more complicated
than IPv4, putting the burden on the user to see that a route is part of
a multipath route and internally creating a multipath route if desired
(e.g., libnl does this as of commit 29b71371e764). This patch addresses
this difference, allowing multipath routes to be returned using the
RTA_MULTIPATH attribute.
The end result is that IPv6 multipath routes can be treated and displayed
in a format similar to IPv4:
$ ip -6 ro ls vrf red
2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium
2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium
2001:db8:200::/120 metric 1024
nexthop via 2001:db8:1::2 dev eth1 weight 1
nexthop via 2001:db8:2::2 dev eth2 weight 1
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 allows multipath routes to be deleted using just the prefix and
length. For example:
$ ip ro ls vrf red
unreachable default metric 8192
1.1.1.0/24
nexthop via 10.100.1.254 dev eth1 weight 1
nexthop via 10.11.200.2 dev eth11.200 weight 1
10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3
$ ip ro del 1.1.1.0/24 vrf red
$ ip ro ls vrf red
unreachable default metric 8192
10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3
The same notation does not work with IPv6 because of how multipath routes
are implemented for IPv6. For IPv6 only the first nexthop of a multipath
route is deleted if the request contains only a prefix and length. This
leads to unnecessary complexity in userspace dealing with IPv6 multipath
routes.
This patch allows all nexthops to be deleted without specifying each one
in the delete request. Internally, this is done by walking the sibling
list of the route matching the specifications given (prefix, length,
metric, protocol, etc).
$ ip -6 ro ls vrf red
2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium
2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium
2001:db8:200::/120 via 2001:db8:1::2 dev eth1 metric 1024 pref medium
2001:db8:200::/120 via 2001:db8:2::2 dev eth2 metric 1024 pref medium
...
$ ip -6 ro del vrf red 2001:db8:200::/120
$ ip -6 ro ls vrf red
2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium
2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium
...
Because IPv6 allows individual nexthops to be deleted without deleting
the entire route, the ip6_route_multipath_del and non-multipath code
path (ip6_route_del) have to be discriminated so that all nexthops are
only deleted for the latter case. This is done by making the existing
fc_type in fib6_config a u16 and then adding a new u16 field with
fc_delete_all_nh as the first bit.
Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller found another out of bound access in ip_options_compile(),
or more exactly in cipso_v4_validate()
Fixes: 20e2a86485 ("cipso: handle CIPSO options correctly when NetLabel is disabled")
Fixes: 446fda4f26 ("[NetLabel]: CIPSOv4 engine")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paul Moore <paul@paul-moore.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey Konovalov got crashes in __ip_options_echo() when a NULL skb->dst
is accessed.
ipv4_pktinfo_prepare() should not drop the dst if (evil) IP options
are present.
We could refine the test to the presence of ts_needtime or srr,
but IP options are not often used, so let's be conservative.
Thanks to syzkaller team for finding this bug.
Fixes: d826eb14ec ("ipv4: PKTINFO doesnt need dst reference")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My recent change missed fact that UFO would perform a complete
UDP checksum before segmenting in frags.
In this case skb->ip_summed is set to CHECKSUM_NONE.
We need to add this valid case to skb_needs_check()
Fixes: b2504a5dbe ("net: reduce skb_warn_bad_offload() noise")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We added generic support for busy polling in NAPI layer in linux-4.5
No network driver uses ndo_busy_poll() anymore, we can get rid
of the pointer in struct net_device_ops, and its use in sk_busy_loop()
Saves NETIF_F_BUSY_POLL features bit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, they are:
1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from
sk_buff so we only access one single cacheline in the conntrack
hotpath. Patchset from Florian Westphal.
2) Don't leak pointer to internal structures when exporting x_tables
ruleset back to userspace, from Willem DeBruijn. This includes new
helper functions to copy data to userspace such as xt_data_to_user()
as well as conversions of our ip_tables, ip6_tables and arp_tables
clients to use it. Not surprinsingly, ebtables requires an ad-hoc
update. There is also a new field in x_tables extensions to indicate
the amount of bytes that we copy to userspace.
3) Add nf_log_all_netns sysctl: This new knob allows you to enable
logging via nf_log infrastructure for all existing netnamespaces.
Given the effort to provide pernet syslog has been discontinued,
let's provide a way to restore logging using netfilter kernel logging
facilities in trusted environments. Patch from Michal Kubecek.
4) Validate SCTP checksum from conntrack helper, from Davide Caratti.
5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly
a copy&paste from the original helper, from Florian Westphal.
6) Reset netfilter state when duplicating packets, also from Florian.
7) Remove unnecessary check for broadcast in IPv6 in pkttype match and
nft_meta, from Liping Zhang.
8) Add missing code to deal with loopback packets from nft_meta when
used by the netdev family, also from Liping.
9) Several cleanups on nf_tables, one to remove unnecessary check from
the netlink control plane path to add table, set and stateful objects
and code consolidation when unregister chain hooks, from Gao Feng.
10) Fix harmless reference counter underflow in IPVS that, however,
results in problems with the introduction of the new refcount_t
type, from David Windsor.
11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp,
from Davide Caratti.
12) Missing documentation on nf_tables uapi header, from Liping Zhang.
13) Use rb_entry() helper in xt_connlimit, from Geliang Tang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver that offloads flower rules needs to know with which priority
user inserted the rules. So add this information into offload struct.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Josef Bacik diagnosed following problem :
I was seeing random disconnects while testing NBD over loopback.
This turned out to be because NBD sets pfmemalloc on it's socket,
however the receiving side is a user space application so does not
have pfmemalloc set on its socket. This means that
sk_filter_trim_cap will simply drop this packet, under the
assumption that the other side will simply retransmit. Well we do
retransmit, and then the packet is just dropped again for the same
reason.
It seems the better way to address this problem is to clear pfmemalloc
in the TCP transmit path. pfmemalloc strict control really makes sense
on the receive path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 stack does not set the protocol for local routes, so those routes show
up with proto "none":
$ ip -6 ro ls table local
local ::1 dev lo proto none metric 0 pref medium
local 2100:3:: dev lo proto none metric 0 pref medium
local 2100:3::4 dev lo proto none metric 0 pref medium
local fe80:: dev lo proto none metric 0 pref medium
...
Set rt6i_protocol to RTPROT_KERNEL for consistency with IPv4. Now routes
show up with proto "kernel":
$ ip -6 ro ls table local
local ::1 dev lo proto kernel metric 0 pref medium
local 2100:3:: dev lo proto kernel metric 0 pref medium
local 2100:3::4 dev lo proto kernel metric 0 pref medium
local fe80:: dev lo proto kernel metric 0 pref medium
...
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- ingress hook:
- if port is a tunnel port, use tunnel info in
attached dst_metadata to map it to a local vlan
- egress hook:
- if port is a tunnel port, use tunnel info attached to
vlan to set dst_metadata on the skb
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to attach per vlan tunnel info dst
metadata. This enables bridge driver to map vlan to tunnel_info
at ingress and egress. It uses the kernel dst_metadata infrastructure.
The initial use case is vlan to vni bridging, but the api is generic
to extend to any tunnel_info in the future:
- Uapi to configure/unconfigure/dump per vlan tunnel data
- netlink functions to configure vlan and tunnel_info mapping
- Introduces bridge port flag BR_LWT_VLAN to enable attach/detach
dst_metadata to bridged packets on ports. off by default.
- changes to existing code is mainly refactor some existing vlan
handling netlink code + hooks for new vlan tunnel code
- I have kept the vlan tunnel code isolated in separate files.
- most of the netlink vlan tunnel code is handling of vlan-tunid
ranges (follows the vlan range handling code). To conserve space
vlan-tunid by default are always dumped in ranges if applicable.
Use case:
example use for this is a vxlan bridging gateway or vtep
which maps vlans to vn-segments (or vnis).
iproute2 example (patched and pruned iproute2 output to just show
relevant fdb entries):
example shows same host mac learnt on two vni's and
vlan 100 maps to vni 1000, vlan 101 maps to vni 1001
before (netdev per vni):
$bridge fdb show | grep "00:02:00:00:00:03"
00:02:00:00:00:03 dev vxlan1001 vlan 101 master bridge
00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self
00:02:00:00:00:03 dev vxlan1000 vlan 100 master bridge
00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self
after this patch with collect metdata in bridged mode (single netdev):
$bridge fdb show | grep "00:02:00:00:00:03"
00:02:00:00:00:03 dev vxlan0 vlan 101 master bridge
00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self
00:02:00:00:00:03 dev vxlan0 vlan 100 master bridge
00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the encode/decode functionality from the ife module instead of using
implementation inside the act_ife.
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This module is responsible for the ife encapsulation protocol
encode/decode logics. That module can:
- ife_encode: encode skb and reserve space for the ife meta header
- ife_decode: decode skb and extract the meta header size
- ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
header space.
- ife_tlv_meta_decode - decodes one tlv entry from the packet
- ife_tlv_meta_next - advance to the next tlv
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the function ife_tlv_meta_encode is not used by any other module,
unexport it and make it static for the act_ife module.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Small cleanup factorizing code doing the TCP_MAXSEG clamping.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the latest version of the IPv6 Segment Routing IETF draft [1] the
cleanup flag is removed and the flags field length is shrunk from 16 bits
to 8 bits. As a consequence, the input of the HMAC computation is modified
in a non-backward compatible way by covering the whole octet of flags
instead of only the cleanup bit. As such, if an implementation compatible
with the latest draft computes the HMAC of an SRH who has other flags set
to 1, then the HMAC result would differ from the current implementation.
This patch carries those modifications to prevent conflict with other
implementations of IPv6 SR.
[1] https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-05
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Debugging issues caused by pfmemalloc is often tedious.
Add a new SNMP counter to more easily diagnose these problems.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Acked-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nothing in devinet.c relies on fib_lookup.h; remove it from the includes
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When __alloc_skb() allocates an skb from fast clone cache,
setting pfmemalloc on the clone is not needed.
Clone will be properly initialized later at skb_clone() time,
including pfmemalloc field, as it is included in the
headers_start/headers_end section which is fully copied.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This ioctl opens a file to which a socket is bound and
returns a file descriptor. The caller has to have CAP_NET_ADMIN
in the socket network namespace.
Currently it is impossible to get a path and a mount point
for a socket file. socket_diag reports address, device ID and inode
number for unix sockets. An address can contain a relative path or
a file may be moved somewhere. And these properties say nothing about
a mount namespace and a mount point of a socket file.
With the introduced ioctl, we can get a path by reading
/proc/self/fd/X and get mnt_id from /proc/self/fdinfo/X.
In CRIU we are going to use this ioctl to dump and restore unix socket.
Here is an example how it can be used:
$ strace -e socket,bind,ioctl ./test /tmp/test_sock
socket(AF_UNIX, SOCK_STREAM, 0) = 3
bind(3, {sa_family=AF_UNIX, sun_path="test_sock"}, 11) = 0
ioctl(3, SIOCUNIXFILE, 0) = 4
^Z
$ ss -a | grep test_sock
u_str LISTEN 0 1 test_sock 17798 * 0
$ ls -l /proc/760/fd/{3,4}
lrwx------ 1 root root 64 Feb 1 09:41 3 -> 'socket:[17798]'
l--------- 1 root root 64 Feb 1 09:41 4 -> /tmp/test_sock
$ cat /proc/760/fdinfo/4
pos: 0
flags: 012000000
mnt_id: 40
$ cat /proc/self/mountinfo | grep "^40\s"
40 19 0:37 / /tmp rw shared:23 - tmpfs tmpfs rw
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYk5bGAAoJECebzXlCjuG+vGcP/j2Sw74vbJX4ifkFQvUomOp/
83IHinaax5RDIqryf903iabvKX1SuXpVuxgpxtaSCz+oS9fV5wv/28NdIrMwfICJ
DC3I2xQ/osLUGHw1Td8BZ+Fv+P0Th+OGKwWCydp3Xejg/X+XUKrLs/Ex1/rHwTzJ
y/BIAV7BU94HlDFxif7sKbv6mi/gpETlvJK4AAbSbvpd4f5JuBXPQBtJBxUPtwi1
zM0pQzMoBgY5XzAZBWmQEYCJjTje5wqiueDHtteh8/X7FoLWtAcHsOmleglleKQn
927eRCTrHt9ioToyZJM4GkH8eW/nOMpaKKvjXCTdoKsuEZyAnlPk0VZUAikfDvql
dr9FznI21Geq/c9IeU3+1xSbe1I9eDO0L9qWp8prUikIZwI4kdEBN5z/0oMMcU4q
IXw2lb46w8JD11GIIAGZIhZb63FdO75Ck4Z2GX6UUFqt246s26Go9yJIZEDfu5sL
8FLmwgOYNMhFrSPAR9JmnBors5gNT9owNwieUB8IFvgMv1ajz2CWG2yvNO9Sq/SK
a/HJJ7A1YvX0uSzsKsvO/j5S4cY73l2kWKX4NRqMFXIYzMzNGHvIvIB238tXAzZe
Z5YaMycsjuRKe9VkP2lZQtVzl9qfnvkd5o6Tg3RkMZXkVOHMB/j2yratWB2XTl3h
I2xAGjQFJ/Pn66pJWNe0
=yQ4e
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.10-2' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Three more miscellaneous nfsd bugfixes"
* tag 'nfsd-4.10-2' of git://linux-nfs.org/~bfields/linux:
svcrpc: fix oops in absence of krb5 module
nfsd: special case truncates some more
NFSD: Fix a null reference case in find_or_create_lock_stateid()
Commit 69b34fb996 ("netfilter: xt_LOG: add net namespace support for
xt_LOG") disabled logging packets using the LOG target from non-init
namespaces. The motivation was to prevent containers from flooding
kernel log of the host. The plan was to keep it that way until syslog
namespace implementation allows containers to log in a safe way.
However, the work on syslog namespace seems to have hit a dead end
somewhere in 2013 and there are users who want to use xt_LOG in all
network namespaces. This patch allows to do so by setting
/proc/sys/net/netfilter/nf_log_all_netns
to a nonzero value. This sysctl is only accessible from init_net so that
one cannot switch the behaviour from inside a container.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, the ip_vs_dest cache frees ip_vs_dest objects when their
reference count becomes < 0. Aside from not being semantically sound,
this is problematic for the new type refcount_t, which will be introduced
shortly in a separate patch. refcount_t is the new kernel type for
holding reference counts, and provides overflow protection and a
constrained interface relative to atomic_t (the type currently being
used for kernel reference counts).
Per Julian Anastasov: "The problem is that dest_trash currently holds
deleted dests (unlinked from RCU lists) with refcnt=0." Changing
dest_trash to hold dest with refcnt=1 will allow us to free ip_vs_dest
structs when their refcnt=0, in ip_vs_dest_put_and_free().
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After this change conntrack operations (lookup, creation, matching from
ruleset) only access one instead of two sk_buff cache lines.
This works for normal conntracks because those are allocated from a slab
that guarantees hw cacheline or 8byte alignment (whatever is larger)
so the 3 bits needed for ctinfo won't overlap with nf_conn addresses.
Template allocation now does manual address alignment (see previous change)
on arches that don't have sufficent kmalloc min alignment.
Some spots intentionally use skb->_nfct instead of skb_nfct() helpers,
this is to avoid undoing the skb_nfct() use when we remove untracked
conntrack object in the future.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The next change will merge skb->nfct pointer and skb->nfctinfo
status bits into single skb->_nfct (unsigned long) area.
For this to work nf_conn addresses must always be aligned at least on
an 8 byte boundary since we will need the lower 3bits to store nfctinfo.
Conntrack templates are allocated via kmalloc.
kbuild test robot reported
BUILD_BUG_ON failed: NFCT_INFOMASK >= ARCH_KMALLOC_MINALIGN
on v1 of this patchset, so not all platforms meet this requirement.
Do manual alignment if needed, the alignment offset is stored in the
nf_conn entry protocol area. This works because templates are not
handed off to L4 protocol trackers.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a helper to assign a nf_conn entry and the ctinfo bits to an sk_buff.
This avoids changing code in followup patch that merges skb->nfct and
skb->nfctinfo into skb->_nfct.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Followup patch renames skb->nfct and changes its type so add a helper to
avoid intrusive rename change later.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Next patch makes direct skb->nfct access illegal, reduce noise
in next patch by using accessors we already have.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We should also toss nf_bridge_info, if any -- packet is leaving via
ip_local_out, also, this skb isn't bridged -- it is a locally generated
copy. Also this avoids the need to touch this later when skb->nfct is
replaced with 'unsigned long _nfct' in followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It is never accessed for reading and the only places that write to it
are the icmp(6) handlers, which also set skb->nfct (and skb->nfctinfo).
The conntrack core specifically checks for attached skb->nfct after
->error() invocation and returns early in this case.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If something fails in nf_tables_table_enable(), it unregisters the
chains. But the rollback code is the same as nf_tables_table_disable()
almostly, except there is one counter check. Now create one wrapper
function to eliminate the duplicated codes.
Signed-off-by: Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull networking fixes from David Miller:
1) Fix handling of interrupt status in stmmac driver. Just because we
have masked the event from generating interrupts, doesn't mean the
bit won't still be set in the interrupt status register. From Alexey
Brodkin.
2) Fix DMA API debugging splats in gianfar driver, from Arseny Solokha.
3) Fix off-by-one error in __ip6_append_data(), from Vlad Yasevich.
4) cls_flow does not match on icmpv6 codes properly, from Simon Horman.
5) Initial MAC address can be set incorrectly in some scenerios, from
Ivan Vecera.
6) Packet header pointer arithmetic fix in ip6_tnl_parse_tlv_end_lim(),
from Dan Carpenter.
7) Fix divide by zero in __tcp_select_window(), from Eric Dumazet.
8) Fix crash in iwlwifi when unregistering thermal zone, from Jens
Axboe.
9) Check for DMA mapping errors in starfire driver, from Alexey
Khoroshilov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (31 commits)
tcp: fix 0 divide in __tcp_select_window()
ipv6: pointer math error in ip6_tnl_parse_tlv_enc_lim()
net: fix ndo_features_check/ndo_fix_features comment ordering
net/sched: matchall: Fix configuration race
be2net: fix initial MAC setting
ipv6: fix flow labels when the traffic class is non-0
net: thunderx: avoid dereferencing xcv when NULL
net/sched: cls_flower: Correct matching on ICMPv6 code
ipv6: Paritially checksum full MTU frames
net/mlx4_core: Avoid command timeouts during VF driver device shutdown
gianfar: synchronize DMA API usage by free_skb_rx_queue w/ gfar_new_page
net: ethtool: add support for 2500BaseT and 5000BaseT link modes
can: bcm: fix hrtimer/tasklet termination in bcm op removal
net: adaptec: starfire: add checks for dma mapping errors
net: phy: micrel: KSZ8795 do not set SUPPORTED_[Asym_]Pause
can: Fix kernel panic at security_sock_rcv_skb
net: macb: Fix 64 bit addressing support for GEM
stmmac: Discard masked flags in interrupt status register
net/mlx5e: Check ets capability before ets query FW command
net/mlx5e: Fix update of hash function/key via ethtool
...
The ASSERT_RTNL is not necessary in the init function, as it does not
touch any rtnl protected structures, as opposed to the mirred action which
does have to hold a net device.
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix error path of in sample init, by releasing the tc hash in case of
failure in psample_group creation.
Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syszkaller fuzzer was able to trigger a divide by zero, when
TCP window scaling is not enabled.
SO_RCVBUF can be used not only to increase sk_rcvbuf, also
to decrease it below current receive buffers utilization.
If mss is negative or 0, just return a zero TCP window.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Casting is a high precedence operation but "off" and "i" are in terms of
bytes so we need to have some parenthesis here.
Fixes: fbfa743a9d ("ipv6: fix ip6_tnl_parse_tlv_enc_lim()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 does not set the NLM_F_APPEND flag in notifications to signal that
a NEWROUTE is an append versus a new route or a replaced one. Add the
flag if the request has it.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry reported warnings occurring in __skb_gso_segment() [1]
All SKB_GSO_DODGY producers can allow user space to feed
packets that trigger the current check.
We could prevent them from doing so, rejecting packets, but
this might add regressions to existing programs.
It turns out our SKB_GSO_DODGY handlers properly set up checksum
information that is needed anyway when packets needs to be segmented.
By checking again skb_needs_check() after skb_mac_gso_segment(),
we should remove these pesky warnings, at a very minor cost.
With help from Willem de Bruijn
[1]
WARNING: CPU: 1 PID: 6768 at net/core/dev.c:2439 skb_warn_bad_offload+0x2af/0x390 net/core/dev.c:2434
lo: caps=(0x000000a2803b7c69, 0x0000000000000000) len=138 data_len=0 gso_size=15883 gso_type=4 ip_summed=0
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 6768 Comm: syz-executor1 Not tainted 4.9.0 #5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
ffff8801c063ecd8 ffffffff82346bdf ffffffff00000001 1ffff100380c7d2e
ffffed00380c7d26 0000000041b58ab3 ffffffff84b37e38 ffffffff823468f1
ffffffff84820740 ffffffff84f289c0 dffffc0000000000 ffff8801c063ee20
Call Trace:
[<ffffffff82346bdf>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff82346bdf>] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
[<ffffffff81827e34>] panic+0x1fb/0x412 kernel/panic.c:179
[<ffffffff8141f704>] __warn+0x1c4/0x1e0 kernel/panic.c:542
[<ffffffff8141f7e5>] warn_slowpath_fmt+0xc5/0x100 kernel/panic.c:565
[<ffffffff8356cbaf>] skb_warn_bad_offload+0x2af/0x390 net/core/dev.c:2434
[<ffffffff83585cd2>] __skb_gso_segment+0x482/0x780 net/core/dev.c:2706
[<ffffffff83586f19>] skb_gso_segment include/linux/netdevice.h:3985 [inline]
[<ffffffff83586f19>] validate_xmit_skb+0x5c9/0xc20 net/core/dev.c:2969
[<ffffffff835892bb>] __dev_queue_xmit+0xe6b/0x1e70 net/core/dev.c:3383
[<ffffffff8358a2d7>] dev_queue_xmit+0x17/0x20 net/core/dev.c:3424
[<ffffffff83ad161d>] packet_snd net/packet/af_packet.c:2930 [inline]
[<ffffffff83ad161d>] packet_sendmsg+0x32ed/0x4d30 net/packet/af_packet.c:2955
[<ffffffff834f0aaa>] sock_sendmsg_nosec net/socket.c:621 [inline]
[<ffffffff834f0aaa>] sock_sendmsg+0xca/0x110 net/socket.c:631
[<ffffffff834f329a>] ___sys_sendmsg+0x8fa/0x9f0 net/socket.c:1954
[<ffffffff834f5e58>] __sys_sendmsg+0x138/0x300 net/socket.c:1988
[<ffffffff834f604d>] SYSC_sendmsg net/socket.c:1999 [inline]
[<ffffffff834f604d>] SyS_sendmsg+0x2d/0x50 net/socket.c:1995
[<ffffffff84371941>] entry_SYSCALL_64_fastpath+0x1f/0xc2
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the current version, the matchall internal state is split into two
structs: cls_matchall_head and cls_matchall_filter. This makes little
sense, as matchall instance supports only one filter, and there is no
situation where one exists and the other does not. In addition, that led
to some races when filter was deleted while packet was processed.
Unify that two structs into one, thus simplifying the process of matchall
creation and deletion. As a result, the new, delete and get callbacks have
a dummy implementation where all the work is done in destroy and change
callbacks, as was done in cls_cgroup.
Fixes: bf3994d2ed ("net/sched: introduce Match-all classifier")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow a master interface to be specified as one of the parameters when
creating a new interface via rtnl_newlink. Previously this would
require invoking interface creation, waiting for it to complete, and
then separately binding that new interface to a master.
In particular, this is used when creating a macvlan child interface for
VRRP in a VRF configuration, allowing the interface creator to specify
directly what master interface should be inherited by the child,
without having to deal with asynchronous complications and potential
race conditions.
Signed-off-by: Theuns Verwoerd <theuns.verwoerd@alliedtelesis.co.nz>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2017-02-01
1) Some typo fixes, from Alexander Alemayhu.
2) Don't acquire state lock in get_mtu functions.
The only rece against a dead state does not matter.
From Florian Westphal.
3) Remove xfrm4_state_fini, it is unused for more than
10 years. From Florian Westphal.
4) Various rcu usage improvements. From Florian Westphal.
5) Properly handle crypto arrors in ah4/ah6.
From Gilad Ben-Yossef.
6) Try to avoid skb linearization in esp4 and esp6.
7) The esp trailer is now set up in different places,
add a helper for this.
8) With the upcomming usage of gro_cells in IPsec,
a gro merged skb can have a secpath. Drop it
before freeing or reusing the skb.
9) Add a xfrm dummy network device for napi. With
this we can use gro_cells from within xfrm,
it allows IPsec GRO without impact on the generic
networking code.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Olga Kornievskaia says: "I ran into this oops in the nfsd (below)
(4.10-rc3 kernel). To trigger this I had a client (unsuccessfully) try
to mount the server with krb5 where the server doesn't have the
rpcsec_gss_krb5 module built."
The problem is that rsci.cred is copied from a svc_cred structure that
gss_proxy didn't properly initialize. Fix that.
[120408.542387] general protection fault: 0000 [#1] SMP
...
[120408.565724] CPU: 0 PID: 3601 Comm: nfsd Not tainted 4.10.0-rc3+ #16
[120408.567037] Hardware name: VMware, Inc. VMware Virtual =
Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[120408.569225] task: ffff8800776f95c0 task.stack: ffffc90003d58000
[120408.570483] RIP: 0010:gss_mech_put+0xb/0x20 [auth_rpcgss]
...
[120408.584946] ? rsc_free+0x55/0x90 [auth_rpcgss]
[120408.585901] gss_proxy_save_rsc+0xb2/0x2a0 [auth_rpcgss]
[120408.587017] svcauth_gss_proxy_init+0x3cc/0x520 [auth_rpcgss]
[120408.588257] ? __enqueue_entity+0x6c/0x70
[120408.589101] svcauth_gss_accept+0x391/0xb90 [auth_rpcgss]
[120408.590212] ? try_to_wake_up+0x4a/0x360
[120408.591036] ? wake_up_process+0x15/0x20
[120408.592093] ? svc_xprt_do_enqueue+0x12e/0x2d0 [sunrpc]
[120408.593177] svc_authenticate+0xe1/0x100 [sunrpc]
[120408.594168] svc_process_common+0x203/0x710 [sunrpc]
[120408.595220] svc_process+0x105/0x1c0 [sunrpc]
[120408.596278] nfsd+0xe9/0x160 [nfsd]
[120408.597060] kthread+0x101/0x140
[120408.597734] ? nfsd_destroy+0x60/0x60 [nfsd]
[120408.598626] ? kthread_park+0x90/0x90
[120408.599448] ret_from_fork+0x22/0x30
Fixes: 1d658336b0 "SUNRPC: Add RPC based upcall mechanism for RPCGSS auth"
Cc: stable@vger.kernel.org
Cc: Simo Sorce <simo@redhat.com>
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Tested-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When matching on the ICMPv6 code ICMPV6_CODE rather than
ICMPV4_CODE attributes should be used.
This corrects what appears to be a typo.
Sample usage:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 protocol ipv6 parent ffff: flower \
indev eth0 ip_proto icmpv6 type 128 code 0 action drop
Without this change the code parameter above is effectively ignored.
Fixes: 7b684884fb ("net/sched: cls_flower: Support matching on ICMP type and code")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEES2FAuYbJvAGobdVQPTuqJaypJWoFAliPEGcTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRA9O6olrKklakWHB/971KHqyYL5Rhbo8rOl+XKIpTLSFCJi
tgaN1mOUaBl2UIlevzW/9kNP7DkT+0axaEdFwsfZCm3opLwCaYI9TtxbDLmMx8VW
p7/djLDDcE9hgH1J6PnHL6w+Mlnr6NcdmVlU7lm8YyiCcALQp6EdYXVh1T1/xD0+
ytoqO2A8pgvR1VfcD7Q4M3qE0L9JQRrFhDenStn8fql6fpb2RqFT2iogQF6hta1Y
YNZx0pK24gpCDTWqR06BBuHHDLpS6nGmnMl5Qvft1FBmSsvKpp4qAMN2IVyYA402
SdEHJHlo9SmqNZmQg2QQg5Hn0KJ4iY7PxLJGc2yPvZowa9ewBRRW2xYG
=CiTr
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-4.10-20170130' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2017-01-30
this is a pull request of one patch.
The patch is by Oliver Hartkopp and fixes the hrtimer/tasklet termination in
bcm op removal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Say we got really unlucky and these failed on the last iteration, then
it could lead to a use after free bug.
Fixes: cd6851f303 ("smc: remote memory buffers (RMBs)")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add necessary plumbing at the slave network device level to have switch
drivers implement ndo_setup_tc() and most particularly the cls_matchall
classifier. We add support for two switch operations:
port_add_mirror and port_del_mirror() which configure, on a per-port
basis the mirror parameters requested from the cls_matchall classifier.
Code is largely borrowed from the Mellanox Spectrum switch driver.
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 will mark data that is smaller that mtu - headersize as
CHECKSUM_PARTIAL, but if the data will completely fill the mtu,
the packet checksum will be computed in software instead.
Extend the conditional to include the data that fills the mtu
as well.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nothing about lwt state requires a device reference, so remove the
input argument.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packets arriving in a VRF currently are delivered to UDP sockets that
aren't bound to any interface. TCP defaults to not delivering packets
arriving in a VRF to unbound sockets. IP route lookup and socket
transmit both assume that unbound means using the default table and
UDP applications that haven't been changed to be aware of VRFs may not
function correctly in this case since they may not be able to handle
overlapping IP address ranges, or be able to send packets back to the
original sender if required.
So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
being analgous to the existing tcp_l3mdev_accept, namely to allow a
process to have a VRF-global listen socket. Have this default to off
as this is the behaviour that users will expect, given that there is
no explicit mechanism to set unmodified VRF-unaware application into a
default VRF.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for adding support for CFP/TCAMP in the bcm_sf2 driver add the
plumbing to call into driver specific {get,set}_rxnfc operations.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When removing a bcm tx operation either a hrtimer or a tasklet might run.
As the hrtimer triggers its associated tasklet and vice versa we need to
take care to mutually terminate both handlers.
Reported-by: Michael Josenhans <michael.josenhans@web.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Michael Josenhans <michael.josenhans@web.de>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch adds a dummy network device so that we can
use gro_cells for IPsec GRO. With this, we handle IPsec
GRO with no impact on the generic networking code.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
With a followup patch, a gro merged skb can have a secpath.
So drop it before freeing or reusing the skb.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch adds devm_alloc_etherdev_mqs function and devm_alloc_etherdev
macro. These can be used for simpler netdev allocation without having to
care about calling free_netdev.
Thanks to this change drivers, their error paths and removal paths may
get simpler by a bit.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
- fix double call of dev_queue_xmit(), caused by the recent introduction
of net_xmit_eval(), by Sven Eckelmann
- Fix includes for IS_ERR/ERR_PTR, by Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAliMd+IWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeofArD/0RArnwnJLDZbL79oBzuYk3vFWy
/8BhZN+Bw/D6laMpE6yOM3zzDM0H9kvwgVl5ZFx5thmDgBj/cKbB2+2SjiPrQQk1
I4hDFatzo+jEtGIGDv59UsAzxbBPgFjK0ZX61tfEYIUFo3DnbJj/pcxI7+QQvkOK
vc+pyaPcYGViopyLVoy9XTt3ZhpRLdk7rlGE7vl8tydmJXrOGG+GQRbd1Emn1U63
OH/HW94hufD5dK5LsxNj1HhViEs4DPJ55+K9UeKY8caoAfn7tzXlOEdsh9uqXU1p
Zqnk2Q2n2D7/Yvs+umuj+7I1IXzykSfGAJjMv843mHuAloyyMDeDGzMxXFXnZ4Hq
DM4DgAeVcGYjYwLrPaeoLkVPMF/2e0z0S7Hfdzhp5UKk9j+846eyTsdISUEyb73h
vl+qPFc6eJ+PLTb8wMvdnuDuJWTqHDlJ2Id1GadJ86si2ZtYYrmNZp8Sn+YQg0zc
9tVI1jATxv3XU22jCnpwnrXMdDMQBX1crtY4N0MtOz3NqZdexpQR4p5LsC+PB2zJ
2JjAJExGpe4bGWorcRnHTTP5mVRdezB8vE8lKs4iIf40wdGGV3xJxq3xBBGH9S21
IpLcROfshDQfMhntpd42bw2AEDIkz4Vp/GvO8r2YUkQfKPFmX9duilQvi0VRDKNk
FfO6GqDuy6WIGcljgQ==
=iLbv
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20170128' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are two fixes for batman-adv for net-next:
- fix double call of dev_queue_xmit(), caused by the recent introduction
of net_xmit_eval(), by Sven Eckelmann
- Fix includes for IS_ERR/ERR_PTR, by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the retransmission stats are not incremented if the
retransmit fails locally. But we always increment the other packet
counters that track total packet/bytes sent. Awkwardly while we
don't count these failed retransmits in RETRANSSEGS, we do count
them in FAILEDRETRANS.
If the qdisc is dropping many packets this could under-estimate
TCP retransmission rate substantially from both SNMP or per-socket
TCP_INFO stats. This patch changes this by always incrementing
retransmission stats on retransmission attempts and failures.
Another motivation is to properly track retransmists in
SCM_TIMESTAMPING_OPT_STATS. Since SCM_TSTAMP_SCHED collection is
triggered in tcp_transmit_skb(), If tp->total_retrans is incremented
after the function, we'll always mis-count by the amount of the
latest retransmission.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two stats in SCM_TIMESTAMPING_OPT_STATS:
TCP_NLA_DATA_SEGS_OUT: total data packets sent including retransmission
TCP_NLA_TOTAL_RETRANS: total data packets retransmitted
The names are picked to be consistent with corresponding fields in
TCP_INFO. This allows applications that are using the timestamping
API to measure latency stats to also retrive retransmission rate
of application write.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
do_execute_actions() implements a worthwhile optimization: in case
an output action is the last action in an action list, skb_clone()
can be avoided by outputing the current skb. However, the
implementation is more complicated than necessary. This patch
simplify this logic.
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon reception of the NETDEV_CHANGEUPPER, a leaving port is already
unbridged, so reflect this by assigning the port's bridge_dev pointer to
NULL before calling the port_bridge_leave DSA driver operation.
Now that the bridge_dev pointer is exposed to the drivers, reflecting
the current state of the DSA switch fabric is necessary for the drivers
to adjust their port based VLANs correctly.
Pass the bridge device pointer to the port_bridge_leave operation so
that drivers have all information to re-program their chips properly,
and do not need to cache it anymore.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the bridge_dev pointer from dsa_slave_priv to dsa_port so that DSA
drivers can access this information and remove the need to cache it.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Store a pointer to the dsa_port structure in the dsa_slave_priv
structure, instead of the switch/port index.
This will allow to store more information such as the bridge device,
needed in DSA drivers for multi-chip configuration.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the physical switch instance and port index a DSA port belongs to to
the dsa_port structure.
That can be used later to retrieve information about a physical port
when configuring a switch fabric, or lighten up struct dsa_slave_priv.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dsa_switch structure contains the number of ports. Use it where the
structure is valid instead of the DSA_MAX_PORTS value.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the ports[DSA_MAX_PORTS] array of the dsa_switch structure for a
zero-length array, allocated at the same time as the dsa_switch
structure itself. A dsa_switch_alloc() helper is provided for that.
This commit brings no functional change yet since we pass DSA_MAX_PORTS
as the number of ports for the moment. Future patches can update the DSA
drivers separately to support dynamic number of ports.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhang Yanmin reported crashes [1] and provided a patch adding a
synchronize_rcu() call in can_rx_unregister()
The main problem seems that the sockets themselves are not RCU
protected.
If CAN uses RCU for delivery, then sockets should be freed only after
one RCU grace period.
Recent kernels could use sock_set_flag(sk, SOCK_RCU_FREE), but let's
ease stable backports with the following fix instead.
[1]
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81495e25>] selinux_socket_sock_rcv_skb+0x65/0x2a0
Call Trace:
<IRQ>
[<ffffffff81485d8c>] security_sock_rcv_skb+0x4c/0x60
[<ffffffff81d55771>] sk_filter+0x41/0x210
[<ffffffff81d12913>] sock_queue_rcv_skb+0x53/0x3a0
[<ffffffff81f0a2b3>] raw_rcv+0x2a3/0x3c0
[<ffffffff81f06eab>] can_rcv_filter+0x12b/0x370
[<ffffffff81f07af9>] can_receive+0xd9/0x120
[<ffffffff81f07beb>] can_rcv+0xab/0x100
[<ffffffff81d362ac>] __netif_receive_skb_core+0xd8c/0x11f0
[<ffffffff81d36734>] __netif_receive_skb+0x24/0xb0
[<ffffffff81d37f67>] process_backlog+0x127/0x280
[<ffffffff81d36f7b>] net_rx_action+0x33b/0x4f0
[<ffffffff810c88d4>] __do_softirq+0x184/0x440
[<ffffffff81f9e86c>] do_softirq_own_stack+0x1c/0x30
<EOI>
[<ffffffff810c76fb>] do_softirq.part.18+0x3b/0x40
[<ffffffff810c8bed>] do_softirq+0x1d/0x20
[<ffffffff81d30085>] netif_rx_ni+0xe5/0x110
[<ffffffff8199cc87>] slcan_receive_buf+0x507/0x520
[<ffffffff8167ef7c>] flush_to_ldisc+0x21c/0x230
[<ffffffff810e3baf>] process_one_work+0x24f/0x670
[<ffffffff810e44ed>] worker_thread+0x9d/0x6f0
[<ffffffff810e4450>] ? rescuer_thread+0x480/0x480
[<ffffffff810ebafc>] kthread+0x12c/0x150
[<ffffffff81f9ccef>] ret_from_fork+0x3f/0x70
Reported-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable patches:
- NFSv4.1: Fix a deadlock in layoutget
- NFSv4 must not bump sequence ids on NFS4ERR_MOVED errors
- NFSv4 Fix a regression with OPEN EXCLUSIVE4 mode
- Fix a memory leak when removing the SUNRPC module
Bugfixes:
- Fix a reference leak in _pnfs_return_layout
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYjM6UAAoJEGcL54qWCgDy+WoP/2NdwqzOa3xxn8piMhLh/TLJ
lxWfVmUZJ8UNVSVl45NDslt2V63P0gpITUiaVo27rPfjKV8pv8XTU7HkxCuXzj0m
uo11jEErsYxbXELNDL2y1fmsPTygYvg4MAV3NK0inKv/ciu6TTMddoga+QQXGJPz
QqQ7znTHyjDSTqZM1D5+f8EhbY7AIAzJbkgVr90WPPjsT6YtdlTQpMb1O9tlDtfj
oYidIOxRphqLOYTdoD8NW4x19dlgTNIYyCgOFomCbpUTFKKK5RFjVy+bpml3NxV1
oUldw8lhiUE8x/5J/CfbCMt1bPZ3yJeFWNghINOIF7cgGnvKEjw0uO2o8Lk6fMzi
yF2aNoZcNQnyoHHCjsqK8t5dB51abwxlzeElrFB1u73B56CcXB91xEP0GsX1vFt1
BIGCnfsOvtYPZxp/OX+B3AweAJ2DRuIvdze9cS2IeHBW9c5XdVdDmMK7CFeL9dc3
YEsWmaWNWtDt41erxCvQ0ezkr3lPHXOcjwS440LIiVpr38/rzjp81+Tf+i2D1GzB
XTsy9IHzaC2zLb0Vz5vOR3S10k4OiFtXT6ikoZLwgpTfI2gt146e0YHx3g90AiFh
QYBRNLJrjSYVedjv2GGFLG1WWsHbhbP6dpmT3BwNmhZTzjPoqZ76f5QFIQUya4jt
4Y2ksyhx7pOeOJy++6SN
=S5qH
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Stable patches:
- NFSv4.1: Fix a deadlock in layoutget
- NFSv4 must not bump sequence ids on NFS4ERR_MOVED errors
- NFSv4 Fix a regression with OPEN EXCLUSIVE4 mode
- Fix a memory leak when removing the SUNRPC module
Bugfixes:
- Fix a reference leak in _pnfs_return_layout"
* tag 'nfs-for-4.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
pNFS: Fix a reference leak in _pnfs_return_layout
nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED"
SUNRPC: cleanup ida information when removing sunrpc module
NFSv4.0: always send mode in SETATTR after EXCLUSIVE4
nfs: Don't increment lock sequence ID after NFS4ERR_MOVED
NFSv4.1: Fix a deadlock in layoutget
IS_ERR/ERR_PTR are not defined in linux/device.h but in linux/err.h. The
files using these macros therefore have to include the correct one.
Reported-by: Linus Luessing <linus.luessing@web.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The net_xmit_eval has side effects because it is not making sure that e
isn't evaluated twice.
#define net_xmit_eval(e) ((e) == NET_XMIT_CN ? 0 : (e))
The code requested by David Miller [1]
return net_xmit_eval(dev_queue_xmit(skb));
will get transformed into
return ((dev_queue_xmit(skb)) == NET_XMIT_CN ? 0 : (dev_queue_xmit(skb)))
dev_queue_xmit will therefore be tried again (with an already consumed skb)
whenever the return code is not NET_XMIT_CN.
[1] https://lkml.kernel.org/r/20170125.225624.965229145391320056.davem@davemloft.net
Fixes: c33705188c ("batman-adv: Treat NET_XMIT_CN as transmit successfully")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Pull networking fixes from David Miller:
1) GTP fixes from Andreas Schultz (missing genl module alias, clear IP
DF on transmit).
2) Netfilter needs to reflect the fwmark when sending resets, from Pau
Espin Pedrol.
3) nftable dump OOPS fix from Liping Zhang.
4) Fix erroneous setting of VIRTIO_NET_HDR_F_DATA_VALID on transmit,
from Rolf Neugebauer.
5) Fix build error of ipt_CLUSTERIP when procfs is disabled, from Arnd
Bergmann.
6) Fix regression in handling of NETIF_F_SG in harmonize_features(),
from Eric Dumazet.
7) Fix RTNL deadlock wrt. lwtunnel module loading, from David Ahern.
8) tcp_fastopen_create_child() needs to setup tp->max_window, from
Alexey Kodanev.
9) Missing kmemdup() failure check in ipv6 segment routing code, from
Eric Dumazet.
10) Don't execute unix_bind() under the bindlock, otherwise we deadlock
with splice. From WANG Cong.
11) ip6_tnl_parse_tlv_enc_lim() potentially reallocates the skb buffer,
therefore callers must reload cached header pointers into that skb.
Fix from Eric Dumazet.
12) Fix various bugs in legacy IRQ fallback handling in alx driver, from
Tobias Regnery.
13) Do not allow lwtunnel drivers to be unloaded while they are
referenced by active instances, from Robert Shearman.
14) Fix truncated PHY LED trigger names, from Geert Uytterhoeven.
15) Fix a few regressions from virtio_net XDP support, from John
Fastabend and Jakub Kicinski.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (102 commits)
ISDN: eicon: silence misleading array-bounds warning
net: phy: micrel: add support for KSZ8795
gtp: fix cross netns recv on gtp socket
gtp: clear DF bit on GTP packet tx
gtp: add genl family modules alias
tcp: don't annotate mark on control socket from tcp_v6_send_response()
ravb: unmap descriptors when freeing rings
virtio_net: reject XDP programs using header adjustment
virtio_net: use dev_kfree_skb for small buffer XDP receive
r8152: check rx after napi is enabled
r8152: re-schedule napi for tx
r8152: avoid start_xmit to schedule napi when napi is disabled
r8152: avoid start_xmit to call napi_schedule during autosuspend
net: dsa: Bring back device detaching in dsa_slave_suspend()
net: phy: leds: Fix truncated LED trigger names
net: phy: leds: Break dependency of phy.h on phy_led_triggers.h
net: phy: leds: Clear phy_num_led_triggers on failure to avoid crash
net-next: ethernet: mediatek: change the compatible string
Documentation: devicetree: change the mediatek ethernet compatible string
bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status().
...
Slava Shwartsman reported a warning in skb_try_coalesce(), when we
detect skb->truesize is completely wrong.
In his case, issue came from IPv6 reassembly coping with malicious
datagrams, that forced various pskb_may_pull() to reallocate a bigger
skb->head than the one allocated by NIC driver before entering GRO
layer.
Current code does not change skb->truesize, leaving this burden to
callers if they care enough.
Blindly changing skb->truesize in pskb_expand_head() is not
easy, as some producers might track skb->truesize, for example
in xmit path for back pressure feedback (sk->sk_wmem_alloc)
We can detect the cases where it should be safe to change
skb->truesize :
1) skb is not attached to a socket.
2) If it is attached to a socket, destructor is sock_edemux()
My audit gave only two callers doing their own skb->truesize
manipulation.
I had to remove skb parameter in sock_edemux macro when
CONFIG_INET is not set to avoid a compile error.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Slava Shwartsman <slavash@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlike ipv4, this control socket is shared by all cpus so we cannot use
it as scratchpad area to annotate the mark that we pass to ip6_xmit().
Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket
family caches the flowi6 structure in the sctp_transport structure, so
we cannot use to carry the mark unless we later on reset it back, which
I discarded since it looks ugly to me.
Fixes: bf99b4ded5 ("tcp: fix mark propagation with fwmark_reflect enabled")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The address generation mode for IPv6 link-local can only be configured
by netlink messages. This patch adds the ability to change the address
generation mode via sysctl.
v1 -> v2
Removed the rtnl lock and switch to use RCU lock to iterate through
the netdev list.
v2 -> v3
Removed the addrgenmode variable from the idev structure and use the
systcl storage for the flag.
Simplifed the logic for sysctl handling by removing the supported
for all operation.
Added support for more types of tunnel interfaces for link-local
address generation.
Based the patches from net-next.
v3 -> v4
Removed unnecessary whitespace changes.
Signed-off-by: Felix Jia <felix.jia@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
lkp-robot reported a BUG:
[ 10.151226] BUG: unable to handle kernel NULL pointer dereference at 00000198
[ 10.152525] IP: rt6_fill_node+0x164/0x4b8
[ 10.153307] *pdpt = 0000000012ee5001 *pde = 0000000000000000
[ 10.153309]
[ 10.154492] Oops: 0000 [#1]
[ 10.154987] CPU: 0 PID: 909 Comm: netifd Not tainted 4.10.0-rc4-00722-g41e8c70ee162-dirty #10
[ 10.156482] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 10.158254] task: d0deb000 task.stack: d0e0c000
[ 10.159059] EIP: rt6_fill_node+0x164/0x4b8
[ 10.159780] EFLAGS: 00010296 CPU: 0
[ 10.160404] EAX: 00000000 EBX: d10c2358 ECX: c1f7c6cc EDX: c1f6ff44
[ 10.161469] ESI: 00000000 EDI: c2059900 EBP: d0e0dc4c ESP: d0e0dbe4
[ 10.162534] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[ 10.163482] CR0: 80050033 CR2: 00000198 CR3: 10d94660 CR4: 000006b0
[ 10.164535] Call Trace:
[ 10.164993] ? paravirt_sched_clock+0x9/0xd
[ 10.165727] ? sched_clock+0x9/0xc
[ 10.166329] ? sched_clock_cpu+0x19/0xe9
[ 10.166991] ? lock_release+0x13e/0x36c
[ 10.167652] rt6_dump_route+0x4c/0x56
[ 10.168276] fib6_dump_node+0x1d/0x3d
[ 10.168913] fib6_walk_continue+0xab/0x167
[ 10.169611] fib6_walk+0x2a/0x40
[ 10.170182] inet6_dump_fib+0xfb/0x1e0
[ 10.170855] netlink_dump+0xcd/0x21f
This happens when the loopback device is set down and a ipv6 fib route
dump is requested.
ip6_null_entry is the root of all ipv6 fib tables making it integrated
into the table and hence passed to the ipv6 route dump code. The
null_entry route uses the loopback device for dst.dev but may not have
rt6i_idev set because of the order in which initializations are done --
ip6_route_net_init is run before addrconf_init has initialized the
loopback device. Fixing the initialization order is a much bigger problem
with no obvious solution thus far.
The BUG is triggered when the loopback is set down and the netif_running
check added by a1a22c1206 fails. The fill_node descends to checking
rt->rt6i_idev for ignore_routes_with_linkdown and since rt6i_idev is
NULL it faults.
The null_entry route should not be processed in a dump request. Catch
and ignore. This check is done in rt6_dump_route as it is the highest
place in the callchain with knowledge of both the route and the network
namespace.
Fixes: a1a22c1206("net: ipv6: Keep nexthop of multipath route on admin down")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove skb_reserve and skb_reset_mac_header from inet6_rtm_getroute. The
allocated skb is not passed through the routing engine (like it is for
IPv4) and has not since the beginning of git time.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the assignment of ports in _dsa_register_switch() closer to where
it is checked, no functional change. Re-order declarations to be
preserve the inverted christmas tree style.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make it clear that these functions take a device_node structure pointer
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for allowing platform data, and therefore no valid
device_node pointer, make most DSA functions takes a pointer to a
dsa_port structure whenever possible. While at it, introduce a
dsa_port_is_valid() helper function which checks whether port->dn is
NULL or not at the moment.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for allowing dsa_register_switch() to be supplied with
device/platform data, pass down a struct device pointer instead of a
struct device_node.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- bump version strings, by Simon Wunderlich
- ignore self-generated loop detect MAC addresses in translation table,
by Simon Wunderlich
- install uapi batman_adv.h header, by Sven Eckelmann
- bump copyright years, by Sven Eckelmann
- Remove an unused variable in translation table code, by Sven Eckelmann
- Handle NET_XMIT_CN like NET_XMIT_SUCCESS (revised according to Davids
suggestion), and a follow up code clean up, by Gao Feng (2 patches)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAliKJcgWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoYzFEACGNOZcW2bFFpuE5CZWZpTW1n24
YAY/3C+THc89oUyn7ZbWpZ05HcG+JZXOWCWHaRTnd4UeA+sbki53Ioo52ZGgo79C
7LsyXMMH3F2NrXVEfCK/MUGZcqBAxMd4dcDPsYz5q5osxydjG3hEJLikqOluop3P
JKK9FTEeP2JeoWz9eEf7Io2te6EwcIjo3TRa9f53uyyhPnk5eS2NeSle5axZqS7c
l+NSZVlrrG7oUB0UdzABt5NWOvjDc+Lqp1dkoJo17PHOgXialIfjZOKUlKtjVx6E
06ow2HZaEYCtdlb0awjyPxtIpwiy0szTzJa/h4XtzeQHgbOKIqJWAEb80X7imHVE
aljP7A1uuGm0bcVQ+pq21PX8yLu4RCDPDE5Khu9atkiQP5+sVEdGiJ8Soaw4PmoD
yhDmXshrPGR8u5txN8gaHWG4MHt19645s8dHqHQ7tf5h+mf2QXQ2v/jJQsCV2UfY
vLd/JOD4ZVYWDCspcDdEwGc8KB9r6P31wXuSVjYfkqTFXocdzDr87V7C69CS0U+b
Lzel8oa/eVa2ppR+OhpELxhL2ahO7p1jZI2ix4NHftx5MAV4WJ3RfR2ev2Sf9Ukc
aGjzjuml3JVo2i+11lqiAtOaBK9wDNv1CaC+D2NmC6wpWzWQryNryw3fm2SoH3Xm
S+UoBDZpik+SIEBv3Q==
=cd1Y
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20170126' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- ignore self-generated loop detect MAC addresses in translation table,
by Simon Wunderlich
- install uapi batman_adv.h header, by Sven Eckelmann
- bump copyright years, by Sven Eckelmann
- Remove an unused variable in translation table code, by Sven Eckelmann
- Handle NET_XMIT_CN like NET_XMIT_SUCCESS (revised according to Davids
suggestion), and a follow up code clean up, by Gao Feng (2 patches)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains a large batch with Netfilter fixes for
your net tree, they are:
1) Two patches to solve conntrack garbage collector cpu hogging, one to
remove GC_MAX_EVICTS and another to look at the ratio (scanned entries
vs. evicted entries) to make a decision on whether to reduce or not
the scanning interval. From Florian Westphal.
2) Two patches to fix incorrect set element counting if NLM_F_EXCL is
is not set. Moreover, don't decrenent set->nelems from abort patch
if -ENFILE which leaks a spare slot in the set. This includes a
patch to deconstify the set walk callback to update set->ndeact.
3) Two fixes for the fwmark_reflect sysctl feature: Propagate mark to
reply packets both from nf_reject and local stack, from Pau Espin Pedrol.
4) Fix incorrect handling of loopback traffic in rpfilter and nf_tables
fib expression, from Liping Zhang.
5) Fix oops on stateful objects netlink dump, when no filter is specified.
Also from Liping Zhang.
6) Fix a build error if proc is not available in ipt_CLUSTERIP, related
to fix that was applied in the previous batch for net. From Arnd Bergmann.
7) Fix lack of string validation in table, chain, set and stateful
object names in nf_tables, from Liping Zhang. Moreover, restrict
maximum log prefix length to 127 bytes, otherwise explicitly bail
out.
8) Two patches to fix spelling and typos in nf_tables uapi header file
and Kconfig, patches from Alexander Alemayhu and William Breathitt Gray.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
cleanup patch to make use of ieee80211_ac_from_tid() to retrieve ac from
ieee802_1d_to_ac[]
Signed-off-by: Amadeusz Sławiński <amadeusz.slawinski@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The tc could return NET_XMIT_CN as one congestion notification, but
it does not mean the packet is lost. Other modules like ipvlan,
macvlan, and others treat NET_XMIT_CN as success too.
So batman-adv should handle NET_XMIT_CN also as NET_XMIT_SUCCESS.
Signed-off-by: Gao Feng <gfree.wind@gmail.com>
[sven@narfation.org: Moved NET_XMIT_CN handling to batadv_send_skb_packet]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It could decrease one condition check to collect some statements in the
first condition block.
Signed-off-by: Gao Feng <gfree.wind@gmail.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The bridge loop avoidance (BLA) feature of batman-adv sends packets to
probe for Mesh/LAN packet loops. Those packets are not sent by real
clients and should therefore not be added to the translation table (TT).
Signed-off-by: Simon Wunderlich <simon.wunderlich@open-mesh.com>
The only caller of this new function is inside of an #ifdef checking
for CONFIG_BRIDGE_IGMP_SNOOPING, so we should move the implementation
there too, in order to avoid this harmless warning:
net/bridge/br_forward.c:177:13: error: 'maybe_deliver_addr' defined but not used [-Werror=unused-function]
Fixes: 6db6f0eae6 ("bridge: multicast to unicast")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- fix reference count handling on fragmentation error, by Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAliI0D8WHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoU2WD/9EdJuxSK0++Ka/sdXNEmy931dl
ojXukizbSIg5+vEs0GCP+E5btvCQkiz1e43AAljHjE4FTv4nu9BeHs7g4msQzZt1
7Pgy7A5wOz3UP/GN5QxEStGXYNmMeeKuaLklwxx+I619pMBX83bwlcfi8Xf40BWP
twmeyC11TCXbyyR7sH8nbDUiuZdP4soO3yr2WzTndHYui5UKTPlQ/5/VJLFtnIYw
ZDGYqAy7cg7wHO4khd2HaWuetIW4QIk59rqJX9pmwhDADuKL4XE9O9K3KawFH873
lYv226vc6d7Nc6mBfD0zDPXKImOsvi3padtVgXr2AYI8bBlh7eDp9NUFLITly8Jo
GIr8ABp5u6ZbN8A+16C5yHj5BZArQTe5EjZCS9yLLZQ0a9dXp4ZlEQuHSN9BJo05
CWYbpoj5Hunooh2EwOYfwWQa7kbDlDh/q7+qIqJcqhd+KT9cCHvPH3bsmi8gZMrh
fxyMouSX5LQmwnAgs/r4KUQ/5eCcf81o+SVTi/yvEf1Y9pfIqFna2IU/oREpjwY5
csFrY8Czc/O2E20+dzrCJAKK+Sa1U+WnUGlguqIQFIue/mMcOg6vvRcj8K2uz/7x
jsnfjA0ZsK2mRVpaWN/36RVcNyMQhrM35sVv7mnvwCWAViGBaPOzOzc+bv4fdzq0
8KOz8mx3qqU3ZIRKkw==
=xS4v
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20170125' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- fix reference count handling on fragmentation error, by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 448b4482c6 ("net: dsa: Add lockdep class to tx queues to avoid
lockdep splat") removed the netif_device_detach() call done in
dsa_slave_suspend() which is necessary, and paired with a corresponding
netif_device_attach(), bring it back.
Fixes: 448b4482c6 ("net: dsa: Add lockdep class to tx queues to avoid lockdep splat")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without TFO, any subsequent connect() call after a successful one returns
-1 EISCONN. The last API update ensured that __inet_stream_connect() can
return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
indicate that the connection is now in progress. Unfortunately since this
function is used both for connect() and sendmsg(), it has the undesired
side effect of making connect() now return -1 EINPROGRESS as well after
a successful call, while at the same time poll() returns POLLOUT. This
can confuse some applications which happen to call connect() and to
check for -1 EISCONN to ensure the connection is usable, and for which
EINPROGRESS indicates a need to poll, causing a loop.
This problem was encountered in haproxy where a call to connect() is
precisely used in certain cases to confirm a connection's readiness.
While arguably haproxy's behaviour should be improved here, it seems
important to aim at a more robust behaviour when the goal of the new
API is to make it easier to implement TFO in existing applications.
This patch simply ensures that we preserve the same semantics as in
the non-TFO case on the connect() syscall when using TFO, while still
returning -1 EINPROGRESS on sendmsg(). For this we simply tell
__inet_stream_connect() whether we're doing a regular connect() or in
fact connecting for a sendmsg() call.
Cc: Wei Wang <weiwan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove __sk_dst_reset() in the failure handling because __sk_dst_reset()
will eventually get called when sk is released. No need to handle it in
the protocol specific connect call.
This is also to make the code path consistent with ipv4.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the cookie check logic in tcp_send_syn_data() into a function.
This function will be called else where in later changes.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit().
Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient.
Fixes: 3c7151275c ("tcp: add memory barriers to write space paths")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_add_backlog() can use skb_condense() helper to get better
gains and less SKB_TRUESIZE() magic. This only happens when socket
backlog has to be used.
Some attacks involve specially crafted out of order tiny TCP packets,
clogging the ofo queue of (many) sockets.
Then later, expensive collapse happens, trying to copy all these skbs
into single ones.
This unfortunately does not work if each skb has no neighbor in TCP
sequence order.
By using skb_condense() if the skb could not be coalesced to a prior
one, we defeat these kind of threats, potentially saving 4K per skb
(or more, since this is one page fragment).
A typical NAPI driver allocates gro packets with GRO_MAX_HEAD bytes
in skb->head, meaning the copy done by skb_condense() is limited to
about 200 bytes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We shuffled some code around and added some new case statements here and
now "res" isn't initialized on all paths.
Fixes: 01fd12bb18 ("tipc: make replicast a user selectable option")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce optional 128-bit action cookie.
Like all other cookie schemes in the networking world (eg in protocols
like http or existing kernel fib protocol field, etc) the idea is to save
user state that when retrieved serves as a correlator. The kernel
_should not_ intepret it. The user can store whatever they wish in the
128 bits.
Sample exercise(showing variable length use of cookie)
.. create an accept action with cookie a1b2c3d4
sudo $TC actions add action ok index 1 cookie a1b2c3d4
.. dump all gact actions..
sudo $TC -s actions ls action gact
action order 0: gact action pass
random type none pass val 0
index 1 ref 1 bind 0 installed 5 sec used 5 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
cookie a1b2c3d4
.. bind the accept action to a filter..
sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1
... send some traffic..
$ ping 127.0.0.1 -c 3
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms
Signed-off-by: David S. Miller <davem@davemloft.net>
Now sctp gso puts segments into skb's frag_list, then processes these
segments in skb_segment. But skb_segment handles them only when gs is
enabled, as it's in the same branch with skb's frags.
Although almost all the NICs support sg other than some old ones, but
since commit 1e16aa3ddf ("net: gso: use feature flag argument in all
protocol gso handlers"), features &= skb->dev->hw_enc_features, and
xfrm_output_gso call skb_segment with features = 0, which means sctp
gso would call skb_segment with sg = 0, and skb_segment would not work
as expected.
This patch is to fix it by setting features param with NETIF_F_SG when
calling skb_segment so that it can go the right branch to process the
skb's frag_list.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_addr_id2transport is a function for sockopt to look up assoc by
address. As the address is from userspace, it can be a v4-mapped v6
address. But in sctp protocol stack, it always handles a v4-mapped
v6 address as a v4 address. So it's necessary to convert it to a v4
address before looking up assoc by address.
This patch is to fix it by calling sctp_verify_addr in which it can do
this conversion before calling sctp_endpoint_lookup_assoc, just like
what sctp_sendmsg and __sctp_connect do for the address from users.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When attempting to free lwtunnel state after the module for the encap
has been unloaded an oops occurs:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: lwtstate_free+0x18/0x40
[..]
task: ffff88003e372380 task.stack: ffffc900001fc000
RIP: 0010:lwtstate_free+0x18/0x40
RSP: 0018:ffff88003fd83e88 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88002bbb3380 RCX: ffff88000c91a300
[..]
Call Trace:
<IRQ>
free_fib_info_rcu+0x195/0x1a0
? rt_fibinfo_free+0x50/0x50
rcu_process_callbacks+0x2d3/0x850
? rcu_process_callbacks+0x296/0x850
__do_softirq+0xe4/0x4cb
irq_exit+0xb0/0xc0
smp_apic_timer_interrupt+0x3d/0x50
apic_timer_interrupt+0x93/0xa0
[..]
Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8
The problem is after the module for the encap can be unloaded the
corresponding ops is removed and is thus NULL here.
Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so grab the module
reference for the ops on creating lwtunnel state and of course release
the reference when freeing the state.
Fixes: 1104d9ba44 ("lwtunnel: Add destroy state operation")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so specify the owning
module for all lwtunnel ops.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tipc_server_stop(), we iterate over the connections with limiting
factor as server's idr_in_use. We ignore the fact that this variable
is decremented in tipc_close_conn(), leading to premature exit.
In this commit, we iterate until the we have no connections left.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tipc_conn_sendmsg(), we first queue the request to the outqueue
followed by the connection state check. If the connection is not
connected, we should not queue this message.
In this commit, we reject the messages if the connection state is
not CF_CONNECTED.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 333f796235 ("tipc: fix a race condition leading to
subscriber refcnt bug") reveals a soft lockup while acquiring
nametbl_lock.
Before commit 333f796235, we call tipc_conn_shutdown() from
tipc_close_conn() in the context of tipc_topsrv_stop(). In that
context, we are allowed to grab the nametbl_lock.
Commit 333f796235, moved tipc_conn_release (renamed from
tipc_conn_shutdown) to the connection refcount cleanup. This allows
either tipc_nametbl_withdraw() or tipc_topsrv_stop() to the cleanup.
Since tipc_exit_net() first calls tipc_topsrv_stop() and then
tipc_nametble_withdraw() increases the chances for the later to
perform the connection cleanup.
The soft lockup occurs in the call chain of tipc_nametbl_withdraw(),
when it performs the tipc_conn_kref_release() as it tries to grab
nametbl_lock again while holding it already.
tipc_nametbl_withdraw() grabs nametbl_lock
tipc_nametbl_remove_publ()
tipc_subscrp_report_overlap()
tipc_subscrp_send_event()
tipc_conn_sendmsg()
<< if (con->flags != CF_CONNECTED) we do conn_put(),
triggering the cleanup as refcount=0. >>
tipc_conn_kref_release
tipc_sock_release
tipc_conn_release
tipc_subscrb_delete
tipc_subscrp_delete
tipc_nametbl_unsubscribe << Soft Lockup >>
The previous changes in this series fixes the race conditions fixed
by commit 333f796235. Hence we can now revert the commit.
Fixes: 333f796235 ("tipc: fix a race condition leading to subscriber refcnt bug")
Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, the generic server framework maintains the connection
id's per subscriber in server's conn_idr. At tipc_close_conn, we
remove the connection id from the server list, but the connection is
valid until we call the refcount cleanup. Hence we have a window
where the server allocates the same connection to an new subscriber
leading to inconsistent reference count. We have another refcount
warning we grab the refcount in tipc_conn_lookup() for connections
with flag with CF_CONNECTED not set. This usually occurs at shutdown
when the we stop the topology server and withdraw TIPC_CFG_SRV
publication thereby triggering a withdraw message to subscribers.
In this commit, we:
1. remove the connection from the server list at recount cleanup.
2. grab the refcount for a connection only if CF_CONNECTED is set.
Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, the subscribers keep track of the subscriptions using
reference count at subscriber level. At subscription cancel or
subscriber delete, we delete the subscription only if the timer
was pending for the subscription. This approach is incorrect as:
1. del_timer() is not SMP safe, if on CPU0 the check for pending
timer returns true but CPU1 might schedule the timer callback
thereby deleting the subscription. Thus when CPU0 is scheduled,
it deletes an invalid subscription.
2. We export tipc_subscrp_report_overlap(), which accesses the
subscription pointer multiple times. Meanwhile the subscription
timer can expire thereby freeing the subscription and we might
continue to access the subscription pointer leading to memory
violations.
In this commit, we introduce subscription refcount to avoid deleting
an invalid subscription.
Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>