"RFC 5961, 4.2. Mitigation" describes a mechanism to request
client to confirm with RST the restart of TCP connection
before resending its SYN. As result, IPVS can see SYNs for
existing connection in CLOSE state. Add check to allow
rescheduling in this state.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Jiri Bohac is reporting for a problem where the attempt
to reschedule existing connection to another real server
needs proper redirect for the conntrack used by the IPVS
connection. For example, when IPVS connection is created
to NAT-ed real server we alter the reply direction of
conntrack. If we later decide to select different real
server we can not alter again the conntrack. And if we
expire the old connection, the new connection is left
without conntrack.
So, the only way to redirect both the IPVS connection and
the Netfilter's conntrack is to drop the SYN packet that
hits existing connection, to wait for the next jiffie
to expire the old connection and its conntrack and to rely
on client's retransmission to create new connection as
usually.
Jiri Bohac provided a fix that drops all SYNs on rescheduling,
I extended his patch to do such drops only for connections
that use conntrack. Here is the original report from Jiri Bohac:
Since commit dc7b3eb900 ("ipvs: Fix reuse connection if real server
is dead"), new connections to dead servers are redistributed
immediately to new servers. The old connection is expired using
ip_vs_conn_expire_now() which sets the connection timer to expire
immediately.
However, before the timer callback, ip_vs_conn_expire(), is run
to clean the connection's conntrack entry, the new redistributed
connection may already be established and its conntrack removed
instead.
Fix this by dropping the first packet of the new connection
instead, like we do when the destination server is not available.
The timer will have deleted the old conntrack entry long before
the first packet of the new connection is retransmitted.
Fixes: dc7b3eb900 ("ipvs: Fix reuse connection if real server is dead")
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
ip_vs_fill_iph_skb_off() may not find an IP header, and gcc has
determined that ip_vs_sip_fill_param() then incorrectly accesses
the protocol fields:
net/netfilter/ipvs/ip_vs_pe_sip.c: In function 'ip_vs_sip_fill_param':
net/netfilter/ipvs/ip_vs_pe_sip.c:76:5: error: 'iph.protocol' may be used uninitialized in this function [-Werror=maybe-uninitialized]
if (iph.protocol != IPPROTO_UDP)
^
net/netfilter/ipvs/ip_vs_pe_sip.c:81:10: error: 'iph.len' may be used uninitialized in this function [-Werror=maybe-uninitialized]
dataoff = iph.len + sizeof(struct udphdr);
^
This adds a check for the ip_vs_fill_iph_skb_off() return code
before looking at the ip header data returned from it.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: b0e010c527 ("ipvs: replace ip_vs_fill_ip4hdr with ip_vs_fill_iph_skb_off")
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
nfacct_filter_alloc doesn't validate the NFACCT_FILTER_MASK and
NFACCT_FILTER_VALUE parameters which can trigger a NULL pointer
dereference. CAP_NET_ADMIN is required to trigger the bug.
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When the send skbuff reaches the end, nlmsg_put and friends returns
-EMSGSIZE but it is silently thrown away in ndo_fdb_dump. It is called
within a for_each_netdev loop and the first fdb entry of a following
netdev could fit in the remaining skbuff. This breaks the mechanism
of cb->args[0] and idx to keep track of the entries that are already
dumped, which results missing entries in bridge fdb show command.
Signed-off-by: Minoura Makoto <minoura@valinux.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to update the skb->csum after pulling the skb, otherwise
an unnecessary checksum (re)computation can ocure for IGMP/MLD packets
in the bridge code. Additionally this fixes the following splats for
network devices / bridge ports with support for and enabled RX checksum
offloading:
[...]
[ 43.986968] eth0: hw csum failure
[ 43.990344] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0 #2
[ 43.996193] Hardware name: BCM2709
[ 43.999647] [<800204e0>] (unwind_backtrace) from [<8001cf14>] (show_stack+0x10/0x14)
[ 44.007432] [<8001cf14>] (show_stack) from [<801ab614>] (dump_stack+0x80/0x90)
[ 44.014695] [<801ab614>] (dump_stack) from [<802e4548>] (__skb_checksum_complete+0x6c/0xac)
[ 44.023090] [<802e4548>] (__skb_checksum_complete) from [<803a055c>] (ipv6_mc_validate_checksum+0x104/0x178)
[ 44.032959] [<803a055c>] (ipv6_mc_validate_checksum) from [<802e111c>] (skb_checksum_trimmed+0x130/0x188)
[ 44.042565] [<802e111c>] (skb_checksum_trimmed) from [<803a06e8>] (ipv6_mc_check_mld+0x118/0x338)
[ 44.051501] [<803a06e8>] (ipv6_mc_check_mld) from [<803b2c98>] (br_multicast_rcv+0x5dc/0xd00)
[ 44.060077] [<803b2c98>] (br_multicast_rcv) from [<803aa510>] (br_handle_frame_finish+0xac/0x51c)
[...]
Fixes: 9afd85c9e4 ("net: Export IGMP/MLD message validation code")
Reported-by: Álvaro Fernández Rojas <noltari@gmail.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
* stop critical protocol session on disconnect to avoid
it getting stuck
* wext: fix two RTNL message ordering issues
* fix an uninitialized value (found by KASAN)
* fix an out-of-bounds access (also found by KASAN)
* clear connection keys when freeing them in all cases
(IBSS, all other places already did so)
* fix expected throughput unit to get consistent values
* set default TX aggregation timeout to 0 in minstrel
to avoid (really just hide) issues and perform better
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJWzCthAAoJEGt7eEactAAdLjEP/38ujypJkVlwZzm9L8MbSVPk
qPMmRBshVX89xOXtwJLJHOGCRY6/pfe1LKThodOp+hj+NH7OLLuAGFDScjr7eEjG
bAQrXZ8dDIo54uri4EF4jBfkqUsePs+d4VpLgdJ2Gdh2UgIdDlUcggiEyOSicc27
ZP3gRQzad8tP0O+MuqiAjwJE4C3AUzzj+8TXJJXjrT3RvDpvZfA3MClfiQVhRM+4
3qtFhqFFGfIjTSHCoGr+0RYhLJwfjalIyoENR/kYizjHXU4YtZcu7ihdixKmG3Qb
SUIlNl60WSyoh2+PXx3wp0gfTc0BrOVIpB+TQPouRu5D+tLRiPwlieVrlfJyIUJ5
ccU1RWoTZn+3OVSjNF+9IGbpL43cjaJpZUMU/mQES7+z+uEwlzdcnwDScG7sMMC+
KIyKZ4v+2sBPKYYKAzGxHb2lfTMyp9sCs25U30yqLX3RSNtuxJbDwuKrwbKa1CMx
WEjImI+iSApjTch0THAxV8/sO/ZneuRZP76PXmSDm7Mzf9qob8fnpCjFsHTypDpx
eWy8AHPhIhb3ni6Cm3TfsGLjGE4KGwD/lEwmAles2Cj48N4b2UwRjMrJ5GRxuV00
9kh2MvFEmSYBU4dw6LQmemjxNNIa792+JiS4AKJXBqkPK1H1YFvq7i8Hl9a1tQ0P
SDgbH5x6AYvmTRcD97Ph
=01n9
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Another small set of fixes:
* stop critical protocol session on disconnect to avoid
it getting stuck
* wext: fix two RTNL message ordering issues
* fix an uninitialized value (found by KASAN)
* fix an out-of-bounds access (also found by KASAN)
* clear connection keys when freeing them in all cases
(IBSS, all other places already did so)
* fix expected throughput unit to get consistent values
* set default TX aggregation timeout to 0 in minstrel
to avoid (really just hide) issues and perform better
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The fix in 35e2d1152b ("tunnels: Allow IPv6 UDP checksums to be correctly
controlled.") changed behavior for bpf_set_tunnel_key() when in use with
IPv6 and thus uncovered a bug that TUNNEL_CSUM needed to be set but wasn't.
As a result, the stack dropped ingress vxlan IPv6 packets, that have been
sent via eBPF through collect meta data mode due to checksum now being zero.
Since after LCO, we enable IPv4 checksum by default, so make that analogous
and only provide a flag BPF_F_ZERO_CSUM_TX for the user to turn it off in
IPv4 case.
Fixes: 35e2d1152b ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.")
Fixes: c6c3345407 ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise we break the contract with GSO to only pass CHECKSUM_PARTIAL
skbs down. This can easily happen with UDP+IPv4 sockets with the first
MSG_MORE write smaller than the MTU, second write is a sendfile.
Returning -EOPNOTSUPP lets the callers fall back into normal sendmsg path,
were we calculate the checksum manually during copying.
Commit d749c9cbff ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked
sockets") started to exposes this bug.
Fixes: d749c9cbff ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets")
Reported-by: Jiri Benc <jbenc@redhat.com>
Cc: Jiri Benc <jbenc@redhat.com>
Reported-by: Wakko Warner <wakko@animx.eu.org>
Cc: Wakko Warner <wakko@animx.eu.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPCB may contain data from previous layers (in the observed case the
qdisc layer). In the observed scenario, the data was misinterpreted as
ip header options, which later caused the ihl to be set to an invalid
value (<5). This resulted in an infinite loop in the mips implementation
of ip_fast_csum.
This patch clears IPCB(skb)->opt before dst_link_failure can be called for
various types of tunnels. This change only applies to encapsulated ipv4
packets.
The code introduced in 11c21a30 which clears all of IPCB has been removed
to be consistent with these changes, and instead the opt field is cleared
unconditionally in ip_tunnel_xmit. The change in ip_tunnel_xmit applies to
SIT, GRE, and IPIP tunnels.
The relevant vti, l2tp, and pptp functions already contain similar code for
clearing the IPCB.
Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it's converted into msecs, thus HZ=1000 intact.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 740b0f1841 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: David S. Miller <davem@davemloft.net>
When user-space has started a critical protocol session and a disconnect
event occurs, the rdev::crit_prot_nlportid remains set. This caused a
subsequent NL80211_CMD_CRIT_PROTO_START to fail (-EBUSY). Fix this by
clearing the rdev attribute and call .crit_proto_stop() callback upon
disconnect event.
Reviewed-by: Hante Meuleman <meuleman@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com>
Signed-off-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The connection keys are zeroed out in all other cases except this
one. Let's fix the last one as well.
Signed-off-by: Ola Olsson <ola.olsson@sonymobile.com>
Reviewed-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The value 5000 was put here with the addition of the timeout field to
ieee80211_start_tx_ba_session. It was originally added in mac80211 to
save resources for drivers like iwlwifi, which only supports a limited
number of concurrent aggregation sessions.
Since iwlwifi does not use minstrel_ht and other drivers don't need
this, 0 is a better default - especially since there have been
recent reports of aggregation setup related issues reproduced with
ath9k. This should improve stability without causing any adverse
effects.
Cc: stable@vger.kernel.org
Acked-by: Avery Pennarun <apenwarr@gmail.com>
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Johan Hedberg says:
====================
pull request: bluetooth 2016-02-20
Here's an important patch for 4.5 which fixes potential invalid pointer
access when processing completed Bluetooth HCI commands.
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Vyukov noted recently that the sctp_port_hashtable had an error in
its size computation, observing that the current method never guaranteed
that the hashsize (measured in number of entries) would be a power of two,
which the input hash function for that table requires. The root cause of
the problem is that two values need to be computed (one, the allocation
order of the storage requries, as passed to __get_free_pages, and two the
number of entries for the hash table). Both need to be ^2, but for
different reasons, and the existing code is simply computing one order
value, and using it as the basis for both, which is wrong (i.e. it assumes
that ((1<<order)*PAGE_SIZE)/sizeof(bucket) is still ^2 when its not).
To fix this, we change the logic slightly. We start by computing a goal
allocation order (which is limited by the maximum size hash table we want
to support. Then we attempt to allocate that size table, decreasing the
order until a successful allocation is made. Then, with the resultant
successful order we compute the number of buckets that hash table supports,
which we then round down to the nearest power of two, giving us the number
of entries the table actually supports.
I've tested this locally here, using non-debug and spinlock-debug kernels,
and the number of entries in the hashtable consistently work out to be
powers of two in all cases.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
CC: Dmitry Vyukov <dvyukov@google.com>
CC: Vladislav Yasevich <vyasevich@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 44d2713774 ("Bluetooth: Compress the size of struct
hci_ctrl") we squashed down the size of the structure by using a union
with the assumption that all users would use the flag to determine
whether we had a req_complete or a req_complete_skb.
Unfortunately we had a case in hci_req_cmd_complete() where we weren't
looking at the flag. This can result in a situation where we might be
storing a hci_req_complete_skb_t in a hci_req_complete_t variable, or
vice versa.
During some testing I found at least one case where the function
hci_req_sync_complete() was called improperly because the kernel thought
that it didn't require an SKB. Looking through the stack in kgdb I
found that it was called by hci_event_packet() and that
hci_event_packet() had both of its locals "req_complete" and
"req_complete_skb" pointing to the same place: both to
hci_req_sync_complete().
Let's make sure we always check the flag.
For more details on debugging done, see <http://crbug.com/588288>.
Fixes: 44d2713774 ("Bluetooth: Compress the size of struct hci_ctrl")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The unix_stream_read_generic function tries to use a continue statement
to restart the receive loop after waiting for a message. This may not
work as intended as the caller might use a recvmsg call to peek at
control messages without specifying a message buffer. If this was the
case, the continue will cause the function to return without an error
and without the credential information if the function had to wait for a
message while it had returned with the credentials otherwise. Change to
using goto to restart the loop without checking the condition first in
this case so that credentials are returned either way.
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The value passed by unix_diag_get_exact to unix_lookup_by_ino has type
__u32, but unix_lookup_by_ino's argument ino has type int, which is not
a problem yet.
However, when ino is compared with sock_i_ino return value of type
unsigned long, ino is sign extended to signed long, and this results
to incorrect comparison on 64-bit architectures for inode numbers
greater than INT_MAX.
This bug was found by strace test suite.
Fixes: 5d3cae8bc3 ("unix_diag: Dumping exact socket core")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the commit 35e2d1152b ("tunnels: Allow IPv6 UDP checksums to be
correctly controlled.") changed the default xmit checksum setting
for lwt vxlan/geneve ipv6 tunnels, so that now the checksum is not
set into external UDP header.
This commit changes the rx checksum setting for both lwt vxlan/geneve
devices created by openvswitch accordingly, so that lwt over ipv6
tunnel pairs are again able to communicate with default values.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
access - it was triggered when removing an object from a list
after it was already free'd due to bad reference counting.
This misbehaviour existed for both the gw_node and the
orig_node_vlan object and has been fixed by Sven Eckelmann.
The last patch fixes our interface feasibility check and prevents
it from looping indefinitely when two net_device objects
reference each other via iflink index (i.e. veth pair), by
Andrew Lunn
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWwy9MAAoJENpFlCjNi1MRu+cQAMF7zYIlWggCWUEqaEIId4qi
daGly3DNrXWH8FJta+NR399xXlTXNoyS8uZYt9YUZrXlwXDbDNBHJ8Jcmc5sslZr
/06os5KUvNLlasup1GLBbynKUCTBSjVyo3DBwVCsiCL/iHdN1tFZXg3c2ET2U7Tk
4bFZO4h1JkpwCYeCjCJQSbPrjnAXYdxV01iarDJZgAoc290f92Ob4mTuTrQr1Vzi
EH2fyrd3m/qmRb2YBj7jNdVeD2Q4XyMKY4vHHeBq9JR/kwINwUg38+gnvcwK+KeD
VJilETdC686ZJKBSVvECQZqeXr4HaCINFNuepZLsg128IQ2ZjqPYC1yTq/qc1VF+
YGCunXhbbNgxeE1+QgXssgRAiUBx43m2osQegpHkyCzK8GPbv1gRIbTYaHgmqa4A
Pn6WHdV2kcfGRFNOASST5+eWPe9ol8RvAbXT0yD2+6CNz6aijfbGITUTlB24X+f+
6CsS3uCBsH8FI3hqjgULTqcyF14QbnLZ9AnkxYwt8B0Ge62LEfdwuXAh96mIhfyf
cGi7DBnHAYapOPzz0jSBhW5myiPOioNLJfzokZrBq+RuRBYUIUvzGmz8QJQpy/GO
EdnMOE/uWeZeSCsAGTE00pjqfHliK/vu4Wh0puGFWTQGCBbKWP5p4eNqRh4p+2GC
Sg1f1EKDt2PRw9aZHNHm
=CvI3
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
Two of the fixes included in this patchset prevent wrong memory
access - it was triggered when removing an object from a list
after it was already free'd due to bad reference counting.
This misbehaviour existed for both the gw_node and the
orig_node_vlan object and has been fixed by Sven Eckelmann.
The last patch fixes our interface feasibility check and prevents
it from looping indefinitely when two net_device objects
reference each other via iflink index (i.e. veth pair), by
Andrew Lunn
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
An error response from a RTM_GETNETCONF request can return the positive
error value EINVAL in the struct nlmsgerr that can mislead userspace.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cfrfml_receive() function might return positive value EPROTO
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The atalk_sendmsg() function might return wrong value ENETUNREACH
instead of -ENETUNREACH.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My implementation around IFF_NO_QUEUE driver flag assumed that leaving
tx_queue_len untouched (specifically: not setting it to zero) by drivers
would make it possible to assign a regular qdisc to them without having
to worry about setting tx_queue_len to a useful value. This was only
partially true: I overlooked that some drivers don't call ether_setup()
and therefore not initialize tx_queue_len to the default value of 1000.
Consequently, removing the workarounds in place for that case in qdisc
implementations which cared about it (namely, pfifo, bfifo, gred, htb,
plug and sfb) leads to problems with these specific interface types and
qdiscs.
Luckily, there's already a sanitization point for drivers setting
tx_queue_len to zero, which can be reused to assign the fallback value
most qdisc implementations used, which is 1.
Fixes: 348e3435cb ("net: sched: drop all special handling of tx_queue_len == 0")
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
ether_setup sets IFF_TX_SKB_SHARING but this is not supported by gre
as it modifies the skb on xmit.
Also, clean up whitespace in ipgre_tap_setup when we're already touching it.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya reported following lockdep splat:
kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0: (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1: (rcu_read_lock){......}, at: [<ffffffff816e977f>]
ip_local_deliver_finish+0x3f/0x380
kernel: #2: (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
sk_clone_lock+0x19b/0x440
kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.
We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: ebb516af60 ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the gc of ipv4 route was removed, the route cached would has
no chance to be removed, and even it has been timeout, it still could
be used, cause no code to check it's expires.
Fix this issue by checking and removing route cache when we get route.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
actions could change the etherproto in particular with ethernet
tunnelled data. Typically such actions, after peeling the outer header,
will ask for the packet to be reclassified. We then need to restart
the classification with the new proto header.
Example setup used to catch this:
sudo tc qdisc add dev $ETH ingress
sudo $TC filter add dev $ETH parent ffff: pref 1 protocol 802.1Q \
u32 match u32 0 0 flowid 1:1 \
action vlan pop reclassify
Fixes: 3b3ae88026 ("net: sched: consolidate tc_classify{,_compat}")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
With commit 0071f56e46 ("dsa: Register netdev before phy"), we are now trying
to free a network device that has been previously registered, and in case of
errors, this will make us hit the BUG_ON(dev->reg_state != NETREG_UNREGISTERED)
condition.
Fix this by adding a missing unregister_netdev() before free_netdev().
Fixes: 0071f56e46 ("dsa: Register netdev before phy")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A previous commit (33f72e6) added notification via netlink for tunnels
when created/modified/deleted. If the notification returned an error,
this error was returned from the tunnel function. If there were no
listeners, the error code ESRCH was returned, even though having no
listeners is not an error. Other calls to this and other similar
notification functions either ignore the error code, or filter ESRCH.
This patch checks for ESRCH and does not flag this as an error.
Reviewed-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
If tcp_v4_inbound_md5_hash() returns an error, we must release
the refcount on the request socket, not on the listener.
The bug was added for IPv4 only.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some cases where rtt_us derives from deltas of jiffies,
instead of using usec timestamps.
Since we want to track minimal rtt, better to assume a delta of 0 jiffie
might be in fact be very close to 1 jiffie.
It is kind of sad jiffies_to_usecs(1) calls a function instead of simply
using a constant.
Fixes: f672258391 ("tcp: track min RTT using windowed min-filter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The phy has not been initialized, disconnecting it in the error
path results in a NULL pointer exception. Drop the phy_disconnect
from the error path.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 5266698661 ("tipc: let broadcast packet reception
use new link receive function") we introduced a new per-node
broadcast reception link instance. This link is created at the
moment the node itself is created. Unfortunately, the allocation
is done after the node instance has already been added to the node
lookup hash table. This creates a potential race condition, where
arriving broadcast packets are able to find and access the node
before it has been fully initialized, and before the above mentioned
link has been created. The result is occasional crashes in the function
tipc_bcast_rcv(), which is trying to access the not-yet existing link.
We fix this by deferring the addition of the node instance until after
it has been fully initialized in the function tipc_node_create().
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent change to the mdb code confused the compiler to the point
where it did not realize that the port-group returned from
br_mdb_add_group() is always valid when the function returns a nonzero
return value, so we get a spurious warning:
net/bridge/br_mdb.c: In function 'br_mdb_add':
net/bridge/br_mdb.c:542:4: error: 'pg' may be used uninitialized in this function [-Werror=maybe-uninitialized]
__br_mdb_notify(dev, entry, RTM_NEWMDB, pg);
Slightly rearranging the code in br_mdb_add_group() makes the problem
go away, as gcc is clever enough to see that both functions check
for 'ret != 0'.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 9e8430f8d6 ("bridge: mdb: Passing the port-group pointer to br_mdb module")
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch corrects the unaligned accesses seen on GRE TEB tunnels when
generating hash keys. Specifically what this patch does is make it so that
we force the use of skb_copy_bits when the GRE inner headers will be
unaligned due to NET_IP_ALIGNED being a non-zero value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contain a rather large batch for your net that
includes accumulated bugfixes, they are:
1) Run conntrack cleanup from workqueue process context to avoid hitting
soft lockup via watchdog for large tables. This is required by the
IPv6 masquerading extension. From Florian Westphal.
2) Use original skbuff from nfnetlink batch when calling netlink_ack()
on error since this needs to access the skb->sk pointer.
3) Incremental fix on top of recent Sasha Levin's lock fix for conntrack
resizing.
4) Fix several problems in nfnetlink batch message header sanitization
and error handling, from Phil Turnbull.
5) Select NF_DUP_IPV6 based on CONFIG_IPV6, from Arnd Bergmann.
6) Fix wrong signess in return values on nf_tables counter expression,
from Anton Protopopov.
Due to the NetDev 1.1 organization burden, I had no chance to pass up
this to you any sooner in this release cycle, sorry about that.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The unix_dgram_sendmsg routine use the following test
if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
to determine if sk and other are in an n:1 association (either
established via connect or by using sendto to send messages to an
unrelated socket identified by address). This isn't correct as the
specified address could have been bound to the sending socket itself or
because this socket could have been connected to itself by the time of
the unix_peer_get but disconnected before the unix_state_lock(other). In
both cases, the if-block would be entered despite other == sk which
might either block the sender unintentionally or lead to trying to unlock
the same spin lock twice for a non-blocking send. Add a other != sk
check to guard against this.
Fixes: 7d267278a9 ("unix: avoid use-after-free in ep_remove_wait_queue")
Reported-By: Philipp Hahn <pmhahn@pmhahn.de>
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Tested-by: Philipp Hahn <pmhahn@pmhahn.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The present unix_stream_read_generic contains various code sequences of
the form
err = -EDISASTER;
if (<test>)
goto out;
This has the unfortunate side effect of possibly causing the error code
to bleed through to the final
out:
return copied ? : err;
and then to be wrongly returned if no data was copied because the caller
didn't supply a data buffer, as demonstrated by the program available at
http://pad.lv/1540731
Change it such that err is only set if an error condition was detected.
Fixes: 3822b5c2fc ("af_unix: Revert 'lock_interruptible' in stream receive code")
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
batman-adv checks in different situation if a new device is already on top
of a different batman-adv device. This is done by getting the iflink of a
device and all its parent. It assumes that this iflink is always a parent
device in an acyclic graph. But this assumption is broken by devices like
veth which are actually a pair of two devices linked to each other. The
recursive check would therefore get veth0 when calling dev_get_iflink on
veth1. And it gets veth0 when calling dev_get_iflink with veth1.
Creating a veth pair and loading batman-adv freezes parts of the system
ip link add veth0 type veth peer name veth1
modprobe batman-adv
An RCU stall will be detected on the system which cannot be fixed.
INFO: rcu_sched self-detected stall on CPU
1: (5264 ticks this GP) idle=3e9/140000000000001/0
softirq=144683/144686 fqs=5249
(t=5250 jiffies g=46 c=45 q=43)
Task dump for CPU 1:
insmod R running task 0 247 245 0x00000008
ffffffff8151f140 ffffffff8107888e ffff88000fd141c0 ffffffff8151f140
0000000000000000 ffffffff81552df0 ffffffff8107b420 0000000000000001
ffff88000e3fa700 ffffffff81540b00 ffffffff8107d667 0000000000000001
Call Trace:
<IRQ> [<ffffffff8107888e>] ? rcu_dump_cpu_stacks+0x7e/0xd0
[<ffffffff8107b420>] ? rcu_check_callbacks+0x3f0/0x6b0
[<ffffffff8107d667>] ? hrtimer_run_queues+0x47/0x180
[<ffffffff8107cf9d>] ? update_process_times+0x2d/0x50
[<ffffffff810873fb>] ? tick_handle_periodic+0x1b/0x60
[<ffffffff810290ae>] ? smp_trace_apic_timer_interrupt+0x5e/0x90
[<ffffffff813bbae2>] ? apic_timer_interrupt+0x82/0x90
<EOI> [<ffffffff812c3fd7>] ? __dev_get_by_index+0x37/0x40
[<ffffffffa0031f3e>] ? batadv_hard_if_event+0xee/0x3a0 [batman_adv]
[<ffffffff812c5801>] ? register_netdevice_notifier+0x81/0x1a0
[...]
This can be avoided by checking if two devices are each others parent and
stopping the check in this situation.
Fixes: b7eddd0b39 ("batman-adv: prevent using any virtual device created on batman-adv as hard-interface")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[sven@narfation.org: rewritten description, extracted fix]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_orig_node_vlan reference counter in batadv_tt_global_size_mod
can only be reduced when the list entry was actually removed. Otherwise the
reference counter may reach zero when batadv_tt_global_size_mod is called
from two different contexts for the same orig_node_vlan but only one
context is actually removing the entry from the list.
The release function for this orig_node_vlan is not called inside the
vlan_list_lock spinlock protected region because the function
batadv_tt_global_size_mod still holds a orig_node_vlan reference for the
object pointer on the stack. Thus the actual release function (when
required) will be called only at the end of the function.
Fixes: 7ea7b4a142 ("batman-adv: make the TT CRC logic VLAN specific")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
The batadv_gw_node reference counter in batadv_gw_node_update can only be
reduced when the list entry was actually removed. Otherwise the reference
counter may reach zero when batadv_gw_node_update is called from two
different contexts for the same gw_node but only one context is actually
removing the entry from the list.
The release function for this gw_node is not called inside the list_lock
spinlock protected region because the function batadv_gw_node_update still
holds a gw_node reference for the object pointer on the stack. Thus the
actual release function (when required) will be called only at the end of
the function.
Fixes: bd3524c14b ("batman-adv: remove obsolete deleted attribute for gateway node")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Dmitry reported memory leaks of IP options allocated in
ip_cmsg_send() when/if this function returns an error.
Callers are responsible for the freeing.
Many thanks to Dmitry for the report and diagnostic.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix BPF handling of branch offset adjustmnets on backjumps, from
Daniel Borkmann.
2) Make sure selinux knows about SOCK_DESTROY netlink messages, from
Lorenzo Colitti.
3) Fix openvswitch tunnel mtu regression, from David Wragg.
4) Fix ICMP handling of TCP sockets in syn_recv state, from Eric
Dumazet.
5) Fix SCTP user hmacid byte ordering bug, from Xin Long.
6) Fix recursive locking in ipv6 addrconf, from Subash Abhinov
Kasiviswanathan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
bpf: fix branch offset adjustment on backjumps after patching ctx expansion
vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices
geneve: Relax MTU constraints
vxlan: Relax MTU constraints
flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen
of: of_mdio: Add marvell, 88e1145 to whitelist of PHY compatibilities.
selinux: nlmsgtab: add SOCK_DESTROY to the netlink mapping tables
sctp: translate network order to host order when users get a hmacid
enic: increment devcmd2 result ring in case of timeout
tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs
net:Add sysctl_max_skb_frags
tcp: do not drop syn_recv on all icmp reports
ipv6: fix a lockdep splat
unix: correctly track in-flight fds in sending process user_struct
update be2net maintainers' email addresses
dwc_eth_qos: Reset hardware before PHY start
ipv6: addrconf: Fix recursive spin lock call
Prior to 4.3, openvswitch tunnel vports (vxlan, gre and geneve) could
transmit vxlan packets of any size, constrained only by the ability to
send out the resulting packets. 4.3 introduced netdevs corresponding
to tunnel vports. These netdevs have an MTU, which limits the size of
a packet that can be successfully encapsulated. The default MTU
values are low (1500 or less), which is awkwardly small in the context
of physical networks supporting jumbo frames, and leads to a
conspicuous change in behaviour for userspace.
Instead, set the MTU on openvswitch-created netdevs to be the relevant
maximum (i.e. the maximum IP packet size minus any relevant overhead),
effectively restoring the behaviour prior to 4.3.
Signed-off-by: David Wragg <david@weave.works>
Signed-off-by: David S. Miller <davem@davemloft.net>