-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEE4bay/IylYqM/npjQHv7KIOw4HPYFAlnohyMTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRAe/sog7Dgc9jRFCACmSBz2H1yW5VfjrN3ZgoEoUo8P5qCZ
7mXvTstTASF20T7kFqNS2K94CMVjBYRUIHvihp0LmPmPAmpmQRNedAssWuZpMflw
xhaH8T8YWbyzHU2MBZP9WtTjrTilPEPcjvFSgYw6wpHW7VC3S/ffEN8Mj3ymR6bW
KRw7gemXpvYUxPrGGCzyvFy4G+KptyPcAD6I8ceDUtdkwK4TXGYqEAoSqxtAIUHX
dgRLzheOv8sDh+dM+ZNpH6UUkzK0gVHQ40lgXydZazEwPPq9zdj2Ec/6qpR5fkUH
dnCIhzLNiCJOMGSfzkf0Esef1zBpU8oJmILbgf97CdayaW9cA1EUleGW
=HYTx
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-4.14-20171019' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2017-10-19
this is a pull request of 11 patches for the upcoming 4.14 release.
There are 6 patches by ZHU Yi for the flexcan driver, that work around
the CAN error handling state transition problems found in various
incarnations of the flexcan IP core.
The patch by Colin Ian King fixes a potential NULL pointer deref in the
CAN broad cast manager (bcm). One patch by me replaces a direct deref of a RCU
protected pointer by rcu_access_pointer. My second patch adds missing
OOM error handling in af_can. A patch by Stefan Mätje for the esd_usb2
driver fixes the dlc in received RTR frames. And the last patch is by
Wolfgang Grandegger, it fixes a busy loop in the gs_usb driver in case
it runs out of TX contexts.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Without the patch, when hvs_open_connection() hasn't completely established
a connection (e.g. it has changed sk->sk_state to SS_CONNECTED, but hasn't
inserted the sock into the connected queue), vsock_stream_connect() may see
the sk_state change and return the connection to the userspace, and next
when the userspace closes the connection quickly, hvs_release() may not see
the connection in the connected queue; finally hvs_open_connection()
inserts the connection into the queue, but we won't be able to purge the
connection for ever.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Rolf Neugebauer <rolf.neugebauer@docker.com>
Cc: Marcelo Cerri <marcelo.cerri@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The length of GVI (GetVersionInfo) response packet should be 40 instead
of 36. This issue was found from /sys/kernel/debug/ncsi/eth0/stats.
# ethtool --ncsi eth0 swstats
:
RESPONSE OK TIMEOUT ERROR
=======================================
GVI 0 0 2
With this applied, no error reported on GVI response packets:
# ethtool --ncsi eth0 swstats
:
RESPONSE OK TIMEOUT ERROR
=======================================
GVI 2 0 0
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NCSI channel has been configured to provide service if its link
monitor timer is enabled, regardless of its state (inactive or active).
So the timeout event on the link monitor indicates the out-of-service
on that channel, for which a failover is needed.
This sets NCSI_DEV_RESHUFFLE flag to enforce failover on link monitor
timeout, regardless the channel's original state (inactive or active).
Also, the link is put into "down" state to give the failing channel
lowest priority when selecting for the active channel. The state of
failing channel should be set to active in order for deinitialization
and failover to be done.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When there are no NCSI channels probed, HWA (Hardware Arbitration)
mode is enabled. It's not correct because HWA depends on the fact:
NCSI channels exist and all of them support HWA mode. This disables
HWA when no channels are probed.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ncsi_channel_monitor() misses stopping the channel monitor in several
places that it should, causing a WARN_ON_ONCE() to trigger when the
monitor is re-started later, eg:
[ 459.040000] WARNING: CPU: 0 PID: 1093 at net/ncsi/ncsi-manage.c:269 ncsi_start_channel_monitor+0x7c/0x90
[ 459.040000] CPU: 0 PID: 1093 Comm: kworker/0:3 Not tainted 4.10.17-gaca2fdd #140
[ 459.040000] Hardware name: ASpeed SoC
[ 459.040000] Workqueue: events ncsi_dev_work
[ 459.040000] [<80010094>] (unwind_backtrace) from [<8000d950>] (show_stack+0x20/0x24)
[ 459.040000] [<8000d950>] (show_stack) from [<801dbf70>] (dump_stack+0x20/0x28)
[ 459.040000] [<801dbf70>] (dump_stack) from [<80018d7c>] (__warn+0xe0/0x108)
[ 459.040000] [<80018d7c>] (__warn) from [<80018e70>] (warn_slowpath_null+0x30/0x38)
[ 459.040000] [<80018e70>] (warn_slowpath_null) from [<803f6a08>] (ncsi_start_channel_monitor+0x7c/0x90)
[ 459.040000] [<803f6a08>] (ncsi_start_channel_monitor) from [<803f7664>] (ncsi_configure_channel+0xdc/0x5fc)
[ 459.040000] [<803f7664>] (ncsi_configure_channel) from [<803f8160>] (ncsi_dev_work+0xac/0x474)
[ 459.040000] [<803f8160>] (ncsi_dev_work) from [<8002d244>] (process_one_work+0x1e0/0x450)
[ 459.040000] [<8002d244>] (process_one_work) from [<8002d510>] (worker_thread+0x5c/0x570)
[ 459.040000] [<8002d510>] (worker_thread) from [<80033614>] (kthread+0x124/0x164)
[ 459.040000] [<80033614>] (kthread) from [<8000a5e8>] (ret_from_fork+0x14/0x2c)
This also updates the monitor instead of just returning if
ncsi_xmit_cmd() fails to send the get-link-status command so that the
monitor properly times out.
Fixes: e6f44ed6d0 "net/ncsi: Package and channel management"
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct the value of the HNCDSC AEN packet.
Fixes: 7a82ecf4cf "net/ncsi: NCSI AEN packet handler"
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller got crashes in packet_getsockopt() processing
PACKET_ROLLOVER_STATS command while another thread was managing
to change po->rollover
Using RCU will fix this bug. We might later add proper RCU annotations
for sparse sake.
In v2: I replaced kfree(rollover) in fanout_add() to kfree_rcu()
variant, as spotted by John.
Fixes: a9b6391814 ("packet: rollover statistics")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: John Sperbeck <jsperbeck@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb->mark field is a union with reserved_tailroom which is used
in the TCP code paths from stream memory allocation. Allowing SK_SKB
programs to set this field creates a conflict with future code
optimizations, such as "gifting" the skb to the egress path instead
of creating a new skb and doing a memcpy.
Because we do not have a released version of SK_SKB yet lets just
remove it for now. A more appropriate scratch pad to use at the
socket layer is dev_scratch, but lets add that in future kernels
when needed.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
SK_SKB BPF programs are run from the socket/tcp context but early in
the stack before much of the TCP metadata is needed in tcp_skb_cb. So
we can use some unused fields to place BPF metadata needed for SK_SKB
programs when implementing the redirect function.
This allows us to drop the preempt disable logic. It does however
require an API change so sk_redirect_map() has been updated to
additionally provide ctx_ptr to skb. Note, we do however continue to
disable/enable preemption around actual BPF program running to account
for map updates.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now sctp processes icmp redirect packet in sctp_icmp_redirect where
it calls sctp_transport_dst_check in which tp->dst can be released.
The problem is before calling sctp_transport_dst_check, it doesn't
check sock_owned_by_user, which means tp->dst could be freed while
a process is accessing it with owning the socket.
An use-after-free issue could be triggered by this.
This patch is to fix it by checking sock_owned_by_user before calling
sctp_transport_dst_check in sctp_icmp_redirect, so that it would not
release tp->dst if users still hold sock lock.
Besides, the same issue fixed in commit 45caeaa5ac ("dccp/tcp: fix
routing redirect race") on sctp also needs this check.
Fixes: 55be7a9c60 ("ipv4: Add redirect support to all protocol icmp error handlers")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The transport may need to flush transport connect and receive tasks
that are running on rpciod. In order to do so safely, we need to
ensure that the caller of cancel_work_sync() etc is not itself
running on rpciod.
Do so by running the destroy task from the system workqueue.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Now when peeling off an association to the sock in another netns, all
transports in this assoc are not to be rehashed and keep use the old
key in hashtable.
As a transport uses sk->net as the hash key to insert into hashtable,
it would miss removing these transports from hashtable due to the new
netns when closing the sock and all transports are being freeed, then
later an use-after-free issue could be caused when looking up an asoc
and dereferencing those transports.
This is a very old issue since very beginning, ChunYu found it with
syzkaller fuzz testing with this series:
socket$inet6_sctp()
bind$inet6()
sendto$inet6()
unshare(0x40000000)
getsockopt$inet_sctp6_SCTP_GET_ASSOC_ID_LIST()
getsockopt$inet_sctp6_SCTP_SOCKOPT_PEELOFF()
This patch is to block this call when peeling one assoc off from one
netns to another one, so that the netns of all transport would not
go out-sync with the key in hashtable.
Note that this patch didn't fix it by rehashing transports, as it's
difficult to handle the situation when the tuple is already in use
in the new netns. Besides, no one would like to peel off one assoc
to another netns, considering ipaddrs, ifaces, etc. are usually
different.
Reported-by: ChunYu Wang <chunwang@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the missing check and error handling for out-of-memory
situations, when kzalloc cannot allocate memory.
Fixes: cb5635a367 ("can: complete initial namespace support")
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
"proto_tab" is a RCU protected array, when directly accessing the array,
sparse throws these warnings:
CHECK /srv/work/frogger/socketcan/linux/net/can/af_can.c
net/can/af_can.c:115:14: error: incompatible types in comparison expression (different address spaces)
net/can/af_can.c:795:17: error: incompatible types in comparison expression (different address spaces)
net/can/af_can.c:816:9: error: incompatible types in comparison expression (different address spaces)
This patch fixes the problem by using rcu_access_pointer() and
annotating "proto_tab" array as __rcu.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The assignment of net via call sock_net will dereference sk. This
is performed before a sanity null check on sk, so there could be
a potential null dereference on the sock_net call if sk is null.
Fix this by assigning net after the sk null check. Also replace
the sk == NULL with the more usual !sk idiom.
Detected by CoverityScan CID#1431862 ("Dereference before null check")
Fixes: 384317ef41 ("can: network namespace support for CAN_BCM protocol")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
It seems that it's possible to toggle NETLINK_F_EXT_ACK
through setsockopt() while another thread/CPU is building
a message inside netlink_ack(), which could then trigger
the WARN_ON()s I added since if it goes from being turned
off to being turned on between allocating and filling the
message, the skb could end up being too small.
Avoid this whole situation by storing the value of this
flag in a separate variable and using that throughout the
function instead.
Fixes: 2d4bc93368 ("netlink: extended ACK reporting")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate KEY_FLAG_INSTANTIATED, KEY_FLAG_NEGATIVE and the rejection
error into one field such that:
(1) The instantiation state can be modified/read atomically.
(2) The error can be accessed atomically with the state.
(3) The error isn't stored unioned with the payload pointers.
This deals with the problem that the state is spread over three different
objects (two bits and a separate variable) and reading or updating them
atomically isn't practical, given that not only can uninstantiated keys
change into instantiated or rejected keys, but rejected keys can also turn
into instantiated keys - and someone accessing the key might not be using
any locking.
The main side effect of this problem is that what was held in the payload
may change, depending on the state. For instance, you might observe the
key to be in the rejected state. You then read the cached error, but if
the key semaphore wasn't locked, the key might've become instantiated
between the two reads - and you might now have something in hand that isn't
actually an error code.
The state is now KEY_IS_UNINSTANTIATED, KEY_IS_POSITIVE or a negative error
code if the key is negatively instantiated. The key_is_instantiated()
function is replaced with key_is_positive() to avoid confusion as negative
keys are also 'instantiated'.
Additionally, barriering is included:
(1) Order payload-set before state-set during instantiation.
(2) Order state-read before payload-read when using the key.
Further separate barriering is necessary if RCU is being used to access the
payload content after reading the payload pointers.
Fixes: 146aa8b145 ("KEYS: Merge the type-specific data with the payload data")
Cc: stable@vger.kernel.org # v4.4+
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Ben reported that when the user rate mask is rejected for not
matching any basic rate, the driver had already been configured.
This is clearly an oversight in my original change, fix this by
doing the validation before calling the driver.
Reported-by: Ben Greear <greearb@candelatech.com>
Fixes: e8e4f5280d ("mac80211: reject/clear user rate mask if not usable")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we try to connect while already connected/connecting, but
this fails, we set ssid_len=0 but leave current_bss hanging,
leading to errors.
Check all of this better, first of all ensuring that we can't
try to connect to a different SSID while connected/ing; ensure
that prev_bssid is set for re-association attempts even in the
case of the driver supporting the connect() method, and don't
reset ssid_len in the failure cases.
While at it, also reset ssid_len while disconnecting unless we
were connected and expect a disconnected event, and warn on a
successful connection without ssid_len being set.
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Otherwise we risk leaking information via timing side channel.
Fixes: fdf7cb4185 ("mac80211: accept key reinstall without changing anything")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAlnkt7YACgkQa3t4Rpy0
AB3UZg/8CsZr3SeDM1CxLTmDxJ734qq2kcoBo8nJgJqSwIWE5goGXCJSIGQw0o9O
X8IF/xKj+Vgqfmxs73dnZM15klkVc9HBOdMdnjayWZZa5h6wKFFVNdjkMsrFxdV8
iugcCaUYCg1/Sc/qq4hBWLe5k3mQa5fB1clr7qykYEfIuY2ORpEJ+P2Uqjb5/RKx
ugLO1qdCVMk7kqB/Fc8XKrPdiCgmfaMdz9lTqm1yXDThQe0rNxmWR339YpleqEiA
lWIOZ53OlHZ3LTz8YDPcNA7mf5KtK1qQc60kZLygDcy5swvKaSIUaLfYOIjsPLIi
5tJ4tztEPpi5RTMPHkOLCcdGCT2bQ9InCmgSmEwymMbn36vaLuyH+9pfkqgbunUq
M1C07VNG9HUi+Qz6mhtbY6ZyG0s1M09rAouCOGo4T9x7YHA73BetvcxYcxhdTm6t
vOTDd/xiWM1Dsgc2tu41A+jB3cShi5so/OLE1KcLe4r1cso6+FdR+p5VW6XyRV73
Kbdh39azsdlyK6L+BW6cWCm/DUstourVmOz7mPebE0NW7N8cEhTtvclu0dz1gdKL
nBMha+lOBmy2+acU5Z58f6/jG8RCMcyiD+0z9yrZmOmO4OgVyxHXP2GrsQ1d/Qr5
vhBLT1LFxcvJpIlKZy3dqv1lZQ10suwcsFzJG3NsQbczV7Vk9aM=
=Iwyf
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2017-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Just a single fix, for a WoWLAN-related part of CVE-2017-13080.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When changing dev tx_queue_len via netlink or net-sysfs,
a NETDEV_CHANGE_TX_QUEUE_LEN event notification will be
called.
But dev_ioctl missed this event notification, which could
cause no userspace notification would be sent.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7091d8c '(net/sched: cls_flower: Add offload support using egress
Hardware device') made sure (when fl_hw_replace_filter is called) to put
the egress_dev mark on persisent structure instance. Hence, following calls
into the HW driver for stats and deletion will note it and act accordingly.
With commit de4784ca03 this property is lost and hence when called,
the HW driver failes to operate (stats, delete) on the offloaded flow.
Fix it by setting the egress_dev flag whenever the ingress device is
different from the hw device since this is exactly the condition under
which we're calling into the HW driver through the egress port net-device.
Fixes: de4784ca03 ('net: sched: get rid of struct tc_to_netdev')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
register_netdevice() could fail early when we have an invalid
dev name, in which case ->ndo_uninit() is not called. For tun
device, this is a problem because a timer etc. are already
initialized and it expects ->ndo_uninit() to clean them up.
We could move these initializations into a ->ndo_init() so
that register_netdevice() knows better, however this is still
complicated due to the logic in tun_detach().
Therefore, I choose to just call dev_get_valid_name() before
register_netdevice(), which is quicker and much easier to audit.
And for this specific case, it is already enough.
Fixes: 96442e4242 ("tuntap: choose the txq based on rxq")
Reported-by: Dmitry Alexeev <avekceeb@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IFLA_IFALIAS is defined as NLA_STRING. It means that the minimal length of
the attribute is 1 ("\0"). However, to remove an alias, the attribute
length must be 0 (see dev_set_alias()).
Let's define the type to NLA_BINARY to allow 0-length string, so that the
alias can be removed.
Example:
$ ip l s dummy0 alias foo
$ ip l l dev dummy0
5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ae:20:30:4f:a7:f3 brd ff:ff:ff:ff:ff:ff
alias foo
Before the patch:
$ ip l s dummy0 alias ""
RTNETLINK answers: Numerical result out of range
After the patch:
$ ip l s dummy0 alias ""
$ ip l l dev dummy0
5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ae:20:30:4f:a7:f3 brd ff:ff:ff:ff:ff:ff
CC: Oliver Hartkopp <oliver@hartkopp.net>
CC: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 96ca4a2cc1 ("net: remove ifalias on empty given alias")
Reported-by: Julien FLoret <julien.floret@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NETDEV_CHANGE_TX_QUEUE_LEN event process in rtnetlink_event would
send a notification for userspace and tx_queue_len's setting in
do_setlink would trigger NETDEV_CHANGE_TX_QUEUE_LEN.
So it shouldn't set DO_SETLINK_NOTIFY status for this change to
send a notification any more.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check 'status & DO_SETLINK_NOTIFY' in do_setlink doesn't really
work after status & DO_SETLINK_MODIFIED, as:
DO_SETLINK_MODIFIED 0x1
DO_SETLINK_NOTIFY 0x3
Considering that notifications are suppposed to be sent only when
status have the flag DO_SETLINK_NOTIFY, the right check would be:
(status & DO_SETLINK_NOTIFY) == DO_SETLINK_NOTIFY
This would avoid lots of duplicated notifications when setting some
properties of a link.
Fixes: ba9989069f ("rtnl/do_setlink(): notify when a netdev is modified")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
libteam needs this event notification in userspace when dev's master
dev has been changed. After this, the redundant notifications issue
would be fixed in the later patch 'rtnetlink: check DO_SETLINK_NOTIFY
correctly in do_setlink'.
Fixes: b6b36eb23a ("rtnetlink: Do not generate notifications for NETDEV_CHANGEUPPER event")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As I said in patch 'rtnetlink: bring NETDEV_CHANGEMTU event process back
in rtnetlink_event', removing NETDEV_POST_TYPE_CHANGE event was not the
right fix for the redundant notifications issue.
So bring this event process back to rtnetlink_event and the old redundant
notifications issue would be fixed in the later patch 'rtnetlink: check
DO_SETLINK_NOTIFY correctly in do_setlink'.
Fixes: aef091ae58 ("rtnetlink: Do not generate notifications for POST_TYPE_CHANGE event")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same fix for changing mtu in the patch 'rtnetlink: bring
NETDEV_CHANGEMTU event process back in rtnetlink_event' is
needed for changing tx_queue_len.
Note that the redundant notifications issue for tx_queue_len
will be fixed in the later patch 'rtnetlink: do not send
notification for tx_queue_len in do_setlink'.
Fixes: 27b3b551d8 ("rtnetlink: Do not generate notifications for NETDEV_CHANGE_TX_QUEUE_LEN event")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 085e1a65f0 ("rtnetlink: Do not generate notifications for MTU
events") tried to fix the redundant notifications issue when ip link
set mtu by removing NETDEV_CHANGEMTU event process in rtnetlink_event.
But it also resulted in no notification generated when dev's mtu is
changed via other methods, like:
'ifconfig eth1 mtu 1400' or 'echo 1400 > /sys/class/net/eth1/mtu'
It would cause users not to be notified by this change.
This patch is to fix it by bringing NETDEV_CHANGEMTU event back into
rtnetlink_event, and the redundant notifications issue will be fixed
in the later patch 'rtnetlink: check DO_SETLINK_NOTIFY correctly in
do_setlink'.
Fixes: 085e1a65f0 ("rtnetlink: Do not generate notifications for MTU events")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We remove the request from the receive list before we call
xprt_wait_on_pinned_rqst(), and so we need to use list_del_init().
Otherwise, we will see list corruption when xprt_complete_rqst()
is called.
Reported-by: Emre Celebi <emre@primarydata.com>
Fixes: ce7c252a8c ("SUNRPC: Add a separate spinlock to protect...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When a key is reinstalled we can reset the replay counters
etc. which can lead to nonce reuse and/or replay detection
being impossible, breaking security properties, as described
in the "KRACK attacks".
In particular, CVE-2017-13080 applies to GTK rekeying that
happened in firmware while the host is in D3, with the second
part of the attack being done after the host wakes up. In
this case, the wpa_supplicant mitigation isn't sufficient
since wpa_supplicant doesn't know the GTK material.
In case this happens, simply silently accept the new key
coming from userspace but don't take any action on it since
it's the same key; this keeps the PN replay counters intact.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When pppol2tp_session_ioctl() is called by pppol2tp_tunnel_ioctl(),
the session may be unconnected. That is, it was created by
pppol2tp_session_create() and hasn't been connected with
pppol2tp_connect(). In this case, ps->sock is NULL, so we need to check
for this case in order to avoid dereferencing a NULL pointer.
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel config help for policy routing was still pointing at
an ancient document from 2000 that refers to Linux 2.1. Update it
to point to something that is at least occasionally updated.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we drop any new VLAN ids if there are more than the current
(or last used) channel can support. Most importantly this is a problem
if no channel has been selected yet, resulting in a segfault.
Secondly this does not necessarily reflect the capabilities of any other
channels. Instead only drop a new VLAN id if we are already tracking the
maximum allowed by the NCSI specification. Per-channel limits are
already handled by ncsi_add_filter(), but add a message to set_one_vid()
to make it obvious that the channel can not support any more VLAN ids.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we cannot find a suitable inner_mode value, we will leak
the currently allocated 'xdst'.
The fix is to make sure it is linked into the chain before
erroring out.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
If for some reason, the newly allocated child need to be freed,
we will call cgroup_put() (via sk_free_unlock_clone()) while the
corresponding cgroup_get() was not yet done, and we will free memory
too soon.
Fixes: d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit fbb1fb4ad4.
This was not the proper fix, lets cleanly revert it, so that
following patch can be carried to stable versions.
sock_cgroup_ptr() callers do not expect a NULL return value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_clone_lock() might run while TCP/DCCP listener already vanished.
In order to prevent use after free, it is better to defer cgroup_sk_alloc()
to the point we know both parent and child exist, and from process context.
Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of calling mem_cgroup_sk_alloc() from BH context,
it is better to call it from inet_csk_accept() in process context.
Not only this removes code in mem_cgroup_sk_alloc(), but it also
fixes a bug since listener might have been dismantled and css_get()
might cause a use-after-free.
Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix object leak on IPSEC offload failure, from Steffen Klassert.
2) Fix range checks in ipset address range addition operations, from
Jozsef Kadlecsik.
3) Fix pernet ops unregistration order in ipset, from Florian Westphal.
4) Add missing netlink attribute policy for nl80211 packet pattern
attrs, from Peng Xu.
5) Fix PPP device destruction race, from Guillaume Nault.
6) Write marks get lost when BPF verifier processes R1=R2 register
assignments, causing incorrect liveness information and less state
pruning. Fix from Alexei Starovoitov.
7) Fix blockhole routes so that they are marked dead and therefore not
cached in sockets, otherwise IPSEC stops working. From Steffen
Klassert.
8) Fix broadcast handling of UDP socket early demux, from Paolo Abeni.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (37 commits)
cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan
net: thunderx: mark expected switch fall-throughs in nicvf_main()
udp: fix bcast packet reception
netlink: do not set cb_running if dump's start() errs
ipv4: Fix traffic triggered IPsec connections.
ipv6: Fix traffic triggered IPsec connections.
ixgbe: incorrect XDP ring accounting in ethtool tx_frame param
net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
Revert commit 1a8b6d76dc ("net:add one common config...")
ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register
ixgbe: Return error when getting PHY address if PHY access is not supported
netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
netfilter: SYNPROXY: skip non-tcp packet in {ipv4, ipv6}_synproxy_hook
tipc: Unclone message at secondary destination lookup
tipc: correct initialization of skb list
gso: fix payload length when gso_size is zero
mlxsw: spectrum_router: Avoid expensive lookup during route removal
bpf: fix liveness marking
doc: Fix typo "8023.ad" in bonding documentation
ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real
...
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Fix packet drops due to incorrect ECN handling in IPVS, from Vadim
Fedorenko.
2) Fix splat with mark restoration in xt_socket with non-full-sock,
patch from Subash Abhinov Kasiviswanathan.
3) ipset bogusly bails out when adding IPv4 range containing more than
2^31 addresses, from Jozsef Kadlecsik.
4) Incorrect pernet unregistration order in ipset, from Florian Westphal.
5) Races between dump and swap in ipset results in BUG_ON splats, from
Ross Lagerwall.
6) Fix chain renames in nf_tables, from JingPiao Chen.
7) Fix race in pernet codepath with ebtables table registration, from
Artem Savkov.
8) Memory leak in error path in set name allocation in nf_tables, patch
from Arvind Yadav.
9) Don't dump chain counters if they are not available, this fixes a
crash when listing the ruleset.
10) Fix out of bound memory read in strlcpy() in x_tables compat code,
from Eric Dumazet.
11) Make sure we only process TCP packets in SYNPROXY hooks, patch from
Lin Zhang.
12) Cannot load rules incrementally anymore after xt_bpf with pinned
objects, added in revision 1. From Shmulik Ladkani.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit bc044e8db7 ("udp: perform source validation for
mcast early demux") does not take into account that broadcast packets
lands in the same code path and they need different checks for the
source address - notably, zero source address are valid for bcast
and invalid for mcast.
As a result, 2nd and later broadcast packets with 0 source address
landing to the same socket are dropped. This breaks dhcp servers.
Since we don't have stringent performance requirements for ingress
broadcast traffic, fix it by disabling UDP early demux such traffic.
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Fixes: bc044e8db7 ("udp: perform source validation for mcast early demux")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out that multiple places can call netlink_dump(), which means
it's still possible to dereference partially initialized values in
dump() that were the result of a faulty returned start().
This fixes the issue by calling start() _before_ setting cb_running to
true, so that there's no chance at all of hitting the dump() function
through any indirect paths.
It also moves the call to start() to be when the mutex is held. This has
the nice side effect of serializing invocations to start(), which is
likely desirable anyway. It also prevents any possible other races that
might come out of this logic.
In testing this with several different pieces of tricky code to trigger
these issues, this commit fixes all avenues that I'm aware of.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAlnbJz4ACgkQa3t4Rpy0
AB2R+Q//UgCNRjosPLsEgLNR9zBP/Kys7cxy2ZtazBhAqYF7bil2QTh9o+Q0PW1d
d9B/Dwo1lQhYe2D4qh6YoNimakdN0SfGViqLoXl4s28vC6ZQLFWfHgKP845VXQbC
6ihGsOG9TC2Xe5MIKXHf4VUPLCEQHBv7yWyRFOjVd+IJ3dfz2STi3tQTfApv6O2/
LXpERzgb9m3gj0DeGpU50dN7wpO+uUNX87cKLrByBwzS9qHQECcMB/d4eRsirljF
EOtmMBWg/KnBfT3jwjmjLBEFLDDrPEa1aQn1C4WdhowK6Fg65XeIeO1czLqm0wRL
NnWXeS7h1fywQ3+e8HJ3qDkAlBGvO3+uMORVQf5HNgETtQ8BpDvfDLJEU31D4UA9
vdPIy6L01fL2MMQw3H0j9YQHPIdKTKZdHhI7aX2Pd+UoihQwuooS+g/Pyrf18qrc
8FmVxo4Uflmm9/pqZ7YiNVOFTptwz81XHJBaTMfrjgTHdS2N6EyjCc2ucSwjXbXU
ma7nNlYgMloOXOncN5JraFEhtQCkQvtw9mPWcIdpmi97+sj7VT4kP+5KOeVD9vjl
VSyji5WMAn6bBwwHSnon3yGFJUXmW1NYO0H786iHs7QqmWwD4BjpP6GAfjwPVPbm
kCmfcVb1YWkSEKgmdImn1SUExvkjxdhIwY++Wt5rksbxa9JMczQ=
=WEgb
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2017-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
pull-request: mac80211 2017-10-09
The QCA folks found another netlink problem - we were missing validation
of some attributes. It's not super problematic since one can only read a
few bytes beyond the message (and that memory must exist), but here's the
fix for it.
I thought perhaps we can make nla_parse_nested() require a policy, but
given the two-stage validation/parsing in regular netlink that won't work.
Please pull and let me know if there's any problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2017-10-09
1) Fix some error paths of the IPsec offloading API.
2) Fix a NULL pointer dereference when IPsec is used
with vti. From Alexey Kodanev.
3) Don't call xfrm_policy_cache_flush under xfrm_state_lock,
it triggers several locking warnings. From Artem Savkov.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent patch removed the dst_free() on the allocated
dst_entry in ipv4_blackhole_route(). The dst_free() marked the
dst_entry as dead and added it to the gc list. I.e. it was setup
for a one time usage. As a result we may now have a blackhole
route cached at a socket on some IPsec scenarios. This makes the
connection unusable.
Fix this by marking the dst_entry directly at allocation time
as 'dead', so it is used only once.
Fixes: b838d5e1c5 ("ipv4: mark DST_NOGC and remove the operation of dst_free()")
Reported-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent patch removed the dst_free() on the allocated
dst_entry in ipv6_blackhole_route(). The dst_free() marked
the dst_entry as dead and added it to the gc list. I.e. it
was setup for a one time usage. As a result we may now have
a blackhole route cached at a socket on some IPsec scenarios.
This makes the connection unusable.
Fix this by marking the dst_entry directly at allocation time
as 'dead', so it is used only once.
Fixes: 587fea7411 ("ipv6: mark DST_NOGC and remove the operation of dst_free()")
Reported-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2c16d60332 ("netfilter: xt_bpf: support ebpf") introduced
support for attaching an eBPF object by an fd, with the
'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
IPT_SO_SET_REPLACE call.
However this breaks subsequent iptables calls:
# iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
# iptables -A INPUT -s 5.6.7.8 -j ACCEPT
iptables: Invalid argument. Run `dmesg' for more information.
That's because iptables works by loading existing rules using
IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
the replacement set.
However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
(from the initial "iptables -m bpf" invocation) - so when 2nd invocation
occurs, userspace passes a bogus fd number, which leads to
'bpf_mt_check_v1' to fail.
One suggested solution [1] was to hack iptables userspace, to perform a
"entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
process-local fd per every 'xt_bpf_info_v1' entry seen.
However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
'.fd' and instead perform an in-kernel lookup for the bpf object given
the provided '.path'.
It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
expected to provide the path of the pinned object.
Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
[2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2
Reported-by: Rafael Buchbinder <rafi@rbk.ms>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In function {ipv4,ipv6}_synproxy_hook we expect a normal tcp packet, but
the real server maybe reply an icmp error packet related to the exist
tcp conntrack, so we will access wrong tcp data.
Fix it by checking for the protocol field and only process tcp traffic.
Signed-off-by: Lin Zhang <xiaolou4617@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When a bundling message is received, the function tipc_link_input()
calls function tipc_msg_extract() to unbundle all inner messages of
the bundling message before adding them to input queue.
The function tipc_msg_extract() just clones all inner skb for all
inner messagges from the bundling skb. This means that the skb
headroom of an inner message overlaps with the data part of the
preceding message in the bundle.
If the message in question is a name addressed message, it may be
subject to a secondary destination lookup, and eventually be sent out
on one of the interfaces again. But, since what is perceived as headroom
by the device driver in reality is the last bytes of the preceding
message in the bundle, the latter will be overwritten by the MAC
addresses of the L2 header. If the preceding message has not yet been
consumed by the user, it will evenually be delivered with corrupted
contents.
This commit fixes this by uncloning all messages passing through the
function tipc_msg_lookup_dest(), hence ensuring that the headroom
is always valid when the message is passed on.
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We change the initialization of the skb transmit buffer queues
in the functions tipc_bcast_xmit() and tipc_rcast_xmit() to also
initialize their spinlocks. This is needed because we may, during
error conditions, need to call skb_queue_purge() on those queues
further down the stack.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When gso_size reset to zero for the tail segment in skb_segment(), later
in ipv6_gso_segment(), __skb_udp_tunnel_segment() and gre_gso_segment()
we will get incorrect results (payload length, pcsum) for that segment.
inet_gso_segment() already has a check for gso_size before calculating
payload.
The issue was found with LTP vxlan & gre tests over ixgbe NIC.
Fixes: 07b26c9454 ("gso: Support partial splitting at the frag_list pointer")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 35e015e1f5 ("ipv6: fix net.ipv6.conf.all interface DAD handlers")
was intended to affect accept_dad flag handling in such a way that
DAD operation and mode on a given interface would be selected
according to the maximum value of conf/{all,interface}/accept_dad.
However, addrconf_dad_begin() checks for particular cases in which we
need to skip DAD, and this check was modified in the wrong way.
Namely, it was modified so that, if the accept_dad flag is 0 for the
given interface *or* for all interfaces, DAD would be skipped.
We have instead to skip DAD if accept_dad is 0 for the given interface
*and* for all interfaces.
Fixes: 35e015e1f5 ("ipv6: fix net.ipv6.conf.all interface DAD handlers")
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reported-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller reports an out of bound read in strlcpy(), triggered
by xt_copy_counters_from_user()
Fix this by using memcpy(), then forcing a zero byte at the last position
of the destination, as Florian did for the non COMPAT code.
Fixes: d7591f0c41 ("netfilter: x_tables: introduce and use xt_copy_counters_from_user")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Chain counters are only enabled on demand since 9f08ea8481, skip them
when dumping them via netlink.
Fixes: 9f08ea8481 ("netfilter: nf_tables: keep chain counters away from hot path")
Reported-by: Johny Mattsson <johny.mattsson+kernel@gmail.com>
Tested-by: Johny Mattsson <johny.mattsson+kernel@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull networking fixes from David Miller:
1) Check iwlwifi 9000 reorder buffer out-of-space condition properly,
from Sara Sharon.
2) Fix RCU splat in qualcomm rmnet driver, from Subash Abhinov
Kasiviswanathan.
3) Fix session and tunnel release races in l2tp, from Guillaume Nault
and Sabrina Dubroca.
4) Fix endian bug in sctp_diag_dump(), from Dan Carpenter.
5) Several mlx5 driver fixes from the Mellanox folks (max flow counters
cap check, invalid memory access in IPoIB support, etc.)
6) tun_get_user() should bail if skb->len is zero, from Alexander
Potapenko.
7) Fix RCU lookups in inetpeer, from Eric Dumazet.
8) Fix locking in packet_do_bund().
9) Handle cb->start() error properly in netlink dump code, from Jason
A. Donenfeld.
10) Handle multicast properly in UDP socket early demux code. From Paolo
Abeni.
11) Several erspan bug fixes in ip_gre, from Xin Long.
12) Fix use-after-free in socket filter code, in order to handle the
fact that listener lock is no longer taken during the three-way TCP
handshake. From Eric Dumazet.
13) Fix infoleak in RTM_GETSTATS, from Nikolay Aleksandrov.
14) Fix tail call generation in x86-64 BPF JIT, from Alexei Starovoitov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
net: 8021q: skip packets if the vlan is down
bpf: fix bpf_tail_call() x64 JIT
net: stmmac: dwmac-rk: Add RK3128 GMAC support
rndis_host: support Novatel Verizon USB730L
net: rtnetlink: fix info leak in RTM_GETSTATS call
socket, bpf: fix possible use after free
mlxsw: spectrum_router: Track RIF of IPIP next hops
mlxsw: spectrum_router: Move VRF refcounting
net: hns3: Fix an error handling path in 'hclge_rss_init_hw()'
net: mvpp2: Fix clock resource by adding an optional bus clock
r8152: add Linksys USB3GIGV1 id
l2tp: fix l2tp_eth module loading
ip_gre: erspan device should keep dst
ip_gre: set tunnel hlen properly in erspan_tunnel_init
ip_gre: check packet length and mtu correctly in erspan_xmit
ip_gre: get key from session_id correctly in erspan_rcv
tipc: use only positive error codes in messages
ppp: fix __percpu annotation
udp: perform source validation for mcast early demux
IPv4: early demux can return an error code
...
If the vlan is down, free the packet instead of proceeding with other
processing, or counting it as received. If vlan interfaces are used
as slaves for bonding, with arp monitoring for connectivity, if the rx
counter is seen to be incrementing, then the bond device will not
observe that the interface is down.
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Vishakha Narvekar <Vishakha.Narvekar@dell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define a policy for packet pattern attributes in order to fix a
potential read over the end of the buffer during nla_get_u32()
of the NL80211_PKTPAT_OFFSET attribute.
Note that the data there can always be read due to SKB allocation
(with alignment and struct skb_shared_info at the end), but the
data might be uninitialized. This could be used to leak some data
from uninitialized vmalloc() memory, but most drivers don't allow
an offset (so you'd just get -EINVAL if the data is non-zero) or
just allow it with a fixed value - 100 or 128 bytes, so anything
above that would get -EINVAL. With brcmfmac the limit is 1500 so
(at least) one byte could be obtained.
Cc: stable@kernel.org
Signed-off-by: Peng Xu <pxu@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
[rewrite description based on SKB allocation knowledge]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When RTM_GETSTATS was added the fields of its header struct were not all
initialized when returning the result thus leaking 4 bytes of information
to user-space per rtnl_fill_statsinfo call, so initialize them now. Thanks
to Alexander Potapenko for the detailed report and bisection.
Reported-by: Alexander Potapenko <glider@google.com>
Fixes: 10c9ead9f3 ("rtnetlink: add new RTM_GETSTATS message to dump link stats")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Free memory region, if nf_tables_set_alloc_name is not successful.
Fixes: 387454901b ("netfilter: nf_tables: Allow set names of up to 255 chars")
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Starting from linux-4.4, 3WHS no longer takes the listener lock.
Since this time, we might hit a use-after-free in sk_filter_charge(),
if the filter we got in the memcpy() of the listener content
just happened to be replaced by a thread changing listener BPF filter.
To fix this, we need to make sure the filter refcount is not already
zero before incrementing it again.
Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The l2tp_eth module crashes if its netlink callbacks are run when the
pernet data aren't initialised.
We should normally register_pernet_device() before the genl callbacks.
However, the pernet data only maintain a list of l2tpeth interfaces,
and this list is never used. So let's just drop pernet handling
instead.
Fixes: d9e31d17ce ("l2tp: Add L2TP ethernet pseudowire support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch 'ip_gre: ipgre_tap device should keep dst' fixed
the issue ipgre_tap dev mtu couldn't be updated in tx path.
The same fix is needed for erspan as well.
Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to __gre_tunnel_init, tunnel->hlen should be set as the
headers' length between inner packet and outer iphdr.
It would be used especially to calculate a proper mtu when updating
mtu in tnl_update_pmtu. Now without setting it, a bigger mtu value
than expected would be updated, which hurts performance a lot.
This patch is to fix it by setting tunnel->hlen with:
tunnel->tun_hlen + tunnel->encap_hlen + sizeof(struct erspanhdr)
Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As a ARPHRD_ETHER device, skb->len in erspan_xmit is the length
of the whole ether packet. So before checking if a packet size
exceeds the mtu, skb->len should subtract dev->hard_header_len.
Otherwise, all packets with max size according to mtu would be
trimmed to be truncated packet.
Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
erspan only uses the first 10 bits of session_id as the key to look
up the tunnel. But in erspan_rcv, it missed 'session_id & ID_MASK'
when getting the key from session_id.
If any other flag is also set in session_id in a packet, it would
fail to find the tunnel due to incorrect key in erspan_rcv.
This patch is to add 'session_id & ID_MASK' there and also remove
the unnecessary variable session_id.
Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock is being initialized and then being almost immediately updated
hence the initialized value is not being used and is redundant. Remove
the initialization. Cleans up clang warning:
warning: Value stored to 'sock' during its initialization is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In commit e3a77561e7 ("tipc: split up function tipc_msg_eval()"),
we have updated the function tipc_msg_lookup_dest() to set the error
codes to negative values at destination lookup failures. Thus when
the function sets the error code to -TIPC_ERR_NO_NAME, its inserted
into the 4 bit error field of the message header as 0xf instead of
TIPC_ERR_NO_NAME (1). The value 0xf is an unknown error code.
In this commit, we set only positive error code.
Fixes: e3a77561e7 ("tipc: split up function tipc_msg_eval()")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP early demux can leverate the rx dst cache even for
multicast unconnected sockets.
In such scenario the ipv4 source address is validated only on
the first packet in the given flow. After that, when we fetch
the dst entry from the socket rx cache, we stop enforcing
the rp_filter and we even start accepting any kind of martian
addresses.
Disabling the dst cache for unconnected multicast socket will
cause large performace regression, nearly reducing by half the
max ingress tput.
Instead we factor out a route helper to completely validate an
skb source address for multicast packets and we call it from
the UDP early demux for mcast packets landing on unconnected
sockets, after successful fetching the related cached dst entry.
This still gives a measurable, but limited performance
regression:
rp_filter = 0 rp_filter = 1
edmux disabled: 1182 Kpps 1127 Kpps
edmux before: 2238 Kpps 2238 Kpps
edmux after: 2037 Kpps 2019 Kpps
The above figures are on top of current net tree.
Applying the net-next commit 6e617de84e ("net: avoid a full
fib lookup when rp_filter is disabled.") the delta with
rp_filter == 0 will decrease even more.
Fixes: 421b3885bf ("udp: ipv4: Add udp early demux")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently no error is emitted, but this infrastructure will
used by the next patch to allow source address validation
for mcast sockets.
Since early demux can do a route lookup and an ipv4 route
lookup can return an error code this is consistent with the
current ipv4 route infrastructure.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when updating mtu in tx path, it doesn't consider ARPHRD_ETHER tunnel
device, like ip6gre_tap tunnel, for which it should also subtract ether
header to get the correct mtu.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch 'ip_gre: ipgre_tap device should keep dst' fixed
a issue that ipgre_tap mtu couldn't be updated in tx path.
The same fix is needed for ip6gre_tap as well.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without keeping dst, the tunnel will not update any mtu/pmtu info,
since it does not have a dst on the skb.
Reproducer:
client(ipgre_tap1 - eth1) <-----> (eth1 - ipgre_tap1)server
After reducing eth1's mtu on client, then perforamnce became 0.
This patch is to netif_keep_dst in gre_tap_init, as ipgre does.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers that use the start method for netlink dumping rely on dumpit not
being called if start fails. For example, ila_xlat.c allocates memory
and assigns it to cb->args[0] in its start() function. It might fail to
do that and return -ENOMEM instead. However, even when returning an
error, dumpit will be called, which, in the example above, quickly
dereferences the memory in cb->args[0], which will OOPS the kernel. This
is but one example of how this goes wrong.
Since start() has always been a function with an int return type, it
therefore makes sense to use it properly, rather than ignoring it. This
patch thus returns early and does not call dumpit() when start() fails.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible for ebt_in_hook to be triggered before ebt_table is assigned
resulting in a NULL-pointer dereference. Make sure hooks are
registered as the last step.
Fixes: aee12a0a37 ("ebtables: remove nf_hook_register usage")
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix a race between ip_set_dump_start() and ip_set_swap().
The race is as follows:
* Without holding the ref lock, ip_set_swap() checks ref_netlink of the
set and it is 0.
* ip_set_dump_start() takes a reference on the set.
* ip_set_swap() does the swap (even though it now has a non-zero
reference count).
* ip_set_dump_start() gets the set from ip_set_list again which is now a
different set since it has been swapped.
* ip_set_dump_start() calls __ip_set_put_netlink() and hits a BUG_ON due
to the reference count being 0.
Fix this race by extending the critical region in which the ref lock is
held to include checking the ref counts.
The race can be reproduced with the following script:
while :; do
ipset destroy hash_ip1
ipset destroy hash_ip2
ipset create hash_ip1 hash:ip family inet hashsize 1024 \
maxelem 500000
ipset create hash_ip2 hash:ip family inet hashsize 300000 \
maxelem 500000
ipset create hash_ip3 hash:ip family inet hashsize 1024 \
maxelem 500000
ipset save &
ipset swap hash_ip3 hash_ip2
ipset destroy hash_ip3
wait
done
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This reverts commit dbbccdc4ce.
It turns out that the "legacy" users aren't so legacy at all, and that
turning off the legacy ioctl will break the current Qt bluetooth stack
for bluetooth LE devices that were released just a couple of months ago.
So it's simply not true that this was a legacy interface that hasn't
been needed and is only limited to old legacy BT devices. Because I
actually read Kconfig help messages, and actively try to turn off
features that I don't need, I turned the option off.
Then I spent _way_ too much time debugging BLE issues until I realized
that it wasn't the Qt and subsurface development that had broken one of
my dive computer BLE downloads, but simply my broken kernel config.
Maybe in a decade it will be true that this is a legacy interface. And
maybe with a better help-text and correct dependencies, this kind of
legacy removal might be acceptable. But as things are right now both
the commit message and the Kconfig help text were misleading, and the
Kconfig option had the wrong dependenencies.
There's no reason to keep that broken Kconfig option in the tree.
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sk->sk_prot and sk->sk_prot_creator can differ when the app uses
IPV6_ADDRFORM (transforming an IPv6-socket to an IPv4-one).
Which is why sk_prot_creator is there to make sure that sk_prot_free()
does the kmem_cache_free() on the right kmem_cache slab.
Now, if such a socket gets transformed back to a listening socket (using
connect() with AF_UNSPEC) we will allocate an IPv4 tcp_sock through
sk_clone_lock() when a new connection comes in. But sk_prot_creator will
still point to the IPv6 kmem_cache (as everything got copied in
sk_clone_lock()). When freeing, we will thus put this
memory back into the IPv6 kmem_cache although it was allocated in the
IPv4 cache. I have seen memory corruption happening because of this.
With slub-debugging and MEMCG_KMEM enabled this gives the warning
"cache_from_obj: Wrong slab cache. TCPv6 but object is from TCP"
A C-program to trigger this:
void main(void)
{
int fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);
int new_fd, newest_fd, client_fd;
struct sockaddr_in6 bind_addr;
struct sockaddr_in bind_addr4, client_addr1, client_addr2;
struct sockaddr unsp;
int val;
memset(&bind_addr, 0, sizeof(bind_addr));
bind_addr.sin6_family = AF_INET6;
bind_addr.sin6_port = ntohs(42424);
memset(&client_addr1, 0, sizeof(client_addr1));
client_addr1.sin_family = AF_INET;
client_addr1.sin_port = ntohs(42424);
client_addr1.sin_addr.s_addr = inet_addr("127.0.0.1");
memset(&client_addr2, 0, sizeof(client_addr2));
client_addr2.sin_family = AF_INET;
client_addr2.sin_port = ntohs(42421);
client_addr2.sin_addr.s_addr = inet_addr("127.0.0.1");
memset(&unsp, 0, sizeof(unsp));
unsp.sa_family = AF_UNSPEC;
bind(fd, (struct sockaddr *)&bind_addr, sizeof(bind_addr));
listen(fd, 5);
client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
connect(client_fd, (struct sockaddr *)&client_addr1, sizeof(client_addr1));
new_fd = accept(fd, NULL, NULL);
close(fd);
val = AF_INET;
setsockopt(new_fd, SOL_IPV6, IPV6_ADDRFORM, &val, sizeof(val));
connect(new_fd, &unsp, sizeof(unsp));
memset(&bind_addr4, 0, sizeof(bind_addr4));
bind_addr4.sin_family = AF_INET;
bind_addr4.sin_port = ntohs(42421);
bind(new_fd, (struct sockaddr *)&bind_addr4, sizeof(bind_addr4));
listen(new_fd, 5);
client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
connect(client_fd, (struct sockaddr *)&client_addr2, sizeof(client_addr2));
newest_fd = accept(new_fd, NULL, NULL);
close(new_fd);
close(client_fd);
close(new_fd);
}
As far as I can see, this bug has been there since the beginning of the
git-days.
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packet socket option po->has_vnet_hdr can be updated concurrently with
other operations if no ring is attached.
Do not test the option twice in packet_snd, as the value may change in
between calls. A race on setsockopt disable may cause a packet > mtu
to be sent without having GSO options set.
Fixes: bfd5f4a3d6 ("packet: Add GSO/csum offload support.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once a socket has po->fanout set, it remains a member of the group
until it is destroyed. The prot_hook must be constant and identical
across sockets in the group.
If fanout_add races with packet_do_bind between the test of po->fanout
and taking the lock, the bind call may make type or dev inconsistent
with that of the fanout group.
Hold po->bind_lock when testing po->fanout to avoid this race.
I had to introduce artificial delay (local_bh_enable) to actually
observe the race.
Fixes: dc99f60069 ("packet: Add fanout support.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We cannot be registering the network device first, then setting its
carrier off and finally connecting it to a PHY, doing that leaves a
window during which the carrier is at best inconsistent, and at worse
the device is not usable without a down/up sequence since the network
device is visible to user space with possibly no PHY device attached.
Re-order steps so that they make logical sense. This fixes some devices
where the port was not usable after e.g: an unbind then bind of the
driver.
Fixes: 0071f56e46 ("dsa: Register netdev before phy")
Fixes: 91da11f870 ("net: Distributed Switch Architecture protocol support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My prior fix was not complete, as we were dereferencing a pointer
three times per node, not twice as I initially thought.
Fixes: 4cc5b44b29 ("inetpeer: fix RCU lookup()")
Fixes: b145425f26 ("inetpeer: remove AVL implementation in favor of RB tree")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I might be wrong but it doesn't look like xfrm_state_lock is required
for xfrm_policy_cache_flush and calling it under this lock triggers both
"sleeping function called from invalid context" and "possible circular
locking dependency detected" warnings on flush.
Fixes: ec30d78c14 xfrm: add xdst pcpu cache
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The sctp_for_each_transport() function takes an pointer to int. The
cb->args[] array holds longs so it's only using the high 32 bits. It
works on little endian system but will break on big endian 64 bit
machines.
Fixes: d25adbeb0c ("sctp: fix an use-after-free issue in sctp_sock_dump")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removing the ipset module leaves a small window where one cpu performs
module removal while another runs a command like 'ipset flush'.
ipset uses net_generic(), unregistering the pernet ops frees this
storage area.
Fix it by first removing the user-visible api handlers and the pernet
ops last.
Fixes: 1785e8f473 ("netfiler: ipset: Add net namespace for ipset")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Wrong comparison prevented the hash types to add a range with more than
2^31 addresses but reported as a success.
Fixes Netfilter's bugzilla id #1005, reported by Oleg Serditov and
Oliver Ford.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If we try to delete the same tunnel twice, the first delete operation
does a lookup (l2tp_tunnel_get), finds the tunnel, calls
l2tp_tunnel_delete, which queues it for deletion by
l2tp_tunnel_del_work.
The second delete operation also finds the tunnel and calls
l2tp_tunnel_delete. If the workqueue has already fired and started
running l2tp_tunnel_del_work, then l2tp_tunnel_delete will queue the
same tunnel a second time, and try to free the socket again.
Add a dead flag to prevent firing the workqueue twice. Then we can
remove the check of queue_work's result that was meant to prevent that
race but doesn't.
Reproducer:
ip l2tp add tunnel tunnel_id 3000 peer_tunnel_id 4000 local 192.168.0.2 remote 192.168.0.1 encap udp udp_sport 5000 udp_dport 6000
ip l2tp add session name l2tp1 tunnel_id 3000 session_id 1000 peer_session_id 2000
ip link set l2tp1 up
ip l2tp del tunnel tunnel_id 3000
ip l2tp del tunnel tunnel_id 3000
Fixes: f8ccac0e44 ("l2tp: put tunnel socket release on a workqueue")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
When running LTP IPsec tests, KASan might report:
BUG: KASAN: use-after-free in vti_tunnel_xmit+0xeee/0xff0 [ip_vti]
Read of size 4 at addr ffff880dc6ad1980 by task swapper/0/0
...
Call Trace:
<IRQ>
dump_stack+0x63/0x89
print_address_description+0x7c/0x290
kasan_report+0x28d/0x370
? vti_tunnel_xmit+0xeee/0xff0 [ip_vti]
__asan_report_load4_noabort+0x19/0x20
vti_tunnel_xmit+0xeee/0xff0 [ip_vti]
? vti_init_net+0x190/0x190 [ip_vti]
? save_stack_trace+0x1b/0x20
? save_stack+0x46/0xd0
dev_hard_start_xmit+0x147/0x510
? icmp_echo.part.24+0x1f0/0x210
__dev_queue_xmit+0x1394/0x1c60
...
Freed by task 0:
save_stack_trace+0x1b/0x20
save_stack+0x46/0xd0
kasan_slab_free+0x70/0xc0
kmem_cache_free+0x81/0x1e0
kfree_skbmem+0xb1/0xe0
kfree_skb+0x75/0x170
kfree_skb_list+0x3e/0x60
__dev_queue_xmit+0x1298/0x1c60
dev_queue_xmit+0x10/0x20
neigh_resolve_output+0x3a8/0x740
ip_finish_output2+0x5c0/0xe70
ip_finish_output+0x4ba/0x680
ip_output+0x1c1/0x3a0
xfrm_output_resume+0xc65/0x13d0
xfrm_output+0x1e4/0x380
xfrm4_output_finish+0x5c/0x70
Can be fixed if we get skb->len before dst_output().
Fixes: b9959fd3b0 ("vti: switch to new ip tunnel code")
Fixes: 22e1b23daf ("vti6: Support inter address family tunneling.")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPVS tunnel mode works as simple tunnel (see RFC 3168) copying ECN field
to outer header. That's result in packet drops on egress tunnels in case
the egress tunnel operates as ECN-capable with Full-functionality option
(like ip_tunnel and ip6_tunnel kernel modules), according to RFC 3168
section 9.1.1 recommendation.
This patch implements ECN full-functionality option into ipvs xmit code.
Cc: netdev@vger.kernel.org
Cc: lvs-devel@vger.kernel.org
Signed-off-by: Vadim Fedorenko <vfedorenko@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are several ways to remove L2TP sessions:
* deleting a session explicitly using the netlink interface (with
L2TP_CMD_SESSION_DELETE),
* deleting the session's parent tunnel (either by closing the
tunnel's file descriptor or using the netlink interface),
* closing the PPPOL2TP file descriptor of a PPP pseudo-wire.
In some cases, when these methods are used concurrently on the same
session, the session can be removed twice, leading to use-after-free
bugs.
This patch adds a 'dead' flag, used by l2tp_session_delete() and
l2tp_tunnel_closeall() to prevent them from stepping on each other's
toes.
The session deletion path used when closing a PPPOL2TP file descriptor
doesn't need to be adapted. It already has to ensure that a session
remains valid for the lifetime of its PPPOL2TP file descriptor.
So it takes an extra reference on the session in the ->session_close()
callback (pppol2tp_session_close()), which is eventually dropped
in the ->sk_destruct() callback of the PPPOL2TP socket
(pppol2tp_session_destruct()).
Still, __l2tp_session_unhash() and l2tp_session_queue_purge() can be
called twice and even concurrently for a given session, but thanks to
proper locking and re-initialisation of list fields, this is not an
issue.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
If l2tp_tunnel_delete() or l2tp_tunnel_closeall() deletes a session
right after pppol2tp_release() orphaned its socket, then the 'sock'
variable of the pppol2tp_session_close() callback is NULL. Yet the
session is still used by pppol2tp_release().
Therefore we need to take an extra reference in any case, to prevent
l2tp_tunnel_delete() or l2tp_tunnel_closeall() from freeing the session.
Since the pppol2tp_session_close() callback is only set if the session
is associated to a PPPOL2TP socket and that both l2tp_tunnel_delete()
and l2tp_tunnel_closeall() hold the PPPOL2TP socket before calling
pppol2tp_session_close(), we're sure that pppol2tp_session_close() and
pppol2tp_session_destruct() are paired and called in the right order.
So the reference taken by the former will be released by the later.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>