Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, they are:
1) Incorrect uapi header comment in bitwise, from Jeremy Sowden.
2) Fetch flow statistics if flow is still active.
3) Restrict flow matching on hardware based on input device.
4) Add nf_flow_offload_work_alloc() helper function.
5) Remove the last client of the FLOW_OFFLOAD_DYING flag, use teardown
instead.
6) Use atomic bitwise operation to operate with flow flags.
7) Add nf_flowtable_hw_offload() helper function to check for the
NF_FLOWTABLE_HW_OFFLOAD flag.
8) Add NF_FLOW_HW_REFRESH to retry hardware offload from the flowtable
software datapath.
9) Remove indirect calls in xt_hashlimit, from Florian Westphal.
10) Add nf_flow_offload_tuple() helper to consolidate code.
11) Add nf_flow_table_offload_cmd() helper function.
12) A few whitespace cleanups in nf_tables in bitwise and the bitmap/hash
set types, from Jeremy Sowden.
13) Cleanup netlink attribute checks in bitwise, from Jeremy Sowden.
14) Replace goto by return in error path of nft_bitwise_dump(), from
Jeremy Sowden.
15) Add bitwise operation netlink attribute, also from Jeremy.
16) Add nft_bitwise_init_bool(), from Jeremy Sowden.
17) Add nft_bitwise_eval_bool(), also from Jeremy.
18) Add nft_bitwise_dump_bool(), from Jeremy Sowden.
19) Disallow hardware offload for other that NFT_BITWISE_BOOL,
from Jeremy Sowden.
20) Add NFTA_BITWISE_DATA netlink attribute, again from Jeremy.
21) Add support for bitwise shift operation, from Jeremy Sowden.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mark function parameters as 'const' where possible.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If nf_flow_offload_add() fails to add the flow to hardware, then the
NF_FLOW_HW_REFRESH flag bit is set and the flow remains in the flowtable
software path.
If flowtable hardware offload is enabled, this patch enqueues a new
request to offload this flow to hardware.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This function checks for the NF_FLOWTABLE_HW_OFFLOAD flag, meaning that
the flowtable hardware offload is enabled.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Originally, all flow flag bits were set on only from the workqueue. With
the introduction of the flow teardown state and hardware offload this is
no longer true. Let's be safe and use atomic bitwise operation to
operation with flow flags.
Fixes: 59c466dd68 ("netfilter: nf_flow_table: add a new flow state for tearing down offloading")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The dying bit removes the conntrack entry if the netdev that owns this
flow is going down. Instead, use the teardown mechanism to push back the
flow to conntrack to let the classic software path decide what to do
with it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Daniel Borkmann says:
====================
pull-request: bpf 2020-01-15
The following pull-request contains BPF updates for your *net* tree.
We've added 12 non-merge commits during the last 9 day(s) which contain
a total of 13 files changed, 95 insertions(+), 43 deletions(-).
The main changes are:
1) Fix refcount leak for TCP time wait and request sockets for socket lookup
related BPF helpers, from Lorenz Bauer.
2) Fix wrong verification of ARSH instruction under ALU32, from Daniel Borkmann.
3) Batch of several sockmap and related TLS fixes found while operating
more complex BPF programs with Cilium and OpenSSL, from John Fastabend.
4) Fix sockmap to read psock's ingress_msg queue before regular sk_receive_queue()
to avoid purging data upon teardown, from Lingpeng Chen.
5) Fix printing incorrect pointer in bpftool's btf_dump_ptr() in order to properly
dump a BPF map's value with BTF, from Martin KaFai Lau.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
don't push the write_space callback up and instead simply overwrite the
op with the psock stored previous op. This may or may not be correct so
to ensure we don't overwrite the TLS write space hook pass this field to
the ULP and have it fixup the ctx.
This completes a previous fix that pushed the ops through to the ULP
but at the time missed doing this for write_space, presumably because
write_space TLS hook was added around the same time.
Fixes: 95fa145479 ("bpf: sockmap/tls, close can race with map free")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
This enables passing ISO packets to the monitor socket.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
These adds the HCI definitions for handling CIS connections along with
ISO data packets.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
MGMT command is added to receive the list of blocked keys from
user-space.
The list is used to:
1) Block keys from being distributed by the device during
the ke distribution phase of SMP.
2) Filter out any keys that were previously saved so
they are no longer used.
Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
* -O3 enablement fallout, thanks to Arnd who ran this
* fixes for a few leaks, thanks to Felix
* channel 12 regulatory fix for custom regdomains
* check for a crash reported by syzbot
(NULL function is called on drivers that don't have it)
* fix TKIP replay protection after setup with some APs
(from Jouni)
* restrict obtaining some mesh data to avoid WARN_ONs
* fix deadlocks with auto-disconnect (socket owner)
* fix radar detection events with multiple devices
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl4e4voACgkQB8qZga/f
l8RVFg/+Ngjzje8xU6Ymh7VSwvBBz6CmSVwH/nRSsIAszDrQLoDCo7+eWeUHnAKG
2ViFQ4Lf5CP2J9kSgi7pbD0yGQRDGxH1CVSxMB64uZZRErlOCZYLRTIeBn7HGb8w
JhqAVXBjskzP9iB5jUZIkOgJ6L2ESM2KYKQVGtgn8p9fFdHKitec8UEKiPU3tcX1
zx94EbKldynuvDNb7VkTm4MdoY8LK5KI8ovVdVKwJ05qCtLvc6xvEBYhm33Fl6Pe
66cWklCrU093jQXUDTk0yBkwlcXGxJc143y1nXuurI8ZLH6q2wdDUAMHUEKdL6mz
ktHxABK4JA4Z/ylphR2Tp1TNh8xnz/ZZ5IdAe4P9K0/PerrjH8Yk7zf/48lrHTSv
LKb86Oaoq4MY4k5UNyjDUdgPB/q8CgxN2xDQaeQX69u740wrlxYtrwgHmqHrKI69
ppCzL7a3wmJzAz/97TMFhSseCH1WEQfWY+tlzktrvRofSgOGsIUtcPP10NmvGZHj
zdF79E9TiUPX4KVWssRb9A2MjBbvKA31hCQbH/OVKSWt/fItoBUbMQKtPxfnHcam
/W/gCWrOpw5Qqhc/BdFbYYsgY9J06KzifsV5e2Z6he+6Ex+rtdZNoLXc6/4QH3+9
HzsVYBP+ZUaqrMJi64KDN59C3ni4jDVs9zfYnVkC9LhzHASI5EE=
=i20C
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-net-2020-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A few fixes:
* -O3 enablement fallout, thanks to Arnd who ran this
* fixes for a few leaks, thanks to Felix
* channel 12 regulatory fix for custom regdomains
* check for a crash reported by syzbot
(NULL function is called on drivers that don't have it)
* fix TKIP replay protection after setup with some APs
(from Jouni)
* restrict obtaining some mesh data to avoid WARN_ONs
* fix deadlocks with auto-disconnect (socket owner)
* fix radar detection events with multiple devices
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
It is now possible to propagate BSS color settings into the subsystem. Lets
make mac80211 also handle them so that we can send them further down the
stack into the drivers. We drop the old bss_color field and change iwlwifi
to use the new he_bss_color struct.
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20191217141921.8114-2-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch adds the attributes, policy and parsing code to allow userland
to send the info about the BSS coloring settings to the kernel.
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20191217141921.8114-1-john@phrozen.org
[johannes: remove the strict policy parsing, that was a misunderstanding]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In case a radar event of CAC_FINISHED or RADAR_DETECTED
happens during another phy is during CAC we might need
to cancel that CAC.
If we got a radar in a channel that another phy is now
doing CAC on then the CAC should be canceled there.
If, for example, 2 phys doing CAC on the same channels,
or on comptable channels, once on of them will finish his
CAC the other might need to cancel his CAC, since it is no
longer relevant.
To fix that the commit adds an callback and implement it in
mac80211 to end CAC.
This commit also adds a call to said callback if after a radar
event we see the CAC is no longer relevant
Signed-off-by: Orr Mazor <Orr.Mazor@tandemg.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20191222145449.15792-1-Orr.Mazor@tandemg.com
[slightly reformat/reword commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In a similar fashion to previous patch, add "offload" and "trap"
indication to IPv6 routes.
This is done by using two unused bits in 'struct fib6_info' to hold
these indications. Capable drivers are expected to set these when
processing the various in-kernel route notifications.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When performing L3 offload, routes and nexthops are usually programmed
into two different tables in the underlying device. Therefore, the fact
that a nexthop resides in hardware does not necessarily mean that all
the associated routes also reside in hardware and vice-versa.
While the kernel can signal to user space the presence of a nexthop in
hardware (via 'RTNH_F_OFFLOAD'), it does not have a corresponding flag
for routes. In addition, the fact that a route resides in hardware does
not necessarily mean that the traffic is offloaded. For example,
unreachable routes (i.e., 'RTN_UNREACHABLE') are programmed to trap
packets to the CPU so that the kernel will be able to generate the
appropriate ICMP error packet.
This patch adds an "offload" and "trap" indications to IPv4 routes, so
that users will have better visibility into the offload process.
'struct fib_alias' is extended with two new fields that indicate if the
route resides in hardware or not and if it is offloading traffic from
the kernel or trapping packets to it. Note that the new fields are added
in the 6 bytes hole and therefore the struct still fits in a single
cache line [1].
Capable drivers are expected to invoke fib_alias_hw_flags_set() with the
route's key in order to set the flags.
The indications are dumped to user space via a new flags (i.e.,
'RTM_F_OFFLOAD' and 'RTM_F_TRAP') in the 'rtm_flags' field in the
ancillary header.
v2:
* Make use of 'struct fib_rt_info' in fib_alias_hw_flags_set()
[1]
struct fib_alias {
struct hlist_node fa_list; /* 0 16 */
struct fib_info * fa_info; /* 16 8 */
u8 fa_tos; /* 24 1 */
u8 fa_type; /* 25 1 */
u8 fa_state; /* 26 1 */
u8 fa_slen; /* 27 1 */
u32 tb_id; /* 28 4 */
s16 fa_default; /* 32 2 */
u8 offload:1; /* 34: 0 1 */
u8 trap:1; /* 34: 1 1 */
u8 unused:6; /* 34: 2 1 */
/* XXX 5 bytes hole, try to pack */
struct callback_head rcu __attribute__((__aligned__(8))); /* 40 16 */
/* size: 56, cachelines: 1, members: 12 */
/* sum members: 50, holes: 1, sum holes: 5 */
/* sum bitfield members: 8 bits (1 bytes) */
/* forced alignments: 1, forced holes: 1, sum forced holes: 5 */
/* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_dump_info() is used to prepare RTM_{NEW,DEL}ROUTE netlink messages
using the passed arguments. Currently, the function takes 11 arguments,
6 of which are attributes of the route being dumped (e.g., prefix, TOS).
The next patch will need the function to also dump to user space an
indication if the route is present in hardware or not. Instead of
passing yet another argument, change the function to take a struct
containing the different route attributes.
v2:
* Name last argument of fib_dump_info()
* Move 'struct fib_rt_info' to include/net/ip_fib.h so that it could
later be passed to fib_alias_hw_flags_set()
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow to call macsec_pn_wrapped from hardware drivers to notify when a
PN rolls over. Some drivers might used an interrupt to implement this.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces MACsec ops for drivers to support offloading
MACsec operations.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces the macsec_context structure. It will be used
in the kernel to exchange information between the common MACsec
implementation (macsec.c) and the MACsec hardware offloading
implementations. This structure contains pointers to MACsec specific
structures which contain the actual MACsec configuration, and to the
underlying device (phydev for now).
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves some structure, type and identifier definitions into a
MACsec specific header. This patch does not modify how the MACsec code
is running and only move things around. This is a preparation for the
future MACsec hardware offloading support, which will re-use those
definitions outside macsec.c.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function to obtain a unique snapshot id was mistakenly typo'd as
devlink_region_shapshot_id_get. Fix this typo by renaming the function
and all of its users.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Combine the documentation for devlink into a subfolder, and provide an
index.rst file that can be used to generally describe devlink.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "fw.psid" devlink info version is documented in devlink-info.rst,
and used by one driver. However, there is no associated macro for this
firmware version like there is for others. Add one now.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise we will find stray/unexpected/old extensions value on next
iteration.
On tcp_write_xmit() we can end-up splitting an already queued skb in two
parts, via tso_fragment(). The newly created skb can be allocated via
the tx cache and an upper layer will not be aware of it, so that upper
layer cannot set the ext properly.
Resetting the ext on recycle ensures that stale data is not propagated
in to packet headers or elsewhere.
An alternative would be add an additional hook in tso_fragment() or in
sk_stream_alloc_skb() to init the ext for upper layers that need it.
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MPTCP will make use of tcp_send_mss() and tcp_push() when sending
data to specific TCP subflows.
tcp_request_sock_ipvX_ops and ipvX_specific will be referenced
during TCP subflow creation.
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Coalesce and collapse of packets carrying MPTCP extensions is allowed
when the newer packet has no extension or the extensions carried by both
packets are equal.
This allows merging of TSO packet trains and even cross-TSO packets, and
does not require any additional action when moving data into existing
SKBs.
v3 -> v4:
- allow collapsing, under mptcp_skb_can_collapse() constraint
v5 -> v6:
- clarify MPTCP skb extensions must always be cleared at allocation
time
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add enum value for MPTCP and update config dependencies
v5 -> v6:
- fixed '__unused' field size
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If ULP is used on a listening socket, icsk_ulp_ops and icsk_ulp_data are
copied when the listener is cloned. Sometimes the clone is immediately
deleted, which will invoke the release op on the clone and likely
corrupt the listening socket's icsk_ulp_data.
The clone operation is invoked immediately after the clone is copied and
gives the ULP type an opportunity to set up the clone socket and its
icsk_ulp_data.
The MPTCP ULP clone will silently fallback to plain TCP on allocation
failure, so 'clone()' does not need to return an error code.
v6 -> v7:
- move and rename ulp clone helper to make it inline-friendly
v5 -> v6:
- clarified MPTCP clone usage in commit message
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP option 30 is allocated for MPTCP by the IANA.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Match the 16-bit width of skbuff->protocol. Fills an 8-bit hole so
sizeof(struct sock) does not change.
Also take care of BPF field access for sk_type/sk_protocol. Both of them
are now outside the bitfield, so we can use load instructions without
further shifting/masking.
v5 -> v6:
- update eBPF accessors, too (Intel's kbuild test robot)
v2 -> v3:
- keep 'sk_type' 2 bytes aligned (Eric)
v1 -> v2:
- preserve sk_pacing_shift as bit field (Eric)
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SK_PROTOCOL_MAX is only used in two places, for DECNet and AX.25. The
limits have more to do with the those protocol definitions than they do
with the data type of sk_protocol, so remove SK_PROTOCOL_MAX and use
U8_MAX directly.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ungrafting from PRIO bug fixes in net, when merged into net-next,
merge cleanly but create a build failure. The resolution used here is
from Petr Machata.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes "struct tcp_congestion_ops" to be the first user
of BPF STRUCT_OPS. It allows implementing a tcp_congestion_ops
in bpf.
The BPF implemented tcp_congestion_ops can be used like
regular kernel tcp-cc through sysctl and setsockopt. e.g.
[root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
net.ipv4.tcp_congestion_control = bpf_cubic
There has been attempt to move the TCP CC to the user space
(e.g. CCP in TCP). The common arguments are faster turn around,
get away from long-tail kernel versions in production...etc,
which are legit points.
BPF has been the continuous effort to join both kernel and
userspace upsides together (e.g. XDP to gain the performance
advantage without bypassing the kernel). The recent BPF
advancements (in particular BTF-aware verifier, BPF trampoline,
BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
possible in BPF. It allows a faster turnaround for testing algorithm
in the production while leveraging the existing (and continue growing)
BPF feature/framework instead of building one specifically for
userspace TCP CC.
This patch allows write access to a few fields in tcp-sock
(in bpf_tcp_ca_btf_struct_access()).
The optional "get_info" is unsupported now. It can be added
later. One possible way is to output the info with a btf-id
to describe the content.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com
It is possible to stack multiple DSA switches in a way that they are not
part of the tree (disjoint) but the DSA master of a switch is a DSA
slave of another. When that happens switch drivers may have to know this
is the case so as to determine whether their tagging protocol has a
remove chance of working.
This is useful for specific switch drivers such as b53 where devices
have been known to be stacked in the wild without the Broadcom tag
protocol supporting that feature. This allows b53 to continue supporting
those devices by forcing the disabling of Broadcom tags on the outermost
switches if necessary.
The get_tag_protocol() function is therefore updated to gain an
additional enum dsa_tag_protocol argument which denotes the current
tagging protocol used by the DSA master we are attached to, else
DSA_TAG_PROTO_NONE for the top of the dsa_switch_tree.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible that a reporter recovery completion do not finish
successfully when recovery is triggered via
devlink_health_reporter_recover as recovery could be processed in
different context. In such scenario an error is returned by driver when
recover hook is invoked and successful recovery completion is
intimated later.
Expose devlink recover done API to update recovery stats.
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Missing netns context in arp_tables, from Florian Westphal.
2) Underflow in flowtable reference counter, from wenxu.
3) Fix incorrect ethernet destination address in flowtable offload,
from wenxu.
4) Check for status of neighbour entry, from wenxu.
5) Fix NAT port mangling, from wenxu.
6) Unbind callbacks from destroy path to cleanup hardware properly
on flowtable removal.
7) Fix missing casting statistics timestamp, add nf_flowtable_time_stamp
and use it.
8) NULL pointer exception when timeout argument is null in conntrack
dccp and sctp protocol helpers, from Florian Westphal.
9) Possible nul-dereference in ipset with IPSET_ATTR_LINENO, also from
Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds nf_flowtable_time_stamp and updates the existing code to
use it.
This patch is also implicitly fixing up hardware statistic fetching via
nf_flow_offload_stats() where casting to u32 is missing. Use
nf_flow_timeout_delta() to fix this.
Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: wenxu <wenxu@ucloud.cn>
The DSA drivers that implement .phylink_mac_link_state should normally
register an interrupt for the PCS, from which they should call
phylink_mac_change(). However not all switches implement this, and those
who don't should set this flag in dsa_switch in the .setup callback, so
that PHYLINK will poll for a few ms until the in-band AN link timer
expires and the PCS state settles.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are 3 things that are wrong with the DSA deferred xmit mechanism:
1. Its introduction has made the DSA hotpath ever so slightly more
inefficient for everybody, since DSA_SKB_CB(skb)->deferred_xmit needs
to be initialized to false for every transmitted frame, in order to
figure out whether the driver requested deferral or not (a very rare
occasion, rare even for the only driver that does use this mechanism:
sja1105). That was necessary to avoid kfree_skb from freeing the skb.
2. Because L2 PTP is a link-local protocol like STP, it requires
management routes and deferred xmit with this switch. But as opposed
to STP, the deferred work mechanism needs to schedule the packet
rather quickly for the TX timstamp to be collected in time and sent
to user space. But there is no provision for controlling the
scheduling priority of this deferred xmit workqueue. Too bad this is
a rather specific requirement for a feature that nobody else uses
(more below).
3. Perhaps most importantly, it makes the DSA core adhere a bit too
much to the NXP company-wide policy "Innovate Where It Doesn't
Matter". The sja1105 is probably the only DSA switch that requires
some frames sent from the CPU to be routed to the slave port via an
out-of-band configuration (register write) rather than in-band (DSA
tag). And there are indeed very good reasons to not want to do that:
if that out-of-band register is at the other end of a slow bus such
as SPI, then you limit that Ethernet flow's throughput to effectively
the throughput of the SPI bus. So hardware vendors should definitely
not be encouraged to design this way. We do _not_ want more
widespread use of this mechanism.
Luckily we have a solution for each of the 3 issues:
For 1, we can just remove that variable in the skb->cb and counteract
the effect of kfree_skb with skb_get, much to the same effect. The
advantage, of course, being that anybody who doesn't use deferred xmit
doesn't need to do any extra operation in the hotpath.
For 2, we can create a kernel thread for each port's deferred xmit work.
If the user switch ports are named swp0, swp1, swp2, the kernel threads
will be named swp0_xmit, swp1_xmit, swp2_xmit (there appears to be a 15
character length limit on kernel thread names). With this, the user can
change the scheduling priority with chrt $(pidof swp2_xmit).
For 3, we can actually move the entire implementation to the sja1105
driver.
So this patch deletes the generic implementation from the DSA core and
adds a new one, more adequate to the requirements of PTP TX
timestamping, in sja1105_main.c.
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This handles LE PHY Update Complete event and store both tx_phy and
rx_phy into hci_conn.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The macro is really not needed and can be replaced with either usage of
bt_err_ratelimited or bt_dev_err_ratelimited.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The macro will be used to display rate limited warning messages in the
log.
Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and
add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key.
With the key now based on an l3index, add the new parameter to the
lookup functions and consider the l3index when looking for a match.
The l3index comes from the skb when processing ingress packets leveraging
the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the
v6 variants). When the sdif index is set it means the packet ingressed a
device that is part of an L3 domain and inet_iif points to the VRF device.
For egress, the L3 domain is determined from the socket binding and
sk_bound_dev_if.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Revert "net/sched: cls_u32: fix refcount leak in the error path of
u32_change()", and fix the u32 refcount leak in a more generic way that
preserves the semantic of rule dumping.
On tc filters that don't support lockless insertion/removal, there is no
need to guard against concurrent insertion when a removal is in progress.
Therefore, for most of them we can avoid a full walk() when deleting, and
just decrease the refcount, like it was done on older Linux kernels.
This fixes situations where walk() was wrongly detecting a non-empty
filter, like it happened with cls_u32 in the error path of change(), thus
leading to failures in the following tdc selftests:
6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id
On cls_flower, and on (future) lockless filters, this check is necessary:
move all the check_empty() logic in a callback so that each filter
can have its own implementation. For cls_flower, it's sufficient to check
if no IDRs have been allocated.
This reverts commit 275c44aa19.
Changes since v1:
- document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
is used, thanks to Vlad Buslov
- implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
- squash revert and new fix in a single patch, to be nice with bisect
tests that run tdc on u32 filter, thanks to Dave Miller
Fixes: 275c44aa19 ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
Fixes: 6676d5e416 ("net: sched: set dedicated tcf_walker flag when tp is empty")
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-12-27
The following pull-request contains BPF updates for your *net-next* tree.
We've added 127 non-merge commits during the last 17 day(s) which contain
a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).
There are three merge conflicts. Conflicts and resolution looks as follows:
1) Merge conflict in net/bpf/test_run.c:
There was a tree-wide cleanup c593642c8b ("treewide: Use sizeof_field() macro")
which gets in the way with b590cb5f80 ("bpf: Switch to offsetofend in
BPF_PROG_TEST_RUN"):
<<<<<<< HEAD
if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
sizeof_field(struct __sk_buff, priority),
=======
if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
>>>>>>> 7c8dce4b16
There are a few occasions that look similar to this. Always take the chunk with
offsetofend(). Note that there is one where the fields differ in here:
<<<<<<< HEAD
if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
sizeof_field(struct __sk_buff, tstamp),
=======
if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
>>>>>>> 7c8dce4b16
Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
850a88cc40 ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").
2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:
(I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)
<<<<<<< HEAD
if (is_13b_check(off, insn))
return -1;
emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
=======
emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
>>>>>>> 7c8dce4b16
Result should look like:
emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);
3) Merge conflict in arch/riscv/include/asm/pgtable.h:
<<<<<<< HEAD
=======
#define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
#define VMALLOC_END (PAGE_OFFSET - 1)
#define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
#define BPF_JIT_REGION_SIZE (SZ_128M)
#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
#define BPF_JIT_REGION_END (VMALLOC_END)
/*
* Roughly size the vmemmap space to be large enough to fit enough
* struct pages to map half the virtual address space. Then
* position vmemmap directly below the VMALLOC region.
*/
#define VMEMMAP_SHIFT \
(CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
#define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
#define VMEMMAP_END (VMALLOC_START - 1)
#define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)
#define vmemmap ((struct page *)VMEMMAP_START)
>>>>>>> 7c8dce4b16
Only take the BPF_* defines from there and move them higher up in the
same file. Remove the rest from the chunk. The VMALLOC_* etc defines
got moved via 01f52e16b8 ("riscv: define vmemmap before pfn_to_page
calls"). Result:
[...]
#define __S101 PAGE_READ_EXEC
#define __S110 PAGE_SHARED_EXEC
#define __S111 PAGE_SHARED_EXEC
#define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
#define VMALLOC_END (PAGE_OFFSET - 1)
#define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
#define BPF_JIT_REGION_SIZE (SZ_128M)
#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
#define BPF_JIT_REGION_END (VMALLOC_END)
/*
* Roughly size the vmemmap space to be large enough to fit enough
* struct pages to map half the virtual address space. Then
* position vmemmap directly below the VMALLOC region.
*/
#define VMEMMAP_SHIFT \
(CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
#define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
#define VMEMMAP_END (VMALLOC_START - 1)
#define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)
[...]
Let me know if there are any other issues.
Anyway, the main changes are:
1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
to a provided BPF object file. This provides an alternative, simplified API
compared to standard libbpf interaction. Also, add libbpf extern variable
resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.
2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
add various BPF riscv JIT improvements, from Björn Töpel.
3) Extend bpftool to allow matching BPF programs and maps by name,
from Paul Chaignon.
4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
flag for allowing updates without service interruption, from Andrey Ignatov.
5) Cleanup and simplification of ring access functions for AF_XDP with a
bonus of 0-5% performance improvement, from Magnus Karlsson.
6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.
7) Move and extend test_select_reuseport into BPF program tests under
BPF selftests, from Jakub Sitnicki.
8) Various BPF sample improvements for xdpsock for customizing parameters
to set up and benchmark AF_XDP, from Jay Jayatheerthan.
9) Improve libbpf to provide a ulimit hint on permission denied errors.
Also change XDP sample programs to attach in driver mode by default,
from Toke Høiland-Jørgensen.
10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
programs, from Nikita V. Shirokov.
11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.
12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
libbpf conversion, from Jesper Dangaard Brouer.
13) Minor misc improvements from various others.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that mlxsw is converted to use the new FIB notifications it is
possible to delete the old ones and use the new replace / append /
delete notifications.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the purpose of route offload, when a single route is deleted, it is
only of interest if it is the first route in the node or if it is
sibling to such a route.
In the first case, distinguish between several possibilities:
1. Route is the last route in the node. Emit a delete notification
2. Route is followed by a non-multipath route. Emit a replace
notification for the non-multipath route.
3. Route is followed by a multipath route. Emit a replace notification
for the multipath route.
In the second case, only emit a delete notification to ensure the route
is no longer used as a valid nexthop.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subsequent patches are going to simplify the IPv6 route offload API,
which will only use three events - replace, delete and append.
Introduce a temporary version of replace and delete in order to make the
conversion easier to review. Note that append does not need a temporary
version, as it is currently not used.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
we should not call dst_confirm_neigh() as there is no two-way communication.
So disable the neigh confirm for vxlan and geneve pmtu update.
v5: No change.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
dst_ops.update_pmtu to control whether we should do neighbor confirm.
Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
Fixes: a93bf0ff44 ("vxlan: update skb dst pmtu on tx path")
Fixes: 52a589d51f ("geneve: update skb dst pmtu on tx path")
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Tested-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new function skb_dst_update_pmtu_no_confirm() for callers who need
update pmtu but should not do neighbor confirm.
v5: No change.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
dst_ops.update_pmtu to control whether we should do neighbor confirm.
Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MTU update code is supposed to be invoked in response to real
networking events that update the PMTU. In IPv6 PMTU update function
__ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
confirmed time.
But for tunnel code, it will call pmtu before xmit, like:
- tnl_update_pmtu()
- skb_dst_update_pmtu()
- ip6_rt_update_pmtu()
- __ip6_rt_update_pmtu()
- dst_confirm_neigh()
If the tunnel remote dst mac address changed and we still do the neigh
confirm, we will not be able to update neigh cache and ping6 remote
will failed.
So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
should not be invoking dst_confirm_neigh() as we have no evidence
of successful two-way communication at this point.
On the other hand it is also important to keep the neigh reachability fresh
for TCP flows, so we cannot remove this dst_confirm_neigh() call.
To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
to choose whether we should do neigh update or not. I will add the parameter
in this patch and set all the callers to true to comply with the previous
way, and fix the tunnel code one by one on later patches.
v5: No change.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
dst_ops.update_pmtu to control whether we should do neighbor confirm.
Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
Suggested-by: David Miller <davem@davemloft.net>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Several nf_flow_table_offload fixes from Pablo Neira Ayuso,
including adding a missing ipv6 match description.
2) Several heap overflow fixes in mwifiex from qize wang and Ganapathi
Bhat.
3) Fix uninit value in bond_neigh_init(), from Eric Dumazet.
4) Fix non-ACPI probing of nxp-nci, from Stephan Gerhold.
5) Fix use after free in tipc_disc_rcv(), from Tuong Lien.
6) Enforce limit of 33 tail calls in mips and riscv JIT, from Paul
Chaignon.
7) Multicast MAC limit test is off by one in qede, from Manish Chopra.
8) Fix established socket lookup race when socket goes from
TCP_ESTABLISHED to TCP_LISTEN, because there lacks an intervening
RCU grace period. From Eric Dumazet.
9) Don't send empty SKBs from tcp_write_xmit(), also from Eric Dumazet.
10) Fix active backup transition after link failure in bonding, from
Mahesh Bandewar.
11) Avoid zero sized hash table in gtp driver, from Taehee Yoo.
12) Fix wrong interface passed to ->mac_link_up(), from Russell King.
13) Fix DSA egress flooding settings in b53, from Florian Fainelli.
14) Memory leak in gmac_setup_txqs(), from Navid Emamdoost.
15) Fix double free in dpaa2-ptp code, from Ioana Ciornei.
16) Reject invalid MTU values in stmmac, from Jose Abreu.
17) Fix refcount leak in error path of u32 classifier, from Davide
Caratti.
18) Fix regression causing iwlwifi firmware crashes on boot, from Anders
Kaseorg.
19) Fix inverted return value logic in llc2 code, from Chan Shu Tak.
20) Disable hardware GRO when XDP is attached to qede, frm Manish
Chopra.
21) Since we encode state in the low pointer bits, dst metrics must be
at least 4 byte aligned, which is not necessarily true on m68k. Add
annotations to fix this, from Geert Uytterhoeven.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (160 commits)
sfc: Include XDP packet headroom in buffer step size.
sfc: fix channel allocation with brute force
net: dst: Force 4-byte alignment of dst_metrics
selftests: pmtu: fix init mtu value in description
hv_netvsc: Fix unwanted rx_table reset
net: phy: ensure that phy IDs are correctly typed
mod_devicetable: fix PHY module format
qede: Disable hardware gro when xdp prog is installed
net: ena: fix issues in setting interrupt moderation params in ethtool
net: ena: fix default tx interrupt moderation interval
net/smc: unregister ib devices in reboot_event
net: stmmac: platform: Fix MDIO init for platforms without PHY
llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)
net: hisilicon: Fix a BUG trigered by wrong bytes_compl
net: dsa: ksz: use common define for tag len
s390/qeth: don't return -ENOTSUPP to userspace
s390/qeth: fix promiscuous mode after reset
s390/qeth: handle error due to unsupported transport mode
cxgb4: fix refcount init for TC-MQPRIO offload
tc-testing: initial tdc selftests for cls_u32
...
When storing a pointer to a dst_metrics structure in dst_entry._metrics,
two flags are added in the least significant bits of the pointer value.
Hence this assumes all pointers to dst_metrics structures have at least
4-byte alignment.
However, on m68k, the minimum alignment of 32-bit values is 2 bytes, not
4 bytes. Hence in some kernel builds, dst_default_metrics may be only
2-byte aligned, leading to obscure boot warnings like:
WARNING: CPU: 0 PID: 7 at lib/refcount.c:28 refcount_warn_saturate+0x44/0x9a
refcount_t: underflow; use-after-free.
Modules linked in:
CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: G W 5.5.0-rc2-atari-01448-g114a1a1038af891d-dirty #261
Stack from 10835e6c:
10835e6c 0038134f 00023fa6 00394b0f 0000001c 00000009 00321560 00023fea
00394b0f 0000001c 001a70f8 00000009 00000000 10835eb4 00000001 00000000
04208040 0000000a 00394b4a 10835ed4 00043aa8 001a70f8 00394b0f 0000001c
00000009 00394b4a 0026aba8 003215a4 00000003 00000000 0026d5a8 00000001
003215a4 003a4361 003238d6 000001f0 00000000 003215a4 10aa3b00 00025e84
003ddb00 10834000 002416a8 10aa3b00 00000000 00000080 000aa038 0004854a
Call Trace: [<00023fa6>] __warn+0xb2/0xb4
[<00023fea>] warn_slowpath_fmt+0x42/0x64
[<001a70f8>] refcount_warn_saturate+0x44/0x9a
[<00043aa8>] printk+0x0/0x18
[<001a70f8>] refcount_warn_saturate+0x44/0x9a
[<0026aba8>] refcount_sub_and_test.constprop.73+0x38/0x3e
[<0026d5a8>] ipv4_dst_destroy+0x5e/0x7e
[<00025e84>] __local_bh_enable_ip+0x0/0x8e
[<002416a8>] dst_destroy+0x40/0xae
Fix this by forcing 4-byte alignment of all dst_metrics structures.
Fixes: e5fd387ad5 ("ipv6: do not overwrite inetpeer metrics prematurely")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for tag format used in Atheros AR9331 built-in switch.
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the name of xsk_umem_discard_addr to xsk_umem_release_addr to
better reflect the new naming of the AF_XDP queue manipulation
functions. As this functions is used by drivers implementing support
for AF_XDP zero-copy, it requires a name change to these drivers. The
function xsk_umem_release_addr_rq has also changed name in the same
fashion.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1576759171-28550-10-git-send-email-magnus.karlsson@intel.com
Change the names of the validation functions to better reflect what
they are doing. The uppermost ones are reading entries from the rings
and only the bottom ones validate entries. So xskq_cons_read_ is a
better prefix name.
Also change the xskq_cons_read_ functions to return a bool
as the the descriptor or address is already returned by reference
in the parameters. Everyone is using the return value as a bool
anyway.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1576759171-28550-9-git-send-email-magnus.karlsson@intel.com
The xskmap flush list is used to track entries that need to flushed
from via the xdp_do_flush_map() function. This list used to be
per-map, but there is really no reason for that. Instead make the
flush list global for all xskmaps, which simplifies __xsk_map_flush()
and xsk_map_alloc().
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20191219061006.21980-5-bjorn.topel@gmail.com
There is currently no way for driver to reliably check that
the socket it has looked up is in fact RX offloaded. Add
a helper. This allows drivers to catch misbehaving firmware.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add hooks at appropriate points to make it possible to offload the ETS
Qdisc.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bit about negating HW backlog left me scratching my head. Clarify the
comment.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_pacing_shift can be read and written without lock
synchronization. This patch adds annotations to
document this fact and avoid future syzbot complains.
This might also avoid unexpected false sharing
in sk_pacing_shift_update(), as the compiler
could remove the conditional check and always
write over sk->sk_pacing_shift :
if (sk->sk_pacing_shift != val)
sk->sk_pacing_shift = val;
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sparse commit 6002ded74587 ("add a flag to warn on casts to/from
bitwise pointers") introduced a check for non-direct casts from/to
restricted datatypes (when -Wbitwise-pointer is enabled).
This triggered a warning in the 64 bit optimized ipv6_addr_is_*() functions
because sparse doesn't know that the buffer already points to some data in
the correct bitwise integer format. But these were correct and can
therefore be marked with __force to signalize sparse an intended cast to a
specific bitwise type.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sparse commit 6002ded74587 ("add a flag to warn on casts to/from
bitwise pointers") introduced a check for non-direct casts from/to
restricted datatypes (when -Wbitwise-pointer is enabled).
This triggered a warning in ipv6_get_dsfield() because sparse doesn't know
that the buffer already points to some data in the correct bitwise integer
format. This was already fixed in ipv6_change_dsfield() by the __force
attribute and can be fixed here the same way.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to how tcp_sendmsg() is implemented, we can have an empty
skb at the tail of the write queue.
Most [1] tcp_write_queue_empty() callers want to know if there is
anything to send (payload and/or FIN)
Instead of checking if the sk_write_queue is empty, we need
to test if tp->write_seq == tp->snd_nxt
[1] tcp_send_fin() was the only caller that expected to
see if an skb was in the write queue, I have changed the code
to reuse the tcp_write_queue_tail() result.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Michal Kubecek and Firo Yang did a very nice analysis of crashes
happening in __inet_lookup_established().
Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN
(via a close()/socket()/listen() cycle) without a RCU grace period,
I should not have changed listeners linkage in their hash table.
They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt),
so that a lookup can detect a socket in a hash list was moved in
another one.
Since we added code in commit d296ba60d8 ("soreuseport: Resolve
merge conflict for v4/v6 ordering fix"), we have to add
hlist_nulls_add_tail_rcu() helper.
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Reported-by: Firo Yang <firo.yang@suse.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
This patch adds a new transmit path for hardware that supports 802.11
encapsulation offloading. In those cases 802.3 frames get passed
directly to the driver allowing the hardware to handle the encapsulation.
Some features such as monitor mode and TKIP would break when encapsulation
offloading is enabled. If any of these get enabled, the code will alwyas
fallback to the normal sw encapsulation data path.
The patch defines a secondary netdev_ops struct that the device gets
assigned if 802.11 encap support is available and enabled. The driver
needs to enable the support on a per vif basis if it finds that all
pre-reqs are meet.
Signed-off-by: Vasanthakumar Thiagarajan <vthiagar@qti.qualcomm.com>
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20191125100438.16539-1-john@phrozen.org
[reword comments, remove SUPPORTS_80211_ENCAP HW flag, minor cleanups]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Function nl80211_validate_nested() is not specific to nl80211, it's
a counterpart to nla_validate_nested_deprecated() with strict validation.
For consistency with other validation and parse functions, rename it to
nla_validate_nested().
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unix sockets like a block box. You never know what is stored there:
there may be a file descriptor holding a mount or a block device,
or there may be whole universes with namespaces, sockets with receive
queues full of sockets etc.
The patch adds a little debug and accounts number of files (not recursive),
which is in receive queue of a unix socket. Sometimes this is useful
to determine, that socket should be investigated or which task should
be killed to put reference counter on a resourse.
v2: Pass correct argument to lockdep
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows to register a transport able to handle
local communication (loopback).
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save"
that disables TCP ssthresh metrics cache by default. Other parts of TCP
metrics cache, e.g. rtt, cwnd, remain unchanged.
As modern networks becoming more and more dynamic, TCP metrics cache
today often causes more harm than benefits. For example, the same IP
address is often shared by different subscribers behind NAT in residential
networks. Even if the IP address is not shared by different users,
caching the slow-start threshold of a previous short flow using loss-based
congestion control (e.g. cubic) often causes the future longer flows of
the same network path to exit slow-start prematurely with abysmal
throughput.
Caching ssthresh is very risky and can lead to terrible performance.
Therefore it makes sense to make disabling ssthresh caching by
default and opt-in for specific networks by the administrators.
This practice also has worked well for several years of deployment with
CUBIC congestion control at Google.
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().
This patch is generated using following script:
EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"
git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
do
if [[ "$file" =~ $EXCLUDE_FILES ]]; then
continue
fi
sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
done
Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net
This is needed, because if the flag X25_ACCPT_APPRV_FLAG is not set on a
socket (manual call confirmation) and the channel is cleared by remote
before the manual call confirmation was sent, this situation needs to
be handled.
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_cleanup() has not been used for seven years, and was a wrong design.
Messing with shared pointer in bond_neigh_init() without proper
memory barriers would at least trigger syzbot complains eventually.
It is time to remove this stuff.
Fixes: b63b70d877 ("IPoIB: Use a private hash table for path lookup in xmit path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP encapsulation of IKE and IPsec messages (RFC 8229) is implemented
as a TCP ULP, overriding in particular the sendmsg and recvmsg
operations. A Stream Parser is used to extract messages out of the TCP
stream using the first 2 bytes as length marker. Received IKE messages
are put on "ike_queue", waiting to be dequeued by the custom recvmsg
implementation. Received ESP messages are sent to XFRM, like with UDP
encapsulation.
Some of this code is taken from the original submission by Herbert
Xu. Currently, only IPv4 is supported, like for UDP encapsulation.
Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This will be used by TCP encapsulation to write packets to the encap
socket without holding the user socket's lock. Without this reinjection,
we're already holding the lock of the user socket, and then try to lock
the encap socket as well when we enqueue the encrypted packet.
While at it, add a BUILD_BUG_ON like we usually do for skb->cb, since
it's missing for struct xfrm_trans_cb.
Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the
timestamp of the last synflood. Protect them with READ_ONCE() and
WRITE_ONCE() since reads and writes aren't serialised.
Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was
introduced by a0f82f64e2 ("syncookies: remove last_synq_overflow from
struct tcp_sock"). But unprotected accesses were already there when
timestamp was stored in .last_synq_overflow.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When no synflood occurs, the synflood timestamp isn't updated.
Therefore it can be so old that time_after32() can consider it to be
in the future.
That's a problem for tcp_synq_no_recent_overflow() as it may report
that a recent overflow occurred while, in fact, it's just that jiffies
has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31.
Spurious detection of recent overflows lead to extra syncookie
verification in cookie_v[46]_check(). At that point, the verification
should fail and the packet dropped. But we should have dropped the
packet earlier as we didn't even send a syncookie.
Let's refine tcp_synq_no_recent_overflow() to report a recent overflow
only if jiffies is within the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This
way, no spurious recent overflow is reported when jiffies wraps and
'last_overflow' becomes in the future from the point of view of
time_after32().
However, if jiffies wraps and enters the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with
'last_overflow' being a stale synflood timestamp), then
tcp_synq_no_recent_overflow() still erroneously reports an
overflow. In such cases, we have to rely on syncookie verification
to drop the packet. We unfortunately have no way to differentiate
between a fresh and a stale syncookie timestamp.
In practice, using last_overflow as lower bound is problematic.
If the synflood timestamp is concurrently updated between the time
we read jiffies and the moment we store the timestamp in
'last_overflow', then 'now' becomes smaller than 'last_overflow' and
tcp_synq_no_recent_overflow() returns true, potentially dropping a
valid syncookie.
Reading jiffies after loading the timestamp could fix the problem,
but that'd require a memory barrier. Let's just accommodate for
potential timestamp growth instead and extend the interval using
'last_overflow - HZ' as lower bound.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If no synflood happens for a long enough period of time, then the
synflood timestamp isn't refreshed and jiffies can advance so much
that time_after32() can't accurately compare them any more.
Therefore, we can end up in a situation where time_after32(now,
last_overflow + HZ) returns false, just because these two values are
too far apart. In that case, the synflood timestamp isn't updated as
it should be, which can trick tcp_synq_no_recent_overflow() into
rejecting valid syncookies.
For example, let's consider the following scenario on a system
with HZ=1000:
* The synflood timestamp is 0, either because that's the timestamp
of the last synflood or, more commonly, because we're working with
a freshly created socket.
* We receive a new SYN, which triggers synflood protection. Let's say
that this happens when jiffies == 2147484649 (that is,
'synflood timestamp' + HZ + 2^31 + 1).
* Then tcp_synq_overflow() doesn't update the synflood timestamp,
because time_after32(2147484649, 1000) returns false.
With:
- 2147484649: the value of jiffies, aka. 'now'.
- 1000: the value of 'last_overflow' + HZ.
* A bit later, we receive the ACK completing the 3WHS. But
cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow()
says that we're not under synflood. That's because
time_after32(2147484649, 120000) returns false.
With:
- 2147484649: the value of jiffies, aka. 'now'.
- 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID.
Of course, in reality jiffies would have increased a bit, but this
condition will last for the next 119 seconds, which is far enough
to accommodate for jiffie's growth.
Fix this by updating the overflow timestamp whenever jiffies isn't
within the [last_overflow, last_overflow + HZ] range. That shouldn't
have any performance impact since the update still happens at most once
per second.
Now we're guaranteed to have fresh timestamps while under synflood, so
tcp_synq_no_recent_overflow() can safely use it with time_after32() in
such situations.
Stale timestamps can still make tcp_synq_no_recent_overflow() return
the wrong verdict when not under synflood. This will be handled in the
next patch.
For 64 bits architectures, the problem was introduced with the
conversion of ->tw_ts_recent_stamp to 32 bits integer by commit
cca9bab1b7 ("tcp: use monotonic timestamps for PAWS").
The problem has always been there on 32 bits architectures.
Fixes: cca9bab1b7 ("tcp: use monotonic timestamps for PAWS")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With indirect blocks, a driver can register for callbacks from a device
that is does not 'own', for example, a tunnel device. When registering to
or unregistering from a new device, a callback is triggered to generate
a bind/unbind event. This, in turn, allows the driver to receive any
existing rules or to properly clean up installed rules.
When first added, it was assumed that all indirect block registrations
would be for ingress offloads. However, the NFP driver can, in some
instances, support clsact qdisc binds for egress offload.
Change the name of the indirect block callback command in flow_offload to
remove the 'ingress' identifier from it. While this does not change
functionality, a follow up patch will implement a more more generic
callback than just those currently just supporting ingress offload.
Fixes: 4d12ba4278 ("nfp: flower: allow offloading of matches on 'internal' ports")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_stub uses the ip6_dst_lookup function to allow other modules to
perform IPv6 lookups. However, this function skips the XFRM layer
entirely.
All users of ipv6_stub->ip6_dst_lookup use ip_route_output_flow (via the
ip_route_output_key and ip_route_output helpers) for their IPv4 lookups,
which calls xfrm_lookup_route(). This patch fixes this inconsistent
behavior by switching the stub to ip6_dst_lookup_flow, which also calls
xfrm_lookup_route().
This requires some changes in all the callers, as these two functions
take different arguments and have different return types.
Fixes: 5f81bd2e5d ("ipv6: export a stub for IPv6 symbols used by vxlan")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will be used in the conversion of ipv6_stub to ip6_dst_lookup_flow,
as some modules currently pass a net argument without a socket to
ip6_dst_lookup. This is equivalent to commit 343d60aada ("ipv6: change
ipv6_stub_impl.ipv6_dst_lookup to take net argument").
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent commit 5c72299fba ("net: sched: cls_flower: Classify
packets using port ranges") had added filtering based on port ranges
to tc flower. However the commit missed necessary changes in hw-offload
code, so the feature gave rise to generating incorrect offloaded flow
keys in NIC.
One more detailed example is below:
$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
dst_port 100-200 action drop
With the setup above, an exact match filter with dst_port == 0 will be
installed in NIC by hw-offload. IOW, the NIC will have a rule which is
equivalent to the following one.
$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
dst_port 0 action drop
The behavior was caused by the flow dissector which extracts packet
data into the flow key in the tc flower. More specifically, regardless
of exact match or specified port ranges, fl_init_dissector() set the
FLOW_DISSECTOR_KEY_PORTS flag in struct flow_dissector to extract port
numbers from skb in skb_flow_dissect() called by fl_classify(). Note
that device drivers received the same struct flow_dissector object as
used in skb_flow_dissect(). Thus, offloaded drivers could not identify
which of these is used because the FLOW_DISSECTOR_KEY_PORTS flag was
set to struct flow_dissector in either case.
This patch adds the new FLOW_DISSECTOR_KEY_PORTS_RANGE flag and the new
tp_range field in struct fl_flow_key to recognize which filters are applied
to offloaded drivers. At this point, when filters based on port ranges
passed to drivers, drivers return the EOPNOTSUPP error because they do
not support the feature (the newly created FLOW_DISSECTOR_KEY_PORTS_RANGE
flag).
Fixes: 5c72299fba ("net: sched: cls_flower: Classify packets using port ranges")
Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Partially sent record cleanup path increments an SG entry
directly instead of using sg_next(). This should not be a
problem today, as encrypted messages should be always
allocated as arrays. But given this is a cleanup path it's
easy to miss was this ever to change. Use sg_next(), and
simplify the code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Looks like when BPF support was added by commit d3b18ad31f
("tls: add bpf support to sk_msg handling") and
commit d829e9c411 ("tls: convert to generic sk_msg interface")
it broke/removed the support for in-place crypto as added by
commit 4e6d47206c ("tls: Add support for inplace records
encryption").
The inplace_crypto member of struct tls_rec is dead, inited
to zero, and sometimes set to zero again. It used to be
set to 1 when record was allocated, but the skmsg code doesn't
seem to have been written with the idea of in-place crypto
in mind.
Since non trivial effort is required to bring the feature back
and we don't really have the HW to measure the benefit just
remove the left over support for now to avoid confusing readers.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"This is mostly to fix the iwlwifi regression:
1) Flush GRO state properly in iwlwifi driver, from Alexander Lobakin.
2) Validate TIPC link name with properly length macro, from John
Rutherford.
3) Fix completion init and device query timeouts in ibmvnic, from
Thomas Falcon.
4) Fix SKB size calculation for netlink messages in psample, from
Nikolay Aleksandrov.
5) Similar kind of fix for OVS flow dumps, from Paolo Abeni.
6) Handle queue allocation failure unwind properly in gve driver, we
could try to release pages we didn't allocate. From Jeroen de
Borst.
7) Serialize TX queue SKB list accesses properly in mscc ocelot
driver. From Yangbo Lu"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net:
net: usb: aqc111: Use the correct style for SPDX License Identifier
net: phy: Use the correct style for SPDX License Identifier
net: wireless: intel: iwlwifi: fix GRO_NORMAL packet stalling
net: mscc: ocelot: use skb queue instead of skbs list
net: mscc: ocelot: avoid incorrect consuming in skbs list
gve: Fix the queue page list allocated pages count
net: inet_is_local_reserved_port() port arg should be unsigned short
openvswitch: fix flow command message size
net: phy: dp83869: Fix return paths to return proper values
net: psample: fix skb_over_panic
net: usbnet: Fix -Wcast-function-type
net: hso: Fix -Wcast-function-type
net: port < inet_prot_sock(net) --> inet_port_requires_bind_service(net, port)
ibmvnic: Serialize device queries
ibmvnic: Bound waits for device queries
ibmvnic: Terminate waiting device threads after loss of service
ibmvnic: Fix completion structure initialization
net-sctp: replace some sock_net(sk) with just 'net'
net: Fix a documentation bug wrt. ip_unprivileged_port_start
tipc: fix link name length check
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- A comprehensive rewrite of the robust/PI futex code's exit handling
to fix various exit races. (Thomas Gleixner et al)
- Rework the generic REFCOUNT_FULL implementation using
atomic_fetch_* operations so that the performance impact of the
cmpxchg() loops is mitigated for common refcount operations.
With these performance improvements the generic implementation of
refcount_t should be good enough for everybody - and this got
confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
REFCOUNT_FULL entirely, leaving the generic implementation enabled
unconditionally. (Will Deacon)
- Other misc changes, fixes, cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
lkdtm: Remove references to CONFIG_REFCOUNT_FULL
locking/refcount: Remove unused 'refcount_error_report()' function
locking/refcount: Consolidate implementations of refcount_t
locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
locking/refcount: Move saturation warnings out of line
locking/refcount: Improve performance of generic REFCOUNT_FULL code
locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header
locking/refcount: Remove unused refcount_*_checked() variants
locking/refcount: Ensure integer operands are treated as signed
locking/refcount: Define constants for saturation and max refcount values
futex: Prevent exit livelock
futex: Provide distinct return value when owner is exiting
futex: Add mutex around futex exit
futex: Provide state handling for exec() as well
futex: Sanitize exit state handling
futex: Mark the begin of futex exit explicitly
futex: Set task::futex_state to DEAD right after handling futex exit
futex: Split futex_mm_release() for exit/exec
exit/exec: Seperate mm_release()
futex: Replace PF_EXITPIDONE with a state
...
Any argument outside of that range would result in an out of bound
memory access, since the accessed array is 65536 bits long.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Note that the sysctl write accessor functions guarantee that:
net->ipv4.sysctl_ip_prot_sock <= net->ipv4.ip_local_ports.range[0]
invariant is maintained, and as such the max() in selinux hooks is actually spurious.
ie. even though
if (snum < max(inet_prot_sock(sock_net(sk)), low) || snum > high) {
per logic is the same as
if ((snum < inet_prot_sock(sock_net(sk)) && snum < low) || snum > high) {
it is actually functionally equivalent to:
if (snum < low || snum > high) {
which is equivalent to:
if (snum < inet_prot_sock(sock_net(sk)) || snum < low || snum > high) {
even though the first clause is spurious.
But we want to hold on to it in case we ever want to change what what
inet_port_requires_bind_service() means (for example by changing
it from a, by default, [0..1024) range to some sort of set).
Test: builds, git 'grep inet_prot_sock' finds no other references
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Another merge window, another pull full of stuff:
1) Support alternative names for network devices, from Jiri Pirko.
2) Introduce per-netns netdev notifiers, also from Jiri Pirko.
3) Support MSG_PEEK in vsock/virtio, from Matias Ezequiel Vara
Larsen.
4) Allow compiling out the TLS TOE code, from Jakub Kicinski.
5) Add several new tracepoints to the kTLS code, also from Jakub.
6) Support set channels ethtool callback in ena driver, from Sameeh
Jubran.
7) New SCTP events SCTP_ADDR_ADDED, SCTP_ADDR_REMOVED,
SCTP_ADDR_MADE_PRIM, and SCTP_SEND_FAILED_EVENT. From Xin Long.
8) Add XDP support to mvneta driver, from Lorenzo Bianconi.
9) Lots of netfilter hw offload fixes, cleanups and enhancements,
from Pablo Neira Ayuso.
10) PTP support for aquantia chips, from Egor Pomozov.
11) Add UDP segmentation offload support to igb, ixgbe, and i40e. From
Josh Hunt.
12) Add smart nagle to tipc, from Jon Maloy.
13) Support L2 field rewrite by TC offloads in bnxt_en, from Venkat
Duvvuru.
14) Add a flow mask cache to OVS, from Tonghao Zhang.
15) Add XDP support to ice driver, from Maciej Fijalkowski.
16) Add AF_XDP support to ice driver, from Krzysztof Kazimierczak.
17) Support UDP GSO offload in atlantic driver, from Igor Russkikh.
18) Support it in stmmac driver too, from Jose Abreu.
19) Support TIPC encryption and auth, from Tuong Lien.
20) Introduce BPF trampolines, from Alexei Starovoitov.
21) Make page_pool API more numa friendly, from Saeed Mahameed.
22) Introduce route hints to ipv4 and ipv6, from Paolo Abeni.
23) Add UDP segmentation offload to cxgb4, Rahul Lakkireddy"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1857 commits)
libbpf: Fix usage of u32 in userspace code
mm: Implement no-MMU variant of vmalloc_user_node_flags
slip: Fix use-after-free Read in slip_open
net: dsa: sja1105: fix sja1105_parse_rgmii_delays()
macvlan: schedule bc_work even if error
enetc: add support Credit Based Shaper(CBS) for hardware offload
net: phy: add helpers phy_(un)lock_mdio_bus
mdio_bus: don't use managed reset-controller
ax88179_178a: add ethtool_op_get_ts_info()
mlxsw: spectrum_router: Fix use of uninitialized adjacency index
mlxsw: spectrum_router: After underlay moves, demote conflicting tunnels
bpf: Simplify __bpf_arch_text_poke poke type handling
bpf: Introduce BPF_TRACE_x helper for the tracing tests
bpf: Add bpf_jit_blinding_enabled for !CONFIG_BPF_JIT
bpf, testing: Add various tail call test cases
bpf, x86: Emit patchable direct jump as tail call
bpf: Constant map key tracking for prog array pokes
bpf: Add poke dependency tracking for prog array maps
bpf: Add initial poke descriptor table for jit images
bpf: Move owner type, jited info into array auxiliary data
...
Pull cgroup updates from Tejun Heo:
"There are several notable changes here:
- Single thread migrating itself has been optimized so that it
doesn't need threadgroup rwsem anymore.
- Freezer optimization to avoid unnecessary frozen state changes.
- cgroup ID unification so that cgroup fs ino is the only unique ID
used for the cgroup and can be used to directly look up live
cgroups through filehandle interface on 64bit ino archs. On 32bit
archs, cgroup fs ino is still the only ID in use but it is only
unique when combined with gen.
- selftest and other changes"
* 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
writeback: fix -Wformat compilation warnings
docs: cgroup: mm: Fix spelling of "list"
cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
cgroup: use cgrp->kn->id as the cgroup ID
kernfs: use 64bit inos if ino_t is 64bit
kernfs: implement custom exportfs ops and fid type
kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
kernfs: convert kernfs_node->id from union kernfs_node_id to u64
kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
netprio: use css ID instead of cgroup ID
writeback: use ino_t for inodes in tracepoints
kernfs: fix ino wrap-around detection
kselftests: cgroup: Avoid the reuse of fd after it is deallocated
cgroup: freezer: don't change task and cgroups status unnecessarily
cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency
cgroup: remove cgroup_enable_task_cg_lists() optimization
cgroup: pids: use atomic64_t for pids->limit
selftests: cgroup: Run test_core under interfering stress
selftests: cgroup: Add task migration tests
...
This patch is to fix a data-race reported by syzbot:
BUG: KCSAN: data-race in sctp_assoc_migrate / sctp_hash_obj
write to 0xffff8880b67c0020 of 8 bytes by task 18908 on cpu 1:
sctp_assoc_migrate+0x1a6/0x290 net/sctp/associola.c:1091
sctp_sock_migrate+0x8aa/0x9b0 net/sctp/socket.c:9465
sctp_accept+0x3c8/0x470 net/sctp/socket.c:4916
inet_accept+0x7f/0x360 net/ipv4/af_inet.c:734
__sys_accept4+0x224/0x430 net/socket.c:1754
__do_sys_accept net/socket.c:1795 [inline]
__se_sys_accept net/socket.c:1792 [inline]
__x64_sys_accept+0x4e/0x60 net/socket.c:1792
do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x44/0xa9
read to 0xffff8880b67c0020 of 8 bytes by task 12003 on cpu 0:
sctp_hash_obj+0x4f/0x2d0 net/sctp/input.c:894
rht_key_get_hash include/linux/rhashtable.h:133 [inline]
rht_key_hashfn include/linux/rhashtable.h:159 [inline]
rht_head_hashfn include/linux/rhashtable.h:174 [inline]
head_hashfn lib/rhashtable.c:41 [inline]
rhashtable_rehash_one lib/rhashtable.c:245 [inline]
rhashtable_rehash_chain lib/rhashtable.c:276 [inline]
rhashtable_rehash_table lib/rhashtable.c:316 [inline]
rht_deferred_worker+0x468/0xab0 lib/rhashtable.c:420
process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
worker_thread+0xa0/0x800 kernel/workqueue.c:2415
kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
It was caused by rhashtable access asoc->base.sk when sctp_assoc_migrate
is changing its value. However, what rhashtable wants is netns from asoc
base.sk, and for an asoc, its netns won't change once set. So we can
simply fix it by caching netns since created.
Fixes: d6c0256a60 ("sctp: add the rhashtable apis for sctp global transport hashtable")
Reported-by: syzbot+e3b35fe7918ff0ee474e@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Minor conflict in drivers/s390/net/qeth_l2_main.c, kept the lock
from commit c8183f5489 ("s390/qeth: fix potential deadlock on
workqueue flush"), removed the code which was removed by commit
9897d583b0 ("s390/qeth: consolidate some duplicated HW cmd code").
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
In commit a82055af59 ("netfilter: nft_payload: add VLAN offload
support"), VLAN fields in struct flow_dissector_key_vlan were unionized
with the intention of introducing another field that covered the whole TCI
header. However without a wrapping struct the subfields end up sharing the
same bits. As a result, "tc filter add ... flower vlan_id 14" specifies not
only vlan_id, but also vlan_priority.
Fix by wrapping the individual VLAN fields in a struct.
Fixes: a82055af59 ("netfilter: nft_payload: add VLAN offload support")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous commit added the ability to throttle stations when they queue
too much airtime in the hardware. This commit enables the functionality by
calculating the expected airtime usage of each packet that is dequeued from
the TXQs in mac80211, and accounting that as pending airtime.
The estimated airtime for each skb is stored in the tx_info, so we can
subtract the same amount from the running total when the skb is freed or
recycled. The throttling mechanism relies on this accounting to be
accurate (i.e., that we are not freeing skbs without subtracting any
airtime they were accounted for), so we put the subtraction into
ieee80211_report_used_skb(). As an optimisation, we also subtract the
airtime on regular TX completion, zeroing out the value stored in the
packet afterwards, to avoid having to do an expensive lookup of the station
from the packet data on every packet.
This patch does *not* include any mechanism to wake a throttled TXQ again,
on the assumption that this will happen anyway as a side effect of whatever
freed the skb (most commonly a TX completion).
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191119060610.76681-5-kyan@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In order for the Fq_CoDel algorithm integrated in mac80211 layer to operate
effectively to control excessive queueing latency, the CoDel algorithm
requires an accurate measure of how long packets stays in the queue, AKA
sojourn time. The sojourn time measured at the mac80211 layer doesn't
include queueing latency in the lower layer (firmware/hardware) and CoDel
expects lower layer to have a short queue. However, most 802.11ac chipsets
offload tasks such TX aggregation to firmware or hardware, thus have a deep
lower layer queue.
Without a mechanism to control the lower layer queue size, packets only
stay in mac80211 layer transiently before being sent to firmware queue.
As a result, the sojourn time measured by CoDel in the mac80211 layer is
almost always lower than the CoDel latency target, hence CoDel does little
to control the latency, even when the lower layer queue causes excessive
latency.
The Byte Queue Limits (BQL) mechanism is commonly used to address the
similar issue with wired network interface. However, this method cannot be
applied directly to the wireless network interface. "Bytes" is not a
suitable measure of queue depth in the wireless network, as the data rate
can vary dramatically from station to station in the same network, from a
few Mbps to over Gbps.
This patch implements an Airtime-based Queue Limit (AQL) to make CoDel work
effectively with wireless drivers that utilized firmware/hardware
offloading. AQL allows each txq to release just enough packets to the lower
layer to form 1-2 large aggregations to keep hardware fully utilized and
retains the rest of the frames in mac80211 layer to be controlled by the
CoDel algorithm.
Signed-off-by: Kan Yan <kyan@google.com>
[ Toke: Keep API to set pending airtime internal, fix nits in commit msg ]
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191119060610.76681-4-kyan@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Felix recently added code to calculate airtime of packets to the mt76
driver. Import this into mac80211 so we can use it for airtime queue limit
calculations.
The airtime.c file is copied verbatim from the mt76 driver, and adjusted to
be usable in mac80211. This involves:
- Switching to mac80211 data structures.
- Adding support for 160 MHz channels and HE mode.
- Moving the symbol and duration calculations around a bit to avoid
rounding with the higher rates and longer symbol times used for HE rates.
The per-rate TX rate calculation is also split out to its own function so
it can be used directly for the AQL calculations later.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191119060610.76681-3-kyan@google.com
[fix HE_GROUP_IDX() to use 3 * bw, since there are 3 _gi values]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This is alike the previous change, with some additional ipv4 specific
quirk. Even when using the route hint we still have to do perform
additional per packet checks about source address validity: a new
helper is added to wrap them.
Hints are explicitly disabled if the destination is a local broadcast,
that keeps the code simple and local broadcast are a slower path anyway.
UDP flood performances vs recvmmsg() receiver:
vanilla patched delta
Kpps Kpps %
1683 1871 +11
In the worst case scenario - each packet has a different
destination address - the performance delta is within noise
range.
v3 -> v4:
- re-enable hints for forward
v2 -> v3:
- really fix build (sic) and hint usage check
- use fib4_has_custom_rules() helpers (David A.)
- add ip_extract_route_hint() helper (Edward C.)
- use prev skb as hint instead of copying data (Willem)
v1 -> v2:
- fix build issue with !CONFIG_IP_MULTIPLE_TABLES
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can use it in the next patch.
Additionally constify the helper argument.
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a per namespace counter, increment it on successful creation
of any route using the source address, decrement it on deletion
of such routes.
This allows us to check easily if the routing decision in the
current namespace depends on the packet source. Will be used
by the next patch.
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It wraps the namespace field with the same name, to easily
access it regardless of build options.
Suggested-by: David Ahern <dsahern@gmail.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce the following parameters in order to add the possibility to sync
DMA memory for device before putting allocated pages in the page_pool
caches:
- PP_FLAG_DMA_SYNC_DEV: if set in page_pool_params flags, all pages that
the driver gets from page_pool will be DMA-synced-for-device according
to the length provided by the device driver. Please note DMA-sync-for-CPU
is still device driver responsibility
- offset: DMA address offset where the DMA engine starts copying rx data
- max_len: maximum DMA memory size page_pool is allowed to flush. This
is currently used in __page_pool_alloc_pages_slow routine when pages
are allocated from page allocator
These parameters are supposed to be set by device drivers.
This optimization reduces the length of the DMA-sync-for-device.
The optimization is valid because pages are initially
DMA-synced-for-device as defined via max_len. At RX time, the driver
will perform a DMA-sync-for-CPU on the memory for the packet length.
What is important is the memory occupied by packet payload, because
this is the area CPU is allowed to read and modify. As we don't track
cache-lines written into by the CPU, simply use the packet payload length
as dma_sync_size at page_pool recycle time. This also take into account
any tail-extend.
Tested-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add page_pool_update_nid() to be called by page pool consumers when they
detect numa node changes.
It will update the page pool nid value to start allocating from the new
effective numa node.
This is to mitigate page pool allocating pages from a wrong numa node,
where the pool was originally allocated, and holding on to pages that
belong to a different numa node, which causes performance degradation.
For pages that are already being consumed and could be returned to the
pool by the consumer, in next patch we will add a check per page to avoid
recycling them back to the pool and return them to the page allocator.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Match on ethertype and set up protocol dependency. Check for protocol
dependency before accessing the tci field. Allow to match on the
encapsulated ethertype too.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hardware offload support at this stage assumes an ethernet device in
place. The flow dissector provides the intermediate representation to
express this selector, so extend it to allow to store the interface
type. Flower does not uses this, so skb_flow_dissect_meta() is not
extended to match on this new field.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch constifies the pointer to source register data that is passed
as an input parameter.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bring back tls_sw_sendpage_locked. sk_msg redirection into a socket
with TLS_TX takes the following path:
tcp_bpf_sendmsg_redir
tcp_bpf_push_locked
tcp_bpf_push
kernel_sendpage_locked
sock->ops->sendpage_locked
Also update the flags test in tls_sw_sendpage_locked to allow flag
MSG_NO_SHARED_FRAGS. bpf_tcp_sendmsg sets this.
Link: https://lore.kernel.org/netdev/CA+FuTSdaAawmZ2N8nfDDKu3XLpXBbMtcCT0q4FntDD2gn8ASUw@mail.gmail.com/T/#t
Link: https://github.com/wdebruij/kerneltools/commits/icept.2
Fixes: 0608c69c9a ("bpf: sk_msg, sock{map|hash} redirect through ULP")
Fixes: f3de19af0f ("Revert \"net/tls: remove unused function tls_sw_sendpage_locked\"")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When Jonathan change the page_pool to become responsible to its
own shutdown via deferred work queue, then the disconnect_cnt
counter was removed from xdp memory model tracepoint.
This patch change the page_pool_inflight tracepoint name to
page_pool_release, because it reflects the new responsability
better. And it reintroduces a counter that reflect the number of
times page_pool_release have been tried.
The counter is also used by the code, to only empty the alloc
cache once. With a stuck work queue running every second and
counter being 64-bit, it will overrun in approx 584 billion
years. For comparison, Earth lifetime expectancy is 7.5 billion
years, before the Sun will engulf, and destroy, the Earth.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Wildcard support for the net,iface set from Kristian Evensen.
2) Offload support for matching on the input interface.
3) Simplify matching on vlan header fields.
4) Add nft_payload_rebuild_vlan_hdr() function to rebuild the vlan
header from the vlan sk_buff metadata.
5) Pass extack to nft_flow_cls_offload_setup().
6) Add C-VLAN matching support.
7) Use time64_t in xt_time to fix y2038 overflow, from Arnd Bergmann.
8) Use time_t in nft_meta to fix y2038 overflow, also from Arnd.
9) Add flow_action_entry_next() helper function to flowtable offload
infrastructure.
10) Add IPv6 support to the flowtable offload infrastructure.
11) Support for input interface matching from postrouting,
from Phil Sutter.
12) Missing check for ndo callback in flowtable offload, from wenxu.
13) Remove conntrack parameter from flow_offload_fill_dir(), from wenxu.
14) Do not pass flow_rule object for rule removal, cookie is sufficient
to achieve this.
15) Release flow_rule object in case of error from the offload commit
path.
16) Undo offload ruleset updates if transaction fails.
17) Check for error when binding flowtable callbacks, from wenxu.
18) Always unbind flowtable callbacks when unregistering hooks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The page pool keeps track of the number of pages in flight, and
it isn't safe to remove the pool until all pages are returned.
Disallow removing the pool until all pages are back, so the pool
is always available for page producers.
Make the page pool responsible for its own delayed destruction
instead of relying on XDP, so the page pool can be used without
the xdp memory model.
When all pages are returned, free the pool and notify xdp if the
pool is registered with the xdp memory system. Have the callback
perform a table walk since some drivers (cpsw) may share the pool
among multiple xdp_rxq_info.
Note that the increment of pages_state_release_cnt may result in
inflight == 0, resulting in the pool being released.
Fixes: d956a048cd ("xdp: force mem allocator removal and periodic warning")
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add nf_flow_rule_route_ipv6() and use it from the IPv6 and the inet
flowtable type definitions. Rename the nf_flow_rule_route() function to
nf_flow_rule_route_ipv4().
Adjust maximum number of actions, which now becomes 16 to leave
sufficient room for the IPv6 address mangling for NAT.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
While it is entirely possible that this tagger format is in fact more
generic than just these 2 switch families, I don't have that knowledge.
The Seville switch in NXP T1040 has a similar frame format, but there
are enough differences (e.g. DEST field starts at bit 57 instead of 56)
that calling this file tag_vitesse.c is a bit of a stretch at the
moment. The frame format has been listed in a comment so that people who
add support for further Vitesse switches can rework this tagger while
keeping compatibility with Felix.
The "ocelot" name was chosen instead of "felix" because even the Ocelot
switch can act as a DSA device when it is used in NPI mode, and the Felix
tagger format is almost identical. Currently it is only used for the
Felix switch embedded in the NXP LS1028A chip.
The ABI for this tagger should be considered "not stable" at the moment.
The DSA tag is always placed before the Ethernet header and therefore,
we are using the long prefix for RX tags to avoid putting the DSA master
port in promiscuous mode. Once there will be an API in DSA for drivers
to request DSA masters to be in promiscuous mode unconditionally, we
will switch to the "no prefix" extraction frame header, which will save
16 padding bytes for each RX frame.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the ism module is unloaded return control from exit routine only,
if all link groups are freed.
If an IB device is thrown away return control from device removal only,
if all link groups belonging to this device are freed.
A counters for the total number of SMCD link groups per ISM device is
introduced. ism module unloading continues only if the total number of
SMCD link groups for all ISM devices is zero. ISM device
removal continues only it the total number of SMCD link groups per ISM
device has decreased to zero.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SMCD link group termination is called when peer signals its shutdown
of its corresponding link group. For regular shutdowns no connections
exist anymore. For abnormal shutdowns connections must be killed and
their DMBs must be unregistered immediately. That means the SMCR method
to delay the link group freeing several seconds does not fit.
This patch adds immediate termination of a link group and its SMCD
connections and makes sure all SMCD link group related cleanup steps
are finished.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds 'module' member in the 'struct vsock_transport'
in order to get/put the transport module. This prevents the
module unloading while sockets are assigned to it.
We increase the module refcnt when a socket is assigned to a
transport, and we decrease the module refcnt when the socket
is destructed.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the support of multiple transports in the
VSOCK core.
With the multi-transports support, we can use vsock with nested VMs
(using also different hypervisors) loading both guest->host and
host->guest transports at the same time.
Major changes:
- vsock core module can be loaded regardless of the transports
- vsock_core_init() and vsock_core_exit() are renamed to
vsock_core_register() and vsock_core_unregister()
- vsock_core_register() has a feature parameter (H2G, G2H, DGRAM)
to identify which directions the transport can handle and if it's
support DGRAM (only vmci)
- each stream socket is assigned to a transport when the remote CID
is set (during the connect() or when we receive a connection request
on a listener socket).
The remote CID is used to decide which transport to use:
- remote CID <= VMADDR_CID_HOST will use guest->host transport;
- remote CID == local_cid (guest->host transport) will use guest->host
transport for loopback (host->guest transports don't support loopback);
- remote CID > VMADDR_CID_HOST will use host->guest transport;
- listener sockets are not bound to any transports since no transport
operations are done on it. In this way we can create a listener
socket, also if the transports are not loaded or with VMADDR_CID_ANY
to listen on all transports.
- DGRAM sockets are handled as before, since only the vmci_transport
provides this feature.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All transports call __vsock_create() with the same parameters,
most of them depending on the parent socket. In order to simplify
the VSOCK core APIs exposed to the transports, this patch adds
the vsock_create_connected() callable from transports to create
a new socket when a connection request is received.
We also unexported the __vsock_create().
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
virtio_transport and vmci_transport handle the buffer_size
sockopts in a very similar way.
In order to support multiple transports, this patch moves this
handling in the core to allow the user to change the options
also if the socket is not yet assigned to any transport.
This patch also adds the '.notify_buffer_size' callback in the
'struct virtio_transport' in order to inform the transport,
when the buffer_size is changed by the user. It is also useful
to limit the 'buffer_size' requested (e.g. virtio transports).
Acked-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since now the 'struct vsock_sock' object contains a pointer to
the transport, this patch adds a parameter to the
vsock_core_get_transport() to return the right transport
assigned to the socket.
This patch modifies also the virtio_transport_get_ops(), that
uses the vsock_core_get_transport(), adding the
'struct vsock_sock *' parameter.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As a preparation to support multiple transports, this patch adds
the 'transport' member at the 'struct vsock_sock'.
This new field is initialized during the creation in the
__vsock_create() function.
This patch also renames the global 'transport' pointer to
'transport_single', since for now we're only supporting a single
transport registered at run-time.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This header file now only includes the "uapi/linux/vm_sockets.h".
We can include directly it when needed.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) New generic devlink param "enable_roce", for downstream devlink
reload support
2) Do vport ACL configuration on per vport basis when
enabling/disabling a vport. This enables to have vports enabled/disabled
outside of eswitch config for future
3) Split the code for legacy vs offloads mode and make it clear
4) Tide up vport locking and workqueue usage
5) Fix metadata enablement for ECPF
6) Make explicit use of VF property to publish IB_DEVICE_VIRTUAL_FUNCTION
7) E-Switch and flow steering core low level support and refactoring for
netfilter flowtables offload
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2019-11-13
1) Remove a unnecessary net_exit function from the xfrm interface.
From Xin Long.
2) Assign xfrm4_udp_encap_rcv to a UDP socket only if xfrm
is configured. From Alexey Dobriyan.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the dataplane hardware offload to the flowtable
infrastructure. Three new flags represent the hardware state of this
flow:
* FLOW_OFFLOAD_HW: This flow entry resides in the hardware.
* FLOW_OFFLOAD_HW_DYING: This flow entry has been scheduled to be remove
from hardware. This might be triggered by either packet path (via TCP
RST/FIN packet) or via aging.
* FLOW_OFFLOAD_HW_DEAD: This flow entry has been already removed from
the hardware, the software garbage collector can remove it from the
software flowtable.
This patch supports for:
* IPv4 only.
* Aging via FLOW_CLS_STATS, no packet and byte counter synchronization
at this stage.
This patch also adds the action callback that specifies how to convert
the flow entry into the flow_rule object that is passed to the driver.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the NFTA_FLOWTABLE_FLAGS attribute that allows users to
specify the NF_FLOWTABLE_HW_OFFLOAD flag. This patch also adds a new
setup interface for the flowtable type to perform the flowtable offload
block callback configuration.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the infrastructure to support for flow entry types.
The initial type is NF_FLOW_OFFLOAD_ROUTE that stores the routing
information into the flow entry to define a fastpath for the classic
forwarding path.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move rcu_head to struct flow_offload, then remove the flow_offload_entry
structure definition.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers do not have access to the flow_offload structure, hence remove
this union from this flow_offload object as well as the original comment
on top of it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify this code by storing the pointer to conntrack object in the
flow_offload structure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
after commit 4097e9d250 ("net: sched: don't use tc_action->order during
action dump"), 'act->order' is initialized but then it's no more read, so
we can just remove this member of struct tc_action.
CC: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Devlink supports pair output of name and value. When the value is
binary, it must be presented in an array. If the length of the binary
value exceeds fmsg limitation, break the value into chunks internally.
Signed-off-by: Aya Levin <ayal@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netprio uses cgroup ID to index the priority mapping table. This is
currently okay as cgroup IDs are allocated using idr and packed.
However, cgroup IDs will be changed to use full 64bit range and won't
be packed making this impractical. netprio doesn't care what type of
IDs it uses as long as they can identify the controller instances and
are packed. Let's switch to css IDs instead of cgroup IDs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Namhyung Kim <namhyung@kernel.org>
New device parameter to enable/disable handling of RoCE traffic in the
device.
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
There is a race between driver code that does setup/cleanup of device
and devlink reload operation that in some drivers works with the same
code. Use after free could we easily obtained by running:
while true; do
echo "0000:00:10.0" >/sys/bus/pci/drivers/mlxsw_spectrum2/bind
devlink dev reload pci/0000:00:10.0 &
echo "0000:00:10.0" >/sys/bus/pci/drivers/mlxsw_spectrum2/unbind
done
Fix this by enabling reload only after setup of device is complete and
disabling it at the beginning of the cleanup process.
Reported-by: Ido Schimmel <idosch@mellanox.com>
Fixes: 2d8dc5bbf4 ("devlink: Add support for reload")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One conflict in the BPF samples Makefile, some fixes in 'net' whilst
we were converting over to Makefile.target rules in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a new feature defined in section 5 of rfc7829: "Primary Path
Switchover". By introducing a new tunable parameter:
Primary.Switchover.Max.Retrans (PSMR)
The primary path will be changed to another active path when the path
error counter on the old primary path exceeds PSMR, so that "the SCTP
sender is allowed to continue data transmission on a new working path
even when the old primary destination address becomes active again".
This patch is to add this tunable parameter, 'ps_retrans' per netns,
sock, asoc and transport. It also allows a user to change ps_retrans
per netns by sysctl, and ps_retrans per sock/asoc/transport will be
initialized with it.
The check will be done in sctp_do_8_2_transport_strike() when this
feature is enabled.
Note this feature is disabled by initializing 'ps_retrans' per netns
as 0xffff by default, and its value can't be less than 'pf_retrans'
when changing by sysctl.
v3->v4:
- add define SCTP_PS_RETRANS_MAX 0xffff, and use it on extra2 of
sysctl 'ps_retrans'.
- add a new entry for ps_retrans on ip-sysctl.txt.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As said in rfc7829, section 3, point 12:
The SCTP stack SHOULD expose the PF state of its destination
addresses to the ULP as well as provide the means to notify the
ULP of state transitions of its destination addresses from
active to PF, and vice versa. However, it is recommended that
an SCTP stack implementing SCTP-PF also allows for the ULP to be
kept ignorant of the PF state of its destinations and the
associated state transitions, thus allowing for retention of the
simpler state transition model of [RFC4960] in the ULP.
Not only does it allow to expose the PF state to ULP, but also
allow to ignore sctp-pf to ULP.
So this patch is to add pf_expose per netns, sock and asoc. And in
sctp_assoc_control_transport(), ulp_notify will be set to false if
asoc->expose is not 'enabled' in next patch.
It also allows a user to change pf_expose per netns by sysctl, and
pf_expose per sock and asoc will be initialized with it.
Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt,
to not allow a user to query the state of a sctp-pf peer address
when pf_expose is 'disabled', as said in section 7.3.
v1->v2:
- Fix a build warning noticed by Nathan Chancellor.
v2->v3:
- set pf_expose to UNUSED by default to keep compatible with old
applications.
v3->v4:
- add a new entry for pf_expose on ip-sysctl.txt, as Marcelo suggested.
- change this patch to 1/5, and move sctp_assoc_control_transport
change into 2/5, as Marcelo suggested.
- use SCTP_PF_EXPOSE_UNSET instead of SCTP_PF_EXPOSE_UNUSED, and
set SCTP_PF_EXPOSE_UNSET to 0 in enum, as Marcelo suggested.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race between driver code that does setup/cleanup of device
and devlink reload operation that in some drivers works with the same
code. Use after free could we easily obtained by running:
while true; do
echo 10 > /sys/bus/netdevsim/new_device
devlink dev reload netdevsim/netdevsim10 &
echo 10 > /sys/bus/netdevsim/del_device
done
Fix this by enabling reload only after setup of device is complete and
disabling it at the beginning of the cleanup process.
Reported-by: Ido Schimmel <idosch@mellanox.com>
Fixes: 2d8dc5bbf4 ("devlink: Add support for reload")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* typo fixes in docs
* APIs for station separation using VLAN tags rather
than separate wifi netdevs
* some preparations for upcoming features (802.3 offload
and airtime queue limits (AQL)
* stack reduction in ieee80211_assoc_success()
* use DEFINE_DEBUGFS_ATTRIBUTE in hwsim
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl3FYNEACgkQB8qZga/f
l8TQig//aZIXq2qlrci+Tzydi5clhrog8mzGBOECRUrgAFUZB9Q1qSD0IvqQrU8o
3xY+9lDxYvaOYKAm83oyk2st8FIgGXwPmrfPMEodFlFWla4Dkev/n8+CdbVgJDGI
9j0+9tPsQFwvIIOwD1OUKzJyRUIwmJ7+f8DRhMZ8LlkHZwSx217yD1RGq8zbyXUf
tIndHM2yAY0H9ip1GPaTiGMyWN9wbpcllNq6h126OJZ0kGRLbY9X9ySoMIw8zSPF
ul1qofvhcdN8BEOqCd97JIsDF6d3fG/LZyNm+UM415C/Se9bSG1O9i9AakGkk2SW
zqSjkhqeBK5DUfMqGlh6uxx+1CH46vTZhBKuYIe2uWrdia14d88qkQE88R4cRcir
dTJHUP90XOXXh3DrDSgEsfb341E5/IMVgUIIu6U72efNqZ8eMCbsXnp8Bnpervtw
sBPYC4aTzU5wBWbl8XDgm0KWx+nh06Ah5+PXW7gzkKZQpW8qOJp4ATcDkqHrMPBf
OFd8zhYKULBVRH8dsshzXdeVOurNwazJnEN/h72U1eZTa7ZxFL+CMShpY20d4t9I
ldh6EG//b/D7WjBIpbs3VBBQ6fnblwtq41GUS0TgndoDUD1wmBf3mJL9ZvEsYHwC
OmsTaAcsAkHBPdOZDpSyyGYddxXp7YNCT6ua3lNdRAbVn5Y0A/8=
=gDUI
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-net-next-2019-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Some relatively small changes:
* typo fixes in docs
* APIs for station separation using VLAN tags rather
than separate wifi netdevs
* some preparations for upcoming features (802.3 offload
and airtime queue limits (AQL)
* stack reduction in ieee80211_assoc_success()
* use DEFINE_DEBUGFS_ATTRIBUTE in hwsim
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides an alternative mechanism for AP VLAN support where a
single netdev is used with VLAN tagged frames instead of separate
netdevs for each VLAN without tagged frames from the WLAN driver.
By setting NL80211_EXT_FEATURE_VLAN_OFFLOAD flag the driver indicates
support for a single netdev with VLAN tagged frames. Separate
VLAN-specific netdevs can be added using RTM_NEWLINK/IFLA_VLAN_ID
similarly to Ethernet. NL80211_CMD_NEW_KEY (for group keys),
NL80211_CMD_NEW_STATION, and NL80211_CMD_SET_STATION will optionally
specify vlan_id using NL80211_ATTR_VLAN_ID.
Signed-off-by: Gurumoorthi Gnanasambandhan <gguru@codeaurora.org>
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20191031214640.5012-1-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>