The seg6local LWT provides the End.DT6 action, which allows to
decapsulate an outer IPv6 header containing a Segment Routing Header
(SRH), full specification is available here:
https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-05
This patch adds this action now to the seg6local BPF
interface. Since it is not mandatory that the inner IPv6 header also
contains a SRH, seg6_bpf_srh_state has been extended with a pointer to
a possible SRH of the outermost IPv6 header. This helps assessing if the
validation must be triggered or not, and avoids some calls to
ipv6_find_hdr.
v3: s/1/true, s/0/false for boolean values
v2: - changed true/false -> 1/0
- preempt_enable no longer called in first conditional block
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
kfree(NULL) is safe,so this removes NULL check before freeing the mem
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is similar TC_ACT_REDIRECT, but with a slightly different
semantic:
- on ingress the mirred skbs are passed to the target device
network stack without any additional check not scrubbing.
- the rcu-protected stats provided via the tcf_result struct
are updated on error conditions.
This new tcfa_action value is not exposed to the user-space
and can be used only internally by clsact.
v1 -> v2: do not touch TC_ACT_REDIRECT code path, introduce
a new action type instead
v2 -> v3:
- rename the new action value TC_ACT_REINJECT, update the
helper accordingly
- take care of uncloned reinjected packets in XDP generic
hook
v3 -> v4:
- renamed again the new action value (JiriP)
v4 -> v5:
- fix build error with !NET_CLS_ACT (kbuild bot)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The wait_address argument is always directly derived from the filp
argument, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
If an invalid MTU value is set through rtnetlink return extra error
information instead of putting message in kernel log. For other cases
where there is no visible API, keep the error report in the log.
Example:
# ip li set dev enp12s0 mtu 10000
Error: mtu greater than device maximum.
# ifconfig enp12s0 mtu 10000
SIOCSIFMTU: Invalid argument
# dmesg | tail -1
[ 2047.795467] enp12s0: mtu greater than device maximum
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Report the minimum and maximum MTU allowed on a device
via netlink so that it can be displayed by tools like
ip link.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2018-07-28
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) API fixes for libbpf's BTF mapping of map key/value types in order
to make them compatible with iproute2's BPF_ANNOTATE_KV_PAIR()
markings, from Martin.
2) Fix AF_XDP to not report POLLIN prematurely by using the non-cached
consumer pointer of the RX queue, from Björn.
3) Fix __xdp_return() to check for NULL pointer after the rhashtable
lookup that retrieves the allocator object, from Taehee.
4) Fix x86-32 JIT to adjust ebp register in prologue and epilogue
by 4 bytes which got removed from overall stack usage, from Wang.
5) Fix bpf_skb_load_bytes_relative() length check to use actual
packet length, from Daniel.
6) Fix uninitialized return code in libbpf bpf_perf_event_read_simple()
handler, from Thomas.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The len > skb_headlen(skb) cannot be used as a maximum upper bound
for the packet length since it does not have any relation to the full
linear packet length when filtering is used from upper layers (e.g.
in case of reuseport BPF programs) as by then skb->data, skb->len
already got mangled through __skb_pull() and others.
Fixes: 4e1ec56cdc ("bpf: add skb_load_bytes_relative helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2018-07-27
1) Extend the output_mark to also support the input direction
and masking the mark values before applying to the skb.
2) Add a new lookup key for the upcomming xfrm interfaces.
3) Extend the xfrm lookups to match xfrm interface IDs.
4) Add virtual xfrm interfaces. The purpose of these interfaces
is to overcome the design limitations that the existing
VTI devices have.
The main limitations that we see with the current VTI are the
following:
VTI interfaces are L3 tunnels with configurable endpoints.
For xfrm, the tunnel endpoint are already determined by the SA.
So the VTI tunnel endpoints must be either the same as on the
SA or wildcards. In case VTI tunnel endpoints are same as on
the SA, we get a one to one correlation between the SA and
the tunnel. So each SA needs its own tunnel interface.
On the other hand, we can have only one VTI tunnel with
wildcard src/dst tunnel endpoints in the system because the
lookup is based on the tunnel endpoints. The existing tunnel
lookup won't work with multiple tunnels with wildcard
tunnel endpoints. Some usecases require more than on
VTI tunnel of this type, for example if somebody has multiple
namespaces and every namespace requires such a VTI.
VTI needs separate interfaces for IPv4 and IPv6 tunnels.
So when routing to a VTI, we have to know to which address
family this traffic class is going to be encapsulated.
This is a lmitation because it makes routing more complex
and it is not always possible to know what happens behind the
VTI, e.g. when the VTI is move to some namespace.
VTI works just with tunnel mode SAs. We need generic interfaces
that ensures transfomation, regardless of the xfrm mode and
the encapsulated address family.
VTI is configured with a combination GRE keys and xfrm marks.
With this we have to deal with some extra cases in the generic
tunnel lookup because the GRE keys on the VTI are actually
not GRE keys, the GRE keys were just reused for something else.
All extensions to the VTI interfaces would require to add
even more complexity to the generic tunnel lookup.
So to overcome this, we developed xfrm interfaces with the
following design goal:
It should be possible to tunnel IPv4 and IPv6 through the same
interface.
No limitation on xfrm mode (tunnel, transport and beet).
Should be a generic virtual interface that ensures IPsec
transformation, no need to know what happens behind the
interface.
Interfaces should be configured with a new key that must match a
new policy/SA lookup key.
The lookup logic should stay in the xfrm codebase, no need to
change or extend generic routing and tunnel lookups.
Should be possible to use IPsec hardware offloads of the underlying
interface.
5) Remove xfrm pcpu policy cache. This was added after the flowcache
removal, but it turned out to make things even worse.
From Florian Westphal.
6) Allow to update the set mark on SA updates.
From Nathan Harold.
7) Convert some timestamps to time64_t.
From Arnd Bergmann.
8) Don't check the offload_handle in xfrm code,
it is an opaque data cookie for the driver.
From Shannon Nelson.
9) Remove xfrmi interface ID from flowi. After this pach
no generic code is touched anymore to do xfrm interface
lookups. From Benedict Wong.
10) Allow to update the xfrm interface ID on SA updates.
From Nathan Harold.
11) Don't pass zero to ERR_PTR() in xfrm_resolve_and_create_bundle.
From YueHaibing.
12) Return more detailed errors on xfrm interface creation.
From Benedict Wong.
13) Use PTR_ERR_OR_ZERO instead of IS_ERR + PTR_ERR.
From the kbuild test robot.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
rhashtable_lookup() can return NULL. so that NULL pointer
check routine should be added.
Fixes: 02b55e5657 ("xdp: add MEM_TYPE_ZERO_COPY")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Fix dev_change_tx_queue_len so it rolls back original value
upon a failure in dev_qdisc_change_tx_queue_len.
This is already done for notifirers' failures, share the code.
In case of failure in dev_qdisc_change_tx_queue_len, some tx queues
would still be of the new length, while they should be reverted.
Currently, the revert is not done, and is marked with a TODO label
in dev_qdisc_change_tx_queue_len, and should find some nice solution
to do it.
Yet it is still better to not apply the newly requested value.
Fixes: 48bfd55e7e ("net_sched: plug in qdisc ops change_tx_queue_len")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Reported-by: Ran Rozenstein <ranro@mellanox.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several files have extra line at end of file.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cited patch added a call to dev_change_tx_queue_len in
SIOCSIFTXQLEN case.
This obsoletes the new len comparison check done before the function call.
Remove it here.
For the desicion of keep/remove the negative value check, we examine the
range check in dev_change_tx_queue_len.
On 64-bit we will fail with -ERANGE. The 32-bit int ifr_qlen will be sign
extended to 64-bits when it is passed into dev_change_tx_queue_len(). And
then for negative values this test triggers:
if (new_len != (unsigned int)new_len)
return -ERANGE;
because:
if (0xffffffffWHATEVER != 0x00000000WHATEVER)
On 32-bit the signed value will be accepted, changing behavior.
Therefore, the negative value check is kept.
Fixes: 3f76df1982 ("net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current sg coalescing logic in sk_alloc_sg() (latter is used by tls and
sockmap) is not quite correct in that we do fetch the previous sg entry,
however the subsequent check whether the refilled page frag from the
socket is still the same as from the last entry with prior offset and
length matching the start of the current buffer is comparing always the
first sg list entry instead of the prior one.
Fixes: 3c4d755915 ("tls: kernel TLS support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_configure_link sets dev->rtnl_link_state to
RTNL_LINK_INITIALIZED and unconditionally calls
__dev_notify_flags to notify user-space of dev flags.
current call sequence for rtnl_configure_link
rtnetlink_newlink
rtnl_link_ops->newlink
rtnl_configure_link (unconditionally notifies userspace of
default and new dev flags)
If a newlink handler wants to call rtnl_configure_link
early, we will end up with duplicate notifications to
user-space.
This patch fixes rtnl_configure_link to check rtnl_link_state
and call __dev_notify_flags with gchanges = 0 if already
RTNL_LINK_INITIALIZED.
Later in the series, this patch will help the following sequence
where a driver implementing newlink can call rtnl_configure_link
to initialize the link early.
makes the following call sequence work:
rtnetlink_newlink
rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
link and notifies
user-space of default
dev flags)
rtnl_configure_link (updates dev flags if requested by user ifm
and notifies user-space of new dev flags)
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make net_ns_get_ownership() reusable by networking code outside of core.
This is useful, for example, to allow bridge related sysfs files to be
owned by container root.
Add a function comment since this is a potentially dangerous function to
use given the way that kobject_get_ownership() works by initializing uid
and gid before calling .get_ownership().
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating various objects in /sys/class/net/... make sure that they
belong to container's owner instead of global root (if they belong to a
container/namespace).
Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An upcoming change will allow container root to open some /sys/class/net
files for writing. The tx_maxrate attribute can result in changes
to actual hardware devices so err on the side of caution by requiring
CAP_NET_ADMIN in the init namespace in the corresponding attribute store
operation.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a patch by Sean Tranchetti.
Fixes: d4546c2509 ("net: Convert GRO SKB handling to list_head.")
Signed-off-by: David S. Miller <davem@davemloft.net>
Add dissection of the tos and ttl from the ip tunnel headers
fields in case a match is needed on them.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GCC 8 complains:
net/core/pktgen.c: In function ‘pktgen_if_write’:
net/core/pktgen.c:1419:4: warning: ‘strncpy’ output may be truncated copying between 0 and 31 bytes from a string of length 127 [-Wstringop-truncation]
strncpy(pkt_dev->src_max, buf, len);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
net/core/pktgen.c:1399:4: warning: ‘strncpy’ output may be truncated copying between 0 and 31 bytes from a string of length 127 [-Wstringop-truncation]
strncpy(pkt_dev->src_min, buf, len);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
net/core/pktgen.c:1290:4: warning: ‘strncpy’ output may be truncated copying between 0 and 31 bytes from a string of length 127 [-Wstringop-truncation]
strncpy(pkt_dev->dst_max, buf, len);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
net/core/pktgen.c:1268:4: warning: ‘strncpy’ output may be truncated copying between 0 and 31 bytes from a string of length 127 [-Wstringop-truncation]
strncpy(pkt_dev->dst_min, buf, len);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There is no bug here, but the code is not perfect either. It copies
sizeof(pkt_dev->/member/) - 1 from user space into buf, and then does
a strcmp(pkt_dev->/member/, buf) hence assuming buf will be null-terminated
and shorter than pkt_dev->/member/ (pkt_dev->/member/ is never
explicitly null-terminated, and strncpy() doesn't have to null-terminate
so the assumption must be on buf). The use of strncpy() without explicit
null-termination looks suspicious. Convert to use straight strcpy().
strncpy() would also null-pad the output, but that's clearly unnecessary
since the author calls memset(pkt_dev->/member/, 0, sizeof(..)); prior
to strncpy(), anyway.
While at it format the code for "dst_min", "dst_max", "src_min" and
"src_max" in the same way by removing extra new lines in one case.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 784abe24c9 ("net: Add decrypted field to skb")
introduced a 'decrypted' field that is explicitly copied on skb
copy and clone.
Move it between headers_start[0] and headers_end[0], so that we
don't need to copy it explicitly as it's copied by the memcpy()
in __copy_skb_header().
While at it, drop the assignment in __skb_clone(), it was
already redundant.
This doesn't change the size of sk_buff or cacheline boundaries.
The 15-bits hole before tc_index becomes a 14-bits hole, and
will be again a 15-bits hole when this change is merged with
commit 8b7008620b ("net: Don't copy pfmemalloc flag in
__copy_skb_header()").
v2: as reported by kbuild test robot (oops, I forgot to build
with CONFIG_TLS_DEVICE it seems), we can't use
CHECK_SKB_FIELD() on a bit-field member. Just drop the
check for the moment being, perhaps we could think of some
magic to also check bit-field members one day.
Fixes: 784abe24c9 ("net: Add decrypted field to skb")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Smatch caught an uninitialized variable error which GCC seems
to miss.
Fixes: a25717d2b6 ("xdp: support simultaneous driver and hw XDP attachment")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
FIELD_SIZEOF() is in bytes, but we want bits.
Fixes: d9f37d01e2 ("net: convert gro_count to bitmask")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gro_hash size is 192 bytes, and uses 3 cache lines, if there is few
flows, gro_hash may be not fully used, so it is unnecessary to iterate
all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline.
convert gro_count to a bitmask, and rename it as gro_bitmask, each bit
represents a element of gro_hash, only flush a gro_hash element if the
related bit is set, to speed up napi_gro_flush().
and update gro_bitmask only if it will be changed, to reduce cache
update
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a netdev feature to configure TLS RX inline crypto offload.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The decrypted bit is propogated to cloned/copied skbs.
This will be used later by the inline crypto receive side offload
of tls.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-07-15
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Various different arm32 JIT improvements in order to optimize code emission
and make the JIT code itself more robust, from Russell.
2) Support simultaneous driver and offloaded XDP in order to allow for advanced
use-cases where some work is offloaded to the NIC and some to the host. Also
add ability for bpftool to load programs and maps beyond just the cgroup case,
from Jakub.
3) Add BPF JIT support in nfp for multiplication as well as division. For the
latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong.
4) Add BTF pretty print functionality to bpftool in plain and JSON output
format, from Okash.
5) Add build and installation to the BPF helper man page into bpftool, from Quentin.
6) Add a TCP BPF callback for listening sockets which is triggered right after
the socket transitions to TCP_LISTEN state, from Andrey.
7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup
tree and prints all attached programs, from Roman.
8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged
packets, from Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2018-07-13
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix AF_XDP TX error reporting before final kernel release such that it
becomes consistent between copy mode and zero-copy, from Magnus.
2) Fix three different syzkaller reported issues: oob due to ld_abs
rewrite with too large offset, another oob in l3 based skb test run
and a bug leaving mangled prog in subprog JITing error path, from Daniel.
3) Fix BTF handling for bitfield extraction on big endian, from Okash.
4) Fix a missing linux/errno.h include in cgroup/BPF found by kbuild bot,
from Roman.
5) Fix xdp2skb_meta.sh sample by using just command names instead of
absolute paths for tc and ip and allow them to be redefined, from Taeung.
6) Fix availability probing for BPF seg6 helpers before final kernel ships
so they can be detected at prog load time, from Mathieu.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8b7008620b ("net: Don't copy pfmemalloc flag in
__copy_skb_header()") introduced a different handling for the
pfmemalloc flag in copy and clone paths.
In __skb_clone(), now, the flag is set only if it was set in the
original skb, but not cleared if it wasn't. This is wrong and
might lead to socket buffers being flagged with pfmemalloc even
if the skb data wasn't allocated from pfmemalloc reserves. Copy
the flag instead of ORing it.
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Fixes: 8b7008620b ("net: Don't copy pfmemalloc flag in __copy_skb_header()")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split the query of HW-attached program from the software one.
Introduce new .ndo_bpf command to query HW-attached program.
This will allow drivers to install different programs in HW
and SW at the same time. Netlink can now also carry multiple
programs on dump (in which case mode will be set to
XDP_ATTACHED_MULTI and user has to check per-attachment point
attributes, IFLA_XDP_PROG_ID will not be present). We reuse
IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size()
doesn't need to be updated.
Note that the installation side is still not there, since all
drivers currently reject installing more than one program at
the time.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Basic operations drivers perform during xdp setup and query can
be moved to helpers in the core. Encapsulate program and flags
into a structure and add helpers. Note that the structure is
intended as the "main" program information source in the driver.
Most drivers will additionally place the program pointer in their
fast path or ring structures.
The helpers don't have a huge impact now, but they will
decrease the code duplication when programs can be installed
in HW and driver at the same time. Encapsulating the basic
operations in helpers will hopefully also reduce the number
of changes to drivers which adopt them.
Helpers could really be static inline, but they depend on
definition of struct netdev_bpf which means they'd have
to be placed in netdevice.h, an already 4500 line header.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
prog_attached of struct netdev_bpf should have been superseded
by simply setting prog_id long time ago, but we kept it around
to allow offloading drivers to communicate attachment mode (drv
vs hw). Subsequently drivers were also allowed to report back
attachment flags (prog_flags), and since nowadays only programs
attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell
the attachment mode from the flags driver reports. Remove
prog_attached member.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In preparation for support of simultaneous driver and hardware XDP
support add per-mode attributes. The catch-all IFLA_XDP_PROG_ID
will still be reported, but user space can now also access the
program ID in a new IFLA_XDP_<mode>_PROG_ID attribute.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
region_snapshot - When set enables capturing region snapshots
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for DEVLINK_CMD_REGION_READ_GET used for both reading
and dumping region data. Read allows reading from a region specific
address for given length. Dump allows reading the full region.
If only snapshot ID is provided a snapshot dump will be done.
If snapshot ID, Address and Length are provided a snapshot read
will done.
This is used for both snapshot access and will be used in the same
way to access current data on the region.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for DEVLINK_CMD_REGION_DEL used
for deleting a snapshot from a region. The snapshot ID is required.
Also added notification support for NEW and DEL of snapshots.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend the support for DEVLINK_CMD_REGION_GET command to also
return the IDs of the snapshot currently present on the region.
Each reply will include a nested snapshots attribute that
can contain multiple snapshot attributes each with an ID.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for DEVLINK_CMD_REGION_GET command which is used for
querying for the supported DEV/REGION values of devlink devices.
The support is both for doit and dumpit.
Reply includes:
BUS_NAME, DEVICE_NAME, REGION_NAME, REGION_SIZE
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each device address region can store multiple snapshots,
each snapshot is identified using a different numerical ID.
This ID is used when deleting a snapshot or showing an address
region specific snapshot. This patch exposes a callback to add
a new snapshot to an address region.
The snapshot will be deleted using the destructor function
when destroying a region or when a snapshot delete command
from devlink user tool.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To restrict the driver with the snapshot ID selection a new callback
is introduced for the driver to get the snapshot ID before creating
a new snapshot. This will also allow giving the same ID for multiple
snapshots taken of different regions on the same time.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows a device to register its supported address regions.
Each address region can be accessed directly for example reading
the snapshots taken of this address space.
Drivers are not limited in the name selection for different regions.
An example of a region-name can be: pci cr-space, register-space.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pfmemalloc flag indicates that the skb was allocated from
the PFMEMALLOC reserves, and the flag is currently copied on skb
copy and clone.
However, an skb copied from an skb flagged with pfmemalloc
wasn't necessarily allocated from PFMEMALLOC reserves, and on
the other hand an skb allocated that way might be copied from an
skb that wasn't.
So we should not copy the flag on skb copy, and rather decide
whether to allow an skb to be associated with sockets unrelated
to page reclaim depending only on how it was allocated.
Move the pfmemalloc flag before headers_start[0] using an
existing 1-bit hole, so that __copy_skb_header() doesn't copy
it.
When cloning, we'll now take care of this flag explicitly,
contravening to the warning comment of __skb_clone().
While at it, restore the newline usage introduced by commit
b193722731 ("net: reorganize sk_buff for faster
__copy_skb_header()") to visually separate bytes used in
bitfields after headers_start[0], that was gone after commit
a9e419dc7b ("netfilter: merge ctinfo into nfct pointer storage
area"), and describe the pfmemalloc flag in the kernel-doc
structure comment.
This doesn't change the size of sk_buff or cacheline boundaries,
but consolidates the 15 bits hole before tc_index into a 2 bytes
hole before csum, that could now be filled more easily.
Reported-by: Patrick Talbert <ptalbert@redhat.com>
Fixes: c93bdd0e03 ("netvm: allow skb allocation to use PFMEMALLOC reserves")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher says:
====================
L2 Fwd Offload & 10GbE Intel Driver Updates 2018-07-09
This patch series is meant to allow support for the L2 forward offload, aka
MACVLAN offload without the need for using ndo_select_queue.
The existing solution currently requires that we use ndo_select_queue in
the transmit path if we want to associate specific Tx queues with a given
MACVLAN interface. In order to get away from this we need to repurpose the
tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
a means of accessing the queues on the lower device. As a result we cannot
offload a device that is configured as multiqueue, however it doesn't
really make sense to configure a macvlan interfaced as being multiqueue
anyway since it doesn't really have a qdisc of its own in the first place.
The big changes in this set are:
Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
Disable XPS for single queue devices
Replace accel_priv with sb_dev in ndo_select_queue
Add sb_dev parameter to fallback function for ndo_select_queue
Consolidated ndo_select_queue functions that appeared to be duplicates
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bpf_lwt_seg6_* helpers require CONFIG_IPV6_SEG6_BPF, and currently
return -EOPNOTSUPP to indicate unavailability. This patch forces the
BPF verifier to reject programs using these helpers when
!CONFIG_IPV6_SEG6_BPF, allowing users to more easily probe if they are
available or not.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The nes infiniband driver uses current_kernel_time() to get a nanosecond
granunarity timestamp to initialize its tcp sequence counters. This is
one of only a few remaining users of that deprecated function, so we
should try to get rid of it.
Aside from using a deprecated API, there are several problems I see here:
- Using a CLOCK_REALTIME based time source makes it predictable in
case the time base is synchronized.
- Using a coarse timestamp means it only gets updated once per jiffie,
making it even more predictable in order to avoid having to access
the hardware clock source
- The upper 2 bits are always zero because the nanoseconds are at most
999999999.
For the Linux TCP implementation, we use secure_tcp_seq(), which appears
to be appropriate here as well, and solves all the above problems.
i40iw uses a variant of the same code, so I do that same thing there
for ipv4. Unlike nes, i40e also supports ipv6, which needs to call
secure_tcpv6_seq instead.
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mark reported that syzkaller triggered a KASAN detected slab-out-of-bounds
bug in ___bpf_prog_run() with a BPF_LD | BPF_ABS word load at offset 0x8001.
After further investigation it became clear that the issue was the
BPF_LDX_MEM() which takes offset as an argument whereas it cannot encode
larger than S16_MAX offsets into it. For this synthetical case we need to
move the full address into tmp register instead and do the LDX without
immediate value.
Fixes: e0cea7ce98 ("bpf: implement ld_abs/ld_ind in native bpf")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
__netif_receive_skb_core can free the skb, so we have to use the dequeue-
enqueue model when calling it from __netif_receive_skb_list_core.
Fixes: 88eb1944e1 ("net: core: propagate SKB lists through packet_type lookup")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netif_receive_skb_list_internal(), all of skb_defer_rx_timestamp(),
do_xdp_generic() and enqueue_to_backlog() can lead to kfree(skb). Thus,
we cannot wait until after they return to remove the skb from the list;
instead, we remove it first and, in the pass case, add it to a sublist
afterwards.
In the case of enqueue_to_backlog() we have already decided not to pass
when we call the function, so we do not need a sublist.
Fixes: 7da517a3bc ("net: core: Another step of skb receive list processing")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For most of these calls we can just pass NULL through to the fallback
function as the sb_dev. The only cases where we cannot are the cases where
we might be dealing with either an upper device or a driver that would
have configured things to support an sb_dev itself.
The only driver that has any significant change in this patch set should be
ixgbe as we can drop the redundant functionality that existed in both the
ndo_select_queue function and the fallback function that was passed through
to us.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch makes it so that instead of passing a void pointer as the
accel_priv we instead pass a net_device pointer as sb_dev. Making this
change allows us to pass the subordinate device through to the fallback
function eventually so that we can keep the actual code in the
ndo_select_queue call as focused on possible on the exception cases.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch adds a generic version of the ndo_select_queue functions for
either returning 0 or selecting a queue based on the processor ID. This is
generally meant to just reduce the number of functions we have to change
in the future when we have to deal with ndo_select_queue changes.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.
The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch is meant to provide the basic tools needed to allow us to create
subordinate device traffic classes. The general idea here is to allow
subdividing the queues of a device into queue groups accessible through an
upper device such as a macvlan.
The idea here is to enforce the idea that an upper device has to be a
single queue device, ideally with IFF_NO_QUQUE set. With that being the
case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
for the upper device are unused. As such we could reuse those in order to
support subdividing the lower device and distributing those queues between
the subordinate devices.
In order to distinguish between a regular set of traffic classes and if a
device is carrying subordinate traffic classes I changed num_tc from a u8
to a s16 value and use the negative values to represent the subordinate
pool values. So starting at -1 and running to -32768 we can encode those as
pool values, and the existing values of 0 to 15 can be maintained.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch makes it so that we do not report the traffic class or allow XPS
configuration on single queue devices. This is mostly to avoid unnecessary
complexity with changes I have planned that will allow us to reuse
the unused tc_to_txq and XPS configuration on a single queue device to
allow it to make use of a subset of queues on an underlying device.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexei Starovoitov says:
====================
pull-request: bpf 2018-07-07
The following pull-request contains BPF updates for your *net* tree.
Plenty of fixes for different components:
1) A set of critical fixes for sockmap and sockhash, from John Fastabend.
2) fixes for several race conditions in af_xdp, from Magnus Karlsson.
3) hash map refcnt fix, from Mauricio Vasquez.
4) samples/bpf fixes, from Taeung Song.
5) ifup+mtu check for xdp_redirect, from Toshiaki Makita.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise we end up with attempting to send packets from down devices
or to send oversized packets, which may cause unexpected driver/device
behaviour. Generic XDP has already done this check, so reuse the logic
in native XDP.
Fixes: 814abfabef ("xdp: add bpf_redirect helper function")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In commit
'bpf: bpf_compute_data uses incorrect cb structure' (8108a77515)
we added the routine bpf_compute_data_end_sk_skb() to compute the
correct data_end values, but this has since been lost. In kernel
v4.14 this was correct and the above patch was applied in it
entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
we lost the piece that renamed bpf_compute_data_pointers to the
new function bpf_compute_data_end_sk_skb. This was done here,
e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
When it conflicted with the following rename patch,
6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Finally, after a refactor I thought even the function
bpf_compute_data_end_sk_skb() was no longer needed and it was
erroneously removed.
However, we never reverted the sk_skb_convert_ctx_access() usage of
tcp_skb_cb which had been committed and survived the merge conflict.
Here we fix this by adding back the helper and *_data_end_sk_skb()
usage. Using the bpf_skc_data_end mapping is not correct because it
expects a qdisc_skb_cb object but at the sock layer this is not the
case. Even though it happens to work here because we don't overwrite
any data in-use at the socket layer and the cb structure is cleared
later this has potential to create some subtle issues. But, even
more concretely the filter.c access check uses tcp_skb_cb.
And by some act of chance though,
struct bpf_skb_data_end {
struct qdisc_skb_cb qdisc_cb; /* 0 28 */
/* XXX 4 bytes hole, try to pack */
void * data_meta; /* 32 8 */
void * data_end; /* 40 8 */
/* size: 48, cachelines: 1, members: 3 */
/* sum members: 44, holes: 1, sum holes: 4 */
/* last cacheline: 48 bytes */
};
and then tcp_skb_cb,
struct tcp_skb_cb {
[...]
struct {
__u32 flags; /* 24 4 */
struct sock * sk_redir; /* 32 8 */
void * data_end; /* 40 8 */
} bpf; /* 24 */
};
So when we use offset_of() to track down the byte offset we get 40 in
either case and everything continues to work. Fix this mess and use
correct structures its unclear how long this might actually work for
until someone moves the structs around.
Reported-by: Martin KaFai Lau <kafai@fb.com>
Fixes: e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Fixes: 6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Multiple BPF helpers in use by sk_skb programs calculate the max
skb length using the __bpf_skb_max_len function. However, this
calculates the max length using the skb->dev pointer which can be
NULL when an sk_skb program is paired with an sk_msg program.
To force this a sk_msg program needs to redirect into the ingress
path of a sock with an attach sk_skb program. Then the the sk_skb
program would need to call one of the helpers that adjust the skb
size.
To fix the null ptr dereference use SKB_MAX_ALLOC size if no dev
is available.
Fixes: 8934ce2fd0 ("bpf: sockmap redirect ingress support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dissect the QinQ packets to get both outer and inner vlan information,
then store to the extended flow keys.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change vlan dissector key to save vlan tpid to support both 802.1Q
and 802.1AD ethertype.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_configure_link sets dev->rtnl_link_state to
RTNL_LINK_INITIALIZED and unconditionally calls
__dev_notify_flags to notify user-space of dev flags.
current call sequence for rtnl_configure_link
rtnetlink_newlink
rtnl_link_ops->newlink
rtnl_configure_link (unconditionally notifies userspace of
default and new dev flags)
If a newlink handler wants to call rtnl_configure_link
early, we will end up with duplicate notifications to
user-space.
This patch fixes rtnl_configure_link to check rtnl_link_state
and call __dev_notify_flags with gchanges = 0 if already
RTNL_LINK_INITIALIZED.
Later in the series, this patch will help the following sequence
where a driver implementing newlink can call rtnl_configure_link
to initialize the link early.
makes the following call sequence work:
rtnetlink_newlink
rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
link and notifies
user-space of default
dev flags)
rtnl_configure_link (updates dev flags if requested by user ifm
and notifies user-space of new dev flags)
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Warning level 2 was used: -Wimplicit-fallthrough=2
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
enable_sriov - Enables Single-Root Input/Output Virtualization(SR-IOV)
characteristic of the device.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add 2 first generic parameters to devlink configuration parameters set:
internal_err_reset - When set enables reset device on internal errors.
max_macs - max number of MACs per ETH port.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add devlink_param_notify() function to support devlink param notifications.
Add notification call to devlink param set, register and unregister
functions.
Add devlink_param_value_changed() function to enable the driver notify
devlink on value change. Driver should use this function after value was
changed on any configuration mode part to driverinit.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"driverinit" configuration mode value is held by devlink to enable
the driver query the value after reload. Two additional functions
added to help the driver get/set the value from/to devlink:
devlink_param_driverinit_value_set() and
devlink_param_driverinit_value_get().
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add param set command to set value for a parameter.
Value can be set to any of the supported configuration modes.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add param get command which gets data per parameter.
Option to dump the parameters data per device.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define configuration parameters data structure.
Add functions to register and unregister the driver supported
configuration parameters table.
For each parameter registered, the driver should fill all the parameter's
fields. In case the only supported configuration mode is "driverinit"
the parameter's get()/set() functions are not required and should be set
to NULL, for any other configuration mode, these functions are required
and should be set by the driver.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 07d78363dc ("net: Convert NAPI gro list into a small hash
table.")' there is 8 hash buckets, which allows more flows to be held for
merging. but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still,
limit the hash table performance.
keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not
the total 8 skb
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the socket error queue for reporting dropped packets if the
socket has enabled that feature through the SO_TXTIME API.
Packets are dropped either on enqueue() if they aren't accepted by the
qdisc or on dequeue() if the system misses their deadline. Those are
reported as different errors so applications can react accordingly.
Userspace can retrieve the errors through the socket error queue and the
corresponding cmsg interfaces. A struct sock_extended_err* is used for
returning the error data, and the packet's timestamp can be retrieved by
adding both ee_data and ee_info fields as e.g.:
((__u64) serr->ee_data << 32) + serr->ee_info
This feature is disabled by default and must be explicitly enabled by
applications. Enabling it can bring some overhead for the Tx cycles
of the application.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces SO_TXTIME. User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2). The argument to this socket option is a 8-bytes long struct
provided by the uapi header net_tstamp.h defined as:
struct sock_txtime {
clockid_t clockid;
u32 flags;
};
Note that new fields were added to struct sock by filling a 2-bytes
hole found in the struct. For that reason, neither the struct size or
number of cachelines were altered.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is done in preparation for the upcoming time based transmission
patchset. Now that skb->tstamp will be used to hold packet's txtime,
we must ensure that it is being cleared when traversing namespaces.
Also, doing that from skb_scrub_packet() before the early return would
break our feature when tunnels are used.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The gen_stats facility will add a header for the toplevel nlattr of type
TCA_STATS2 that contains all stats added by qdisc callbacks. A reference
to this header is stored in the gnet_dump struct, and when all the
per-qdisc callbacks have finished adding their stats, the length of the
containing header will be adjusted to the right value.
However, on architectures that need padding (i.e., that don't set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS), the padding nlattr is added
before the stats, which means that the stored pointer will point to the
padding, and so when the header is fixed up, the result is just a very
big padding nlattr. Because most qdiscs also supply the legacy TCA_STATS
struct, this problem has been mostly invisible, but we exposed it with
the netlink attribute-based statistics in CAKE.
Fix the issue by fixing up the stored pointer if it points to a padding
nlattr.
Tested-by: Pete Heist <pete@heistp.net>
Tested-by: Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Generally the check should be very cheap, as the sk_buff_head is in cache.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also involved adding a way to run a netfilter hook over a list of packets.
Rather than attempting to make netfilter know about lists (which would be
a major project in itself) we just let it call the regular okfn (in this
case ip_rcv_finish()) for any packets it steals, and have it give us back
a list of packets it's synchronously accepted (which normally NF_HOOK
would automatically call okfn() on, but we want to be able to potentially
pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
packet (see nf_hook_entry_hookfn()), but again, changing that can be left
for future work.
There is potential for out-of-order receives if the netfilter hook ends up
synchronously stealing packets, as they will be processed before any
accepts earlier in the list. However, it was already possible for an
asynchronous accept to cause out-of-order receives, so presumably this is
considered OK.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__netif_receive_skb_core() does a depressingly large amount of per-packet
work that can't easily be listified, because the another_round looping
makes it nontrivial to slice up into smaller functions.
Fortunately, most of that work disappears in the fast path:
* Hardware devices generally don't have an rx_handler
* Unless you're tcpdumping or something, there is usually only one ptype
* VLAN processing comes before the protocol ptype lookup, so doesn't force
a pt_prev deliver
so normally, __netif_receive_skb_core() will run straight through and pass
back the one ptype found in ptype_base[hash of skb->protocol].
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First example of a layer splitting the list (rather than merely taking
individual packets off it).
Involves new list.h function, list_cut_before(), like list_cut_position()
but cuts on the other side of the given entry.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_receive_skb_list_internal() now processes a list and hands it
on to the next function.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple overlapping changes in stmmac driver.
Adjust skb_gro_flush_final_remcsum function signature to make GRO list
changes in net-next, as per Stephen Rothwell's example merge
resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Verify netlink attributes properly in nf_queue, from Eric Dumazet.
2) Need to bump memory lock rlimit for test_sockmap bpf test, from
Yonghong Song.
3) Fix VLAN handling in lan78xx driver, from Dave Stevenson.
4) Fix uninitialized read in nf_log, from Jann Horn.
5) Fix raw command length parsing in mlx5, from Alex Vesker.
6) Cleanup loopback RDS connections upon netns deletion, from Sowmini
Varadhan.
7) Fix regressions in FIB rule matching during create, from Jason A.
Donenfeld and Roopa Prabhu.
8) Fix mpls ether type detection in nfp, from Pieter Jansen van Vuuren.
9) More bpfilter build fixes/adjustments from Masahiro Yamada.
10) Fix XDP_{TX,REDIRECT} flushing in various drivers, from Jesper
Dangaard Brouer.
11) fib_tests.sh file permissions were broken, from Shuah Khan.
12) Make sure BH/preemption is disabled in data path of mac80211, from
Denis Kenzior.
13) Don't ignore nla_parse_nested() return values in nl80211, from
Johannes berg.
14) Properly account sock objects ot kmemcg, from Shakeel Butt.
15) Adjustments to setting bpf program permissions to read-only, from
Daniel Borkmann.
16) TCP Fast Open key endianness was broken, it always took on the host
endiannness. Whoops. Explicitly make it little endian. From Yuching
Cheng.
17) Fix prefix route setting for link local addresses in ipv6, from
David Ahern.
18) Potential Spectre v1 in zatm driver, from Gustavo A. R. Silva.
19) Various bpf sockmap fixes, from John Fastabend.
20) Use after free for GRO with ESP, from Sabrina Dubroca.
21) Passing bogus flags to crypto_alloc_shash() in ipv6 SR code, from
Eric Biggers.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
qede: Adverstise software timestamp caps when PHC is not available.
qed: Fix use of incorrect size in memcpy call.
qed: Fix setting of incorrect eswitch mode.
qed: Limit msix vectors in kdump kernel to the minimum required count.
ipvlan: call dev_change_flags when ipvlan mode is reset
ipv6: sr: fix passing wrong flags to crypto_alloc_shash()
net: fix use-after-free in GRO with ESP
tcp: prevent bogus FRTO undos with non-SACK flows
bpf: sockhash, add release routine
bpf: sockhash fix omitted bucket lock in sock_close
bpf: sockmap, fix smap_list_map_remove when psock is in many maps
bpf: sockmap, fix crash when ipv6 sock is added
net: fib_rules: bring back rule_exists to match rule during add
hv_netvsc: split sub-channel setup into async and sync
net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN
atm: zatm: Fix potential Spectre v1
s390/qeth: consistently re-enable device features
s390/qeth: don't clobber buffer on async TX completion
s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]
s390/qeth: fix race when setting MAC address
...
Currently trace_sock_exceed_buf_limit() only show rmem info,
but wmem limit may also be hit.
So expose wmem info in this tracepoint as well.
Regarding memcg, I think it is better to introduce a new tracepoint(if
that is needed), i.e. trace_memcg_limit_hit other than show memcg info in
trace_sock_exceed_buf_limit.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend transmit queue sysfs attribute to configure Rx queue(s) map
per Tx queue. By default no receive queues are configured for the
Tx queue.
- /sys/class/net/eth0/queues/tx-*/xps_rxqs
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to pick Tx queue based on the Rx queue(s) map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive queue(s) map
does not apply, then the Tx queue selection falls back to CPU(s) map
based selection and finally to hashing.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new field to sock_common 'skc_rx_queue_mapping'
which holds the receive queue number for the connection. The Rx queue
is marked in tcp_finish_connect() to allow a client app to do
SO_INCOMING_NAPI_ID after a connect() call to get the right queue
association for a socket. Rx queue is also marked in tcp_conn_request()
to allow syn-ack to go on the right tx-queue associated with
the queue on which syn is received.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use static_key for XPS maps to reduce the cost of extra map checks,
similar to how it is used for RPS and RFS. This includes static_key
'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using
Rx queues map.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor XPS code to support Tx queue selection based on
CPU(s) map or Rx queue(s) map.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2018-07-01
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) A bpf_fib_lookup() helper fix to change the API before freeze to
return an encoding of the FIB lookup result and return the nexthop
device index in the params struct (instead of device index as return
code that we had before), from David.
2) Various BPF JIT fixes to address syzkaller fallout, that is, do not
reject progs when set_memory_*() fails since it could still be RO.
Also arm32 JIT was not using bpf_jit_binary_lock_ro() API which was
an issue, and a memory leak in s390 JIT found during review, from
Daniel.
3) Multiple fixes for sockmap/hash to address most of the syzkaller
triggered bugs. Usage with IPv6 was crashing, a GPF in bpf_tcp_close(),
a missing sock_map_release() routine to hook up to callbacks, and a
fix for an omitted bucket lock in sock_close(), from John.
4) Two bpftool fixes to remove duplicated error message on program load,
and another one to close the libbpf object after program load. One
additional fix for nfp driver's BPF offload to avoid stopping offload
completely if replace of program failed, from Jakub.
5) Couple of BPF selftest fixes that bail out in some of the test
scripts if the user does not have the right privileges, from Jeffrin.
6) Fixes in test_bpf for s390 when CONFIG_BPF_JIT_ALWAYS_ON is set
where we need to set the flag that some of the test cases are expected
to fail, from Kleber.
7) Fix to detangle BPF_LIRC_MODE2 dependency from CONFIG_CGROUP_BPF
since it has no relation to it and lirc2 users often have configs
without cgroups enabled and thus would not be able to use it, from Sean.
8) Fix a selftest failure in sockmap by removing a useless setrlimit()
call that would set a too low limit where at the same time we are
already including bpf_rlimit.h that does the job, from Yonghong.
9) Fix BPF selftest config with missing missing NET_SCHED, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit f9d4b0c1e9 ("fib_rules: move common handling of newrule
delrule msgs into fib_nl2rule"), rule_exists got replaced by rule_find
for existing rule lookup in both the add and del paths. While this
is good for the delete path, it solves a few problems but opens up
a few invalid key matches in the add path.
$ip -4 rule add table main tos 10 fwmark 1
$ip -4 rule add table main tos 10
RTNETLINK answers: File exists
The problem here is rule_find does not check if the key masks in
the new and old rule are the same and hence ends up matching a more
secific rule. Rule key masks cannot be easily compared today without
an elaborate if-else block. Its best to introduce key masks for easier
and accurate rule comparison in the future. Until then, due to fear of
regressions this patch re-introduces older loose rule_exists during add.
Also fixes both rule_exists and rule_find to cover missing attributes.
Fixes: f9d4b0c1e9 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As noticed by Eric, we need to switch to the helper
dev_change_tx_queue_len() for SIOCSIFTXQLEN call path too,
otheriwse still miss dev_qdisc_change_tx_queue_len().
Fixes: 6a643ddb56 ("net: introduce helper dev_change_tx_queue_len()")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations
which is just a noop and a little bit confusing.
__GFP_NORETRY was added by ed98df3361 ("net: use __GFP_NORETRY for
high order allocations") to prevent from the OOM killer. Yet this was
not enough because fb05e7a89f ("net: don't wait for order-3 page
allocation") didn't want an excessive reclaim for non-costly orders
so it made it completely NOWAIT while it preserved __GFP_NORETRY in
place which is now redundant.
Drop the pointless __GFP_NORETRY because this function is used as
copy&paste source for other places.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>