Commit abf4bb6b63 ("skbuff: Add the offload_mr_fwd_mark field") added
the 'offload_mr_fwd_mark' field to indicate that a packet has already
undergone L3 multicast routing by a capable device. The field is used to
prevent the kernel from forwarding a packet through a netdev through
which the device has already forwarded the packet.
Currently, no unicast packet is routed by both the device and the
kernel, but this is about to change by subsequent patches and we need to
be able to mark such packets, so that they will no be forwarded twice.
Instead of adding yet another field to 'struct sk_buff', we can just
rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet
either has a multicast or a unicast destination IP.
While at it, add a comment about both 'offload_fwd_mark' and
'offload_l3_fwd_mark'.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
Release of its last reference triggers a completion notification.
The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
the skbs, because it can build, send and free skbs within its loop,
possibly reaching refcount zero and freeing the ubuf_info too soon.
The UDP stack currently also takes this extra ref, but does not need
it as all skbs are sent after return from __ip(6)_append_data.
Avoid the extra refcount_inc and refcount_dec_and_test, and generally
the sock_zerocopy_put in the common path, by passing the initial
reference to the first skb.
This approach is taken instead of initializing the refcount to 0, as
that would generate error "refcount_t: increment on 0" on the
next skb_zcopy_set.
Changes
v3 -> v4
- Move skb_zcopy_set below the only kfree_skb that might cause
a premature uarg destroy before skb_zerocopy_put_abort
- Move the entire skb_shinfo assignment block, to keep that
cacheline access in one place
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
interpret flag MSG_ZEROCOPY.
This patch was previously part of the zerocopy RFC patchsets. Zerocopy
is not effective at small MTU. With segmentation offload building
larger datagrams, the benefit of page flipping outweights the cost of
generating a completion notification.
tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
test patch and making skb_orphan_frags_rx same as skb_orphan_frags:
ipv4 udp -t 1
tx=191312 (11938 MB) txc=0 zc=n
rx=191312 (11938 MB)
ipv4 udp -z -t 1
tx=304507 (19002 MB) txc=304507 zc=y
rx=304507 (19002 MB)
ok
ipv6 udp -t 1
tx=174485 (10888 MB) txc=0 zc=n
rx=174485 (10888 MB)
ipv6 udp -z -t 1
tx=294801 (18396 MB) txc=294801 zc=y
rx=294801 (18396 MB)
ok
Changes
v1 -> v2
- Fixup reverse christmas tree violation
v2 -> v3
- Split refcount avoidance optimization into separate patch
- Fix refcount leak on error in fragmented case
(thanks to Paolo Abeni for pointing this one out!)
- Fix refcount inc on zero
- Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
This is needed since commit 5cf4a8532c ("tcp: really ignore
MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many drivers load the device's firmware image during the initialization
flow either from the flash or from the disk. Currently this option is not
controlled by the user and the driver decides from where to load the
firmware image.
'fw_load_policy' gives the ability to control this option which allows the
user to choose between different loading policies supported by the driver.
This parameter can be useful while testing and/or debugging the device. For
example, testing a firmware bug fix.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Standard kernel compilation produces the following warning:
net/core/rtnetlink.c: In function ‘rtnl_newlink’:
net/core/rtnetlink.c:3232:1: warning: the frame size of 1288 bytes is larger than 1024 bytes [-Wframe-larger-than=]
}
^
This should not really be an issue, as rtnl_newlink() stack is
generally quite shallow.
Fix the warning by allocating attributes with kmalloc() in a wrapper
and passing it down to rtnl_newlink(), avoiding complexities on error
paths.
Alternatively we could kmalloc() some structure within rtnl_newlink(),
slave attributes look like a good candidate. In practice it adds to
already rather high complexity and length of the function.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_newlink() used to create VLAs based on link kind. Since
commit ccf8dbcd06 ("rtnetlink: Remove VLA usage") statically
sized array is created on the stack, so there is no more use
for a separate code block that used to be the VLA's live range.
While at it christmas tree the variables. Note that there is
a goto-based retry so to be on the safe side the variables can
no longer be initialized in place. It doesn't seem to matter,
logically, but why make the code harder to read..
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trace events are already present for the receive entry points, to indicate
how the reception entered the stack.
This patch adds the corresponding exit trace events that will bound the
reception such that all events occurring between the entry and the exit
can be considered as part of the reception context. This greatly helps
for dependency and root cause analyses.
Without this, it is not possible with tracepoint instrumentation to
determine whether a sched_wakeup event following a netif_receive_skb
event is the result of the packet reception or a simple coincidence after
further processing by the thread. It is possible using other mechanisms
like kretprobes, but considering the "entry" points are already present,
it would be good to add the matching exit events.
In addition to linking packets with wakeups, the entry/exit event pair
can also be used to perform network stack latency analyses.
Signed-off-by: Geneviève Bastien <gbastien@versatic.net>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: David S. Miller <davem@davemloft.net>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> (tracing side)
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
bpf-next 2018-11-30
The following pull-request contains BPF updates for your *net-next* tree.
(Getting out bit earlier this time to pull in a dependency from bpf.)
The main changes are:
1) Add libbpf ABI versioning and document API naming conventions
as well as ABI versioning process, from Andrey.
2) Add a new sk_msg_pop_data() helper for sk_msg based BPF
programs that is used in conjunction with sk_msg_push_data()
for adding / removing meta data to the msg data, from John.
3) Optimize convert_bpf_ld_abs() for 0 offset and fix various
lib and testsuite build failures on 32 bit, from David.
4) Make BPF prog dump for !JIT identical to how we dump subprogs
when JIT is in use, from Yonghong.
5) Rename btf_get_from_id() to make it more conform with libbpf
API naming conventions, from Martin.
6) Add a missing BPF kselftest config item, from Naresh.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a BPF SK_MSG program helper so that we can pop data from a
msg. We use this to pop metadata from a previous push data call.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Like the previous patch, the goal is to ease to convert nsids from one
netns to another netns.
A new attribute (NETNSA_CURRENT_NSID) is added to the kernel answer when
NETNSA_TARGET_NSID is provided, thus the user can easily convert nsids.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Combined with NETNSA_TARGET_NSID, it enables to "translate" a nsid from one
netns to a nsid of another netns.
This is useful when using NETLINK_F_LISTEN_ALL_NSID because it helps the
user to interpret a nsid received from an other netns.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like it was done for link and address, add the ability to perform get/dump
in another netns by specifying a target nsid attribute.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a preparatory work. To avoid having to much arguments for the
function rtnl_net_fill(), a new structure is defined.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This argument is not used anymore.
Fixes: cab3c8ec8d ("netns: always provide the id to rtnl_net_fill()")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have been adding many new bridge options, a big number of which are
boolean but still take up netlink attribute ids and waste space in the skb.
Recently we discussed learning from link-local packets[1] and decided
yet another new boolean option will be needed, thus introducing this API
to save some bridge nl space.
The API supports changing the value of multiple boolean options at once
via the br_boolopt_multi struct which has an optmask (which options to
set, bit per opt) and optval (options' new values). Future boolean
options will only be added to the br_boolopt_id enum and then will have
to be handled in br_boolopt_toggle/get. The API will automatically
add the ability to change and export them via netlink, sysfs can use the
single boolopt function versions to do the same. The behaviour with
failing/succeeding is the same as with normal netlink option changing.
If an option requires mapping to internal kernel flag or needs special
configuration to be enabled then it should be handled in
br_boolopt_toggle. It should also be able to retrieve an option's current
state via br_boolopt_get.
v2: WARN_ON() on unsupported option as that shouldn't be possible and
also will help catch people who add new options without handling
them for both set and get. Pass down extack so if an option desires
it could set it on error and be more user-friendly.
[1] https://www.spinics.net/lists/netdev/msg532698.html
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
'offset' is constant and if it is zero, no need to subtract it
from BPF_REG_TMP.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-11-26
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Extend BTF to support function call types and improve the BPF
symbol handling with this info for kallsyms and bpftool program
dump to make debugging easier, from Martin and Yonghong.
2) Optimize LPM lookups by making longest_prefix_match() handle
multiple bytes at a time, from Eric.
3) Adds support for loading and attaching flow dissector BPF progs
from bpftool, from Stanislav.
4) Extend the sk_lookup() helper to be supported from XDP, from Nitin.
5) Enable verifier to support narrow context loads with offset > 0
to adapt to LLVM code generation (currently only offset of 0 was
supported). Add test cases as well, from Andrey.
6) Simplify passing device functions for offloaded BPF progs by
adding callbacks to bpf_prog_offload_ops instead of ndo_bpf.
Also convert nfp and netdevsim to make use of them, from Quentin.
7) Add support for sock_ops based BPF programs to send events to
the perf ring-buffer through perf_event_output helper, from
Sowmini and Daniel.
8) Add read / write support for skb->tstamp from tc BPF and cg BPF
programs to allow for supporting rate-limiting in EDT qdiscs
like fq from BPF side, from Vlad.
9) Extend libbpf API to support map in map types and add test cases
for it as well to BPF kselftests, from Nikita.
10) Account the maximum packet offset accessed by a BPF program in
the verifier and use it for optimizing nfp JIT, from Jiong.
11) Fix error handling regarding kprobe_events in BPF sample loader,
from Daniel T.
12) Add support for queue and stack map type in bpftool, from David.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
I do not see how one can effectively use skb_insert() without holding
some kind of lock. Otherwise other cpus could have changed the list
right before we have a chance of acquiring list->lock.
Only existing user is in drivers/infiniband/hw/nes/nes_mgt.c and this
one probably meant to use __skb_insert() since it appears nesqp->pau_list
is protected by nesqp->pau_lock. This looks like nesqp->pau_lock
could be removed, since nesqp->pau_list.lock could be used instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Faisal Latif <faisal.latif@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: linux-rdma <linux-rdma@vger.kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We very often have few flows/chains to look at, and we
might increase GRO_HASH_BUCKETS to 32 or 64 in the future.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric noted that with UDP GRO and NAPI timeout, we could keep a single
UDP packet inside the GRO hash forever, if the related NAPI instance
calls napi_gro_complete() at an higher frequency than the NAPI timeout.
Willem noted that even TCP packets could be trapped there, till the
next retransmission.
This patch tries to address the issue, flushing the old packets -
those with a NAPI_GRO_CB age before the current jiffy - before scheduling
the NAPI timeout. The rationale is that such a timeout should be
well below a jiffy and we are not flushing packets eligible for sane GRO.
v1 -> v2:
- clarified the commit message and comment
RFC -> v1:
- added 'Fixes tags', cleaned-up the wording.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 3b47d30396 ("net: gro: add a per device gro flush timer")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This could be used to rate limit egress traffic in concert with a qdisc
which supports Earliest Departure Time, such as FQ.
Write access from cg skb progs only with CAP_SYS_ADMIN, since the value
will be used by downstream qdiscs. It might make sense to relax this.
Changes v1 -> v2:
- allow access from cg skb, write only with CAP_SYS_ADMIN
Signed-off-by: Vlad Dumitrescu <vladum@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When a packet is trapped and the corresponding SKB marked as
already-forwarded, it retains this marking even after it is forwarded
across veth links into another bridge. There, since it ingresses the
bridge over veth, which doesn't have offload_fwd_mark, it triggers a
warning in nbp_switchdev_frame_mark().
Then nbp_switchdev_allowed_egress() decides not to allow egress from
this bridge through another veth, because the SKB is already marked, and
the mark (of 0) of course matches. Thus the packet is incorrectly
blocked.
Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That
function is called from tunnels and also from veth, and thus catches the
cases where traffic is forwarded between bridges and transformed in a
way that invalidates the marking.
Fixes: 6bc506b4fb ("bridge: switchdev: Add forward mark support for stacked devices")
Fixes: abf4bb6b63 ("skbuff: Add the offload_mr_fwd_mark field")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Suggested-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a packet is trapped and the corresponding SKB marked as
already-forwarded, it retains this marking even after it is forwarded
across veth links into another bridge. There, since it ingresses the
bridge over veth, which doesn't have offload_fwd_mark, it triggers a
warning in nbp_switchdev_frame_mark().
Then nbp_switchdev_allowed_egress() decides not to allow egress from
this bridge through another veth, because the SKB is already marked, and
the mark (of 0) of course matches. Thus the packet is incorrectly
blocked.
Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That
function is called from tunnels and also from veth, and thus catches the
cases where traffic is forwarded between bridges and transformed in a
way that invalidates the marking.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Suggested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eth_type_trans() assumes initial value for skb->pkt_type
is PACKET_HOST.
This is indeed the value right after a fresh skb allocation.
However, it is possible that GRO merged a packet with a different
value (like PACKET_OTHERHOST in case macvlan is used), so
we need to make sure napi->skb will have pkt_type set back to
PACKET_HOST.
Otherwise, valid packets might be dropped by the stack because
their pkt_type is not PACKET_HOST.
napi_reuse_skb() was added in commit 96e93eab20 ("gro: Add
internal interfaces for VLAN"), but this bug always has
been there.
Fixes: 96e93eab20 ("gro: Add internal interfaces for VLAN")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace VLAN_TAG_PRESENT with single bit flag and free up
VLAN.CFI overload. Now VLAN.CFI is visible in networking stack
and can be passed around intact.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make bpf_sk_lookup_tcp, bpf_sk_lookup_udp and bpf_sk_release helpers
available in programs of type BPF_PROG_TYPE_CGROUP_SOCK_ADDR.
Such programs operate on sockets and have access to socket and struct
sockaddr passed by user to system calls such as sys_bind, sys_connect,
sys_sendmsg.
It's useful to be able to lookup other sockets from these programs.
E.g. sys_connect may lookup IP:port endpoint and if there is a server
socket bound to that endpoint ("server" can be defined by saddr & sport
being zero), redirect client connection to it by rewriting IP:port in
sockaddr passed to sys_connect.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lookup functions in sk_lookup have different expectations about byte
order of provided arguments.
Specifically __inet_lookup, __udp4_lib_lookup and __udp6_lib_lookup
expect dport to be in network byte order and do ntohs(dport) internally.
At the same time __inet6_lookup expects dport to be in host byte order
and correspondingly name the argument hnum.
sk_lookup works correctly with __inet_lookup, __udp4_lib_lookup and
__inet6_lookup with regard to dport. But in __udp6_lib_lookup case it
uses host instead of expected network byte order. It makes result
returned by bpf_sk_lookup_udp for IPv6 incorrect.
The patch fixes byte order of dport passed to __udp6_lib_lookup.
Originally sk_lookup properly handled UDPv6, but not TCPv6. 5ef0ae84f0
fixes TCPv6 but breaks UDPv6.
Fixes: 5ef0ae84f0 ("bpf: Fix IPv6 dport byte-order in bpf_sk_lookup")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
if list is NULL pointer, and the following access of list
will trigger panic, which is same as BUG_ON
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently netdev_rx_csum_fault() only shows a device name,
we need more information about the skb for debugging csum
failures.
Sample output:
ens3: hw csum failure
dev features: 0x0000000000014b89
skb len=84 data_len=0 pkt_type=0 gso_size=0 gso_type=0 nr_frags=0 ip_summed=0 csum=0 csum_complete_sw=0 csum_valid=0 csum_level=0
Note, I use pr_err() just to be consistent with the existing one.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a part of sk_reuseport support for sctp. It defines a helper
sctp_bind_addrs_check() to check if the bind_addrs in two socks are
matched. It will add sock_reuseport if they are completely matched,
and return err if they are partly matched, and alloc sock_reuseport
if all socks are not matched at all.
It will work until sk_reuseport support is added in
sctp_get_port_local() in the next patch.
v1->v2:
- use 'laddr->valid && laddr2->valid' check instead as Marcelo
pointed in sctp_bind_addrs_check().
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only first fragment has the sport/dport information,
not the following ones.
If we want consistent hash for all fragments, we need to
ignore ports even for first fragment.
This bug is visible for IPv6 traffic, if incoming fragments
do not have a flow label, since skb_get_hash() will give
different results for first fragment and following ones.
It is also visible if any routing rule wants dissection
and sport or dport.
See commit 5e5d6fed37 ("ipv6: route: dissect flow
in input path if fib rules need it") for details.
[edumazet] rewrote the changelog completely.
Fixes: 06635a35d1 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
Signed-off-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch proposes to extend the sk_lookup() BPF API to the
XDP hookpoint. The sk_lookup() helper supports a lookup
on incoming packet to find the corresponding socket that will
receive this packet. Current support for this BPF API is
at the tc hookpoint. This patch will extend this API at XDP
hookpoint. A XDP program can map the incoming packet to the
5-tuple parameter and invoke the API to find the corresponding
socket structure.
Signed-off-by: Nitin Hande <Nitin.Hande@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch allows eBPF programs that use sock_ops to send perf
based event notifications using bpf_perf_event_output(). Our main
use case for this is the following:
We would like to monitor some subset of TCP sockets in user-space,
(the monitoring application would define 4-tuples it wants to monitor)
using TCP_INFO stats to analyze reported problems. The idea is to
use those stats to see where the bottlenecks are likely to be ("is
it application-limited?" or "is there evidence of BufferBloat in
the path?" etc).
Today we can do this by periodically polling for tcp_info, but this
could be made more efficient if the kernel would asynchronously
notify the application via tcp_info when some "interesting"
thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
are reached. And to make this effective, it is better if
we could apply the threshold check *before* constructing the
tcp_info netlink notification, so that we don't waste resources
constructing notifications that will be discarded by the filter.
This work solves the problem by adding perf event based notification
support for sock_ops. The eBPF program can thus be designed to apply
any desired filters to the bpf_sock_ops and trigger a perf event
notification based on the evaluation from the filter. The user space
component can use these perf event notifications to either read any
state managed by the eBPF program, or issue a TCP_INFO netlink call
if desired.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This removes assumptions about VLAN_TAG_PRESENT bit.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_checksum_complete_head() and __skb_checksum_complete()
are both declared in skbuff.h, they fit better in skbuff.c
than datagram.c.
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to avoid all table update, and only remove or add new
address, the auxiliary function exists, named __hw_addr_sync_dev().
It allows end driver do nothing when nothing changed and add/rm when
concrete address is firstly added or lastly removed. But it doesn't
include cases when an address of real device or vlan was reused by
other vlans or vlan/macval devices.
For handaling events when address was reused/unreused the patch adds
new auxiliary routine - __hw_addr_ref_sync_dev(). It allows to do
nothing when nothing was changed and do updates only for an address
being added/reused/deleted/unreused. Thus, clone address changes for
vlans can be mirrored in the table. The function is exclusive with
__hw_addr_sync_dev(). It's responsibility of the end driver to
identify address vlan device, if it needs so.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting the SO_MARK socket option, if the mark changes, the dst
needs to be reset so that a new route lookup is performed.
This fixes the case where an application wants to change routing by
setting a new sk_mark. If this is done after some packets have already
been sent, the dst is cached and has no effect.
Signed-off-by: David Barmann <david.barmann@stackpath.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure an unbound datagram skt is chosen when not in a VRF. The check
for a device match in compute_score() for UDP must be performed when
there is no device match. For this, a failure is returned when there is
no device match. This ensures that bound sockets are never selected,
even if there is no unbound socket.
Allow IPv6 packets to be sent over a datagram skt bound to a VRF. These
packets are currently blocked, as flowi6_oif was set to that of the
master vrf device, and the ipi6_ifindex is that of the slave device.
Allow these packets to be sent by checking the device with ipi6_ifindex
has the same L3 scope as that of the bound device of the skt, which is
the master vrf device. Note that this check always succeeds if the skt
is unbound.
Even though the right datagram skt is now selected by compute_score(),
a different skt is being returned that is bound to the wrong vrf. The
difference between these and stream sockets is the handling of the skt
option for SO_REUSEPORT. While the handling when adding a skt for reuse
correctly checks that the bound device of the skt is a match, the skts
in the hashslot are already incorrect. So for the same hash, a skt for
the wrong vrf may be selected for the required port. The root cause is
that the skt is immediately placed into a slot when it is created,
but when the skt is then bound using SO_BINDTODEVICE, it remains in the
same slot. The solution is to move the skt to the correct slot by
forcing a rehash.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack arg to the nla_parse_nested calls in rtnl_newlink, and
add messages for unknown device type and link network namespace id.
In particular, it improves the failure message when the wrong link
type is used. From
$ ip li add bond1 type bonding
RTNETLINK answers: Operation not supported
to
$ ip li add bond1 type bonding
Error: Unknown device type.
(The module name is bonding but the link type is bond.)
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack arg to rtnl_create_link and add messages for invalid
number of Tx or Rx queues.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPPROTO_RAW isn't registred as an inet protocol, so
inet_protos[protocol] is always NULL for it.
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Fixes: bf2ae2e4bf ("sock_diag: request _diag module only when the family or proto has been registered")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason to discard using source link local address when
remote netconsole IPv6 address is set to be link local one.
The patch allows administrators to use IPv6 netconsole without
explicitly configuring source address:
netconsole=@/,@fe80::5054:ff:fe2f:6012/
Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
For non-zero return from dumpit() we should break the loop
in rtnl_dump_all() and return the result. Otherwise, e.g.,
we could get the memory leak in inet6_dump_fib() [1]. The
pointer to the allocated struct fib6_walker there (saved
in cb->args) can be lost, reset on the next iteration.
Fix it by partially restoring the previous behavior before
commit c63586dc9b ("net: rtnl_dump_all needs to propagate
error from dumpit function"). The returned error from
dumpit() is still passed further.
[1]:
unreferenced object 0xffff88001322a200 (size 96):
comm "sshd", pid 1484, jiffies 4296032768 (age 1432.542s)
hex dump (first 32 bytes):
00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................
18 09 41 36 00 88 ff ff 18 09 41 36 00 88 ff ff ..A6......A6....
backtrace:
[<0000000095846b39>] kmem_cache_alloc_trace+0x151/0x220
[<000000007d12709f>] inet6_dump_fib+0x68d/0x940
[<000000002775a316>] rtnl_dump_all+0x1d9/0x2d0
[<00000000d7cd302b>] netlink_dump+0x945/0x11a0
[<000000002f43485f>] __netlink_dump_start+0x55d/0x800
[<00000000f76bbeec>] rtnetlink_rcv_msg+0x4fa/0xa00
[<000000009b5761f3>] netlink_rcv_skb+0x29c/0x420
[<0000000087a1dae1>] rtnetlink_rcv+0x15/0x20
[<00000000691b703b>] netlink_unicast+0x4e3/0x6c0
[<00000000b5be0204>] netlink_sendmsg+0x7f2/0xba0
[<0000000096d2aa60>] sock_sendmsg+0xba/0xf0
[<000000008c1b786f>] __sys_sendto+0x1e4/0x330
[<0000000019587b3f>] __x64_sys_sendto+0xe1/0x1a0
[<00000000071f4d56>] do_syscall_64+0x9f/0x300
[<000000002737577f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0000000057587684>] 0xffffffffffffffff
Fixes: c63586dc9b ("net: rtnl_dump_all needs to propagate error from dumpit function")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>