I changed to count sk_wmem_alloc by skb truesize instead of 1 to
fix the sk_wmem_alloc leak caused by later truesize's change in
xfrm in Commit 02968ccf01 ("sctp: count sk_wmem_alloc by skb
truesize in sctp_packet_transmit").
But I should have also increased sk_wmem_alloc when head->truesize
is increased in sctp_packet_gso_append() as xfrm does. Otherwise,
sctp gso packet will cause sk_wmem_alloc underflow.
Fixes: 02968ccf01 ("sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have at least one bool option, we can export all of the
supported bool options via optmask when dumping them.
v2: new patch
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new boolopt API to add an option which disables learning from
link-local packets. The default is kept as before and learning is
enabled. This is a simple map from a boolopt bit to a bridge private
flag that is tested before learning.
v2: pass NULL for extack via sysfs
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have been adding many new bridge options, a big number of which are
boolean but still take up netlink attribute ids and waste space in the skb.
Recently we discussed learning from link-local packets[1] and decided
yet another new boolean option will be needed, thus introducing this API
to save some bridge nl space.
The API supports changing the value of multiple boolean options at once
via the br_boolopt_multi struct which has an optmask (which options to
set, bit per opt) and optval (options' new values). Future boolean
options will only be added to the br_boolopt_id enum and then will have
to be handled in br_boolopt_toggle/get. The API will automatically
add the ability to change and export them via netlink, sysfs can use the
single boolopt function versions to do the same. The behaviour with
failing/succeeding is the same as with normal netlink option changing.
If an option requires mapping to internal kernel flag or needs special
configuration to be enabled then it should be handled in
br_boolopt_toggle. It should also be able to retrieve an option's current
state via br_boolopt_get.
v2: WARN_ON() on unsupported option as that shouldn't be possible and
also will help catch people who add new options without handling
them for both set and get. Pass down extack so if an option desires
it could set it on error and be more user-friendly.
[1] https://www.spinics.net/lists/netdev/msg532698.html
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
All lists that reach the tree_nodes_free() function have both zero
counter and true dead flag. The reason for this is that lists to be
release are selected by nf_conncount_gc_list() which already decrements
the list counter and sets on the dead flag. Therefore, this if statement
in tree_nodes_free() is unnecessary and wrong.
Fixes: 31568ec09e ("netfilter: nf_conncount: fix list_del corruption in conn_free")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
register_{netdevice/inetaddr/inet6addr}_notifier may return an error
value, this patch adds the code to handle these error paths.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When ip6_route_me_harder is invoked, it resets outgoing interface of:
- link-local scoped packets sent by neighbor discovery
- multicast packets sent by MLD host
- multicast packets send by MLD proxy daemon that sets outgoing
interface through IPV6_PKTINFO ipi6_ifindex
Link-local and multicast packets must keep their original oif after
ip6_route_me_harder is called.
Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
'offset' is constant and if it is zero, no need to subtract it
from BPF_REG_TMP.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-11-26
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Extend BTF to support function call types and improve the BPF
symbol handling with this info for kallsyms and bpftool program
dump to make debugging easier, from Martin and Yonghong.
2) Optimize LPM lookups by making longest_prefix_match() handle
multiple bytes at a time, from Eric.
3) Adds support for loading and attaching flow dissector BPF progs
from bpftool, from Stanislav.
4) Extend the sk_lookup() helper to be supported from XDP, from Nitin.
5) Enable verifier to support narrow context loads with offset > 0
to adapt to LLVM code generation (currently only offset of 0 was
supported). Add test cases as well, from Andrey.
6) Simplify passing device functions for offloaded BPF progs by
adding callbacks to bpf_prog_offload_ops instead of ndo_bpf.
Also convert nfp and netdevsim to make use of them, from Quentin.
7) Add support for sock_ops based BPF programs to send events to
the perf ring-buffer through perf_event_output helper, from
Sowmini and Daniel.
8) Add read / write support for skb->tstamp from tc BPF and cg BPF
programs to allow for supporting rate-limiting in EDT qdiscs
like fq from BPF side, from Vlad.
9) Extend libbpf API to support map in map types and add test cases
for it as well to BPF kselftests, from Nikita.
10) Account the maximum packet offset accessed by a BPF program in
the verifier and use it for optimizing nfp JIT, from Jiong.
11) Fix error handling regarding kprobe_events in BPF sample loader,
from Daniel T.
12) Add support for queue and stack map type in bpftool, from David.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot was able to trigger the WARN in cttimeout_default_get() by
passing UDPLITE as l4protocol. Alias UDPLITE to UDP, both use
same timeout values.
Furthermore, also fetch GRE timeouts. GRE is a bit more complicated,
as it still can be a module and its netns_proto_gre struct layout isn't
visible outside of the gre module. Can't move timeouts around, it
appears conntrack sysctl unregister assumes net_generic() returns
nf_proto_net, so we get crash. Expose layout of netns_proto_gre instead.
A followup nf-next patch could make gre tracker be built-in as well
if needed, its not that large.
Last, make the WARN() mention the missing protocol value in case
anything else is missing.
Reported-by: syzbot+2fae8fa157dd92618cae@syzkaller.appspotmail.com
Fixes: 8866df9264 ("netfilter: nfnetlink_cttimeout: pass default timeout policy to obj_to_nlattr")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ip_vs_dst_event is supposed to clean up all dst used in ipvs'
destinations when a net dev is going down. But it works only
when the dst's dev is the same as the dev from the event.
Now with the same priority but late registration,
ip_vs_dst_notifier is always called later than ipv6_dev_notf
where the dst's dev is set to lo for NETDEV_DOWN event.
As the dst's dev lo is not the same as the dev from the event
in ip_vs_dst_event, ip_vs_dst_notifier doesn't actually work.
Also as these dst have to wait for dest_trash_timer to clean
them up. It would cause some non-permanent kernel warnings:
unregister_netdevice: waiting for br0 to become free. Usage count = 3
To fix it, call ip_vs_dst_notifier earlier than ipv6_dev_notf
by increasing its priority to ADDRCONF_NOTIFY_PRIORITY + 5.
Note that for ipv4 route fib_netdev_notifier doesn't set dst's
dev to lo in NETDEV_DOWN event, so this fix is only needed when
IP_VS_IPV6 is defined.
Fixes: 7a4f0761fc ("IPVS: init and cleanup restructuring")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Daniel Borkmann says:
====================
pull-request: bpf 2018-11-25
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix an off-by-one bug when adjusting subprog start offsets after
patching, from Edward.
2) Fix several bugs such as overflow in size allocation in queue /
stack map creation, from Alexei.
3) Fix wrong IPv6 destination port byte order in bpf_sk_lookup_udp
helper, from Andrey.
4) Fix several bugs in bpftool such as preventing an infinite loop
in get_fdinfo, error handling and man page references, from Quentin.
5) Fix a warning in bpf_trace_printk() that wasn't catching an
invalid format string, from Martynas.
6) Fix a bug in BPF cgroup local storage where non-atomic allocation
was used in atomic context, from Roman.
7) Fix a NULL pointer dereference bug in bpftool from reallocarray()
error handling, from Jakub and Wen.
8) Add a copy of pkt_cls.h and tc_bpf.h uapi headers to the tools
include infrastructure so that bpftool compiles on older RHEL7-like
user space which does not ship these headers, from Yonghong.
9) Fix BPF kselftests for user space where to get ping test working
with ping6 and ping -6, from Li.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
I do not see how one can effectively use skb_insert() without holding
some kind of lock. Otherwise other cpus could have changed the list
right before we have a chance of acquiring list->lock.
Only existing user is in drivers/infiniband/hw/nes/nes_mgt.c and this
one probably meant to use __skb_insert() since it appears nesqp->pau_list
is protected by nesqp->pau_lock. This looks like nesqp->pau_lock
could be removed, since nesqp->pau_list.lock could be used instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Faisal Latif <faisal.latif@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: linux-rdma <linux-rdma@vger.kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent change added a null check on p->dev after p->dev was being
dereferenced by the ns_capable check on p->dev. It turns out that
neither the p->dev and p->br null checks are necessary, and can be
removed, which cleans up a static analyis warning.
As Nikolay Aleksandrov noted, these checks can be removed because:
"My reasoning of why it shouldn't be possible:
- On port add new_nbp() sets both p->dev and p->br before creating
kobj/sysfs
- On port del (trickier) del_nbp() calls kobject_del() before call_rcu()
to destroy the port which in turn calls sysfs_remove_dir() which uses
kernfs_remove() which deactivates (shouldn't be able to open new
files) and calls kernfs_drain() to drain current open/mmaped files in
the respective dir before continuing, thus making it impossible to
open a bridge port sysfs file with p->dev and p->br equal to NULL.
So I think it's safe to remove those checks altogether. It'd be nice to
get a second look over my reasoning as I might be missing something in
sysfs/kernfs call path."
Thanks to Nikolay Aleksandrov's suggestion to remove the check and
David Miller for sanity checking this.
Detected by CoverityScan, CID#751490 ("Dereference before null check")
Fixes: a5f3ea54f3 ("net: bridge: add support for raw sysfs port options")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip packet generation, pagedlen is initialized for each skb at the
start of the loop in __ip(6)_append_data, before label alloc_new_skb.
Depending on compiler options, code can be generated that jumps to
this label, triggering use of an an uninitialized variable.
In practice, at -O2, the generated code moves the initialization below
the label. But the code should not rely on that for correctness.
Fixes: 15e36f5b8e ("udp: paged allocation with gso")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a qdisc setup including pacing FQ is dismantled and recreated,
some TCP packets are sent earlier than instructed by TCP stack.
TCP can be fooled when ACK comes back, because the following
operation can return a negative value.
tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;
Some paths in TCP stack were not dealing properly with this,
this patch addresses four of them.
Fixes: ab408b6dc7 ("tcp: switch tcp and sch_fq to new earliest departure time model")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Need to take mutex in ath9k_add_interface(), from Dan Carpenter.
2) Fix mt76 build without CONFIG_LEDS_CLASS, from Arnd Bergmann.
3) Fix socket wmem accounting in SCTP, from Xin Long.
4) Fix failed resume crash in ena driver, from Arthur Kiyanovski.
5) qed driver passes bytes instead of bits into second arg of
bitmap_weight(). From Denis Bolotin.
6) Fix reset deadlock in ibmvnic, from Juliet Kim.
7) skb_scrube_packet() needs to scrub the fwd marks too, from Petr
Machata.
8) Make sure older TCP stacks see enough dup ACKs, and avoid doing SACK
compression during this period, from Eric Dumazet.
9) Add atomicity to SMC protocol cursor handling, from Ursula Braun.
10) Don't leave dangling error pointer if bpf_prog_add() fails in
thunderx driver, from Lorenzo Bianconi. Also, when we unmap TSO
headers, set sq->tso_hdrs to NULL.
11) Fix race condition over state variables in act_police, from Davide
Caratti.
12) Disable guest csum in the presence of XDP in virtio_net, from Jason
Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (64 commits)
net: gemini: Fix copy/paste error
net: phy: mscc: fix deadlock in vsc85xx_default_config
dt-bindings: dsa: Fix typo in "probed"
net: thunderx: set tso_hdrs pointer to NULL in nicvf_free_snd_queue
net: amd: add missing of_node_put()
team: no need to do team_notify_peers or team_mcast_rejoin when disabling port
virtio-net: fail XDP set if guest csum is negotiated
virtio-net: disable guest csum during XDP set
net/sched: act_police: add missing spinlock initialization
net: don't keep lonely packets forever in the gro hash
net/ipv6: re-do dad when interface has IFF_NOARP flag change
packet: copy user buffers before orphan or clone
ibmvnic: Update driver queues after change in ring size support
ibmvnic: Fix RX queue buffer cleanup
net: thunderx: set xdp_prog to NULL if bpf_prog_add fails
net/dim: Update DIM start sample after each DIM iteration
net: faraday: ftmac100: remove netif_running(netdev) check before disabling interrupts
net/smc: use after free fix in smc_wr_tx_put_slot()
net/smc: atomic SMCD cursor handling
net/smc: add SMC-D shutdown signal
...
Due to an explicit check in rocker_world_port_obj_vlan_add(),
dsa_slave_switchdev_event() resp. port_switchdev_event(), VLAN objects
that are added to a device that is not a front-panel port device are
ignored. Therefore this check is immaterial.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of
this field from all clients, which were migrated to use switchdev
notification in the previous patches.
Add a new function switchdev_port_obj_notify() that sends the switchdev
notifications SWITCHDEV_PORT_OBJ_ADD and _DEL.
Update switchdev_port_obj_del_now() to dispatch to this new function.
Drop __switchdev_port_obj_add() and update switchdev_port_obj_add()
likewise.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the transition from switchdev operations to notifier chain (which
will take place in following patches), the onus is on the driver to find
its own devices below possible layer of LAG or other uppers.
The logic to do so is fairly repetitive: each driver is looking for its
own devices among the lowers of the notified device. For those that it
finds, it calls a handler. To indicate that the event was handled,
struct switchdev_notifier_port_obj_info.handled is set. The differences
lie only in what constitutes an "own" device and what handler to call.
Therefore abstract this logic into two helpers,
switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If
a driver only supports physical ports under a bridge device, it will
simply avoid this layer of indirection.
One area where this helper diverges from the current switchdev behavior
is the case of mixed lowers, some of which are switchdev ports and some
of which are not. Previously, such scenario would fail with -EOPNOTSUPP.
The helper could do that for lowers for which the passed-in predicate
doesn't hold. That would however break the case that switchdev ports
from several different drivers are stashed under one master, a scenario
that switchdev currently happily supports. Therefore tolerate any and
all unknown netdevices, whether they are backed by a switchdev driver
or not.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patches will change the way of distributing port object
changes from a switchdev operation to a switchdev notifier. The
switchdev code currently recursively descends through layers of lower
devices, eventually calling the op on a front-panel port device. The
notifier will instead be sent referencing the bridge port device, which
may be a stacking device that's one of front-panel ports uppers, or a
completely unrelated device.
DSA currently doesn't support any other uppers than bridge.
SWITCHDEV_OBJ_ID_HOST_MDB and _PORT_MDB objects are always notified on
the bridge port device. Thus the only case that a stacked device could
be validly referenced by port object notifications are bridge
notifications for VLAN objects added to the bridge itself. But the
driver explicitly rejects such notifications in dsa_port_vlan_add(). It
is therefore safe to assume that the only interesting case is that the
notification is on a front-panel port netdevice. Therefore keep the
filtering by dsa_slave_dev_check() in place.
To handle SWITCHDEV_PORT_OBJ_ADD and _DEL, subscribe to the blocking
notifier chain. Dispatch to rocker_port_obj_add() resp. _del() to
maintain the behavior that the switchdev operation based code currently
has.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In general one can't assume that a switchdev notifier is called in a
non-atomic context, and correspondingly, the switchdev notifier chain is
an atomic one.
However, port object addition and deletion messages are delivered from a
process context. Even the MDB addition messages, whose delivery is
scheduled from atomic context, are queued and the delivery itself takes
place in blocking context. For VLAN messages in particular, keeping the
blocking nature is important for error reporting.
Therefore introduce a blocking notifier chain and related service
functions to distribute the notifications for which a blocking context
can be assumed.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an rmb is no longer in use by a connection, unregister its rkey at
the remote peer with an LLC DELETE RKEY message. With this change,
unused buffers held in the buffer pool are no longer registered at the
remote peer. They are registered before the buffer is actually used and
unregistered when they are no longer used by a connection.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the infrastructure to send LLC messages of type DELETE RKEY to
unregister a shared memory region at the peer.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a send failed then don't start to wait for a response in
smc_llc_do_confirm_rkey.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For easier reading move the unlock of mutex smc_create_lgr_pending into
smc_listen_work(), i.e. into the function the mutex has been locked.
No functional change.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After sending one of the initial LLC messages CONFIRM LINK or
ADD LINK, there is already a wait for the LLC response. It does
not make sense to wait another long time for a CLC DECLINE. Thus
this patch introduces a shorter wait time for these cases.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a link is terminated that has never reached the active state,
there is no need to trigger an LLC DELETE LINK.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If connection initialization fails for the LLC CONFIRM LINK or the
LLC ADD LINK step, fallback to TCP should be enabled. Thus
the negative return code -EAGAIN should switch to a positive timeout
reason code in these cases, and the internal CLC socket should
not have a set sk_err.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to store the return value in sk_err, if it is
afterwards cleared again with sock_error(). This patch sets the
return value directly. Just cleanup, no functional change.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
smc_lgr_free() is just called inside smc_core.c. Make it static.
Just cleanup, no functional change.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tcp_listen_worker is already initialized when socket is
created (in smc_sock_alloc()). Get rid of the duplicate
initialization in smc_listen(). No functional change.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We very often have few flows/chains to look at, and we
might increase GRO_HASH_BUCKETS to 32 or 64 in the future.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlv4ScMTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi1AlCACGgnN3hy/1AS2/fWVkPNZmfAyNC2vb
1MZcYY2eXV+gx5MGr9/DKAGgvlxDRjn+FQAXqTVGNGULTNBEujWa4Z+Hl/gzYXfX
LdK90pBe/E2WwcuDMK8WrMSuumJYElLpAcvEoxmAdJCDSXZ4ZGLfktGuaBqBGEJm
9NftKpJzqavuhVMt3wlNnaiZCD++BzMXTnMvcgpSWZIdlGpAXYYfeyFkPu5s1tUl
0PnsS2fP53JPR3nUz5EOksJidn0A9RYnYz/jKMvKFDLwURuRouHbugaZw/tXqUB3
atcd6u+XV3v7RS/fhIybJ7yoO5bE0TehcP7D7qY2R4R8bG+yWc1L124g
=yrY2
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.20-rc4' of https://github.com/ceph/ceph-client
Pullk ceph fix from Ilya Dryomov:
"A messenger fix, marked for stable"
* tag 'ceph-for-4.20-rc4' of https://github.com/ceph/ceph-client:
libceph: fall back to sendmsg for slab pages
Eric noted that with UDP GRO and NAPI timeout, we could keep a single
UDP packet inside the GRO hash forever, if the related NAPI instance
calls napi_gro_complete() at an higher frequency than the NAPI timeout.
Willem noted that even TCP packets could be trapped there, till the
next retransmission.
This patch tries to address the issue, flushing the old packets -
those with a NAPI_GRO_CB age before the current jiffy - before scheduling
the NAPI timeout. The rationale is that such a timeout should be
well below a jiffy and we are not flushing packets eligible for sane GRO.
v1 -> v2:
- clarified the commit message and comment
RFC -> v1:
- added 'Fixes tags', cleaned-up the wording.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 3b47d30396 ("net: gro: add a per device gro flush timer")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we add a new IPv6 address, we should also join corresponding solicited-node
multicast address, unless the interface has IFF_NOARP flag, as function
addrconf_join_solict() did. But if we remove IFF_NOARP flag later, we do
not do dad and add the mcast address. So we will drop corresponding neighbour
discovery message that came from other nodes.
A typical example is after creating a ipvlan with mode l3, setting up an ipv6
address and changing the mode to l2. Then we will not be able to ping this
address as the interface doesn't join related solicited-node mcast address.
Fix it by re-doing dad when interface changed IFF_NOARP flag. Then we will add
corresponding mcast group and check if there is a duplicate address on the
network.
Reported-by: Jianlin Shi <jishi@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tpacket_snd sends packets with user pages linked into skb frags. It
notifies that pages can be reused when the skb is released by setting
skb->destructor to tpacket_destruct_skb.
This can cause data corruption if the skb is orphaned (e.g., on
transmit through veth) or cloned (e.g., on mirror to another psock).
Create a kernel-private copy of data in these cases, same as tun/tap
zerocopy transmission. Reuse that infrastructure: mark the skb as
SKBTX_ZEROCOPY_FRAG, which will trigger copy in skb_orphan_frags(_rx).
Unlike other zerocopy packets, do not set shinfo destructor_arg to
struct ubuf_info. tpacket_destruct_skb already uses that ptr to notify
when the original skb is released and a timestamp is recorded. Do not
change this timestamp behavior. The ubuf_info->callback is not needed
anyway, as no zerocopy notification is expected.
Mark destructor_arg as not-a-uarg by setting the lower bit to 1. The
resulting value is not a valid ubuf_info pointer, nor a valid
tpacket_snd frame address. Add skb_zcopy_.._nouarg helpers for this.
The fix relies on features introduced in commit 52267790ef ("sock:
add MSG_ZEROCOPY"), so can be backported as is only to 4.14.
Tested with from `./in_netns.sh ./txring_overwrite` from
http://github.com/wdebruij/kerneltools/tests
Fixes: 69e3c75f4d ("net: TX_RING and packet mmap")
Reported-by: Anand H. Krishnan <anandhkrishnan@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This could be used to rate limit egress traffic in concert with a qdisc
which supports Earliest Departure Time, such as FQ.
Write access from cg skb progs only with CAP_SYS_ADMIN, since the value
will be used by downstream qdiscs. It might make sense to relax this.
Changes v1 -> v2:
- allow access from cg skb, write only with CAP_SYS_ADMIN
Signed-off-by: Vlad Dumitrescu <vladum@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
if loopback_idev is NULL pointer, and the following access of
loopback_idev will trigger panic, which is same as BUG_ON
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Allow querying bridge port flags so that drivers capable of performing
VxLAN learning will update the bridge driver only if learning is enabled
on its bridge port corresponding to the VxLAN device.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In smc_wr_tx_put_slot() field pend->idx is used after being
cleared. That means always idx 0 is cleared in the wr_tx_mask.
This results in a broken administration of available WR send
payload buffers.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Running uperf tests with SMCD on LPARs results in corrupted cursors.
SMCD cursors should be treated atomically to fix cursor corruption.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a SMC-D link group is freed, a shutdown signal should be sent to
the peer to indicate that the link group is invalid. This patch adds the
shutdown signal to the SMC code.
Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When searching for an existing link group the queue pair number is also
to be taken into consideration. When the SMC server sends a new number
in a CLC packet (keeping all other values equal) then a new link group
is to be created on the SMC client side.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of a non-blocking SMC socket, the initial CLC handshake is
performed over a blocking TCP connection in a worker. If the SMC socket
is released, smc_release has to wait for the blocking CLC socket
operations (e.g., kernel_connect) inside the worker.
This patch aborts a CLC connection when the respective non-blocking SMC
socket is released to avoid waiting on socket operations or timeouts.
Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jean-Louis reported a TCP regression and bisected to recent SACK
compression.
After a loss episode (receiver not able to keep up and dropping
packets because its backlog is full), linux TCP stack is sending
a single SACK (DUPACK).
Sender waits a full RTO timer before recovering losses.
While RFC 6675 says in section 5, "Algorithm Details",
(2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
indicating at least three segments have arrived above the current
cumulative acknowledgment point, which is taken to indicate loss
-- go to step (4).
...
(4) Invoke fast retransmit and enter loss recovery as follows:
there are old TCP stacks not implementing this strategy, and
still counting the dupacks before starting fast retransmit.
While these stacks probably perform poorly when receivers implement
LRO/GRO, we should be a little more gentle to them.
This patch makes sure we do not enable SACK compression unless
3 dupacks have been sent since last rcv_nxt update.
Ideally we should even rearm the timer to send one or two
more DUPACK if no more packets are coming, but that will
be work aiming for linux-4.21.
Many thanks to Jean-Louis for bisecting the issue, providing
packet captures and testing this patch.
Fixes: 5d9f4262b7 ("tcp: add SACK compression")
Reported-by: Jean-Louis Dupond <jean-louis@dupond.be>
Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a packet is trapped and the corresponding SKB marked as
already-forwarded, it retains this marking even after it is forwarded
across veth links into another bridge. There, since it ingresses the
bridge over veth, which doesn't have offload_fwd_mark, it triggers a
warning in nbp_switchdev_frame_mark().
Then nbp_switchdev_allowed_egress() decides not to allow egress from
this bridge through another veth, because the SKB is already marked, and
the mark (of 0) of course matches. Thus the packet is incorrectly
blocked.
Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That
function is called from tunnels and also from veth, and thus catches the
cases where traffic is forwarded between bridges and transformed in a
way that invalidates the marking.
Fixes: 6bc506b4fb ("bridge: switchdev: Add forward mark support for stacked devices")
Fixes: abf4bb6b63 ("skbuff: Add the offload_mr_fwd_mark field")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Suggested-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
after 'police' configuration parameters were converted to use RCU instead
of spinlock, the state variables used to compute the traffic rate (namely
'tcfp_toks', 'tcfp_ptoks' and 'tcfp_t_c') are erroneously read/updated in
the traffic path without any protection.
Use a dedicated spinlock to avoid race conditions on these variables, and
ensure proper cache-line alignment. In this way, 'police' is still faster
than what we observed when 'tcf_lock' was used in the traffic path _ i.e.
reverting commit 2d550dbad8 ("net/sched: act_police: don't use spinlock
in the data path"). Moreover, we preserve the throughput improvement that
was obtained after 'police' started using per-cpu counters, when 'avrate'
is used instead of 'rate'.
Changes since v1 (thanks to Eric Dumazet):
- call ktime_get_ns() before acquiring the lock in the traffic path
- use a dedicated spinlock instead of tcf_lock
- improve cache-line usage
Fixes: 2d550dbad8 ("net/sched: act_police: don't use spinlock in the data path")
Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
During tcp coalescing ensure that the skb hardware timestamp refers to the
highest sequence number data.
Previously only the software timestamp was updated during coalescing.
Signed-off-by: Stephen Mallon <stephen.mallon@sydney.edu.au>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under stress, softirq rx handler often hits a socket owned by the user,
and has to queue the packet into socket backlog.
When this happens, skb dst refcount is taken before we escape rcu
protected region. This is done from __sk_add_backlog() calling
skb_dst_force().
Consumer will have to perform the opposite costly operation.
AFAIK nothing in tcp stack requests the dst after skb was stored
in the backlog. If this was the case, we would have had failures
already since skb_dst_force() can end up clearing skb dst anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two cases were we can avoid calling ktime_get_ns() :
1) Queue is empty.
2) Internal queue is not empty.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of egress offloads the class/flowid assigned by the filter
may be very important for offloaded Qdisc selection. Provide this
info to drivers.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow drivers which offload GRED to report back statistics. Since
A lot of GRED stats is fairly ad hoc in nature pass to drivers the
standard struct gnet_stats_basic/gnet_stats_queue pairs, and
untangle the values in the core.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add basic offload for the GRED Qdisc. Inform the drivers any
time Qdisc or virtual queue configuration changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a packet is trapped and the corresponding SKB marked as
already-forwarded, it retains this marking even after it is forwarded
across veth links into another bridge. There, since it ingresses the
bridge over veth, which doesn't have offload_fwd_mark, it triggers a
warning in nbp_switchdev_frame_mark().
Then nbp_switchdev_allowed_egress() decides not to allow egress from
this bridge through another veth, because the SKB is already marked, and
the mark (of 0) of course matches. Thus the packet is incorrectly
blocked.
Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That
function is called from tunnels and also from veth, and thus catches the
cases where traffic is forwarded between bridges and transformed in a
way that invalidates the marking.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Suggested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Different from processing the addstrm_out request, The receiver handles
an addstrm_in request by sending back an addstrm_out request to the
sender who will increase its stream's in and incnt later.
Now stream->incnt has been increased since it sent out the addstrm_in
request in sctp_send_add_streams(), with the wrong stream->incnt will
even cause crash when copying stream info from the old stream's in to
the new one's in sctp_process_strreset_addstrm_out().
This patch is to fix it by simply removing the stream->incnt change
from sctp_send_add_streams().
Fixes: 242bd2d519 ("sctp: implement sender-side procedures for Add Incoming/Outgoing Streams Request Parameter")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 22d7be267e.
The dst's mtu in transport can be updated by a non sctp place like
in xfrm where the MTU information didn't get synced between asoc,
transport and dst, so it is still needed to do the pmtu check
in sctp_packet_config.
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As rfc7496#section4.5 says about SCTP_PR_SUPPORTED:
This socket option allows the enabling or disabling of the
negotiation of PR-SCTP support for future associations. For existing
associations, it allows one to query whether or not PR-SCTP support
was negotiated on a particular association.
It means only sctp sock's prsctp_enable can be set.
Note that for the limitation of SCTP_{CURRENT|ALL}_ASSOC, we will
add it when introducing SCTP_{FUTURE|CURRENT|ALL}_ASSOC for linux
sctp in another patchset.
v1->v2:
- drop the params.assoc_id check as Neil suggested.
Fixes: 28aa4c26fc ("sctp: add SCTP_PR_SUPPORTED on sctp sockopt")
Reported-by: Ying Xu <yinxu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now sctp increases sk_wmem_alloc by 1 when doing set_owner_w for the
skb allocked in sctp_packet_transmit and decreases by 1 when freeing
this skb.
But when this skb goes through networking stack, some subcomponents
might change skb->truesize and add the same amount on sk_wmem_alloc.
However sctp doesn't know the amount to decrease by, it would cause
a leak on sk->sk_wmem_alloc and the sock can never be freed.
Xiumei found this issue when it hit esp_output_head() by using sctp
over ipsec, where skb->truesize is added and so is sk->sk_wmem_alloc.
Since sctp has used sk_wmem_queued to count for writable space since
Commit cd305c74b0 ("sctp: use sk_wmem_queued to check for writable
space"), it's ok to fix it by counting sk_wmem_alloc by skb truesize
in sctp_packet_transmit.
Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds sockopt SCTP_EVENT described in rfc6525#section-6.2.
With this sockopt users can subscribe to an event from a specified
asoc.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_event is a structure name defined in RFC for sockopt
SCTP_EVENT. To avoid the conflict, rename it.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The member subscribe should be per asoc, so that sockopt SCTP_EVENT
in the next patch can subscribe a event from one asoc only.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The member subscribe in sctp_sock is used to indicate to which of
the events it is subscribed, more like a group of flags. So it's
better to be defined as __u16 (2 bytpes), instead of struct
sctp_event_subscribe (13 bytes).
Note that sctp_event_subscribe is an UAPI struct, used on sockopt
calls, and thus it will not be removed. This patch only changes
the internal storage of the flags.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix some potentially uninitialized variables and use-after-free in
kvaser_usb can drier, from Jimmy Assarsson.
2) Fix leaks in qed driver, from Denis Bolotin.
3) Socket leak in l2tp, from Xin Long.
4) RSS context allocation fix in bnxt_en from Michael Chan.
5) Fix cxgb4 build errors, from Ganesh Goudar.
6) Route leaks in ipv6 when removing exceptions, from Xin Long.
7) Memory leak in IDR allocation handling of act_pedit, from Davide
Caratti.
8) Use-after-free of bridge vlan stats, from Nikolay Aleksandrov.
9) When MTU is locked, do not force DF bit on ipv4 tunnels. From
Sabrina Dubroca.
10) When NAPI cached skb is reused, we must set it to the proper initial
state which includes skb->pkt_type. From Eric Dumazet.
11) Lockdep and non-linear SKB handling fix in tipc from Jon Maloy.
12) Set RX queue properly in various tuntap receive paths, from Matthew
Cover.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
tuntap: fix multiqueue rx
ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
tipc: don't assume linear buffer when reading ancillary data
tipc: fix lockdep warning when reinitilaizing sockets
net-gro: reset skb->pkt_type in napi_reuse_skb()
tc-testing: tdc.py: Guard against lack of returncode in executed command
tc-testing: tdc.py: ignore errors when decoding stdout/stderr
ip_tunnel: don't force DF when MTU is locked
MAINTAINERS: Add entry for CAKE qdisc
net: bridge: fix vlan stats use-after-free on destruction
socket: do a generic_file_splice_read when proto_ops has no splice_read
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs"
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
net/sched: act_pedit: fix memory leak when IDR allocation fails
net: lantiq: Fix returned value in case of error in 'xrx200_probe()'
ipv6: fix a dst leak when removing its exception
net: mvneta: Don't advertise 2.5G modes
drivers/net/ethernet/qlogic/qed/qed_rdma.h: fix typo
net/mlx4: Fix UBSAN warning of signed integer overflow
...
skb_can_coalesce() allows coalescing neighboring slab objects into
a single frag:
return page == skb_frag_page(frag) &&
off == frag->page_offset + skb_frag_size(frag);
ceph_tcp_sendpage() can be handed slab pages. One example of this is
XFS: it passes down sector sized slab objects for its metadata I/O. If
the kernel client is co-located on the OSD node, the skb may go through
loopback and pop on the receive side with the exact same set of frags.
When tcp_recvmsg() attempts to copy out such a frag, hardened usercopy
complains because the size exceeds the object's allocated size:
usercopy: kernel memory exposure attempt detected from ffff9ba917f20a00 (kmalloc-512) (1024 bytes)
Although skb_can_coalesce() could be taught to return false if the
resulting frag would cross a slab object boundary, we already have
a fallback for non-refcounted pages. Utilize it for slab pages too.
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Preethi reported that PMTU discovery for UDP/raw applications is not
working in the presence of VRF when the socket is not bound to a device.
The problem is that ip6_sk_update_pmtu does not consider the L3 domain
of the skb device if the socket is not bound. Update the function to
set oif to the L3 master device if relevant.
Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Reported-by: Preethi Ramachandra <preethir@juniper.net>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code for reading ancillary data from a received buffer is assuming
the buffer is linear. To make this assumption true we have to linearize
the buffer before message data is read.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eth_type_trans() assumes initial value for skb->pkt_type
is PACKET_HOST.
This is indeed the value right after a fresh skb allocation.
However, it is possible that GRO merged a packet with a different
value (like PACKET_OTHERHOST in case macvlan is used), so
we need to make sure napi->skb will have pkt_type set back to
PACKET_HOST.
Otherwise, valid packets might be dropped by the stack because
their pkt_type is not PACKET_HOST.
napi_reuse_skb() was added in commit 96e93eab20 ("gro: Add
internal interfaces for VLAN"), but this bug always has
been there.
Fixes: 96e93eab20 ("gro: Add internal interfaces for VLAN")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The various types of tunnels running over IPv4 can ask to set the DF
bit to do PMTU discovery. However, PMTU discovery is subject to the
threshold set by the net.ipv4.route.min_pmtu sysctl, and is also
disabled on routes with "mtu lock". In those cases, we shouldn't set
the DF bit.
This patch makes setting the DF bit conditional on the route's MTU
locking state.
This issue seems to be older than git history.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot reported a use-after-free of the global vlan context on port vlan
destruction. When I added per-port vlan stats I missed the fact that the
global vlan context can be freed before the per-port vlan rcu callback.
There're a few different ways to deal with this, I've chosen to add a
new private flag that is set only when per-port stats are allocated so
we can directly check it on destruction without dereferencing the global
context at all. The new field in net_bridge_vlan uses a hole.
v2: cosmetic change, move the check to br_process_vlan_info where the
other checks are done
v3: add change log in the patch, add private (in-kernel only) flags in a
hole in net_bridge_vlan struct and use that instead of mixing
user-space flags with private flags
Fixes: 9163a0fc1f ("net: bridge: add support for per-port vlan stats")
Reported-by: syzbot+04681da557a0e49a52e5@syzkaller.appspotmail.com
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
splice(2) fails with -EINVAL when called reading on a socket with no splice_read
set in its proto_ops (such as vsock sockets). Switch this to fallbacks to a
generic_file_splice_read instead.
Signed-off-by: Slavomir Kaslev <kaslevs@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends the ncsi-netlink interface with two new commands and
three new attributes to configure multiple packages and/or channels at
once, and configure specific failover modes.
NCSI_CMD_SET_PACKAGE mask and NCSI_CMD_SET_CHANNEL_MASK set a whitelist
of packages or channels allowed to be configured with the
NCSI_ATTR_PACKAGE_MASK and NCSI_ATTR_CHANNEL_MASK attributes
respectively. If one of these whitelists is set only packages or
channels matching the whitelist are considered for the channel queue in
ncsi_choose_active_channel().
These commands may also use the NCSI_ATTR_MULTI_FLAG to signal that
multiple packages or channels may be configured simultaneously. NCSI
hardware arbitration (HWA) must be available in order to enable
multi-package mode. Multi-channel mode is always available.
If the NCSI_ATTR_CHANNEL_ID attribute is present in the
NCSI_CMD_SET_CHANNEL_MASK command the it sets the preferred channel as
with the NCSI_CMD_SET_INTERFACE command. The combination of preferred
channel and channel whitelist defines a primary channel and the allowed
failover channels.
If the NCSI_ATTR_MULTI_FLAG attribute is also present then the preferred
channel is configured for Tx/Rx and the other channels are enabled only
for Rx.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the NCSI driver is stopped with ncsi_stop_dev() the channel
monitors are stopped and the state set to "inactive". However the
channels are still configured and active from the perspective of the
network controller. We should suspend each active channel but in the
context of ncsi_stop_dev() the transmit queue has been or is about to be
stopped so we won't have time to do so.
Instead when ncsi_start_dev() is called if the NCSI topology has already
been probed then call ncsi_reset_dev() to suspend any channels that were
previously active. This resets the network controller to a known state,
provides an up to date view of channel link state, and makes sure that
mode flags such as NCSI_MODE_TX_ENABLE are properly reset.
In addition to ncsi_start_dev() use ncsi_reset_dev() in ncsi-netlink.c
to update the channel configuration more cleanly.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The concepts of a channel being 'active' and it having link are slightly
muddled in the NCSI driver. Tweak this slightly so that
NCSI_CHANNEL_ACTIVE represents a channel that has been configured and
enabled, and NCSI_CHANNEL_INACTIVE represents a de-configured channel.
This distinction is important because a channel can be 'active' but have
its link down; in this case the channel may still need to be configured
so that it may receive AEN link-state-change packets.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a package is deselected all channels of that package cease
communication. If there are other channels active on the package of the
suspended channel this will disable them as well, so only send a
deselect-package command if no other channels are active.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the NCSI driver sends a select-package command to all possible
packages simultaneously to discover what packages are available. However
at this stage in the probe process the driver does not know if
hardware arbitration is available: if it isn't then this process could
cause collisions on the RMII bus when packages try to respond.
Update the probe loop to probe each package one by one, and once
complete check if HWA is universally supported.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NCSI hardware arbitration allows multiple packages to be enabled at once
and share the same wiring. If the NCSI driver recognises that HWA is
available it unconditionally enables all packages and channels; but that
is a configuration decision rather than something required by HWA.
Additionally the current implementation will not failover on link events
which can cause connectivity to be lost unless the interface is manually
bounced.
Retain basic HWA support but remove the separate configuration path to
enable all channels, leaving this to be handled by a later
implementation.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add TCP_NLA_SRTT to SCM_TIMESTAMPING_OPT_STATS that reports the smoothed
round trip time in microseconds (tcp_sock.srtt_us >> 3).
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the htable_create(), hinfo is allocated by vmalloc()
So that if error occurred, hinfo should be freed.
Fixes: 11d5f15723 ("netfilter: xt_hashlimit: Create revision 2 to support higher pps rates")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Allow users to set and dump RED flags (ECN enabled and harddrop)
on per-virtual queue basis. Validation of attributes is split
from changes to make sure we won't have to undo previous operations
when we find out configuration is invalid.
The objective is to allow changing per-Qdisc parameters without
overwriting the per-vq configured flags.
Old user space will not pass the TCA_GRED_VQ_FLAGS attribute and
per-Qdisc flags will always get propagated to the virtual queues.
New user space which wants to make use of per-vq flags should set
per-Qdisc flags to 0 and then configure per-vq flags as it
sees fit. Once per-vq flags are set per-Qdisc flags can't be
changed to non-zero. Vice versa - if the per-Qdisc flags are
non-zero the TCA_GRED_VQ_FLAGS attribute has to either be omitted
or set to the same value as per-Qdisc flags.
Update per-Qdisc parameters:
per-Qdisc | per-VQ | result
0 | 0 | all vq flags updated
0 | non-0 | error (vq flags in use)
non-0 | 0 | -- impossible --
non-0 | non-0 | all vq flags updated
Update per-VQ state (flags parameter not specified):
no change to flags
Update per-VQ state (flags parameter set):
per-Qdisc | per-VQ | result
0 | any | per-vq flags updated
non-0 | 0 | -- impossible --
non-0 | non-0 | error (per-Qdisc flags in use)
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now ECN marking and HARD drop (the common RED flags) can only
be configured for the entire Qdisc. In preparation for per-vq flags
store the values in the virtual queue structure. Setting per-vq
flags will only be allowed when no flags are set for the entire Qdisc.
For the new flags we will also make sure undefined bits are 0.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently all GRED's virtual queue data is dumped in a single
array in a single attribute. This makes it pretty much impossible
to add new fields. In order to expose more detailed stats add a
new set of attributes. We can now expose the 64 bit value of bytesin
and all the mark stats which were not part of the original design.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
32 bit counters for bytes are not really going to last long in modern
world. Make sch_gred count bytes on a 64 bit counter. It will still
get truncated during dump but follow up patch will add set of new
stat dump attributes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack messages to -EINVAL errors, to help users identify
their mistakes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case netlink wants to provide parsing error pass extack
to nla_parse_nested().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We will soon want to add more code to the non-error path, separate
it from the error handling flow.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 60fb9567bf ("udp: implement complete book-keeping for
encap_needed") introduced a severe misuse of jump label APIs, which
syzbot, as reported by Eric, was able to exploit.
When multiple sockets/process can concurrently request (and than
disable) the udp encap, we need to track the activation counter with
*_inc()/*_dec() jump label variants, or we can experience bad things
at disable time.
Fixes: 60fb9567bf ("udp: implement complete book-keeping for encap_needed")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently on dequeue() ETF only drops the first expired packet, which
causes a problem if the next packet is already expired. When this
happens, the watchdog will be configured with a time in the past, fire
straight way and the packet will finally be dropped once the dequeue()
function of the qdisc is called again.
We can save quite a few cycles and improve the overall behavior of the
qdisc if we drop all expired packets if the next packet is expired.
This should allow ETF to recover faster from bad situations. But
packet drops are still a very serious warning that the requirements
imposed on the system aren't reasonable.
This was inspired by how the implementation of hrtimers use the
rb_tree inside the kernel.
Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is just a refactor that will simplify the implementation of the
next patch in this series which will drop all expired packets on the
dequeue flow.
Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ETF's peek() operation is heavily used so use an rb_root_cached instead
and leverage rb_first_cached() which will run in O(1) instead of
O(log n).
Even if on 'timesortedlist_clear()' we could be using rb_erase(), we
choose to use rb_erase_cached(), because if in the future we allow
runtime changes to ETF parameters, and need to do a '_clear()', this
might cause some hard to debug issues.
Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no point in firing the qdisc watchdog if there are no future
skbs pending in the queue and the watchdog had been set previously.
Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we can use bpf or tcp tracepoint to conveniently trace the tcp
state transition at the run time.
So we don't need to do this stuff at the compile time anymore.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Explicitly pad short ELP packets with zeros, by Sven Eckelmann
- Fix packet size calculation when merging fragments,
by Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlvsI0MWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeobY0D/9N/v4LVF1vU/mYHPyQA6a8tXxp
9iCazA0jTMym48gvupMVSw0O6VrWZDnqUp6QY/jeB1Hfi188ouJgeY7zMNTnOGSF
jlMd+8dIsrJfX0Rg10s9t8mWHtR0Lzfs2TRTwIbjvvFXSOLSCmDFBCKYZNhJBcZm
gPDhLVS13klVQud78eTAciIXTIBCk3mp92auvwu/7yYSAi1RHUtMuov6qM6oqXDQ
ZaQKlzQ/N1yoh+NwkhIRUPtWZ1Q/8coQk48E8/mxmdCMWf9OKHoxx4TeScG6YDWH
x6qSqKIlMHNRJtwt+SF0X4xVqyKJ28jEH8d2lfbm5G6Dvgv2WCGQV8FQ2hjNtlfd
VatKnRW94uCMVvaB2r1dN8zx0Dozi3fR8QCo75Wovi9gwKjg3Xe6rNMxJqwwuWNO
4Q9YbiYpT6uZsCb9j3Ym/ConnQ8QMn4PA+qC5iH+4p0e0JfWdKsFGXphzSZzDOBN
3cfODCSO3PyVt/rmnnls21hznkPBn5dKtYCFeyBvtAjddgrzEgME6kB6mJ/mtO/7
1Ks4scMmyto03OpIExBPz8VsUDhsgGPMA2Brq9cfjhfA/Sl7OIJXEbVctnC6GK7H
OOuF6QAiN+W92gb+EGZCl3pEirRzZcQsWAgWnR8jM0joMQTHgZLqVDQIYaBPxbuB
wFoGWYX14Z5ImSlkyg==
=nlBu
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20181114' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are two batman-adv bugfixes:
- Explicitly pad short ELP packets with zeros, by Sven Eckelmann
- Fix packet size calculation when merging fragments,
by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
tcf_idr_check_alloc() can return a negative value, on allocation failures
(-ENOMEM) or IDR exhaustion (-ENOSPC): don't leak keys_ex in these cases.
Fixes: 0190c1d452 ("net: sched: atomically check-allocate action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the vlan packet offloads are registered only upon 8021q module
load. However, even without this module loaded, the offloads could be
utilized, for example by openvswitch datapath. As reported by Michael,
that causes 2x to 5x performance improvement, depending on a testcase.
So move the vlan offload registrations into vlan_core and make this
available even without 8021q module loaded.
Reported-by: Michael Shteinbok <michaelsh86@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: Michael Shteinbok <michaelsh86@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These is no need to hold dst before calling rt6_remove_exception_rt().
The call to dst_hold_safe() in ip6_link_failure() was for ip6_del_rt(),
which has been removed in Commit 93531c6743 ("net/ipv6: separate
handling of FIB entries from dst based routes"). Otherwise, it will
cause a dst leak.
This patch is to simply remove the dst_hold_safe() call before calling
rt6_remove_exception_rt() and also do the same in ip6_del_cached_rt().
It's safe, because the removal of the exception that holds its dst's
refcnt is protected by rt6_exception_lock.
Fixes: 93531c6743 ("net/ipv6: separate handling of FIB entries from dst based routes")
Fixes: 23fb93a4d3 ("net/ipv6: Cleanup exception and cache route handling")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a missing indentation before the declaration of port. Add
it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace VLAN_TAG_PRESENT with single bit flag and free up
VLAN.CFI overload. Now VLAN.CFI is visible in networking stack
and can be passed around intact.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make bpf_sk_lookup_tcp, bpf_sk_lookup_udp and bpf_sk_release helpers
available in programs of type BPF_PROG_TYPE_CGROUP_SOCK_ADDR.
Such programs operate on sockets and have access to socket and struct
sockaddr passed by user to system calls such as sys_bind, sys_connect,
sys_sendmsg.
It's useful to be able to lookup other sockets from these programs.
E.g. sys_connect may lookup IP:port endpoint and if there is a server
socket bound to that endpoint ("server" can be defined by saddr & sport
being zero), redirect client connection to it by rewriting IP:port in
sockaddr passed to sys_connect.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lookup functions in sk_lookup have different expectations about byte
order of provided arguments.
Specifically __inet_lookup, __udp4_lib_lookup and __udp6_lib_lookup
expect dport to be in network byte order and do ntohs(dport) internally.
At the same time __inet6_lookup expects dport to be in host byte order
and correspondingly name the argument hnum.
sk_lookup works correctly with __inet_lookup, __udp4_lib_lookup and
__inet6_lookup with regard to dport. But in __udp6_lib_lookup case it
uses host instead of expected network byte order. It makes result
returned by bpf_sk_lookup_udp for IPv6 incorrect.
The patch fixes byte order of dport passed to __udp6_lib_lookup.
Originally sk_lookup properly handled UDPv6, but not TCPv6. 5ef0ae84f0
fixes TCPv6 but breaks UDPv6.
Fixes: 5ef0ae84f0 ("bpf: Fix IPv6 dport byte-order in bpf_sk_lookup")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Bump version strings, by Simon Wunderlich
- Fixup includes, by Sven Eckelmann (3 patches)
- Separate BATMAN_ADV_DEBUG from DEBUGFS, by Sven Eckelmann
- Fixup tracing log documentation, by Sven Eckelmann
- Use exclusive locks to secure netlink information dump transfers,
by Sven Eckelmann (8 patches)
- Move CRC16 dependency, by Sven Eckelmann
- Enable MCAST by default, by Linus Luessing
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlvsK8YWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoUr9EACZvVvgqVQC/kiAWyyXuTUvTy7q
SYc0nwmHeG5L/+ekHMsfs8DZ/wofEo6LcZDrSRv79bLjCOLbvSMThVLRH2p1r2sG
OLHqTEbofuqc79C0gXKcJwiuauomjKNty7NKbrf0g7SYhmgRFXJyDamjXt6+4kAS
HVSPsyFt3hI7wo3VIzm4pxXsrjV3wtKAN4RdkwE0i0NCSvJpFuCPMOi53tabjokR
aOb04vLK/SVg426PNS+0iD7oqP5WYKyZSDFD9HHCRj1AHTCxR+7E25nRYKS5J+t4
gCn6Q9sfrJWO2k816xBl2PysA/kVT3GChs4y14LMCaLDyH0Ny4XFeR5pgjpc62fD
JZe/rQwAQQ9IbN1dO9GTww88vMvELcSJhSP2W4q82qsHdn/h/ghVaLKl+zSUO4oS
OByG6BJk0Dz3KpMcCcHRL+VXGUSVmRuOCP2LqM+c0aK9s56qhJM/aR/g7FePq4lQ
HhOCCRP/bmx7F75OZRdxwOQbupQ7P1AA/P2dwjs1xzZ/BHEdmHipsmWs/z2/tKWn
+A9dvLqiF6Dy7VgFUgp7PSi0QyDrgFHNvE14o4ako7QD2o9NqgPsdSlAK68JD0o9
CR14Tb23mNRulWP0GZXjS/MbHmNT7tY+sAc4tSj2VO++ozd5Qox5Y1qPPnL1K8vL
yLjA9NtJzrT36rCPpA==
=9x4D
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20181114' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This feature/cleanup patchset includes the following patches:
- Bump version strings, by Simon Wunderlich
- Fixup includes, by Sven Eckelmann (3 patches)
- Separate BATMAN_ADV_DEBUG from DEBUGFS, by Sven Eckelmann
- Fixup tracing log documentation, by Sven Eckelmann
- Use exclusive locks to secure netlink information dump transfers,
by Sven Eckelmann (8 patches)
- Move CRC16 dependency, by Sven Eckelmann
- Enable MCAST by default, by Linus Luessing
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
netperf udp stream shows that eth_type_trans takes certain cpu,
so adjust the mac address check order, and firstly check if it
is device address, and only check if it is multicast address
only if not the device address.
After this change:
To unicast, and skb dst mac is device mac, this is most of time
reduce a comparision
To unicast, and skb dst mac is not device mac, nothing change
To multicast, increase a comparision
Before:
1.03% [kernel] [k] eth_type_trans
After:
0.78% [kernel] [k] eth_type_trans
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
if list is NULL pointer, and the following access of list
will trigger panic, which is same as BUG_ON
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When EDT conversion happened, fq lost the ability to enfore a maxrate
for all flows. It kept it for non EDT flows.
This commit restores the functionality.
Tested:
tc qd replace dev eth0 root fq maxrate 500Mbit
netperf -P0 -H host -- -O THROUGHPUT
489.75
Fixes: ab408b6dc7 ("tcp: switch tcp and sch_fq to new earliest departure time model")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added support in tc flower for filtering based on port ranges.
Example:
1. Match on a port range:
-------------------------
$ tc filter add dev enp4s0 protocol ip parent ffff:\
prio 1 flower ip_proto tcp dst_port range 20-30 skip_hw\
action drop
$ tc -s filter show dev enp4s0 parent ffff:
filter protocol ip pref 1 flower chain 0
filter protocol ip pref 1 flower chain 0 handle 0x1
eth_type ipv4
ip_proto tcp
dst_port range 20-30
skip_hw
not_in_hw
action order 1: gact action drop
random type none pass val 0
index 1 ref 1 bind 1 installed 85 sec used 3 sec
Action statistics:
Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
2. Match on IP address and port range:
--------------------------------------
$ tc filter add dev enp4s0 protocol ip parent ffff:\
prio 1 flower dst_ip 192.168.1.1 ip_proto tcp dst_port range 100-200\
skip_hw action drop
$ tc -s filter show dev enp4s0 parent ffff:
filter protocol ip pref 1 flower chain 0 handle 0x2
eth_type ipv4
ip_proto tcp
dst_ip 192.168.1.1
dst_port range 100-200
skip_hw
not_in_hw
action order 1: gact action drop
random type none pass val 0
index 2 ref 1 bind 1 installed 58 sec used 2 sec
Action statistics:
Sent 920 bytes 20 pkt (dropped 20, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
v4:
1. Added condition before setting port key.
2. Organized setting and dumping port range keys into functions
and added validation of input range.
v3:
1. Moved new fields in UAPI enum to the end of enum.
2. Removed couple of empty lines.
v2:
Addressed Jiri's comments:
1. Added separate functions for dst and src comparisons.
2. Removed endpoint enum.
3. Added new bit TCA_FLOWER_FLAGS_RANGE to decide normal/range
lookup.
4. Cleaned up fl_lookup function.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently netdev_rx_csum_fault() only shows a device name,
we need more information about the skb for debugging csum
failures.
Sample output:
ens3: hw csum failure
dev features: 0x0000000000014b89
skb len=84 data_len=0 pkt_type=0 gso_size=0 gso_type=0 nr_frags=0 ip_summed=0 csum=0 csum_complete_sw=0 csum_valid=0 csum_level=0
Note, I use pr_err() just to be consistent with the existing one.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The life-checking function, which is used by kAFS to make sure that a call
is still live in the event of a pending signal, only samples the received
packet serial number counter; it doesn't actually provoke a change in the
counter, rather relying on the server to happen to give us a packet in the
time window.
Fix this by adding a function to force a ping to be transmitted.
kAFS then keeps track of whether there's been a stall, and if so, uses the
new function to ping the server, resetting the timeout to allow the reply
to come back.
If there's a stall, a ping and the call is *still* stalled in the same
place after another period, then the call will be aborted.
Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Fixes: f4d15fb6f9 ("rxrpc: Provide functions for allowing cleaner handling of signals")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King says:
Static analysis with CoverityScan found a potential issue [..]
It seems that pointer pol is set to NULL and then a check to see if it
is non-null is used to set pol to tmp; howeverm this check is always
going to be false because pol is always NULL.
Fix this and update test script to catch this. Updated script only:
./xfrm_policy.sh ; echo $?
RTNETLINK answers: No such file or directory
FAIL: ip -net ns3 xfrm policy get src 10.0.1.0/24 dst 10.0.2.0/24 dir out
RTNETLINK answers: No such file or directory
[..]
PASS: policy before exception matches
PASS: ping to .254 bypassed ipsec tunnel
PASS: direct policy matches
PASS: policy matches
1
Fixes: 6be3b0db6d ("xfrm: policy: add inexact policy search tree infrastructure")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
There is a missing indentation before the goto statement. Add it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
There is an indentation issue before the declaration of xfrm_ctx. Remove
spaces and replace with a tab.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Highlights include:
Stable fixes:
- Don't exit the NFSv4 state manager without clearing NFS4CLNT_MANAGER_RUNNING
Bugfixes:
- Fix an Oops when destroying the RPCSEC_GSS credential cache
- Fix an Oops during delegation callbacks
- Ensure that the NFSv4 state manager exits the loop on SIGKILL
- Fix a bogus get/put in generic_key_to_expire()
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb7Lh0AAoJEA4mA3inWBJc8uAQAIkrGChs3AFuEQ3G3H9RlxDX
WFsPghRGmDwXf2sD+nWjl0r60v0v5fQaUhW/7EPe2kbVTF/rnjieXNeFOw33ZMFk
MDq03nL1/I25DoNK/qg5GZ2NIltZ9oKKbwaN+0LxXKz69X5qIXYnDzYPHDR/PNTg
Go7PvG8rU31Wd67E2pquwC6zZ6rCPf2BtQjZdzouLAEUWXAMHyJmszpFUxhLMJoz
k6dZouphj8fkMse3cfKLnGDqbQ2bE6+Yb0B6Hi0p5nShYgZTaQNZ9KxrEJF7J05i
cxH6IvLEawEMWXYzGEwr1LUDDrpwveuNTt/OroTgOcSsVpZx1DE0sOZkQ4pt/uTe
c5NzZYKjEOb2DWxoGR2GEDkRasKVBkWvR5MegvyDgyAcXkAjN/6CgYXiniNYDxl6
qk7sIqkJfug7fv+VW5YHwORKnvRIEDlFcwy5yZ0ij/Qa0dqUR3aczINGLwS6kcfn
u7M42UR17FUo2zaI9pZhuijwntbtkXMIETWHGRctK7Mum6u37QSVySNCO2A4knBE
jEy+oYPFCIUqH+ESpNp73otrVt1CTexScIJNsEi1naLmOhjQRW7YjUPEH1Xjg0Ss
OGyqIjOf6ToF6ma39/XZI9miJe08k6x8b0aGUdG29Cko9UvjLH86ODEausSRAyFA
OyZFFuHHAau5FGpNvZfj
=AstN
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.20-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- Don't exit the NFSv4 state manager without clearing
NFS4CLNT_MANAGER_RUNNING
Bugfixes:
- Fix an Oops when destroying the RPCSEC_GSS credential cache
- Fix an Oops during delegation callbacks
- Ensure that the NFSv4 state manager exits the loop on SIGKILL
- Fix a bogus get/put in generic_key_to_expire()"
* tag 'nfs-for-4.20-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix an Oops during delegation callbacks
SUNRPC: Fix a bogus get/put in generic_key_to_expire()
SUNRPC: Fix a Oops when destroying the RPCSEC_GSS credential cache
NFSv4: Ensure that the state manager exits the loop on SIGKILL
NFSv4: Don't exit the state manager without clearing NFS4CLNT_MANAGER_RUNNING
This issue happens when trying to add an existent tunnel. It
doesn't call sock_put() before returning -EEXIST to release
the sock refcnt that was held by calling sock_hold() before
the existence check.
This patch is to fix it by holding the sock after doing the
existence check.
Fixes: f6cd651b05 ("l2tp: fix race in duplicate tunnel detection")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
effort to hit, which might explain why they weren't found sooner.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb6zwdAAoJECebzXlCjuG+SLMP/AlpI+vPV7DdCLRWGCY1ZMjk
5pxIS+74mD2EopBYgZY58L1fxWgv2bLOiAs/baAlNpkjTNX3wlxXGTu9IzVdPOn7
3n+W2Rb+mXEFaag7mP8RFpOvt7Yb3p4DObGpg7TKWJZ6r/8xcxQWQO+e0iiS5+XK
EOiaFcGmYlOC1JtrRIL2fr16trXUhT1gz7qAZgKBzebbEdn4FfdsdwHm7nUyRB3I
LhCMV35RfzOBC2C/kQzlHaHYlo0dx5lKMtVzvtgMdpgXr4QXE/7Ke/ANQ7oGfhhO
9uX0Uf18HmeGRejK9QoMha7VWuwh5pyHBq0ppMpGL2jb11BD/l9iXgS+vTxpA2B0
YIiSOnaiDFsEk6hMsFqueVIdaTrarcjg/S2mh2QDjtkXKS3L0W6/7v97JJHu9J4l
6zxiT6Crq2p8pMZ5gY3RI1AYllW/K+TRoccLhO+q19g3q1HWxP6DyeFBgNF66/ha
NtmQP+94IkaCS70zirpEu/OeUMviQgX2x77OReyibHLA4+R+hNHwtR67BLl+xG0G
jmKHfqqX7offFaHmsoD8kK3gpKtit0/py9Hp7gXQg4vU5iL512gI83ICEOEkZMXn
Ppsrl1HyoO/ohY/USpMvRqYHjM1ZGew19ZzD7SId6vUVaYjQIEsjQVnycK3h+gSb
otk5pc3bWPCwa8csOWPs
=+1Ub
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.20-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Three nfsd bugfixes.
None are new bugs, but they all take a little effort to hit, which
might explain why they weren't found sooner"
* tag 'nfsd-4.20-1' of git://linux-nfs.org/~bfields/linux:
SUNRPC: drop pointless static qualifier in xdr_get_next_encode_buffer()
nfsd: COPY and CLONE operations require the saved filehandle to be set
sunrpc: correct the computation for page_ptr when truncating
RED qdisc's limit parameter changes the behaviour of the qdisc,
for instance if it's set to 0 qdisc will drop all the packets.
When replace operation happens and parameter is set to non-0
a new fifo qdisc will be instantiated and replace the old child
qdisc which will be destroyed.
Drivers need to know the parameter, even if they don't impose
the actual limit to be able to reliably reconstruct the Qdisc
hierarchy.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers offloading Qdiscs should have reasonable certainty
the offloaded behaviour matches the SW path. This is impossible
if the driver does not know about all Qdiscs or when Qdiscs move
and are reused. Send a graft notification from MQ.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers offloading Qdiscs should have reasonable certainty
the offloaded behaviour matches the SW path. This is impossible
if the driver does not know about all Qdiscs or when Qdiscs move
and are reused. Send a graft notification from RED. The drivers
are expected to simply stop offloading the Qdisc, if a non-standard
child is ever grafted onto it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers are currently not notified when a Qdisc is grafted as root.
This requires special casing Qdiscs added with parent = TC_H_ROOT in
the driver. Also there is no notification sent to the driver when
an existing Qdisc is grafted as root.
Add this very simple notifications, drivers should now be able to
track their Qdisc tree fully.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEENrCndlB/VnAEWuH5k9IU1zQoZfEFAlvlt0gTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRCT0hTXNChl8bMDB/9ElLCS/uh3CznHeX8w24t/LldHoy0q
eposGQ6+uWV/R7lUfNNUtIAcoSxzuOyXSMh9skz8NdExdQ0/9osnvNWemKTGrfhm
ndCVmMd7dMoWX2m1VTJ2jrij3MKPe8HmUei+kB9PrhHFNwofNSOvw2dEVjJDSwUW
gAvs6K/KrHh5ncd9O3JfaXqc9Cs95o0dz4U4AGZ68UjUemx1AmDse2q3JVPQcxn0
muXoWWFXBbKob/0qpFG0xP9ssdq75AL58dlEqRV+64EMgqWcgvdoPxGGIBbP4t0x
zMwE3hCaoC7Uogr28tnQrf4kSm5IC33AiMQDKmBQRtzFLxtCI1wE71M4
=eM20
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-4.20-20181109' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2018-11-09
this is a pull request of 20 patches for net/master.
First we have a patch by Oliver Hartkopp which changes the raw socket's
raw_sendmsg() to return an error value if the user tries to send a CANFD
frame to a CAN-2.0 device.
The next two patches are by Jimmy Assarsson and fix potential problems
in the kvaser_usb driver.
YueHaibing's patches for the ucan driver fix a compile time warning and
remove a duplicate include.
Eugeniu Rosca patch adds more binding documentation to the rcar_can
driver bindings. The next two patches are by Fabrizio Castro for the
rcar_can driver and fixes a problem in the driver's probe function and
document the r8a774a1 binding.
Lukas Wunner's patch fixes a recpetion problem in hi311x driver by
switching from edge to level triggered interruts.
The next three patches all target the flexcan driver. Pankaj Bansal's
patch unconditionally unlocks the last mailbox used for RX. Alexander
Stein provides a better workaround for a hardware limitation when
sending RTR frames, by using the last mailbox for TX, resulting in fewer
lost frames. The patch by me simplyfies the driver, by making a runtime
value a compile time constant.
The following 4 patches are by me and provide the groundwork for the
next patches by Oleksij Rempel. To avoid code duplication common code in
the common CAN driver infrastructure is factured out and error handling
is cleaned up.
The next 4 patches are by Oleksij Rempel and fix the problem in the
flexcan driver that other processes see TX frames arrive out of order
with ragards to a RX'ed frame (which are send by a different system on
the CAN bus as the result of our TX frame).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
nft_compat ops do not have static storage duration, unlike all other
expressions.
When nf_tables_expr_destroy() returns, expr->ops might have been
free'd already, so we need to store next address before calling
expression destructor.
For same reason, we can't deref match pointer after nft_xt_put().
This can be easily reproduced by adding msleep() before
nft_match_destroy() returns.
Fixes: 0ca743a559 ("netfilter: nf_tables: add compatibility layer for x_tables")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 07d02a67b7 causes a use-after free in the RPCSEC_GSS credential
destroy code, because the call to get_rpccred() in gss_destroying_context()
will now always fail to increment the refcount.
While we could just replace the get_rpccred() with a refcount_set(), that
would have the unfortunate consequence of resurrecting a credential in
the credential cache for which we are in the process of destroying the
RPCSEC_GSS context. Rather than do this, we choose to make a copy that
is never added to the cache and use that to destroy the context.
Fixes: 07d02a67b7 ("SUNRPC: Simplify lookup code")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When socks' sk_reuseport is set, the same port and address are allowed
to be bound into these socks who have the same uid.
Note that the difference from sk_reuse is that it allows multiple socks
to listen on the same port and address.
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a part of sk_reuseport support for sctp. It defines a helper
sctp_bind_addrs_check() to check if the bind_addrs in two socks are
matched. It will add sock_reuseport if they are completely matched,
and return err if they are partly matched, and alloc sock_reuseport
if all socks are not matched at all.
It will work until sk_reuseport support is added in
sctp_get_port_local() in the next patch.
v1->v2:
- use 'laddr->valid && laddr2->valid' check instead as Marcelo
pointed in sctp_bind_addrs_check().
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a part of sk_reuseport support for sctp, and it selects a
sock by the hashkey of lport, paddr and dport by default. It will
work until sk_reuseport support is added in sctp_get_port_local()
in the next patch.
v1->v2:
- define lport as __be16 instead of __be32 as Marcelo pointed in
__sctp_rcv_lookup_endpoint().
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Its possible to set both HANDLE and POSITION when replacing a rule.
In this case, the rule at POSITION gets replaced using the
userspace-provided handle. Rule handles are supposed to be generated
by the kernel only.
Duplicate handles should be harmless, however better disable this "feature"
by only checking for the POSITION attribute on insert operations.
Fixes: 5e94846686 ("netfilter: nf_tables: add insert operation")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is no synchronization between packet path and the configuration plane.
The packet path uses two arrays with rules, one contains the current (active)
generation. The other either contains the last (obsolete) generation or
the future one.
Consider:
cpu1 cpu2
nft_do_chain(c);
delete c
net->gen++;
genbit = !!net->gen;
rules = c->rg[genbit];
cpu1 ignores c when updating if c is not active anymore in the new
generation.
On cpu2, we now use rules from wrong generation, as c->rg[old]
contains the rules matching 'c' whereas c->rg[new] was not updated and
can even point to rules that have been free'd already, causing a crash.
To fix this, make sure that 'current' to the 'next' generation are
identical for chains that are going away so that c->rg[new] will just
use the matching rules even if genbit was incremented already.
Fixes: 0cbc06b3fa ("netfilter: nf_tables: remove synchronize_rcu in commit phase")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When list->count is 0, the list is deleted by GC. But list->count is
never reached 0 because initial count value is 1 and it is increased
when node is inserted. So that initial value of list->count should be 0.
Originally GC always finds zero count list through deleting node and
decreasing count. However, list may be left empty since node insertion
may fail eg. allocaton problem. In order to solve this problem, GC
routine also finds zero count list without deleting node.
Fixes: cb2b36f5a9 ("netfilter: nf_conncount: Switch to plain list")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
conn_free() holds lock with spin_lock() and it is called by both
nf_conncount_lookup() and nf_conncount_gc_list(). nf_conncount_lookup()
is called from bottom-half context and nf_conncount_gc_list() from
process context. So that spin_lock() call is not safe. Hence
conn_free() should use spin_lock_bh() instead of spin_lock().
test commands:
%nft add table ip filter
%nft add chain ip filter input { type filter hook input priority 0\; }
%nft add rule filter input meter test { ip saddr ct count over 2 } \
counter
splat looks like:
[ 461.996507] ================================
[ 461.998999] WARNING: inconsistent lock state
[ 461.998999] 4.19.0-rc6+ #22 Not tainted
[ 461.998999] --------------------------------
[ 461.998999] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[ 461.998999] kworker/0:2/134 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 461.998999] 00000000a71a559a (&(&list->list_lock)->rlock){+.?.}, at: conn_free+0x69/0x2b0 [nf_conncount]
[ 461.998999] {IN-SOFTIRQ-W} state was registered at:
[ 461.998999] _raw_spin_lock+0x30/0x70
[ 461.998999] nf_conncount_add+0x28a/0x520 [nf_conncount]
[ 461.998999] nft_connlimit_eval+0x401/0x580 [nft_connlimit]
[ 461.998999] nft_dynset_eval+0x32b/0x590 [nf_tables]
[ 461.998999] nft_do_chain+0x497/0x1430 [nf_tables]
[ 461.998999] nft_do_chain_ipv4+0x255/0x330 [nf_tables]
[ 461.998999] nf_hook_slow+0xb1/0x160
[ ... ]
[ 461.998999] other info that might help us debug this:
[ 461.998999] Possible unsafe locking scenario:
[ 461.998999]
[ 461.998999] CPU0
[ 461.998999] ----
[ 461.998999] lock(&(&list->list_lock)->rlock);
[ 461.998999] <Interrupt>
[ 461.998999] lock(&(&list->list_lock)->rlock);
[ 461.998999]
[ 461.998999] *** DEADLOCK ***
[ 461.998999]
[ ... ]
Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Thanks to rigorous testing in wireless community mesh networks several
issues with multicast entries in the translation table were found and
fixed in the last 1.5 years. Now we see the first larger networks
(a few hundred nodes) with a batman-adv version with multicast
optimizations enabled arising, with no TT / multicast optimization
related issues so far.
Therefore it seems safe to enable multicast optimizations by default.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The commit ced72933a5 ("batman-adv: use CRC32C instead of CRC16 in TT
code") switched the translation table code from crc16 to crc32c. The
(optional) bridge loop avoidance code is the only user of this function.
batman-adv should only select CRC16 when it is actually using it.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The netlink dump functionality transfers a large number of entries from the
kernel to userspace. It is rather likely that the transfer has to
interrupted and later continued. During that time, it can happen that
either new entries are added or removed. The userspace could than either
receive some entries multiple times or miss entries.
Commit 670dc2833d ("netlink: advertise incomplete dumps") introduced a
mechanism to inform userspace about this problem. Userspace can then decide
whether it is necessary or not to retry dumping the information again.
The netlink dump functions have to be switched to exclusive locks to avoid
changes while the current message is prepared. The already existing
generation sequence counter from the hash helper can be used for this
simple hash.
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The netlink dump functionality transfers a large number of entries from the
kernel to userspace. It is rather likely that the transfer has to
interrupted and later continued. During that time, it can happen that
either new entries are added or removed. The userspace could than either
receive some entries multiple times or miss entries.
Commit 670dc2833d ("netlink: advertise incomplete dumps") introduced a
mechanism to inform userspace about this problem. Userspace can then decide
whether it is necessary or not to retry dumping the information again.
The netlink dump functions have to be switched to exclusive locks to avoid
changes while the current message is prepared. The already existing
generation sequence counter from the hash helper can be used for this
simple hash.
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The netlink dump functionality transfers a large number of entries from the
kernel to userspace. It is rather likely that the transfer has to
interrupted and later continued. During that time, it can happen that
either new entries are added or removed. The userspace could than either
receive some entries multiple times or miss entries.
Commit 670dc2833d ("netlink: advertise incomplete dumps") introduced a
mechanism to inform userspace about this problem. Userspace can then decide
whether it is necessary or not to retry dumping the information again.
The netlink dump functions have to be switched to exclusive locks to avoid
changes while the current message is prepared. The already existing
generation sequence counter from the hash helper can be used for this
simple hash.
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The netlink dump functionality transfers a large number of entries from the
kernel to userspace. It is rather likely that the transfer has to
interrupted and later continued. During that time, it can happen that
either new entries are added or removed. The userspace could than either
receive some entries multiple times or miss entries.
Commit 670dc2833d ("netlink: advertise incomplete dumps") introduced a
mechanism to inform userspace about this problem. Userspace can then decide
whether it is necessary or not to retry dumping the information again.
The netlink dump functions have to be switched to exclusive locks to avoid
changes while the current message is prepared. The already existing
generation sequence counter from the hash helper can be used for this
simple hash.
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The netlink dump functionality transfers a large number of entries from the
kernel to userspace. It is rather likely that the transfer has to
interrupted and later continued. During that time, it can happen that
either new entries are added or removed. The userspace could than either
receive some entries multiple times or miss entries.
Commit 670dc2833d ("netlink: advertise incomplete dumps") introduced a
mechanism to inform userspace about this problem. Userspace can then decide
whether it is necessary or not to retry dumping the information again.
The netlink dump functions have to be switched to exclusive locks to avoid
changes while the current message is prepared. The already existing
generation sequence counter from the hash helper can be used for this
simple hash.
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Multiple datastructures use the hash helper functions to add and remove
entries from the simple hlist based hashes. These are often also dumped to
userspace via netlink and thus should have a generation sequence counter.
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The netlink dump functionality transfers a large number of entries from the
kernel to userspace. It is rather likely that the transfer has to
interrupted and later continued. During that time, it can happen that
either new entries are added or removed. The userspace could than either
receive some entries multiple times or miss entries.
Commit 670dc2833d ("netlink: advertise incomplete dumps") introduced a
mechanism to inform userspace about this problem. Userspace can then decide
whether it is necessary or not to retry dumping the information again.
The netlink dump functions have to be switched to exclusive locks to avoid
changes while the current message is prepared. And an external generation
sequence counter is introduced which tracks all modifications of the list.
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The netlink dump functionality transfers a large number of entries from the
kernel to userspace. It is rather likely that the transfer has to
interrupted and later continued. During that time, it can happen that
either new entries are added or removed. The userspace could than either
receive some entries multiple times or miss entries.
Commit 670dc2833d ("netlink: advertise incomplete dumps") introduced a
mechanism to inform userspace about this problem. Userspace can then decide
whether it is necessary or not to retry dumping the information again.
The netlink dump functions have to be switched to exclusive locks to avoid
changes while the current message is prepared. And an external generation
sequence counter is introduced which tracks all modifications of the list.
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The debug messages of batman-adv are not printed to the kernel log at all
but can be stored (depending on the compile setting) in the tracing buffer
or the batadv specific log buffer. There is also no debug module parameter
but a batadv netdev specific log_level setting to enable/disable different
classes of debug messages at runtime.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The BATMAN_ADV_DEBUGFS portion of batman-adv is marked as deprecated. Thus
all required functionality should be available without it. The debug log
was already modified to also output via the kernel tracing function but
still retained its BATMAN_ADV_DEBUGFS functionality.
Separate the entry point for the debug log from the debugfs portions to
make it possible to build with BATMAN_ADV_DEBUG and without
BATMAN_ADV_DEBUGFS.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The batadv_dbg trace event uses different functionality and datastructures
which are not directly associated with the trace infrastructure. It should
not be expected that the trace headers indirectly provide them and instead
include the required headers directly.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The commit 00caf6a2b3 ("batman-adv: Mark debugfs functionality as
deprecated") introduced various messages to inform the user about the
deprecation of the debugfs based functionality. The messages also include
the context/task in which this problem was observed.
The datastructures and functions to access this information require special
headers. These should be included directly instead of depending on a more
complex and fragile include chain.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The commit dee222c7b2 ("batman-adv: Move OGM rebroadcast stats to
orig_ifinfo") removed all used functionality of the include linux/lockdep.h
from batadv_iv_ogm.c.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The complete size ("total_size") of the fragmented packet is stored in the
fragment header and in the size of the fragment chain. When the fragments
are ready for merge, the skbuff's tail of the first fragment is expanded to
have enough room after the data pointer for at least total_size. This means
that it gets expanded by total_size - first_skb->len.
But this is ignoring the fact that after expanding the buffer, the fragment
header is pulled by from this buffer. Assuming that the tailroom of the
buffer was already 0, the buffer after the data pointer of the skbuff is
now only total_size - len(fragment_header) large. When the merge function
is then processing the remaining fragments, the code to copy the data over
to the merged skbuff will cause an skb_over_panic when it tries to actually
put enough data to fill the total_size bytes of the packet.
The size of the skb_pull must therefore also be taken into account when the
buffer's tailroom is expanded.
Fixes: 610bfc6bc9 ("batman-adv: Receive fragmented packets and merge")
Reported-by: Martin Weinelt <martin@darmstadt.freifunk.net>
Co-authored-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The announcement messages of batman-adv COMPAT_VERSION 15 have the
possibility to announce additional information via a dynamic TVLV part.
This part is optional for the ELP packets and currently not parsed by the
Linux implementation. Still out-of-tree versions are using it to transport
things like neighbor hashes to optimize the rebroadcast behavior.
Since the ELP broadcast packets are smaller than the minimal ethernet
packet, it often has to be padded. This is often done (as specified in
RFC894) with octets of zero and thus work perfectly fine with the TVLV
part (making it a zero length and thus empty). But not all ethernet
compatible hardware seems to follow this advice. To avoid ambiguous
situations when parsing the TVLV header, just force the 4 bytes (TVLV
length + padding) after the required ELP header to zero.
Fixes: d6f94d91f7 ("batman-adv: ELP - adding basic infrastructure")
Reported-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Similar to 80ba92fa1a ("codel: add ce_threshold attribute")
After EDT adoption, it became easier to implement DCTCP-like CE marking.
In many cases, queues are not building in the network fabric but on
the hosts themselves.
If packets leaving fq missed their Earliest Departure Time by XXX usec,
we mark them with ECN CE. This gives a feedback (after one RTT) to
the sender to slow down and find better operating mode.
Example :
tc qd replace dev eth0 root fq ce_threshold 2.5ms
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
FQ pacing guarantees that paced packets queued by one flow do not
add head-of-line blocking for other flows.
After TCP GSO conversion, increasing limit_output_bytes to 1 MB is safe,
since this maps to 16 skbs at most in qdisc or device queues.
(or slightly more if some drivers lower {gso_max_segs|size})
We still can queue at most 1 ms worth of traffic (this can be scaled
by wifi drivers if they need to)
Tested:
# ethtool -c eth0 | egrep "tx-usecs:|tx-frames:" # 40 Gbit mlx4 NIC
tx-usecs: 16
tx-frames: 16
# tc qdisc replace dev eth0 root fq
# for f in {1..10};do netperf -P0 -H lpaa24,6 -o THROUGHPUT;done
Before patch:
27711
26118
27107
27377
27712
27388
27340
27117
27278
27509
After patch:
37434
36949
36658
36998
37711
37291
37605
36659
36544
37349
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_tso_should_defer() first heuristic is to not defer
if last send is "old enough".
Its current implementation uses jiffies and its low granularity.
TSO autodefer performance should not rely on kernel HZ :/
After EDT conversion, we have state variables in nanoseconds that
can allow us to properly implement the heuristic.
This patch increases TSO chunk sizes on medium rate flows,
especially when receivers do not use GRO or similar aggregation.
It also reduces bursts for HZ=100 or HZ=250 kernels, making TCP
behavior more uniform.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_tso_should_defer() last step tries to check if the probable
next ACK packet is coming in less than half rtt.
Problem is that the head->tstamp might be in the future,
so we need to use signed arithmetics to avoid overflows.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Applications using MSG_EOR are giving a strong hint to TCP stack :
Subsequent sendmsg() can not append more bytes to skbs having
the EOR mark.
Do not try to TSO defer suchs skbs, there is really no hope.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bitwise operation is a little faster.
So I replace after() with using the flag FLAG_SND_UNA_ADVANCED as it is
already set before.
In addtion, there's another similar improvement in tcp_cwnd_reduction().
Cc: Joe Perches <joe@perches.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If sch_fq is used at ingress, skbs that might have been
timestamped by net_timestamp_set() if a packet capture
is requesting timestamps could be delayed by arbitrary
amount of time, since sch_fq time base is MONOTONIC.
Fix this problem by moving code from sch_netem.c to act_mirred.c.
Fixes: fb420d5d91 ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a link failure is detected locally, the link is reset, the flag
link->in_session is set to false, and a RESET_MSG with the 'stopping'
bit set is sent to the peer.
The purpose of this bit is to inform the peer that this endpoint just
is going down, and that the peer should handle the reception of this
particular RESET message as a local failure. This forces the peer to
accept another RESET or ACTIVATE message from this endpoint before it
can re-establish the link. This again is necessary to ensure that
link session numbers are properly exchanged before the link comes up
again.
If a failure is detected locally at the same time at the peer endpoint
this will do the same, which is also a correct behavior.
However, when receiving such messages, the endpoints will not
distinguish between 'stopping' RESETs and ordinary ones when it comes
to updating session numbers. Both endpoints will copy the received
session number and set their 'in_session' flags to true at the
reception, while they are still expecting another RESET from the
peer before they can go ahead and re-establish. This is contradictory,
since, after applying the validation check referred to below, the
'in_session' flag will cause rejection of all such messages, and the
link will never come up again.
We now fix this by not only handling received RESET/STOPPING messages
as a local failure, but also by omitting to set a new session number
and the 'in_session' flag in such cases.
Fixes: 7ea817f4e8 ("tipc: check session number before accepting link protocol messages")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the broadcast retransmission algorithm is using the
'prev_retr' field in struct tipc_link to time stamp the latest broadcast
retransmission occasion. This helps to restrict retransmission of
individual broadcast packets to max once per 10 milliseconds, even
though all other criteria for retransmission are met.
We now move this time stamp to the control block of each individual
packet, and remove other limiting criteria. This simplifies the
retransmission algorithm, and eliminates any risk of logical errors
in selecting which packets can be retransmitted.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: LUU Duc Canh <canh.d.luu@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently drivers can register to receive TC block bind/unbind callbacks
by implementing the setup_tc ndo in any of their given netdevs. However,
drivers may also be interested in binds to higher level devices (e.g.
tunnel drivers) to potentially offload filters applied to them.
Introduce indirect block devs which allows drivers to register callbacks
for block binds on other devices. The callback is triggered when the
device is bound to a block, allowing the driver to register for rules
applied to that block using already available functions.
Freeing an indirect block callback will trigger an unbind event (if
necessary) to direct the driver to remove any offloaded rules and unreg
any block rule callbacks. It is the responsibility of the implementing
driver to clean any registered indirect block callbacks before exiting,
if the block it still active at such a time.
Allow registering an indirect block dev callback for a device that is
already bound to a block. In this case (if it is an ingress block),
register and also trigger the callback meaning that any already installed
rules can be replayed to the calling driver.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same change as made to sctp_intl_store_reasm().
To be fully correct, an iterator has an undefined value when something
like skb_queue_walk() naturally terminates.
This will actually matter when SKB queues are converted over to
list_head.
Formalize what this code ends up doing with the current
implementation.
Signed-off-by: David S. Miller <davem@davemloft.net>
To be fully correct, an iterator has an undefined value when something
like skb_queue_walk() naturally terminates.
This will actually matter when SKB queues are converted over to
list_head.
Formalize what this code ends up doing with the current
implementation.
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate the assumption that SKBs and SKB list heads can
be cast to eachother in SKB list handling code.
This change also appears to fix a bug since the list->next pointer is
sampled outside of holding the SKB queue lock.
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out I missed one VLAN_TAG_PRESENT in OVS code while rebasing.
This fixes it.
Fixes: 9df46aefaf ("OVS: remove use of VLAN_TAG_PRESENT")
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCA_FLOWER_KEY_ENC_OPTS and TCA_FLOWER_KEY_ENC_OPTS_MASK can only
currently contain further nested attributes, which are parsed by
hand, so the policy is never actually used resulting in a W=1
build warning:
net/sched/cls_flower.c:492:1: warning: ‘enc_opts_policy’ defined but not used [-Wunused-const-variable=]
enc_opts_policy[TCA_FLOWER_KEY_ENC_OPTS_MAX + 1] = {
Add the validation anyway to avoid potential bugs when other
attributes are added and to make the attribute structure slightly
more clear. Validation will also set extact to point to bad
attribute on error.
Fixes: 0a6e77784f ("net/sched: allow flower to match tunnel options")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the udp6 code path, we needed multiple tests to select the correct
mib to be updated. Since we touch at least a counter at each iteration,
it's convenient to use the recently introduced __UDPX_MIB() helper once
and remove some code duplication.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only first fragment has the sport/dport information,
not the following ones.
If we want consistent hash for all fragments, we need to
ignore ports even for first fragment.
This bug is visible for IPv6 traffic, if incoming fragments
do not have a flow label, since skb_get_hash() will give
different results for first fragment and following ones.
It is also visible if any routing rule wants dissection
and sport or dport.
See commit 5e5d6fed37 ("ipv6: route: dissect flow
in input path if fib rules need it") for details.
[edumazet] rewrote the changelog completely.
Fixes: 06635a35d1 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
Signed-off-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
if skb is NULL pointer, and the following access of skb's
skb_mstamp_ns will trigger panic, which is same as BUG_ON
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the socket is CAN FD enabled it can handle CAN FD frame
transmissions. Add an additional check in raw_sendmsg() as a CAN2.0 CAN
driver (non CAN FD) should never see a CAN FD frame. Due to the commonly
used can_dropped_invalid_skb() function the CAN 2.0 driver would drop
that CAN FD frame anyway - but with this patch the user gets a proper
-EINVAL return code.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This adds the fourth and final search class, containing policies
where both saddr and daddr have prefix lengths (i.e., not wildcards).
Inexact policies now end up in one of the following four search classes:
1. "Any:Any" list, containing policies where both saddr and daddr are
wildcards or have very coarse prefixes, e.g. 10.0.0.0/8 and the like.
2. "saddr:any" list, containing policies with a fixed saddr/prefixlen,
but without destination restrictions.
These lists are stored in rbtree nodes; each node contains those
policies matching saddr/prefixlen.
3. "Any:daddr" list. Similar to 2), except for policies where only the
destinations are specified.
4. "saddr:daddr" lists, containing only those policies that
match the given source/destination network.
The root of the saddr/daddr nodes gets stored in the nodes of the
'daddr' tree.
This diagram illustrates the list classes, and their
placement in the lookup hierarchy:
xfrm_pol_inexact_bin = hash(dir,type,family,if_id);
|
+---- root_d: sorted by daddr:prefix
| |
| xfrm_pol_inexact_node
| |
| +- root: sorted by saddr/prefix
| | |
| | xfrm_pol_inexact_node
| | |
| | + root: unused
| | |
| | + hhead: saddr:daddr policies
| |
| +- coarse policies and all any:daddr policies
|
+---- root_s: sorted by saddr:prefix
| |
| xfrm_pol_inexact_node
| |
| + root: unused
| |
| + hhead: saddr:any policies
|
+---- coarse policies and all any:any policies
lookup for an inexact policy returns pointers to the four relevant list
classes, after which each of the lists needs to be searched for the policy
with the higher priority.
This will only speed up lookups in case we have many policies and a
sizeable portion of these have disjunct saddr/daddr addresses.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This adds the 'saddr:any' search class. It contains all policies that have
a fixed saddr/prefixlen, but 'any' destination.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
validate the re-inserted policies match the lookup node.
Policies that fail this test won't be returned in the candidate set.
This is enabled by default for now, it should not cause noticeable
reinsert slow down.
Such reinserts are needed when we have to merge an existing node
(e.g. for 10.0.0.0/28 because a overlapping subnet was added (e.g.
10.0.0.0/24), so whenever this happens existing policies have to
be placed on the list of the new node.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This adds inexact lists per destination network, stored in a search tree.
Inexact lookups now return two 'candidate lists', the 'any' policies
('any' destionations), and a list of policies that share same
daddr/prefix.
Next patch will add a second search tree for 'saddr:any' policies
so we can avoid placing those on the 'any:any' list too.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
At this time inexact policies are all searched in-order until the first
match is found. After removal of the flow cache, this resolution has
to be performed for every packetm resulting in major slowdown when
number of inexact policies is high.
This adds infrastructure to later sort inexact policies into a tree.
This only introduces a single class: any:any.
Next patch will add a search tree to pre-sort policies that
have a fixed daddr/prefixlen, so in this patch the any:any class
will still be used for all policies.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This avoids searches of polices that cannot match in the first
place due to different interface id by placing them in different bins.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Switch packet-path lookups for inexact policies to rhashtable.
In this initial version, we now no longer need to search policies with
non-matching address family and type.
Next patch will add the if_id as well so lookups from the xfrm interface
driver only need to search inexact policies for that device.
Future patches will augment the hlist in each rhash bucket with a tree
and pre-sort policies according to daddr/prefix.
A single rhashtable is used. In order to avoid a full rhashtable walk on
netns exit, the bins get placed on a pernet list, i.e. we add almost no
cost for network namespaces that had no xfrm policies.
The inexact lists are kept in place, and policies are added to both the
per-rhash-inexact list and a pernet one.
The latter is needed for the control plane to handle migrate -- these
requests do not consider the if_id, so if we'd remove the inexact_list
now we would have to search all hash buckets and then figure
out which matching policy candidate is the most recent one -- this appears
a bit harder than just keeping the 'old' inexact list for this purpose.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
currently policy_hash_bysel() returns the hash bucket list
(for exact policies), or the inexact list (when policy uses a prefix).
Searching this inexact list is slow, so it might be better to pre-sort
inexact lists into a tree or another data structure for faster
searching.
However, due to 'any' policies, that need to be searched in any case,
doing so will require that 'inexact' policies need to be handled
specially to decide the best search strategy. So change hash_bysel()
and return NULL if the policy can't be handled via the policy hash
table.
Right now, we simply use the inexact list when this happens, but
future patch can then implement a different strategy.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
... so we can reuse this later without code duplication when we add
policy to a second inexact list.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
currently all non-socket policies are either hashed in the dst table,
or placed on the 'inexact list'. When flushing, we first walk the
table, then the (per-direction) inexact lists.
When we try and get rid of the inexact lists to having "n" inexact
lists (e.g. per-af inexact lists, or sorted into a tree), this walk
would become more complicated.
Simplify this: walk the 'all' list and skip socket policies during
traversal so we don't need to handle exact and inexact policies
separately anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch proposes to extend the sk_lookup() BPF API to the
XDP hookpoint. The sk_lookup() helper supports a lookup
on incoming packet to find the corresponding socket that will
receive this packet. Current support for this BPF API is
at the tc hookpoint. This patch will extend this API at XDP
hookpoint. A XDP program can map the incoming packet to the
5-tuple parameter and invoke the API to find the corresponding
socket structure.
Signed-off-by: Nitin Hande <Nitin.Hande@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch allows eBPF programs that use sock_ops to send perf
based event notifications using bpf_perf_event_output(). Our main
use case for this is the following:
We would like to monitor some subset of TCP sockets in user-space,
(the monitoring application would define 4-tuples it wants to monitor)
using TCP_INFO stats to analyze reported problems. The idea is to
use those stats to see where the bottlenecks are likely to be ("is
it application-limited?" or "is there evidence of BufferBloat in
the path?" etc).
Today we can do this by periodically polling for tcp_info, but this
could be made more efficient if the kernel would asynchronously
notify the application via tcp_info when some "interesting"
thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
are reached. And to make this effective, it is better if
we could apply the threshold check *before* constructing the
tcp_info netlink notification, so that we don't waste resources
constructing notifications that will be discarded by the filter.
This work solves the problem by adding perf event based notification
support for sock_ops. The eBPF program can thus be designed to apply
any desired filters to the bpf_sock_ops and trigger a perf event
notification based on the evaluation from the filter. The user space
component can use these perf event notifications to either read any
state managed by the eBPF program, or issue a TCP_INFO netlink call
if desired.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently when an AP and STA interfaces are active in the same or different
radios, regulatory settings are restored whenever the STA disconnects. This
restores all channel information including dfs states in all radios.
For example, if an AP interface is active in one radio and STA in another,
when radar is detected on the AP interface, the dfs state of the channel
will be changed to UNAVAILABLE. But when the STA interface disconnects,
this issues a regulatory disconnect hint which restores all regulatory
settings in all the radios attached and thereby losing the stored dfs
state on the other radio where the channel was marked as unavailable
earlier. Hence prevent such regulatory restore whenever another active
beaconing interface is present in the same or other radios.
Signed-off-by: Sriram R <srirrama@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the FTM responder settings are changed simultaneously with
the CSA beacon, the buffer size allocated isn't sufficient and
we'll have a heap overrun. Fix this.
While at it, also clean up the ftm_responder assignment, doing
it only if ftm_responder is non-zero is valid as it's 0 to start
with, but not really useful to understand the code.
Fixes: bc847970f4 ("mac80211: support FTM responder configuration/statistics")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When FTM is enabled, doing a CSA will unexpectedly lose it since
the value of ftm_responder may be initialized to 0 instead of -1,
so fix that.
Fixes: 81e54d08d9 ("cfg80211: support FTM responder configuration/statistics")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This fixes stale beacon-int values that would keep a netdev
from going up.
To reproduce:
Create two VAP on one radio.
vap1 has beacon-int 100, start it.
vap2 has beacon-int 240, start it (and it will fail
because beacon-int mismatch).
reconfigure vap2 to have beacon-int 100 and start it.
It will fail because the stale beacon-int 240 will be used
in the ifup path and hostapd never gets a chance to set the
new beacon interval.
Cc: stable@vger.kernel.org
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
cfg80211_sta_opmode_change_notify needs a gfp_t flag to hint the nl80211
stack when allocating new skb, but it is called under tasklet context
here with GFP_KERNEL and kernel will yield a warning about it.
Cc: stable@vger.kernel.org
Fixes: ff84e7bfe1 ("mac80211: Add support to notify ht/vht opmode modification.")
Signed-off-by: Yan-Hsuan Chuang <yhchuang@realtek.com>
ACKed-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Do a logical vht_capa &= vht_capa_mask of user-supplied VHT mask with
the driver-supplied mask of modifiable VHT capabilities.
Fix whitespaces and comment typos.
Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add the missing unlock before return from function
ieee80211_mark_sta_auth() in the error handling case.
Cc: stable@vger.kernel.org
Fixes: fc107a9330 ("mac80211: Helper function for marking STA authenticated")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
[use result variable/label instead of duplicating]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Lookup functions in sk_lookup have different expectations about byte
order of provided arguments.
Specifically __inet_lookup, __udp4_lib_lookup and __udp6_lib_lookup
expect dport to be in network byte order and do ntohs(dport) internally.
At the same time __inet6_lookup expects dport to be in host byte order
and correspondingly name the argument hnum.
sk_lookup works correctly with __inet_lookup, __udp4_lib_lookup and
__inet6_lookup with regard to dport. But in __udp6_lib_lookup case it
uses host instead of expected network byte order. It makes result
returned by bpf_sk_lookup_udp for IPv6 incorrect.
The patch fixes byte order of dport passed to __udp6_lib_lookup.
Originally sk_lookup properly handled UDPv6, but not TCPv6. 5ef0ae84f0
fixes TCPv6 but breaks UDPv6.
Fixes: 5ef0ae84f0 ("bpf: Fix IPv6 dport byte-order in bpf_sk_lookup")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
To mirror software behaviour on offload more precisely inform
the drivers about the state of the harddrop flag.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently, in commit ab408b6dc7 ("tcp: switch tcp and sch_fq to new
earliest departure time model"), the TCP BBR code switched to a new
approach of using an explicit bbr_pacing_margin_percent for shaving a
pacing rate "haircut", rather than the previous implict
approach. Update an old comment to reflect the new approach.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes assumption than vlan_tci != 0 when tag is present.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>