Overlapping header include additions in macsec.c
A bug fix in 'net' overlapping with the removal of 'version'
string in ena_netdev.c
Overlapping test additions in selftests Makefile
Overlapping PCI ID table adjustments in iwlwifi driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix deadlock in bpf_send_signal() from Yonghong Song.
2) Fix off by one in kTLS offload of mlx5, from Tariq Toukan.
3) Add missing locking in iwlwifi mvm code, from Avraham Stern.
4) Fix MSG_WAITALL handling in rxrpc, from David Howells.
5) Need to hold RTNL mutex in tcindex_partial_destroy_work(), from Cong
Wang.
6) Fix producer race condition in AF_PACKET, from Willem de Bruijn.
7) cls_route removes the wrong filter during change operations, from
Cong Wang.
8) Reject unrecognized request flags in ethtool netlink code, from
Michal Kubecek.
9) Need to keep MAC in reset until PHY is up in bcmgenet driver, from
Doug Berger.
10) Don't leak ct zone template in act_ct during replace, from Paul
Blakey.
11) Fix flushing of offloaded netfilter flowtable flows, also from Paul
Blakey.
12) Fix throughput drop during tx backpressure in cxgb4, from Rahul
Lakkireddy.
13) Don't let a non-NULL skb->dev leave the TCP stack, from Eric
Dumazet.
14) TCP_QUEUE_SEQ socket option has to update tp->copied_seq as well,
also from Eric Dumazet.
15) Restrict macsec to ethernet devices, from Willem de Bruijn.
16) Fix reference leak in some ethtool *_SET handlers, from Michal
Kubecek.
17) Fix accidental disabling of MSI for some r8169 chips, from Heiner
Kallweit.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (138 commits)
net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build
net: ena: Add PCI shutdown handler to allow safe kexec
selftests/net/forwarding: define libs as TEST_PROGS_EXTENDED
selftests/net: add missing tests to Makefile
r8169: re-enable MSI on RTL8168c
net: phy: mdio-bcm-unimac: Fix clock handling
cxgb4/ptp: pass the sign of offset delta in FW CMD
net: dsa: tag_8021q: replace dsa_8021q_remove_header with __skb_vlan_pop
net: cbs: Fix software cbs to consider packet sending time
net/mlx5e: Do not recover from a non-fatal syndrome
net/mlx5e: Fix ICOSQ recovery flow with Striding RQ
net/mlx5e: Fix missing reset of SW metadata in Striding RQ reset
net/mlx5e: Enhance ICOSQ WQE info fields
net/mlx5_core: Set IB capability mask1 to fix ib_srpt connection failure
selftests: netfilter: add nfqueue test case
netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress
netfilter: nft_fwd_netdev: validate family and chain type
netfilter: nft_set_rbtree: Detect partial overlaps on insertion
netfilter: nft_set_rbtree: Introduce and use nft_rbtree_interval_start()
netfilter: nft_set_pipapo: Separate partial and complete overlap cases on insertion
...
Commit 3f8fd02b1b ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
the vunmap() code-path. While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.
Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.
To avoid the unnecessary work on x86-64 and to gain the performance
back, split up vmalloc_sync_all() into two functions:
* vmalloc_sync_mappings(), and
* vmalloc_sync_unmappings()
Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized. The only exception is the new call-site added in the
above mentioned commit.
Shile Zhang directed us to a report of an 80% regression in reaim
throughput.
Fixes: 3f8fd02b1b ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Borislav Petkov <bp@suse.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
prevent inodes from vanishing, but ihold() does not guarantee inode
persistence. Replace the inode pointer with a per boot, machine wide,
unique inode identifier. The second commit fixes the breakage of the hash
mechanism whihc causes a 100% performance regression.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uQJsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYp4EAC5fr/AyRaIn/AEIZUmoyK6ELUaknfH
Z788avxDB/t5GkzC9A2dMpybYi78tzLSAEfB8jYgwbrqExapVtiqvjGZ1RIi3HoN
f/DWLnOb2s+yYQ3BQlHu4RdKONEzCqBwKFpElGRv3JzCY8Qeh5cQBzdqzvOEFmYw
P7DJVtJRZ2dud7AzJ+xk6KuNIKCT2F7Djmtop6nq1EVw0J/2oYOVgQu76APBj7cj
32srLmpP4xcQiJmWLC5ksXKiZrMPnyNfwXhHFufNvJ2Re6+Wf8mmglqG/5DmA+Ns
Sq3L7D7yXwtWQZ8Po1qnWhPDZVXQbWzHyTn4YAMJAK7yoO7mut8jgECt+A8vf4L+
hsc41c6THfdCQQ9gmxLL+c08nZGlmvIC4/1RsihNZ3kd2o4k6Ah9xFp8lBFcpjWd
7tuhakNqJvUOvB34t2AYqzMFbZ/FJG+QNGyIW0bTUn4YIgRPxI/zsdMxqGVBZ4oN
0iuy1kPLGbGAnLU9thkiVMmAyaPesuiB6f+mmzobEUgGI35GrCJi6a4YaTG1sqFn
Gl8oPzcU2n+DWbVBfJrVFHJye7oi78kCw6wpNLBCJQp8NP8doAH0Sgspglg52E/p
G4GGLz0vGauHBC5wQ3WYiGLImWbzC1dwKdcNE7dhuTgXbhz8ChVlOSU9Fu4+pGpq
6URL6DVTLwDZPg==
=e2iB
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex fix from Thomas Gleixner:
"Fix for yet another subtle futex issue.
The futex code used ihold() to prevent inodes from vanishing, but
ihold() does not guarantee inode persistence. Replace the inode
pointer with a per boot, machine wide, unique inode identifier.
The second commit fixes the breakage of the hash mechanism which
causes a 100% performance regression"
* tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Unbreak futex hashing
futex: Fix inode life-time issue
which caused sys/sysinfo to be inconsistent with /proc/uptime when read
from a task inside a time namespace.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uRM8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodEtD/9G/4q0bnI9Oyghb1vSFK/cXTUq+gFZ
oRpBbcs7GlN8Nj6I51hE7iWqPz+wvKuQQkuTvjewg+7yO+19IecD+SHxu2NQ7PCz
W0PHRCqX32coPVeya9fxc5mdqtrG32uFaOfiL3UVGTwmkwfapfOOQgDPDFCf9vcu
bBP8YpvQOkU/bqH5sXBCO3u34n9NK6dpQLjcnPSYF+recSUiJVa17F3LVHFcCuXH
5Ck4lY2W+xARIBwapZzz5rey4U8SIMEaHANkSmS11jpg5WB3jUOX80z6zGus15sN
haoaxYSDjUwwKUOPrqglL/Dq4DkCVYyPSzyYM3IRaTV/LpH1Fh/e6i6kIjpXtCth
d19rcklU8o2TIFw2qCK9V2Z7j71BOF10EYnf0MWNmAF5q2v/D6isQfwDz6sbU4mU
TLPCOiAgrGbWAS87ywBCdLjom4sejuL9tEf+O9i0YuqLp95BE2F4zjY/m19r5UNm
vCX1XEbga68Gxi+Oy39hhEanSAo6MhK6PEGH8KabAdkC4/ioyE5PzsccQA2lWwQp
cvpEMhGyDSzi7BcCw3uwp7DipBMpoIQu54OZ7cSoVeM9qNcj9dh4Y/Z0mF8v+Ibf
RucI7P68iloPMw8vWJV8gOm1xl2B7iCKrIs4TVxqf3J+ZzprWmtUPCo75vKo2vlf
l4lvX4dTRQB6Fw==
=Is++
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single fix adding the missing time namespace adjustment in
sys/sysinfo which caused sys/sysinfo to be inconsistent with
/proc/uptime when read from a task inside a time namespace"
* tag 'timers-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sys/sysinfo: Respect boottime inside time namespace
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-03-13
The following pull-request contains BPF updates for your *net-next* tree.
We've added 86 non-merge commits during the last 12 day(s) which contain
a total of 107 files changed, 5771 insertions(+), 1700 deletions(-).
The main changes are:
1) Add modify_return attach type which allows to attach to a function via
BPF trampoline and is run after the fentry and before the fexit programs
and can pass a return code to the original caller, from KP Singh.
2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher
objects to be visible in /proc/kallsyms so they can be annotated in
stack traces, from Jiri Olsa.
3) Extend BPF sockmap to allow for UDP next to existing TCP support in order
in order to enable this for BPF based socket dispatch, from Lorenz Bauer.
4) Introduce a new bpftool 'prog profile' command which attaches to existing
BPF programs via fentry and fexit hooks and reads out hardware counters
during that period, from Song Liu. Example usage:
bpftool prog profile id 337 duration 3 cycles instructions llc_misses
4228 run_cnt
3403698 cycles (84.08%)
3525294 instructions # 1.04 insn per cycle (84.05%)
13 llc_misses # 3.69 LLC misses per million isns (83.50%)
5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition
of a new bpf_link abstraction to keep in particular BPF tracing programs
attached even when the applicaion owning them exits, from Andrii Nakryiko.
6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering
and which returns the PID as seen by the init namespace, from Carlos Neira.
7) Refactor of RISC-V JIT code to move out common pieces and addition of a
new RV32G BPF JIT compiler, from Luke Nelson.
8) Add gso_size context member to __sk_buff in order to be able to know whether
a given skb is GSO or not, from Willem de Bruijn.
9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output
implementation but can be called from tracepoint programs, from Eelco Chaudron.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at __bpf_prog_enter() and __bpf_prog_exit()
warning: context imbalance in __bpf_prog_enter() - wrong count at exit
warning: context imbalance in __bpf_prog_exit() - unexpected unlock
The root cause is the missing annotation at __bpf_prog_enter()
and __bpf_prog_exit()
Add the missing __acquires(RCU) annotation
Add the missing __releases(RCU) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200311010908.42366-2-jbi.octave@gmail.com
Now that we have all the objects (bpf_prog, bpf_trampoline,
bpf_dispatcher) linked in bpf_tree, there's no need to have
separate bpf_image tree for images.
Reverting the bpf_image tree together with struct bpf_image,
because it's no longer needed.
Also removing bpf_image_alloc function and adding the original
bpf_jit_alloc_exec_page interface instead.
The kernel_text_address function can now rely only on is_bpf_text_address,
because it checks the bpf_tree that contains all the objects.
Keeping bpf_image_ksym_add and bpf_image_ksym_del because they are
useful wrappers with perf's ksymbol interface calls.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-13-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding dispatchers to kallsyms. It's displayed as
bpf_dispatcher_<NAME>
where NAME is the name of dispatcher.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-12-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding trampolines to kallsyms. It's displayed as
bpf_trampoline_<ID> [bpf]
where ID is the BTF id of the trampoline function.
Adding bpf_image_ksym_add/del functions that setup
the start/end values and call KSYMBOL perf events
handlers.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-11-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Separating /proc/kallsyms add/del code and adding bpf_ksym_add/del
functions for that.
Moving bpf_prog_ksym_node_add/del functions to __bpf_ksym_add/del
and changing their argument to 'struct bpf_ksym' object. This way
we can call them for other bpf objects types like trampoline and
dispatcher.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-10-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding 'prog' bool flag to 'struct bpf_ksym' to mark that
this object belongs to bpf_prog object.
This change allows having bpf_prog objects together with
other types (trampolines and dispatchers) in the single
bpf_tree. It's used when searching for bpf_prog exception
tables by the bpf_prog_ksym_find function, where we need
to get the bpf_prog pointer.
>From now we can safely add bpf_ksym support for trampoline
or dispatcher objects, because we can differentiate them
from bpf_prog objects.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-9-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding bpf_ksym_find function that is used bpf bpf address
lookup functions:
__bpf_address_lookup
is_bpf_text_address
while keeping bpf_prog_kallsyms_find to be used only for lookup
of bpf_prog objects (will happen in following changes).
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-8-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Moving ksym_tnode list node to 'struct bpf_ksym' object,
so the symbol itself can be chained and used in other
objects like bpf_trampoline and bpf_dispatcher.
We need bpf_ksym object to be linked both in bpf_kallsyms
via lnode for /proc/kallsyms and in bpf_tree via tnode for
bpf address lookup functions like __bpf_address_lookup or
bpf_prog_kallsyms_find.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200312195610.346362-7-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding lnode list node to 'struct bpf_ksym' object,
so the struct bpf_ksym itself can be chained and used
in other objects like bpf_trampoline and bpf_dispatcher.
Changing iterator to bpf_ksym in bpf_get_kallsym function.
The ksym->start is holding the prog->bpf_func value,
so it's ok to use it as value in bpf_get_kallsym.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200312195610.346362-6-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding name to 'struct bpf_ksym' object to carry the name
of the symbol for bpf_prog, bpf_trampoline, bpf_dispatcher
objects.
The current benefit is that name is now generated only when
the symbol is added to the list, so we don't need to generate
it every time it's accessed.
The future benefit is that we will have all the bpf objects
symbols represented by struct bpf_ksym.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200312195610.346362-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding 'struct bpf_ksym' object that will carry the
kallsym information for bpf symbol. Adding the start
and end address to begin with. It will be used by
bpf_prog, bpf_trampoline, bpf_dispatcher objects.
The symbol_start/symbol_end values were originally used
to sort bpf_prog objects. For the address displayed in
/proc/kallsyms we are using prog->bpf_func value.
I'm using the bpf_func value for program symbol start
instead of the symbol_start, because it makes no difference
for sorting bpf_prog objects and we can use it directly as
an address to display it in /proc/kallsyms.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200312195610.346362-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of requiring users to do three steps for cleaning up bpf_link, its
anon_inode file, and unused fd, abstract that away into bpf_link_cleanup()
helper. bpf_link_defunct() is removed, as it shouldn't be needed as an
individual operation anymore.
v1->v2:
- keep bpf_link_cleanup() static for now (Daniel).
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200313002128.2028680-1-andriin@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf 2020-03-12
The following pull-request contains BPF updates for your *net* tree.
We've added 12 non-merge commits during the last 8 day(s) which contain
a total of 12 files changed, 161 insertions(+), 15 deletions(-).
The main changes are:
1) Andrii fixed two bugs in cgroup-bpf.
2) John fixed sockmap.
3) Luke fixed x32 jit.
4) Martin fixed two issues in struct_ops.
5) Yonghong fixed bpf_send_signal.
6) Yoshiki fixed BTF enum.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce new helper that reuses existing xdp perf_event output
implementation, but can be called from raw_tracepoint programs
that receive 'struct xdp_buff *' as a tracepoint argument.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158348514556.2239.11050972434793741444.stgit@xdp-tutorial
New bpf helper bpf_get_ns_current_pid_tgid,
This helper will return pid and tgid from current task
which namespace matches dev_t and inode number provided,
this will allows us to instrument a process inside a container.
Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200304204157.58695-3-cneirabustos@gmail.com
Pull networking fixes from David Miller:
"It looks like a decent sized set of fixes, but a lot of these are one
liner off-by-one and similar type changes:
1) Fix netlink header pointer to calcular bad attribute offset
reported to user. From Pablo Neira Ayuso.
2) Don't double clear PHY interrupts when ->did_interrupt is set,
from Heiner Kallweit.
3) Add missing validation of various (devlink, nl802154, fib, etc.)
attributes, from Jakub Kicinski.
4) Missing *pos increments in various netfilter seq_next ops, from
Vasily Averin.
5) Missing break in of_mdiobus_register() loop, from Dajun Jin.
6) Don't double bump tx_dropped in veth driver, from Jiang Lidong.
7) Work around FMAN erratum A050385, from Madalin Bucur.
8) Make sure ARP header is pulled early enough in bonding driver,
from Eric Dumazet.
9) Do a cond_resched() during multicast processing of ipvlan and
macvlan, from Mahesh Bandewar.
10) Don't attach cgroups to unrelated sockets when in interrupt
context, from Shakeel Butt.
11) Fix tpacket ring state management when encountering unknown GSO
types. From Willem de Bruijn.
12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend()
only in the suspend context. From Heiner Kallweit"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
net: systemport: fix index check to avoid an array out of bounds access
tc-testing: add ETS scheduler to tdc build configuration
net: phy: fix MDIO bus PM PHY resuming
net: hns3: clear port base VLAN when unload PF
net: hns3: fix RMW issue for VLAN filter switch
net: hns3: fix VF VLAN table entries inconsistent issue
net: hns3: fix "tc qdisc del" failed issue
taprio: Fix sending packets without dequeueing them
net: mvmdio: avoid error message for optional IRQ
net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register
net: memcg: fix lockdep splat in inet_csk_accept()
s390/qeth: implement smarter resizing of the RX buffer pool
s390/qeth: refactor buffer pool code
s390/qeth: use page pointers to manage RX buffer pool
seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed
net/packet: tpacket_rcv: do not increment ring index on drop
sxgbe: Fix off by one in samsung driver strncpy size arg
net: caif: Add lockdep expression to RCU traversal primitive
MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXmebNQAKCRCRxhvAZXjc
ohidAP4y7sujHKMe87Qd6RFQ+aPTB1cGVgBSyMV5DuvbTW0R9QEA/bWSUtye5+Ln
WRDkXapGM2l36s02xspgokaAhYiFoAE=
=a6eO
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2020-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull thread fix from Christian Brauner:
"This contains a single fix for a regression which was introduced when
we introduced the ability to select a specific pid at process creation
time.
When this feature is requested, the error value will be set to -EPERM
after exiting the pid allocation loop. This caused EPERM to be
returned when e.g. the init process/child subreaper of the pid
namespace has already died where we used to return ENOMEM before.
The first patch here simply fixes the regression by unconditionally
setting the return value back to ENOMEM again once we've successfully
allocated the requested pid number. This should be easy to backport to
v5.5.
The second patch adds a comment explaining that we must keep returning
ENOMEM since we've been doing it for a long time and have explicitly
documented this behavior for userspace. This seemed worthwhile because
we now have at least two separate example where people tried to change
the return value to something other than ENOMEM (The first version of
the regression fix did that too and the commit message links to an
earlier patch that tried to do the same.).
I have a simple regression test to make sure we catch this regression
in the future but since that introduces a whole new selftest subdir
and test files I'll keep this for v5.7"
* tag 'for-linus-2020-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
pid: make ENOMEM return value more obvious
pid: Fix error return value in some cases
can break live patching.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXmj5ExQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quQGAQDO35RBAQDGmpxnSCQPNwrzqokM8p8d
1e1xshwOVnwqgAEA7csC4u1n5Z8ncIl5Pd8ygt4nXeqw4AenHLeZIdhfegc=
=+AeW
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Have ftrace lookup_rec() return a consistent record otherwise it can
break live patching"
* tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Return the first found result in lookup_rec()
It appears that ip ranges can overlap so. In that case lookup_rec()
returns whatever results it got last even if it found nothing in last
searched page.
This breaks an obscure livepatch late module patching usecase:
- load livepatch
- load the patched module
- unload livepatch
- try to load livepatch again
To fix this return from lookup_rec() as soon as it found the record
containing searched-for ip. This used to be this way prior lookup_rec()
introduction.
Link: http://lkml.kernel.org/r/20200306174317.21699-1-asavkov@redhat.com
Cc: stable@vger.kernel.org
Fixes: 7e16f581a8 ("ftrace: Separate out functionality from ftrace_location_range()")
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add bpf_link_new_file() API for cases when we need to ensure anon_inode is
successfully created before we proceed with expensive BPF program attachment
procedure, which will require equally (if not more so) expensive and
potentially failing compensation detachment procedure just because anon_inode
creation failed. This API allows to simplify code by ensuring first that
anon_inode is created and after BPF program is attached proceed with
fd_install() that can't fail.
After anon_inode file is created, link can't be just kfree()'d anymore,
because its destruction will be performed by deferred file_operations->release
call. For this, bpf_link API required specifying two separate operations:
release() and dealloc(), former performing detachment only, while the latter
frees memory used by bpf_link itself. dealloc() needs to be specified, because
struct bpf_link is frequently embedded into link type-specific container
struct (e.g., struct bpf_raw_tp_link), so bpf_link itself doesn't know how to
properly free the memory. In case when anon_inode file was successfully
created, but subsequent BPF attachment failed, bpf_link needs to be marked as
"defunct", so that file's release() callback will perform only memory
deallocation, but no detachment.
Convert raw tracepoint and tracing attachment to new API and eliminate
detachment from error handling path.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309231051.1270337-1-andriin@fb.com
We are testing network memory accounting in our setup and noticed
inconsistent network memory usage and often unrelated cgroups network
usage correlates with testing workload. On further inspection, it
seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
irq context specially for cgroup v1.
mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
and kind of assumes that this can only happen from sk_clone_lock()
and the source sock object has already associated cgroup. However in
cgroup v1, where network memory accounting is opt-in, the source sock
can be unassociated with any cgroup and the new cloned sock can get
associated with unrelated interrupted cgroup.
Cgroup v2 can also suffer if the source sock object was created by
process in the root cgroup or if sk_alloc() is called in irq context.
The fix is to just do nothing in interrupt.
WARNING: Please note that about half of the TCP sockets are allocated
from the IRQ context, so, memory used by such sockets will not be
accouted by the memcg.
The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1
Hardware name: ...
Call Trace:
<IRQ>
dump_stack+0x57/0x75
mem_cgroup_sk_alloc+0xe9/0xf0
sk_clone_lock+0x2a7/0x420
inet_csk_clone_lock+0x1b/0x110
tcp_create_openreq_child+0x23/0x3b0
tcp_v6_syn_recv_sock+0x88/0x730
tcp_check_req+0x429/0x560
tcp_v6_rcv+0x72d/0xa40
ip6_protocol_deliver_rcu+0xc9/0x400
ip6_input+0x44/0xd0
? ip6_protocol_deliver_rcu+0x400/0x400
ip6_rcv_finish+0x71/0x80
ipv6_rcv+0x5b/0xe0
? ip6_sublist_rcv+0x2e0/0x2e0
process_backlog+0x108/0x1e0
net_rx_action+0x26b/0x460
__do_softirq+0x104/0x2a6
do_softirq_own_stack+0x2a/0x40
</IRQ>
do_softirq.part.19+0x40/0x50
__local_bh_enable_ip+0x51/0x60
ip6_finish_output2+0x23d/0x520
? ip6table_mangle_hook+0x55/0x160
__ip6_finish_output+0xa1/0x100
ip6_finish_output+0x30/0xd0
ip6_output+0x73/0x120
? __ip6_finish_output+0x100/0x100
ip6_xmit+0x2e3/0x600
? ipv6_anycast_cleanup+0x50/0x50
? inet6_csk_route_socket+0x136/0x1e0
? skb_free_head+0x1e/0x30
inet6_csk_xmit+0x95/0xf0
__tcp_transmit_skb+0x5b4/0xb20
__tcp_send_ack.part.60+0xa3/0x110
tcp_send_ack+0x1d/0x20
tcp_rcv_state_process+0xe64/0xe80
? tcp_v6_connect+0x5d1/0x5f0
tcp_v6_do_rcv+0x1b1/0x3f0
? tcp_v6_do_rcv+0x1b1/0x3f0
__release_sock+0x7f/0xd0
release_sock+0x30/0xa0
__inet_stream_connect+0x1c3/0x3b0
? prepare_to_wait+0xb0/0xb0
inet_stream_connect+0x3b/0x60
__sys_connect+0x101/0x120
? __sys_getsockopt+0x11b/0x140
__x64_sys_connect+0x1a/0x20
do_syscall_64+0x51/0x200
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
Fixes: 2d75807383 ("mm: memcontrol: consolidate cgroup socket tracking")
Fixes: d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull cgroup fixes from Tejun Heo:
- cgroup.procs listing related fixes.
It didn't interlock properly with exiting tasks leaving a short
window where a cgroup has empty cgroup.procs but still can't be
removed and misbehaved on short reads.
- psi_show() crash fix on 32bit ino archs
- Empty release_agent handling fix
* 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup1: don't call release_agent when it is ""
cgroup: fix psi_show() crash on 32bit ino archs
cgroup: Iterate tasks that did not finish do_exit()
cgroup: cgroup_procs_next should increase position index
cgroup-v1: cgroup_pidlist_next should update position index
Pull workqueue fixes from Tejun Heo:
"Workqueue has been incorrectly round-robining per-cpu work items.
Hillf's patch fixes that.
The other patch documents memory-ordering properties of workqueue
operations"
* 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: don't use wq_select_unbound_cpu() for bound works
workqueue: Document (some) memory-ordering properties of {queue,schedule}_work()
btf_enum_check_member() was currently sure to recognize the size of
"enum" type members in struct/union as the size of "int" even if
its size was packed.
This patch fixes BTF enum verification to use the correct size
of member in BPF programs.
Fixes: 179cde8cef ("bpf: btf: Check members of struct/union")
Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1583825550-18606-2-git-send-email-komachi.yoshiki@gmail.com
wq_select_unbound_cpu() is designed for unbound workqueues only, but
it's wrongly called when using a bound workqueue too.
Fixing this ensures work queued to a bound workqueue with
cpu=WORK_CPU_UNBOUND always runs on the local CPU.
Before, that would happen only if wq_unbound_cpumask happened to include
it (likely almost always the case), or was empty, or we got lucky with
forced round-robin placement. So restricting
/sys/devices/virtual/workqueue/cpumask to a small subset of a machine's
CPUs would cause some bound work items to run unexpectedly there.
Fixes: ef55718044 ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Hillf Danton <hdanton@sina.com>
[dj: massage changelog]
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
There is no compensating cgroup_bpf_put() for each ancestor cgroup in
cgroup_bpf_inherit(). If compute_effective_progs returns error, those cgroups
won't be freed ever. Fix it by putting them in cleanup code path.
Fixes: e10360f815 ("bpf: cgroup: prevent out-of-order release of cgroup bpf")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Link: https://lore.kernel.org/bpf/20200309224017.1063297-1-andriin@fb.com
The alloc_pid() codepath used to be simpler. With the introducation of the
ability to choose specific pids in 49cb2fc42c ("fork: extend clone3() to
support setting a PID") it got more complex. It hasn't been super obvious
that ENOMEM is returned when the pid namespace init process/child subreaper
of the pid namespace has died. As can be seen from multiple attempts to
improve this see e.g. [1] and most recently [2].
We regressed returning ENOMEM in [3] and [2] restored it. Let's add a
comment on top explaining that this is historic and documented behavior and
cannot easily be changed.
[1]: 35f71bc0a0 ("fork: report pid reservation failure properly")
[2]: b26ebfe12f ("pid: Fix error return value in some cases")
[3]: 49cb2fc42c ("fork: extend clone3() to support setting a PID")
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The recent futex inode life time fix changed the ordering of the futex key
union struct members, but forgot to adjust the hash function accordingly,
As a result the hashing omits the leading 64bit and even hashes beyond the
futex key causing a bad hash distribution which led to a ~100% performance
regression.
Hand in the futex key pointer instead of a random struct member and make
the size calculation based of the struct offset.
Fixes: 8019ad13ef ("futex: Fix inode life-time issue")
Reported-by: Rong Chen <rong.a.chen@intel.com>
Decoded-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Rong Chen <rong.a.chen@intel.com>
Link: https://lkml.kernel.org/r/87h7yy90ve.fsf@nanos.tec.linutronix.de
Recent changes to alloc_pid() allow the pid number to be specified on
the command line. If set_tid_size is set, then the code scanning the
levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM
value.
After the code scanning the levels, there are error returns that do not
set retval, assuming it is still set to -ENOMEM.
So set retval back to -ENOMEM after scanning the levels.
Fixes: 49cb2fc42c ("fork: extend clone3() to support setting a PID")
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <areber@redhat.com>
Cc: <stable@vger.kernel.org> # 5.5
Link: https://lore.kernel.org/r/20200306172314.12232-1-minyard@acm.org
[christian.brauner@ubuntu.com: fixup commit message]
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5j8hwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnjID/4/XVrqtVNUzVoVOtkOyxyesBrJVMHEQEpJ
PZssv835IStw0ENhxQJfGjPaIFc9Ff6PMkeN5KRAlMoEc+NkrJShF3owGf+6Bps7
rxpblPxaw+CJFa31YBDZVjMCvbVkDm40G5SsJh+xzdIjlWz7MppkkMPdrErPwY8V
0vnrIc+mKBKfBMZTwVkycYtp17LVgfXguledoWzxM1y47IW5UasKh8jdzhbu8Hvt
zztdQrigUdb+9XnLGCZIY0JQOyrhJ5zQpZ40FzbvxdYrQZXOoYT8L7iFu/z0Wi7K
p3a+G+B4WowtLYW78me4Uut5RrHq2XOehSypfujanQlpgXPGjS3TdHT3an2T8XPQ
NyGsZsn/eLm3btNbhGUd8vqpQy5EmWhqmwvYk9tFAoSFLiLcvCC624b/TCYPL+gk
3ZiI7mXBMjHnUZ0J/RF6kZWTAZDvr/tE7UZt1f8r1eEr8VDzCNp5Pst+HCVIguYD
g9eWF8oH6wYoj39UKf1k+vW2GjXGFsnfivObaxhyz03sAPXK2wQlzAe/4jZ24XNr
TRtOXh97c3CbLAwdUHehlzzdR3U7h0n2KsmrTC5AGmLABmR79s7BJ0+pexuZituO
LwU8+gpf7AugHTrLg1eNXAmBHW44I1ticXYiWcT4iSPn99kNIhlW+Jb1iTGoiu7n
nXyS3b5SCw==
=xwKl
-----END PGP SIGNATURE-----
Merge tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Here are a few fixes that should go into this release. This contains:
- Revert of a bad bcache patch from this merge window
- Removed unused function (Daniel)
- Fixup for the blktrace fix from Jan from this release (Cengiz)
- Fix of deeper level bfqq overwrite in BFQ (Carlo)"
* tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()
blktrace: fix dereference after null check
Revert "bcache: ignore pending signals when creating gc and allocator thread"
block: Remove used kblockd_schedule_work_on()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXmNvpgAKCRCRxhvAZXjc
ouFvAQDCzfOx1vcEP/nNhYBP2MPuafKclJcoJggC9rSmIvcLiQD/TI+LyHzplD+m
MWSu9NZJ6h6qyjKJivja3/bs8DVEewU=
=4gyS
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux
Pull thread fixes from Christian Brauner:
"Here are a few hopefully uncontroversial fixes:
- Use RCU_INIT_POINTER() when initializing rcu protected members in
task_struct to fix sparse warnings.
- Add pidfd_fdinfo_test binary to .gitignore file"
* tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
selftests: pidfd: Add pidfd_fdinfo_test in .gitignore
exit: Fix Sparse errors and warnings
fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer()
As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.
This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
test_run.o is not built when CONFIG_NET is not set and
bpf_prog_test_run_tracing being referenced in bpf_trace.o causes the
linker error:
ld: kernel/trace/bpf_trace.o:(.rodata+0x38): undefined reference to
`bpf_prog_test_run_tracing'
Add a __weak function in bpf_trace.c to handle this.
Fixes: da00d2f117 ("bpf: Add test ops for BPF_PROG_TYPE_TRACING")
Signed-off-by: KP Singh <kpsingh@google.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305220127.29109-1-kpsingh@chromium.org
While well intentioned, checking CAP_MAC_ADMIN for attaching
BPF_MODIFY_RETURN tracing programs to "security_" functions is not
necessary as tracing BPF programs already require CAP_SYS_ADMIN.
Fixes: 6ba43b761c ("bpf: Attachment verification for BPF_MODIFY_RETURN")
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305204955.31123-1-kpsingh@chromium.org
struct_ops map cannot support map_freeze. Otherwise, a struct_ops
cannot be unregistered from the subsystem.
Fixes: 85d33df357 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305013454.535397-1-kafai@fb.com
The current always succeed behavior in bpf_struct_ops_map_delete_elem()
is not ideal for userspace tool. It can be improved to return proper
error value.
If it is in TOBEFREE, it means unregistration has been already done
before but it is in progress and waiting for the subsystem to clear
the refcnt to zero, so -EINPROGRESS.
If it is INIT, it means the struct_ops has not been registered yet,
so -ENOENT.
Fixes: 85d33df357 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305013447.535326-1-kafai@fb.com
When experimenting with bpf_send_signal() helper in our production
environment (5.2 based), we experienced a deadlock in NMI mode:
#5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24
#6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012
#7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd
#8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55
#9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602
#10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a
#11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227
#12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140
#13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf
#14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09
#15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47
#16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d
#17 [ffffc9002219fc90] schedule at ffffffff81a3e219
#18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9
#19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529
#20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc
#21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c
#22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602
#23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068
The above call stack is actually very similar to an issue
reported by Commit eac9153f2b ("bpf/stackmap: Fix deadlock with
rq_lock in bpf_get_stack()") by Song Liu. The only difference is
bpf_send_signal() helper instead of bpf_get_stack() helper.
The above deadlock is triggered with a perf_sw_event.
Similar to Commit eac9153f2b, the below almost identical reproducer
used tracepoint point sched/sched_switch so the issue can be easily caught.
/* stress_test.c */
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <pthread.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define THREAD_COUNT 1000
char *filename;
void *worker(void *p)
{
void *ptr;
int fd;
char *pptr;
fd = open(filename, O_RDONLY);
if (fd < 0)
return NULL;
while (1) {
struct timespec ts = {0, 1000 + rand() % 2000};
ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
usleep(1);
if (ptr == MAP_FAILED) {
printf("failed to mmap\n");
break;
}
munmap(ptr, 4096 * 64);
usleep(1);
pptr = malloc(1);
usleep(1);
pptr[0] = 1;
usleep(1);
free(pptr);
usleep(1);
nanosleep(&ts, NULL);
}
close(fd);
return NULL;
}
int main(int argc, char *argv[])
{
void *ptr;
int i;
pthread_t threads[THREAD_COUNT];
if (argc < 2)
return 0;
filename = argv[1];
for (i = 0; i < THREAD_COUNT; i++) {
if (pthread_create(threads + i, NULL, worker, NULL)) {
fprintf(stderr, "Error creating thread\n");
return 0;
}
}
for (i = 0; i < THREAD_COUNT; i++)
pthread_join(threads[i], NULL);
return 0;
}
and the following command:
1. run `stress_test /bin/ls` in one windown
2. hack bcc trace.py with the following change:
--- a/tools/trace.py
+++ b/tools/trace.py
@@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s);
__data.tgid = __tgid;
__data.pid = __pid;
bpf_get_current_comm(&__data.comm, sizeof(__data.comm));
+ bpf_send_signal(10);
%s
%s
%s.perf_submit(%s, &__data, sizeof(__data));
3. in a different window run
./trace.py -p $(pidof stress_test) t:sched:sched_switch
The deadlock can be reproduced in our production system.
Similar to Song's fix, the fix is to delay sending signal if
irqs is disabled to avoid deadlocks involving with rq_lock.
With this change, my above stress-test in our production system
won't cause deadlock any more.
I also implemented a scale-down version of reproducer in the
selftest (a subsequent commit). With latest bpf-next,
it complains for the following potential deadlock.
[ 32.832450] -> #1 (&p->pi_lock){-.-.}:
[ 32.833100] _raw_spin_lock_irqsave+0x44/0x80
[ 32.833696] task_rq_lock+0x2c/0xa0
[ 32.834182] task_sched_runtime+0x59/0xd0
[ 32.834721] thread_group_cputime+0x250/0x270
[ 32.835304] thread_group_cputime_adjusted+0x2e/0x70
[ 32.835959] do_task_stat+0x8a7/0xb80
[ 32.836461] proc_single_show+0x51/0xb0
...
[ 32.839512] -> #0 (&(&sighand->siglock)->rlock){....}:
[ 32.840275] __lock_acquire+0x1358/0x1a20
[ 32.840826] lock_acquire+0xc7/0x1d0
[ 32.841309] _raw_spin_lock_irqsave+0x44/0x80
[ 32.841916] __lock_task_sighand+0x79/0x160
[ 32.842465] do_send_sig_info+0x35/0x90
[ 32.842977] bpf_send_signal+0xa/0x10
[ 32.843464] bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000
[ 32.844301] trace_call_bpf+0x115/0x270
[ 32.844809] perf_trace_run_bpf_submit+0x4a/0xc0
[ 32.845411] perf_trace_sched_switch+0x10f/0x180
[ 32.846014] __schedule+0x45d/0x880
[ 32.846483] schedule+0x5f/0xd0
...
[ 32.853148] Chain exists of:
[ 32.853148] &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock
[ 32.853148]
[ 32.854451] Possible unsafe locking scenario:
[ 32.854451]
[ 32.855173] CPU0 CPU1
[ 32.855745] ---- ----
[ 32.856278] lock(&rq->lock);
[ 32.856671] lock(&p->pi_lock);
[ 32.857332] lock(&rq->lock);
[ 32.857999] lock(&(&sighand->siglock)->rlock);
Deadlock happens on CPU0 when it tries to acquire &sighand->siglock
but it has been held by CPU1 and CPU1 tries to grab &rq->lock
and cannot get it.
This is not exactly the callstack in our production environment,
but sympotom is similar and both locks are using spin_lock_irqsave()
to acquire the lock, and both involves rq_lock. The fix to delay
sending signal when irq is disabled also fixed this issue.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com
There was a recent change in blktrace.c that added a RCU protection to
`q->blk_trace` in order to fix a use-after-free issue during access.
However the change missed an edge case that can lead to dereferencing of
`bt` pointer even when it's NULL:
Coverity static analyzer marked this as a FORWARD_NULL issue with CID
1460458.
```
/kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store()
1898 ret = 0;
1899 if (bt == NULL)
1900 ret = blk_trace_setup_queue(q, bdev);
1901
1902 if (ret == 0) {
1903 if (attr == &dev_attr_act_mask)
>>> CID 1460458: Null pointer dereferences (FORWARD_NULL)
>>> Dereferencing null pointer "bt".
1904 bt->act_mask = value;
1905 else if (attr == &dev_attr_pid)
1906 bt->pid = value;
1907 else if (attr == &dev_attr_start_lba)
1908 bt->start_lba = value;
1909 else if (attr == &dev_attr_end_lba)
```
Added a reassignment with RCU annotation to fix the issue.
Fixes: c780e86dd4 ("blktrace: Protect q->blk_trace with RCU")
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Cengiz Can <cengiz@kernel.wtf>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current fexit and fentry tests rely on a different program to
exercise the functions they attach to. Instead of doing this, implement
the test operations for tracing which will also be used for
BPF_MODIFY_RETURN in a subsequent patch.
Also, clean up the fexit test to use the generated skeleton.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-7-kpsingh@chromium.org
- Allow BPF_MODIFY_RETURN attachment only to functions that are:
* Whitelisted for error injection by checking
within_error_injection_list. Similar discussions happened for the
bpf_override_return helper.
* security hooks, this is expected to be cleaned up with the LSM
changes after the KRSI patches introduce the LSM_HOOK macro:
https://lore.kernel.org/bpf/20200220175250.10795-1-kpsingh@chromium.org/
- The attachment is currently limited to functions that return an int.
This can be extended later other types (e.g. PTR).
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-5-kpsingh@chromium.org
When multiple programs are attached, each program receives the return
value from the previous program on the stack and the last program
provides the return value to the attached function.
The fmod_ret bpf programs are run after the fentry programs and before
the fexit programs. The original function is only called if all the
fmod_ret programs return 0 to avoid any unintended side-effects. The
success value, i.e. 0 is not currently configurable but can be made so
where user-space can specify it at load time.
For example:
int func_to_be_attached(int a, int b)
{ <--- do_fentry
do_fmod_ret:
<update ret by calling fmod_ret>
if (ret != 0)
goto do_fexit;
original_function:
<side_effects_happen_here>
} <--- do_fexit
The fmod_ret program attached to this function can be defined as:
SEC("fmod_ret/func_to_be_attached")
int BPF_PROG(func_name, int a, int b, int ret)
{
// This will skip the original function logic.
return 1;
}
The first fmod_ret program is passed 0 in its return argument.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-4-kpsingh@chromium.org
As we need to introduce a third type of attachment for trampolines, the
flattened signature of arch_prepare_bpf_trampoline gets even more
complicated.
Refactor the prog and count argument to arch_prepare_bpf_trampoline to
use bpf_tramp_progs to simplify the addition and accounting for new
attachment types.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-2-kpsingh@chromium.org