linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2025-01-22 07:00:50 +07:00

Author	SHA1	Message	Date
Daniel Borkmann	eb1e1478b6	Merge branch 'bpf-sockmap-listen' Jakub Sitnicki says: ==================== This patch set turns SOCK{MAP,HASH} into generic collections for TCP sockets, both listening and established. Adding support for listening sockets enables us to use these BPF map types with reuseport BPF programs. Why? SOCKMAP and SOCKHASH, in comparison to REUSEPORT_SOCKARRAY, allow the socket to be in more than one map at the same time. Having a BPF map type that can hold listening sockets, and gracefully co-exist with reuseport BPF is important if, in the future, we want BPF programs that run at socket lookup time [0]. Cover letter for v1 of this series tells the full story of how we got here [1]. Although SOCK{MAP,HASH} are not a drop-in replacement for SOCKARRAY just yet, because UDP support is lacking, it's a step in this direction. We're working with Lorenz on extending SOCK{MAP,HASH} to hold UDP sockets, and expect to post RFC series for sockmap + UDP in the near future. I've dropped Acks from all patches that have been touched since v6. The audit for missing READ_ONCE annotations for access to sk_prot is ongoing. Thus far I've found one location specific to TCP listening sockets that needed annotating. This got fixed it in this iteration. I wonder if sparse checker could be put to work to identify places where we have sk_prot access while not holding sk_lock... The patch series depends on another one, posted earlier [2], that has been split out of it. v6 -> v7: - Extended the series to cover SOCKHASH. (patches 4-8, 10-11) (John) - Rebased onto recent bpf-next. Resolved conflicts in recent fixes to sk_state checks on sockmap/sockhash update path. (patch 4) - Added missing READ_ONCE annotation in sock_copy. (patch 1) - Split out patches that simplify sk_psock_restore_proto [2]. v5 -> v6: - Added a fix-up for patch 1 which I forgot to commit in v5. Sigh. v4 -> v5: - Rebase onto recent bpf-next to resolve conflicts. (Daniel) v3 -> v4: - Make tcp_bpf_clone parameter names consistent across function declaration and definition. (Martin) - Use sock_map_redirect_okay helper everywhere we need to take a different action for listening sockets. (Lorenz) - Expand comment explaining the need for a callback from reuseport to sockarray code in reuseport_detach_sock. (Martin) - Mention the possibility of using a u64 counter for reuseport IDs in the future in the description for patch 10. (Martin) v2 -> v3: - Generate reuseport ID when group is created. Please see patch 10 description for details. (Martin) - Fix the build when CONFIG_NET_SOCK_MSG is not selected by either CONFIG_BPF_STREAM_PARSER or CONFIG_TLS. (kbuild bot & John) - Allow updating sockmap from BPF on BPF_SOCK_OPS_TCP_LISTEN_CB callback. An oversight in previous iterations. Users may want to populate the sockmap with listening sockets from BPF as well. - Removed RCU read lock assertion in sock_map_lookup_sys. (Martin) - Get rid of a warning when child socket was cloned with parent's psock state. (John) - Check for tcp_bpf_unhash rather than tcp_bpf_recvmsg when deciding if sk_proto needs restoring on clone. Check for recvmsg in the context of listening socket cloning was confusing. (Martin) - Consolidate sock_map_sk_is_suitable with sock_map_update_okay. This led to adding dedicated predicates for sockhash. Update self-tests accordingly. (John) - Annotate unlikely branch in bpf_{sk,msg}_redirect_map when socket isn't in a map, or isn't a valid redirect target. (John) - Document paired READ/WRITE_ONCE annotations and cover shared access in more detail in patch 2 description. (John) - Correct a couple of log messages in sockmap_listen self-tests so the message reflects the actual failure. - Rework reuseport tests from sockmap_listen suite so that ENOENT error from bpf_sk_select_reuseport handler does not happen on happy path. v1 -> v2: - af_ops->syn_recv_sock callback is no longer overridden and burdened with restoring sk_prot and clearing sk_user_data in the child socket. As child socket is already hashed when syn_recv_sock returns, it is too late to put it in the right state. Instead patches 3 & 4 address restoring sk_prot and clearing sk_user_data before we hash the child socket. (Pointed out by Martin Lau) - Annotate shared access to sk->sk_prot with READ_ONCE/WRITE_ONCE macros as we write to it from sk_msg while socket might be getting cloned on another CPU. (Suggested by John Fastabend) - Convert tests for SOCKMAP holding listening sockets to return-on-error style, and hook them up to test_progs. Also use BPF skeleton for setup. Add new tests to cover the race scenario discovered during v1 review. RFC -> v1: - Switch from overriding proto->accept to af_ops->syn_recv_sock, which happens earlier. Clearing the psock state after accept() does not work for child sockets that become orphaned (never got accepted). v4-mapped sockets need special care. - Return the socket cookie on SOCKMAP lookup from syscall to be on par with REUSEPORT_SOCKARRAY. Requires SOCKMAP to take u64 on lookup/update from syscall. - Make bpf_sk_redirect_map (ingress) and bpf_msg_redirect_map (egress) SOCKMAP helpers fail when target socket is a listening one. - Make bpf_sk_select_reuseport helper fail when target is a TCP established socket. - Teach libbpf to recognize SK_REUSEPORT program type from section name. - Add a dedicated set of tests for SOCKMAP holding listening sockets, covering map operations, overridden socket callbacks, and BPF helpers. [0] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/ [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/ [2] https://lore.kernel.org/bpf/20200217121530.754315-1-jakub@cloudflare.com/ ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2020-02-21 22:31:41 +01:00
Jakub Sitnicki	44d28be2b8	selftests/bpf: Tests for sockmap/sockhash holding listening sockets Now that SOCKMAP and SOCKHASH map types can store listening sockets, user-space and BPF API is open to a new set of potential pitfalls. Exercise the map operations, with extra attention to code paths susceptible to races between map ops and socket cloning, and BPF helpers that work with SOCKMAP/SOCKHASH to gain confidence that all works as expected. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200218171023.844439-12-jakub@cloudflare.com	2020-02-21 22:29:46 +01:00
Jakub Sitnicki	11318ba8ca	selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP/SOCKHASH Parametrize the SK_REUSEPORT tests so that the map type for storing sockets is not hard-coded in the test setup routine. This, together with careful state cleaning after the tests, lets us run the test cases for REUSEPORT_ARRAY, SOCKMAP, and SOCKHASH to have test coverage for all supported map types. The last two support only TCP sockets at the moment. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200218171023.844439-11-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Jakub Sitnicki	035ff358f2	net: Generate reuseport group ID on group creation Commit `736b46027e` ("net: Add ID (if needed) to sock_reuseport and expose reuseport_lock") has introduced lazy generation of reuseport group IDs that survive group resize. By comparing the identifier we check if BPF reuseport program is not trying to select a socket from a BPF map that belongs to a different reuseport group than the one the packet is for. Because SOCKARRAY used to be the only BPF map type that can be used with reuseport BPF, it was possible to delay the generation of reuseport group ID until a socket from the group was inserted into BPF map for the first time. Now that SOCK{MAP,HASH} can be used with reuseport BPF we have two options, either generate the reuseport ID on map update, like SOCKARRAY does, or allocate an ID from the start when reuseport group gets created. This patch takes the latter approach to keep sockmap free of calls into reuseport code. This streamlines the reuseport_id access as its lifetime now matches the longevity of reuseport object. The cost of this simplification, however, is that we allocate reuseport IDs for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their setups. With the way identifiers are currently generated, we can have at most S32_MAX reuseport groups, which hopefully is sufficient. If we ever get close to the limit, we can switch an u64 counter like sk_cookie. Another change is that we now always call into SOCKARRAY logic to unlink the socket from the map when unhashing or closing the socket. Previously we did it only when at least one socket from the group was in a BPF map. It is worth noting that this doesn't conflict with sockmap tear-down in case a socket is in a SOCK{MAP,HASH} and belongs to a reuseport group. sockmap tear-down happens first: prot->unhash `- tcp_bpf_unhash \|- tcp_bpf_remove \| `- while (sk_psock_link_pop(psock)) \| `- sk_psock_unlink \| `- sock_map_delete_from_link \| `- __sock_map_delete \| `- sock_map_unref \| `- sk_psock_put \| `- sk_psock_drop \| `- rcu_assign_sk_user_data(sk, NULL) `- inet_unhash `- reuseport_detach_sock `- bpf_sk_reuseport_detach `- WRITE_ONCE(sk->sk_user_data, NULL) Suggested-by: Martin Lau <kafai@fb.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200218171023.844439-10-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Jakub Sitnicki	9fed9000c5	bpf: Allow selecting reuseport socket from a SOCKMAP/SOCKHASH SOCKMAP & SOCKHASH now support storing references to listening sockets. Nothing keeps us from using these map types a collection of sockets to select from in BPF reuseport programs. Whitelist the map types with the bpf_sk_select_reuseport helper. The restriction that the socket has to be a member of a reuseport group still applies. Sockets in SOCKMAP/SOCKHASH that don't have sk_reuseport_cb set are not a valid target and we signal it with -EINVAL. The main benefit from this change is that, in contrast to REUSEPORT_SOCKARRAY, SOCK{MAP,HASH} don't impose a restriction that a listening socket can be just one BPF map at the same time. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200218171023.844439-9-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Jakub Sitnicki	1d59f3bcee	bpf, sockmap: Let all kernel-land lookup values in SOCKMAP/SOCKHASH Don't require the kernel code, like BPF helpers, that needs access to SOCK{MAP,HASH} map contents to live in net/core/sock_map.c. Expose the lookup operation to all kernel-land. Lookup from BPF context is not whitelisted yet. While syscalls have a dedicated lookup handler. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200218171023.844439-8-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Jakub Sitnicki	c1cdf65da0	bpf, sockmap: Return socket cookie on lookup from syscall Tooling that populates the SOCK{MAP,HASH} with sockets from user-space needs a way to inspect its contents. Returning the struct sock * that the map holds to user-space is neither safe nor useful. An approach established by REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier) instead. Since socket cookies are u64 values, SOCK{MAP,HASH} need to support such a value size for lookup to be possible. This requires special handling on update, though. Attempts to do a lookup on a map holding u32 values will be met with ENOSPC error. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200218171023.844439-7-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Jakub Sitnicki	6e830c2f6c	bpf, sockmap: Don't set up upcalls and progs for listening sockets Now that sockmap/sockhash can hold listening sockets, when setting up the psock we will (i) grab references to verdict/parser progs, and (2) override socket upcalls sk_data_ready and sk_write_space. However, since we cannot redirect to listening sockets so we don't need to link the socket to the BPF progs. And more importantly we don't want the listening socket to have overridden upcalls because they would get inherited by child sockets cloned from it. Introduce a separate initialization path for listening sockets that does not change the upcalls and ignores the BPF progs. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200218171023.844439-6-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Jakub Sitnicki	8ca30379a4	bpf, sockmap: Allow inserting listening TCP sockets into sockmap In order for sockmap/sockhash types to become generic collections for storing TCP sockets we need to loosen the checks during map update, while tightening the checks in redirect helpers. Currently sock{map,hash} require the TCP socket to be in established state, which prevents inserting listening sockets. Change the update pre-checks so the socket can also be in listening state. Since it doesn't make sense to redirect with sock{map,hash} to listening sockets, add appropriate socket state checks to BPF redirect helpers too. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200218171023.844439-5-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Jakub Sitnicki	e80251555f	tcp_bpf: Don't let child socket inherit parent protocol ops on copy Prepare for cloning listening sockets that have their protocol callbacks overridden by sk_msg. Child sockets must not inherit parent callbacks that access state stored in sk_user_data owned by the parent. Restore the child socket protocol callbacks before it gets hashed and any of the callbacks can get invoked. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200218171023.844439-4-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Jakub Sitnicki	f1ff5ce2cd	net, sk_msg: Clear sk_user_data pointer on clone if tagged sk_user_data can hold a pointer to an object that is not intended to be shared between the parent socket and the child that gets a pointer copy on clone. This is the case when sk_user_data points at reference-counted object, like struct sk_psock. One way to resolve it is to tag the pointer with a no-copy flag by repurposing its lowest bit. Based on the bit-flag value we clear the child sk_user_data pointer after cloning the parent socket. The no-copy flag is stored in the pointer itself as opposed to externally, say in socket flags, to guarantee that the pointer and the flag are copied from parent to child socket in an atomic fashion. Parent socket state is subject to change while copying, we don't hold any locks at that time. This approach relies on an assumption that sk_user_data holds a pointer to an object aligned at least 2 bytes. A manual audit of existing users of rcu_dereference_sk_user_data helper confirms our assumption. Also, an RCU-protected sk_user_data is not likely to hold a pointer to a char value or a pathological case of "struct { char c; }". To be safe, warn when the flag-bit is set when setting sk_user_data to catch any future misuses. It is worth considering why clearing sk_user_data unconditionally is not an option. There exist users, DRBD, NVMe, and Xen drivers being among them, that rely on the pointer being copied when cloning the listening socket. Potentially we could distinguish these users by checking if the listening socket has been created in kernel-space via sock_create_kern, and hence has sk_kern_sock flag set. However, this is not the case for NVMe and Xen drivers, which create sockets without marking them as belonging to the kernel. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Jakub Sitnicki	b8e202d1d1	net, sk_msg: Annotate lockless access to sk_prot on clone sk_msg and ULP frameworks override protocol callbacks pointer in sk->sk_prot, while tcp accesses it locklessly when cloning the listening socket, that is with neither sk_lock nor sk_callback_lock held. Once we enable use of listening sockets with sockmap (and hence sk_msg), there will be shared access to sk->sk_prot if socket is getting cloned while being inserted/deleted to/from the sockmap from another CPU: Read side: tcp_v4_rcv sk = __inet_lookup_skb(...) tcp_check_req(sk) inet_csk(sk)->icsk_af_ops->syn_recv_sock tcp_v4_syn_recv_sock tcp_create_openreq_child inet_csk_clone_lock sk_clone_lock READ_ONCE(sk->sk_prot) Write side: sock_map_ops->map_update_elem sock_map_update_elem sock_map_update_common sock_map_link_no_progs tcp_bpf_init tcp_bpf_update_sk_prot sk_psock_update_proto WRITE_ONCE(sk->sk_prot, ops) sock_map_ops->map_delete_elem sock_map_delete_elem __sock_map_delete sock_map_unref sk_psock_put sk_psock_drop sk_psock_restore_proto tcp_update_ulp WRITE_ONCE(sk->sk_prot, proto) Mark the shared access with READ_ONCE/WRITE_ONCE annotations. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200218171023.844439-2-jakub@cloudflare.com	2020-02-21 22:29:45 +01:00
Yonghong Song	e42da4c62a	docs/bpf: Update bpf development Q/A file bpf now has its own mailing list bpf@vger.kernel.org. Update the bpf_devel_QA.rst file to reflect this. Also llvm has switch to github with llvm and clang in the same repo https://github.com/llvm/llvm-project.git. Update the QA file with newer build instructions. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200221004354.930952-1-yhs@fb.com	2020-02-20 18:05:37 -08:00
Andrii Nakryiko	006ed53e8c	selftests/bpf: Fix trampoline_count clean up logic Libbpf's Travis CI tests caught this issue. Ensure bpf_link and bpf_object clean up is performed correctly. Fixes: `d633d57902` ("selftest/bpf: Add test for allowed trampolines count") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20200220230546.769250-1-andriin@fb.com	2020-02-20 18:03:10 -08:00
Alexei Starovoitov	2c3a368127	Merge branch 'set_attach_target' Eelco Chaudron says: ==================== Currently when you want to attach a trace program to a bpf program the section name needs to match the tracepoint/function semantics. However the addition of the bpf_program__set_attach_target() API allows you to specify the tracepoint/function dynamically. The call flow would look something like this: xdp_fd = bpf_prog_get_fd_by_id(id); trace_obj = bpf_object__open_file("func.o", NULL); prog = bpf_object__find_program_by_title(trace_obj, "fentry/myfunc"); bpf_program__set_expected_attach_type(prog, BPF_TRACE_FENTRY); bpf_program__set_attach_target(prog, xdp_fd, "xdpfilt_blk_all"); bpf_object__load(trace_obj) v1 -> v2: Remove requirement for attach type hint in API v2 -> v3: Moved common warning to __find_vmlinux_btf_id, requested by Andrii Updated the xdp_bpf2bpf test to use this new API v3 -> v4: Split up patch, update libbpf.map version v4 -> v5: Fix return code, and prog assignment in test case ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-02-20 17:51:40 -08:00
Eelco Chaudron	933ce62d68	selftests/bpf: Update xdp_bpf2bpf test to use new set_attach_target API Use the new bpf_program__set_attach_target() API in the xdp_bpf2bpf selftest so it can be referenced as an example on how to use it. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/158220520562.127661.14289388017034825841.stgit@xdp-tutorial	2020-02-20 17:48:40 -08:00
Eelco Chaudron	ff26ce5cd7	libbpf: Add support for dynamic program attach target Currently when you want to attach a trace program to a bpf program the section name needs to match the tracepoint/function semantics. However the addition of the bpf_program__set_attach_target() API allows you to specify the tracepoint/function dynamically. The call flow would look something like this: xdp_fd = bpf_prog_get_fd_by_id(id); trace_obj = bpf_object__open_file("func.o", NULL); prog = bpf_object__find_program_by_title(trace_obj, "fentry/myfunc"); bpf_program__set_expected_attach_type(prog, BPF_TRACE_FENTRY); bpf_program__set_attach_target(prog, xdp_fd, "xdpfilt_blk_all"); bpf_object__load(trace_obj) Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158220519486.127661.7964708960649051384.stgit@xdp-tutorial	2020-02-20 17:48:40 -08:00
Eelco Chaudron	dd88aed92d	libbpf: Bump libpf current version to v0.0.8 New development cycles starts, bump to v0.0.8. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158220518424.127661.8278643006567775528.stgit@xdp-tutorial	2020-02-20 17:48:40 -08:00
Andrii Nakryiko	5327644614	libbpf: Relax check whether BTF is mandatory If BPF program is using BTF-defined maps, BTF is required only for libbpf itself to process map definitions. If after that BTF fails to be loaded into kernel (e.g., if it doesn't support BTF at all), this shouldn't prevent valid BPF program from loading. Existing retry-without-BTF logic for creating maps will succeed to create such maps without any problems. So, presence of .maps section shouldn't make BTF required for kernel. Update the check accordingly. Validated by ensuring simple BPF program with BTF-defined maps is still loaded on old kernel without BTF support and map is correctly parsed and created. Fixes: `abd29c9314` ("libbpf: allow specifying map definitions using BTF") Reported-by: Julia Kartseva <hex@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200220062635.1497872-1-andriin@fb.com	2020-02-20 11:03:39 -08:00
Alexei Starovoitov	500897804a	selftests/bpf: Fix build of sockmap_ktls.c The selftests fails to build with: tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c: In function ‘test_sockmap_ktls_disconnect_after_delete’: tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c:72:37: error: ‘TCP_ULP’ undeclared (first use in this function) 72 \| err = setsockopt(cli, IPPROTO_TCP, TCP_ULP, "tls", strlen("tls")); \| ^~~~~~~ Similar to commit that fixes build of sockmap_basic.c on systems with old /usr/include fix the build of sockmap_ktls.c Fixes: `d1ba1204f2` ("selftests/bpf: Test unhashing kTLS socket after removing from map") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200219205514.3353788-1-ast@kernel.org	2020-02-20 01:17:24 +01:00
Yonghong Song	83250f2b69	selftests/bpf: Change llvm flag -mcpu=probe to -mcpu=v3 The latest llvm supports cpu version v3, which is cpu version v1 plus some additional 64bit jmp insns and 32bit jmp insn support. In selftests/bpf Makefile, the llvm flag -mcpu=probe did runtime probe into the host system. Depending on compilation environments, it is possible that runtime probe may fail, e.g., due to memlock issue. This will cause generated code with cpu version v1. This may cause confusion as the same compiler and the same C code generates different byte codes in different environment. Let us change the llvm flag -mcpu=probe to -mcpu=v3 so the generated code will be the same regardless of the compilation environment. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200219004236.2291125-1-yhs@fb.com	2020-02-19 15:15:07 -08:00
Alexei Starovoitov	03aa39558e	Merge branch 'bpf_read_branch_records' Daniel Xu says: ==================== Branch records are a CPU feature that can be configured to record certain branches that are taken during code execution. This data is particularly interesting for profile guided optimizations. perf has had branch record support for a while but the data collection can be a bit coarse grained. We (Facebook) have seen in experiments that associating metadata with branch records can improve results (after postprocessing). We generally use bpf_probe_read_*() to get metadata out of userspace. That's why bpf support for branch records is useful. Aside from this particular use case, having branch data available to bpf progs can be useful to get stack traces out of userspace applications that omit frame pointers. Changes in v8: - Use globals instead of perf buffer - Call test_perf_branches__detach() before destroying skeleton - Fix typo in docs Changes in v7: - Const-ify and static-ify local var - Documentation formatting Changes in v6: - Move #ifdef a little to avoid unused variable warnings on !x86 - Test negative condition in selftest (-EINVAL on improperly configured perf event) - Skip positive condition selftest on setups that don't support branch records Changes in v5: - Rename bpf_perf_prog_read_branches() -> bpf_read_branch_records() - Rename BPF_F_GET_BR_SIZE -> BPF_F_GET_BRANCH_RECORDS_SIZE - Squash tools/ bpf.h sync into selftest commit Changes in v4: - Add BPF_F_GET_BR_SIZE flag - Return -ENOENT on unsupported architectures - Only accept initialized memory in helper - Check buffer size is multiple of sizeof(struct perf_branch_entry) - Use bpf skeleton in selftest - Add commit messages - Spelling and formatting Changes in v3: - Document filling unused buffer with zero - Formatting fixes - Rebase Changes in v2: - Change to a bpf helper instead of context access - Avoid mentioning Intel specific things ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-02-19 15:01:12 -08:00
Daniel Xu	67306f84ca	selftests/bpf: Add bpf_read_branch_records() selftest Add a selftest to test: * default bpf_read_branch_records() behavior * BPF_F_GET_BRANCH_RECORDS_SIZE flag behavior * error path on non branch record perf events * using helper to write to stack * using helper to write to global On host with hardware counter support: # ./test_progs -t perf_branches #27/1 perf_branches_hw:OK #27/2 perf_branches_no_hw:OK #27 perf_branches:OK Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED On host without hardware counter support (VM): # ./test_progs -t perf_branches #27/1 perf_branches_hw:OK #27/2 perf_branches_no_hw:OK #27 perf_branches:OK Summary: 1/2 PASSED, 1 SKIPPED, 0 FAILED Also sync tools/include/uapi/linux/bpf.h. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200218030432.4600-3-dxu@dxuuu.xyz	2020-02-19 15:01:07 -08:00
Daniel Xu	fff7b64355	bpf: Add bpf_read_branch_records() helper Branch records are a CPU feature that can be configured to record certain branches that are taken during code execution. This data is particularly interesting for profile guided optimizations. perf has had branch record support for a while but the data collection can be a bit coarse grained. We (Facebook) have seen in experiments that associating metadata with branch records can improve results (after postprocessing). We generally use bpf_probe_read_*() to get metadata out of userspace. That's why bpf support for branch records is useful. Aside from this particular use case, having branch data available to bpf progs can be useful to get stack traces out of userspace applications that omit frame pointers. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200218030432.4600-2-dxu@dxuuu.xyz	2020-02-19 14:37:36 -08:00
Daniel Borkmann	2f14b2d9dd	Merge branch 'bpf-skmsg-simplify-restore' Jakub Sitnicki says: ==================== This series has been split out from "Extend SOCKMAP to store listening sockets" [0]. I think it stands on its own, and makes the latter series smaller, which will make the review easier, hopefully. The essence is that we don't need to do a complicated dance in sk_psock_restore_proto, if we agree that the contract with tcp_update_ulp is to restore callbacks even when the socket doesn't use ULP. This is what tcp_update_ulp currently does, and we just make use of it. Series is accompanied by a test for a particularly tricky case of restoring callbacks when we have both sockmap and tls callbacks configured in sk->sk_prot. [0] https://lore.kernel.org/bpf/20200127131057.150941-1-jakub@cloudflare.com/ ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2020-02-19 16:54:24 +01:00
Jakub Sitnicki	d1ba1204f2	selftests/bpf: Test unhashing kTLS socket after removing from map When a TCP socket gets inserted into a sockmap, its sk_prot callbacks get replaced with tcp_bpf callbacks built from regular tcp callbacks. If TLS gets enabled on the same socket, sk_prot callbacks get replaced once again, this time with kTLS callbacks built from tcp_bpf callbacks. Now, we allow removing a socket from a sockmap that has kTLS enabled. After removal, socket remains with kTLS configured. This is where things things get tricky. Since the socket has a set of sk_prot callbacks that are a mix of kTLS and tcp_bpf callbacks, we need to restore just the tcp_bpf callbacks to the original ones. At the moment, it comes down to the the unhash operation. We had a regression recently because tcp_bpf callbacks were not cleared in this particular scenario of removing a kTLS socket from a sockmap. It got fixed in commit `4da6a196f9` ("bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop"). Add a test that triggers the regression so that we don't reintroduce it in the future. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200217121530.754315-4-jakub@cloudflare.com	2020-02-19 16:54:05 +01:00
Jakub Sitnicki	a178b45858	bpf, sk_msg: Don't clear saved sock proto on restore There is no need to clear psock->sk_proto when restoring socket protocol callbacks in sk->sk_prot. The psock is about to get detached from the sock and eventually destroyed. At worst we will restore the protocol callbacks and the write callback twice. This makes reasoning about psock state easier. Once psock is initialized, we can count on psock->sk_proto always being set. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200217121530.754315-3-jakub@cloudflare.com	2020-02-19 16:54:05 +01:00
Jakub Sitnicki	a4393861a3	bpf, sk_msg: Let ULP restore sk_proto and write_space callback We don't need a fallback for when the socket is not using ULP. tcp_update_ulp handles this case exactly the same as we do in sk_psock_restore_proto. Get rid of the duplicated code. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200217121530.754315-2-jakub@cloudflare.com	2020-02-19 16:54:05 +01:00
Song Liu	b80b033bed	bpf: Allow bpf_perf_event_read_value in all BPF programs bpf_perf_event_read_value() is NMI safe. Enable it for all BPF programs. This can be used in fentry/fexit to profile BPF program and individual kernel function with hardware counters. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200214234146.2910011-1-songliubraving@fb.com	2020-02-18 16:08:27 +01:00
YueHaibing	b182a66792	net: ena: remove set but not used variable 'hash_key' drivers/net/ethernet/amazon/ena/ena_com.c: In function ena_com_hash_key_allocate: drivers/net/ethernet/amazon/ena/ena_com.c:1070:50: warning: variable hash_key set but not used [-Wunused-but-set-variable] commit `6a4f7dc82d` ("net: ena: rss: do not allocate key when not supported") introduced this, but not used, so remove it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 22:32:50 -08:00
Gustavo A. R. Silva	2b73812483	net: netlink: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 19:05:06 -08:00
Gustavo A. R. Silva	fbfc8502af	net: switchdev: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 19:05:06 -08:00
Gustavo A. R. Silva	45a4296b6e	bpf, sockmap: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 19:05:05 -08:00
Gustavo A. R. Silva	9814428a44	NFC: digital: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 19:05:05 -08:00
Gustavo A. R. Silva	dc3cc347d2	net: usb: cdc-phonet: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 19:05:05 -08:00
Russell King	725d23b59c	net: phy: allow bcm84881 to be a module Now that the phylib module loading issue has been resolved, we can allow this PHY driver to be built as a module. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 15:08:55 -08:00
David S. Miller	4c08222170	Merge branch 'net-smc-next' Ursula Braun says: ==================== net/smc: patches 2020-02-17 here are patches for SMC making termination tasks more perfect. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:50:25 -08:00
Ursula Braun	5613f20c93	net/smc: reduce port_event scheduling IB event handlers schedule the port event worker for further processing of port state changes. This patch reduces the number of schedules to avoid duplicate processing of the same port change. Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:50:24 -08:00
Karsten Graul	5f78fe968d	net/smc: simplify normal link termination smc_lgr_terminate() and smc_lgr_terminate_sched() both result in soft link termination, smc_lgr_terminate_sched() is scheduling a worker for this task. Take out complexity by always using the termination worker and getting rid of smc_lgr_terminate() completely. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:50:24 -08:00
Karsten Graul	ba95206042	net/smc: remove unused parameter of smc_lgr_terminate() The soft parameter of smc_lgr_terminate() is not used and obsolete. Remove it. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:50:24 -08:00
Karsten Graul	3739707c45	net/smc: do not delete lgr from list twice When 2 callers call smc_lgr_terminate() at the same time for the same lgr, one gets the lgr_lock and deletes the lgr from the list and releases the lock. Then the second caller gets the lock and tries to delete it again. In smc_lgr_terminate() add a check if the link group lgr is already deleted from the link group list and prevent to try to delete it a second time. And add a check if the lgr is marked as freeing, which means that a termination is already pending. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:50:24 -08:00
Karsten Graul	354ea2baa3	net/smc: use termination worker under send_lock smc_tx_rdma_write() is called under the send_lock and should not call smc_lgr_terminate() directly. Call smc_lgr_terminate_sched() instead which schedules a worker. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:50:24 -08:00
Karsten Graul	55dd575817	net/smc: improve smc_lgr_cleanup() smc_lgr_cleanup() is called during termination processing, there is no need to send a DELETE_LINK at that time. A DELETE_LINK should have been sent before the termination is initiated, if needed. And remove the extra call to wake_up(&lnk->wr_reg_wait) because smc_llc_link_inactive() already calls the related helper function smc_wr_wakeup_reg_wait(). Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:50:24 -08:00
David S. Miller	790a9a7cce	Merge branch 'mlxsw-Reduce-dependency-between-bridge-and-router-code' Ido Schimmel says: ==================== mlxsw: Reduce dependency between bridge and router code This patch set reduces the dependency between the bridge and the router code in preparation for RTNL removal from the route insertion path in mlxsw. The motivation and solution are explained in detail in patch #3. The main idea is that we need to stop special-casing the VXLAN devices with regards to the reference counting of the FIDs. Otherwise, we can bump into the situation described in patch #3, where the routing code calls into the bridge code which calls back into the routing code. After adding a mutex to protect router data structures to remove RTNL dependency, this can result in an AA deadlock. Patches #1 and #2 are preparations. They convert the FIDs to use 'refcount_t' for reference counting in order to catch over/under flows and add extack to the bridge creation function. Patches #3-#5 reduce the dependency between the bridge and the router code. First, by having the VXLAN device take a reference on the FID in patch #3 and then by removing unnecessary code following the change in patch #3. Patches #6-#10 adjust existing selftests and add new ones to exercise the new code paths. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:42:54 -08:00
Ido Schimmel	495c3da648	selftests: mlxsw: vxlan: Add test for error path Test that when two VXLAN tunnels with conflicting configurations (i.e., different TTL) are enslaved to the same VLAN-aware bridge, then the enslavement of a port to the bridge is denied. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:42:53 -08:00
Ido Schimmel	58ba0238e9	selftests: mlxsw: vxlan: Adjust test to recent changes After recent changes, the VXLAN tunnel will be offloaded regardless if any local ports are member in the FID or not. Adjust the test to make sure the tunnel is offloaded in this case. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:42:53 -08:00
Ido Schimmel	6c4e61ff5f	selftests: mlxsw: extack: Test creation of multiple VLAN-aware bridges The driver supports a single VLAN-aware bridge. Test that the enslavement of a port to the second VLAN-aware bridge fails with an extack. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:42:53 -08:00
Ido Schimmel	bdc58bea0d	selftests: mlxsw: extack: Test bridge creation with VXLAN Test that creation of a bridge (both VLAN-aware and VLAN-unaware) fails with an extack when a VXLAN device with an unsupported configuration is already enslaved to it. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:42:53 -08:00
Ido Schimmel	745a7ea72d	selftests: mlxsw: Remove deprecated test The addition of a VLAN on a bridge slave prompts the driver to have the local port in question join the FID corresponding to this VLAN. Before recent changes, the operation of joining the FID would also mean that the driver would enable VXLAN tunneling if a VXLAN device was also member in the VLAN. In case the configuration of the VXLAN tunnel was not supported, an extack error would be returned. Since the operation of joining the FID no longer means that VXLAN tunneling is potentially enabled, the test is no longer relevant. Remove it. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:42:53 -08:00
Ido Schimmel	da1f9f8cb7	mlxsw: spectrum: Reduce dependency between bridge and router code Commit `f40be47a3e` ("mlxsw: spectrum_router: Do not force specific configuration order") added a call from the routing code to the bridge code in order to handle the case where VNI should be set on a FID following the joining of the router port to the FID. This is no longer required, as previous patches made VXLAN devices explicitly take a reference on the FID and set VNI on it. Therefore, remove the unnecessary call and simply have the RIF take a reference on the FID without checking if VNI should also be set on it. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-17 14:42:53 -08:00

1 2 3 4 5 ...

900710 Commits