linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-17 12:48:25 +07:00

Author	SHA1	Message	Date
Alexei Starovoitov	0d7f68270b	Merge branch 'bpf_skb_ecn_set_ce' Lawrence Brakmo says: ==================== Host Bandwidth Manager is a framework for limiting the bandwidth used by v2 cgroups. It consists of 1 BPF helper, a sample BPF program to limit egress bandwdith as well as a sample user program and script to simplify HBM testing. The sample HBM BPF program is not meant to be production quality, it is provided as proof of concept. A lot more information, including sample runs in some cases, are provided in the commit messages of the individual patches. A future patch will add support for reducing TCP's cwnd (we are evaluating alternatives). Another patch will add support for fair queueing's Earliest Departure Time. Until then, HBM is better suited for flows supporitng ECN. In addition, A BPF program to limit ingress bandwidth will be provided in an upcomming patchset. Changes from v1 to v2: * bpf_tcp_enter_cwr can only be called from a cgroup skb egress BPF program (otherwise load or attach will fail) where we already hold the sk lock. Also only applies for ESTABLISHED state. * bpf_skb_ecn_set_ce uses INET_ECN_set_ce() * bpf_tcp_check_probe_timer now uses tcp_reset_xmit_timer. Can only be used by egress cgroup skb programs. * removed load_cg_skb user program. * nrm bpf egress program checks packet header in skb to determine ECN value. Now also works for ECN enabled UDP packets. Using ECN_ defines instead of integers. * NRM script test program now uses bpftool instead of load_cg_skb Changes from v2 to v3: * Changed name from NRM (Network Resource Manager) to HBM (Host Bandwdith Manager) * The bpf helper to set ECN ce now checks that the header is writeable * Removed helper bpf functions that modified TCP state due to a concern about whether the socket is locked by the current thread. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-02 10:48:28 -08:00
brakmo	4ffd44cfd1	bpf: HBM test script Script for testing HBM (Host Bandwidth Manager) framework. It creates a cgroup to use for testing and load a BPF program to limit egress bandwidht. It then uses iperf3 or netperf to create loads. The output is the goodput in Mbps (unless -D is used). It can work on a single host using loopback or among two hosts (with netperf). When using loopback, it is recommended to also introduce a delay of at least 1ms (-d=1), otherwise the assigned bandwidth is likely to be underutilized. USAGE: $name [out] [-b=<prog>\|--bpf=<prog>] [-c=<cc>\|--cc=<cc>] [-D] [-d=<delay>\|--delay=<delay>] [--debug] [-E] [-f=<#flows>\|--flows=<#flows>] [-h] [-i=<id>\|--id=<id >] [-l] [-N] [-p=<port>\|--port=<port>] [-P] [-q=<qdisc>] [-R] [-s=<server>\|--server=<server] [--stats] [-t=<time>\|--time=<time>] [-w] [cubic\|dctcp] Where: out Egress (default egress) -b or --bpf BPF program filename to load and attach. Default is nrm_out_kern.o for egress, -c or -cc TCP congestion control (cubic or dctcp) -d or --delay Add a delay in ms using netem -D In addition to the goodput in Mbps, it also outputs other detailed information. This information is test dependent (i.e. iperf3 or netperf). --debug Print BPF trace buffer -E Enable ECN (not required for dctcp) -f or --flows Number of concurrent flows (default=1) -i or --id cgroup id (an integer, default is 1) -l Do not limit flows using loopback -N Use netperf instead of iperf3 -h Help -p or --port iperf3 port (default is 5201) -P Use an iperf3 instance for each flow -q Use the specified qdisc. -r or --rate Rate in Mbps (default 1s 1Gbps) -R Use TCP_RR for netperf. 1st flow has req size of 10KB, rest of 1MB. Reply in all cases is 1 byte. More detailed output for each flow can be found in the files netperf.<cg>.<flow>, where <cg> is the cgroup id as specified with the -i flag, and <flow> is the flow id starting at 1 and increasing by 1 for flow (as specified by -f). -s or --server hostname of netperf server. Used to create netperf test traffic between to hosts (default is within host) netserver must be running on the host. --stats Get HBM stats (marked, dropped, etc.) -t or --time duration of iperf3 in seconds (default=5) -w Work conserving flag. cgroup can increase its bandwidth beyond the rate limit specified while there is available bandwidth. Current implementation assumes there is only one NIC (eth0), but can be extended to support multiple NICs. This is just a proof of concept. cubic or dctcp specify TCP CC to use Examples: ./do_hbm_test.sh -l -d=1 -D --stats Runs a 5 second test, using a single iperf3 flow and with the default rate limit of 1Gbps and a delay of 1ms (using netem) using the default TCP congestion control on the loopback device (hence we use "-l" to enforce bandwidth limit on loopback device). Since no direction is specified, it defaults to egress. Since no TCP CC algorithm is specified it uses the system default (Cubic for this test). With no -D flag, only the value of the AGGREGATE OUTPUT would show. id refers to the cgroup id and is useful when running multi cgroup tests (supported by a future patch). This patchset does not support calling TCP's congesion window reduction, even when packets are dropped by the BPF program, resulting in a large number of packets dropped. It is recommended that the current HBM implemenation only be used with ECN enabled flows. A future patch will add support for reducing TCP's cwnd and will increase the performance of non-ECN enabled flows. Output: Details for HBM in cgroup 1 id:1 rate_mbps:493 duration:4.8 secs packets:11355 bytes_MB:590 pkts_dropped:4497 bytes_dropped_MB:292 pkts_marked_percent: 39.60 bytes_marked_percent: 49.49 pkts_dropped_percent: 39.60 bytes_dropped_percent: 49.49 PING AVG DELAY:2.075 AGGREGATE_GOODPUT:505 ./do_nrm_test.sh -l -d=1 -D --stats dctcp Same as above but using dctcp. Note that fewer bytes are dropped (0.01% vs. 49%). Output: Details for HBM in cgroup 1 id:1 rate_mbps:945 duration:4.9 secs packets:16859 bytes_MB:578 pkts_dropped:1 bytes_dropped_MB:0 pkts_marked_percent: 28.74 bytes_marked_percent: 45.15 pkts_dropped_percent: 0.01 bytes_dropped_percent: 0.01 PING AVG DELAY:2.083 AGGREGATE_GOODPUT:965 ./do_nrm_test.sh -d=1 -D --stats As first example, but without limiting loopback device (i.e. no "-l" flag). Since there is no bandwidth limiting, no details for HBM are printed out. Output: Details for HBM in cgroup 1 PING AVG DELAY:2.019 AGGREGATE_GOODPUT:42655 ./do_hbm.sh -l -d=1 -D --stats -f=2 Uses iper3 and does 2 flows ./do_hbm.sh -l -d=1 -D --stats -f=4 -P Uses iperf3 and does 4 flows, each flow as a separate process. ./do_hbm.sh -l -d=1 -D --stats -f=4 -N Uses netperf, 4 flows ./do_hbm.sh -f=1 -r=2000 -t=5 -N -D --stats dctcp -s=<server-name> Uses netperf between two hosts. The remote host name is specified with -s= and you need to start the program netserver manually on the remote host. It will use 1 flow, a rate limit of 2Gbps and dctcp. ./do_hbm.sh -f=1 -r=2000 -t=5 -N -D --stats -w dctcp \ -s=<server-name> As previous, but allows use of extra bandwidth. For this test the rate is 8Gbps vs. 1Gbps of the previous test. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-02 10:48:27 -08:00
brakmo	a1270fe95b	bpf: User program for testing HBM The program nrm creates a cgroup and attaches a BPF program to the cgroup for testing HBM (Host Bandwidth Manager) for egress traffic. One still needs to create network traffic. This can be done through netesto, netperf or iperf3. A follow-up patch contains a script to create traffic. USAGE: hbm [-d] [-l] [-n <id>] [-r <rate>] [-s] [-t <secs>] [-w] [-h] [prog] Where: -d Print BPF trace debug buffer -l Also limit flows doing loopback -n <#> To create cgroup "/hbm#" and attach prog. Default is /nrm1 This is convenient when testing HBM in more than 1 cgroup -r <rate> Rate limit in Mbps -s Get HBM stats (marked, dropped, etc.) -t <time> Exit after specified seconds (deault is 0) -w Work conserving flag. cgroup can increase its bandwidth beyond the rate limit specified while there is available bandwidth. Current implementation assumes there is only NIC (eth0), but can be extended to support multiple NICs. Currrently only supported for egress. Note, this is just a proof of concept. -h Print this info prog BPF program file name. Name defaults to hbm_out_kern.o More information about HBM can be found in the paper "BPF Host Resource Management" presented at the 2018 Linux Plumbers Conference, Networking Track (http://vger.kernel.org/lpc_net2018_talks/LPC%20BPF%20Network%20Resource%20Paper.pdf) Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-02 10:48:27 -08:00
brakmo	187d0738ff	bpf: Sample HBM BPF program to limit egress bw A cgroup skb BPF program to limit cgroup output bandwidth. It uses a modified virtual token bucket queue to limit average egress bandwidth. The implementation uses credits instead of tokens. Negative credits imply that queueing would have happened (this is a virtual queue, so no queueing is done by it. However, queueing may occur at the actual qdisc (which is not used for rate limiting). This implementation uses 3 thresholds, one to start marking packets and the other two to drop packets: CREDIT - <--------------------------\|------------------------> + \| \| \| 0 \| Large pkt \| \| drop thresh \| Small pkt drop Mark threshold thresh The effect of marking depends on the type of packet: a) If the packet is ECN enabled, then the packet is ECN ce marked. The current mark threshold is tuned for DCTCP. c) Else, it is dropped if it is a large packet. If the credit is below the drop threshold, the packet is dropped. Note that dropping a packet through the BPF program does not trigger CWR (Congestion Window Reduction) in TCP packets. A future patch will add support for triggering CWR. This BPF program actually uses 2 drop thresholds, one threshold for larger packets (>= 120 bytes) and another for smaller packets. This protects smaller packets such as SYNs, ACKs, etc. The default bandwidth limit is set at 1Gbps but this can be changed by a user program through a shared BPF map. In addition, by default this BPF program does not limit connections using loopback. This behavior can be overwritten by the user program. There is also an option to calculate some statistics, such as percent of packets marked or dropped, which the user program can access. A latter patch provides such a program (hbm.c) Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-02 10:48:27 -08:00
brakmo	5cce85c640	bpf: sync bpf.h to tools and update bpf_helpers.h This patch syncs the uapi bpf.h to tools/ and also updates bpf_herlpers.h in tools/ Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-02 10:48:27 -08:00
brakmo	f7c917ba11	bpf: add bpf helper bpf_skb_ecn_set_ce This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce "int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can be attached to the ingress and egress path. The helper is needed because his type of bpf_prog cannot modify the skb directly. This helper is used to set the ECN field of ECN capable IP packets to ce (congestion encountered) in the IPv6 or IPv4 header of the skb. It can be used by a bpf_prog to manage egress or ingress network bandwdith limit per cgroupv2 by inducing an ECN response in the TCP sender. This works best when using DCTCP. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-02 10:48:27 -08:00
Yonghong Song	b74e21ab7d	samples/bpf: silence compiler warning for xdpsock_user.c Compiling xdpsock_user.c with 4.8.5, I hit the following compilation warning: HOSTCC samples/bpf/xdpsock_user.o /data/users/yhs/work/net-next/samples/bpf/xdpsock_user.c: In function ‘main’: /data/users/yhs/work/net-next/samples/bpf/xdpsock_user.c:449:6: warning: ‘idx_cq’ may be used unini tialized in this function [-Wmaybe-uninitialized] u32 idx_cq, idx_fq; ^ /data/users/yhs/work/net-next/samples/bpf/xdpsock_user.c:606:7: warning: ‘idx_rx’ may be used unini tialized in this function [-Wmaybe-uninitialized] u32 idx_rx, idx_tx = 0; ^ /data/users/yhs/work/net-next/samples/bpf/xdpsock_user.c:506:6: warning: ‘idx_rx’ may be used unini tialized in this function [-Wmaybe-uninitialized] u32 idx_rx, idx_fq = 0; As an example, the code pattern looks like: u32 idx_cq; ... ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq); if (ret) { ... } ... idx_fq ... The compiler warns since it does not know whether &idx_fq is assigned or not inside the library function xsk_ring_prod__reserve(). Let us assign an initial value 0 to such auto variables to silence compiler warning. Fixes: `248c7f9c0e` ("samples/bpf: convert xdpsock to use libbpf for AF_XDP access") Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-02 01:07:10 +01:00
Yonghong Song	a83de90658	selftests/bpf: set unlimited RLIMIT_MEMLOCK for test_sock_fields This is to avoid permission denied error. A lot of systems may have a much lower number, e.g., 64KB, for RLIMIT_MEMLOCK, which may not be sufficient for the test to run successfully. Fixes: `e0b27b3f97` ("bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock") Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-02 01:06:23 +01:00
Daniel Borkmann	4269f69bc9	Merge branch 'bpf-doc-improvements' Andrii Nakryiko says: ==================== A bunch of BPF-related docs typo, wording and formatting fixes. v1->v2: - split off non-documentation changes into separate patchset ==================== Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-02 00:40:05 +01:00
Andrii Nakryiko	46604676c8	docs/bpf: minor casing/punctuation fixes Fix few casing and punctuation glitches. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-02 00:40:04 +01:00
Andrii Nakryiko	9ab5305dbe	docs/btf: reflow text to fill up to 78 characters Reflow paragraphs to more fully and evenly fill 78 character lines. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-02 00:40:04 +01:00
Andrii Nakryiko	5efc529fb4	docs/btf: fix typos, improve wording Fix various typos, some of the formatting and wording for Documentation/btf.rst. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-02 00:40:04 +01:00
Eric Dumazet	4b9113045b	bpf: fix u64_stats_init() usage in bpf_prog_alloc() We need to iterate through all possible cpus. Fixes: `492ecee892` ("bpf: enable program stats") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-02 00:31:36 +01:00
Daniel Borkmann	3860d38f28	Merge branch 'bpf-dedup-fixes' Andrii Nakryiko says: ==================== This patchset fixes a bug in btf_dedup() algorithm, which under specific hash collision causes infinite loop. It also exposes ability to tune BTF deduplication table size, with double purpose of allowing applications to adjust size according to the size of BTF data, as well as allowing a simple way to force hash collisions by setting table size to 1. - Patch #1 fixes bug in btf_dedup testing code that's checking strings - Patch #2 fixes pointer arg formatting in btf.h - Patch #3 adds option to specify custom dedup table size - Patch #4 fixes aforementioned bug in btf_dedup - Patch #5 adds test that validates the fix v1->v2: - remove "Fixes" from formatting change patch - extract roundup_pow2_max func for dedup table size - btf_equal_struct -> btf_shallow_equal_struct - explain in comment why we can't rely on just btf_dedup_is_equiv ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 01:31:49 +01:00
Andrii Nakryiko	7c7a4890c8	selftests/bpf: add btf_dedup test of FWD/STRUCT resolution This patch adds a btf_dedup test exercising logic of STRUCT<->FWD resolution and validating that STRUCT is not resolved to a FWD. It also forces hash collisions, forcing both FWD and STRUCT to be candidates for each other. Previously this condition caused infinite loop due to FWD pointing to STRUCT and STRUCT pointing to its FWD. Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 01:31:48 +01:00
Andrii Nakryiko	91097fbee4	btf: fix bug with resolving STRUCT/UNION into corresponding FWD When checking available canonical candidates for struct/union algorithm utilizes btf_dedup_is_equiv to determine if candidate is suitable. This check is not enough when candidate is corresponding FWD for that struct/union, because according to equivalence logic they are equivalent. When it so happens that FWD and STRUCT/UNION end in hashing to the same bucket, it's possible to create remapping loop from FWD to STRUCT and STRUCT to same FWD, which will cause btf_dedup() to loop forever. This patch fixes the issue by additionally checking that type and canonical candidate are strictly equal (utilizing btf_equal_struct). Fixes: `d5caef5b56` ("btf: add BTF types deduplication algorithm") Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 01:31:48 +01:00
Andrii Nakryiko	51edf5f6e0	btf: allow to customize dedup hash table size Default size of dedup table (16k) is good enough for most binaries, even typical vmlinux images. But there are cases of binaries with huge amount of BTF types (e.g., allyesconfig variants of kernel), which benefit from having bigger dedup table size to lower amount of unnecessary hash collisions. Tools like pahole, thus, can tune this parameter to reach optimal performance. This change also serves double purpose of allowing tests to force hash collisions to test some corner cases, used in follow up patch. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 01:31:47 +01:00
Andrii Nakryiko	1baabdc108	libbpf: fix formatting for btf_ext__get_raw_data Fix invalid formatting of pointer arg. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 01:31:47 +01:00
Andrii Nakryiko	8054d51f76	selftests/bpf: fix btf_dedup testing code btf_dedup testing code doesn't account for length of struct btf_header when calculating the start of a string section. This patch fixes this problem. Fixes: `49b57e0d01` ("tools/bpf: remove btf__get_strings() superseded by raw data API") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 01:31:47 +01:00
Dan Carpenter	3d8669e637	tools/libbpf: signedness bug in btf_dedup_ref_type() The "ref_type_id" variable needs to be signed for the error handling to work. Fixes: `d5caef5b56` ("btf: add BTF types deduplication algorithm") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 00:56:06 +01:00
Daniel Borkmann	74b3881908	Merge branch 'bpf-samples-improvements' Jakub Kicinski says: ==================== This set is next part of a quest to get rid of the bpf_load ELF loader. It fixes some minor issues with the samples and starts the conversion. First patch fixes ping invocations, ping localhost defaults to IPv6 on modern setups. Next load_sock_ops sample is removed and users are directed towards using bpftool directly. Patch 4 removes the use of bpf_load from samples which don't need the auto-attachment functionality at all. Patch 5 improves symbol counting in libbpf, it's not currently an issue but it will be when anyone adds a symbol with a long name. Let's make sure that person doesn't have to spend time scratching their head and wondering why .a and .so symbol counts don't match. v2: - specify prog_type where possible (Andrii). ==================== Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 00:53:47 +01:00
Jakub Kicinski	771744f9dc	tools: libbpf: make sure readelf shows full names in build checks readelf truncates its output by default to attempt to make it more readable. This can lead to function names getting aliased if they differ late in the string. Use --wide parameter to avoid truncation. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 00:53:46 +01:00
Jakub Kicinski	1a9b268c90	samples: bpf: use libbpf where easy Some samples don't really need the magic of bpf_load, switch them to libbpf. v2: - specify program types. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 00:53:45 +01:00
Jakub Kicinski	f74a53d9a5	tools: libbpf: add a correctly named define for map iteration For historical reasons the helper to loop over maps in an object is called bpf_map__for_each while it really should be called bpf_object__for_each_map. Rename and add a correctly named define for backward compatibility. Switch all in-tree users to the correct name (Quentin). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 00:53:45 +01:00
Jakub Kicinski	ea9b636201	samples: bpf: remove load_sock_ops in favour of bpftool bpftool can do all the things load_sock_ops used to do, and more. Point users to bpftool instead of maintaining this sample utility. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 00:53:45 +01:00
Jakub Kicinski	5c3cf87d47	samples: bpf: force IPv4 in ping ping localhost may default of IPv6 on modern systems, but samples are trying to only parse IPv4. Force IPv4. samples/bpf/tracex1_user.c doesn't interpret the packet so we don't care which IP version will be used there. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 00:53:45 +01:00
Stanislav Fomichev	ebace0e981	selftests/bpf: use __bpf_constant_htons in test_prog.c for flow dissector Older GCC (<4.8) isn't smart enough to optimize !__builtin_constant_p() branch in bpf_htons. I recently fixed it for pkt_v4 and pkt_v6 in commit `a0517a0f7e` ("selftests/bpf: use __bpf_constant_htons in test_prog.c"), but later added another bunch of bpf_htons in commit `bf0f0fd939` ("selftests/bpf: add simple BPF_PROG_TEST_RUN examples for flow dissector"). Fixes: `bf0f0fd939` ("selftests/bpf: add simple BPF_PROG_TEST_RUN examples for flow dissector") Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 00:48:51 +01:00
Willem de Bruijn	f2bb53887e	bpf: add missing entries to bpf_helpers.h This header defines the BPF functions enumerated in uapi/linux.bpf.h in a callable format. Expand to include all registered functions. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 00:46:55 +01:00
Alexei Starovoitov	3fcc5530bc	bpf: fix build without bpf_syscall wrap bpf_stats_enabled sysctl with #ifdef Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: `492ecee892` ("bpf: enable program stats") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-03-01 00:44:58 +01:00
Alexei Starovoitov	3bcd604445	Merge branch 'inner_map_spin_lock-fix' Yonghong Song says: ==================== The inner_map_meta->spin_lock_off is not set correctly during map creation for BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS. This may lead verifier error due to misinformation. This patch set fixed the issue with Patch #1 for the kernel change and Patch #2 for enhanced selftest test_maps. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-02-27 17:03:14 -08:00
Yonghong Song	9eca508375	tools/bpf: selftests: add map lookup to test_map_in_map bpf prog The bpf_map_lookup_elem is added in the bpf program. Without previous patch, the test change will trigger the following error: $ ./test_maps ... ; value_p = bpf_map_lookup_elem(map, &key); 20: (bf) r1 = r7 21: (bf) r2 = r8 22: (85) call bpf_map_lookup_elem#1 ; if (!value_p \|\| value_p != 123) 23: (15) if r0 == 0x0 goto pc+16 R0=map_value(id=2,off=0,ks=4,vs=4,imm=0) R6=inv1 R7=map_ptr(id=0,off=0,ks=4,vs=4,imm=0) R8=fp-8,call_-1 R10=fp0,call_-1 fp-8=mmmmmmmm ; if (!value_p \|\| value_p != 123) 24: (61) r1 = (u32 )(r0 +0) R0=map_value(id=2,off=0,ks=4,vs=4,imm=0) R6=inv1 R7=map_ptr(id=0,off=0,ks=4,vs=4,imm=0) R8=fp-8,call_-1 R10=fp0,call_-1 fp-8=mmmmmmmm bpf_spin_lock cannot be accessed directly by load/store With the kernel fix in the previous commit, the error goes away. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-02-27 17:03:13 -08:00
Yonghong Song	a115d0ed72	bpf: set inner_map_meta->spin_lock_off correctly Commit `d83525ca62` ("bpf: introduce bpf_spin_lock") introduced bpf_spin_lock and the field spin_lock_off in kernel internal structure bpf_map has the following meaning: >=0 valid offset, <0 error For every map created, the kernel will ensure spin_lock_off has correct value. Currently, bpf_map->spin_lock_off is not copied from the inner map to the map_in_map inner_map_meta during a map_in_map type map creation, so inner_map_meta->spin_lock_off = 0. This will give verifier wrong information that inner_map has bpf_spin_lock and the bpf_spin_lock is defined at offset 0. An access to offset 0 of a value pointer will trigger the following error: bpf_spin_lock cannot be accessed directly by load/store This patch fixed the issue by copy inner map's spin_lock_off value to inner_map_meta->spin_lock_off. Fixes: `d83525ca62` ("bpf: introduce bpf_spin_lock") Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-02-27 17:03:13 -08:00
Daniel T. Lee	d2e614cb07	samples: bpf: fix: broken sample regarding removed function Currently, running sample "task_fd_query" and "tracex3" occurs the following error. On kernel v5.0-rc* this sample will be unavailable due to the removal of function 'blk_start_request' at commit "a1ce35f". (function removed, as "Single Queue IO scheduler" no longer exists) $ sudo ./task_fd_query failed to create kprobe 'blk_start_request' error 'No such file or directory' This commit will change the function 'blk_start_request' to 'blk_mq_start_request' to fix the broken sample. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-27 17:27:22 +01:00
Daniel Borkmann	da4e023e45	Merge branch 'bpf-prog-stats' Alexei Starovoitov says: ==================== Introduce per program stats to monitor the usage BPF. v2->v3: - rename to run_time_ns/run_cnt everywhere v1->v2: - fixed u64 stats on 32-bit archs. Thanks Eric - use more verbose run_time_ns in json output as suggested by Andrii - refactored prog_alloc and clarified behavior of stats in subprogs ==================== Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-27 17:22:52 +01:00
Alexei Starovoitov	88ad472b8a	tools/bpftool: recognize bpf_prog_info run_time_ns and run_cnt $ bpftool p s 1: kprobe tag a56587d488d216c9 gpl run_time_ns 79786 run_cnt 8 loaded_at 2019-02-22T12:22:51-0800 uid 0 xlated 352B not jited memlock 4096B $ bpftool --json --pretty p s [{ "id": 1, "type": "kprobe", "tag": "a56587d488d216c9", "gpl_compatible": true, "run_time_ns": 79786, "run_cnt": 8, "loaded_at": 1550866971, "uid": 0, "bytes_xlated": 352, "jited": false, "bytes_memlock": 4096 } ] Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-27 17:22:51 +01:00
Alexei Starovoitov	b1eca86db6	tools/bpf: sync bpf.h into tools sync bpf.h into tools directory Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-27 17:22:51 +01:00
Alexei Starovoitov	5f8f8b93ae	bpf: expose program stats via bpf_prog_info Return bpf program run_time_ns and run_cnt via bpf_prog_info Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-27 17:22:50 +01:00
Alexei Starovoitov	492ecee892	bpf: enable program stats JITed BPF programs are indistinguishable from kernel functions, but unlike kernel code BPF code can be changed often. Typical approach of "perf record" + "perf report" profiling and tuning of kernel code works just as well for BPF programs, but kernel code doesn't need to be monitored whereas BPF programs do. Users load and run large amount of BPF programs. These BPF stats allow tools monitor the usage of BPF on the server. The monitoring tools will turn sysctl kernel.bpf_stats_enabled on and off for few seconds to sample average cost of the programs. Aggregated data over hours and days will provide an insight into cost of BPF and alarms can trigger in case given program suddenly gets more expensive. The cost of two sched_clock() per program invocation adds ~20 nsec. Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down from ~10 nsec to ~30 nsec. static_key minimizes the cost of the stats collection. There is no measurable difference before/after this patch with kernel.bpf_stats_enabled=0 Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-27 17:22:50 +01:00
Daniel Borkmann	143bdc2e27	Merge branch 'bpf-libbpf-af-xdp' Magnus Karlsson says: ==================== This patch proposes to add AF_XDP support to libbpf. The main reason for this is to facilitate writing applications that use AF_XDP by offering higher-level APIs that hide many of the details of the AF_XDP uapi. This is in the same vein as libbpf facilitates XDP adoption by offering easy-to-use higher level interfaces of XDP functionality. Hopefully this will facilitate adoption of AF_XDP, make applications using it simpler and smaller, and finally also make it possible for applications to benefit from optimizations in the AF_XDP user space access code. Previously, people just copied and pasted the code from the sample application into their application, which is not desirable. The proposed interface is composed of two parts: * Low-level access interface to the four rings and the packet * High-level control plane interface for creating and setting up umems and AF_XDP sockets. This interface also loads a simple XDP program that routes all traffic on a queue up to the AF_XDP socket. The sample program has been updated to use this new interface and in that process it lost roughly 300 lines of code. I cannot detect any performance degradations due to the use of this library instead of the previous functions that were inlined in the sample application. But I did measure this on a slower machine and not the Broadwell that we normally use. The rings are now called xsk_ring and when a producer operates on it. It is xsk_ring_prod and for a consumer it is xsk_ring_cons. This way we can get some compile time error checking that the rings are used correctly. Comments and contenplations: * The current behaviour is that the library loads an XDP program (if requested to do so) but the clean up of this program is left to the application. It would be possible to implement this cleanup in the library, but it would require state to be kept on netdev level, which there is none at the moment, and the synchronization of this between processes. All this adding complexity. But when we get an XDP program per queue id, then it becomes trivial to also remove the XDP program when the application exits. This proposal from Jesper, Björn and others will also improve the performance of libbpf, since most of the XDP program code can be removed when that feature is supported. * In a future release, I am planning on adding a higher level data plane interface too. This will be based around recvmsg and sendmsg with the use of struct iovec for batching, without the user having to know anything about the underlying four rings of an AF_XDP socket. There will be one semantic difference though from the standard recvmsg and that is that the kernel will fill in the iovecs instead of the application. But the rest should be the same as the libc versions so that application writers feel at home. Patch 1: adds AF_XDP support in libbpf Patch 2: updates the xdpsock sample application to use the libbpf functions Patch 3: Documentation update to help first time users Changes v5 to v6: * Fixed prog_fd bug found by Xiaolong Ye. Thanks! Changes v4 to v5: * Added a FAQ to the documentation * Removed xsk_umem__get_data and renamed xsk_umem__get_dat_raw to xsk_umem__get_data * Replaced the netlink code with bpf_get_link_xdp_id() * Dynamic allocation of the map sizes. They are now sized after the max number of queueus on the netdev in question. Changes v3 to v4: * Dropped the pr_() patch in favor of Yonghong Song's patch set Addressed the review comments of Daniel Borkmann, mainly leaking of file descriptors at clean up and making the data plane APIs all static inline (with the exception of xsk_umem__get_data that uses an internal structure I do not want to expose). * Fixed the netlink callback as suggested by Maciej Fijalkowski. * Removed an unecessary include in the sample program as spotted by Ilia Fillipov. Changes v2 to v3: * Added automatic loading of a simple XDP program that routes all traffic on a queue up to the AF_XDP socket. This program loading can be disabled. * Updated function names to be consistent with the libbpf naming convention * Moved all code to xsk.[ch] * Removed all the XDP program loading code from the sample since this is now done by libbpf * The initialization functions now return a handle as suggested by Alexei * const statements added in the API where applicable. Changes v1 to v2: * Fixed cleanup of library state on error. * Moved API to initial version * Prefixed all public functions by xsk__ instead of xsk_ * Added comment about changed default ring sizes, batch size and umem size in the sample application commit message * The library now only creates an Rx or Tx ring if the respective parameter is != NULL ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-25 23:26:36 +01:00
Magnus Karlsson	0f4a9b7d4e	xsk: add FAQ to facilitate for first time users Added an FAQ section in Documentation/networking/af_xdp.rst to help first time users with common problems. As problems are getting identified, entries will be added to the FAQ. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-25 23:21:42 +01:00
Magnus Karlsson	248c7f9c0e	samples/bpf: convert xdpsock to use libbpf for AF_XDP access This commit converts the xdpsock sample application to use the AF_XDP functions present in libbpf. This cuts down the size of it by nearly 300 lines of code. The default ring sizes plus the batch size has been increased and the size of the umem area has decreased. This so that the sample application will provide higher throughput. Note also that the shared umem code has been removed from the sample as this is not supported by libbpf at this point in time. Tested-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-25 23:21:42 +01:00
Magnus Karlsson	1cad078842	libbpf: add support for using AF_XDP sockets This commit adds AF_XDP support to libbpf. The main reason for this is to facilitate writing applications that use AF_XDP by offering higher-level APIs that hide many of the details of the AF_XDP uapi. This is in the same vein as libbpf facilitates XDP adoption by offering easy-to-use higher level interfaces of XDP functionality. Hopefully this will facilitate adoption of AF_XDP, make applications using it simpler and smaller, and finally also make it possible for applications to benefit from optimizations in the AF_XDP user space access code. Previously, people just copied and pasted the code from the sample application into their application, which is not desirable. The interface is composed of two parts: * Low-level access interface to the four rings and the packet * High-level control plane interface for creating and setting up umems and af_xdp sockets as well as a simple XDP program. Tested-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-25 23:21:42 +01:00
Stanislav Fomichev	740f8a6572	selftests/bpf: make sure signal interrupts BPF_PROG_TEST_RUN Simple test that I used to reproduce the issue in the previous commit: Do BPF_PROG_TEST_RUN with max iterations, each program is 4096 simple move instructions. File alarm in 0.1 second and check that bpf_prog_test_run is interrupted (i.e. test doesn't hang). Note: reposting this for bpf-next to avoid linux-next conflict. In this version I test both BPF_PROG_TYPE_SOCKET_FILTER (which uses generic bpf_test_run implementation) and BPF_PROG_TYPE_FLOW_DISSECTOR (which has it own loop with preempt handling in bpf_prog_test_run_flow_dissector). Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-25 22:24:19 +01:00
Stanislav Fomichev	a439184d51	bpf/test_run: fix unkillable BPF_PROG_TEST_RUN for flow dissector Syzbot found out that running BPF_PROG_TEST_RUN with repeat=0xffffffff makes process unkillable. The problem is that when CONFIG_PREEMPT is enabled, we never see need_resched() return true. This is due to the fact that preempt_enable() (which we do in bpf_test_run_one on each iteration) now handles resched if it's needed. Let's disable preemption for the whole run, not per test. In this case we can properly see whether resched is needed. Let's also properly return -EINTR to the userspace in case of a signal interrupt. This is a follow up for a recently fixed issue in bpf_test_run, see commit `df1a2cb7c7` ("bpf/test_run: fix unkillable BPF_PROG_TEST_RUN"). Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-25 22:21:22 +01:00
Anders Roxell	fd92d6648f	bpf: test_bpf: turn off preemption in function __run_once When running BPF test suite the following splat occurs: [ 415.930950] test_bpf: #0 TAX jited:0 [ 415.931067] BUG: assuming atomic context at lib/test_bpf.c:6674 [ 415.946169] in_atomic(): 0, irqs_disabled(): 0, pid: 11556, name: modprobe [ 415.953176] INFO: lockdep is turned off. [ 415.957207] CPU: 1 PID: 11556 Comm: modprobe Tainted: G W 5.0.0-rc7-next-20190220 #1 [ 415.966328] Hardware name: HiKey Development Board (DT) [ 415.971592] Call trace: [ 415.974069] dump_backtrace+0x0/0x160 [ 415.977761] show_stack+0x24/0x30 [ 415.981104] dump_stack+0xc8/0x114 [ 415.984534] __cant_sleep+0xf0/0x108 [ 415.988145] test_bpf_init+0x5e0/0x1000 [test_bpf] [ 415.992971] do_one_initcall+0x90/0x428 [ 415.996837] do_init_module+0x60/0x1e4 [ 416.000614] load_module+0x1de0/0x1f50 [ 416.004391] __se_sys_finit_module+0xc8/0xe0 [ 416.008691] __arm64_sys_finit_module+0x24/0x30 [ 416.013255] el0_svc_common+0x78/0x130 [ 416.017031] el0_svc_handler+0x38/0x78 [ 416.020806] el0_svc+0x8/0xc Rework so that preemption is disabled when we loop over function 'BPF_PROG_RUN(...)'. Fixes: `568f196756` ("bpf: check that BPF programs run with preemption disabled") Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-25 22:18:07 +01:00
Toke Høiland-Jørgensen	915654fd71	samples/bpf: Fix dummy program unloading for xdp_redirect samples The xdp_redirect and xdp_redirect_map sample programs both load a dummy program onto the egress interfaces. However, the unload code checks these programs against the wrong id number, and thus refuses to unload them. Fix the comparison to avoid this. Fixes: `3b7a8ec2de` ("samples/bpf: Check the prog id before exiting") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-22 16:21:59 +01:00
Alexei Starovoitov	e80d02dd76	seccomp, bpf: disable preemption before calling into bpf prog All BPF programs must be called with preemption disabled. Fixes: `568f196756` ("bpf: check that BPF programs run with preemption disabled") Reported-by: syzbot+8bf19ee2aa580de7a2a7@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-22 00:14:19 +01:00
Jesper Dangaard Brouer	74e31ca850	bpf: add skb->queue_mapping write access from tc clsact The skb->queue_mapping already have read access, via __sk_buff->queue_mapping. This patch allow BPF tc qdisc clsact write access to the queue_mapping via tc_cls_act_is_valid_access. Also handle that the value NO_QUEUE_MAPPING is not allowed. It is already possible to change this via TC filter action skbedit tc-skbedit(8). Due to the lack of TC examples, lets show one: # tc qdisc add dev ixgbe1 clsact # tc filter add dev ixgbe1 ingress matchall action skbedit queue_mapping 5 # tc filter list dev ixgbe1 ingress The most common mistake is that XPS (Transmit Packet Steering) takes precedence over setting skb->queue_mapping. XPS is configured per DEVICE via /sys/class/net/DEVICE/queues/tx-*/xps_cpus via a CPU hex mask. To disable set mask=00. The purpose of changing skb->queue_mapping is to influence the selection of the net_device "txq" (struct netdev_queue), which influence selection of the qdisc "root_lock" (via txq->qdisc->q.lock) and txq->_xmit_lock. When using the MQ qdisc the txq->qdisc points to different qdiscs and associated locks, and HARD_TX_LOCK (txq->_xmit_lock), allowing for CPU scalability. Due to lack of TC examples, lets show howto attach clsact BPF programs: # tc qdisc add dev ixgbe2 clsact # tc filter add dev ixgbe2 egress bpf da obj XXX_kern.o sec tc_qmap2cpu # tc filter list dev ixgbe2 egress Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-19 21:56:05 +01:00
Peter Zijlstra	568f196756	bpf: check that BPF programs run with preemption disabled Introduce cant_sleep() macro for annotation of functions that cannot sleep. Use it in BPF_PROG_RUN to catch execution of BPF programs in preemptable context. Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-19 21:53:07 +01:00
Alban Crequy	a5d9265e01	bpf: bpftool, fix documentation for attach types bpftool has support for attach types "stream_verdict" and "stream_parser" but the documentation was referring to them as "skb_verdict" and "skb_parse". The inconsistency comes from commit `b7d3826c2e` ("bpf: bpftool, add support for attaching programs to maps"). This patch changes the documentation to match the implementation: - "bpftool prog help" - man pages - bash completion Signed-off-by: Alban Crequy <alban@kinvolk.io> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-02-19 17:23:18 +01:00

1 2 3 4 5 ...

813654 Commits