As Naresh reported, test_stacktrace_build_id() causes panic on i386 and
arm32 systems. This is caused by page_address() returns NULL in certain
cases.
This patch fixes this error by using kmap_atomic/kunmap_atomic instead
of page_address.
Fixes: 615755a77b (" bpf: extend stackmap to save binary_build_id+offset instead of address")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
While 979d63d50c ("bpf: prevent out of bounds speculation on pointer
arithmetic") took care of rejecting alu op on pointer when e.g. pointer
came from two different map values with different map properties such as
value size, Jann reported that a case was not covered yet when a given
alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
different branches where we would incorrectly try to sanitize based
on the pointer's limit. Catch this corner case and reject the program
instead.
Fixes: 979d63d50c ("bpf: prevent out of bounds speculation on pointer arithmetic")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jann reported that the original commit back in b2157399cc
("bpf: prevent out-of-bounds speculation") was not sufficient
to stop CPU from speculating out of bounds memory access:
While b2157399cc only focussed on masking array map access
for unprivileged users for tail calls and data access such
that the user provided index gets sanitized from BPF program
and syscall side, there is still a more generic form affected
from BPF programs that applies to most maps that hold user
data in relation to dynamic map access when dealing with
unknown scalars or "slow" known scalars as access offset, for
example:
- Load a map value pointer into R6
- Load an index into R7
- Do a slow computation (e.g. with a memory dependency) that
loads a limit into R8 (e.g. load the limit from a map for
high latency, then mask it to make the verifier happy)
- Exit if R7 >= R8 (mispredicted branch)
- Load R0 = R6[R7]
- Load R0 = R6[R0]
For unknown scalars there are two options in the BPF verifier
where we could derive knowledge from in order to guarantee
safe access to the memory: i) While </>/<=/>= variants won't
allow to derive any lower or upper bounds from the unknown
scalar where it would be safe to add it to the map value
pointer, it is possible through ==/!= test however. ii) another
option is to transform the unknown scalar into a known scalar,
for example, through ALU ops combination such as R &= <imm>
followed by R |= <imm> or any similar combination where the
original information from the unknown scalar would be destroyed
entirely leaving R with a constant. The initial slow load still
precedes the latter ALU ops on that register, so the CPU
executes speculatively from that point. Once we have the known
scalar, any compare operation would work then. A third option
only involving registers with known scalars could be crafted
as described in [0] where a CPU port (e.g. Slow Int unit)
would be filled with many dependent computations such that
the subsequent condition depending on its outcome has to wait
for evaluation on its execution port and thereby executing
speculatively if the speculated code can be scheduled on a
different execution port, or any other form of mistraining
as described in [1], for example. Given this is not limited
to only unknown scalars, not only map but also stack access
is affected since both is accessible for unprivileged users
and could potentially be used for out of bounds access under
speculation.
In order to prevent any of these cases, the verifier is now
sanitizing pointer arithmetic on the offset such that any
out of bounds speculation would be masked in a way where the
pointer arithmetic result in the destination register will
stay unchanged, meaning offset masked into zero similar as
in array_index_nospec() case. With regards to implementation,
there are three options that were considered: i) new insn
for sanitation, ii) push/pop insn and sanitation as inlined
BPF, iii) reuse of ax register and sanitation as inlined BPF.
Option i) has the downside that we end up using from reserved
bits in the opcode space, but also that we would require
each JIT to emit masking as native arch opcodes meaning
mitigation would have slow adoption till everyone implements
it eventually which is counter-productive. Option ii) and iii)
have both in common that a temporary register is needed in
order to implement the sanitation as inlined BPF since we
are not allowed to modify the source register. While a push /
pop insn in ii) would be useful to have in any case, it
requires once again that every JIT needs to implement it
first. While possible, amount of changes needed would also
be unsuitable for a -stable patch. Therefore, the path which
has fewer changes, less BPF instructions for the mitigation
and does not require anything to be changed in the JITs is
option iii) which this work is pursuing. The ax register is
already mapped to a register in all JITs (modulo arm32 where
it's mapped to stack as various other BPF registers there)
and used in constant blinding for JITs-only so far. It can
be reused for verifier rewrites under certain constraints.
The interpreter's tmp "register" has therefore been remapped
into extending the register set with hidden ax register and
reusing that for a number of instructions that needed the
prior temporary variable internally (e.g. div, mod). This
allows for zero increase in stack space usage in the interpreter,
and enables (restricted) generic use in rewrites otherwise as
long as such a patchlet does not make use of these instructions.
The sanitation mask is dynamic and relative to the offset the
map value or stack pointer currently holds.
There are various cases that need to be taken under consideration
for the masking, e.g. such operation could look as follows:
ptr += val or val += ptr or ptr -= val. Thus, the value to be
sanitized could reside either in source or in destination
register, and the limit is different depending on whether
the ALU op is addition or subtraction and depending on the
current known and bounded offset. The limit is derived as
follows: limit := max_value_size - (smin_value + off). For
subtraction: limit := umax_value + off. This holds because
we do not allow any pointer arithmetic that would
temporarily go out of bounds or would have an unknown
value with mixed signed bounds where it is unclear at
verification time whether the actual runtime value would
be either negative or positive. For example, we have a
derived map pointer value with constant offset and bounded
one, so limit based on smin_value works because the verifier
requires that statically analyzed arithmetic on the pointer
must be in bounds, and thus it checks if resulting
smin_value + off and umax_value + off is still within map
value bounds at time of arithmetic in addition to time of
access. Similarly, for the case of stack access we derive
the limit as follows: MAX_BPF_STACK + off for subtraction
and -off for the case of addition where off := ptr_reg->off +
ptr_reg->var_off.value. Subtraction is a special case for
the masking which can be in form of ptr += -val, ptr -= -val,
or ptr -= val. In the first two cases where we know that
the value is negative, we need to temporarily negate the
value in order to do the sanitation on a positive value
where we later swap the ALU op, and restore original source
register if the value was in source.
The sanitation of pointer arithmetic alone is still not fully
sufficient as is, since a scenario like the following could
happen ...
PTR += 0x1000 (e.g. K-based imm)
PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
PTR += 0x1000
PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
[...]
... which under speculation could end up as ...
PTR += 0x1000
PTR -= 0 [ truncated by mitigation ]
PTR += 0x1000
PTR -= 0 [ truncated by mitigation ]
[...]
... and therefore still access out of bounds. To prevent such
case, the verifier is also analyzing safety for potential out
of bounds access under speculative execution. Meaning, it is
also simulating pointer access under truncation. We therefore
"branch off" and push the current verification state after the
ALU operation with known 0 to the verification stack for later
analysis. Given the current path analysis succeeded it is
likely that the one under speculation can be pruned. In any
case, it is also subject to existing complexity limits and
therefore anything beyond this point will be rejected. In
terms of pruning, it needs to be ensured that the verification
state from speculative execution simulation must never prune
a non-speculative execution path, therefore, we mark verifier
state accordingly at the time of push_stack(). If verifier
detects out of bounds access under speculative execution from
one of the possible paths that includes a truncation, it will
reject such program.
Given we mask every reg-based pointer arithmetic for
unprivileged programs, we've been looking into how it could
affect real-world programs in terms of size increase. As the
majority of programs are targeted for privileged-only use
case, we've unconditionally enabled masking (with its alu
restrictions on top of it) for privileged programs for the
sake of testing in order to check i) whether they get rejected
in its current form, and ii) by how much the number of
instructions and size will increase. We've tested this by
using Katran, Cilium and test_l4lb from the kernel selftests.
For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
we've used test_l4lb.o as well as test_l4lb_noinline.o. We
found that none of the programs got rejected by the verifier
with this change, and that impact is rather minimal to none.
balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
7,797 bytes JITed before and after the change. Most complex
program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
and 18,538 bytes JITed before and after and none of the other
tail call programs in bpf_lxc.o had any changes either. For
the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
after the change. Other programs from that object file had
similar small increase. Both test_l4lb.o had no change and
remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
JITed and for test_l4lb_noinline.o constant at 5,080 bytes
(634 insns) xlated and 3,313 bytes JITed. This can be explained
in that LLVM typically optimizes stack based pointer arithmetic
by using K-based operations and that use of dynamic map access
is not overly frequent. However, in future we may decide to
optimize the algorithm further under known guarantees from
branch and value speculation. Latter seems also unclear in
terms of prediction heuristics that today's CPUs apply as well
as whether there could be collisions in e.g. the predictor's
Value History/Pattern Table for triggering out of bounds access,
thus masking is performed unconditionally at this point but could
be subject to relaxation later on. We were generally also
brainstorming various other approaches for mitigation, but the
blocker was always lack of available registers at runtime and/or
overhead for runtime tracking of limits belonging to a specific
pointer. Thus, we found this to be minimally intrusive under
given constraints.
With that in place, a simple example with sanitized access on
unprivileged load at post-verification time looks as follows:
# bpftool prog dump xlated id 282
[...]
28: (79) r1 = *(u64 *)(r7 +0)
29: (79) r2 = *(u64 *)(r7 +8)
30: (57) r1 &= 15
31: (79) r3 = *(u64 *)(r0 +4608)
32: (57) r3 &= 1
33: (47) r3 |= 1
34: (2d) if r2 > r3 goto pc+19
35: (b4) (u32) r11 = (u32) 20479 |
36: (1f) r11 -= r2 | Dynamic sanitation for pointer
37: (4f) r11 |= r2 | arithmetic with registers
38: (87) r11 = -r11 | containing bounded or known
39: (c7) r11 s>>= 63 | scalars in order to prevent
40: (5f) r11 &= r2 | out of bounds speculation.
41: (0f) r4 += r11 |
42: (71) r4 = *(u8 *)(r4 +0)
43: (6f) r4 <<= r1
[...]
For the case where the scalar sits in the destination register
as opposed to the source register, the following code is emitted
for the above example:
[...]
16: (b4) (u32) r11 = (u32) 20479
17: (1f) r11 -= r2
18: (4f) r11 |= r2
19: (87) r11 = -r11
20: (c7) r11 s>>= 63
21: (5f) r2 &= r11
22: (0f) r2 += r0
23: (61) r0 = *(u32 *)(r2 +0)
[...]
JIT blinding example with non-conflicting use of r10:
[...]
d5: je 0x0000000000000106 _
d7: mov 0x0(%rax),%edi |
da: mov $0xf153246,%r10d | Index load from map value and
e0: xor $0xf153259,%r10 | (const blinded) mask with 0x1f.
e7: and %r10,%rdi |_
ea: mov $0x2f,%r10d |
f0: sub %rdi,%r10 | Sanitized addition. Both use r10
f3: or %rdi,%r10 | but do not interfere with each
f6: neg %r10 | other. (Neither do these instructions
f9: sar $0x3f,%r10 | interfere with the use of ax as temp
fd: and %r10,%rdi | in interpreter.)
100: add %rax,%rdi |_
103: mov 0x0(%rdi),%eax
[...]
Tested that it fixes Jann's reproducer, and also checked that test_verifier
and test_progs suite with interpreter, JIT and JIT with hardening enabled
on x86-64 and arm64 runs successfully.
[0] Speculose: Analyzing the Security Implications of Speculative
Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
https://arxiv.org/pdf/1801.04084.pdf
[1] A Systematic Evaluation of Transient Execution Attacks and
Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
Dmitry Evtyushkin, Daniel Gruss,
https://arxiv.org/pdf/1811.05441.pdf
Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In check_map_access() we probe actual bounds through __check_map_access()
with offset of reg->smin_value + off for lower bound and offset of
reg->umax_value + off for the upper bound. However, even though the
reg->smin_value could have a negative value, the final result of the
sum with off could be positive when pointer arithmetic with known and
unknown scalars is combined. In this case we reject the program with
an error such as "R<x> min value is negative, either use unsigned index
or do a if (index >=0) check." even though the access itself would be
fine. Therefore extend the check to probe whether the actual resulting
reg->smin_value + off is less than zero.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For unknown scalars of mixed signed bounds, meaning their smin_value is
negative and their smax_value is positive, we need to reject arithmetic
with pointer to map value. For unprivileged the goal is to mask every
map pointer arithmetic and this cannot reliably be done when it is
unknown at verification time whether the scalar value is negative or
positive. Given this is a corner case, the likelihood of breaking should
be very small.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Restrict stack pointer arithmetic for unprivileged users in that
arithmetic itself must not go out of bounds as opposed to the actual
access later on. Therefore after each adjust_ptr_min_max_vals() with
a stack pointer as a destination we simulate a check_stack_access()
of 1 byte on the destination and once that fails the program is
rejected for unprivileged program loads. This is analog to map
value pointer arithmetic and needed for masking later on.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Restrict map value pointer arithmetic for unprivileged users in that
arithmetic itself must not go out of bounds as opposed to the actual
access later on. Therefore after each adjust_ptr_min_max_vals() with a
map value pointer as a destination it will simulate a check_map_access()
of 1 byte on the destination and once that fails the program is rejected
for unprivileged program loads. We use this later on for masking any
pointer arithmetic with the remainder of the map value space. The
likelihood of breaking any existing real-world unprivileged eBPF
program is very small for this corner case.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Right now we are using BPF ax register in JIT for constant blinding as
well as in interpreter as temporary variable. Verifier will not be able
to use it simply because its use will get overridden from the former in
bpf_jit_blind_insn(). However, it can be made to work in that blinding
will be skipped if there is prior use in either source or destination
register on the instruction. Taking constraints of ax into account, the
verifier is then open to use it in rewrites under some constraints. Note,
ax register already has mappings in every eBPF JIT.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
into the hidden ax register. The latter is currently only used in JITs
for constant blinding as a temporary scratch register, meaning the BPF
interpreter will never see the use of ax. Therefore it is safe to use
it for the cases where tmp has been used earlier. This is needed to later
on allow restricted hidden use of ax in both interpreter and JITs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move prev_insn_idx and insn_idx from the do_check() function into
the verifier environment, so they can be read inside the various
helper functions for handling the instructions. It's easier to put
this into the environment rather than changing all call-sites only
to pass it along. insn_idx is useful in particular since this later
on allows to hold state in env->insn_aux_data[env->insn_idx].
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-12-21
The following pull-request contains BPF updates for your *net-next* tree.
There is a merge conflict in test_verifier.c. Result looks as follows:
[...]
},
{
"calls: cross frame pruning",
.insns = {
[...]
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
.result_unpriv = REJECT,
.errstr = "!read_ok",
.result = REJECT,
},
{
"jset: functional",
.insns = {
[...]
{
"jset: unknown const compare not taken",
.insns = {
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
BPF_FUNC_get_prandom_u32),
BPF_JMP_IMM(BPF_JSET, BPF_REG_0, 1, 1),
BPF_LDX_MEM(BPF_B, BPF_REG_8, BPF_REG_9, 0),
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.errstr_unpriv = "!read_ok",
.result_unpriv = REJECT,
.errstr = "!read_ok",
.result = REJECT,
},
[...]
{
"jset: range",
.insns = {
[...]
},
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.result_unpriv = ACCEPT,
.result = ACCEPT,
},
The main changes are:
1) Various BTF related improvements in order to get line info
working. Meaning, verifier will now annotate the corresponding
BPF C code to the error log, from Martin and Yonghong.
2) Implement support for raw BPF tracepoints in modules, from Matt.
3) Add several improvements to verifier state logic, namely speeding
up stacksafe check, optimizations for stack state equivalence
test and safety checks for liveness analysis, from Alexei.
4) Teach verifier to make use of BPF_JSET instruction, add several
test cases to kselftests and remove nfp specific JSET optimization
now that verifier has awareness, from Jakub.
5) Improve BPF verifier's slot_type marking logic in order to
allow more stack slot sharing, from Jiong.
6) Add sk_msg->size member for context access and add set of fixes
and improvements to make sock_map with kTLS usable with openssl
based applications, from John.
7) Several cleanups and documentation updates in bpftool as well as
auto-mount of tracefs for "bpftool prog tracelog" command,
from Quentin.
8) Include sub-program tags from now on in bpf_prog_info in order to
have a reliable way for user space to get all tags of the program
e.g. needed for kallsyms correlation, from Song.
9) Add BTF annotations for cgroup_local_storage BPF maps and
implement bpf fs pretty print support, from Roman.
10) Fix bpftool in order to allow for cross-compilation, from Ivan.
11) Update of bpftool license to GPLv2-only + BSD-2-Clause in order
to be compatible with libbfd and allow for Debian packaging,
from Jakub.
12) Remove an obsolete prog->aux sanitation in dump and get rid of
version check for prog load, from Daniel.
13) Fix a memory leak in libbpf's line info handling, from Prashant.
14) Fix cpumap's frame alignment for build_skb() so that skb_shared_info
does not get unaligned, from Jesper.
15) Fix test_progs kselftest to work with older compilers which are less
smart in optimizing (and thus throwing build error), from Stanislav.
16) Cleanup and simplify AF_XDP socket teardown, from Björn.
17) Fix sk lookup in BPF kselftest's test_sock_addr with regards
to netns_id argument, from Andrey.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The frame_size passed to build_skb must be aligned, else it is
possible that the embedded struct skb_shared_info gets unaligned.
For correctness make sure that xdpf->headroom in included in the
alignment. No upstream drivers can hit this, as all XDP drivers provide
an aligned headroom. This was discovered when playing with implementing
XDP support for mvneta, which have a 2 bytes DSA header, and this
Marvell ARM64 platform didn't like doing atomic operations on an
unaligned skb_shinfo(skb)->dataref addresses.
Fixes: 1c601d829a ("bpf: cpumap xdp_buff to skb conversion and allocation")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.
Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reorder the calls to check_max_stack_depth() and sanitize_dead_code()
to separate functions which can rewrite instructions from pure checks.
No functional changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Some JITs (nfp) try to optimize code on their own. It could make
sense in case of BPF_JSET instruction which is currently not interpreted
by the verifier, meaning for instance that dead could would not be
detected if it was under BPF_JSET branch.
Teach the verifier basics of BPF_JSET, JIT optimizations will be
removed shortly.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch rejects a line_info if the bpf insn code referred by
line_info.insn_off is 0. F.e. a broken userspace tool might generate
a line_info.insn_off that points to the second 8 bytes of a BPF_LD_IMM64.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Current btf internal verbose logger logs fwd type as
[2] FWD A type_id=0
where A is the type name.
Commit 9d5f9f701b ("bpf: btf: fix struct/union/fwd types
with kind_flag") introduced kind_flag which can be used
to distinguish whether a forward type is a struct or
union.
Also, "type_id=0" does not carry any meaningful
information for fwd type as btf_type.type = 0 is simply
enforced during btf verification and is not used
anywhere else.
This commit changed the log to
[2] FWD A struct
if kind_flag = 0, or
[2] FWD A union
if kind_flag = 1.
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Verifier is supposed to support sharing stack slot allocated to ptr with
SCALAR_VALUE for privileged program. However this doesn't happen for some
cases.
The reason is verifier is not clearing slot_type STACK_SPILL for all bytes,
it only clears part of them, while verifier is using:
slot_type[0] == STACK_SPILL
as a convention to check one slot is ptr type.
So, the consequence of partial clearing slot_type is verifier could treat a
partially overridden ptr slot, which should now be a SCALAR_VALUE slot,
still as ptr slot, and rejects some valid programs.
Before this patch, test_xdp_noinline.o under bpf selftests, bpf_lxc.o and
bpf_netdev.o under Cilium bpf repo, when built with -mattr=+alu32 are
rejected due to this issue. After this patch, they all accepted.
There is no processed insn number change before and after this patch on
Cilium bpf programs.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Distributions build drivers as modules, including network and filesystem
drivers which export numerous tracepoints. This enables
bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints.
Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 970289fc0a83 ("bpf: add bpffs pretty print for cgroup
local storage maps") added bpffs pretty print for cgroup
local storage maps. The commit worked for struct without kind_flag
set.
This patch refactored and made pretty print also work
with kind_flag set for the struct.
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch fixed two issues with BTF. One is related to
struct/union bitfield encoding and the other is related to
forward type.
Issue #1 and solution:
======================
Current btf encoding of bitfield follows what pahole generates.
For each bitfield, pahole will duplicate the type chain and
put the bitfield size at the final int or enum type.
Since the BTF enum type cannot encode bit size,
pahole workarounds the issue by generating
an int type whenever the enum bit size is not 32.
For example,
-bash-4.4$ cat t.c
typedef int ___int;
enum A { A1, A2, A3 };
struct t {
int a[5];
___int b:4;
volatile enum A c:4;
} g;
-bash-4.4$ gcc -c -O2 -g t.c
The current kernel supports the following BTF encoding:
$ pahole -JV t.o
[1] TYPEDEF ___int type_id=2
[2] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED
[3] ENUM A size=4 vlen=3
A1 val=0
A2 val=1
A3 val=2
[4] STRUCT t size=24 vlen=3
a type_id=5 bits_offset=0
b type_id=9 bits_offset=160
c type_id=11 bits_offset=164
[5] ARRAY (anon) type_id=2 index_type_id=2 nr_elems=5
[6] INT sizetype size=8 bit_offset=0 nr_bits=64 encoding=(none)
[7] VOLATILE (anon) type_id=3
[8] INT int size=1 bit_offset=0 nr_bits=4 encoding=(none)
[9] TYPEDEF ___int type_id=8
[10] INT (anon) size=1 bit_offset=0 nr_bits=4 encoding=SIGNED
[11] VOLATILE (anon) type_id=10
Two issues are in the above:
. by changing enum type to int, we lost the original
type information and this will not be ideal later
when we try to convert BTF to a header file.
. the type duplication for bitfields will cause
BTF bloat. Duplicated types cannot be deduplicated
later if the bitfield size is different.
To fix this issue, this patch implemented a compatible
change for BTF struct type encoding:
. the bit 31 of struct_type->info, previously reserved,
now is used to indicate whether bitfield_size is
encoded in btf_member or not.
. if bit 31 of struct_type->info is set,
btf_member->offset will encode like:
bit 0 - 23: bit offset
bit 24 - 31: bitfield size
if bit 31 is not set, the old behavior is preserved:
bit 0 - 31: bit offset
So if the struct contains a bit field, the maximum bit offset
will be reduced to (2^24 - 1) instead of MAX_UINT. The maximum
bitfield size will be 256 which is enough for today as maximum
bitfield in compiler can be 128 where int128 type is supported.
This kernel patch intends to support the new BTF encoding:
$ pahole -JV t.o
[1] TYPEDEF ___int type_id=2
[2] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED
[3] ENUM A size=4 vlen=3
A1 val=0
A2 val=1
A3 val=2
[4] STRUCT t kind_flag=1 size=24 vlen=3
a type_id=5 bitfield_size=0 bits_offset=0
b type_id=1 bitfield_size=4 bits_offset=160
c type_id=7 bitfield_size=4 bits_offset=164
[5] ARRAY (anon) type_id=2 index_type_id=2 nr_elems=5
[6] INT sizetype size=8 bit_offset=0 nr_bits=64 encoding=(none)
[7] VOLATILE (anon) type_id=3
Issue #2 and solution:
======================
Current forward type in BTF does not specify whether the original
type is struct or union. This will not work for type pretty print
and BTF-to-header-file conversion as struct/union must be specified.
$ cat tt.c
struct t;
union u;
int foo(struct t *t, union u *u) { return 0; }
$ gcc -c -g -O2 tt.c
$ pahole -JV tt.o
[1] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED
[2] FWD t type_id=0
[3] PTR (anon) type_id=2
[4] FWD u type_id=0
[5] PTR (anon) type_id=4
To fix this issue, similar to issue #1, type->info bit 31
is used. If the bit is set, it is union type. Otherwise, it is
a struct type.
$ pahole -JV tt.o
[1] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED
[2] FWD t kind_flag=0 type_id=0
[3] PTR (anon) kind_flag=0 type_id=2
[4] FWD u kind_flag=1 type_id=0
[5] PTR (anon) kind_flag=0 type_id=4
Pahole/LLVM change:
===================
The new kind_flag functionality has been implemented in pahole
and llvm:
https://github.com/yonghong-song/pahole/tree/bitfieldhttps://github.com/yonghong-song/llvm/tree/bitfield
Note that pahole hasn't implemented func/func_proto kind
and .BTF.ext. So to print function signature with bpftool,
the llvm compiler should be used.
Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Refactor function btf_int_bits_seq_show() by creating
function btf_bitfield_seq_show() which has no dependence
on btf and btf_type. The function btf_bitfield_seq_show()
will be in later patch to directly dump bitfield member values.
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Existing libraries and tracing frameworks work around this kernel
version check by automatically deriving the kernel version from
uname(3) or similar such that the user does not need to do it
manually; these workarounds also make the version check useless
at the same time.
Moreover, most other BPF tracing types enabling bpf_probe_read()-like
functionality have /not/ adapted this check, and in general these
days it is well understood anyway that all the tracing programs are
not stable with regards to future kernels as kernel internal data
structures are subject to change from release to release.
Back at last netconf we discussed [0] and agreed to remove this
check from bpf_prog_load() and instead document it here in the uapi
header that there is no such guarantee for stable API for these
programs.
[0] http://vger.kernel.org/netconf2018_files/DanielBorkmann_netconf2018.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce REG_LIVE_DONE to check the liveness propagation
and prepare the states for merging.
See algorithm description in clean_live_states().
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
"if (old->allocated_stack > cur->allocated_stack)" check is too conservative.
In some cases explored stack could have allocated more space,
but that stack space was not live.
The test case improves from 19 to 15 processed insns
and improvement on real programs is significant as well:
before after
bpf_lb-DLB_L3.o 1940 1831
bpf_lb-DLB_L4.o 3089 3029
bpf_lb-DUNKNOWN.o 1065 1064
bpf_lxc-DDROP_ALL.o 28052 26309
bpf_lxc-DUNKNOWN.o 35487 33517
bpf_netdev.o 10864 9713
bpf_overlay.o 6643 6184
bpf_lcx_jit.o 38437 37335
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree@solarflare.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Don't check the same stack liveness condition 8 times.
once is enough.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree@solarflare.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds bpf_line_info during the verifier's verbose.
It can give error context for debug purpose.
~~~~~~~~~~
Here is the verbose log for backedge:
while (a) {
a += bpf_get_smp_processor_id();
bpf_trace_printk(fmt, sizeof(fmt), a);
}
~> bpftool prog load ./test_loop.o /sys/fs/bpf/test_loop type tracepoint
13: while (a) {
3: a += bpf_get_smp_processor_id();
back-edge from insn 13 to 3
~~~~~~~~~~
Here is the verbose log for invalid pkt access:
Modification to test_xdp_noinline.c:
data = (void *)(long)xdp->data;
data_end = (void *)(long)xdp->data_end;
/*
if (data + 4 > data_end)
return XDP_DROP;
*/
*(u32 *)data = dst->dst;
~> bpftool prog load ./test_xdp_noinline.o /sys/fs/bpf/test_xdp_noinline type xdp
; data = (void *)(long)xdp->data;
224: (79) r2 = *(u64 *)(r10 -112)
225: (61) r2 = *(u32 *)(r2 +0)
; *(u32 *)data = dst->dst;
226: (63) *(u32 *)(r2 +0) = r1
invalid access to packet, off=0 size=4, R2(id=0,off=0,r=0)
R2 offset is outside of the packet
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The current btf_name_by_offset() is returning "(anon)" type name for
the offset == 0 case and "(invalid-name-offset)" for the out-of-bound
offset case.
It fits well for the internal BTF verbose log purpose which
is focusing on type. For example,
offset == 0 => "(anon)" => anonymous type/name.
Returning non-NULL for the bad offset case is needed
during the BTF verification process because the BTF verifier may
complain about another field first before discovering the name_off
is invalid.
However, it may not be ideal for the newer use case which does not
necessary mean type name. For example, when logging line_info
in the BPF verifier in the next patch, it is better to log an
empty src line instead of logging "(anon)".
The existing bpf_name_by_offset() is renamed to __bpf_name_by_offset()
and static to btf.c.
A new bpf_name_by_offset() is added for generic context usage. It
returns "\0" for name_off == 0 (note that btf->strings[0] is "\0")
and NULL for invalid offset. It allows the caller to decide
what is the best output in its context.
The new btf_name_by_offset() is overlapped with btf_name_offset_valid().
Hence, btf_name_offset_valid() is removed from btf.h to keep the btf.h API
minimal. The existing btf_name_offset_valid() usage in btf.c could also be
replaced later.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This logic is not needed anymore since we got rid of the verifier
rewrite that was using prog->aux address in f6069b9aa9 ("bpf:
fix redirect to map under tail calls").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently for liveness and state pruning the register parentage
chains don't include states of the callee. This makes some sense
as the callee can't access those registers. However, this means
that READs done after the callee returns will not propagate into
the states of the callee. Callee will then perform pruning
disregarding differences in caller state.
Example:
0: (85) call bpf_user_rnd_u32
1: (b7) r8 = 0
2: (55) if r0 != 0x0 goto pc+1
3: (b7) r8 = 1
4: (bf) r1 = r8
5: (85) call pc+4
6: (15) if r8 == 0x1 goto pc+1
7: (05) *(u64 *)(r9 - 8) = r3
8: (b7) r0 = 0
9: (95) exit
10: (15) if r1 == 0x0 goto pc+0
11: (95) exit
Here we acquire unknown state with call to get_random() [1]. Then
we store this random state in r8 (either 0 or 1) [1 - 3], and make
a call on line 5. Callee does nothing but a trivial conditional
jump (to create a pruning point). Upon return caller checks the
state of r8 and either performs an unsafe read or not.
Verifier will first explore the path with r8 == 1, creating a pruning
point at [11]. The parentage chain for r8 will include only callers
states so once verifier reaches [6] it will mark liveness only on states
in the caller, and not [11]. Now when verifier walks the paths with
r8 == 0 it will reach [11] and since REG_LIVE_READ on r8 was not
propagated there it will prune the walk entirely (stop walking
the entire program, not just the callee). Since [6] was never walked
with r8 == 0, [7] will be considered dead and replaced with "goto -1"
causing hang at runtime.
This patch weaves the callee's explored states onto the callers
parentage chain. Rough parentage for r8 would have looked like this
before:
[0] [1] [2] [3] [4] [5] [10] [11] [6] [7]
| | ,---|----. | | |
sl0: sl0: / sl0: \ sl0: sl0: sl0:
fr0: r8 <-- fr0: r8<+--fr0: r8 `fr0: r8 ,fr0: r8<-fr0: r8
\ fr1: r8 <- fr1: r8 /
\__________________/
after:
[0] [1] [2] [3] [4] [5] [10] [11] [6] [7]
| | | | | |
sl0: sl0: sl0: sl0: sl0: sl0:
fr0: r8 <-- fr0: r8 <- fr0: r8 <- fr0: r8 <-fr0: r8<-fr0: r8
fr1: r8 <- fr1: r8
Now the mark from instruction 6 will travel through callees states.
Note that we don't have to connect r0 because its overwritten by
callees state on return and r1 - r5 because those are not alive
any more once a call is made.
v2:
- don't connect the callees registers twice (Alexei: suggestion & code)
- add more details to the comment (Ed & Alexei)
v1: don't unnecessarily link caller saved regs (Jiong)
Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Changes v2 -> v3:
1. remove check for bpf_dump_raw_ok().
Changes v1 -> v2:
1. Fix error path as Martin suggested.
This patch adds nr_prog_tags and prog_tags to bpf_prog_info. This is a
reliable way for user space to get tags of all sub programs. Before this
patch, user space need to find sub program tags via kallsyms.
This feature will be used in BPF introspection, where user space queries
information about BPF programs via sys_bpf.
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The func_info and line_info have the bpf insn offset but
they do not contain kernel address. They will still be useful
for the userspace tool to annotate the xlated insn.
This patch removes the bpf_dump_raw_ok() guard for the
func_info and line_info during bpf_prog_get_info_by_fd().
The guard stays for jited_line_info which contains the kernel
address.
Although this bpf_dump_raw_ok() guard behavior has started since
the earlier func_info patch series, I marked the Fixes tag to the
latest line_info patch series which contains both func_info and
line_info and this patch is fixing for both of them.
Fixes: c454a46b5e ("bpf: Add bpf_line_info support")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Implement bpffs pretty printing for cgroup local storage maps
(both shared and per-cpu).
Output example (captured for tools/testing/selftests/bpf/netcnt_prog.c):
Shared:
$ cat /sys/fs/bpf/map_2
# WARNING!! The output is for debug purpose only
# WARNING!! The output format will change
{4294968594,1}: {9999,1039896}
Per-cpu:
$ cat /sys/fs/bpf/map_1
# WARNING!! The output is for debug purpose only
# WARNING!! The output format will change
{4294968594,1}: {
cpu0: {0,0,0,0,0}
cpu1: {0,0,0,0,0}
cpu2: {1,104,0,0,0}
cpu3: {0,0,0,0,0}
}
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If key_type or value_type are of non-trivial data types
(e.g. structure or typedef), it's not possible to check them without
the additional information, which can't be obtained without a pointer
to the btf structure.
So, let's pass btf pointer to the map_check_btf() callbacks.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Michael and Sandipan report:
Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
and is adjusted again at init time if MODULES_VADDR is defined.
For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
the compile-time default at boot-time, which is 0x9c400000 when
using 64K page size. This overflows the signed 32-bit bpf_jit_limit
value:
root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
-1673527296
and can cause various unexpected failures throughout the network
stack. In one case `strace dhclient eth0` reported:
setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
16) = -1 ENOTSUPP (Unknown error 524)
and similar failures can be seen with tools like tcpdump. This doesn't
always reproduce however, and I'm not sure why. The more consistent
failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
host would time out on systemd/netplan configuring a virtio-net NIC
with no noticeable errors in the logs.
Given this and also given that in near future some architectures like
arm64 will have a custom area for BPF JIT image allocations we should
get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
so therefore add another overridable bpf_jit_alloc_exec_limit() helper
function which returns the possible size of the memory area for deriving
the default heuristic in bpf_jit_charge_init().
Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
JIT memory provider, and therefore in case archs implement their custom
module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
Additionally, for archs supporting large page sizes, we should change
the sysctl to be handled as long to not run into sysctl restrictions
in future.
Fixes: ede95a63b5 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
Reported-by: Sandipan Das <sandipan@linux.ibm.com>
Reported-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-12-11
The following pull-request contains BPF updates for your *net-next* tree.
It has three minor merge conflicts, resolutions:
1) tools/testing/selftests/bpf/test_verifier.c
Take first chunk with alignment_prevented_execution.
2) net/core/filter.c
[...]
case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
case bpf_ctx_range(struct __sk_buff, wire_len):
return false;
[...]
3) include/uapi/linux/bpf.h
Take the second chunk for the two cases each.
The main changes are:
1) Add support for BPF line info via BTF and extend libbpf as well
as bpftool's program dump to annotate output with BPF C code to
facilitate debugging and introspection, from Martin.
2) Add support for BPF_ALU | BPF_ARSH | BPF_{K,X} in interpreter
and all JIT backends, from Jiong.
3) Improve BPF test coverage on archs with no efficient unaligned
access by adding an "any alignment" flag to the BPF program load
to forcefully disable verifier alignment checks, from David.
4) Add a new bpf_prog_test_run_xattr() API to libbpf which allows for
proper use of BPF_PROG_TEST_RUN with data_out, from Lorenz.
5) Extend tc BPF programs to use a new __sk_buff field called wire_len
for more accurate accounting of packets going to wire, from Petar.
6) Improve bpftool to allow dumping the trace pipe from it and add
several improvements in bash completion and map/prog dump,
from Quentin.
7) Optimize arm64 BPF JIT to always emit movn/movk/movk sequence for
kernel addresses and add a dedicated BPF JIT backend allocator,
from Ard.
8) Add a BPF helper function for IR remotes to report mouse movements,
from Sean.
9) Various cleanups in BPF prog dump e.g. to make UAPI bpf_prog_info
member naming consistent with existing conventions, from Yonghong
and Song.
10) Misc cleanups and improvements in allowing to pass interface name
via cmdline for xdp1 BPF example, from Matteo.
11) Fix a potential segfault in BPF sample loader's kprobes handling,
from Daniel T.
12) Fix SPDX license in libbpf's README.rst, from Andrey.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In uapi bpf.h, currently we have the following fields in
the struct bpf_prog_info:
__u32 func_info_cnt;
__u32 line_info_cnt;
__u32 jited_line_info_cnt;
The above field names "func_info_cnt" and "line_info_cnt"
also appear in union bpf_attr for program loading.
The original intention is to keep the names the same
between bpf_prog_info and bpf_attr
so it will imply what we returned to user space will be
the same as what the user space passed to the kernel.
Such a naming convention in bpf_prog_info is not consistent
with other fields like:
__u32 nr_jited_ksyms;
__u32 nr_jited_func_lens;
This patch made this adjustment so in bpf_prog_info
newly introduced *_info_cnt becomes nr_*_info.
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
info.nr_jited_ksyms and info.nr_jited_func_lens cannot be 0 in these two
statements, so we don't need to check them.
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, the destination register is marked as unknown for 32-bit
sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is
SCALAR_VALUE.
This is too conservative that some valid cases will be rejected.
Especially, this may turn a constant scalar value into unknown value that
could break some assumptions of verifier.
For example, test_l4lb_noinline.c has the following C code:
struct real_definition *dst
1: if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
2: return TC_ACT_SHOT;
3:
4: if (dst->flags & F_IPV6) {
get_packet_dst is responsible for initializing "dst" into valid pointer and
return true (1), otherwise return false (0). The compiled instruction
sequence using alu32 will be:
412: (54) (u32) r7 &= (u32) 1
413: (bc) (u32) r0 = (u32) r7
414: (95) exit
insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even
r7 contains SCALAR_VALUE 1.
This causes trouble when verifier is walking the code path that hasn't
initialized "dst" inside get_packet_dst, for which case 0 is returned and
we would then expect verifier concluding line 1 in the above C code pass
the "if" check, therefore would skip fall through path starting at line 4.
Now, because r0 returned from callee has became unknown value, so verifier
won't skip analyzing path starting at line 4 and "dst->flags" requires
dereferencing the pointer "dst" which actually hasn't be initialized for
this path.
This patch relaxed the code marking sub-register move destination. For a
SCALAR_VALUE, it is safe to just copy the value from source then truncate
it into 32-bit.
A unit test also included to demonstrate this issue. This test will fail
before this patch.
This relaxation could let verifier skipping more paths for conditional
comparison against immediate. It also let verifier recording a more
accurate/strict value for one register at one state, if this state end up
with going through exit without rejection and it is used for state
comparison later, then it is possible an inaccurate/permissive value is
better. So the real impact on verifier processed insn number is complex.
But in all, without this fix, valid program could be rejected.
>From real benchmarking on kernel selftests and Cilium bpf tests, there is
no impact on processed instruction number when tests ares compiled with
default compilation options. There is slightly improvements when they are
compiled with -mattr=+alu32 after this patch.
Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is
rejected before this fix.
Insn processed before/after this patch:
default -mattr=+alu32
Kernel selftest
===
test_xdp.o 371/371 369/369
test_l4lb.o 6345/6345 5623/5623
test_xdp_noinline.o 2971/2971 rejected/2727
test_tcp_estates.o 429/429 430/430
Cilium bpf
===
bpf_lb-DLB_L3.o: 2085/2085 1685/1687
bpf_lb-DLB_L4.o: 2287/2287 1986/1982
bpf_lb-DUNKNOWN.o: 690/690 622/622
bpf_lxc.o: 95033/95033 N/A
bpf_netdev.o: 7245/7245 N/A
bpf_overlay.o: 2898/2898 3085/2947
NOTE:
- bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by
verifier due to another issue inside verifier on supporting alu32
binary.
- Each cilium bpf program could generate several processed insn number,
above number is sum of them.
v1->v2:
- Restrict the change on SCALAR_VALUE.
- Update benchmark numbers on Cilium bpf tests.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Several conflicts, seemingly all over the place.
I used Stephen Rothwell's sample resolutions for many of these, if not
just to double check my own work, so definitely the credit largely
goes to him.
The NFP conflict consisted of a bug fix (moving operations
past the rhashtable operation) while chaning the initial
argument in the function call in the moved code.
The net/dsa/master.c conflict had to do with a bug fix intermixing of
making dsa_master_set_mtu() static with the fixing of the tagging
attribute location.
cls_flower had a conflict because the dup reject fix from Or
overlapped with the addition of port range classifiction.
__set_phy_supported()'s conflict was relatively easy to resolve
because Andrew fixed it in both trees, so it was just a matter
of taking the net-next copy. Or at least I think it was :-)
Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup()
intermixed with changes on how the sdif and caller_net are calculated
in these code paths in net-next.
The remaining BPF conflicts were largely about the addition of the
__bpf_md_ptr stuff in 'net' overlapping with adjustments and additions
to the relevant data structure where the MD pointer macros are used.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds bpf_line_info support.
It accepts an array of bpf_line_info objects during BPF_PROG_LOAD.
The "line_info", "line_info_cnt" and "line_info_rec_size" are added
to the "union bpf_attr". The "line_info_rec_size" makes
bpf_line_info extensible in the future.
The new "check_btf_line()" ensures the userspace line_info is valid
for the kernel to use.
When the verifier is translating/patching the bpf_prog (through
"bpf_patch_insn_single()"), the line_infos' insn_off is also
adjusted by the newly added "bpf_adj_linfo()".
If the bpf_prog is jited, this patch also provides the jited addrs (in
aux->jited_linfo) for the corresponding line_info.insn_off.
"bpf_prog_fill_jited_linfo()" is added to fill the aux->jited_linfo.
It is currently called by the x86 jit. Other jits can also use
"bpf_prog_fill_jited_linfo()" and it will be done in the followup patches.
In the future, if it deemed necessary, a particular jit could also provide
its own "bpf_prog_fill_jited_linfo()" implementation.
A few "*line_info*" fields are added to the bpf_prog_info such
that the user can get the xlated line_info back (i.e. the line_info
with its insn_off reflecting the translated prog). The jited_line_info
is available if the prog is jited. It is an array of __u64.
If the prog is not jited, jited_line_info_cnt is 0.
The verifier's verbose log with line_info will be done in
a follow up patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch remove the rejection on BPF_ALU | BPF_ARSH as we have supported
them on interpreter and all JIT back-ends
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch implements interpreting BPF_ALU | BPF_ARSH. Do arithmetic right
shift on low 32-bit sub-register, and zero the high 32 bits.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The later patch will introduce "struct bpf_line_info" which
has member "line_off" and "file_off" referring back to the
string section in btf. The line_"off" and file_"off"
are more consistent to the naming convention in btf.h that
means "offset" (e.g. name_off in "struct btf_type").
The to-be-added "struct bpf_line_info" also has another
member, "insn_off" which is the same as the "insn_offset"
in "struct bpf_func_info". Hence, this patch renames "insn_offset"
to "insn_off" for "struct bpf_func_info".
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
1) When bpf_dump_raw_ok() == false and the kernel can provide >=1
func_info to the userspace, the current behavior is setting
the info.func_info_cnt to 0 instead of setting info.func_info
to 0.
It is different from the behavior in jited_func_lens/nr_jited_func_lens,
jited_ksyms/nr_jited_ksyms...etc.
This patch fixes it. (i.e. set func_info to 0 instead of
func_info_cnt to 0 when bpf_dump_raw_ok() == false).
2) When the userspace passed in info.func_info_cnt == 0, the kernel
will set the expected func_info size back to the
info.func_info_rec_size. It is a way for the userspace to learn
the kernel expected func_info_rec_size introduced in
commit 838e96904f ("bpf: Introduce bpf_func_info").
An exception is the kernel expected size is not set when
func_info is not available for a bpf_prog. This makes the
returned info.func_info_rec_size has different values
depending on the returned value of info.func_info_cnt.
This patch sets the kernel expected size to info.func_info_rec_size
independent of the info.func_info_cnt.
3) The current logic only rejects invalid func_info_rec_size if
func_info_cnt is non zero. This patch also rejects invalid
nonzero info.func_info_rec_size and not equal to the kernel
expected size.
4) Set info.btf_id as long as prog->aux->btf != NULL. That will
setup the later copy_to_user() codes look the same as others
which then easier to understand and maintain.
prog->aux->btf is not NULL only if prog->aux->func_info_cnt > 0.
Breaking up info.btf_id from prog->aux->func_info_cnt is needed
for the later line info patch anyway.
A similar change is made to bpf_get_prog_name().
Fixes: 838e96904f ("bpf: Introduce bpf_func_info")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
By default, BPF uses module_alloc() to allocate executable memory,
but this is not necessary on all arches and potentially undesirable
on some of them.
So break out the module_alloc() and module_memfree() calls into __weak
functions to allow them to be overridden in arch code.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
malicious bpf program may try to force the verifier to remember
a lot of distinct verifier states.
Put a limit to number of per-insn 'struct bpf_verifier_state'.
Note that hitting the limit doesn't reject the program.
It potentially makes the verifier do more steps to analyze the program.
It means that malicious programs will hit BPF_COMPLEXITY_LIMIT_INSNS sooner
instead of spending cpu time walking long link list.
The limit of BPF_COMPLEXITY_LIMIT_STATES==64 affects cilium progs
with slight increase in number of "steps" it takes to successfully verify
the programs:
before after
bpf_lb-DLB_L3.o 1940 1940
bpf_lb-DLB_L4.o 3089 3089
bpf_lb-DUNKNOWN.o 1065 1065
bpf_lxc-DDROP_ALL.o 28052 | 28162
bpf_lxc-DUNKNOWN.o 35487 | 35541
bpf_netdev.o 10864 10864
bpf_overlay.o 6643 6643
bpf_lcx_jit.o 38437 38437
But it also makes malicious program to be rejected in 0.4 seconds vs 6.5
Hence apply this limit to unprivileged programs only.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
pathological bpf programs may try to force verifier to explode in
the number of branch states:
20: (d5) if r1 s<= 0x24000028 goto pc+0
21: (b5) if r0 <= 0xe1fa20 goto pc+2
22: (d5) if r1 s<= 0x7e goto pc+0
23: (b5) if r0 <= 0xe880e000 goto pc+0
24: (c5) if r0 s< 0x2100ecf4 goto pc+0
25: (d5) if r1 s<= 0xe880e000 goto pc+1
26: (c5) if r0 s< 0xf4041810 goto pc+0
27: (d5) if r1 s<= 0x1e007e goto pc+0
28: (b5) if r0 <= 0xe86be000 goto pc+0
29: (07) r0 += 16614
30: (c5) if r0 s< 0x6d0020da goto pc+0
31: (35) if r0 >= 0x2100ecf4 goto pc+0
Teach verifier to recognize always taken and always not taken branches.
This analysis is already done for == and != comparison.
Expand it to all other branches.
It also helps real bpf programs to be verified faster:
before after
bpf_lb-DLB_L3.o 2003 1940
bpf_lb-DLB_L4.o 3173 3089
bpf_lb-DUNKNOWN.o 1080 1065
bpf_lxc-DDROP_ALL.o 29584 28052
bpf_lxc-DUNKNOWN.o 36916 35487
bpf_netdev.o 11188 10864
bpf_overlay.o 6679 6643
bpf_lcx_jit.o 39555 38437
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Malicious user space may try to force the verifier to use as much cpu
time and memory as possible. Hence check for pending signals
while verifying the program.
Note that suspend of sys_bpf(PROG_LOAD) syscall will lead to EAGAIN,
since the kernel has to release the resources used for program verification.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The aux->func_info and aux->btf are leaked in the error out cases
during bpf_prog_load(). This patch fixes it.
Fixes: ba64e7d852 ("bpf: btf: support proper non-jit func info")
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Often we want to write tests cases that check things like bad context
offset accesses. And one way to do this is to use an odd offset on,
for example, a 32-bit load.
This unfortunately triggers the alignment checks first on platforms
that do not set CONFIG_EFFICIENT_UNALIGNED_ACCESS. So the test
case see the alignment failure rather than what it was testing for.
It is often not completely possible to respect the original intention
of the test, or even test the same exact thing, while solving the
alignment issue.
Another option could have been to check the alignment after the
context and other validations are performed by the verifier, but
that is a non-trivial change to the verifier.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
bpf-next 2018-11-30
The following pull-request contains BPF updates for your *net-next* tree.
(Getting out bit earlier this time to pull in a dependency from bpf.)
The main changes are:
1) Add libbpf ABI versioning and document API naming conventions
as well as ABI versioning process, from Andrey.
2) Add a new sk_msg_pop_data() helper for sk_msg based BPF
programs that is used in conjunction with sk_msg_push_data()
for adding / removing meta data to the msg data, from John.
3) Optimize convert_bpf_ld_abs() for 0 offset and fix various
lib and testsuite build failures on 32 bit, from David.
4) Make BPF prog dump for !JIT identical to how we dump subprogs
when JIT is in use, from Yonghong.
5) Rename btf_get_from_id() to make it more conform with libbpf
API naming conventions, from Martin.
6) Add a missing BPF kselftest config item, from Naresh.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial conflict in net/core/filter.c, a locally computed
'sdif' is now an argument to the function.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added name checking for the following types:
. BTF_KIND_PTR, BTF_KIND_ARRAY, BTF_KIND_VOLATILE,
BTF_KIND_CONST, BTF_KIND_RESTRICT:
the name must be null
. BTF_KIND_STRUCT, BTF_KIND_UNION: the struct/member name
is either null or a valid identifier
. BTF_KIND_ENUM: the enum type name is either null or a valid
identifier; the enumerator name must be a valid identifier.
. BTF_KIND_FWD: the name must be a valid identifier
. BTF_KIND_TYPEDEF: the name must be a valid identifier
For those places a valid name is required, the name must be
a valid C identifier. This can be relaxed later if we found
use cases for a different (non-C) frontend.
Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Function btf_name_valid_identifier() have been implemented in
bpf-next commit 2667a2626f ("bpf: btf: Add BTF_KIND_FUNC and
BTF_KIND_FUNC_PROTO"). Backport this function so later patch
can use it.
Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 838e96904f ("bpf: Introduce bpf_func_info")
added bpf func info support. The userspace is able
to get better ksym's for bpf programs with jit, and
is able to print out func prototypes.
For a program containing func-to-func calls, the existing
implementation returns user specified number of function
calls and BTF types if jit is enabled. If the jit is not
enabled, it only returns the type for the main function.
This is undesirable. Interpreter may still be used
and we should keep feature identical regardless of
whether jit is enabled or not.
This patch fixed this discrepancy.
Fixes: 838e96904f ("bpf: Introduce bpf_func_info")
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Make fetching of the BPF call address from ppc64 JIT generic. ppc64
was using a slightly different variant rather than through the insns'
imm field encoding as the target address would not fit into that space.
Therefore, the target subprog number was encoded into the insns' offset
and fetched through fp->aux->func[off]->bpf_func instead. Given there
are other JITs with this issue and the mechanism of fetching the address
is JIT-generic, move it into the core as a helper instead. On the JIT
side, we get information on whether the retrieved address is a fixed
one, that is, not changing through JIT passes, or a dynamic one. For
the former, JITs can optimize their imm emission because this doesn't
change jump offsets throughout JIT process.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Sandipan Das <sandipan@linux.ibm.com>
Tested-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There is a spelling mistake in a btf_verifier_log_member message,
fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Building tags produces warning:
ctags: Warning: kernel/bpf/local_storage.c:10: null expansion of name pattern "\1"
Let's use the same fix as in commit 25528213fe ("tags: Fix DEFINE_PER_CPU
expansions"), even though it violates the usual code style.
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
At LPC 2018 in Vancouver, Vlad Dumitrescu mentioned that longest_prefix_match()
has a high cost [1].
One reason for that cost is a loop handling one byte at a time.
We can handle more bytes at a time, if enough attention is paid
to endianness.
I was able to remove ~55 % of longest_prefix_match() cpu costs.
[1] https://linuxplumbersconf.org/event/2/contributions/88/attachments/76/87/lpc-bpf-2018-shaping.pdf
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Dumitrescu <vladum@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch added interface to load a program with the following
additional information:
. prog_btf_fd
. func_info, func_info_rec_size and func_info_cnt
where func_info will provide function range and type_id
corresponding to each function.
The func_info_rec_size is introduced in the UAPI to specify
struct bpf_func_info size passed from user space. This
intends to make bpf_func_info structure growable in the future.
If the kernel gets a different bpf_func_info size from userspace,
it will try to handle user request with part of bpf_func_info
it can understand. In this patch, kernel can understand
struct bpf_func_info {
__u32 insn_offset;
__u32 type_id;
};
If user passed a bpf func_info record size of 16 bytes, the
kernel can still handle part of records with the above definition.
If verifier agrees with function range provided by the user,
the bpf_prog ksym for each function will use the func name
provided in the type_id, which is supposed to provide better
encoding as it is not limited by 16 bytes program name
limitation and this is better for bpf program which contains
multiple subprograms.
The bpf_prog_info interface is also extended to
return btf_id, func_info, func_info_rec_size and func_info_cnt
to userspace, so userspace can print out the function prototype
for each xlated function. The insn_offset in the returned
func_info corresponds to the insn offset for xlated functions.
With other jit related fields in bpf_prog_info, userspace can also
print out function prototypes for each jited function.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
to support the function debug info.
BTF_KIND_FUNC_PROTO must not have a name (i.e. !t->name_off)
and it is followed by >= 0 'struct bpf_param' objects to
describe the function arguments.
The BTF_KIND_FUNC must have a valid name and it must
refer back to a BTF_KIND_FUNC_PROTO.
The above is the conclusion after the discussion between
Edward Cree, Alexei, Daniel, Yonghong and Martin.
By combining BTF_KIND_FUNC and BTF_LIND_FUNC_PROTO,
a complete function signature can be obtained. It will be
used in the later patches to learn the function signature of
a running bpf program.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch breaks up btf_type_is_void() into
btf_type_is_void() and btf_type_is_fwd().
It also adds btf_type_nosize() to better describe it is
testing a type has nosize info.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a new flag BPF_F_ZERO_SEED, which forces a hash map
to initialize the seed to zero. This is useful when doing
performance analysis both on individual BPF programs, as
well as the kernel's hash table implementation.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When patching in a new sequence for the first insn of a subprog, the start
of that subprog does not change (it's the first insn of the sequence), so
adjust_subprog_starts should check start <= off (rather than < off).
Also added a test to test_verifier.c (it's essentially the syz reproducer).
Fixes: cc8b0b92a1 ("bpf: introduce function calls (function boundaries)")
Reported-by: syzbot+4fc427c7af994b0948be@syzkaller.appspotmail.com
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pointer offload is being null checked however the following statement
dereferences the potentially null pointer offload when assigning
offload->dev_state. Fix this by only assigning it if offload is not
null.
Detected by CoverityScan, CID#1475437 ("Dereference after null check")
Fixes: 00db12c3d1 ("bpf: call verifier_prep from its callback in struct bpf_offload_dev")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently BPF verifier allows narrow loads for a context field only with
offset zero. E.g. if there is a __u32 field then only the following
loads are permitted:
* off=0, size=1 (narrow);
* off=0, size=2 (narrow);
* off=0, size=4 (full).
On the other hand LLVM can generate a load with offset different than
zero that make sense from program logic point of view, but verifier
doesn't accept it.
E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code:
#define DST_IP4 0xC0A801FEU /* 192.168.1.254 */
...
if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&
where ctx is struct bpf_sock_addr.
Some versions of LLVM can produce the following byte code for it:
8: 71 12 07 00 00 00 00 00 r2 = *(u8 *)(r1 + 7)
9: 67 02 00 00 18 00 00 00 r2 <<= 24
10: 18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00 r3 = 4261412864 ll
12: 5d 32 07 00 00 00 00 00 if r2 != r3 goto +7 <LBB0_6>
where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1
and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently
rejected by verifier.
Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok()
what means any is_valid_access implementation, that uses the function,
works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or
sock_addr_is_valid_access() for bpf_sock_addr.
The patch makes such loads supported. Offset can be in [0; size_default)
but has to be multiple of load size. E.g. for __u32 field the following
loads are supported now:
* off=0, size=1 (narrow);
* off=1, size=1 (narrow);
* off=2, size=1 (narrow);
* off=3, size=1 (narrow);
* off=0, size=2 (narrow);
* off=2, size=2 (narrow);
* off=0, size=4 (full).
Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The kernel functions to prepare verifier and translate for offloaded
program retrieve "offload" from "prog", and "netdev" from "offload".
Then both "prog" and "netdev" are passed to the callbacks.
Simplify this by letting the drivers retrieve the net device themselves
from the offload object attached to prog - if they need it at all. There
is currently no need to pass the netdev as an argument to those
functions.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Function bpf_prog_offload_verifier_prep(), called from the kernel BPF
verifier to run a driver-specific callback for preparing for the
verification step for offloaded programs, takes a pointer to a struct
bpf_verifier_env object. However, no driver callback needs the whole
structure at this time: the two drivers supporting this, nfp and
netdevsim, only need a pointer to the struct bpf_prog instance held by
env.
Update the callback accordingly, on kernel side and in these two
drivers.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
As part of the transition from ndo_bpf() to callbacks attached to struct
bpf_offload_dev for some of the eBPF offload operations, move the
functions related to program destruction to the struct and remove the
subcommand that was used to call them through the NDO.
Remove function __bpf_offload_ndo(), which is no longer used.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
As part of the transition from ndo_bpf() to callbacks attached to struct
bpf_offload_dev for some of the eBPF offload operations, move the
functions related to code translation to the struct and remove the
subcommand that was used to call them through the NDO.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In a way similar to the change previously brought to the verify_insn
hook and to the finalize callback, switch to the newly added ops in
struct bpf_prog_offload for calling the functions used to prepare driver
verifiers.
Since the dev_ops pointer in struct bpf_prog_offload is no longer used
by any callback, we can now remove it from struct bpf_prog_offload.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In a way similar to the change previously brought to the verify_insn
hook, switch to the newly added ops in struct bpf_prog_offload for
calling the functions used to perform final verification steps for
offloaded programs.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We intend to remove the dev_ops in struct bpf_prog_offload, and to only
keep the ops in struct bpf_offload_dev instead, which is accessible from
more locations for passing function pointers.
But dev_ops is used for calling the verify_insn hook. Switch to the
newly added ops in struct bpf_prog_offload instead.
To avoid table lookups for each eBPF instruction to verify, we remember
the offdev attached to a netdev and modify bpf_offload_find_netdev() to
avoid performing more than once a lookup for a given offload object.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For passing device functions for offloaded eBPF programs, there used to
be no place where to store the pointer without making the non-offloaded
programs pay a memory price.
As a consequence, three functions were called with ndo_bpf() through
specific commands. Now that we have struct bpf_offload_dev, and since
none of those operations rely on RTNL, we can turn these three commands
into hooks inside the struct bpf_prog_offload_ops, and pass them as part
of bpf_offload_dev_create().
This commit effectively passes a pointer to the struct to
bpf_offload_dev_create(). We temporarily have two struct
bpf_prog_offload_ops instances, one under offdev->ops and one under
offload->dev_ops. The next patches will make the transition towards the
former, so that offload->dev_ops can be removed, and callbacks relying
on ndo_bpf() added to offdev->ops as well.
While at it, rename "nfp_bpf_analyzer_ops" as "nfp_bpf_dev_ops" (and
similarly for netdevsim).
Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In check_packet_access, update max_pkt_offset after the offset has passed
__check_packet_access.
It should be safe to use u32 for max_pkt_offset as explained in code
comment.
Also, when there is tail call, the max_pkt_offset of the called program is
unknown, so conservatively set max_pkt_offset to MAX_PACKET_OFF for such
case.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
While dbecd73884 ("bpf: get kernel symbol addresses via syscall")
zeroed info.nr_jited_ksyms in bpf_prog_get_info_by_fd() for queries
from unprivileged users, commit 815581c11c ("bpf: get JITed image
lengths of functions via syscall") forgot about doing so and therefore
returns the #elems of the user set up buffer which is incorrect. It
also needs to indicate a info.nr_jited_func_lens of zero.
Fixes: 815581c11c ("bpf: get JITed image lengths of functions via syscall")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Sandipan Das <sandipan@linux.vnet.ibm.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, when there is no subprog (prog->aux->func_cnt == 0),
bpf_prog_info does not return any jited_ksyms or jited_func_lens. This
patch adds main program address (prog->bpf_func) and main program
length (prog->jited_len) to bpf_prog_info.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, jited_ksyms in bpf_prog_info shows page addresses of jited
bpf program. The main reason here is to not expose randomized start
address. However, this is not ideal for detailed profiling (find hot
instructions from stack traces). This patch replaces the page address
with real prog start address.
This change is OK because bpf_prog_get_info_by_fd() is only available
to root.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, /proc/kallsyms shows page address of jited bpf program. The
main reason here is to not expose randomized start address. However,
This is not ideal for detailed profiling (find hot instructions from
stack traces). This patch replaces the page address with real prog start
address.
This change is OK because these addresses are still protected by sysctl
kptr_restrict (see kallsyms_show_value()), and only programs loaded by
root are added to kallsyms (see bpf_prog_kallsyms_add()).
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In the verifier there is no such semantics where registers with
PTR_TO_MAP_VALUE type have an id assigned to them. This is only
used in PTR_TO_MAP_VALUE_OR_NULL and later on nullified once the
test against NULL has been pattern matched and type transformed
into PTR_TO_MAP_VALUE.
Fixes: 3e6a4b3e02 ("bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Roman Gushchin <guro@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
ALU operations on pointers such as scalar_reg += map_value_ptr are
handled in adjust_ptr_min_max_vals(). Problem is however that map_ptr
and range in the register state share a union, so transferring state
through dst_reg->range = ptr_reg->range is just buggy as any new
map_ptr in the dst_reg is then truncated (or null) for subsequent
checks. Fix this by adding a raw member and use it for copying state
over to dst_reg.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Rick reported that the BPF JIT could potentially fill the entire module
space with BPF programs from unprivileged users which would prevent later
attempts to load normal kernel modules or privileged BPF programs, for
example. If JIT was enabled but unsuccessful to generate the image, then
before commit 290af86629 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
we would always fall back to the BPF interpreter. Nowadays in the case
where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
with a failure since the BPF interpreter was compiled out.
Add a global limit and enforce it for unprivileged users such that in case
of BPF interpreter compiled out we fail once the limit has been reached
or we fall back to BPF interpreter earlier w/o using module mem if latter
was compiled in. In a next step, fair share among unprivileged users can
be resolved in particular for the case where we would fail hard once limit
is reached.
Fixes: 290af86629 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
Fixes: 0a14842f5a ("net: filter: Just In Time compiler for x86-64")
Co-Developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Given this seems to be quite fragile and can easily slip through the
cracks, lets make direct packet write more robust by requiring that
future program types which allow for such write must provide a prologue
callback. In case of XDP and sk_msg it's noop, thus add a generic noop
handler there. The latter starts out with NULL data/data_end unconditionally
when sg pages are shared.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit f1a2e44a3a ("bpf: add queue and stack maps") added helpers
with ARG_PTR_TO_UNINIT_MAP_VALUE. Meaning, the helper is supposed to
fill the map value buffer with data instead of reading from it like
in other helpers such as map update. However, given the buffer is
allowed to be uninitialized (since we fill it in the helper anyway),
it also means that the helper is obliged to wipe the memory in case
of an error in order to not allow for leaking uninitialized memory.
Given pop/peek is both handled inside __{stack,queue}_map_get(),
lets wipe it there on error case, that is, empty stack/queue.
Fixes: f1a2e44a3a ("bpf: add queue and stack maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Mauricio Vasquez B<mauricio.vasquez@polito.it>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit f1a2e44a3a ("bpf: add queue and stack maps") probably just
copy-pasted .pkt_access for bpf_map_{pop,peek}_elem() helpers, but
this is buggy in this context since it would allow writes into cloned
skbs which is invalid. Therefore, disable .pkt_access for the two.
Fixes: f1a2e44a3a ("bpf: add queue and stack maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Mauricio Vasquez B<mauricio.vasquez@polito.it>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit b39b5f411d ("bpf: add cg_skb_is_valid_access for
BPF_PROG_TYPE_CGROUP_SKB") added direct packet access for skbs in
cg_skb program types, however allowed access type was not added to
the may_access_direct_pkt_data() helper. Therefore the latter always
returns false. This is not directly an issue, it just means writes
are unconditionally disabled (which is correct) but also reads.
Latter is relevant in this function when BPF helpers may read direct
packet data which is unconditionally disabled then. Fix it by properly
adding BPF_PROG_TYPE_CGROUP_SKB to may_access_direct_pkt_data().
Fixes: b39b5f411d ("bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit d58e468b11 ("flow_dissector: implements flow dissector BPF
hook") added direct packet access for skbs in may_access_direct_pkt_data()
function where this enables read and write access to the skb->data. This
is buggy because without a prologue generator such as bpf_unclone_prologue()
we would allow for writing into cloned skbs. Original intention might have
been to only allow read access where this is not needed (similar as the
flow_dissector_func_proto() indicates which enables only bpf_skb_load_bytes()
as well), therefore this patch fixes it to restrict to read-only.
Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Wenwen Wang reported:
In btf_parse(), the header of the user-space btf data 'btf_data'
is firstly parsed and verified through btf_parse_hdr().
In btf_parse_hdr(), the header is copied from user-space 'btf_data'
to kernel-space 'btf->hdr' and then verified. If no error happens
during the verification process, the whole data of 'btf_data',
including the header, is then copied to 'data' in btf_parse(). It
is obvious that the header is copied twice here. More importantly,
no check is enforced after the second copy to make sure the headers
obtained in these two copies are same. Given that 'btf_data' resides
in the user space, a malicious user can race to modify the header
between these two copies. By doing so, the user can inject
inconsistent data, which can cause undefined behavior of the
kernel and introduce potential security risk.
This issue is similar to the one fixed in commit 8af03d1ae2 ("bpf:
btf: Fix a missing check bug"). To fix it, this patch copies the user
'btf_data' *before* parsing / verifying the BTF header.
Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Co-developed-by: Wenwen Wang <wang6495@umn.edu>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The dev_map_notification() removes interface in devmap if
unregistering interface's ifindex is same.
But only checking ifindex is not enough because other netns can have
same ifindex. so that wrong interface selection could occurred.
Hence netdev pointer comparison code is added.
v2: compare netdev pointer instead of using net_eq() (Daniel Borkmann)
v1: Initial patch
Fixes: 2ddf71e23c ("net: add notifier hooks for devmap bpf map")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-10-21
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Implement two new kind of BPF maps, that is, queue and stack
map along with new peek, push and pop operations, from Mauricio.
2) Add support for MSG_PEEK flag when redirecting into an ingress
psock sk_msg queue, and add a new helper bpf_msg_push_data() for
insert data into the message, from John.
3) Allow for BPF programs of type BPF_PROG_TYPE_CGROUP_SKB to use
direct packet access for __skb_buff, from Song.
4) Use more lightweight barriers for walking perf ring buffer for
libbpf and perf tool as well. Also, various fixes and improvements
from verifier side, from Daniel.
5) Add per-symbol visibility for DSO in libbpf and hide by default
global symbols such as netlink related functions, from Andrey.
6) Two improvements to nfp's BPF offload to check vNIC capabilities
in case prog is shared with multiple vNICs and to protect against
mis-initializing atomic counters, from Jakub.
7) Fix for bpftool to use 4 context mode for the nfp disassembler,
also from Jakub.
8) Fix a return value comparison in test_libbpf.sh and add several
bpftool improvements in bash completion, documentation of bpf fs
restrictions and batch mode summary print, from Quentin.
9) Fix a file resource leak in BPF selftest's load_kallsyms()
helper, from Peng.
10) Fix an unused variable warning in map_lookup_and_delete_elem(),
from Alexei.
11) Fix bpf_skb_adjust_room() signature in BPF UAPI helper doc,
from Nicolas.
12) Add missing executables to .gitignore in BPF selftests, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend prior work from 09772d92cd ("bpf: avoid retpoline for
lookup/update/delete calls on maps") to also apply to the recently
added map helpers that perform push/pop/peek operations so that
the indirect call can be avoided.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
They PTR_TO_FLOW_KEYS is not used today to be passed into a helper
as memory, so it can be removed from check_helper_mem_access().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We should not enable xadd operation for flow key memory if not
needed there anyway. There is no such issue as described in the
commit f37a8cb84c ("bpf: reject stores into ctx via st and xadd")
since there's no context rewriter for flow keys today, but it
also shouldn't become part of the user facing behavior to allow
for it. After patch:
0: (79) r7 = *(u64 *)(r1 +144)
1: (b7) r3 = 4096
2: (db) lock *(u64 *)(r7 +0) += r3
BPF_XADD stores into R7 flow_keys is not allowed
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Using reg_type_str[insn->dst_reg] is incorrect since insn->dst_reg
contains the register number but not the actual register type. Add
a small reg_state() helper and use it to get to the type. Also fix
up the test_verifier test cases that have an incorrect errstr.
Fixes: 9d2be44a7f ("bpf: Reuse canonical string formatter for ctx errs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
skb. This patch enables direct access of skb for these programs.
Two helper functions bpf_compute_and_save_data_end() and
bpf_restore_data_end() are introduced. There are used in
__cgroup_bpf_run_filter_skb(), to compute proper data_end for the
BPF program, and restore original data afterwards.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The previous patch implemented a bpf queue/stack maps that
provided the peek/pop/push functions. There is not a direct
relationship between those functions and the current maps
syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added,
this is mapped to the pop operation in the queue/stack maps
and it is still to implement in other kind of maps.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
These maps support peek, pop and push operations that are exposed to eBPF
programs through the new bpf_map[peek/pop/push] helpers. Those operations
are exposed to userspace applications through the already existing
syscalls in the following way:
BPF_MAP_LOOKUP_ELEM -> peek
BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
BPF_MAP_UPDATE_ELEM -> push
Queue/stack maps are implemented using a buffer, tail and head indexes,
hence BPF_F_NO_PREALLOC is not supported.
As opposite to other maps, queue and stack do not use RCU for protecting
maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
argument that is a pointer to a memory zone where to save the value of a
map. Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
be passed as an extra argument.
Our main motivation for implementing queue/stack maps was to keep track
of a pool of elements, like network ports in a SNAT, however we forsee
other use cases, like for exampling saving last N kernel events in a map
and then analysing from userspace.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
ARG_PTR_TO_UNINIT_MAP_VALUE argument is a pointer to a memory zone
used to save the value of a map. Basically the same as
ARG_PTR_TO_UNINIT_MEM, but the size has not be passed as an extra
argument.
This will be used in the following patch that implements some new
helpers that receive a pointer to be filled with a map value.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit adds the required logic to allow key being NULL
in case the key_size of the map is 0.
A new __bpf_copy_key function helper only copies the key from
userpsace when key_size != 0, otherwise it enforces that key must be
null.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In the following patches queue and stack maps (FIFO and LIFO
datastructures) will be implemented. In order to avoid confusion and
a possible name clash rename stack_map_ops to stack_trace_map_ops
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.
net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'. Thanks to David
Ahern for the heads up.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In order to prepare sockmap logic to be used in combination with kTLS
we need to detangle it from ULP, and further split it in later commits
into a generic API.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The map-in-map frequently serves as a mechanism for atomic
snapshotting of state that a BPF program might record. The current
implementation is dangerous to use in this way, however, since
userspace has no way of knowing when all programs that might have
retrieved the "old" value of the map may have completed.
This change ensures that map update operations on map-in-map map types
always wait for all references to the old map to drop before returning
to userspace.
Signed-off-by: Daniel Colascione <dancol@google.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The XSKMAP update and delete functions called synchronize_net(), which
can sleep. It is not allowed to sleep during an RCU read section.
Instead we need to make sure that the sock sk_destruct (xsk_destruct)
function is asynchronously called after an RCU grace period. Setting
the SOCK_RCU_FREE flag for XDP sockets takes care of this.
Fixes: fbfc504a24 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The error value returned by map_lookup_elem doesn't differentiate
whether lookup was failed because of invalid key or lookup is not
supported.
Lets add handling for -EOPNOTSUPP return value of map_lookup_elem()
method of map, with expectation from map's implementation that it
should return -EOPNOTSUPP if lookup is not supported.
The errno for bpf syscall for BPF_MAP_LOOKUP_ELEM command will be set
to EOPNOTSUPP if map lookup is not supported.
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In btf_parse_hdr(), the length of the btf data header is firstly copied
from the user space to 'hdr_len' and checked to see whether it is larger
than 'btf_data_size'. If yes, an error code EINVAL is returned. Otherwise,
the whole header is copied again from the user space to 'btf->hdr'.
However, after the second copy, there is no check between
'btf->hdr->hdr_len' and 'hdr_len' to confirm that the two copies get the
same value. Given that the btf data is in the user space, a malicious user
can race to change the data between the two copies. By doing so, the user
can provide malicious data to the kernel and cause undefined behavior.
This patch adds a necessary check after the second copy, to make sure
'btf->hdr->hdr_len' has the same value as 'hdr_len'. Otherwise, an error
code EINVAL will be returned.
Signed-off-by: Wenwen Wang <wang6495@umn.edu>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-10-08
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) sk_lookup_[tcp|udp] and sk_release helpers from Joe Stringer which allow
BPF programs to perform lookups for sockets in a network namespace. This would
allow programs to determine early on in processing whether the stack is
expecting to receive the packet, and perform some action (eg drop,
forward somewhere) based on this information.
2) per-cpu cgroup local storage from Roman Gushchin.
Per-cpu cgroup local storage is very similar to simple cgroup storage
except all the data is per-cpu. The main goal of per-cpu variant is to
implement super fast counters (e.g. packet counters), which don't require
neither lookups, neither atomic operations in a fast path.
The example of these hybrid counters is in selftests/bpf/netcnt_prog.c
3) allow HW offload of programs with BPF-to-BPF function calls from Quentin Monnet
4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub Kicinski
5) rename of libbpf interfaces from Andrey Ignatov.
libbpf is maturing as a library and should follow good practices in
library design and implementation to play well with other libraries.
This patch set brings consistent naming convention to global symbols.
6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov
to let Apache2 projects use libbpf
7) various AF_XDP fixes from Björn and Magnus
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that there is at least one driver supporting BPF-to-BPF function
calls, lift the restriction, in the verifier, on hardware offload of
eBPF programs containing such calls. But prevent jit_subprogs(), still
in the verifier, from being run for offloaded programs.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In preparation for BPF-to-BPF calls in offloaded programs, add a new
function attribute to the struct bpf_prog_offload_ops so that drivers
supporting eBPF offload can hook at the end of program verification, and
potentially extract information collected by the verifier.
Implement a minimal callback (returning 0) in the drivers providing the
structs, namely netdevsim and nfp.
This will be useful in the nfp driver, in later commits, to extract the
number of subprograms as well as the stack depth for those subprograms.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When I wrote commit 468f6eafa6 ("bpf: fix 32-bit ALU op verification"), I
assumed that, in order to emulate 64-bit arithmetic with 32-bit logic, it
is sufficient to just truncate the output to 32 bits; and so I just moved
the register size coercion that used to be at the start of the function to
the end of the function.
That assumption is true for almost every op, but not for 32-bit right
shifts, because those can propagate information towards the least
significant bit. Fix it by always truncating inputs for 32-bit ops to 32
bits.
Also get rid of the coerce_reg_to_size() after the ALU op, since that has
no effect.
Fixes: 468f6eafa6 ("bpf: fix 32-bit ALU op verification")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
socket listening on this host, and returns a socket pointer which the
BPF program can then access to determine, for instance, whether to
forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
socket, so when a BPF program makes use of this function, it must
subsequently pass the returned pointer into the newly added sk_release()
to return the reference.
By way of example, the following pseudocode would filter inbound
connections at XDP if there is no corresponding service listening for
the traffic:
struct bpf_sock_tuple tuple;
struct bpf_sock_ops *sk;
populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
if (!sk) {
// Couldn't find a socket listening for this traffic. Drop.
return TC_ACT_SHOT;
}
bpf_sk_release(sk, 0);
return TC_ACT_OK;
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow helper functions to acquire a reference and return it into a
register. Specific pointer types such as the PTR_TO_SOCKET will
implicitly represent such a reference. The verifier must ensure that
these references are released exactly once in each path through the
program.
To achieve this, this commit assigns an id to the pointer and tracks it
in the 'bpf_func_state', then when the function or program exits,
verifies that all of the acquired references have been freed. When the
pointer is passed to a function that frees the reference, it is removed
from the 'bpf_func_state` and all existing copies of the pointer in
registers are marked invalid.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
An upcoming commit will need very similar copy/realloc boilerplate, so
refactor the existing stack copy/realloc functions into macros to
simplify it.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Teach the verifier a little bit about a new type of pointer, a
PTR_TO_SOCKET. This pointer type is accessed from BPF through the
'struct bpf_sock' structure.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This check will be reused by an upcoming commit for conditional jump
checks for sockets. Refactor it a bit to simplify the later commit.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The array "reg_type_str" provides canonical formatting of register
types, however a couple of places would previously check whether a
register represented the context and write the name "context" directly.
An upcoming commit will add another pointer type to these statements, so
to provide more accurate error messages in the verifier, update these
error messages to use "reg_type_str" instead.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
An upcoming commit will add another two pointer types that need very
similar behaviour, so generalise this function now.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add this iterator for spilled registers, it concentrates the details of
how to get the current frame's spilled registers into a single macro
while clarifying the intention of the code which is calling the macro.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Explicitly forbid creating cgroup local storage maps with zero value
size, as it makes no sense and might even cause a panic.
Reported-by: syzbot+18628320d3b14a5c459c@syzkaller.appspotmail.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Explicitly forbid creating map of per-cpu cgroup local storages.
This behavior matches the behavior of shared cgroup storages.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commit introduced per-cpu cgroup local storage.
Per-cpu cgroup local storage is very similar to simple cgroup storage
(let's call it shared), except all the data is per-cpu.
The main goal of per-cpu variant is to implement super fast
counters (e.g. packet counters), which don't require neither
lookups, neither atomic operations.
>From userspace's point of view, accessing a per-cpu cgroup storage
is similar to other per-cpu map types (e.g. per-cpu hashmaps and
arrays).
Writing to a per-cpu cgroup storage is not atomic, but is performed
by copying longs, so some minimal atomicity is here, exactly
as with other per-cpu maps.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
To simplify the following introduction of per-cpu cgroup storage,
let's rework a bit a mechanism of passing a pointer to a cgroup
storage into the bpf_get_local_storage(). Let's save a pointer
to the corresponding bpf_cgroup_storage structure, instead of
a pointer to the actual buffer.
It will help us to handle per-cpu storage later, which has
a different way of accessing to the actual data.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In order to introduce per-cpu cgroup storage, let's generalize
bpf cgroup core to support multiple cgroup storage types.
Potentially, per-node cgroup storage can be added later.
This commit is mostly a formal change that replaces
cgroup_storage pointer with a array of cgroup_storage pointers.
It doesn't actually introduce a new storage type,
it will be done later.
Each bpf program is now able to have one cgroup storage of each type.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
cgroup_storage_update_elem() shouldn't accept any flags
argument values except BPF_ANY and BPF_EXIST to guarantee
the backward compatibility, had a new flag value been added.
Fixes: de9cbbaadb ("bpf: introduce cgroup storage maps")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, helper bpf_get_current_cgroup_id() is not permitted
for CGROUP_DEVICE type of programs. If the helper is used
in such cases, the verifier will log the following error:
0: (bf) r6 = r1
1: (69) r7 = *(u16 *)(r6 +0)
2: (85) call bpf_get_current_cgroup_id#80
unknown func bpf_get_current_cgroup_id#80
The bpf_get_current_cgroup_id() is useful for CGROUP_DEVICE
type of programs in order to customize action based on cgroup id.
This patch added such a support.
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-09-25
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Allow for RX stack hardening by implementing the kernel's flow
dissector in BPF. Idea was originally presented at netconf 2017 [0].
Quote from merge commit:
[...] Because of the rigorous checks of the BPF verifier, this
provides significant security guarantees. In particular, the BPF
flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate.
It cannot read outside of packet bounds, because all memory accesses
are checked. Also, with BPF the administrator can decide which
protocols to support, reducing potential attack surface. Rarely
encountered protocols can be excluded from dissection and the
program can be updated without kernel recompile or reboot if a
bug is discovered. [...]
Also, a sample flow dissector has been implemented in BPF as part
of this work, from Petar and Willem.
[0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
2) Add support for bpftool to list currently active attachment
points of BPF networking programs providing a quick overview
similar to bpftool's perf subcommand, from Yonghong.
3) Fix a verifier pruning instability bug where a union member
from the register state was not cleared properly leading to
branches not being pruned despite them being valid candidates,
from Alexei.
4) Various smaller fast-path optimizations in XDP's map redirect
code, from Jesper.
5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
in bpftool, from Roman.
6) Remove a duplicate check in libbpf that probes for function
storage, from Taeung.
7) Fix an issue in test_progs by avoid checking for errno since
on success its value should not be checked, from Mauricio.
8) Fix unused variable warning in bpf_getsockopt() helper when
CONFIG_INET is not configured, from Anders.
9) Fix a compilation failure in the BPF sample code's use of
bpf_flow_keys, from Prashant.
10) Minor cleanups in BPF code, from Yue and Zhong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Version bump conflict in batman-adv, take what's in net-next.
iavf conflict, adjustment of netdev_ops in net-next conflicting
with poll controller method removal in net.
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE
state via tcp_disconnect() without actually calling tcp_close which
would then call our bpf_tcp_close() callback. Because of this a user
could disconnect a socket then put it in a LISTEN state which would
break our assumptions about sockets always being ESTABLISHED state.
To resolve this rely on the unhash hook, which is called in the
disconnect case, to remove the sock from the sockmap.
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 1aa12bdf1b ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
After this patch we only allow socks that are in ESTABLISHED state or
are being added via a sock_ops event that is transitioning into an
ESTABLISHED state. By allowing sock_ops events we allow users to
manage sockmaps directly from sock ops programs. The two supported
sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.
Similar to TLS ULP this ensures sk_user_data is correct.
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 1aa12bdf1b ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
consume_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before consume_skb.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Two new tls tests added in parallel in both net and net-next.
Used Stephen Rothwell's linux-next resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.
Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Subtraction of pointers was accidentally allowed for unpriv programs
by commit 82abbf8d2f. Revert that part of commit.
Fixes: 82abbf8d2f ("bpf: do not allow root to mangle valid pointers")
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The end boundary math for type section is incorrect in
btf_check_all_metas(). It just happens that hdr->type_off
is always 0 for now because there are only two sections
(type and string) and string section must be at the end (ensured
in btf_parse_str_sec).
However, type_off may not be 0 if a new section would be added later.
This patch fixes it.
Fixes: f80442a4cd ("bpf: btf: Change how section is supported in btf_header")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Added bpffs pretty print for program array map. For a particular
array index, if the program array points to a valid program,
the "<index>: <prog_id>" will be printed out like
0: 6
which means bpf program with id "6" is installed at index "0".
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Edward Cree says:
In check_mem_access(), for the PTR_TO_CTX case, after check_ctx_access()
has supplied a reg_type, the other members of the register state are set
appropriately. Previously reg.range was set to 0, but as it is in a
union with reg.map_ptr, which is larger, upper bytes of the latter were
left in place. This then caused the memcmp() in regsafe() to fail,
preventing some branches from being pruned (and occasionally causing the
same program to take a varying number of processed insns on repeated
verifier runs).
Fix the instability by clearing bpf_reg_state in __mark_reg_[un]known()
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Debugged-by: Edward Cree <ecree@solarflare.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently we check sk_user_data is non NULL to determine if the sk
exists in a map. However, this is not sufficient to ensure the psock
or the ULP ops are not in use by another user, such as kcm or TLS. To
avoid this when adding a sock to a map also verify it is of the
correct ULP type. Additionally, when releasing a psock verify that
it is the TCP_ULP_BPF type before releasing the ULP. The error case
where we abort an update due to ULP collision can cause this error
path.
For example,
__sock_map_ctx_update_elem()
[...]
err = tcp_set_ulp_id(sock, TCP_ULP_BPF) <- collides with TLS
if (err) <- so err out here
goto out_free
[...]
out_free:
smap_release_sock() <- calling tcp_cleanup_ulp releases the
TLS ULP incorrectly.
Fixes: 2f857d0460 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Added bpffs pretty print for percpu arraymap, percpu hashmap
and percpu lru hashmap.
For each map <key, value> pair, the format is:
<key_value>: {
cpu0: <value_on_cpu0>
cpu1: <value_on_cpu1>
...
cpun: <value_on_cpun>
}
For example, on my VM, there are 4 cpus, and
for test_btf test in the next patch:
cat /sys/fs/bpf/pprint_test_percpu_hash
You may get:
...
43602: {
cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
}
72847: {
cpu0: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu1: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu2: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu3: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
}
...
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If a stack slot does not hold a spilled register (STACK_SPILL), then each
of its eight bytes could potentially have a different slot_type. This
information can be important for debugging, and previously we either did
not print anything for the stack slot, or just printed fp-X=0 in the case
where its first byte was STACK_ZERO.
Instead, print eight characters with either 0 (STACK_ZERO), m (STACK_MISC)
or ? (STACK_INVALID) for any stack slot which is neither STACK_SPILL nor
entirely STACK_INVALID.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
By giving each register its own liveness chain, we elide the skip_callee()
logic. Instead, each register's parent is the state it inherits from;
both check_func_call() and prepare_func_exit() automatically connect
reg states to the correct chain since when they copy the reg state across
(r1-r5 into the callee as args, and r0 out as the return value) they also
copy the parent pointer.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, when a redirect occurs in sockmap and an error occurs in
the redirect call we unwind the scatterlist once in the error path
of bpf_tcp_sendmsg_do_redirect() and then again in sendmsg(). Then
in the error path of sendmsg we decrement the copied count by the
send size.
However, its possible we partially sent data before the error was
generated. This can happen if do_tcp_sendpages() partially sends the
scatterlist before encountering a memory pressure error. If this
happens we need to decrement the copied value (the value tracking
how many bytes were actually sent to TCP stack) by the number of
remaining bytes _not_ the entire send size. Otherwise we risk
confusing userspace.
Also we don't need two calls to free the scatterlist one is
good enough. So remove the one in bpf_tcp_sendmsg_do_redirect() and
then properly reduce copied by the number of remaining bytes which
may in fact be the entire send size if no bytes were sent.
To do this use bool to indicate if free_start_sg() should do mem
accounting or not.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In bpf_tcp_recvmsg() we first took a reference on the psock, however
once we find that there are skbs in the normal socket's receive queue
we return with processing them through tcp_recvmsg(). Problem is that
we leak the taken reference on the psock in that path. Given we don't
really do anything with the psock at this point, move the skb_queue_empty()
test before we fetch the psock to fix this case.
Fixes: 8934ce2fd0 ("bpf: sockmap redirect ingress support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_tcp_close() we pop the psock linkage to a map via psock_map_pop().
A parallel update on the sock hash map can happen between psock_map_pop()
and lookup_elem_raw() where we override the element under link->hash /
link->key. In bpf_tcp_close()'s lookup_elem_raw() we subsequently only
test whether an element is present, but we do not test whether the
element is infact the element we were looking for.
We lock the sock in bpf_tcp_close() during that time, so do we hold
the lock in sock_hash_update_elem(). However, the latter locks the
sock which is newly updated, not the one we're purging from the hash
table. This means that while one CPU is doing the lookup from bpf_tcp_close(),
another CPU is doing the map update in parallel, dropped our sock from
the hlist and released the psock.
Subsequently the first CPU will find the new sock and attempts to drop
and release the old sock yet another time. Fix is that we need to check
the elements for a match after lookup, similar as we do in the sock map.
Note that the hash tab elems are freed via RCU, so access to their
link->hash / link->key is fine since we're under RCU read side there.
Fixes: e9db4ef6bf ("bpf: sockhash fix omitted bucket lock in sock_close")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
All BPF hash and LRU maps currently have a known and global seed
we feed into jhash() which is 0. This is suboptimal, thus fix it
by generating a random seed upon hashtab setup time which we can
later on feed into jhash() on lookup, update and deletions.
Fixes: 0f8e4bd8a1 ("bpf: add hashtable type of eBPF maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
When sockmap code is using the stream parser it also handles the write
space events in order to handle the case where (a) verdict redirects
skb to another socket and (b) the sockmap then sends the skb but due
to memory constraints (or other EAGAIN errors) needs to do a retry.
But the initial code missed a third case where the
skb_send_sock_locked() triggers an sk_wait_event(). A typically case
would be when sndbuf size is exceeded. If this happens because we
do not pass the write_space event to the lower layers we never wake
up the event and it will wait for sndtimeo. Which as noted in ktls
fix may be rather large and look like a hang to the user.
To reproduce the best test is to reduce the sndbuf size and send
1B data chunks to stress the memory handling. To fix this pass the
event from the upper layer to the lower layer.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When we try to allocate a new sock hash entry and the allocation
fails, then sock hash map fails to reduce the map element counter,
meaning we keep accounting this element although it was never used.
Fix it by dropping the element counter on error.
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Currently, it is possible to create a sock hash map with key size
of 0 and have the kernel return a fd back to user space. This is
invalid for hash maps (and kernel also hasn't been tested for zero
key size support in general at this point). Thus, reject such
configuration.
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Commits 109980b894 ("bpf: don't select potentially stale ri->map
from buggy xdp progs") and 7c30013133 ("bpf: fix ri->map_owner
pointer on bpf_prog_realloc") tried to mitigate that buggy programs
using bpf_redirect_map() helper call do not leave stale maps behind.
Idea was to add a map_owner cookie into the per CPU struct redirect_info
which was set to prog->aux by the prog making the helper call as a
proof that the map is not stale since the prog is implicitly holding
a reference to it. This owner cookie could later on get compared with
the program calling into BPF whether they match and therefore the
redirect could proceed with processing the map safely.
In (obvious) hindsight, this approach breaks down when tail calls are
involved since the original caller's prog->aux pointer does not have
to match the one from one of the progs out of the tail call chain,
and therefore the xdp buffer will be dropped instead of redirected.
A way around that would be to fix the issue differently (which also
allows to remove related work in fast path at the same time): once
the life-time of a redirect map has come to its end we use it's map
free callback where we need to wait on synchronize_rcu() for current
outstanding xdp buffers and remove such a map pointer from the
redirect info if found to be present. At that time no program is
using this map anymore so we simply invalidate the map pointers to
NULL iff they previously pointed to that instance while making sure
that the redirect path only reads out the map once.
Fixes: 97f91a7cf0 ("bpf: add bpf_redirect_map helper routine")
Fixes: 109980b894 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
Reported-by: Sebastiano Miano <sebastiano.miano@polito.it>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The current code in sock_map_ctx_update_elem() allows for BPF_EXIST
and BPF_NOEXIST map update flags. While on array-like maps this approach
is rather uncommon, e.g. bpf_fd_array_map_update_elem() and others
enforce map update flags to be BPF_ANY such that xchg() can be used
directly, the current implementation in sock map does not guarantee
that such operation with BPF_EXIST / BPF_NOEXIST is atomic.
The initial test does a READ_ONCE(stab->sock_map[i]) to fetch the
socket from the slot which is then tested for NULL / non-NULL. However
later after __sock_map_ctx_update_elem(), the actual update is done
through osock = xchg(&stab->sock_map[i], sock). Problem is that in
the meantime a different CPU could have updated / deleted a socket
on that specific slot and thus flag contraints won't hold anymore.
I've been thinking whether best would be to just break UAPI and do
an enforcement of BPF_ANY to check if someone actually complains,
however trouble is that already in BPF kselftest we use BPF_NOEXIST
for the map update, and therefore it might have been copied into
applications already. The fix to keep the current behavior intact
would be to add a map lock similar to the sock hash bucket lock only
for covering the whole map.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The smap_start_sock() and smap_stop_sock() are each protected under
the sock->sk_callback_lock from their call-sites except in the case
of sock_map_delete_elem() where we drop the old socket from the map
slot. This is racy because the same sock could be part of multiple
sock maps, so we run smap_stop_sock() in parallel, and given at that
point psock->strp_enabled might be true on both CPUs, we might for
example wrongly restore the sk->sk_data_ready / sk->sk_write_space.
Therefore, hold the sock->sk_callback_lock as well on delete. Looks
like 2f857d0460 ("bpf: sockmap, remove STRPARSER map_flags and add
multi-map support") had this right, but later on e9db4ef6bf ("bpf:
sockhash fix omitted bucket lock in sock_close") removed it again
from delete leaving this smap_stop_sock() instance unprotected.
Fixes: e9db4ef6bf ("bpf: sockhash fix omitted bucket lock in sock_close")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While working on sockmap I noticed that we do not always kfree the
struct smap_psock_map_entry list elements which track psocks attached
to maps. In the case of sock_hash_ctx_update_elem(), these map entries
are allocated outside of __sock_map_ctx_update_elem() with their
linkage to the socket hash table filled. In the case of sock array,
the map entries are allocated inside of __sock_map_ctx_update_elem()
and added with their linkage to the psock->maps. Both additions are
under psock->maps_lock each.
Now, we drop these elements from their psock->maps list in a few
occasions: i) in sock array via smap_list_map_remove() when an entry
is either deleted from the map from user space, or updated via
user space or BPF program where we drop the old socket at that map
slot, or the sock array is freed via sock_map_free() and drops all
its elements; ii) for sock hash via smap_list_hash_remove() in exactly
the same occasions as just described for sock array; iii) in the
bpf_tcp_close() where we remove the elements from the list via
psock_map_pop() and iterate over them dropping themselves from either
sock array or sock hash; and last but not least iv) once again in
smap_gc_work() which is a callback for deferring the work once the
psock refcount hit zero and thus the socket is being destroyed.
Problem is that the only case where we kfree() the list entry is
in case iv), which at that point should have an empty list in
normal cases. So in cases from i) to iii) we unlink the elements
without freeing where they go out of reach from us. Hence fix is
to properly kfree() them as well to stop the leakage. Given these
are all handled under psock->maps_lock there is no need for deferred
RCU freeing.
I later also ran with kmemleak detector and it confirmed the finding
as well where in the state before the fix the object goes unreferenced
while after the patch no kmemleak report related to BPF showed up.
[...]
unreferenced object 0xffff880378eadae0 (size 64):
comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
hex dump (first 32 bytes):
00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................
50 4d 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00 PMu]............
backtrace:
[<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
[<0000000045dd6d3c>] bpf_sock_map_update+0x29/0x60
[<00000000877723aa>] ___bpf_prog_run+0x1e1f/0x4960
[<000000002ef89e83>] 0xffffffffffffffff
unreferenced object 0xffff880378ead240 (size 64):
comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
hex dump (first 32 bytes):
00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................
00 44 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00 .Du]............
backtrace:
[<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
[<0000000030e37a3a>] sock_map_update_elem+0x125/0x240
[<000000002e5ce36e>] map_update_elem+0x4eb/0x7b0
[<00000000db453cc9>] __x64_sys_bpf+0x1f9/0x360
[<0000000000763660>] do_syscall_64+0x9a/0x300
[<00000000422a2bb2>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<000000002ef89e83>] 0xffffffffffffffff
[...]
Fixes: e9db4ef6bf ("bpf: sockhash fix omitted bucket lock in sock_close")
Fixes: 54fedb42c6 ("bpf: sockmap, fix smap_list_map_remove when psock is in many maps")
Fixes: 2f857d0460 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 394e40a297 ("bpf: extend bpf_prog_array to store pointers
to the cgroup storage") refactored the bpf_prog_array_copy_core()
to accommodate new structure bpf_prog_array_item which contains
bpf_prog array itself.
In the old code, we had
perf_event_query_prog_array():
mutex_lock(...)
bpf_prog_array_copy_call():
prog = rcu_dereference_check(array, 1)->progs
bpf_prog_array_copy_core(prog, ...)
mutex_unlock(...)
With the above commit, we had
perf_event_query_prog_array():
mutex_lock(...)
bpf_prog_array_copy_call():
bpf_prog_array_copy_core(array, ...):
item = rcu_dereference(array)->items;
...
mutex_unlock(...)
The new code will trigger a lockdep rcu checking warning.
The fix is to change rcu_dereference() to rcu_dereference_check()
to prevent such a warning.
Reported-by: syzbot+6e72317008eef84a216b@syzkaller.appspotmail.com
Fixes: 394e40a297 ("bpf: extend bpf_prog_array to store pointers to the cgroup storage")
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull networking updates from David Miller:
"Highlights:
- Gustavo A. R. Silva keeps working on the implicit switch fallthru
changes.
- Support 802.11ax High-Efficiency wireless in cfg80211 et al, From
Luca Coelho.
- Re-enable ASPM in r8169, from Kai-Heng Feng.
- Add virtual XFRM interfaces, which avoids all of the limitations of
existing IPSEC tunnels. From Steffen Klassert.
- Convert GRO over to use a hash table, so that when we have many
flows active we don't traverse a long list during accumluation.
- Many new self tests for routing, TC, tunnels, etc. Too many
contributors to mention them all, but I'm really happy to keep
seeing this stuff.
- Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu.
- Lots of cleanups and fixes in L2TP code from Guillaume Nault.
- Add IPSEC offload support to netdevsim, from Shannon Nelson.
- Add support for slotting with non-uniform distribution to netem
packet scheduler, from Yousuk Seung.
- Add UDP GSO support to mlx5e, from Boris Pismenny.
- Support offloading of Team LAG in NFP, from John Hurley.
- Allow to configure TX queue selection based upon RX queue, from
Amritha Nambiar.
- Support ethtool ring size configuration in aquantia, from Anton
Mikaev.
- Support DSCP and flowlabel per-transport in SCTP, from Xin Long.
- Support list based batching and stack traversal of SKBs, this is
very exciting work. From Edward Cree.
- Busyloop optimizations in vhost_net, from Toshiaki Makita.
- Introduce the ETF qdisc, which allows time based transmissions. IGB
can offload this in hardware. From Vinicius Costa Gomes.
- Add parameter support to devlink, from Moshe Shemesh.
- Several multiplication and division optimizations for BPF JIT in
nfp driver, from Jiong Wang.
- Lots of prepatory work to make more of the packet scheduler layer
lockless, when possible, from Vlad Buslov.
- Add ACK filter and NAT awareness to sch_cake packet scheduler, from
Toke Høiland-Jørgensen.
- Support regions and region snapshots in devlink, from Alex Vesker.
- Allow to attach XDP programs to both HW and SW at the same time on
a given device, with initial support in nfp. From Jakub Kicinski.
- Add TLS RX offload and support in mlx5, from Ilya Lesokhin.
- Use PHYLIB in r8169 driver, from Heiner Kallweit.
- All sorts of changes to support Spectrum 2 in mlxsw driver, from
Ido Schimmel.
- PTP support in mv88e6xxx DSA driver, from Andrew Lunn.
- Make TCP_USER_TIMEOUT socket option more accurate, from Jon
Maxwell.
- Support for templates in packet scheduler classifier, from Jiri
Pirko.
- IPV6 support in RDS, from Ka-Cheong Poon.
- Native tproxy support in nf_tables, from Máté Eckl.
- Maintain IP fragment queue in an rbtree, but optimize properly for
in-order frags. From Peter Oskolkov.
- Improvde handling of ACKs on hole repairs, from Yuchung Cheng"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits)
bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT"
hv/netvsc: Fix NULL dereference at single queue mode fallback
net: filter: mark expected switch fall-through
xen-netfront: fix warn message as irq device name has '/'
cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0
net: dsa: mv88e6xxx: missing unlock on error path
rds: fix building with IPV6=m
inet/connection_sock: prefer _THIS_IP_ to current_text_addr
net: dsa: mv88e6xxx: bitwise vs logical bug
net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
ieee802154: hwsim: using right kind of iteration
net: hns3: Add vlan filter setting by ethtool command -K
net: hns3: Set tx ring' tc info when netdev is up
net: hns3: Remove tx ring BD len register in hns3_enet
net: hns3: Fix desc num set to default when setting channel
net: hns3: Fix for phy link issue when using marvell phy driver
net: hns3: Fix for information of phydev lost problem when down/up
net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
net: hns3: Add support for serdes loopback selftest
bnxt_en: take coredump_record structure off stack
...
Pull locking/atomics update from Thomas Gleixner:
"The locking, atomics and memory model brains delivered:
- A larger update to the atomics code which reworks the ordering
barriers, consolidates the atomic primitives, provides the new
atomic64_fetch_add_unless() primitive and cleans up the include
hell.
- Simplify cmpxchg() instrumentation and add instrumentation for
xchg() and cmpxchg_double().
- Updates to the memory model and documentation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/atomics: Rework ordering barriers
locking/atomics: Instrument cmpxchg_double*()
locking/atomics: Instrument xchg()
locking/atomics: Simplify cmpxchg() instrumentation
locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
tools/memory-model: Rename litmus tests to comply to norm7
tools/memory-model/Documentation: Fix typo, smb->smp
sched/Documentation: Update wake_up() & co. memory-barrier guarantees
locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
sched/core: Use smp_mb() in wake_woken_function()
tools/memory-model: Add informal LKMM documentation to MAINTAINERS
locking/atomics/Documentation: Describe atomic_set() as a write operation
tools/memory-model: Make scripts executable
tools/memory-model: Remove ACCESS_ONCE() from model
tools/memory-model: Remove ACCESS_ONCE() from recipes
locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
tools/memory-model: Add litmus test for full multicopy atomicity
locking/refcount: Always allow checked forms
...
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-08-13
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add driver XDP support for veth. This can be used in conjunction with
redirect of another XDP program e.g. sitting on NIC so the xdp_frame
can be forwarded to the peer veth directly without modification,
from Toshiaki.
2) Add a new BPF map type REUSEPORT_SOCKARRAY and prog type SK_REUSEPORT
in order to provide more control and visibility on where a SO_REUSEPORT
sk should be located, and the latter enables to directly select a sk
from the bpf map. This also enables map-in-map for application migration
use cases, from Martin.
3) Add a new BPF helper bpf_skb_ancestor_cgroup_id() that returns the id
of cgroup v2 that is the ancestor of the cgroup associated with the
skb at the ancestor_level, from Andrey.
4) Implement BPF fs map pretty-print support based on BTF data for regular
hash table and LRU map, from Yonghong.
5) Decouple the ability to attach BTF for a map from the key and value
pretty-printer in BPF fs, and enable further support of BTF for maps for
percpu and LPM trie, from Daniel.
6) Implement a better BPF sample of using XDP's CPU redirect feature for
load balancing SKB processing to remote CPU. The sample implements the
same XDP load balancing as Suricata does which is symmetric hash based
on IP and L4 protocol, from Jesper.
7) Revert adding NULL pointer check with WARN_ON_ONCE() in __xdp_return()'s
critical path as it is ensured that the allocator is present, from Björn.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit a26ca7c982 ("bpf: btf: Add pretty print support to
the basic arraymap") and 699c86d6ec ("bpf: btf: add pretty
print for hash/lru_hash maps") enabled support for BTF and
dumping via BPF fs for array and hash/lru map. However, both
can be decoupled from each other such that regular BPF maps
can be supported for attaching BTF key/value information,
while not all maps necessarily need to dump via map_seq_show_elem()
callback.
The basic sanity check which is a prerequisite for all maps
is that key/value size has to match in any case, and some maps
can have extra checks via map_check_btf() callback, e.g.
probing certain types or indicating no support in general. With
that we can also enable retrieving BTF info for per-cpu map
types and lpm.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].
At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this
point, it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.
For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb(). Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.
Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.
The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).
In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages. Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb. The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.
The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").
The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk. It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()". Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case). The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want. This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match). The bpf prog can decide if it wants to do SK_DROP as its
discretion.
When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk. If not, the kernel
will select one from "reuse->socks[]" (as before this patch).
The SK_DROP and SK_PASS handling logic will be in the next patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision. In this case, decide which
SO_REUSEPORT sk to serve the incoming request.
By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map. That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).
For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map. The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information. This connection info contact/API stays within the
userspace.
Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process. The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds. [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]
During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used. That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP. These conditions are
ensured in "reuseport_array_update_check()".
A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map). SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.
When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also. It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]". Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.
The reuseport_array will also support lookup from the syscall
side. It will return a sock_gen_cookie(). The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).
The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update. Supporting different
value_size between lookup and update seems unintuitive also.
We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4. The syscall's lookup
will return ENOSPC on value_size=4. It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commit a26ca7c982 ("bpf: btf: Add pretty print support to
the basic arraymap") added pretty print support to array map.
This patch adds pretty print for hash and lru_hash maps.
The following example shows the pretty-print result of
a pinned hashmap:
struct map_value {
int count_a;
int count_b;
};
cat /sys/fs/bpf/pinned_hash_map:
87907: {87907,87908}
57354: {37354,57355}
76625: {76625,76626}
...
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In function map_seq_next() of kernel/bpf/inode.c,
the first key will be the "0" regardless of the map type.
This works for array. But for hash type, if it happens
key "0" is in the map, the bpffs map show will miss
some items if the key "0" is not the first element of
the first bucket.
This patch fixed the issue by guaranteeing to get
the first element, if the seq_show is just started,
by passing NULL pointer key to map_get_next_key() callback.
This way, no missing elements will occur for
bpffs hash table show even if key "0" is in the map.
Fixes: a26ca7c982 ("bpf: btf: Add pretty print support to the basic arraymap")
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Like cpumap teardown, the devmap teardown code also flush remaining
xdp_frames, via bq_xmit_all() in case map entry is removed. The code
can call xdp_return_frame_rx_napi, from the the wrong context, in-case
ndo_xdp_xmit() fails.
Fixes: 389ab7f01a ("xdp: introduce xdp_return_frame_rx_napi")
Fixes: 735fc4054b ("xdp: change ndo_xdp_xmit API to support bulking")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When removing a cpumap entry, a number of syncronization steps happen.
Eventually the teardown code __cpu_map_entry_free is invoked from/via
call_rcu.
The teardown code __cpu_map_entry_free() flushes remaining xdp_frames,
by invoking bq_flush_to_queue, which calls xdp_return_frame_rx_napi().
The issues is that the teardown code is not running in the RX NAPI
code path. Thus, it is not allowed to invoke the NAPI variant of
xdp_return_frame.
This bug was found and triggered by using the --stress-mode option to
the samples/bpf program xdp_redirect_cpu. It is hard to trigger,
because the ptr_ring have to be full and cpumap bulk queue max
contains 8 packets, and a remote CPU is racing to empty the ptr_ring
queue.
Fixes: 389ab7f01a ("xdp: introduce xdp_return_frame_rx_napi")
Tested-by: Jean-Tsung Hsiao <jhsiao@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In bpf_tcp_sendmsg() the sk_alloc_sg() may fail. In the case of
ENOMEM, it may also mean that we've partially filled the scatterlist
entries with pages. Later jumping to sk_stream_wait_memory()
we could further fail with an error for several reasons, however
we miss to call free_start_sg() if the local sk_msg_buff was used.
Fixes: 4f738adba3 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While working on bpf_tcp_sendmsg() code, I noticed that when a
sk->sk_err is set we error out with err = sk->sk_err. However
this is problematic since sk->sk_err is a positive error value
and therefore we will neither go into sk_stream_error() nor will
we report an error back to user space. I had this case with EPIPE
and user space was thinking sendmsg() succeeded since EPIPE is
a positive value, thinking we submitted 32 bytes. Fix it by
negating the sk->sk_err value.
Fixes: 4f738adba3 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-08-07
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add cgroup local storage for BPF programs, which provides a fast
accessible memory for storing various per-cgroup data like number
of transmitted packets, etc, from Roman.
2) Support bpf_get_socket_cookie() BPF helper in several more program
types that have a full socket available, from Andrey.
3) Significantly improve the performance of perf events which are
reported from BPF offload. Also convert a couple of BPF AF_XDP
samples overto use libbpf, both from Jakub.
4) seg6local LWT provides the End.DT6 action, which allows to
decapsulate an outer IPv6 header containing a Segment Routing Header.
Adds this action now to the seg6local BPF interface, from Mathieu.
5) Do not mark dst register as unbounded in MOV64 instruction when
both src and dst register are the same, from Arthur.
6) Define u_smp_rmb() and u_smp_wmb() to their respective barrier
instructions on arm64 for the AF_XDP sample code, from Brian.
7) Convert the tcp_client.py and tcp_server.py BPF selftest scripts
over from Python 2 to Python 3, from Jeremy.
8) Enable BTF build flags to the BPF sample code Makefile, from Taeung.
9) Remove an unnecessary rcu_read_lock() in run_lwt_bpf(), from Taehee.
10) Several improvements to the README.rst from the BPF documentation
to make it more consistent with RST format, from Tobin.
11) Replace all occurrences of strerror() by calls to strerror_r()
in libbpf and fix a FORTIFY_SOURCE build error along with it,
from Thomas.
12) Fix a bug in bpftool's get_btf() function to correctly propagate
an error via PTR_ERR(), from Yue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
__cgroup_bpf_attach() and __cgroup_bpf_detach() functions have
a good amount of duplicated code, which is possible to eliminate
by introducing the update_effective_progs() helper function.
The update_effective_progs() calls compute_effective_progs()
and then in case of success it calls activate_effective_progs()
for each descendant cgroup. In case of failure (OOM), it releases
allocated prog arrays and return the error code.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The bpf_get_local_storage() helper function is used
to get a pointer to the bpf local storage from a bpf program.
It takes a pointer to a storage map and flags as arguments.
Right now it accepts only cgroup storage maps, and flags
argument has to be 0. Further it can be extended to support
other types of local storage: e.g. thread local storage etc.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
As there is one-to-one relation between a bpf program
and cgroup local storage map, there is no sense in
creating a map of cgroup local storage maps.
Forbid it explicitly to avoid possible side effects.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
BPF_MAP_TYPE_CGROUP_STORAGE maps are special in a way
that the access from the bpf program side is lookup-free.
That means the result is guaranteed to be a valid
pointer to the cgroup storage; no NULL-check is required.
This patch introduces BPF_PTR_TO_MAP_VALUE return type,
which is required to cause the verifier accept programs,
which are not checking the map value pointer for being NULL.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch converts bpf_prog_array from an array of prog pointers
to the array of struct bpf_prog_array_item elements.
This allows to save a cgroup storage pointer for each bpf program
efficiently attached to a cgroup.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If a bpf program is using cgroup local storage, allocate
a bpf_cgroup_storage structure automatically on attaching the program
to a cgroup and save the pointer into the corresponding bpf_prog_list
entry.
Analogically, release the cgroup local storage on detaching
of the bpf program.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commit introduces the bpf_cgroup_storage_set() helper,
which will be used to pass a pointer to a cgroup storage
to the bpf helper.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commit introduces BPF_MAP_TYPE_CGROUP_STORAGE maps:
a special type of maps which are implementing the cgroup storage.
>From the userspace point of view it's almost a generic
hash map with the (cgroup inode id, attachment type) pair
used as a key.
The only difference is that some operations are restricted:
1) a user can't create new entries,
2) a user can't remove existing entries.
The lookup from userspace is o(log(n)).
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commits extends existing bpf maps memory charging API
to support dynamic charging/uncharging.
This is required to account memory used by maps,
if all entries are created dynamically after
the map initialization.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The BTF conflicts were simple overlapping changes.
The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.
Signed-off-by: David S. Miller <davem@davemloft.net>
When check_alu_op() handles a BPF_MOV64 between two registers,
it calls check_reg_arg(DST_OP) on the dst register, marking it
as unbounded. If the src and dst register are the same, this
marks the src as unbounded, which can lead to unexpected errors
for further checks that rely on bounds info. For example:
BPF_MOV64_IMM(BPF_REG_2, 0),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_2),
BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_2),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
Results in:
"math between ctx pointer and register with unbounded
min value is not allowed"
check_alu_op() now uses check_reg_arg(DST_OP_NO_MARK), and MOVs
that need to mark the dst register (MOVIMM, MOV32) do so.
Added a test case for MOV64 dst == src, and dst != src.
Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The current map_check_btf() in BPF_MAP_TYPE_ARRAY rejects
'> map->value_size' to ensure map_seq_show_elem() will not
access things beyond an array element.
Yonghong suggested that using '!=' is a more correct
check. The 8 bytes round_up on value_size is stored
in array->elem_size. Hence, using '!=' on map->value_size
is a proper check.
This patch also adds new tests to check the btf array
key type and value type. Two of these new tests verify
the btf's value_size (the change in this patch).
It also fixes two existing tests that wrongly encoded
a btf's type size (pprint_test) and the value_type_id (in one
of the raw_tests[]). However, that do not affect these two
BTF verification tests before or after this test changes.
These two tests mainly failed at array creation time after
this patch.
Fixes: a26ca7c982 ("bpf: btf: Add pretty print support to the basic arraymap")
Suggested-by: Yonghong Song <yhs@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch ensures the member->offset of a struct
is in the correct order (i.e the later member's offset cannot
go backward).
The current "pahole -J" BTF encoder does not generate something
like this. However, checking this can ensure future encoder
will not violate this.
Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-07-20
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add sharing of BPF objects within one ASIC: this allows for reuse of
the same program on multiple ports of a device, and therefore gains
better code store utilization. On top of that, this now also enables
sharing of maps between programs attached to different ports of a
device, from Jakub.
2) Cleanup in libbpf and bpftool's Makefile to reduce unneeded feature
detections and unused variable exports, also from Jakub.
3) First batch of RCU annotation fixes in prog array handling, i.e.
there are several __rcu markers which are not correct as well as
some of the RCU handling, from Roman.
4) Two fixes in BPF sample files related to checking of the prog_cnt
upper limit from sample loader, from Dan.
5) Minor cleanup in sockmap to remove a set but not used variable,
from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch shrinks the BTF_INT_BITS() mask. The current
btf_int_check_meta() ensures the nr_bits of an integer
cannot exceed 64. Hence, it is mostly an uapi cleanup.
The actual btf usage (i.e. seq_show()) is also modified
to use u8 instead of u16. The verification (e.g. btf_int_check_meta())
path stays as is to deal with invalid BTF situation.
Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow programs and maps to be re-used across different netdevs,
as long as they belong to the same struct bpf_offload_dev.
Update the bpf_offload_prog_map_match() helper for the verifier
and export a new helper for the drivers to use when checking
programs at attachment time.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Create a higher-level entity to represent a device/ASIC to allow
programs and maps to be shared between device ports. The extra
work is required to make sure we don't destroy BPF objects as
soon as the netdev for which they were loaded gets destroyed,
as other ports may still be using them. When netdev goes away
all of its BPF objects will be moved to other netdevs of the
device, and only destroyed when last netdev is unregistered.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently we have two lists of offloaded objects - programs and maps.
Netdevice unregister notifier scans those lists to orphan objects
associated with device being unregistered. This puts unnecessary
(even if negligible) burden on all netdev unregister calls in BPF-
-enabled kernel. The lists of objects may potentially get long
making the linear scan even more problematic. There haven't been
complaints about this mechanisms so far, but it is suboptimal.
Instead of relying on notifiers, make the few BPF-capable drivers
register explicitly for BPF offloads. The programs and maps will
now be collected per-device not on a global list, and only scanned
for removal when driver unregisters from BPF offloads.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
A set of new API functions exported for the drivers will soon use
'bpf_offload_dev_' as a prefix. Rename the bpf_offload_dev_match()
which is internal to the core (used by the verifier) to avoid any
confusion.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pointer sg is being assigned but is never used hence it is
redundant and can be removed.
Cleans up clang warning:
warning: variable 'sg' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The progs local variable in compute_effective_progs() is marked
as __rcu, which is not correct. This is a local pointer, which
is initialized by bpf_prog_array_alloc(), which also now
returns a generic non-rcu pointer.
The real rcu-protected pointer is *array (array is a pointer
to an RCU-protected pointer), so the assignment should be performed
using rcu_assign_pointer().
Fixes: 324bda9e6c ("bpf: multi program support for cgroup+bpf")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently the return type of the bpf_prog_array_alloc() is
struct bpf_prog_array __rcu *, which is not quite correct.
Obviously, the returned pointer is a generic pointer, which
is valid for an indefinite amount of time and it's not shared
with anyone else, so there is no sense in marking it as __rcu.
This change eliminate the following sparse warnings:
kernel/bpf/core.c:1544:31: warning: incorrect type in return expression (different address spaces)
kernel/bpf/core.c:1544:31: expected struct bpf_prog_array [noderef] <asn:4>*
kernel/bpf/core.c:1544:31: got void *
kernel/bpf/core.c:1548:17: warning: incorrect type in return expression (different address spaces)
kernel/bpf/core.c:1548:17: expected struct bpf_prog_array [noderef] <asn:4>*
kernel/bpf/core.c:1548:17: got struct bpf_prog_array *<noident>
kernel/bpf/core.c:1681:15: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1681:15: expected struct bpf_prog_array *array
kernel/bpf/core.c:1681:15: got struct bpf_prog_array [noderef] <asn:4>*
Fixes: 324bda9e6c ("bpf: multi program support for cgroup+bpf")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
syzkaller managed to trigger the following bug through fault injection:
[...]
[ 141.043668] verifier bug. No program starts at insn 3
[ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
[ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
fixup_call_args kernel/bpf/verifier.c:5587 [inline]
[ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
[ 141.047355] CPU: 3 PID: 4072 Comm: a.out Not tainted 4.18.0-rc4+ #51
[ 141.048446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 1.10.2-1 04/01/2014
[ 141.049877] Call Trace:
[ 141.050324] __dump_stack lib/dump_stack.c:77 [inline]
[ 141.050324] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
[ 141.050950] ? dump_stack_print_info.cold.2+0x52/0x52 lib/dump_stack.c:60
[ 141.051837] panic+0x238/0x4e7 kernel/panic.c:184
[ 141.052386] ? add_taint.cold.5+0x16/0x16 kernel/panic.c:385
[ 141.053101] ? __warn.cold.8+0x148/0x1ba kernel/panic.c:537
[ 141.053814] ? __warn.cold.8+0x117/0x1ba kernel/panic.c:530
[ 141.054506] ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
[ 141.054506] ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
[ 141.054506] ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
[ 141.055163] __warn.cold.8+0x163/0x1ba kernel/panic.c:538
[ 141.055820] ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
[ 141.055820] ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
[ 141.055820] ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
[...]
What happens in jit_subprogs() is that kcalloc() for the subprog func
buffer is failing with NULL where we then bail out. Latter is a plain
return -ENOMEM, and this is definitely not okay since earlier in the
loop we are walking all subprogs and temporarily rewrite insn->off to
remember the subprog id as well as insn->imm to temporarily point the
call to __bpf_call_base + 1 for the initial JIT pass. Thus, bailing
out in such state and handing this over to the interpreter is troublesome
since later/subsequent e.g. find_subprog() lookups are based on wrong
insn->imm.
Therefore, once we hit this point, we need to jump to out_free path
where we undo all changes from earlier loop, so that interpreter can
work on unmodified insn->{off,imm}.
Another point is that should find_subprog() fail in jit_subprogs() due
to a verifier bug, then we also should not simply defer the program to
the interpreter since also here we did partial modifications. Instead
we should just bail out entirely and return an error to the user who is
trying to load the program.
Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Reported-by: syzbot+7d427828b2ea6e592804@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When extracting bitfield from a number, btf_int_bits_seq_show() builds
a mask and accesses least significant byte of the number in a way
specific to little-endian. This patch fixes that by checking endianness
of the machine and then shifting left and right the unneeded bits.
Thanks to Martin Lau for the help in navigating potential pitfalls when
dealing with endianess and for the final solution.
Fixes: b00b8daec8 ("bpf: btf: Add pretty print capability for data with BTF type info")
Signed-off-by: Okash Khawaja <osk@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Otherwise we end up with attempting to send packets from down devices
or to send oversized packets, which may cause unexpected driver/device
behaviour. Generic XDP has already done this check, so reuse the logic
in native XDP.
Fixes: 814abfabef ("xdp: add bpf_redirect helper function")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In commit
'bpf: bpf_compute_data uses incorrect cb structure' (8108a77515)
we added the routine bpf_compute_data_end_sk_skb() to compute the
correct data_end values, but this has since been lost. In kernel
v4.14 this was correct and the above patch was applied in it
entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
we lost the piece that renamed bpf_compute_data_pointers to the
new function bpf_compute_data_end_sk_skb. This was done here,
e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
When it conflicted with the following rename patch,
6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Finally, after a refactor I thought even the function
bpf_compute_data_end_sk_skb() was no longer needed and it was
erroneously removed.
However, we never reverted the sk_skb_convert_ctx_access() usage of
tcp_skb_cb which had been committed and survived the merge conflict.
Here we fix this by adding back the helper and *_data_end_sk_skb()
usage. Using the bpf_skc_data_end mapping is not correct because it
expects a qdisc_skb_cb object but at the sock layer this is not the
case. Even though it happens to work here because we don't overwrite
any data in-use at the socket layer and the cb structure is cleared
later this has potential to create some subtle issues. But, even
more concretely the filter.c access check uses tcp_skb_cb.
And by some act of chance though,
struct bpf_skb_data_end {
struct qdisc_skb_cb qdisc_cb; /* 0 28 */
/* XXX 4 bytes hole, try to pack */
void * data_meta; /* 32 8 */
void * data_end; /* 40 8 */
/* size: 48, cachelines: 1, members: 3 */
/* sum members: 44, holes: 1, sum holes: 4 */
/* last cacheline: 48 bytes */
};
and then tcp_skb_cb,
struct tcp_skb_cb {
[...]
struct {
__u32 flags; /* 24 4 */
struct sock * sk_redir; /* 32 8 */
void * data_end; /* 40 8 */
} bpf; /* 24 */
};
So when we use offset_of() to track down the byte offset we get 40 in
either case and everything continues to work. Fix this mess and use
correct structures its unclear how long this might actually work for
until someone moves the structs around.
Reported-by: Martin KaFai Lau <kafai@fb.com>
Fixes: e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Fixes: 6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, when a sock is closed and the bpf_tcp_close() callback is
used we remove memory but do not free the skb. Call consume_skb() if
the skb is attached to the buffer.
Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
Fixes: 1aa12bdf1b ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After latest lock updates there is no longer anything preventing a
close and recvmsg call running in parallel. Additionally, we can
race update with close if we close a socket and simultaneously update
if via the BPF userspace API (note the cgroup ops are already run
with sock_lock held).
To resolve this take sock_lock in close and update paths.
Reported-by: syzbot+b680e42077a0d7c9a0c4@syzkaller.appspotmail.com
Fixes: e9db4ef6bf ("bpf: sockhash fix omitted bucket lock in sock_close")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This removes locking from readers of RCU hash table. Its not
necessary.
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The current code, in the error path of sock_hash_ctx_update_elem,
checks if the sock has a psock in the user data and if so decrements
the reference count of the psock. However, if the error happens early
in the error path we may have never incremented the psock reference
count and if the psock exists because the sock is in another map then
we may inadvertently decrement the reference count.
Fix this by making the error path only call smap_release_sock if the
error happens after the increment.
Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Decrement the number of elements in the map in case the allocation
of a new node fails.
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add map_release_uref pointer to hashmap ops. This was dropped when
original sockhash code was ported into bpf-next before initial
commit.
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
First the sk_callback_lock() was being used to protect both the
sock callback hooks and the psock->maps list. This got overly
convoluted after the addition of sockhash (in sockmap it made
some sense because masp and callbacks were tightly coupled) so
lets split out a specific lock for maps and only use the callback
lock for its intended purpose. This fixes a couple cases where
we missed using maps lock when it was in fact needed. Also this
makes it easier to follow the code because now we can put the
locking closer to the actual code its serializing.
Next, in sock_hash_delete_elem() the pattern was as follows,
sock_hash_delete_elem()
[...]
spin_lock(bucket_lock)
l = lookup_elem_raw()
if (l)
hlist_del_rcu()
write_lock(sk_callback_lock)
.... destroy psock ...
write_unlock(sk_callback_lock)
spin_unlock(bucket_lock)
The ordering is necessary because we only know the {p}sock after
dereferencing the hash table which we can't do unless we have the
bucket lock held. Once we have the bucket lock and the psock element
it is deleted from the hashmap to ensure any other path doing a lookup
will fail. Finally, the refcnt is decremented and if zero the psock
is destroyed.
In parallel with the above (or free'ing the map) a tcp close event
may trigger tcp_close(). Which at the moment omits the bucket lock
altogether (oops!) where the flow looks like this,
bpf_tcp_close()
[...]
write_lock(sk_callback_lock)
for each psock->maps // list of maps this sock is part of
hlist_del_rcu(ref_hash_node);
.... destroy psock ...
write_unlock(sk_callback_lock)
Obviously, and demonstrated by syzbot, this is broken because
we can have multiple threads deleting entries via hlist_del_rcu().
To fix this we might be tempted to wrap the hlist operation in a
bucket lock but that would create a lock inversion problem. In
summary to follow locking rules the psocks maps list needs the
sk_callback_lock (after this patch maps_lock) but we need the bucket
lock to do the hlist_del_rcu.
To resolve the lock inversion problem pop the head of the maps list
repeatedly and remove the reference until no more are left. If a
delete happens in parallel from the BPF API that is OK as well because
it will do a similar action, lookup the lock in the map/hash, delete
it from the map/hash, and dec the refcnt. We check for this case
before doing a destroy on the psock to ensure we don't have two
threads tearing down a psock. The new logic is as follows,
bpf_tcp_close()
e = psock_map_pop(psock->maps) // done with map lock
bucket_lock() // lock hash list bucket
l = lookup_elem_raw(head, hash, key, key_size);
if (l) {
//only get here if elmnt was not already removed
hlist_del_rcu()
... destroy psock...
}
bucket_unlock()
And finally for all the above to work add missing locking around map
operations per above. Then add RCU annotations and use
rcu_dereference/rcu_assign_pointer to manage values relying on RCU so
that the object is not free'd from sock_hash_free() while it is being
referenced in bpf_tcp_close().
Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If a hashmap is free'd with open socks it removes the reference to
the hash entry from the psock. If that is the last reference to the
psock then it will also be free'd by the reference counting logic.
However the current logic that removes the hash reference from the
list of references is broken. In smap_list_remove() we first check
if the sockmap entry matches and then check if the hashmap entry
matches. But, the sockmap entry sill always match because its NULL in
this case which causes the first entry to be removed from the list.
If this is always the "right" entry (because the user adds/removes
entries in order) then everything is OK but otherwise a subsequent
bpf_tcp_close() may reference a free'd object.
To fix this create two list handlers one for sockmap and one for
sockhash.
Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This fixes a crash where we assign tcp_prot to IPv6 sockets instead
of tcpv6_prot.
Previously we overwrote the sk->prot field with tcp_prot even in the
AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
are used.
Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
crashing case here.
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Partially undo commit 9facc33687 ("bpf: reject any prog that failed
read-only lock") since it caused a regression, that is, syzkaller was
able to manage to cause a panic via fault injection deep in set_memory_ro()
path by letting an allocation fail: In x86's __change_page_attr_set_clr()
it was able to change the attributes of the primary mapping but not in
the alias mapping via cpa_process_alias(), so the second, inner call
to the __change_page_attr() via __change_page_attr_set_clr() had to split
a larger page and failed in the alloc_pages() with the artifically triggered
allocation error which is then propagated down to the call site.
Thus, for set_memory_ro() this means that it returned with an error, but
from debugging a probe_kernel_write() revealed EFAULT on that memory since
the primary mapping succeeded to get changed. Therefore the subsequent
hdr->locked = 0 reset triggered the panic as it was performed on read-only
memory, so call-site assumptions were infact wrong to assume that it would
either succeed /or/ not succeed at all since there's no such rollback in
set_memory_*() calls from partial change of mappings, in other words, we're
left in a state that is "half done". A later undo via set_memory_rw() is
succeeding though due to matching permissions on that part (aka due to the
try_preserve_large_page() succeeding). While reproducing locally with
explicitly triggering this error, the initial splitting only happens on
rare occasions and in real world it would additionally need oom conditions,
but that said, it could partially fail. Therefore, it is definitely wrong
to bail out on set_memory_ro() error and reject the program with the
set_memory_*() semantics we have today. Shouldn't have gone the extra mile
since no other user in tree today infact checks for any set_memory_*()
errors, e.g. neither module_enable_ro() / module_disable_ro() for module
RO/NX handling which is mostly default these days nor kprobes core with
alloc_insn_page() / free_insn_page() as examples that could be invoked long
after bootup and original 314beb9bca ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks") did neither when it got first introduced to BPF
so "improving" with bailing out was clearly not right when set_memory_*()
cannot handle it today.
Kees suggested that if set_memory_*() can fail, we should annotate it with
__must_check, and all callers need to deal with it gracefully given those
set_memory_*() markings aren't "advisory", but they're expected to actually
do what they say. This might be an option worth to move forward in future
but would at the same time require that set_memory_*() calls from supporting
archs are guaranteed to be "atomic" in that they provide rollback if part
of the range fails, once that happened, the transition from RW -> RO could
be made more robust that way, while subsequent RO -> RW transition /must/
continue guaranteeing to always succeed the undo part.
Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
Fixes: 9facc33687 ("bpf: reject any prog that failed read-only lock")
Cc: Laura Abbott <labbott@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not
possible to attach, detach or query IR BPF programs to /dev/lircN devices,
making them impossible to use. For embedded devices, it should be possible
to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled.
This change requires some refactoring, since bpf_prog_{attach,detach,query}
functions are now always compiled, but their code paths for cgroups need
moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c,
moving them to kernel/bpf/cgroup.c and kernel/bpf/sockmap.c does not
require #ifdefs since that is already conditionally compiled.
Fixes: f4364dcfc8 ("media: rc: introduce BPF_PROG_LIRC_MODE2")
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
While __atomic_add_unless() was originally intended as a building-block
for atomic_add_unless(), it's now used in a number of places around the
kernel. It's the only common atomic operation named __atomic*(), rather
than atomic_*(), and for consistency it would be better named
atomic_fetch_add_unless().
This lack of consistency is slightly confusing, and gets in the way of
scripting atomics. Given that, let's clean things up and promote it to
an official part of the atomics API, in the form of
atomic_fetch_add_unless().
This patch converts definitions and invocations over to the new name,
including the instrumented version, using the following script:
----
git grep -w __atomic_add_unless | while read line; do
sed -i '{s/\<__atomic_add_unless\>/atomic_fetch_add_unless/}' "${line%%:*}";
done
git grep -w __arch_atomic_add_unless | while read line; do
sed -i '{s/\<__arch_atomic_add_unless\>/arch_atomic_fetch_add_unless/}' "${line%%:*}";
done
----
Note that we do not have atomic{64,_long}_fetch_add_unless(), which will
be introduced by later patches.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Palmer Dabbelt <palmer@sifive.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2018-06-16
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a panic in devmap handling in generic XDP where return type
of __devmap_lookup_elem() got changed recently but generic XDP
code missed the related update, from Toshiaki.
2) Fix a freeze when BPF progs are loaded that include BPF to BPF
calls when JIT is enabled where we would later bail out via error
path w/o dropping kallsyms, and another one to silence syzkaller
splats from locking prog read-only, from Daniel.
3) Fix a bug in test_offloads.py BPF selftest which must not assume
that the underlying system have no BPF progs loaded prior to test,
and one in bpftool to fix accuracy of program load time, from Jakub.
4) Fix a bug in bpftool's probe for availability of the bpf(2)
BPF_TASK_FD_QUERY subcommand, from Yonghong.
5) Fix a regression in AF_XDP's XDP_SKB receive path where queue
id check got erroneously removed, from Björn.
6) Fix missing state cleanup in BPF's xfrm tunnel test, from William.
7) Check tunnel type more accurately in BPF's tunnel collect metadata
kselftest, from Jian.
8) Fix missing Kconfig fragments for BPF kselftests, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Various netfilter fixlets from Pablo and the netfilter team.
2) Fix regression in IPVS caused by lack of PMTU exceptions on local
routes in ipv6, from Julian Anastasov.
3) Check pskb_trim_rcsum for failure in DSA, from Zhouyang Jia.
4) Don't crash on poll in TLS, from Daniel Borkmann.
5) Revert SO_REUSE{ADDR,PORT} change, it regresses various things
including Avahi mDNS. From Bart Van Assche.
6) Missing of_node_put in qcom/emac driver, from Yue Haibing.
7) We lack checking of the TCP checking in one special case during SYN
receive, from Frank van der Linden.
8) Fix module init error paths of mac80211 hwsim, from Johannes Berg.
9) Handle 802.1ad properly in stmmac driver, from Elad Nachman.
10) Must grab HW caps before doing quirk checks in stmmac driver, from
Jose Abreu.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits)
net: stmmac: Run HWIF Quirks after getting HW caps
neighbour: skip NTF_EXT_LEARNED entries during forced gc
net: cxgb3: add error handling for sysfs_create_group
tls: fix waitall behavior in tls_sw_recvmsg
tls: fix use-after-free in tls_push_record
l2tp: filter out non-PPP sessions in pppol2tp_tunnel_ioctl()
l2tp: reject creation of non-PPP sessions on L2TPv2 tunnels
mlxsw: spectrum_switchdev: Fix port_vlan refcounting
mlxsw: spectrum_router: Align with new route replace logic
mlxsw: spectrum_router: Allow appending to dev-only routes
ipv6: Only emit append events for appended routes
stmmac: added support for 802.1ad vlan stripping
cfg80211: fix rcu in cfg80211_unregister_wdev
mac80211: Move up init of TXQs
mac80211_hwsim: fix module init error paths
cfg80211: initialize sinfo in cfg80211_get_station
nl80211: fix some kernel doc tag mistakes
hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
rds: avoid unenecessary cong_update in loop transport
l2tp: clean up stale tunnel or session in pppol2tp_connect's error path
...
Commit 67f29e07e1 ("bpf: devmap introduce dev_map_enqueue") changed
the return value type of __devmap_lookup_elem() from struct net_device *
to struct bpf_dtab_netdev * but forgot to modify generic XDP code
accordingly.
Thus generic XDP incorrectly used struct bpf_dtab_netdev where struct
net_device is expected, then skb->dev was set to invalid value.
v2:
- Fix compiler warning without CONFIG_BPF_SYSCALL.
Fixes: 67f29e07e1 ("bpf: devmap introduce dev_map_enqueue")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We currently lock any JITed image as read-only via bpf_jit_binary_lock_ro()
as well as the BPF image as read-only through bpf_prog_lock_ro(). In
the case any of these would fail we throw a WARN_ON_ONCE() in order to
yell loudly to the log. Perhaps, to some extend, this may be comparable
to an allocation where __GFP_NOWARN is explicitly not set.
Added via 65869a47f3 ("bpf: improve read-only handling"), this behavior
is slightly different compared to any of the other in-kernel set_memory_ro()
users who do not check the return code of set_memory_ro() and friends /at
all/ (e.g. in the case of module_enable_ro() / module_disable_ro()). Given
in BPF this is mandatory hardening step, we want to know whether there
are any issues that would leave both BPF data writable. So it happens
that syzkaller enabled fault injection and it triggered memory allocation
failure deep inside x86's change_page_attr_set_clr() which was triggered
from set_memory_ro().
Now, there are two options: i) leaving everything as is, and ii) reworking
the image locking code in order to have a final checkpoint out of the
central bpf_prog_select_runtime() which probes whether any of the calls
during prog setup weren't successful, and then bailing out with an error.
Option ii) is a better approach since this additional paranoia avoids
altogether leaving any potential W+X pages from BPF side in the system.
Therefore, lets be strict about it, and reject programs in such unlikely
occasion. While testing I noticed also that one bpf_prog_lock_ro()
call was missing on the outer dummy prog in case of calls, e.g. in the
destructor we call bpf_prog_free_deferred() on the main prog where we
try to bpf_prog_unlock_free() the program, and since we go via
bpf_prog_select_runtime() do that as well.
Reported-by: syzbot+3b889862e65a98317058@syzkaller.appspotmail.com
Reported-by: syzbot+9e762b52dd17e616a7a5@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While testing I found that when hitting error path in bpf_prog_load()
where we jump to free_used_maps and prog contained BPF to BPF calls
that were JITed earlier, then we never clean up the bpf_prog_kallsyms_add()
done under jit_subprogs(). Add proper API to make BPF kallsyms deletion
more clear and fix that.
Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
syzkaller was able to trigger the following warning in
do_dentry_open():
WARNING: CPU: 1 PID: 4508 at fs/open.c:778 do_dentry_open+0x4ad/0xe40 fs/open.c:778
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 4508 Comm: syz-executor867 Not tainted 4.17.0+ #90
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
[...]
vfs_open+0x139/0x230 fs/open.c:908
do_last fs/namei.c:3370 [inline]
path_openat+0x1717/0x4dc0 fs/namei.c:3511
do_filp_open+0x249/0x350 fs/namei.c:3545
do_sys_open+0x56f/0x740 fs/open.c:1101
__do_sys_openat fs/open.c:1128 [inline]
__se_sys_openat fs/open.c:1122 [inline]
__x64_sys_openat+0x9d/0x100 fs/open.c:1122
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Problem was that prog and map inodes in bpf fs did not
implement a dummy file open operation that would return an
error. The patch in do_dentry_open() checks whether f_ops
are present and if not bails out with an error. While this
may be fine, we really shouldn't be throwing a warning
though. Thus follow the model similar to bad_file_ops and
reject the request unconditionally with -EIO.
Fixes: b2197755b2 ("bpf: add support for persistent maps/progs")
Reported-by: syzbot+2e7fcab0f56fdbb330b8@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
As commit 28e33f9d78 ("bpf: disallow arithmetic operations on
context pointer") already describes, f1174f77b5 ("bpf/verifier:
rework value tracking") removed the specific white-listed cases
we had previously where we would allow for pointer arithmetic in
order to further generalize it, and allow e.g. context access via
modified registers. While the dereferencing of modified context
pointers had been forbidden through 28e33f9d78, syzkaller did
recently manage to trigger several KASAN splats for slab out of
bounds access and use after frees by simply passing a modified
context pointer to a helper function which would then do the bad
access since verifier allowed it in adjust_ptr_min_max_vals().
Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals()
generally could break existing programs as there's a valid use
case in tracing in combination with passing the ctx to helpers as
bpf_probe_read(), where the register then becomes unknown at
verification time due to adding a non-constant offset to it. An
access sequence may look like the following:
offset = args->filename; /* field __data_loc filename */
bpf_probe_read(&dst, len, (char *)args + offset); // args is ctx
There are two options: i) we could special case the ctx and as
soon as we add a constant or bounded offset to it (hence ctx type
wouldn't change) we could turn the ctx into an unknown scalar, or
ii) we generalize the sanity test for ctx member access into a
small helper and assert it on the ctx register that was passed
as a function argument. Fwiw, latter is more obvious and less
complex at the same time, and one case that may potentially be
legitimate in future for ctx member access at least would be for
ctx to carry a const offset. Therefore, fix follows approach
from ii) and adds test cases to BPF kselftests.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Reported-by: syzbot+3d0b2441dbb71751615e@syzkaller.appspotmail.com
Reported-by: syzbot+c8504affd4fdd0c1b626@syzkaller.appspotmail.com
Reported-by: syzbot+e5190cb881d8660fb1a3@syzkaller.appspotmail.com
Reported-by: syzbot+efae31b384d5badbd620@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf has been used extensively for tracing. For example, bcc
contains an almost full set of bpf-based tools to trace kernel
and user functions/events. Most tracing tools are currently
either filtered based on pid or system-wide.
Containers have been used quite extensively in industry and
cgroup is often used together to provide resource isolation
and protection. Several processes may run inside the same
container. It is often desirable to get container-level tracing
results as well, e.g. syscall count, function count, I/O
activity, etc.
This patch implements a new helper, bpf_get_current_cgroup_id(),
which will return cgroup id based on the cgroup within which
the current task is running.
The later patch will provide an example to show that
userspace can get the same cgroup id so it could
configure a filter or policy in the bpf program based on
task cgroup id.
The helper is currently implemented for tracing. It can
be added to other program types as well when needed.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The XDP_REDIRECT map devmap can avoid using ndo_xdp_flush, by instead
instructing ndo_xdp_xmit to flush via XDP_XMIT_FLUSH flag in
appropriate places.
Notice after this patch it is possible to remove ndo_xdp_flush
completely, as this is the last user of ndo_xdp_flush. This is left
for later patches, to keep driver changes separate.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch only change the API and reject any use of flags. This is an
intermediate step that allows us to implement the flush flag operation
later, for each individual driver in a separate patch.
The plan is to implement flush operation via XDP_XMIT_FLUSH flag
and then remove XDP_XMIT_FLAGS_NONE when done.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT
program type in test_verifier report the following errors on x86_32:
172/p unpriv: spill/fill of different pointers ldx FAIL
Unexpected error message!
0: (bf) r6 = r10
1: (07) r6 += -8
2: (15) if r1 == 0x0 goto pc+3
R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
3: (bf) r2 = r10
4: (07) r2 += -76
5: (7b) *(u64 *)(r6 +0) = r2
6: (55) if r1 != 0x0 goto pc+1
R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp
7: (7b) *(u64 *)(r6 +0) = r1
8: (79) r1 = *(u64 *)(r6 +0)
9: (79) r1 = *(u64 *)(r1 +68)
invalid bpf_context access off=68 size=8
378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
Failed to load prog 'Permission denied'!
0: (b7) r0 = 0
1: (71) r0 = *(u8 *)(r1 +68)
invalid bpf_context access off=68 size=1
379/p check bpf_perf_event_data->sample_period half load permitted FAIL
Failed to load prog 'Permission denied'!
0: (b7) r0 = 0
1: (69) r0 = *(u16 *)(r1 +68)
invalid bpf_context access off=68 size=2
380/p check bpf_perf_event_data->sample_period word load permitted FAIL
Failed to load prog 'Permission denied'!
0: (b7) r0 = 0
1: (61) r0 = *(u32 *)(r1 +68)
invalid bpf_context access off=68 size=4
381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
Failed to load prog 'Permission denied'!
0: (b7) r0 = 0
1: (79) r0 = *(u64 *)(r1 +68)
invalid bpf_context access off=68 size=8
Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte
boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok()
will then bail out saying that off & (size_default - 1) which is 68 & 7
doesn't cleanly align in the case of sample_period access from struct
bpf_perf_event_data, hence verifier wrongly thinks we might be doing an
unaligned access here though underlying arch can handle it just fine.
Therefore adjust this down to machine size and check and rewrite the
offset for narrow access on that basis. We also need to fix corresponding
pe_prog_is_valid_access(), since we hit the check for off % size != 0
(e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs
for tracing work on x86_32.
Reported-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While some of the BPF map lookup helpers provide a ->map_gen_lookup()
callback for inlining the map lookup altogether it is not available
for every map, so the remaining ones have to call bpf_map_lookup_elem()
helper which does a dispatch to map->ops->map_lookup_elem(). In
times of retpolines, this will control and trap speculative execution
rather than letting it do its work for the indirect call and will
therefore cause a slowdown. Likewise, bpf_map_update_elem() and
bpf_map_delete_elem() do not have an inlined version and need to call
into their map->ops->map_update_elem() resp. map->ops->map_delete_elem()
handlers.
Before:
# bpftool prog dump xlated id 1
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = map[id:1]
5: (85) call __htab_map_lookup_elem#232656
6: (15) if r0 == 0x0 goto pc+4
7: (71) r1 = *(u8 *)(r0 +35)
8: (55) if r1 != 0x0 goto pc+1
9: (72) *(u8 *)(r0 +35) = 1
10: (07) r0 += 56
11: (15) if r0 == 0x0 goto pc+4
12: (bf) r2 = r0
13: (18) r1 = map[id:1]
15: (85) call bpf_map_delete_elem#215008 <-- indirect call via
16: (95) exit helper
After:
# bpftool prog dump xlated id 1
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = map[id:1]
5: (85) call __htab_map_lookup_elem#233328
6: (15) if r0 == 0x0 goto pc+4
7: (71) r1 = *(u8 *)(r0 +35)
8: (55) if r1 != 0x0 goto pc+1
9: (72) *(u8 *)(r0 +35) = 1
10: (07) r0 += 56
11: (15) if r0 == 0x0 goto pc+4
12: (bf) r2 = r0
13: (18) r1 = map[id:1]
15: (85) call htab_lru_map_delete_elem#238240 <-- direct call
16: (95) exit
In all three lookup/update/delete cases however we can use the actual
address of the map callback directly if we find that there's only a
single path with a map pointer leading to the helper call, meaning
when the map pointer has not been poisoned from verifier side.
Example code can be seen above for the delete case.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Its trivial and straight forward to expose it for scripts that can
then use it along with bpftool in order to inspect an individual
application's used maps and progs. Right now we dump some basic
information in the fdinfo file but with the help of the map/prog
id full introspection becomes possible now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stating 'proprietary program' in the error is just silly since it
can also be a different open source license than that which is just
not compatible.
Reference: https://twitter.com/majek04/status/998531268039102465
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The t->type in BTF_KIND_FWD is not used. It must be 0.
This patch ensures that and also adds a test case in test_btf.c
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch ensures array's t->size is 0.
The array size is decided by its individual elem's size and the
number of elements. Hence, t->size is not used and
it must be 0.
A test case is added to test_btf.c
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The assignment dev = dev is redundant and should be removed.
Detected by CoverityScan, CID#1469486 ("Evaluation order violation")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add support for BPF_PROG_LIRC_MODE2. This type of BPF program can call
rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report
that the last key should be repeated.
The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall;
the target_fd must be the /dev/lircN device.
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This makes is it possible for bpf prog detach to return -ENOENT.
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In addition to already existing BPF hooks for sys_bind and sys_connect,
the patch provides new hooks for sys_sendmsg.
It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
that provides access to socket itlself (properties like family, type,
protocol) and user-passed `struct sockaddr *` so that BPF program can
override destination IP and port for system calls such as sendto(2) or
sendmsg(2) and/or assign source IP to the socket.
The hooks are implemented as two new attach types:
`BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
UDPv6 correspondingly.
UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
sys_connect hooks, i.e. to prevent reading from / writing to e.g.
user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
The difference with already existing hooks is sys_sendmsg are
implemented only for unconnected UDP.
For TCP it doesn't make sense to change user-provided `struct sockaddr *`
at sendto(2)/sendmsg(2) time since socket either was already connected
and has source/destination set or wasn't connected and call to
sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
Connected UDP is already handled by sys_connect hooks that can override
source/destination at connect time and use fast-path later, i.e. these
hooks don't affect UDP fast-path.
Rewriting source IP is implemented differently than that in sys_connect
hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
just bind socket to desired local IP address since source IP can be set
on per-packet basis by using ancillary data (cmsg(3)). So no matter if
socket is bound or not, source IP has to be rewritten on every call to
sys_sendmsg.
To do so two new fields are added to UAPI `struct bpf_sock_addr`;
* `msg_src_ip4` to set source IPv4 for UDPv4;
* `msg_src_ip6` to set source IPv6 for UDPv6.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The stack_map_get_build_id_offset() function is too long for gcc to track
whether 'work' may or may not be initialized at the end of it, leading
to a false-positive warning:
kernel/bpf/stackmap.c: In function 'stack_map_get_build_id_offset':
kernel/bpf/stackmap.c:334:13: error: 'work' may be used uninitialized in this function [-Werror=maybe-uninitialized]
This removes the 'in_nmi_ctx' flag and uses the state of that variable
itself to see if it got initialized.
Fixes: bae77c5eb5 ("bpf: enable stackmap with build_id in nmi context")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
gcc warns about a noreturn function possibly returning in
some configurations:
kernel/bpf/btf.c: In function 'env_type_is_resolve_sink':
kernel/bpf/btf.c:729:1: error: control reaches end of non-void function [-Werror=return-type]
Using BUG() instead of BUG_ON() avoids that warning and otherwise
does the exact same thing.
Fixes: eb3f595dab ("bpf: btf: Validate type reference")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull networking fixes from David Miller:
"Let's begin the holiday weekend with some networking fixes:
1) Whoops need to restrict cfg80211 wiphy names even more to 64
bytes. From Eric Biggers.
2) Fix flags being ignored when using kernel_connect() with SCTP,
from Xin Long.
3) Use after free in DCCP, from Alexey Kodanev.
4) Need to check rhltable_init() return value in ipmr code, from Eric
Dumazet.
5) XDP handling fixes in virtio_net from Jason Wang.
6) Missing RTA_TABLE in rtm_ipv4_policy[], from Roopa Prabhu.
7) Need to use IRQ disabling spinlocks in mlx4_qp_lookup(), from Jack
Morgenstein.
8) Prevent out-of-bounds speculation using indexes in BPF, from
Daniel Borkmann.
9) Fix regression added by AF_PACKET link layer cure, from Willem de
Bruijn.
10) Correct ENIC dma mask, from Govindarajulu Varadarajan.
11) Missing config options for PMTU tests, from Stefano Brivio"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
ibmvnic: Fix partial success login retries
selftests/net: Add missing config options for PMTU tests
mlx4_core: allocate ICM memory in page size chunks
enic: set DMA mask to 47 bit
ppp: remove the PPPIOCDETACH ioctl
ipv4: remove warning in ip_recv_error
net : sched: cls_api: deal with egdev path only if needed
vhost: synchronize IOTLB message with dev cleanup
packet: fix reserve calculation
net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands
net/mlx5e: When RXFCS is set, add FCS data into checksum calculation
bpf: properly enforce index mask to prevent out-of-bounds speculation
net/mlx4: Fix irq-unsafe spinlock usage
net: phy: broadcom: Fix bcm_write_exp()
net: phy: broadcom: Fix auxiliary control register reads
net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy
net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message
ibmvnic: Only do H_EOI for mobility events
tuntap: correctly set SOCKWQ_ASYNC_NOSPACE
virtio-net: fix leaking page for gso packet during mergeable XDP
...
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-05-24
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc).
2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers.
3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit.
4) Jiong Wang adds support for indirect and arithmetic shifts to NFP
5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible.
6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing
to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions.
7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT.
8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events.
9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Extending tracepoint xdp:xdp_devmap_xmit in devmap with an err code
allow people to easier identify the reason behind the ndo_xdp_xmit
call to a given driver is failing.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When sending an xdp_frame through xdp_do_redirect call, then error
cases can happen where the xdp_frame needs to be dropped, and
returning an -errno code isn't sufficient/possible any-longer
(e.g. for cpumap case). This is already fully supported, by simply
calling xdp_return_frame.
This patch is an optimization, which provides xdp_return_frame_rx_napi,
which is a faster variant for these error cases. It take advantage of
the protection provided by XDP RX running under NAPI protection.
This change is mostly relevant for drivers using the page_pool
allocator as it can take advantage of this. (Tested with mlx5).
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Notice how this allow us get XDP statistic without affecting the XDP
performance, as tracepoint is no-longer activated on a per packet basis.
V5: Spotted by John Fastabend.
Fix 'sent' also counted 'drops' in this patch, a later patch corrected
this, but it was a mistake in this intermediate step.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Like cpumap create queue for xdp frames that will be bulked. For now,
this patch simply invoke ndo_xdp_xmit foreach frame. This happens,
either when the map flush operation is envoked, or when the limit
DEV_MAP_BULK_SIZE is reached.
V5: Avoid memleak on error path in dev_map_update_elem()
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Functionality is the same, but the ndo_xdp_xmit call is now
simply invoked from inside the devmap.c code.
V2: Fix compile issue reported by kbuild test robot <lkp@intel.com>
V5: Cleanups requested by Daniel
- Newlines before func definition
- Use BUILD_BUG_ON checks
- Remove unnecessary use return value store in dev_map_enqueue
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, suppose a userspace application has loaded a bpf program
and attached it to a tracepoint/kprobe/uprobe, and a bpf
introspection tool, e.g., bpftool, wants to show which bpf program
is attached to which tracepoint/kprobe/uprobe. Such attachment
information will be really useful to understand the overall bpf
deployment in the system.
There is a name field (16 bytes) for each program, which could
be used to encode the attachment point. There are some drawbacks
for this approaches. First, bpftool user (e.g., an admin) may not
really understand the association between the name and the
attachment point. Second, if one program is attached to multiple
places, encoding a proper name which can imply all these
attachments becomes difficult.
This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
Given a pid and fd, if the <pid, fd> is associated with a
tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return
. prog_id
. tracepoint name, or
. k[ret]probe funcname + offset or kernel addr, or
. u[ret]probe filename + offset
to the userspace.
The user can use "bpftool prog" to find more information about
bpf program itself with prog_id.
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While reviewing the verifier code, I recently noticed that the
following two program variants in relation to tail calls can be
loaded.
Variant 1:
# bpftool p d x i 15
0: (15) if r1 == 0x0 goto pc+3
1: (18) r2 = map[id:5]
3: (05) goto pc+2
4: (18) r2 = map[id:6]
6: (b7) r3 = 7
7: (35) if r3 >= 0xa0 goto pc+2
8: (54) (u32) r3 &= (u32) 255
9: (85) call bpf_tail_call#12
10: (b7) r0 = 1
11: (95) exit
# bpftool m s i 5
5: prog_array flags 0x0
key 4B value 4B max_entries 4 memlock 4096B
# bpftool m s i 6
6: prog_array flags 0x0
key 4B value 4B max_entries 160 memlock 4096B
Variant 2:
# bpftool p d x i 20
0: (15) if r1 == 0x0 goto pc+3
1: (18) r2 = map[id:8]
3: (05) goto pc+2
4: (18) r2 = map[id:7]
6: (b7) r3 = 7
7: (35) if r3 >= 0x4 goto pc+2
8: (54) (u32) r3 &= (u32) 3
9: (85) call bpf_tail_call#12
10: (b7) r0 = 1
11: (95) exit
# bpftool m s i 8
8: prog_array flags 0x0
key 4B value 4B max_entries 160 memlock 4096B
# bpftool m s i 7
7: prog_array flags 0x0
key 4B value 4B max_entries 4 memlock 4096B
In both cases the index masking inserted by the verifier in order
to control out of bounds speculation from a CPU via b2157399cc
("bpf: prevent out-of-bounds speculation") seems to be incorrect
in what it is enforcing. In the 1st variant, the mask is applied
from the map with the significantly larger number of entries where
we would allow to a certain degree out of bounds speculation for
the smaller map, and in the 2nd variant where the mask is applied
from the map with the smaller number of entries, we get buggy
behavior since we truncate the index of the larger map.
The original intent from commit b2157399cc is to reject such
occasions where two or more different tail call maps are used
in the same tail call helper invocation. However, the check on
the BPF_MAP_PTR_POISON is never hit since we never poisoned the
saved pointer in the first place! We do this explicitly for map
lookups but in case of tail calls we basically used the tail
call map in insn_aux_data that was processed in the most recent
path which the verifier walked. Thus any prior path that stored
a pointer in insn_aux_data at the helper location was always
overridden.
Fix it by moving the map pointer poison logic into a small helper
that covers both BPF helpers with the same logic. After that in
fixup_bpf_calls() the poison check is then hit for tail calls
and the program rejected. Latter only happens in unprivileged
case since this is the *only* occasion where a rewrite needs to
happen, and where such rewrite is specific to the map (max_entries,
index_mask). In the privileged case the rewrite is generic for
the insn->imm / insn->code update so multiple maps from different
paths can be handled just fine since all the remaining logic
happens in the instruction processing itself. This is similar
to the case of map lookups: in case there is a collision of
maps in fixup_bpf_calls() we must skip the inlined rewrite since
this will turn the generic instruction sequence into a non-
generic one. Thus the patch_call_imm will simply update the
insn->imm location where the bpf_map_lookup_elem() will later
take care of the dispatch. Given we need this 'poison' state
as a check, the information of whether a map is an unpriv_array
gets lost, so enforcing it prior to that needs an additional
state. In general this check is needed since there are some
complex and tail call intensive BPF programs out there where
LLVM tends to generate such code occasionally. We therefore
convert the map_ptr rather into map_state to store all this
w/o extra memory overhead, and the bit whether one of the maps
involved in the collision was from an unpriv_array thus needs
to be retained as well there.
Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds the End.BPF action to the LWT seg6local infrastructure.
This action works like any other seg6local End action, meaning that an IPv6
header with SRH is needed, whose DA has to be equal to the SID of the
action. It will also advance the SRH to the next segment, the BPF program
does not have to take care of this.
Since the BPF program may not be a source of instability in the kernel, it
is important to ensure that the integrity of the packet is maintained
before yielding it back to the IPv6 layer. The hook hence keeps track if
the SRH has been altered through the helpers, and re-validates its
content if needed with seg6_validate_srh. The state kept for validation is
stored in a per-CPU buffer. The BPF program is not allowed to directly
write into the packet, and only some fields of the SRH can be altered
through the helper bpf_lwt_seg6_store_bytes.
Performances profiling has shown that the SRH re-validation does not induce
a significant overhead. If the altered SRH is deemed as invalid, the packet
is dropped.
This validation is also done before executing any action through
bpf_lwt_seg6_action, and will not be performed again if the SRH is not
modified after calling the action.
The BPF program may return 3 types of return codes:
- BPF_OK: the End.BPF action will look up the next destination through
seg6_lookup_nexthop.
- BPF_REDIRECT: if an action has been executed through the
bpf_lwt_seg6_action helper, the BPF program should return this
value, as the skb's destination is already set and the default
lookup should not be performed.
- BPF_DROP : the packet will be dropped.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This adds new two new fields to struct bpf_prog_info. For
multi-function programs, these fields can be used to pass
a list of the JITed image lengths of each function for a
given program to userspace using the bpf system call with
the BPF_OBJ_GET_INFO_BY_FD command.
This can be used by userspace applications like bpftool
to split up the contiguous JITed dump, also obtained via
the system call, into more relatable chunks corresponding
to each function.
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, for multi-function programs, we cannot get the JITed
instructions using the bpf system call's BPF_OBJ_GET_INFO_BY_FD
command. Because of this, userspace tools such as bpftool fail
to identify a multi-function program as being JITed or not.
With the JIT enabled and the test program running, this can be
verified as follows:
# cat /proc/sys/net/core/bpf_jit_enable
1
Before applying this patch:
# bpftool prog list
1: kprobe name foo tag b811aab41a39ad3d gpl
loaded_at 2018-05-16T11:43:38+0530 uid 0
xlated 216B not jited memlock 65536B
...
# bpftool prog dump jited id 1
no instructions returned
After applying this patch:
# bpftool prog list
1: kprobe name foo tag b811aab41a39ad3d gpl
loaded_at 2018-05-16T12:13:01+0530 uid 0
xlated 216B jited 308B memlock 65536B
...
# bpftool prog dump jited id 1
0: nop
4: nop
8: mflr r0
c: std r0,16(r1)
10: stdu r1,-112(r1)
14: std r31,104(r1)
18: addi r31,r1,48
1c: li r3,10
...
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This adds new two new fields to struct bpf_prog_info. For
multi-function programs, these fields can be used to pass
a list of kernel symbol addresses for all functions in a
given program to userspace using the bpf system call with
the BPF_OBJ_GET_INFO_BY_FD command.
When bpf_jit_kallsyms is enabled, we can get the address
of the corresponding kernel symbol for a callee function
and resolve the symbol's name. The address is determined
by adding the value of the call instruction's imm field
to __bpf_call_base. This offset gets assigned to the imm
field by the verifier.
For some architectures, such as powerpc64, the imm field
is not large enough to hold this offset.
We resolve this by:
[1] Assigning the subprog id to the imm field of a call
instruction in the verifier instead of the offset of
the callee's symbol's address from __bpf_call_base.
[2] Determining the address of a callee's corresponding
symbol by using the imm field as an index for the
list of kernel symbol addresses now available from
the program info.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The imm field of a bpf instruction is a signed 32-bit integer.
For JITed bpf-to-bpf function calls, it holds the offset of the
start address of the callee's JITed image from __bpf_call_base.
For some architectures, such as powerpc64, this offset may be
as large as 64 bits and cannot be accomodated in the imm field
without truncation.
We resolve this by:
[1] Additionally using the auxiliary data of each function to
keep a list of start addresses of the JITed images for all
functions determined by the verifier.
[2] Retaining the subprog id inside the off field of the call
instructions and using it to index into the list mentioned
above and lookup the callee's address.
To make sure that the existing JIT compilers continue to work
without requiring changes, we keep the imm field as it is.
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sparse warning:
kernel/bpf/btf.c:1985:34: warning: Variable length array is used.
This patch directly uses ARRAY_SIZE().
Fixes: f80442a4cd ("bpf: btf: Change how section is supported in btf_header")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id"
could cause confusion because the "id" of "btf_id" means the BPF obj id
given to the BTF object while
"btf_key_id" and "btf_value_id" means the BTF type id within
that BTF object.
To make it clear, btf_key_id and btf_value_id are
renamed to btf_key_type_id and btf_value_type_id.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch does the followings:
1. Limit BTF_MAX_TYPES and BTF_MAX_NAME_OFFSET to 64k. We can
raise it later.
2. Remove the BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID. They are
currently encoded at the highest bit of a u32.
It is because the current use case does not require supporting
parent type (i.e type_id referring to a type in another BTF file).
It also does not support referring to a string in ELF.
The BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID checks are replaced
by BTF_TYPE_ID_CHECK and BTF_STR_OFFSET_CHECK which are
defined in btf.c instead of uapi/linux/btf.h.
3. Limit the BTF_INFO_KIND from 5 bits to 4 bits which is enough.
There is unused bits headroom if we ever needed it later.
4. The root bit in BTF_INFO is also removed because it is not
used in the current use case.
5. Remove BTF_INT_VARARGS since func type is not supported now.
The BTF_INT_ENCODING is limited to 4 bits instead of 8 bits.
The above can be added back later because the verifier
ensures the unused bits are zeros.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Instead of ingoring the array->index_type field. Enforce that
it must be a BTF_KIND_INT in size 1/2/4/8 bytes.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There are currently unused section descriptions in the btf_header. Those
sections are here to support future BTF use cases. For example, the
func section (func_off) is to support function signature (e.g. the BPF
prog function signature).
Instead of spelling out all potential sections up-front in the btf_header.
This patch makes changes to btf_header such that extending it (e.g. adding
a section) is possible later. The unused ones can be removed for now and
they can be added back later.
This patch:
1. adds a hdr_len to the btf_header. It will allow adding
sections (and other info like parent_label and parent_name)
later. The check is similar to the existing bpf_attr.
If a user passes in a longer hdr_len, the kernel
ensures the extra tailing bytes are 0.
2. allows the section order in the BTF object to be
different from its sec_off order in btf_header.
3. each sec_off is followed by a sec_len. It must not have gap or
overlapping among sections.
The string section is ensured to be at the end due to the 4 bytes
alignment requirement of the type section.
The above changes will allow enough flexibility to
add new sections (and other info) to the btf_header later.
This patch also removes an unnecessary !err check
at the end of btf_parse().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch exposes check_uarg_tail_zero() which will
be reused by a later BTF patch. Its name is changed to
bpf_check_uarg_tail_zero().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.
TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.
The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.
Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge speculative store buffer bypass fixes from Thomas Gleixner:
- rework of the SPEC_CTRL MSR management to accomodate the new fancy
SSBD (Speculative Store Bypass Disable) bit handling.
- the CPU bug and sysfs infrastructure for the exciting new Speculative
Store Bypass 'feature'.
- support for disabling SSB via LS_CFG MSR on AMD CPUs including
Hyperthread synchronization on ZEN.
- PRCTL support for dynamic runtime control of SSB
- SECCOMP integration to automatically disable SSB for sandboxed
processes with a filter flag for opt-out.
- KVM integration to allow guests fiddling with SSBD including the new
software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on
AMD.
- BPF protection against SSB
.. this is just the core and x86 side, other architecture support will
come separately.
* 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
bpf: Prevent memory disambiguation attack
x86/bugs: Rename SSBD_NO to SSB_NO
KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD
x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG
x86/bugs: Rework spec_ctrl base and mask logic
x86/bugs: Remove x86_spec_ctrl_set()
x86/bugs: Expose x86_spec_ctrl_base directly
x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host}
x86/speculation: Rework speculative_store_bypass_update()
x86/speculation: Add virtualized speculative store bypass disable support
x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL
x86/speculation: Handle HT correctly on AMD
x86/cpufeatures: Add FEATURE_ZEN
x86/cpufeatures: Disentangle SSBD enumeration
x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS
x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP
KVM: SVM: Move spec control call after restore of GS
x86/cpu: Make alternative_msr_write work for 32-bit code
x86/bugs: Fix the parameters alignment and missing void
x86/bugs: Make cpu_show_common() static
...
Currently sk_msg programs only have access to the raw data. However,
it is often useful when building policies to have the policies specific
to the socket endpoint. This allows using the socket tuple as input
into filters, etc.
This patch adds ctx access to the sock fields.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Clean up SPDX-License-Identifier and removing licensing leftovers.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Recently during testing, I ran into the following panic:
[ 207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP
[ 207.901637] Modules linked in: binfmt_misc [...]
[ 207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G W 4.17.0-rc3+ #7
[ 207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
[ 207.982428] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
[ 207.992603] lr : 0xffff000000bdb754
[ 207.996080] sp : ffff000013703ca0
[ 207.999384] x29: ffff000013703ca0 x28: 0000000000000001
[ 208.004688] x27: 0000000000000001 x26: 0000000000000000
[ 208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00
[ 208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000
[ 208.020599] x21: fffffffffeff2a6f x20: 000000000000000a
[ 208.025903] x19: ffff000009578000 x18: 0000000000000a03
[ 208.031206] x17: 0000000000000000 x16: 0000000000000000
[ 208.036510] x15: 0000ffff9de83000 x14: 0000000000000000
[ 208.041813] x13: 0000000000000000 x12: 0000000000000000
[ 208.047116] x11: 0000000000000001 x10: ffff0000089e7f18
[ 208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000
[ 208.057723] x7 : 000000000000000a x6 : 00280c6160000000
[ 208.063026] x5 : 0000000000000018 x4 : 0000000000007db6
[ 208.068329] x3 : 000000000008647a x2 : 19868179b1484500
[ 208.073632] x1 : 0000000000000000 x0 : ffff000009578c08
[ 208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974)
[ 208.086235] Call trace:
[ 208.088672] bpf_skb_load_helper_8_no_cache+0x34/0xc0
[ 208.093713] 0xffff000000bdb754
[ 208.096845] bpf_test_run+0x78/0xf8
[ 208.100324] bpf_prog_test_run_skb+0x148/0x230
[ 208.104758] sys_bpf+0x314/0x1198
[ 208.108064] el0_svc_naked+0x30/0x34
[ 208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680)
[ 208.117717] ---[ end trace 263cb8a59b5bf29f ]---
The program itself which caused this had a long jump over the whole
instruction sequence where all of the inner instructions required
heavy expansions into multiple BPF instructions. Additionally, I also
had BPF hardening enabled which requires once more rewrites of all
constant values in order to blind them. Each time we rewrite insns,
bpf_adj_branches() would need to potentially adjust branch targets
which cross the patchlet boundary to accommodate for the additional
delta. Eventually that lead to the case where the target offset could
not fit into insn->off's upper 0x7fff limit anymore where then offset
wraps around becoming negative (in s16 universe), or vice versa
depending on the jump direction.
Therefore it becomes necessary to detect and reject any such occasions
in a generic way for native eBPF and cBPF to eBPF migrations. For
the latter we can simply check bounds in the bpf_convert_filter()'s
BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
of subsequent hardening) is a bit more complex in that we need to
detect such truncations before hitting the bpf_prog_realloc(). Thus
the latter is split into an extra pass to probe problematic offsets
on the original program in order to fail early. With that in place
and carefully tested I no longer hit the panic and the rewrites are
rejected properly. The above example panic I've seen on bpf-next,
though the issue itself is generic in that a guard against this issue
in bpf seems more appropriate in this case.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In the sockmap design BPF programs (SK_SKB_STREAM_PARSER,
SK_SKB_STREAM_VERDICT and SK_MSG_VERDICT) are attached to the sockmap
map type and when a sock is added to the map the programs are used by
the socket. However, sockmap updates from both userspace and BPF
programs can happen concurrently with the attach and detach of these
programs.
To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE()
primitive to ensure the program pointer is not refeched and
possibly NULL'd before the refcnt increment. This happens inside
a RCU critical section so although the pointer reference in the map
object may be NULL (by a concurrent detach operation) the reference
from READ_ONCE will not be free'd until after grace period. This
ensures the object returned by READ_ONCE() is valid through the
RCU criticl section and safe to use as long as we "know" it may
be free'd shortly.
Daniel spotted a case in the sock update API where instead of using
the READ_ONCE() program reference we used the pointer from the
original map, stab->bpf_{verdict|parse|txmsg}. The problem with this
is the logic checks the object returned from the READ_ONCE() is not
NULL and then tries to reference the object again but using the
above map pointer, which may have already been NULL'd by a parallel
detach operation. If this happened bpf_porg_inc_not_zero could
dereference a NULL pointer.
Fix this by using variable returned by READ_ONCE() that is checked
for NULL.
Fixes: 2f857d0460 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If the user were to only attach one of the parse or verdict programs
then it is possible a subsequent sockmap update could incorrectly
decrement the refcnt on the program. This happens because in the
rollback logic, after an error, we have to decrement the program
reference count when its been incremented. However, we only increment
the program reference count if the user has both a verdict and a
parse program. The reason for this is because, at least at the
moment, both are required for any one to be meaningful. The problem
fixed here is in the rollback path we decrement the program refcnt
even if only one existing. But we never incremented the refcnt in
the first place creating an imbalance.
This patch fixes the error path to handle this case.
Fixes: 2f857d0460 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
`e' is being freed twice.
Fix this by removing one of the kfree() calls.
Addresses-Coverity-ID: 1468983 ("Double free")
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There is a potential execution path in which variable err is
returned without being properly initialized previously.
Fix this by initializing variable err to 0.
Addresses-Coverity-ID: 1468964 ("Uninitialized scalar variable")
Fixes: e5cd3abcb3 ("bpf: sockmap, refactor sockmap routines to work with hashmap")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-05-17
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Provide a new BPF helper for doing a FIB and neighbor lookup
in the kernel tables from an XDP or tc BPF program. The helper
provides a fast-path for forwarding packets. The API supports
IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
implemented in this initial work, from David (Ahern).
2) Just a tiny diff but huge feature enabled for nfp driver by
extending the BPF offload beyond a pure host processing offload.
Offloaded XDP programs are allowed to set the RX queue index and
thus opening the door for defining a fully programmable RSS/n-tuple
filter replacement. Once BPF decided on a queue already, the device
data-path will skip the conventional RSS processing completely,
from Jakub.
3) The original sockmap implementation was array based similar to
devmap. However unlike devmap where an ifindex has a 1:1 mapping
into the map there are use cases with sockets that need to be
referenced using longer keys. Hence, sockhash map is added reusing
as much of the sockmap code as possible, from John.
4) Introduce BTF ID. The ID is allocatd through an IDR similar as
with BPF maps and progs. It also makes BTF accessible to user
space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
through BPF_OBJ_GET_INFO_BY_FD, from Martin.
5) Enable BPF stackmap with build_id also in NMI context. Due to the
up_read() of current->mm->mmap_sem build_id cannot be parsed.
This work defers the up_read() via a per-cpu irq_work so that
at least limited support can be enabled, from Song.
6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
JIT conversion as well as implementation of an optimized 32/64 bit
immediate load in the arm64 JIT that allows to reduce the number of
emitted instructions; in case of tested real-world programs they
were shrinking by three percent, from Daniel.
7) Add ifindex parameter to the libbpf loader in order to enable
BPF offload support. Right now only iproute2 can load offloaded
BPF and this will also enable libbpf for direct integration into
other applications, from David (Beckett).
8) Convert the plain text documentation under Documentation/bpf/ into
RST format since this is the appropriate standard the kernel is
moving to for all documentation. Also add an overview README.rst,
from Jesper.
9) Add __printf verification attribute to the bpf_verifier_vlog()
helper. Though it uses va_list we can still allow gcc to check
the format string, from Mathieu.
10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
is a bash 4.0+ feature which is not guaranteed to be available
when calling out to shell, therefore use a more portable variant,
from Joe.
11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
instead of relying on the gcc built-in, from Björn.
12) Fix a sock hashmap kmalloc warning reported by syzbot when an
overly large key size is used in hashmap then causing overflows
in htab->elem_size. Reject bogus attr->key_size early in the
sock_hash_alloc(), from Yonghong.
13) Ensure in BPF selftests when urandom_read is being linked that
--build-id is always enabled so that test_stacktrace_build_id[_nmi]
won't be failing, from Alexei.
14) Add bitsperlong.h as well as errno.h uapi headers into the tools
header infrastructure which point to one of the arch specific
uapi headers. This was needed in order to fix a build error on
some systems for the BPF selftests, from Sirio.
15) Allow for short options to be used in the xdp_monitor BPF sample
code. And also a bpf.h tools uapi header sync in order to fix a
selftest build failure. Both from Prashant.
16) More formally clarify the meaning of ID in the direct packet access
section of the BPF documentation, from Wang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When an error happens in the update sockmap element logic also pass
the err up to the user.
Fixes: e5cd3abcb3 ("bpf: sockmap, refactor sockmap routines to work with hashmap")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
syzbot reported a kernel warning below:
WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
panic+0x22f/0x4de kernel/panic.c:184
__warn.cold.8+0x163/0x1b3 kernel/panic.c:536
report_bug+0x252/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996
RSP: 0018:ffff8801d907fc58 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8801aeecb280 RCX: ffffffff8185ebd7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffe1
RBP: ffff8801d907fc58 R08: ffff8801adb5e1c0 R09: ffffed0035a84700
R10: ffffed0035a84700 R11: ffff8801ad423803 R12: ffff8801aeecb280
R13: 00000000fffffff4 R14: ffff8801ad891a00 R15: 00000000014200c0
__do_kmalloc mm/slab.c:3713 [inline]
__kmalloc+0x25/0x760 mm/slab.c:3727
kmalloc include/linux/slab.h:517 [inline]
map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858
__do_sys_bpf kernel/bpf/syscall.c:2131 [inline]
__se_sys_bpf kernel/bpf/syscall.c:2096 [inline]
__x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The test case is against sock hashmap with a key size 0xffffffe1.
Such a large key size will cause the below code in function
sock_hash_alloc() overflowing and produces a smaller elem_size,
hence map creation will be successful.
htab->elem_size = sizeof(struct htab_elem) +
round_up(htab->map.key_size, 8);
Later, when map_get_next_key is called and kernel tries
to allocate the key unsuccessfully, it will issue
the above warning.
Similar to hashtab, ensure the key size is at most
MAX_BPF_STACK for a successful map creation.
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Reported-by: syzbot+e4566d29080e7f3460ff@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sockmap is currently backed by an array and enforces keys to be
four bytes. This works well for many use cases and was originally
modeled after devmap which also uses four bytes keys. However,
this has become limiting in larger use cases where a hash would
be more appropriate. For example users may want to use the 5-tuple
of the socket as the lookup key.
To support this add hash support.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch only refactors the existing sockmap code. This will allow
much of the psock initialization code path and bpf helper codes to
work for both sockmap bpf map types that are backed by an array, the
currently supported type, and the new hash backed bpf map type
sockhash.
Most the fallout comes from three changes,
- Pushing bpf programs into an independent structure so we
can use it from the htab struct in the next patch.
- Generalizing helpers to use void *key instead of the hardcoded
u32.
- Instead of passing map/key through the metadata we now do
the lookup inline. This avoids storing the key in the metadata
which will be useful when keys can be longer than 4 bytes. We
rename the sk pointers to sk_redir at this point as well to
avoid any confusion between the current sk pointer and the
redirect pointer sk_redir.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, we cannot parse build_id in nmi context because of
up_read(¤t->mm->mmap_sem), this makes stackmap with build_id
less useful. This patch enables parsing build_id in nmi by putting
the up_read() call in irq_work. To avoid memory allocation in nmi
context, we use per cpu variable for the irq_work. As a result, only
one irq_work per cpu is allowed. If the irq_work is in-use, we
fallback to only report ips.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The bpf syscall and selftests conflicts were trivial
overlapping changes.
The r8169 change involved moving the added mdelay from 'net' into a
different function.
A TLS close bug fix overlapped with the splitting of the TLS state
into separate TX and RX parts. I just expanded the tests in the bug
fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf
== X".
Signed-off-by: David S. Miller <davem@davemloft.net>
It's fairly easy for offloaded XDP programs to select the RX queue
packets go to. We need a way of expressing this in the software.
Allow write to the rx_queue_index field of struct xdp_md for
device-bound programs.
Skip convert_ctx_access callback entirely for offloads.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
During BPF_OBJ_GET_INFO_BY_FD on a btf_fd, the current bpf_attr's
info.info is directly filled with the BTF binary data. It is
not extensible. In this case, we want to add BTF ID.
This patch adds "struct bpf_btf_info" which has the BTF ID as
one of its member. The BTF binary data itself is exposed through
the "btf" and "btf_size" members.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch gives an ID to each loaded BTF. The ID is allocated by
the idr like the existing prog-id and map-id.
The bpf_put(map->btf) is moved to __bpf_map_put() so that the
userspace can stop seeing the BTF ID ASAP when the last BTF
refcnt is gone.
It also makes BTF accessible from userspace through the
1. new BPF_BTF_GET_FD_BY_ID command. It is limited to CAP_SYS_ADMIN
which is inline with the BPF_BTF_LOAD cmd and the existing
BPF_[MAP|PROG]_GET_FD_BY_ID cmd.
2. new btf_id (and btf_key_id + btf_value_id) in "struct bpf_map_info"
Once the BTF ID handler is accessible from userspace, freeing a BTF
object has to go through a rcu period. The BPF_BTF_GET_FD_BY_ID cmd
can then be done under a rcu_read_lock() instead of taking
spin_lock.
[Note: A similar rcu usage can be done to the existing
bpf_prog_get_fd_by_id() in a follow up patch]
When processing the BPF_BTF_GET_FD_BY_ID cmd,
refcount_inc_not_zero() is needed because the BTF object
could be already in the rcu dead row . btf_get() is
removed since its usage is currently limited to btf.c
alone. refcount_inc() is used directly instead.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If CONFIG_REFCOUNT_FULL=y, refcount_inc() WARN when refcount is 0.
When creating a new btf, the initial btf->refcnt is 0 and
triggered the following:
[ 34.855452] refcount_t: increment on 0; use-after-free.
[ 34.856252] WARNING: CPU: 6 PID: 1857 at lib/refcount.c:153 refcount_inc+0x26/0x30
....
[ 34.868809] Call Trace:
[ 34.869168] btf_new_fd+0x1af6/0x24d0
[ 34.869645] ? btf_type_seq_show+0x200/0x200
[ 34.870212] ? lock_acquire+0x3b0/0x3b0
[ 34.870726] ? security_capable+0x54/0x90
[ 34.871247] __x64_sys_bpf+0x1b2/0x310
[ 34.871761] ? __ia32_sys_bpf+0x310/0x310
[ 34.872285] ? bad_area_access_error+0x310/0x310
[ 34.872894] do_syscall_64+0x95/0x3f0
This patch uses refcount_set() instead.
Reported-by: Yonghong Song <yhs@fb.com>
Tested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Minor conflict, a CHECK was placed into an if() statement
in net-next, whilst a newline was added to that CHECK
call in 'net'. Thanks to Daniel for the merge resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
If bpf_map_precharge_memlock() did not fail, then we set err to zero.
However, any subsequent failure from either alloc_percpu() or the
bpf_map_area_alloc() will return ERR_PTR(0) which in find_and_alloc_map()
will cause NULL pointer deref.
In devmap we have the convention that we return -EINVAL on page count
overflow, so keep the same logic here and just set err to -ENOMEM
after successful bpf_map_precharge_memlock().
Fixes: fbfc504a24 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Björn Töpel <bjorn.topel@intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Comments in the verifier refer to free_bpf_prog_info() which
seems to have never existed in tree. Replace it with
free_used_maps().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Offloads may find host map pointers more useful than map fds.
Map pointers can be used to identify the map, while fds are
only valid within the context of loading process.
Jump to skip_full_check on error in case verifier log overflow
has to be handled (replace_map_fd_with_map_ptr() prints to the
log, driver prep may do that too in the future).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
bpf_event_output() is useful for offloads to add events to BPF
event rings, export it. Note that export is placed near the stub
since tracing is optional and kernel/bpf/core.c is always going
to be built.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
For asynchronous events originating from the device, like perf event
output, we need to be able to make sure that objects being referred
to by the FW message are valid on the host. FW events can get queued
and reordered. Even if we had a FW message "barrier" we should still
protect ourselves from bogus FW output.
Add a reverse-mapping hash table and record in it all raw map pointers
FW may refer to. Only record neutral maps, i.e. perf event arrays.
These are currently the only objects FW can refer to. Use RCU protection
on the read side, update side is under RTNL.
Since program vs map destruction order is slightly painful for offload
simply take an extra reference on all the recorded maps to make sure
they don't disappear.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
BPF_MAP_TYPE_PERF_EVENT_ARRAY is special as far as offload goes.
The map only holds glue to perf ring, not actual data. Allow
non-offloaded perf event arrays to be used in offloaded programs.
Offload driver can extract the events from HW and put them in
the map for user space to retrieve.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There are quite a few code snippet like the following in verifier:
subprog_start = 0;
if (env->subprog_cnt == cur_subprog + 1)
subprog_end = insn_cnt;
else
subprog_end = env->subprog_info[cur_subprog + 1].start;
The reason is there is no marker in subprog_info array to tell the end of
it.
We could resolve this issue by introducing a faked "ending" subprog.
The special "ending" subprog is with "insn_cnt" as start offset, so it is
serving as the end mark whenever we iterate over all subprogs.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
It is better to centre all subprog information fields into one structure.
This structure could later serve as function node in call graph.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, verifier treat main prog and subprog differently. All subprogs
detected are kept in env->subprog_starts while main prog is not kept there.
Instead, main prog is implicitly defined as the prog start at 0.
There is actually no difference between main prog and subprog, it is better
to unify them, and register all progs detected into env->subprog_starts.
This could also help simplifying some code logic.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commit 9ef09e35e5 ("bpf: fix possible spectre-v1 in find_and_alloc_map()")
converted find_and_alloc_map() over to use array_index_nospec() to sanitize
map type that user space passes on map creation, and this patch does an
analogous conversion for progs in find_prog_type() as it's also passed from
user space when loading progs as attr->prog_type.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The main part of this work is to finally allow removal of LD_ABS
and LD_IND from the BPF core by reimplementing them through native
eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
keeping them around in native eBPF caused way more trouble than
actually worth it. To just list some of the security issues in
the past:
* fdfaf64e75 ("x86: bpf_jit: support negative offsets")
* 35607b02db ("sparc: bpf_jit: fix loads from negative offsets")
* e0ee9c1215 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
* 07aee94394 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
* 6d59b7dbf7 ("bpf, s390x: do not reload skb pointers in non-skb context")
* 87338c8e2c ("bpf, ppc64: do not reload skb pointers in non-skb context")
For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
these days due to their limitations and more efficient/flexible
alternatives that have been developed over time such as direct
packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
register, the load happens in host endianness and its exception
handling can yield unexpected behavior. The latter is explained
in depth in f6b1b3bf0d ("bpf: fix subprog verifier bypass by
div/mod by 0 exception") with similar cases of exceptions we had.
In native eBPF more recent program types will disable LD_ABS/LD_IND
altogether through may_access_skb() in verifier, and given the
limitations in terms of exception handling, it's also disabled
in programs that use BPF to BPF calls.
In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
to access packet data. It is not used in seccomp-BPF but programs
that use it for socket filtering or reuseport for demuxing with
cBPF. This is mostly relevant for applications that have not yet
migrated to native eBPF.
The main complexity and source of bugs in LD_ABS/LD_IND is coming
from their implementation in the various JITs. Most of them keep
the model around from cBPF times by implementing a fastpath written
in asm. They use typically two from the BPF program hidden CPU
registers for caching the skb's headlen (skb->len - skb->data_len)
and skb->data. Throughout the JIT phase this requires to keep track
whether LD_ABS/LD_IND are used and if so, the two registers need
to be recached each time a BPF helper would change the underlying
packet data in native eBPF case. At least in eBPF case, available
CPU registers are rare and the additional exit path out of the
asm written JIT helper makes it also inflexible since not all
parts of the JITer are in control from plain C. A LD_ABS/LD_IND
implementation in eBPF therefore allows to significantly reduce
the complexity in JITs with comparable performance results for
them, e.g.:
test_bpf tcpdump port 22 tcpdump complex
x64 - before 15 21 10 14 19 18
- after 7 10 10 7 10 15
arm64 - before 40 91 92 40 91 151
- after 51 64 73 51 62 113
For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
and cache the skb's headlen and data in the cBPF prologue. The
BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
used as a local temporary variable. This allows to shrink the
image on x86_64 also for seccomp programs slightly since mapping
to %rsi is not an ereg. In callee-saved R8 and R9 we now track
skb data and headlen, respectively. For normal prologue emission
in the JITs this does not add any extra instructions since R8, R9
are pushed to stack in any case from eBPF side. cBPF uses the
convert_bpf_ld_abs() emitter which probes the fast path inline
already and falls back to bpf_skb_load_helper_{8,16,32}() helper
relying on the cached skb data and headlen as well. R8 and R9
never need to be reloaded due to bpf_helper_changes_pkt_data()
since all skb access in cBPF is read-only. Then, for the case
of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
does neither cache skb data and headlen nor has an inlined fast
path. The reason for the latter is that native eBPF does not have
any extra registers available anyway, but even if there were, it
avoids any reload of skb data and headlen in the first place.
Additionally, for the negative offsets, we provide an alternative
bpf_skb_load_bytes_relative() helper in eBPF which operates
similarly as bpf_skb_load_bytes() and allows for more flexibility.
Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It's possible for userspace to control attr->map_type. Sanitize it when
using it as an array index to prevent an out-of-bounds value being used
under speculation.
Found by smatch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: netdev@vger.kernel.org
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The xskmap is yet another BPF map, very much inspired by
dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
adds AF_XDP sockets into the map, and by using the bpf_redirect_map
helper, an XDP program can redirect XDP frames to an AF_XDP socket.
Note that a socket that is bound to certain ifindex/queue index will
*only* accept XDP frames from that netdev/queue index. If an XDP
program tries to redirect from a netdev/queue index other than what
the socket is bound to, the frame will not be received on the socket.
A socket can reside in multiple maps.
v3: Fixed race and simplified code.
v2: Removed one indirection in map lookup.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When a redirect failure happens we release the buffers in-flight
without calling a sk_mem_uncharge(), the uncharge is called before
dropping the sock lock for the redirecte, however we missed updating
the ring start index. When no apply actions are in progress this
is OK because we uncharge the entire buffer before the redirect.
But, when we have apply logic running its possible that only a
portion of the buffer is being redirected. In this case we only
do memory accounting for the buffer slice being redirected and
expect to be able to loop over the BPF program again and/or if
a sock is closed uncharge the memory at sock destruct time.
With an invalid start index however the program logic looks at
the start pointer index, checks the length, and when seeing the
length is zero (from the initial release and failure to update
the pointer) aborts without uncharging/releasing the remaining
memory.
The fix for this is simply to update the start index. To avoid
fixing this error in two locations we do a small refactor and
remove one case where it is open-coded. Then fix it in the
single function.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When an error occurs during a redirect we have two cases that need
to be handled (i) we have a cork'ed buffer (ii) we have a normal
sendmsg buffer.
In the cork'ed buffer case we don't currently support recovering from
errors in a redirect action. So the buffer is released and the error
should _not_ be pushed back to the caller of sendmsg/sendpage. The
rationale here is the user will get an error that relates to old
data that may have been sent by some arbitrary thread on that sock.
Instead we simple consume the data and tell the user that the data
has been consumed. We may add proper error recovery in the future.
However, this patch fixes a bug where the bytes outstanding counter
sg_size was not zeroed. This could result in a case where if the user
has both a cork'ed action and apply action in progress we may
incorrectly call into the BPF program when the user expected an
old verdict to be applied via the apply action. I don't have a use
case where using apply and cork at the same time is valid but we
never explicitly reject it because it should work fine. This patch
ensures the sg_size is zeroed so we don't have this case.
In the normal sendmsg buffer case (no cork data) we also do not
zero sg_size. Again this can confuse the apply logic when the logic
calls into the BPF program when the BPF programmer expected the old
verdict to remain. So ensure we set sg_size to zero here as well. And
additionally to keep the psock state in-sync with the sk_msg_buff
release all the memory as well. Previously we did this before
returning to the user but this left a gap where psock and sk_msg_buff
states were out of sync which seems fragile. No additional overhead
is taken here except for a call to check the length and realize its
already been freed. This is in the error path as well so in my
opinion lets have robust code over optimized error paths.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When the call to do_tcp_sendpage() fails to send the complete block
requested we either retry if only a partial send was completed or
abort if we receive a error less than or equal to zero. Before
returning though we must update the scatterlist length/offset to
account for any partial send completed.
Before this patch we did this at the end of the retry loop, but
this was buggy when used while applying a verdict to fewer bytes
than in the scatterlist. When the scatterlist length was being set
we forgot to account for the apply logic reducing the size variable.
So the result was we chopped off some bytes in the scatterlist without
doing proper cleanup on them. This results in a WARNING when the
sock is tore down because the bytes have previously been charged to
the socket but are never uncharged.
The simple fix is to simply do the accounting inside the retry loop
subtracting from the absolute scatterlist values rather than trying
to accumulate the totals and subtract at the end.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
tracepoints to bpf core were added as a way to provide introspection
to bpf programs and maps, but after some time it became clear that
this approach is inadequate, so prog_id, map_id and corresponding
get_next_id, get_fd_by_id, get_info_by_fd, prog_query APIs were
introduced and fully adopted by bpftool and other applications.
The tracepoints in bpf core started to rot and causing syzbot warnings:
WARNING: CPU: 0 PID: 3008 at kernel/trace/trace_event_perf.c:274
Kernel panic - not syncing: panic_on_warn set ...
perf_trace_bpf_map_keyval+0x260/0xbd0 include/trace/events/bpf.h:228
trace_bpf_map_update_elem include/trace/events/bpf.h:274 [inline]
map_update_elem kernel/bpf/syscall.c:597 [inline]
SYSC_bpf kernel/bpf/syscall.c:1478 [inline]
Hence this patch deletes tracepoints in bpf core.
Reported-by: Eric Biggers <ebiggers3@gmail.com>
Reported-by: syzbot <bot+a9dbb3c3e64b62536a4bc5ee7bbd4ca627566188@syzkaller.appspotmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When helpers like bpf_get_stack returns an int value
and later on used for arithmetic computation, the LSH and ARSH
operations are often required to get proper sign extension into
64-bit. For example, without this patch:
54: R0=inv(id=0,umax_value=800)
54: (bf) r8 = r0
55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
55: (67) r8 <<= 32
56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
56: (c7) r8 s>>= 32
57: R8=inv(id=0)
With this patch:
54: R0=inv(id=0,umax_value=800)
54: (bf) r8 = r0
55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
55: (67) r8 <<= 32
56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
56: (c7) r8 s>>= 32
57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
With better range of "R8", later on when "R8" is added to other register,
e.g., a map pointer or scalar-value register, the better register
range can be derived and verifier failure may be avoided.
In our later example,
......
usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
if (usize < 0)
return 0;
ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
......
Without improving ARSH value range tracking, the register representing
"max_len - usize" will have smin_value equal to S64_MIN and will be
rejected by verifier.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In verifier function adjust_scalar_min_max_vals,
when src_known is false and the opcode is BPF_LSH/BPF_RSH,
early return will happen in the function. So remove
the branch in handling BPF_LSH/BPF_RSH when src_known is false.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The special property of return values for helpers bpf_get_stack
and bpf_probe_read_str are captured in verifier.
Both helpers return a negative error code or
a length, which is equal to or smaller than the buffer
size argument. This additional information in the
verifier can avoid the condition such as "retval > bufsize"
in the bpf program. For example, for the code blow,
usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
if (usize < 0 || usize > max_len)
return 0;
The verifier may have the following errors:
52: (85) call bpf_get_stack#65
R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
R9_w=inv800 R10=fp0,call_-1
53: (bf) r8 = r0
54: (bf) r1 = r8
55: (67) r1 <<= 32
56: (bf) r2 = r1
57: (77) r2 >>= 32
58: (25) if r2 > 0x31f goto pc+33
R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
umax_value=18446744069414584320,
var_off=(0x0; 0xffffffff00000000))
R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
R8=inv(id=0) R9=inv800 R10=fp0,call_-1
59: (1f) r9 -= r8
60: (c7) r1 s>>= 32
61: (bf) r2 = r7
62: (0f) r2 += r1
math between map_value pointer and register with unbounded
min value is not allowed
The failure is due to llvm compiler optimization where register "r2",
which is a copy of "r1", is tested for condition while later on "r1"
is used for map_ptr operation. The verifier is not able to track such
inst sequence effectively.
Without the "usize > max_len" condition, there is no llvm optimization
and the below generated code passed verifier:
52: (85) call bpf_get_stack#65
R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
R9_w=inv800 R10=fp0,call_-1
53: (b7) r1 = 0
54: (bf) r8 = r0
55: (67) r8 <<= 32
56: (c7) r8 s>>= 32
57: (6d) if r1 s> r8 goto pc+24
R0=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
R1=inv0 R6=ctx(id=0,off=0,imm=0)
R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
R10=fp0,call_-1
58: (bf) r2 = r7
59: (0f) r2 += r8
60: (1f) r9 -= r8
61: (bf) r1 = r6
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.
This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch didn't incur functionality change. The function prototype
got changed so that the same function can be reused later.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-04-27
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add extensive BPF helper description into include/uapi/linux/bpf.h
and a new script bpf_helpers_doc.py which allows for generating a
man page out of it. Thus, every helper in BPF now comes with proper
function signature, detailed description and return code explanation,
from Quentin.
2) Migrate the BPF collect metadata tunnel tests from BPF samples over
to the BPF selftests and further extend them with v6 vxlan, geneve
and ipip tests, simplify the ipip tests, improve documentation and
convert to bpf_ntoh*() / bpf_hton*() api, from William.
3) Currently, helpers that expect ARG_PTR_TO_MAP_{KEY,VALUE} can only
access stack and packet memory. Extend this to allow such helpers
to also use map values, which enabled use cases where value from
a first lookup can be directly used as a key for a second lookup,
from Paul.
4) Add a new helper bpf_skb_get_xfrm_state() for tc BPF programs in
order to retrieve XFRM state information containing SPI, peer
address and reqid values, from Eyal.
5) Various optimizations in nfp driver's BPF JIT in order to turn ADD
and SUB instructions with negative immediate into the opposite
operation with a positive immediate such that nfp can better fit
small immediates into instructions. Savings in instruction count
up to 4% have been observed, from Jakub.
6) Add the BPF prog's gpl_compatible flag to struct bpf_prog_info
and add support for dumping this through bpftool, from Jiri.
7) Move the BPF sockmap samples over into BPF selftests instead since
sockmap was rather a series of tests than sample anyway and this way
this can be run from automated bots, from John.
8) Follow-up fix for bpf_adjust_tail() helper in order to make it work
with generic XDP, from Nikita.
9) Some follow-up cleanups to BTF, namely, removing unused defines from
BTF uapi header and renaming 'name' struct btf_* members into name_off
to make it more clear they are offsets into string section, from Martin.
10) Remove test_sock_addr from TEST_GEN_PROGS in BPF selftests since
not run directly but invoked from test_sock_addr.sh, from Yonghong.
11) Remove redundant ret assignment in sample BPF loader, from Wang.
12) Add couple of missing files to BPF selftest's gitignore, from Anders.
There are two trivial merge conflicts while pulling:
1) Remove samples/sockmap/Makefile since all sockmap tests have been
moved to selftests.
2) Add both hunks from tools/testing/selftests/bpf/.gitignore to the
file since git should ignore all of them.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding gpl_compatible flag to struct bpf_prog_info
so it can be dumped via bpf_prog_get_info_by_fd and
displayed via bpftool progs dump.
Alexei noticed 4-byte hole in struct bpf_prog_info,
so we put the u32 flags field in there, and we can
keep adding bit fields in there without breaking
user space.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE can only
access stack and packet memory. Allow these helpers to directly access
map values by passing registers of type PTR_TO_MAP_VALUE.
This change removes the need for an extra copy to the stack when using a
map value to perform a second map lookup, as in the following:
struct bpf_map_def SEC("maps") infobyreq = {
.type = BPF_MAP_TYPE_HASHMAP,
.key_size = sizeof(struct request *),
.value_size = sizeof(struct info_t),
.max_entries = 1024,
};
struct bpf_map_def SEC("maps") counts = {
.type = BPF_MAP_TYPE_HASHMAP,
.key_size = sizeof(struct info_t),
.value_size = sizeof(u64),
.max_entries = 1024,
};
SEC("kprobe/blk_account_io_start")
int bpf_blk_account_io_start(struct pt_regs *ctx)
{
struct info_t *info = bpf_map_lookup_elem(&infobyreq, &ctx->di);
u64 *count = bpf_map_lookup_elem(&counts, info);
(*count)++;
}
Signed-off-by: Paul Chaignon <paul.chaignon@orange.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In the case where the socket memory boundary is hit the redirect
path returns an ENOMEM error. However, before checking for this
condition the redirect scatterlist buffer is setup with a valid
page and length. This is never unwound so when the buffers are
released latter in the error path we do a put_page() and clear
the scatterlist fields. But, because the initial error happens
before completing the scatterlist buffer we end up with both the
original buffer and the redirect buffer pointing to the same page
resulting in duplicate put_page() calls.
To fix this simply move the initial configuration of the redirect
scatterlist buffer below the sock memory check.
Found this while running TCP_STREAM test with netperf using Cilium.
Fixes: fa246693a1 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In the recvmsg handler we need to add a wait event to support the
blocking use cases. Without this we return zero and may confuse
user applications. In the wait event any data received on the
sk either via sk_receive_queue or the psock ingress list will
wake up the sock.
Fixes: fa246693a1 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Relying on map_release hook to decrement the reference counts when a
map is removed only works if the map is not being pinned. In the
pinned case the ref is decremented immediately and the BPF programs
released. After this BPF programs may not be in-use which is not
what the user would expect.
This patch moves the release logic into bpf_map_put_uref() and brings
sockmap in-line with how a similar case is handled in prog array maps.
Fixes: 3d9e952697 ("bpf: sockmap, fix leaking maps with attached but not detached progs")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch cleans up btf.h in uapi:
1) Rename "name" to "name_off" to better reflect it is an offset to the
string section instead of a char array.
2) Remove unused value BTF_FLAGS_COMPR and BTF_MAGIC_SWAP
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Remove dead code that bails on `attr->value_size > KMALLOC_MAX_SIZE` - the
previous check already bails on `attr->value_size != 4`.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds pretty print support to the basic arraymap.
Support for other bpf maps can be added later.
This patch adds new attrs to the BPF_MAP_CREATE command to allow
specifying the btf_fd, btf_key_id and btf_value_id. The
BPF_MAP_CREATE can then associate the btf to the map if
the creating map supports BTF.
A BTF supported map needs to implement two new map ops,
map_seq_show_elem() and map_check_btf(). This patch has
implemented these new map ops for the basic arraymap.
It also adds file_operations, bpffs_map_fops, to the pinned
map such that the pinned map can be opened and read.
After that, the user has an intuitive way to do
"cat bpffs/pathto/a-pinned-map" instead of getting
an error.
bpffs_map_fops should not be extended further to support
other operations. Other operations (e.g. write/key-lookup...)
should be realized by the userspace tools (e.g. bpftool) through
the BPF_OBJ_GET_INFO_BY_FD, map's lookup/update interface...etc.
Follow up patches will allow the userspace to obtain
the BTF from a map-fd.
Here is a sample output when reading a pinned arraymap
with the following map's value:
struct map_value {
int count_a;
int count_b;
};
cat /sys/fs/bpf/pinned_array_map:
0: {1,2}
1: {3,4}
2: {5,6}
...
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds BPF_OBJ_GET_INFO_BY_FD support to BTF fd.
The original BTF data, which was used to create the BTF fd during
the earlier BPF_BTF_LOAD call, will be returned.
The userspace is expected to allocate buffer
to info.info and the buffer size is set to info.info_len before
calling BPF_OBJ_GET_INFO_BY_FD.
The original BTF data is copied to the userspace buffer (info.info).
Only upto the user's specified info.info_len will be copied.
The original BTF data size is set to info.info_len. The userspace
needs to check if it is bigger than its allocated buffer size.
If it is, the userspace should realloc with the kernel-returned
info.info_len and call the BPF_OBJ_GET_INFO_BY_FD again.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds a BPF_BTF_LOAD command which
1) loads and verifies the BTF (implemented in earlier patches)
2) returns a BTF fd to userspace. In the next patch, the
BTF fd can be specified during BPF_MAP_CREATE.
It currently limits to CAP_SYS_ADMIN.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds pretty print capability for data with BTF type info.
The current usage is to allow pretty print for a BPF map.
The next few patches will allow a read() on a pinned map with BTF
type info for its key and value.
This patch uses the seq_printf() infra.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch checks a few things of struct's members:
1) It has a valid size (e.g. a "const void" is invalid)
2) A member's size (+ its member's offset) does not exceed
the containing struct's size.
3) The member's offset satisfies the alignment requirement
The above can only be done after the needs_resolve member's type
is resolved. Hence, the above is done together in
btf_struct_resolve().
Each possible member's type (e.g. int, enum, modifier...) implements
the check_member() ops which will be called from btf_struct_resolve().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
After collecting all btf_type in the first pass in an earlier patch,
the second pass (in this patch) can validate the reference types
(e.g. the referring type does exist and it does not refer to itself).
While checking the reference type, it also gathers other information (e.g.
the size of an array). This info will be useful in checking the
struct's members in a later patch. They will also be useful in doing
pretty print later.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch introduces BPF type Format (BTF).
BTF (BPF Type Format) is the meta data format which describes
the data types of BPF program/map. Hence, it basically focus
on the C programming language which the modern BPF is primary
using. The first use case is to provide a generic pretty print
capability for a BPF map.
BTF has its root from CTF (Compact C-Type format). To simplify
the handling of BTF data, BTF removes the differences between
small and big type/struct-member. Hence, BTF consistently uses u32
instead of supporting both "one u16" and "two u32 (+padding)" in
describing type and struct-member.
It also raises the number of types (and functions) limit
from 0x7fff to 0x7fffffff.
Due to the above changes, the format is not compatible to CTF.
Hence, BTF starts with a new BTF_MAGIC and version number.
This patch does the first verification pass to the BTF. The first
pass checks:
1. meta-data size (e.g. It does not go beyond the total btf's size)
2. name_offset is valid
3. Each BTF_KIND (e.g. int, enum, struct....) does its
own check of its meta-data.
Some other checks, like checking a struct's member is referring
to a valid type, can only be done in the second pass. The second
verification pass will be implemented in the next patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.
When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".
This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object. In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.
It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame. The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow. In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.
To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern. My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda42 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The generic xdp_frame format, was inspired by the cpumap own internal
xdp_pkt format. It is now time to convert it over to the generic
xdp_frame format. The cpumap needs one extra field dev_rx.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce an xdp_return_frame API, and convert over cpumap as
the first user, given it have queued XDP frame structure to leverage.
V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
V6: Remove comment that id will be added later (Req by Alex Duyck)
V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot)
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
The error details:
======================================================
WARNING: possible circular locking dependency detected
4.16.0-rc7+ #3 Not tainted
------------------------------------------------------
syz-executor7/24531 is trying to acquire lock:
(bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
but task is already holding lock:
(&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&mm->mmap_sem){++++}:
__might_fault+0x13a/0x1d0 mm/memory.c:4571
_copy_to_user+0x2c/0xc0 lib/usercopy.c:25
copy_to_user include/linux/uaccess.h:155 [inline]
bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
_perf_ioctl kernel/events/core.c:4750 [inline]
perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
SYSC_ioctl fs/ioctl.c:701 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
-> #0 (bpf_event_mutex){+.+.}:
lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
__mutex_lock_common kernel/locking/mutex.c:756 [inline]
__mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
_free_event+0xbdb/0x10f0 kernel/events/core.c:4116
put_event+0x24/0x30 kernel/events/core.c:4204
perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
remove_vma+0xb4/0x1b0 mm/mmap.c:172
remove_vma_list mm/mmap.c:2490 [inline]
do_munmap+0x82a/0xdf0 mm/mmap.c:2731
mmap_region+0x59e/0x15a0 mm/mmap.c:1646
do_mmap+0x6c0/0xe00 mm/mmap.c:1483
do_mmap_pgoff include/linux/mm.h:2223 [inline]
vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
lock(bpf_event_mutex);
lock(&mm->mmap_sem);
lock(bpf_event_mutex);
*** DEADLOCK ***
======================================================
The bug is introduced by Commit f371b304f1 ("bpf/tracing: allow
user space to query prog array on the same tp") where copy_to_user,
which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
At the same time, during perf_event file descriptor close,
mm->mmap_sem is held first and then subsequent
perf_event_detach_bpf_prog needs bpf_event_mutex lock.
Such a senario caused a deadlock.
As suggested by Daniel, moving copy_to_user out of the
bpf_event_mutex lock should fix the problem.
Fixes: f371b304f1 ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There will be a build warning -Wunused-function if CONFIG_CGROUP_BPF
isn't defined, since the only user is inside #ifdef CONFIG_CGROUP_BPF:
kernel/bpf/syscall.c:1229:12: warning: ‘bpf_prog_attach_check_attach_type’
defined but not used [-Wunused-function]
static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Current code moves function bpf_prog_attach_check_attach_type inside
ifdef CONFIG_CGROUP_BPF.
Fixes: 5e43f899b0 ("bpf: Check attach type at prog load time")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
It is possible to have multiple ULP tcp_release call paths in flight
if a sock is closed and simultaneously being removed from the sockmap
control path. The result would be setting the sk_prot to the saved
values on the first iteration and then on the second iteration setting
the value to NULL.
This patch resolves this by ensuring we only reset the sk_prot pointer
if we have a valid saved state to set.
Fixes: 4f738adba3 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If a socket with pending cork data is closed we do not return the
memory to the socket until the garbage collector free's the psock
structure. The garbage collector though can run after the sock has
completed its close operation. If this ordering happens the sock code
will through a WARN_ON because there is still outstanding memory
accounted to the sock.
To resolve this ensure we return memory to the sock when a socket
is closed.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Fixes: 91843d540a ("bpf: sockmap, add msg_cork_bytes() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
"Post-hooks" are hooks that are called right before returning from
sys_bind. At this time IP and port are already allocated and no further
changes to `struct sock` can happen before returning from sys_bind but
BPF program has a chance to inspect the socket and change sys_bind
result.
Specifically it can e.g. inspect what port was allocated and if it
doesn't satisfy some policy, BPF program can force sys_bind to fail and
return EPERM to user.
Another example of usage is recording the IP:port pair to some map to
use it in later calls to sys_connect. E.g. if some TCP server inside
cgroup was bound to some IP:port_n, it can be recorded to a map. And
later when some TCP client inside same cgroup is trying to connect to
127.0.0.1:port_n, BPF hook for sys_connect can override the destination
and connect application to IP:port_n instead of 127.0.0.1:port_n. That
helps forcing all applications inside a cgroup to use desired IP and not
break those applications if they e.g. use localhost to communicate
between each other.
== Implementation details ==
Post-hooks are implemented as two new attach types
`BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
Separate attach types for IPv4 and IPv6 are introduced to avoid access
to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
`inet6_bind()` since those fields might not make sense in such cases.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
== The problem ==
See description of the problem in the initial patch of this patch set.
== The solution ==
The patch provides much more reliable in-kernel solution for the 2nd
part of the problem: making outgoing connecttion from desired IP.
It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
`BPF_CGROUP_INET6_CONNECT` for program type
`BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
source and destination of a connection at connect(2) time.
Local end of connection can be bound to desired IP using newly
introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
and doesn't support binding to port, i.e. leverages
`IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
* looking for a free port is expensive and can affect performance
significantly;
* there is no use-case for port.
As for remote end (`struct sockaddr *` passed by user), both parts of it
can be overridden, remote IP and remote port. It's useful if an
application inside cgroup wants to connect to another application inside
same cgroup or to itself, but knows nothing about IP assigned to the
cgroup.
Support is added for IPv4 and IPv6, for TCP and UDP.
IPv4 and IPv6 have separate attach types for same reason as sys_bind
hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
when user passes sockaddr_in since it'd be out-of-bound.
== Implementation notes ==
The patch introduces new field in `struct proto`: `pre_connect` that is
a pointer to a function with same signature as `connect` but is called
before it. The reason is in some cases BPF hooks should be called way
before control is passed to `sk->sk_prot->connect`. Specifically
`inet_dgram_connect` autobinds socket before calling
`sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
it'd cause double-bind. On the other hand `proto.pre_connect` provides a
flexible way to add BPF hooks for connect only for necessary `proto` and
call them at desired time before `connect`. Since `bpf_bind()` is
allowed to bind only to IP and autobind in `inet_dgram_connect` binds
only port there is no chance of double-bind.
bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
of value of `bind_address_no_port` socket field.
bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
and __inet6_bind() since all call-sites, where bpf_bind() is called,
already hold socket lock.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
== The problem ==
There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured. Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.
Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.
== The solution ==
The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).
It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
(similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
The new program type is intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.
The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.
== Implementation notes ==
[1]
Separate attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.
[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".
There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
== The problem ==
There are use-cases when a program of some type can be attached to
multiple attach points and those attach points must have different
permissions to access context or to call helpers.
E.g. context structure may have fields for both IPv4 and IPv6 but it
doesn't make sense to read from / write to IPv6 field when attach point
is somewhere in IPv4 stack.
Same applies to BPF-helpers: it may make sense to call some helper from
some attach point, but not from other for same prog type.
== The solution ==
Introduce `expected_attach_type` field in in `struct bpf_attr` for
`BPF_PROG_LOAD` command. If scenario described in "The problem" section
is the case for some prog type, the field will be checked twice:
1) At load time prog type is checked to see if attach type for it must
be known to validate program permissions correctly. Prog will be
rejected with EINVAL if it's the case and `expected_attach_type` is
not specified or has invalid value.
2) At attach time `attach_type` is compared with `expected_attach_type`,
if prog type requires to have one, and, if they differ, attach will
be rejected with EINVAL.
The `expected_attach_type` is now available as part of `struct bpf_prog`
in both `bpf_verifier_ops->is_valid_access()` and
`bpf_verifier_ops->get_func_proto()` () and can be used to check context
accesses and calls to helpers correspondingly.
Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
Daniel Borkmann <daniel@iogearbox.net> here:
https://marc.info/?l=linux-netdev&m=152107378717201&w=2
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When CONFIG_DEBUG_SG is set, sg->sg_magic is initialized in
sg_init_table() and it is verified in sg api while navigating. We hit
BUG_ON when magic check is failed.
In functions sg_tcp_sendpage and sg_tcp_sendmsg, the struct containing
the scatterlist is already zeroed out. So to avoid extra memset, we
use sg_init_marker() to initialize sg_magic.
Fixed following things:
- In bpf_tcp_sendpage: initialize sg using sg_init_marker
- In bpf_tcp_sendmsg: Replace sg_init_table with sg_init_marker
- In bpf_tcp_push: Replace memset with sg_init_table where consumed
sg entry needs to be re-initialized.
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add support for the BPF_F_INGRESS flag in skb redirect helper. To
do this convert skb into a scatterlist and push into ingress queue.
This is the same logic that is used in the sk_msg redirect helper
so it should feel familiar.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add support for the BPF_F_INGRESS flag in sk_msg redirect helper.
To do this add a scatterlist ring for receiving socks to check
before calling into regular recvmsg call path. Additionally, because
the poll wakeup logic only checked the skb recv queue we need to
add a hook in TCP stack (similar to write side) so that we have
a way to wake up polling socks when a scatterlist is redirected
to that sock.
After this all that is needed is for the redirect helper to
push the scatterlist into the psock receive queue.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
kernel internal arguments of the tracepoints in their raw form.
>From bpf program point of view the access to the arguments look like:
struct bpf_raw_tracepoint_args {
__u64 args[0];
};
int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
{
// program can read args[N] where N depends on tracepoint
// and statically verified at program load+attach time
}
kprobe+bpf infrastructure allows programs access function arguments.
This feature allows programs access raw tracepoint arguments.
Similar to proposed 'dynamic ftrace events' there are no abi guarantees
to what the tracepoints arguments are and what their meaning is.
The program needs to type cast args properly and use bpf_probe_read()
helper to access struct fields when argument is a pointer.
For every tracepoint __bpf_trace_##call function is prepared.
In assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
0xffffffff81132080 <+0>: mov %ecx,%ecx
0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 <bpf_trace_run3>
where
TRACE_EVENT(xdp_exception,
TP_PROTO(const struct net_device *dev,
const struct bpf_prog *xdp, u32 act),
The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
and in total this approach adds 7k bytes to .text.
This approach gives the lowest possible overhead
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.
Since tracepoint+bpf are used at speeds of 1M+ events per second
this is valuable optimization.
The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
that returns anon_inode FD of 'bpf-raw-tracepoint' object.
The user space looks like:
// load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
prog_fd = bpf_prog_load(...);
// receive anon_inode fd for given bpf_raw_tracepoint with prog attached
raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
Ctrl-C of tracing daemon or cmdline tool that uses this feature
will automatically detach bpf program, unload it and
unregister tracepoint probe.
On the kernel side the __bpf_raw_tp_map section of pointers to
tracepoint definition and to __bpf_trace_*() probe function is used
to find a tracepoint with "xdp_exception" name and
corresponding __bpf_trace_xdp_exception() probe function
which are passed to tracepoint_probe_register() to connect probe
with tracepoint.
Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
tracepoint mechanisms. perf_event_open() can be used in parallel
on the same tracepoint.
Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
Each with its own bpf program. The kernel will execute
all tracepoint probes and all attached bpf programs.
In the future bpf_raw_tracepoints can be extended with
query/introspection logic.
__bpf_raw_tp_map section logic was contributed by Steven Rostedt
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Generally we do a preload before doing idr allocation. This also help
improve the allocation success rate in memory pressure.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The BTF (BPF Type Format) verifier needs to reuse the current
BPF verifier log. Hence, it requires the following changes:
(1) Expose log_write() in verifier.c for other users.
Its name is renamed to bpf_verifier_vlog().
(2) The BTF verifier also needs to check
'log->level && log->ubuf && !bpf_verifier_log_full(log);'
independently outside of the current log_write(). It is
because the BTF verifier will do one-check before
making multiple calls to btf_verifier_vlog to log
the details of a type.
Hence, this check is also re-factored to a new function
bpf_verifier_log_needed(). Since it is re-factored,
we can check it before va_start() in the current
bpf_verifier_log_write() and verbose().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We use print_bpf_insn in user space (bpftool and soon perf),
so it'd be nice to keep it generic and strip it off the kernel
struct bpf_verifier_env argument.
This argument can be safely removed, because its users can
use the struct bpf_insn_cbs::private_data to pass it.
By changing the argument type we can no longer have clean
'verbose' alias to 'bpf_verifier_log_write' in verifier.c.
Instead we're adding the 'verbose' cb_print callback and
removing the alias.
This way we have new cb_print callback in place, and all
the 'verbose(env, ...) calls in verifier.c will cleanly
cast to 'verbose(void *, ...)' so no other change is
needed.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Fun set of conflict resolutions here...
For the mac80211 stuff, these were fortunately just parallel
adds. Trivially resolved.
In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.
In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.
The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.
The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:
====================
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit b5ca15ad7e (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The current check statement in BPF syscall will do a capability check
for CAP_SYS_ADMIN before checking sysctl_unprivileged_bpf_disabled. This
code path will trigger unnecessary security hooks on capability checking
and cause false alarms on unprivileged process trying to get CAP_SYS_ADMIN
access. This can be resolved by simply switch the order of the statement
and CAP_SYS_ADMIN is not required anyway if unprivileged bpf syscall is
allowed.
Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This implements a BPF ULP layer to allow policy enforcement and
monitoring at the socket layer. In order to support this a new
program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
the sendmsg/sendpage hook. To attach the policy to sockets a
sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
Similar to previous sockmap usages when a sock is added to a
sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
program type attached then the BPF ULP layer is created on the
socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
every msg in sendmsg case and page/offset in sendpage case.
BPF_PROG_TYPE_SK_MSG Semantics/API:
BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
case and in the sendpage case leaves the data untouched. Both cases
return -EACESS to the user. Returning SK_PASS will allow the msg to
be sent.
In the sendmsg case data is copied into kernel space buffers before
running the BPF program. The kernel space buffers are stored in a
scatterlist object where each element is a kernel memory buffer.
Some effort is made to coalesce data from the sendmsg call here.
For example a sendmsg call with many one byte iov entries will
likely be pushed into a single entry. The BPF program is run with
data pointers (start/end) pointing to the first sg element.
In the sendpage case data is not copied. We opt not to copy the
data by default here, because the BPF infrastructure does not
know what bytes will be needed nor when they will be needed. So
copying all bytes may be wasteful. Because of this the initial
start/end data pointers are (0,0). Meaning no data can be read or
written. This avoids reading data that may be modified by the
user. A new helper is added later in this series if reading and
writing the data is needed. The helper call will do a copy by
default so that the page is exclusively owned by the BPF call.
The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
in the sendmsg() case and the entire page/offset in the sendpage case.
This avoids ambiguity on how to handle mixed return codes in the
sendmsg case. Again a helper is added later in the series if
a verdict needs to apply to multiple system calls and/or only
a subpart of the currently being processed message.
The helper msg_redirect_map() can be used to select the socket to
send the data on. This is used similar to existing redirect use
cases. This allows policy to redirect msgs.
Pseudo code simple example:
The basic logic to attach a program to a socket is as follows,
// load the programs
bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
&obj, &msg_prog);
// lookup the sockmap
bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
// get fd for sockmap
map_fd_msg = bpf_map__fd(bpf_map_msg);
// attach program to sockmap
bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
Adding sockets to the map is done in the normal way,
// Add a socket 'fd' to sockmap at location 'i'
bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
After the above any socket attached to "my_sock_map", in this case
'fd', will run the BPF msg verdict program (msg_prog) on every
sendmsg and sendpage system call.
For a complete example see BPF selftests or sockmap samples.
Implementation notes:
It seemed the simplest, to me at least, to use a refcnt to ensure
psock is not lost across the sendmsg copy into the sg, the bpf program
running on the data in sg_data, and the final pass to the TCP stack.
Some performance testing may show a better method to do this and avoid
the refcnt cost, but for now use the simpler method.
Another item that will come after basic support is in place is
supporting MSG_MORE flag. At the moment we call sendpages even if
the MSG_MORE flag is set. An enhancement would be to collect the
pages into a larger scatterlist and pass down the stack. Notice that
bpf_tcp_sendmsg() could support this with some additional state saved
across sendmsg calls. I built the code to support this without having
to do refactoring work. Other features TBD include ZEROCOPY and the
TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
shortly.
Future work could improve size limits on the scatterlist rings used
here. Currently, we use MAX_SKB_FRAGS simply because this was being
used already in the TLS case. Future work could extend the kernel sk
APIs to tune this depending on workload. This is a trade-off
between memory usage and throughput performance.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The sockmap refcnt up until now has been wrapped in the
sk_callback_lock(). So its not actually needed any locking of its
own. The counter itself tracks the lifetime of the psock object.
Sockets in a sockmap have a lifetime that is independent of the
map they are part of. This is possible because a single socket may
be in multiple maps. When this happens we can only release the
psock data associated with the socket when the refcnt reaches
zero. There are three possible delete sock reference decrement
paths first through the normal sockmap process, the user deletes
the socket from the map. Second the map is removed and all sockets
in the map are removed, delete path is similar to case 1. The third
case is an asyncronous socket event such as a closing the socket. The
last case handles removing sockets that are no longer available.
For completeness, although inc does not pose any problems in this
patch series, the inc case only happens when a psock is added to a
map.
Next we plan to add another socket prog type to handle policy and
monitoring on the TX path. When we do this however we will need to
keep a reference count open across the sendmsg/sendpage call and
holding the sk_callback_lock() here (on every send) seems less than
ideal, also it may sleep in cases where we hit memory pressure.
Instead of dealing with these issues in some clever way simply make
the reference counting a refcnt_t type and do proper atomic ops.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, bpf stackmap store address for each entry in the call trace.
To map these addresses to user space files, it is necessary to maintain
the mapping from these virtual address to symbols in the binary. Usually,
the user space profiler (such as perf) has to scan /proc/pid/maps at the
beginning of profiling, and monitor mmap2() calls afterwards. Given the
cost of maintaining the address map, this solution is not practical for
system wide profiling that is always on.
This patch tries to solve this problem with a variation of stackmap. This
variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing
addresses, the variation stores ELF file build_id + offset.
Build ID is a 20-byte unique identifier for ELF files. The following
command shows the Build ID of /bin/bash:
[user@]$ readelf -n /bin/bash
...
Build ID: XXXXXXXXXX
...
With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID
for each entry in the call trace, and translate it into the following
struct:
struct bpf_stack_build_id_offset {
__s32 status;
unsigned char build_id[BPF_BUILD_ID_SIZE];
union {
__u64 offset;
__u64 ip;
};
};
The search of build_id is limited to the first page of the file, and this
page should be in page cache. Otherwise, we fallback to store ip for this
entry (ip field in struct bpf_stack_build_id_offset). This requires the
build_id to be stored in the first page. A quick survey of binary and
dynamic library files in a few different systems shows that almost all
binary and dynamic library files have build_id in the first page.
Build_id is only meaningful for user stack. If a kernel stack is added to
a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to
only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id
lookup failed for some reason, it will also fallback to store ip.
User space can access struct bpf_stack_build_id_offset with bpf
syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to
maintain mapping from build id to binary files. This mostly static
mapping is much easier to maintain than per process address maps.
Note: Stackmap with build_id only works in non-nmi context at this time.
This is because we need to take mm->mmap_sem for find_vma(). If this
changes, we would like to allow build_id lookup in nmi context.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>