tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
/* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of version 2 of the GNU General Public
|
|
|
|
* License as published by the Free Software Foundation.
|
|
|
|
*/
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/types.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/bpf.h>
|
|
|
|
#include <linux/filter.h>
|
|
|
|
#include <linux/uaccess.h>
|
2015-03-26 02:49:22 +07:00
|
|
|
#include <linux/ctype.h>
|
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
#include "trace.h"
|
|
|
|
|
|
|
|
/**
|
|
|
|
* trace_call_bpf - invoke BPF program
|
|
|
|
* @prog: BPF program
|
|
|
|
* @ctx: opaque context pointer
|
|
|
|
*
|
|
|
|
* kprobe handlers execute BPF programs via this helper.
|
|
|
|
* Can be used from static tracepoints in the future.
|
|
|
|
*
|
|
|
|
* Return: BPF programs always return an integer which is interpreted by
|
|
|
|
* kprobe handler as:
|
|
|
|
* 0 - return from kprobe (event is filtered out)
|
|
|
|
* 1 - store kprobe event into ring buffer
|
|
|
|
* Other values are reserved and currently alias to 1
|
|
|
|
*/
|
|
|
|
unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
|
|
|
|
{
|
|
|
|
unsigned int ret;
|
|
|
|
|
|
|
|
if (in_nmi()) /* not supported yet */
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
|
|
|
|
if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
|
|
|
|
/*
|
|
|
|
* since some bpf program is already running on this cpu,
|
|
|
|
* don't call into another bpf program (same or different)
|
|
|
|
* and don't send kprobe event into ring-buffer,
|
|
|
|
* so return zero here
|
|
|
|
*/
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
ret = BPF_PROG_RUN(prog, ctx);
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
out:
|
|
|
|
__this_cpu_dec(bpf_prog_active);
|
|
|
|
preempt_enable();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(trace_call_bpf);
|
|
|
|
|
|
|
|
static u64 bpf_probe_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
|
|
|
|
{
|
|
|
|
void *dst = (void *) (long) r1;
|
2016-04-13 05:10:52 +07:00
|
|
|
int ret, size = (int) r2;
|
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
void *unsafe_ptr = (void *) (long) r3;
|
|
|
|
|
2016-04-13 05:10:52 +07:00
|
|
|
ret = probe_kernel_read(dst, unsafe_ptr, size);
|
|
|
|
if (unlikely(ret < 0))
|
|
|
|
memset(dst, 0, size);
|
|
|
|
|
|
|
|
return ret;
|
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static const struct bpf_func_proto bpf_probe_read_proto = {
|
|
|
|
.func = bpf_probe_read,
|
|
|
|
.gpl_only = true,
|
|
|
|
.ret_type = RET_INTEGER,
|
2016-04-13 05:10:52 +07:00
|
|
|
.arg1_type = ARG_PTR_TO_RAW_STACK,
|
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
.arg2_type = ARG_CONST_STACK_SIZE,
|
|
|
|
.arg3_type = ARG_ANYTHING,
|
|
|
|
};
|
|
|
|
|
2016-07-25 19:54:46 +07:00
|
|
|
static u64 bpf_probe_write_user(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
|
|
|
|
{
|
|
|
|
void *unsafe_ptr = (void *) (long) r1;
|
|
|
|
void *src = (void *) (long) r2;
|
|
|
|
int size = (int) r3;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure we're in user context which is safe for the helper to
|
|
|
|
* run. This helper has no business in a kthread.
|
|
|
|
*
|
|
|
|
* access_ok() should prevent writing to non-user memory, but in
|
|
|
|
* some situations (nommu, temporary switch, etc) access_ok() does
|
|
|
|
* not provide enough validation, hence the check on KERNEL_DS.
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (unlikely(in_interrupt() ||
|
|
|
|
current->flags & (PF_KTHREAD | PF_EXITING)))
|
|
|
|
return -EPERM;
|
|
|
|
if (unlikely(segment_eq(get_fs(), KERNEL_DS)))
|
|
|
|
return -EPERM;
|
|
|
|
if (!access_ok(VERIFY_WRITE, unsafe_ptr, size))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
return probe_kernel_write(unsafe_ptr, src, size);
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct bpf_func_proto bpf_probe_write_user_proto = {
|
|
|
|
.func = bpf_probe_write_user,
|
|
|
|
.gpl_only = true,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_ANYTHING,
|
|
|
|
.arg2_type = ARG_PTR_TO_STACK,
|
|
|
|
.arg3_type = ARG_CONST_STACK_SIZE,
|
|
|
|
};
|
|
|
|
|
|
|
|
static const struct bpf_func_proto *bpf_get_probe_write_proto(void)
|
|
|
|
{
|
|
|
|
pr_warn_ratelimited("%s[%d] is installing a program with bpf_probe_write_user helper that may corrupt user memory!",
|
|
|
|
current->comm, task_pid_nr(current));
|
|
|
|
|
|
|
|
return &bpf_probe_write_user_proto;
|
|
|
|
}
|
|
|
|
|
2015-03-26 02:49:22 +07:00
|
|
|
/*
|
|
|
|
* limited trace_printk()
|
2015-08-29 05:56:23 +07:00
|
|
|
* only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed
|
2015-03-26 02:49:22 +07:00
|
|
|
*/
|
|
|
|
static u64 bpf_trace_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
|
|
|
|
{
|
|
|
|
char *fmt = (char *) (long) r1;
|
2015-08-29 05:56:23 +07:00
|
|
|
bool str_seen = false;
|
2015-03-26 02:49:22 +07:00
|
|
|
int mod[3] = {};
|
|
|
|
int fmt_cnt = 0;
|
2015-08-29 05:56:23 +07:00
|
|
|
u64 unsafe_addr;
|
|
|
|
char buf[64];
|
2015-03-26 02:49:22 +07:00
|
|
|
int i;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* bpf_check()->check_func_arg()->check_stack_boundary()
|
|
|
|
* guarantees that fmt points to bpf program stack,
|
|
|
|
* fmt_size bytes of it were initialized and fmt_size > 0
|
|
|
|
*/
|
|
|
|
if (fmt[--fmt_size] != 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* check format string for allowed specifiers */
|
|
|
|
for (i = 0; i < fmt_size; i++) {
|
|
|
|
if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i]))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (fmt[i] != '%')
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (fmt_cnt >= 3)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */
|
|
|
|
i++;
|
|
|
|
if (fmt[i] == 'l') {
|
|
|
|
mod[fmt_cnt]++;
|
|
|
|
i++;
|
2015-08-29 05:56:23 +07:00
|
|
|
} else if (fmt[i] == 'p' || fmt[i] == 's') {
|
2015-03-26 02:49:22 +07:00
|
|
|
mod[fmt_cnt]++;
|
|
|
|
i++;
|
|
|
|
if (!isspace(fmt[i]) && !ispunct(fmt[i]) && fmt[i] != 0)
|
|
|
|
return -EINVAL;
|
|
|
|
fmt_cnt++;
|
2015-08-29 05:56:23 +07:00
|
|
|
if (fmt[i - 1] == 's') {
|
|
|
|
if (str_seen)
|
|
|
|
/* allow only one '%s' per fmt string */
|
|
|
|
return -EINVAL;
|
|
|
|
str_seen = true;
|
|
|
|
|
|
|
|
switch (fmt_cnt) {
|
|
|
|
case 1:
|
|
|
|
unsafe_addr = r3;
|
|
|
|
r3 = (long) buf;
|
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
unsafe_addr = r4;
|
|
|
|
r4 = (long) buf;
|
|
|
|
break;
|
|
|
|
case 3:
|
|
|
|
unsafe_addr = r5;
|
|
|
|
r5 = (long) buf;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
buf[0] = 0;
|
|
|
|
strncpy_from_unsafe(buf,
|
|
|
|
(void *) (long) unsafe_addr,
|
|
|
|
sizeof(buf));
|
|
|
|
}
|
2015-03-26 02:49:22 +07:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (fmt[i] == 'l') {
|
|
|
|
mod[fmt_cnt]++;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (fmt[i] != 'd' && fmt[i] != 'u' && fmt[i] != 'x')
|
|
|
|
return -EINVAL;
|
|
|
|
fmt_cnt++;
|
|
|
|
}
|
|
|
|
|
|
|
|
return __trace_printk(1/* fake ip will not be printed */, fmt,
|
|
|
|
mod[0] == 2 ? r3 : mod[0] == 1 ? (long) r3 : (u32) r3,
|
|
|
|
mod[1] == 2 ? r4 : mod[1] == 1 ? (long) r4 : (u32) r4,
|
|
|
|
mod[2] == 2 ? r5 : mod[2] == 1 ? (long) r5 : (u32) r5);
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct bpf_func_proto bpf_trace_printk_proto = {
|
|
|
|
.func = bpf_trace_printk,
|
|
|
|
.gpl_only = true,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_PTR_TO_STACK,
|
|
|
|
.arg2_type = ARG_CONST_STACK_SIZE,
|
|
|
|
};
|
|
|
|
|
2015-06-13 09:39:13 +07:00
|
|
|
const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* this program might be calling bpf_trace_printk,
|
|
|
|
* so allocate per-cpu printk buffers
|
|
|
|
*/
|
|
|
|
trace_printk_init_buffers();
|
|
|
|
|
|
|
|
return &bpf_trace_printk_proto;
|
|
|
|
}
|
|
|
|
|
2016-06-28 17:18:25 +07:00
|
|
|
static u64 bpf_perf_event_read(u64 r1, u64 flags, u64 r3, u64 r4, u64 r5)
|
2015-08-06 14:02:35 +07:00
|
|
|
{
|
|
|
|
struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
|
|
|
|
struct bpf_array *array = container_of(map, struct bpf_array, map);
|
2016-06-28 17:18:25 +07:00
|
|
|
unsigned int cpu = smp_processor_id();
|
|
|
|
u64 index = flags & BPF_F_INDEX_MASK;
|
bpf, maps: flush own entries on perf map release
The behavior of perf event arrays are quite different from all
others as they are tightly coupled to perf event fds, f.e. shown
recently by commit e03e7ee34fdd ("perf/bpf: Convert perf_event_array
to use struct file") to make refcounting on perf event more robust.
A remaining issue that the current code still has is that since
additions to the perf event array take a reference on the struct
file via perf_event_get() and are only released via fput() (that
cleans up the perf event eventually via perf_event_release_kernel())
when the element is either manually removed from the map from user
space or automatically when the last reference on the perf event
map is dropped. However, this leads us to dangling struct file's
when the map gets pinned after the application owning the perf
event descriptor exits, and since the struct file reference will
in such case only be manually dropped or via pinned file removal,
it leads to the perf event living longer than necessary, consuming
needlessly resources for that time.
Relations between perf event fds and bpf perf event map fds can be
rather complex. F.e. maps can act as demuxers among different perf
event fds that can possibly be owned by different threads and based
on the index selection from the program, events get dispatched to
one of the per-cpu fd endpoints. One perf event fd (or, rather a
per-cpu set of them) can also live in multiple perf event maps at
the same time, listening for events. Also, another requirement is
that perf event fds can get closed from application side after they
have been attached to the perf event map, so that on exit perf event
map will take care of dropping their references eventually. Likewise,
when such maps are pinned, the intended behavior is that a user
application does bpf_obj_get(), puts its fds in there and on exit
when fd is released, they are dropped from the map again, so the map
acts rather as connector endpoint. This also makes perf event maps
inherently different from program arrays as described in more detail
in commit c9da161c6517 ("bpf: fix clearing on persistent program
array maps").
To tackle this, map entries are marked by the map struct file that
added the element to the map. And when the last reference to that map
struct file is released from user space, then the tracked entries
are purged from the map. This is okay, because new map struct files
instances resp. frontends to the anon inode are provided via
bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
for retrieving a pinned map, but also when an initial instance is
created via map_create(). The rest is resolved by the vfs layer
automatically for us by keeping reference count on the map's struct
file. Any concurrent updates on the map slot are fine as well, it
just means that perf_event_fd_array_release() needs to delete less
of its own entires.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16 03:47:14 +07:00
|
|
|
struct bpf_event_entry *ee;
|
2015-08-06 14:02:35 +07:00
|
|
|
struct perf_event *event;
|
|
|
|
|
2016-06-28 17:18:25 +07:00
|
|
|
if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
|
|
|
|
return -EINVAL;
|
|
|
|
if (index == BPF_F_CURRENT_CPU)
|
|
|
|
index = cpu;
|
2015-08-06 14:02:35 +07:00
|
|
|
if (unlikely(index >= array->map.max_entries))
|
|
|
|
return -E2BIG;
|
|
|
|
|
bpf, maps: flush own entries on perf map release
The behavior of perf event arrays are quite different from all
others as they are tightly coupled to perf event fds, f.e. shown
recently by commit e03e7ee34fdd ("perf/bpf: Convert perf_event_array
to use struct file") to make refcounting on perf event more robust.
A remaining issue that the current code still has is that since
additions to the perf event array take a reference on the struct
file via perf_event_get() and are only released via fput() (that
cleans up the perf event eventually via perf_event_release_kernel())
when the element is either manually removed from the map from user
space or automatically when the last reference on the perf event
map is dropped. However, this leads us to dangling struct file's
when the map gets pinned after the application owning the perf
event descriptor exits, and since the struct file reference will
in such case only be manually dropped or via pinned file removal,
it leads to the perf event living longer than necessary, consuming
needlessly resources for that time.
Relations between perf event fds and bpf perf event map fds can be
rather complex. F.e. maps can act as demuxers among different perf
event fds that can possibly be owned by different threads and based
on the index selection from the program, events get dispatched to
one of the per-cpu fd endpoints. One perf event fd (or, rather a
per-cpu set of them) can also live in multiple perf event maps at
the same time, listening for events. Also, another requirement is
that perf event fds can get closed from application side after they
have been attached to the perf event map, so that on exit perf event
map will take care of dropping their references eventually. Likewise,
when such maps are pinned, the intended behavior is that a user
application does bpf_obj_get(), puts its fds in there and on exit
when fd is released, they are dropped from the map again, so the map
acts rather as connector endpoint. This also makes perf event maps
inherently different from program arrays as described in more detail
in commit c9da161c6517 ("bpf: fix clearing on persistent program
array maps").
To tackle this, map entries are marked by the map struct file that
added the element to the map. And when the last reference to that map
struct file is released from user space, then the tracked entries
are purged from the map. This is okay, because new map struct files
instances resp. frontends to the anon inode are provided via
bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
for retrieving a pinned map, but also when an initial instance is
created via map_create(). The rest is resolved by the vfs layer
automatically for us by keeping reference count on the map's struct
file. Any concurrent updates on the map slot are fine as well, it
just means that perf_event_fd_array_release() needs to delete less
of its own entires.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16 03:47:14 +07:00
|
|
|
ee = READ_ONCE(array->ptrs[index]);
|
2016-06-28 17:18:23 +07:00
|
|
|
if (!ee)
|
2015-08-06 14:02:35 +07:00
|
|
|
return -ENOENT;
|
|
|
|
|
bpf, maps: flush own entries on perf map release
The behavior of perf event arrays are quite different from all
others as they are tightly coupled to perf event fds, f.e. shown
recently by commit e03e7ee34fdd ("perf/bpf: Convert perf_event_array
to use struct file") to make refcounting on perf event more robust.
A remaining issue that the current code still has is that since
additions to the perf event array take a reference on the struct
file via perf_event_get() and are only released via fput() (that
cleans up the perf event eventually via perf_event_release_kernel())
when the element is either manually removed from the map from user
space or automatically when the last reference on the perf event
map is dropped. However, this leads us to dangling struct file's
when the map gets pinned after the application owning the perf
event descriptor exits, and since the struct file reference will
in such case only be manually dropped or via pinned file removal,
it leads to the perf event living longer than necessary, consuming
needlessly resources for that time.
Relations between perf event fds and bpf perf event map fds can be
rather complex. F.e. maps can act as demuxers among different perf
event fds that can possibly be owned by different threads and based
on the index selection from the program, events get dispatched to
one of the per-cpu fd endpoints. One perf event fd (or, rather a
per-cpu set of them) can also live in multiple perf event maps at
the same time, listening for events. Also, another requirement is
that perf event fds can get closed from application side after they
have been attached to the perf event map, so that on exit perf event
map will take care of dropping their references eventually. Likewise,
when such maps are pinned, the intended behavior is that a user
application does bpf_obj_get(), puts its fds in there and on exit
when fd is released, they are dropped from the map again, so the map
acts rather as connector endpoint. This also makes perf event maps
inherently different from program arrays as described in more detail
in commit c9da161c6517 ("bpf: fix clearing on persistent program
array maps").
To tackle this, map entries are marked by the map struct file that
added the element to the map. And when the last reference to that map
struct file is released from user space, then the tracked entries
are purged from the map. This is okay, because new map struct files
instances resp. frontends to the anon inode are provided via
bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
for retrieving a pinned map, but also when an initial instance is
created via map_create(). The rest is resolved by the vfs layer
automatically for us by keeping reference count on the map's struct
file. Any concurrent updates on the map slot are fine as well, it
just means that perf_event_fd_array_release() needs to delete less
of its own entires.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16 03:47:14 +07:00
|
|
|
event = ee->event;
|
2016-06-28 17:18:23 +07:00
|
|
|
if (unlikely(event->attr.type != PERF_TYPE_HARDWARE &&
|
|
|
|
event->attr.type != PERF_TYPE_RAW))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2015-10-23 07:10:14 +07:00
|
|
|
/* make sure event is local and doesn't have pmu::count */
|
2016-06-28 17:18:25 +07:00
|
|
|
if (unlikely(event->oncpu != cpu || event->pmu->count))
|
2015-10-23 07:10:14 +07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2015-08-06 14:02:35 +07:00
|
|
|
/*
|
|
|
|
* we don't know if the function is run successfully by the
|
|
|
|
* return value. It can be judged in other places, such as
|
|
|
|
* eBPF programs.
|
|
|
|
*/
|
|
|
|
return perf_event_read_local(event);
|
|
|
|
}
|
|
|
|
|
2015-10-23 07:10:14 +07:00
|
|
|
static const struct bpf_func_proto bpf_perf_event_read_proto = {
|
2015-08-06 14:02:35 +07:00
|
|
|
.func = bpf_perf_event_read,
|
2015-10-24 04:58:19 +07:00
|
|
|
.gpl_only = true,
|
2015-08-06 14:02:35 +07:00
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg2_type = ARG_ANYTHING,
|
|
|
|
};
|
|
|
|
|
2016-07-14 23:08:04 +07:00
|
|
|
static __always_inline u64
|
|
|
|
__bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
|
|
|
|
u64 flags, struct perf_raw_record *raw)
|
2015-10-21 10:02:34 +07:00
|
|
|
{
|
|
|
|
struct bpf_array *array = container_of(map, struct bpf_array, map);
|
2016-06-28 17:18:24 +07:00
|
|
|
unsigned int cpu = smp_processor_id();
|
2016-04-19 02:01:23 +07:00
|
|
|
u64 index = flags & BPF_F_INDEX_MASK;
|
2015-10-21 10:02:34 +07:00
|
|
|
struct perf_sample_data sample_data;
|
bpf, maps: flush own entries on perf map release
The behavior of perf event arrays are quite different from all
others as they are tightly coupled to perf event fds, f.e. shown
recently by commit e03e7ee34fdd ("perf/bpf: Convert perf_event_array
to use struct file") to make refcounting on perf event more robust.
A remaining issue that the current code still has is that since
additions to the perf event array take a reference on the struct
file via perf_event_get() and are only released via fput() (that
cleans up the perf event eventually via perf_event_release_kernel())
when the element is either manually removed from the map from user
space or automatically when the last reference on the perf event
map is dropped. However, this leads us to dangling struct file's
when the map gets pinned after the application owning the perf
event descriptor exits, and since the struct file reference will
in such case only be manually dropped or via pinned file removal,
it leads to the perf event living longer than necessary, consuming
needlessly resources for that time.
Relations between perf event fds and bpf perf event map fds can be
rather complex. F.e. maps can act as demuxers among different perf
event fds that can possibly be owned by different threads and based
on the index selection from the program, events get dispatched to
one of the per-cpu fd endpoints. One perf event fd (or, rather a
per-cpu set of them) can also live in multiple perf event maps at
the same time, listening for events. Also, another requirement is
that perf event fds can get closed from application side after they
have been attached to the perf event map, so that on exit perf event
map will take care of dropping their references eventually. Likewise,
when such maps are pinned, the intended behavior is that a user
application does bpf_obj_get(), puts its fds in there and on exit
when fd is released, they are dropped from the map again, so the map
acts rather as connector endpoint. This also makes perf event maps
inherently different from program arrays as described in more detail
in commit c9da161c6517 ("bpf: fix clearing on persistent program
array maps").
To tackle this, map entries are marked by the map struct file that
added the element to the map. And when the last reference to that map
struct file is released from user space, then the tracked entries
are purged from the map. This is okay, because new map struct files
instances resp. frontends to the anon inode are provided via
bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
for retrieving a pinned map, but also when an initial instance is
created via map_create(). The rest is resolved by the vfs layer
automatically for us by keeping reference count on the map's struct
file. Any concurrent updates on the map slot are fine as well, it
just means that perf_event_fd_array_release() needs to delete less
of its own entires.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16 03:47:14 +07:00
|
|
|
struct bpf_event_entry *ee;
|
2015-10-21 10:02:34 +07:00
|
|
|
struct perf_event *event;
|
|
|
|
|
2016-04-19 02:01:23 +07:00
|
|
|
if (index == BPF_F_CURRENT_CPU)
|
2016-06-28 17:18:24 +07:00
|
|
|
index = cpu;
|
2015-10-21 10:02:34 +07:00
|
|
|
if (unlikely(index >= array->map.max_entries))
|
|
|
|
return -E2BIG;
|
|
|
|
|
bpf, maps: flush own entries on perf map release
The behavior of perf event arrays are quite different from all
others as they are tightly coupled to perf event fds, f.e. shown
recently by commit e03e7ee34fdd ("perf/bpf: Convert perf_event_array
to use struct file") to make refcounting on perf event more robust.
A remaining issue that the current code still has is that since
additions to the perf event array take a reference on the struct
file via perf_event_get() and are only released via fput() (that
cleans up the perf event eventually via perf_event_release_kernel())
when the element is either manually removed from the map from user
space or automatically when the last reference on the perf event
map is dropped. However, this leads us to dangling struct file's
when the map gets pinned after the application owning the perf
event descriptor exits, and since the struct file reference will
in such case only be manually dropped or via pinned file removal,
it leads to the perf event living longer than necessary, consuming
needlessly resources for that time.
Relations between perf event fds and bpf perf event map fds can be
rather complex. F.e. maps can act as demuxers among different perf
event fds that can possibly be owned by different threads and based
on the index selection from the program, events get dispatched to
one of the per-cpu fd endpoints. One perf event fd (or, rather a
per-cpu set of them) can also live in multiple perf event maps at
the same time, listening for events. Also, another requirement is
that perf event fds can get closed from application side after they
have been attached to the perf event map, so that on exit perf event
map will take care of dropping their references eventually. Likewise,
when such maps are pinned, the intended behavior is that a user
application does bpf_obj_get(), puts its fds in there and on exit
when fd is released, they are dropped from the map again, so the map
acts rather as connector endpoint. This also makes perf event maps
inherently different from program arrays as described in more detail
in commit c9da161c6517 ("bpf: fix clearing on persistent program
array maps").
To tackle this, map entries are marked by the map struct file that
added the element to the map. And when the last reference to that map
struct file is released from user space, then the tracked entries
are purged from the map. This is okay, because new map struct files
instances resp. frontends to the anon inode are provided via
bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
for retrieving a pinned map, but also when an initial instance is
created via map_create(). The rest is resolved by the vfs layer
automatically for us by keeping reference count on the map's struct
file. Any concurrent updates on the map slot are fine as well, it
just means that perf_event_fd_array_release() needs to delete less
of its own entires.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16 03:47:14 +07:00
|
|
|
ee = READ_ONCE(array->ptrs[index]);
|
2016-06-28 17:18:23 +07:00
|
|
|
if (!ee)
|
2015-10-21 10:02:34 +07:00
|
|
|
return -ENOENT;
|
|
|
|
|
bpf, maps: flush own entries on perf map release
The behavior of perf event arrays are quite different from all
others as they are tightly coupled to perf event fds, f.e. shown
recently by commit e03e7ee34fdd ("perf/bpf: Convert perf_event_array
to use struct file") to make refcounting on perf event more robust.
A remaining issue that the current code still has is that since
additions to the perf event array take a reference on the struct
file via perf_event_get() and are only released via fput() (that
cleans up the perf event eventually via perf_event_release_kernel())
when the element is either manually removed from the map from user
space or automatically when the last reference on the perf event
map is dropped. However, this leads us to dangling struct file's
when the map gets pinned after the application owning the perf
event descriptor exits, and since the struct file reference will
in such case only be manually dropped or via pinned file removal,
it leads to the perf event living longer than necessary, consuming
needlessly resources for that time.
Relations between perf event fds and bpf perf event map fds can be
rather complex. F.e. maps can act as demuxers among different perf
event fds that can possibly be owned by different threads and based
on the index selection from the program, events get dispatched to
one of the per-cpu fd endpoints. One perf event fd (or, rather a
per-cpu set of them) can also live in multiple perf event maps at
the same time, listening for events. Also, another requirement is
that perf event fds can get closed from application side after they
have been attached to the perf event map, so that on exit perf event
map will take care of dropping their references eventually. Likewise,
when such maps are pinned, the intended behavior is that a user
application does bpf_obj_get(), puts its fds in there and on exit
when fd is released, they are dropped from the map again, so the map
acts rather as connector endpoint. This also makes perf event maps
inherently different from program arrays as described in more detail
in commit c9da161c6517 ("bpf: fix clearing on persistent program
array maps").
To tackle this, map entries are marked by the map struct file that
added the element to the map. And when the last reference to that map
struct file is released from user space, then the tracked entries
are purged from the map. This is okay, because new map struct files
instances resp. frontends to the anon inode are provided via
bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
for retrieving a pinned map, but also when an initial instance is
created via map_create(). The rest is resolved by the vfs layer
automatically for us by keeping reference count on the map's struct
file. Any concurrent updates on the map slot are fine as well, it
just means that perf_event_fd_array_release() needs to delete less
of its own entires.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16 03:47:14 +07:00
|
|
|
event = ee->event;
|
2015-10-21 10:02:34 +07:00
|
|
|
if (unlikely(event->attr.type != PERF_TYPE_SOFTWARE ||
|
|
|
|
event->attr.config != PERF_COUNT_SW_BPF_OUTPUT))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2016-06-28 17:18:24 +07:00
|
|
|
if (unlikely(event->oncpu != cpu))
|
2015-10-21 10:02:34 +07:00
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
perf_sample_data_init(&sample_data, 0, 0);
|
2016-07-14 23:08:04 +07:00
|
|
|
sample_data.raw = raw;
|
2015-10-21 10:02:34 +07:00
|
|
|
perf_event_output(event, &sample_data, regs);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-07-14 23:08:04 +07:00
|
|
|
static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
|
|
|
|
{
|
|
|
|
struct pt_regs *regs = (struct pt_regs *)(long) r1;
|
|
|
|
struct bpf_map *map = (struct bpf_map *)(long) r2;
|
|
|
|
void *data = (void *)(long) r4;
|
|
|
|
struct perf_raw_record raw = {
|
|
|
|
.frag = {
|
|
|
|
.size = size,
|
|
|
|
.data = data,
|
|
|
|
},
|
|
|
|
};
|
|
|
|
|
|
|
|
if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return __bpf_perf_event_output(regs, map, flags, &raw);
|
|
|
|
}
|
|
|
|
|
2015-10-21 10:02:34 +07:00
|
|
|
static const struct bpf_func_proto bpf_perf_event_output_proto = {
|
|
|
|
.func = bpf_perf_event_output,
|
2015-10-24 04:58:19 +07:00
|
|
|
.gpl_only = true,
|
2015-10-21 10:02:34 +07:00
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_PTR_TO_CTX,
|
|
|
|
.arg2_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg3_type = ARG_ANYTHING,
|
|
|
|
.arg4_type = ARG_PTR_TO_STACK,
|
|
|
|
.arg5_type = ARG_CONST_STACK_SIZE,
|
|
|
|
};
|
|
|
|
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 02:01:24 +07:00
|
|
|
static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs);
|
|
|
|
|
2016-07-14 23:08:05 +07:00
|
|
|
u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
|
|
|
|
void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy)
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 02:01:24 +07:00
|
|
|
{
|
|
|
|
struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs);
|
2016-07-14 23:08:05 +07:00
|
|
|
struct perf_raw_frag frag = {
|
|
|
|
.copy = ctx_copy,
|
|
|
|
.size = ctx_size,
|
|
|
|
.data = ctx,
|
|
|
|
};
|
|
|
|
struct perf_raw_record raw = {
|
|
|
|
.frag = {
|
2016-07-19 05:50:58 +07:00
|
|
|
{
|
|
|
|
.next = ctx_size ? &frag : NULL,
|
|
|
|
},
|
2016-07-14 23:08:05 +07:00
|
|
|
.size = meta_size,
|
|
|
|
.data = meta,
|
|
|
|
},
|
|
|
|
};
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 02:01:24 +07:00
|
|
|
|
|
|
|
perf_fetch_caller_regs(regs);
|
|
|
|
|
2016-07-14 23:08:05 +07:00
|
|
|
return __bpf_perf_event_output(regs, map, flags, &raw);
|
bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19 02:01:24 +07:00
|
|
|
}
|
|
|
|
|
2016-07-07 12:38:36 +07:00
|
|
|
static u64 bpf_get_current_task(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
|
|
|
|
{
|
|
|
|
return (long) current;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct bpf_func_proto bpf_get_current_task_proto = {
|
|
|
|
.func = bpf_get_current_task,
|
|
|
|
.gpl_only = true,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
};
|
|
|
|
|
2016-04-07 08:43:26 +07:00
|
|
|
static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
|
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
{
|
|
|
|
switch (func_id) {
|
|
|
|
case BPF_FUNC_map_lookup_elem:
|
|
|
|
return &bpf_map_lookup_elem_proto;
|
|
|
|
case BPF_FUNC_map_update_elem:
|
|
|
|
return &bpf_map_update_elem_proto;
|
|
|
|
case BPF_FUNC_map_delete_elem:
|
|
|
|
return &bpf_map_delete_elem_proto;
|
|
|
|
case BPF_FUNC_probe_read:
|
|
|
|
return &bpf_probe_read_proto;
|
2015-03-26 02:49:21 +07:00
|
|
|
case BPF_FUNC_ktime_get_ns:
|
|
|
|
return &bpf_ktime_get_ns_proto;
|
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 06:59:03 +07:00
|
|
|
case BPF_FUNC_tail_call:
|
|
|
|
return &bpf_tail_call_proto;
|
2015-06-13 09:39:12 +07:00
|
|
|
case BPF_FUNC_get_current_pid_tgid:
|
|
|
|
return &bpf_get_current_pid_tgid_proto;
|
2016-07-07 12:38:36 +07:00
|
|
|
case BPF_FUNC_get_current_task:
|
|
|
|
return &bpf_get_current_task_proto;
|
2015-06-13 09:39:12 +07:00
|
|
|
case BPF_FUNC_get_current_uid_gid:
|
|
|
|
return &bpf_get_current_uid_gid_proto;
|
|
|
|
case BPF_FUNC_get_current_comm:
|
|
|
|
return &bpf_get_current_comm_proto;
|
2015-03-26 02:49:22 +07:00
|
|
|
case BPF_FUNC_trace_printk:
|
2015-06-13 09:39:13 +07:00
|
|
|
return bpf_get_trace_printk_proto();
|
2015-06-13 09:39:14 +07:00
|
|
|
case BPF_FUNC_get_smp_processor_id:
|
|
|
|
return &bpf_get_smp_processor_id_proto;
|
2015-08-06 14:02:35 +07:00
|
|
|
case BPF_FUNC_perf_event_read:
|
|
|
|
return &bpf_perf_event_read_proto;
|
2016-07-25 19:54:46 +07:00
|
|
|
case BPF_FUNC_probe_write_user:
|
|
|
|
return bpf_get_probe_write_proto();
|
2016-04-07 08:43:26 +07:00
|
|
|
default:
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
|
|
|
|
{
|
|
|
|
switch (func_id) {
|
2015-10-21 10:02:34 +07:00
|
|
|
case BPF_FUNC_perf_event_output:
|
|
|
|
return &bpf_perf_event_output_proto;
|
2016-02-18 10:58:58 +07:00
|
|
|
case BPF_FUNC_get_stackid:
|
|
|
|
return &bpf_get_stackid_proto;
|
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
default:
|
2016-04-07 08:43:26 +07:00
|
|
|
return tracing_func_proto(func_id);
|
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* bpf+kprobe programs can access fields of 'struct pt_regs' */
|
2016-06-16 08:25:38 +07:00
|
|
|
static bool kprobe_prog_is_valid_access(int off, int size, enum bpf_access_type type,
|
|
|
|
enum bpf_reg_type *reg_type)
|
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
{
|
|
|
|
if (off < 0 || off >= sizeof(struct pt_regs))
|
|
|
|
return false;
|
|
|
|
if (type != BPF_READ)
|
|
|
|
return false;
|
|
|
|
if (off % size != 0)
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2015-12-12 00:35:59 +07:00
|
|
|
static const struct bpf_verifier_ops kprobe_prog_ops = {
|
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
.get_func_proto = kprobe_prog_func_proto,
|
|
|
|
.is_valid_access = kprobe_prog_is_valid_access,
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct bpf_prog_type_list kprobe_tl = {
|
|
|
|
.ops = &kprobe_prog_ops,
|
|
|
|
.type = BPF_PROG_TYPE_KPROBE,
|
|
|
|
};
|
|
|
|
|
2016-04-07 08:43:27 +07:00
|
|
|
static u64 bpf_perf_event_output_tp(u64 r1, u64 r2, u64 index, u64 r4, u64 size)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* r1 points to perf tracepoint buffer where first 8 bytes are hidden
|
|
|
|
* from bpf program and contain a pointer to 'struct pt_regs'. Fetch it
|
|
|
|
* from there and call the same bpf_perf_event_output() helper
|
|
|
|
*/
|
2016-04-17 03:29:33 +07:00
|
|
|
u64 ctx = *(long *)(uintptr_t)r1;
|
2016-04-07 08:43:27 +07:00
|
|
|
|
|
|
|
return bpf_perf_event_output(ctx, r2, index, r4, size);
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct bpf_func_proto bpf_perf_event_output_proto_tp = {
|
|
|
|
.func = bpf_perf_event_output_tp,
|
|
|
|
.gpl_only = true,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_PTR_TO_CTX,
|
|
|
|
.arg2_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg3_type = ARG_ANYTHING,
|
|
|
|
.arg4_type = ARG_PTR_TO_STACK,
|
|
|
|
.arg5_type = ARG_CONST_STACK_SIZE,
|
|
|
|
};
|
|
|
|
|
|
|
|
static u64 bpf_get_stackid_tp(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
|
|
|
|
{
|
2016-04-17 03:29:33 +07:00
|
|
|
u64 ctx = *(long *)(uintptr_t)r1;
|
2016-04-07 08:43:27 +07:00
|
|
|
|
|
|
|
return bpf_get_stackid(ctx, r2, r3, r4, r5);
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct bpf_func_proto bpf_get_stackid_proto_tp = {
|
|
|
|
.func = bpf_get_stackid_tp,
|
|
|
|
.gpl_only = true,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_PTR_TO_CTX,
|
|
|
|
.arg2_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg3_type = ARG_ANYTHING,
|
|
|
|
};
|
|
|
|
|
2016-04-07 08:43:26 +07:00
|
|
|
static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
|
|
|
|
{
|
|
|
|
switch (func_id) {
|
|
|
|
case BPF_FUNC_perf_event_output:
|
2016-04-07 08:43:27 +07:00
|
|
|
return &bpf_perf_event_output_proto_tp;
|
2016-04-07 08:43:26 +07:00
|
|
|
case BPF_FUNC_get_stackid:
|
2016-04-07 08:43:27 +07:00
|
|
|
return &bpf_get_stackid_proto_tp;
|
2016-04-07 08:43:26 +07:00
|
|
|
default:
|
|
|
|
return tracing_func_proto(func_id);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-16 08:25:38 +07:00
|
|
|
static bool tp_prog_is_valid_access(int off, int size, enum bpf_access_type type,
|
|
|
|
enum bpf_reg_type *reg_type)
|
2016-04-07 08:43:26 +07:00
|
|
|
{
|
|
|
|
if (off < sizeof(void *) || off >= PERF_MAX_TRACE_SIZE)
|
|
|
|
return false;
|
|
|
|
if (type != BPF_READ)
|
|
|
|
return false;
|
|
|
|
if (off % size != 0)
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct bpf_verifier_ops tracepoint_prog_ops = {
|
|
|
|
.get_func_proto = tp_prog_func_proto,
|
|
|
|
.is_valid_access = tp_prog_is_valid_access,
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct bpf_prog_type_list tracepoint_tl = {
|
|
|
|
.ops = &tracepoint_prog_ops,
|
|
|
|
.type = BPF_PROG_TYPE_TRACEPOINT,
|
|
|
|
};
|
|
|
|
|
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
static int __init register_kprobe_prog_ops(void)
|
|
|
|
{
|
|
|
|
bpf_register_prog_type(&kprobe_tl);
|
2016-04-07 08:43:26 +07:00
|
|
|
bpf_register_prog_type(&tracepoint_tl);
|
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-26 02:49:20 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
late_initcall(register_kprobe_prog_ops);
|