2017-10-06 20:31:47 +07:00
|
|
|
#ifndef __PERF_MMAP_H
|
|
|
|
#define __PERF_MMAP_H 1
|
|
|
|
|
|
|
|
#include <linux/compiler.h>
|
|
|
|
#include <linux/refcount.h>
|
|
|
|
#include <linux/types.h>
|
tools, perf: add and use optimized ring_buffer_{read_head, write_tail} helpers
Currently, on x86-64, perf uses LFENCE and MFENCE (rmb() and mb(),
respectively) when processing events from the perf ring buffer which
is unnecessarily expensive as we can do more lightweight in particular
given this is critical fast-path in perf.
According to Peter rmb()/mb() were added back then via a94d342b9cb0
("tools/perf: Add required memory barriers") at a time where kernel
still supported chips that needed it, but nowadays support for these
has been ditched completely, therefore we can fix them up as well.
While for x86-64, replacing rmb() and mb() with smp_*() variants would
result in just a compiler barrier for the former and LOCK + ADD for
the latter (__sync_synchronize() uses slower MFENCE by the way), Peter
suggested we can use smp_{load_acquire,store_release}() instead for
architectures where its implementation doesn't resolve in slower smp_mb().
Thus, e.g. in x86-64 we would be able to avoid CPU barrier entirely due
to TSO. For architectures where the latter needs to use smp_mb() e.g.
on arm, we stick to cheaper smp_rmb() variant for fetching the head.
This work adds helpers ring_buffer_read_head() and ring_buffer_write_tail()
for tools infrastructure that either switches to smp_load_acquire() for
architectures where it is cheaper or uses READ_ONCE() + smp_rmb() barrier
for those where it's not in order to fetch the data_head from the perf
control page, and it uses smp_store_release() to write the data_tail.
Latter is smp_mb() + WRITE_ONCE() combination or a cheaper variant if
architecture allows for it. Those that rely on smp_rmb() and smp_mb() can
further improve performance in a follow up step by implementing the two
under tools/arch/*/include/asm/barrier.h such that they don't have to
fallback to rmb() and mb() in tools/include/asm/barrier.h.
Switch perf to use ring_buffer_read_head() and ring_buffer_write_tail()
so it can make use of the optimizations. Later, we convert libbpf as
well to use the same helpers.
Side note [0]: the topic has been raised of whether one could simply use
the C11 gcc builtins [1] for the smp_load_acquire() and smp_store_release()
instead:
__atomic_load_n(ptr, __ATOMIC_ACQUIRE);
__atomic_store_n(ptr, val, __ATOMIC_RELEASE);
Kernel and (presumably) tooling shipped along with the kernel has a
minimum requirement of being able to build with gcc-4.6 and the latter
does not have C11 builtins. While generally the C11 memory models don't
align with the kernel's, the C11 load-acquire and store-release alone
/could/ suffice, however. Issue is that this is implementation dependent
on how the load-acquire and store-release is done by the compiler and
the mapping of supported compilers must align to be compatible with the
kernel's implementation, and thus needs to be verified/tracked on a
case by case basis whether they match (unless an architecture uses them
also from kernel side). The implementations for smp_load_acquire() and
smp_store_release() in this patch have been adapted from the kernel side
ones to have a concrete and compatible mapping in place.
[0] http://patchwork.ozlabs.org/patch/985422/
[1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 20:51:02 +07:00
|
|
|
#include <linux/ring_buffer.h>
|
2017-10-06 20:31:47 +07:00
|
|
|
#include <stdbool.h>
|
2019-08-31 00:45:20 +07:00
|
|
|
#include <pthread.h> // for cpu_set_t
|
2018-11-06 16:03:35 +07:00
|
|
|
#ifdef HAVE_AIO_SUPPORT
|
|
|
|
#include <aio.h>
|
|
|
|
#endif
|
2017-10-06 20:31:47 +07:00
|
|
|
#include "auxtrace.h"
|
|
|
|
#include "event.h"
|
|
|
|
|
2018-11-06 16:04:58 +07:00
|
|
|
struct aiocb;
|
2017-10-06 20:31:47 +07:00
|
|
|
/**
|
|
|
|
* struct perf_mmap - perf's ring buffer mmap details
|
|
|
|
*
|
|
|
|
* @refcnt - e.g. code using PERF_EVENT_IOC_SET_OUTPUT to share this
|
|
|
|
*/
|
|
|
|
struct perf_mmap {
|
|
|
|
void *base;
|
|
|
|
int mask;
|
|
|
|
int fd;
|
2018-08-17 18:45:55 +07:00
|
|
|
int cpu;
|
2017-10-06 20:31:47 +07:00
|
|
|
refcount_t refcnt;
|
|
|
|
u64 prev;
|
2018-03-06 22:36:01 +07:00
|
|
|
u64 start;
|
|
|
|
u64 end;
|
2018-03-06 22:36:00 +07:00
|
|
|
bool overwrite;
|
2017-10-06 20:31:47 +07:00
|
|
|
struct auxtrace_mmap auxtrace_mmap;
|
|
|
|
char event_copy[PERF_SAMPLE_MAX_SIZE] __aligned(8);
|
2018-11-06 16:03:35 +07:00
|
|
|
#ifdef HAVE_AIO_SUPPORT
|
|
|
|
struct {
|
2018-11-06 16:07:19 +07:00
|
|
|
void **data;
|
|
|
|
struct aiocb *cblocks;
|
|
|
|
struct aiocb **aiocb;
|
2018-11-06 16:04:58 +07:00
|
|
|
int nr_cblocks;
|
2018-11-06 16:03:35 +07:00
|
|
|
} aio;
|
|
|
|
#endif
|
2019-01-23 00:47:43 +07:00
|
|
|
cpu_set_t affinity_mask;
|
perf record: Implement --mmap-flush=<number> option
Implement a --mmap-flush option that specifies minimal number of bytes
that is extracted from mmaped kernel buffer to store into a trace. The
default option value is 1 byte what means every time trace writing
thread finds some new data in the mmaped buffer the data is extracted,
possibly compressed and written to a trace.
$ tools/perf/perf record --mmap-flush 1024 -e cycles -- matrix.gcc
$ tools/perf/perf record --aio --mmap-flush 1K -e cycles -- matrix.gcc
The option is independent from -z setting, doesn't vary with compression
level and can serve two purposes.
The first purpose is to increase the compression ratio of a trace data.
Larger data chunks are compressed more effectively so the implemented
option allows specifying data chunk size to compress. Also at some cases
executing more write syscalls with smaller data size can take longer
than executing less write syscalls with bigger data size due to syscall
overhead so extracting bigger data chunks specified by the option value
could additionally decrease runtime overhead.
The second purpose is to avoid self monitoring live-lock issue in system
wide (-a) profiling mode. Profiling in system wide mode with compression
(-a -z) can additionally induce data into the kernel buffers along with
the data from monitored processes. If performance data rate and volume
from the monitored processes is high then trace streaming and
compression activity in the tool is also high. High tool process
activity can lead to subtle live-lock effect when compression of single
new byte from some of mmaped kernel buffer leads to generation of the
next single byte at some mmaped buffer. So perf tool process ends up in
endless self monitoring.
Implemented synch parameter is the mean to force data move independently
from the specified flush threshold value. Despite the provided flush
value the tool needs capability to unconditionally drain memory buffers,
at least in the end of the collection.
Committer testing:
Running with the default value, i.e. as soon as there is something to
read go on consuming, we first write the synthesized events, small
chunks of about 128 bytes:
# perf trace -m 2048 --call-graph dwarf -e write -- perf record
<SNIP>
101.142 ( 0.004 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x210db60, count: 120) = 120
__libc_write (/usr/lib64/libpthread-2.28.so)
ion (/home/acme/bin/perf)
record__write (inlined)
process_synthesized_event (/home/acme/bin/perf)
perf_tool__process_synth_event (inlined)
perf_event__synthesize_mmap_events (/home/acme/bin/perf)
Then we move to reading the mmap buffers consuming the events put there
by the kernel perf infrastructure:
107.561 ( 0.005 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02000, count: 336) = 336
__libc_write (/usr/lib64/libpthread-2.28.so)
ion (/home/acme/bin/perf)
record__write (inlined)
record__pushfn (/home/acme/bin/perf)
perf_mmap__push (/home/acme/bin/perf)
record__mmap_read_evlist (inlined)
record__mmap_read_all (inlined)
__cmd_record (inlined)
cmd_record (/home/acme/bin/perf)
12919.953 ( 0.136 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc83150, count: 184984) = 184984
<SNIP same backtrace as in the 107.561 timestamp>
12920.094 ( 0.155 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befc02150, count: 261816) = 261816
<SNIP same backtrace as in the 107.561 timestamp>
12920.253 ( 0.093 ms): perf/25821 write(fd: 3</root/perf.data>, buf: 0x7f1befb81120, count: 170832) = 170832
<SNIP same backtrace as in the 107.561 timestamp>
If we limit it to write only when more than 16MB are available for
reading, it throttles that to a quarter of the --mmap-pages set for
'perf record', which by default get to 528384 bytes, found out using
'record -v':
mmap flush: 132096
mmap size 528384B
With that in place all the writes coming from
record__mmap_read_evlist(), i.e. from the mmap buffers setup by the
kernel perf infrastructure were at least 132096 bytes long.
Trying with a bigger mmap size:
perf trace -e write perf record -v -m 2048 --mmap-flush 16M
74982.928 ( 2.471 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff94a6cc000, count: 3580888) = 3580888
74985.406 ( 2.353 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff949ecb000, count: 3453256) = 3453256
74987.764 ( 2.629 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9496ca000, count: 3859232) = 3859232
74990.399 ( 2.341 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff948ec9000, count: 3769032) = 3769032
74992.744 ( 2.064 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9486c8000, count: 3310520) = 3310520
74994.814 ( 2.619 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff947ec7000, count: 4194688) = 4194688
74997.439 ( 2.787 ms): perf/26500 write(fd: 3</root/perf.data>, buf: 0x7ff9476c6000, count: 4029760) = 4029760
Was again limited to a quarter of the mmap size:
mmap flush: 2098176
mmap size 8392704B
A warning about that would be good to have but can be added later,
something like:
"max flush is a quarter of the mmap size, if wanting to bump the mmap
flush further, bump the mmap size as well using -m/--mmap-pages"
Also rename the 'sync' parameters to 'synch' to keep tools/perf building
with older glibcs:
cc1: warnings being treated as errors
builtin-record.c: In function 'record__mmap_read_evlist':
builtin-record.c:775: warning: declaration of 'sync' shadows a global declaration
/usr/include/unistd.h:933: warning: shadowed declaration is here
builtin-record.c: In function 'record__mmap_read_all':
builtin-record.c:856: warning: declaration of 'sync' shadows a global declaration
/usr/include/unistd.h:933: warning: shadowed declaration is here
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/f6600d72-ecfa-2eb7-7e51-f6954547d500@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-03-19 00:40:26 +07:00
|
|
|
u64 flush;
|
2019-03-19 00:42:19 +07:00
|
|
|
void *data;
|
|
|
|
int comp_level;
|
2017-10-06 20:31:47 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* State machine of bkw_mmap_state:
|
|
|
|
*
|
|
|
|
* .________________(forbid)_____________.
|
|
|
|
* | V
|
|
|
|
* NOTREADY --(0)--> RUNNING --(1)--> DATA_PENDING --(2)--> EMPTY
|
|
|
|
* ^ ^ | ^ |
|
|
|
|
* | |__(forbid)____/ |___(forbid)___/|
|
|
|
|
* | |
|
|
|
|
* \_________________(3)_______________/
|
|
|
|
*
|
|
|
|
* NOTREADY : Backward ring buffers are not ready
|
|
|
|
* RUNNING : Backward ring buffers are recording
|
|
|
|
* DATA_PENDING : We are required to collect data from backward ring buffers
|
|
|
|
* EMPTY : We have collected data from backward ring buffers.
|
|
|
|
*
|
|
|
|
* (0): Setup backward ring buffer
|
|
|
|
* (1): Pause ring buffers for reading
|
|
|
|
* (2): Read from ring buffers
|
|
|
|
* (3): Resume ring buffers for recording
|
|
|
|
*/
|
|
|
|
enum bkw_mmap_state {
|
|
|
|
BKW_MMAP_NOTREADY,
|
|
|
|
BKW_MMAP_RUNNING,
|
|
|
|
BKW_MMAP_DATA_PENDING,
|
|
|
|
BKW_MMAP_EMPTY,
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mmap_params {
|
2019-03-19 00:42:19 +07:00
|
|
|
int prot, mask, nr_cblocks, affinity, flush, comp_level;
|
2017-10-06 20:31:47 +07:00
|
|
|
struct auxtrace_mmap_params auxtrace_mp;
|
|
|
|
};
|
|
|
|
|
2018-08-17 18:45:55 +07:00
|
|
|
int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int cpu);
|
2017-10-06 20:31:47 +07:00
|
|
|
void perf_mmap__munmap(struct perf_mmap *map);
|
|
|
|
|
|
|
|
void perf_mmap__get(struct perf_mmap *map);
|
|
|
|
void perf_mmap__put(struct perf_mmap *map);
|
|
|
|
|
2018-03-06 22:36:05 +07:00
|
|
|
void perf_mmap__consume(struct perf_mmap *map);
|
2017-10-06 20:31:47 +07:00
|
|
|
|
|
|
|
static inline u64 perf_mmap__read_head(struct perf_mmap *mm)
|
|
|
|
{
|
tools, perf: add and use optimized ring_buffer_{read_head, write_tail} helpers
Currently, on x86-64, perf uses LFENCE and MFENCE (rmb() and mb(),
respectively) when processing events from the perf ring buffer which
is unnecessarily expensive as we can do more lightweight in particular
given this is critical fast-path in perf.
According to Peter rmb()/mb() were added back then via a94d342b9cb0
("tools/perf: Add required memory barriers") at a time where kernel
still supported chips that needed it, but nowadays support for these
has been ditched completely, therefore we can fix them up as well.
While for x86-64, replacing rmb() and mb() with smp_*() variants would
result in just a compiler barrier for the former and LOCK + ADD for
the latter (__sync_synchronize() uses slower MFENCE by the way), Peter
suggested we can use smp_{load_acquire,store_release}() instead for
architectures where its implementation doesn't resolve in slower smp_mb().
Thus, e.g. in x86-64 we would be able to avoid CPU barrier entirely due
to TSO. For architectures where the latter needs to use smp_mb() e.g.
on arm, we stick to cheaper smp_rmb() variant for fetching the head.
This work adds helpers ring_buffer_read_head() and ring_buffer_write_tail()
for tools infrastructure that either switches to smp_load_acquire() for
architectures where it is cheaper or uses READ_ONCE() + smp_rmb() barrier
for those where it's not in order to fetch the data_head from the perf
control page, and it uses smp_store_release() to write the data_tail.
Latter is smp_mb() + WRITE_ONCE() combination or a cheaper variant if
architecture allows for it. Those that rely on smp_rmb() and smp_mb() can
further improve performance in a follow up step by implementing the two
under tools/arch/*/include/asm/barrier.h such that they don't have to
fallback to rmb() and mb() in tools/include/asm/barrier.h.
Switch perf to use ring_buffer_read_head() and ring_buffer_write_tail()
so it can make use of the optimizations. Later, we convert libbpf as
well to use the same helpers.
Side note [0]: the topic has been raised of whether one could simply use
the C11 gcc builtins [1] for the smp_load_acquire() and smp_store_release()
instead:
__atomic_load_n(ptr, __ATOMIC_ACQUIRE);
__atomic_store_n(ptr, val, __ATOMIC_RELEASE);
Kernel and (presumably) tooling shipped along with the kernel has a
minimum requirement of being able to build with gcc-4.6 and the latter
does not have C11 builtins. While generally the C11 memory models don't
align with the kernel's, the C11 load-acquire and store-release alone
/could/ suffice, however. Issue is that this is implementation dependent
on how the load-acquire and store-release is done by the compiler and
the mapping of supported compilers must align to be compatible with the
kernel's implementation, and thus needs to be verified/tracked on a
case by case basis whether they match (unless an architecture uses them
also from kernel side). The implementations for smp_load_acquire() and
smp_store_release() in this patch have been adapted from the kernel side
ones to have a concrete and compatible mapping in place.
[0] http://patchwork.ozlabs.org/patch/985422/
[1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 20:51:02 +07:00
|
|
|
return ring_buffer_read_head(mm->base);
|
2017-10-06 20:31:47 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail)
|
|
|
|
{
|
tools, perf: add and use optimized ring_buffer_{read_head, write_tail} helpers
Currently, on x86-64, perf uses LFENCE and MFENCE (rmb() and mb(),
respectively) when processing events from the perf ring buffer which
is unnecessarily expensive as we can do more lightweight in particular
given this is critical fast-path in perf.
According to Peter rmb()/mb() were added back then via a94d342b9cb0
("tools/perf: Add required memory barriers") at a time where kernel
still supported chips that needed it, but nowadays support for these
has been ditched completely, therefore we can fix them up as well.
While for x86-64, replacing rmb() and mb() with smp_*() variants would
result in just a compiler barrier for the former and LOCK + ADD for
the latter (__sync_synchronize() uses slower MFENCE by the way), Peter
suggested we can use smp_{load_acquire,store_release}() instead for
architectures where its implementation doesn't resolve in slower smp_mb().
Thus, e.g. in x86-64 we would be able to avoid CPU barrier entirely due
to TSO. For architectures where the latter needs to use smp_mb() e.g.
on arm, we stick to cheaper smp_rmb() variant for fetching the head.
This work adds helpers ring_buffer_read_head() and ring_buffer_write_tail()
for tools infrastructure that either switches to smp_load_acquire() for
architectures where it is cheaper or uses READ_ONCE() + smp_rmb() barrier
for those where it's not in order to fetch the data_head from the perf
control page, and it uses smp_store_release() to write the data_tail.
Latter is smp_mb() + WRITE_ONCE() combination or a cheaper variant if
architecture allows for it. Those that rely on smp_rmb() and smp_mb() can
further improve performance in a follow up step by implementing the two
under tools/arch/*/include/asm/barrier.h such that they don't have to
fallback to rmb() and mb() in tools/include/asm/barrier.h.
Switch perf to use ring_buffer_read_head() and ring_buffer_write_tail()
so it can make use of the optimizations. Later, we convert libbpf as
well to use the same helpers.
Side note [0]: the topic has been raised of whether one could simply use
the C11 gcc builtins [1] for the smp_load_acquire() and smp_store_release()
instead:
__atomic_load_n(ptr, __ATOMIC_ACQUIRE);
__atomic_store_n(ptr, val, __ATOMIC_RELEASE);
Kernel and (presumably) tooling shipped along with the kernel has a
minimum requirement of being able to build with gcc-4.6 and the latter
does not have C11 builtins. While generally the C11 memory models don't
align with the kernel's, the C11 load-acquire and store-release alone
/could/ suffice, however. Issue is that this is implementation dependent
on how the load-acquire and store-release is done by the compiler and
the mapping of supported compilers must align to be compatible with the
kernel's implementation, and thus needs to be verified/tracked on a
case by case basis whether they match (unless an architecture uses them
also from kernel side). The implementations for smp_load_acquire() and
smp_store_release() in this patch have been adapted from the kernel side
ones to have a concrete and compatible mapping in place.
[0] http://patchwork.ozlabs.org/patch/985422/
[1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 20:51:02 +07:00
|
|
|
ring_buffer_write_tail(md->base, tail);
|
2017-10-06 20:31:47 +07:00
|
|
|
}
|
|
|
|
|
2017-12-03 09:00:41 +07:00
|
|
|
union perf_event *perf_mmap__read_forward(struct perf_mmap *map);
|
2017-10-06 20:31:47 +07:00
|
|
|
|
2018-03-06 22:36:06 +07:00
|
|
|
union perf_event *perf_mmap__read_event(struct perf_mmap *map);
|
2018-01-19 04:26:23 +07:00
|
|
|
|
2018-03-06 22:36:02 +07:00
|
|
|
int perf_mmap__push(struct perf_mmap *md, void *to,
|
2018-09-13 19:54:06 +07:00
|
|
|
int push(struct perf_mmap *map, void *to, void *buf, size_t size));
|
2017-10-06 20:46:01 +07:00
|
|
|
|
2017-10-06 20:31:47 +07:00
|
|
|
size_t perf_mmap__mmap_len(struct perf_mmap *map);
|
|
|
|
|
2018-03-06 22:36:07 +07:00
|
|
|
int perf_mmap__read_init(struct perf_mmap *md);
|
2018-01-19 04:26:22 +07:00
|
|
|
void perf_mmap__read_done(struct perf_mmap *map);
|
2017-10-06 20:31:47 +07:00
|
|
|
#endif /*__PERF_MMAP_H */
|