linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-27 10:15:10 +07:00

History

Wang Nan 9ecda41acb perf/core: Add ::write_backward attribute to perf event This patch introduces 'write_backward' bit to perf_event_attr, which controls the direction of a ring buffer. After set, the corresponding ring buffer is written from end to beginning. This feature is design to support reading from overwritable ring buffer. Ring buffer can be created by mapping a perf event fd. Kernel puts event records into ring buffer, user tooling like perf fetch them from address returned by mmap(). To prevent racing between kernel and tooling, they communicate to each other through 'head' and 'tail' pointers. Kernel maintains 'head' pointer, points it to the next free area (tail of the last record). Tooling maintains 'tail' pointer, points it to the tail of last consumed record (record has already been fetched). Kernel determines the available space in a ring buffer using these two pointers to avoid overwrite unfetched records. By mapping without 'PROT_WRITE', an overwritable ring buffer is created. Different from normal ring buffer, tooling is unable to maintain 'tail' pointer because writing is forbidden. Therefore, for this type of ring buffers, kernel overwrite old records unconditionally, works like flight recorder. This feature would be useful if reading from overwritable ring buffer were as easy as reading from normal ring buffer. However, there's an obscure problem. The following figure demonstrates a full overwritable ring buffer. In this figure, the 'head' pointer points to the end of last record, and a long record 'E' is pending. For a normal ring buffer, a 'tail' pointer would have pointed to position (X), so kernel knows there's no more space in the ring buffer. However, for an overwritable ring buffer, kernel ignore the 'tail' pointer. (X) head . \| . V +------+-------+----------+------+---+ \|A....A\|B.....B\|C........C\|D....D\| \| +------+-------+----------+------+---+ Record 'A' is overwritten by event 'E': head \| V +--+---+-------+----------+------+---+ \|.E\|..A\|B.....B\|C........C\|D....D\|E..\| +--+---+-------+----------+------+---+ Now tooling decides to read from this ring buffer. However, none of these two natural positions, 'head' and the start of this ring buffer, are pointing to the head of a record. Even the full ring buffer can be accessed by tooling, it is unable to find a position to start decoding. The first attempt tries to solve this problem AFAIK can be found from [1]. It makes kernel to maintain 'tail' pointer: updates it when ring buffer is half full. However, this approach introduces overhead to fast path. Test result shows a 1% overhead [2]. In addition, this method utilizes no more tham 50% records. Another attempt can be found from [3], which allows putting the size of an event at the end of each record. This approach allows tooling to find records in a backward manner from 'head' pointer by reading size of a record from its tail. However, because of alignment requirement, it needs 8 bytes to record the size of a record, which is a huge waste. Its performance is also not good, because more data need to be written. This approach also introduces some extra branch instructions to fast path. 'write_backward' is a better solution to this problem. Following figure demonstrates the state of the overwritable ring buffer when 'write_backward' is set before overwriting: head \| V +---+------+----------+-------+------+ \| \|D....D\|C........C\|B.....B\|A....A\| +---+------+----------+-------+------+ and after overwriting: head \| V +---+------+----------+-------+---+--+ \|..E\|D....D\|C........C\|B.....B\|A..\|E.\| +---+------+----------+-------+---+--+ In each situation, 'head' points to the beginning of the newest record. From this record, tooling can iterate over the full ring buffer and fetch records one by one. The only limitation that needs to be considered is back-to-back reading. Due to the non-deterministic of user programs, it is impossible to ensure the ring buffer keeps stable during reading. Consider an extreme situation: tooling is scheduled out after reading record 'D', then a burst of events come, eat up the whole ring buffer (one or multiple rounds). When the tooling process comes back, reading after 'D' is incorrect now. To prevent this problem, we need to find a way to ensure the ring buffer is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is suggested because its overhead is lower than ioctl(PERF_EVENT_IOC_ENABLE). By carefully verifying 'header' pointer, reader can avoid pausing the ring-buffer. For example: /* A union of all possible events / union perf_event event; p = head = perf_mmap__read_head(); while (true) { / copy header of next event / fetch(&event.header, p, sizeof(event.header)); / read 'head' pointer / head = perf_mmap__read_head(); / check overwritten: is the header good? / if (!verify(sizeof(event.header), p, head)) break; / copy the whole event / fetch(&event, p, event.header.size); / read 'head' pointer again / head = perf_mmap__read_head(); / is the whole event good? / if (!verify(event.header.size, p, head)) break; p += event.header.size; } However, the overhead is high because: a) In-place decoding is not safe. Copying-verifying-decoding is required. b) Fetching 'head' pointer requires additional synchronization. (From Alexei Starovoitov: Even when this trick works, pause is needed for more than stability of reading. When we collect the events into overwrite buffer we're waiting for some other trigger (like all cpu utilization spike or just one cpu running and all others are idle) and when it happens the buffer has valuable info from the past. At this point new events are no longer interesting and buffer should be paused, events read and unpaused until next trigger comes.) This patch utilizes event's default overflow_handler introduced previously. perf_event_output_backward() is created as the default overflow handler for backward ring buffers. To avoid extra overhead to fast path, original perf_event_output() becomes __perf_event_output() and marked '__always_inline'. In theory, there's no extra overhead introduced to fast path. Performance testing: Calling 3000000 times of 'close(-1)', use gettimeofday() to check duration. Use 'perf record -o /dev/null -e raw_syscalls:' to capture system calls. In ns. Testing environment: CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz Kernel : v4.5.0 MEAN STDVAR BASE 800214.950 2853.083 PRE1 2253846.700 9997.014 PRE2 2257495.540 8516.293 POST 2250896.100 8933.921 Where 'BASE' is pure performance without capturing. 'PRE1' is test result of pure 'v4.5.0' kernel. 'PRE2' is test result before this patch. 'POST' is test result after this patch. See [4] for the detailed experimental setup. Considering the stdvar, this patch doesn't introduce performance overhead to the fast path. [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com Signed-off-by: Wang Nan <wangnan0@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: <acme@kernel.org> Cc: <pi3orama@163.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com [ Fixed the changelog some more. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>		2016-04-23 14:12:39 +02:00
..
bpf	bpf: add missing map_flags to bpf_map_show_fdinfo	2016-03-25 11:36:41 -04:00
configs	kconfig: add xenconfig defconfig helper	2015-06-16 11:04:29 +01:00
debug	mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings	2016-02-22 08:51:37 +01:00
events	perf/core: Add ::write_backward attribute to perf event	2016-04-23 14:12:39 +02:00
gcov	gcov: use within_module() helper.	2015-12-04 22:46:25 +01:00
irq	kernel/...: convert pr_warning to pr_warn	2016-03-22 15:36:02 -07:00
livepatch	livepatch/module: remove livepatch module notifier	2016-03-17 09:45:10 +01:00
locking	locking/lockdep: Fix print_collision() unused warning	2016-04-04 11:41:34 +02:00
power	Power management and ACPI material for v4.6-rc1, part 2	2016-03-24 22:59:58 -07:00
printk	printk: add clear_idx symbol to vmcoreinfo	2016-03-17 15:09:34 -07:00
rcu	kernel: add kcov code coverage	2016-03-22 15:36:02 -07:00
sched	locking/atomic, sched: Unexport fetch_or()	2016-03-29 11:52:11 +02:00
time	timers/nohz: Convert tick dependency mask to atomic_t	2016-03-29 11:52:11 +02:00
trace	Linux 4.6-rc3	2016-04-13 08:57:03 +02:00
.gitignore	certs: add .gitignore to stop git nagging about x509_certificate_list	2015-10-21 15:18:35 +01:00
acct.c
async.c	async: export current_is_async()	2015-11-19 17:51:48 +01:00
audit_fsnotify.c	wrappers for ->i_mutex access	2016-01-22 18:04:28 -05:00
audit_tree.c	audit: audit_tree_match can be boolean	2015-11-04 08:23:51 -05:00
audit_watch.c	Merge branch 'stable-4.6' of git://git.infradead.org/users/pcmoore/audit	2016-03-19 17:52:49 -07:00
audit.c	Merge branch 'stable-4.6' of git://git.infradead.org/users/pcmoore/audit	2016-03-19 17:52:49 -07:00
audit.h	security: Make inode argument of inode_getsecid non-const	2015-12-24 11:09:39 -05:00
auditfilter.c	audit: Fix typo in comment	2016-02-08 11:25:39 -05:00
auditsc.c	auditsc: for seccomp events, log syscall compat state using in_compat_syscall	2016-03-22 15:36:02 -07:00
backtracetest.c
bounds.c
capability.c
cgroup_freezer.c	cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends	2015-12-03 10:24:08 -05:00
cgroup_pids.c	cgroup_pids: fix a typo.	2015-12-14 14:54:37 -05:00
cgroup.c	Merge branch 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup	2016-03-21 10:05:13 -07:00
compat.c	compat: cleanup coding in compat_get_bitmap() and compat_put_bitmap()	2015-06-04 23:57:18 +02:00
configs.c
context_tracking.c	context_tracking: Switch to new static_branch API	2015-11-24 09:56:43 +01:00
cpu_pm.c	kernel/cpu_pm: fix cpu_cluster_pm_exit comment	2015-09-03 02:42:20 +02:00
cpu.c	cpu/hotplug: Document states better	2016-03-12 20:57:38 +01:00
cpuset.c	Merge branch 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup	2016-03-21 10:05:13 -07:00
crash_dump.c
cred.c	kmemcg: account certain kmem allocations to memcg	2016-01-14 16:00:49 -08:00
delayacct.c	kmemcg: account certain kmem allocations to memcg	2016-01-14 16:00:49 -08:00
dma.c
elfcore.c
exec_domain.c
exit.c	oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space	2016-03-25 16:37:42 -07:00
extable.c	kernel/extable.c: remove duplicated include	2015-09-10 13:29:01 -07:00
fork.c	kernel: add kcov code coverage	2016-03-22 15:36:02 -07:00
freezer.c
futex_compat.c	ptrace: use fsuid, fsgid, effective creds for fs access checks	2016-01-20 17:09:18 -08:00
futex.c	futex: Replace barrier() in unqueue_me() with READ_ONCE()	2016-03-08 17:04:02 +01:00
groups.c
hung_task.c	kernel/hung_task.c: use timeout diff when timeout is updated	2016-03-22 15:36:02 -07:00
irq_work.c	treewide: Remove old email address	2015-11-23 09:44:58 +01:00
jump_label.c	treewide: Remove old email address	2015-11-23 09:44:58 +01:00
kallsyms.c	kallsyms: add support for relative offsets in kallsyms address table	2016-03-15 16:55:16 -07:00
kcmp.c	ptrace: use fsuid, fsgid, effective creds for fs access checks	2016-01-20 17:09:18 -08:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks	locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS	2015-05-12 09:46:00 +02:00
Kconfig.preempt
kcov.c	kernel: add kcov code coverage	2016-03-22 15:36:02 -07:00
kexec_core.c	kexec: Set IORESOURCE_SYSTEM_RAM for System RAM	2016-01-30 09:49:57 +01:00
kexec_file.c	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security	2016-03-17 11:33:45 -07:00
kexec_internal.h	kexec: move some memembers and definitions within the scope of CONFIG_KEXEC_FILE	2016-01-20 17:09:18 -08:00
kexec.c	kexec: set KEXEC_TYPE_CRASH before sanity_check_segment_list()	2016-01-20 17:09:18 -08:00
kmod.c	kmod: don't run async usermode helper as a child of kworker thread	2015-10-23 17:55:10 +09:00
kprobes.c	perf/x86/hw_breakpoints: Disallow kernel breakpoints unless kprobe-safe	2015-08-04 10:16:54 +02:00
ksysfs.c	rcu: Remove TINY_RCU bloat from pointless boot parameters	2015-12-07 16:59:37 -08:00
kthread.c	kernel/kthread.c:kthread_create_on_node(): clarify documentation	2015-09-04 16:54:41 -07:00
latencytop.c	sched/debug: Make schedstats a runtime tunable that is disabled by default	2016-02-09 11:54:23 +01:00
Makefile	kernel: add kcov code coverage	2016-03-22 15:36:02 -07:00
membarrier.c	sys_membarrier(): system-wide memory barrier (generic, x86)	2015-09-11 15:21:34 -07:00
memremap.c	memremap: add MEMREMAP_WC flag	2016-03-22 15:36:02 -07:00
module_signing.c	X.509: Make algo identifiers text instead of enum	2016-03-03 21:49:27 +00:00
module-internal.h
module.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching	2016-03-17 21:46:32 -07:00
notifier.c	Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2015-09-01 08:40:25 -07:00
nsproxy.c	cgroup: introduce cgroup namespaces	2016-02-16 13:04:58 -05:00
padata.c
panic.c	panic: change nmi_panic from macro to function	2016-03-22 15:36:02 -07:00
params.c	Nothing exciting, minor tweaks and cleanups.	2015-11-09 15:53:39 -08:00
pid_namespace.c
pid.c	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-01-31 15:44:04 -08:00
profile.c	profile: hide unused functions when !CONFIG_PROC_FS	2016-03-22 15:36:02 -07:00
ptrace.c	ptrace: change __ptrace_unlink() to clear ->ptrace under ->siglock	2016-03-22 15:36:02 -07:00
range.c
reboot.c	kexec: split kexec_load syscall from kexec core code	2015-09-10 13:29:01 -07:00
relay.c	wrappers for ->i_mutex access	2016-01-22 18:04:28 -05:00
resource.c	/proc/iomem: only expose physical resource addresses to privileged users	2016-04-14 12:56:09 -07:00
seccomp.c	seccomp: check in_compat_syscall, not is_compat_task, in strict mode	2016-03-22 15:36:02 -07:00
signal.c	kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZE	2016-03-22 15:36:02 -07:00
smp.c	Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-03-15 13:50:29 -07:00
smpboot.c	cpu/hotplug: Unpark smpboot threads from the state machine	2016-03-01 20:36:56 +01:00
smpboot.h	cpu/hotplug: Create hotplug threads	2016-03-01 20:36:56 +01:00
softirq.c	arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections	2016-03-25 16:37:42 -07:00
stacktrace.c
stop_machine.c	kernel/stop_machine.c: remove CONFIG_SMP dependencies	2016-01-16 11:17:24 -08:00
sys_ni.c	vfs: add copy_file_range syscall and vfs helper	2015-12-01 14:00:53 -05:00
sys.c	timer: convert timer_slack_ns from unsigned long to u64	2016-03-17 15:09:34 -07:00
sysctl_binary.c	fs/coredump: prevent fsuid=0 dumps into user-controlled directories	2016-03-22 15:36:02 -07:00
sysctl.c	mm: scale kswapd watermarks in proportion to memory	2016-03-17 15:09:34 -07:00
task_work.c	task_work: remove fifo ordering guarantee	2015-09-05 13:46:58 -07:00
taskstats.c
test_kprobes.c
torture.c	torture: Consolidate cond_resched_rcu_qs() into stutter_wait()	2015-10-06 11:25:01 -07:00
tracepoint.c	kernel/...: convert pr_warning to pr_warn	2016-03-22 15:36:02 -07:00
tsacct.c	time, acct: Drop irq save & restore from __acct_update_integrals()	2016-02-29 09:53:09 +01:00
uid16.c
up.c
user_namespace.c	kernel/*: switch to memdup_user_nul()	2016-01-04 10:27:55 -05:00
user-return-notifier.c
user.c
utsname_sysctl.c
utsname.c
watchdog.c	watchdog: don't run proc_watchdog_update if new value is same as old	2016-03-17 15:09:34 -07:00
workqueue_internal.h	sched/core: Get rid of 'cpu' argument in wq_worker_sleeping()	2016-03-02 10:28:47 -05:00
workqueue.c	Merge branch 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq	2016-03-18 20:05:39 -07:00