linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-18 13:57:37 +07:00

History

Steven Rostedt (VMware) a389d86f7f ring-buffer: Have nested events still record running time stamp Up until now, if an event is interrupted while it is recorded by an interrupt, and that interrupt records events, the time of those events will all be the same. This is because events only record the delta of the time since the previous event (or beginning of a page), and to handle updating the time keeping for that of nested events is extremely racy. After years of thinking about this and several failed attempts, I finally have a solution to solve this puzzle. The problem is that you need to atomically calculate the delta and then update the time stamp you made the delta from, as well as then record it into the buffer, all this while at any time an interrupt can come in and do the same thing. This is easy to solve with heavy weight atomics, but that would be detrimental to the performance of the ring buffer. The current state of affairs sacrificed the time deltas for nested events for performance. The reason for previous failed attempts at solving this puzzle was because I was trying to completely avoid slow atomic operations like cmpxchg. I final came to the conclusion to always avoid cmpxchg is not possible, which is why those previous attempts always failed. But it is possible to pick one path (the most common case) and avoid cmpxchg in that path, which is the "fast path". The most common case is that an event will not be interrupted and have other events added into it. An event can detect if it has interrupted another event, and for these cases we can make it the slow path and use the heavy operations like cmpxchg. One more player was added to the game that made this possible, and that is the "absolute timestamp" (by Tom Zanussi) that allows us to inject a full 59 bit time stamp. (Of course this breaks if a machine is running for more than 18 years without a reboot!). There's barrier() placements around for being paranoid, even when they are not needed because of other atomic functions near by. But those should not hurt, as if they are not needed, they basically become a nop. Note, this also makes the race window much smaller, which means there are less slow paths to slow down the performance. The basic idea is that there's two main paths taken. 1) Not being interrupted between time stamps and reserving buffer space. In this case, the time stamps taken are true to the location in the buffer. 2) Was interrupted by another path between taking time stamps and reserving buffer space. The objective is to know what the delta is from the last reserved location in the buffer. As it is possible to detect if an event is interrupting another event before reserving data, space is added to the length to be reserved to inject a full time stamp along with the event being reserved. When an event is not interrupted, the write stamp is always the time of the last event written to the buffer. In path 1, there's two sub paths we care about: a) The event did not interrupt another event. b) The event interrupted another event. In case a, as the write stamp was read and known to be correct, the delta between the current time stamp and the write stamp is the delta between the current event and the previously recorded event. In case b, extra space was reserved to just put the full time stamp into the buffer. Which is done, as stated, in this path the time stamp taken is known to match the location in the buffer. In path 2, there's also two sub paths we care about: a) The event was not interrupted by another event since it reserved space on the buffer and re-reading the write stamp. b) The event was interrupted by another event. In case a, the write stamp is that of the last event that interrupted this event between taking the time stamps and reserving. As no event came in after re-reading the write stamp, that event is known to be the time of the event directly before this event and the delta can be the new time stamp and the write stamp. In case b, one or more events came in between reserving the event and re-reading he write stamp. Since this event's buffer reservation is between other events at this path, there's no way to know what the delta is. But because an event interrupted this event after it started, its fine to just give a zero delta, and take the same time stamp as the events that happened within the event being recorded. Here's the implementation of the design of this solution: All this is per cpu, and only needs to worry about nested events (not parallel events). The players: write_tail: The index in the buffer where new events can be written to. It is incremented via local_add() to reserve space for a new event. before_stamp: A time stamp set by all events before reserving space. write_stamp: A time stamp updated by events after it has successfully reserved space. /* Save the current position of write / [A] w = local_read(write_tail); barrier(); / Read both before and write stamps before touching anything / before = local_read(before_stamp); after = local_read(write_stamp); barrier(); / * If before and after are the same, then this event is not * interrupting a time update. If it is, then reserve space for adding * a full time stamp (this can turn into a time extend which is * just an extended time delta but fill up the extra space). / if (after != before) abs = true; ts = clock(); / Now update the before_stamp (everyone does this!) / [B] local_set(before_stamp, ts); / Now reserve space on the buffer / [C] write = local_add_return(len, write_tail); / Set tail to be were this event's data is / tail = write - len; if (w == tail) { / Nothing interrupted this between A and C / [D] local_set(write_stamp, ts); barrier(); [E] save_before = local_read(before_stamp); if (!abs) { / This did not interrupt a time update / delta = ts - after; } else { delta = ts; / The full time stamp will be in use / } if (ts != save_before) { / slow path - Was interrupted between C and E / / The update to write_stamp could have overwritten the update to * it by the interrupting event, but before and after should be * the same for all completed top events / after = local_read(write_stamp); if (save_before > after) local_cmpxchg(write_stamp, after, save_before); } } else { / slow path - Interrupted between A and C / after = local_read(write_stamp); temp_ts = clock(); barrier(); [F] if (write == local_read(write_tail) && after < temp_ts) { / This was not interrupted since C and F * The last write_stamp is still valid for the previous event * in the buffer. / delta = temp_ts - after; / OK to keep this new time stamp / ts = temp_ts; } else { / Interrupted between C and F * Well, there's no use to try to know what the time stamp * is for the previous event. Just set delta to zero and * be the same time as that event that interrupted us before * the reservation of the buffer. / delta = 0; } / No need to use full timestamps here */ abs = 0; } Link: https://lkml.kernel.org/r/20200625094454.732790f7@oasis.local.home Link: https://lore.kernel.org/r/20200627010041.517736087@goodmis.org Link: http://lkml.kernel.org/r/20200629025258.957440797@goodmis.org Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>		2020-06-30 14:29:33 -04:00
..
blktrace.c	blktrace: Avoid sparse warnings when assigning q->blk_trace	2020-06-17 09:07:11 -06:00
bpf_trace.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	2020-06-25 18:27:40 -07:00
fgraph.c	tracing: Define MCOUNT_INSN_SIZE when not defined without direct calls	2020-01-02 21:56:44 -05:00
ftrace_internal.h	x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up	2020-05-12 18:24:34 -04:00
ftrace.c	x86/ftrace: Only have the builtin ftrace_regs_caller call direct hooks	2020-06-29 11:42:47 -04:00
Kconfig	Tracing updates for 5.8:	2020-06-09 10:06:18 -07:00
kprobe_event_gen_test.c	tracing: Add kprobe event command generation test module	2020-01-30 09:46:28 -05:00
Makefile	Rebase locking/kcsan to locking/urgent	2020-06-11 20:02:46 +02:00
power-traces.c
preemptirq_delay_test.c	tracing: Wait for preempt irq delay thread to execute	2020-05-11 17:00:34 -04:00
ring_buffer_benchmark.c	tracing: Make struct ring_buffer less ambiguous	2020-01-13 13:19:38 -05:00
ring_buffer.c	ring-buffer: Have nested events still record running time stamp	2020-06-30 14:29:33 -04:00
rpm-traces.c
synth_event_gen_test.c	tracing: Have synthetic event test use raw_smp_processor_id()	2020-02-20 17:43:41 -05:00
trace_benchmark.c	trace: Use pr_warn instead of pr_warning	2019-10-18 15:01:57 +02:00
trace_benchmark.h	tracing: Fix SPDX format headers to use C++ style comments	2018-08-16 19:08:06 -04:00
trace_boot.c	tracing/boottime: Fix kprobe multiple events	2020-06-23 21:51:50 -04:00
trace_branch.c	tracing: Make struct ring_buffer less ambiguous	2020-01-13 13:19:38 -05:00
trace_clock.c	tracing: Add SPDX License format tags to tracing files	2018-08-16 19:08:06 -04:00
trace_dynevent.c	tracing: Use seq_buf for building dynevent_cmd string	2020-02-01 13:10:15 -05:00
trace_dynevent.h	tracing: Remove check_arg() callbacks from dynevent args	2020-02-01 13:09:23 -05:00
trace_entries.h	tracing: Make ftrace packed events have align of 1	2020-06-16 21:21:02 -04:00
trace_event_perf.c	Merge branch 'perf/urgent' into perf/core, to pick up fixes	2019-10-28 12:38:26 +01:00
trace_events_filter_test.h	tracing: Fix SPDX format headers to use C++ style comments	2018-08-16 19:08:06 -04:00
trace_events_filter.c	tracing: Avoid memory leak in process_system_preds()	2019-12-19 18:24:17 -05:00
trace_events_hist.c	tracing: Move synthetic events to a separate file	2020-06-01 08:23:22 -04:00
trace_events_inject.c	tracing: Initialize val to zero in parse_entry of inject code	2020-01-02 19:04:57 -05:00
trace_events_synth.c	tracing: Move synthetic events to a separate file	2020-06-01 08:23:22 -04:00
trace_events_trigger.c	tracing: Fix event trigger to accept redundant spaces	2020-06-23 21:51:40 -04:00
trace_events.c	tracing: Add hist_debug trace event files for histogram debugging	2020-06-01 08:22:30 -04:00
trace_export.c	tracing: Make ftrace packed events have align of 1	2020-06-16 21:21:02 -04:00
trace_functions_graph.c	ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()	2020-03-19 19:11:19 -04:00
trace_functions.c	trace: Fix typo in allocate_ftrace_ops()'s comment	2020-06-16 21:21:02 -04:00
trace_hwlat.c	tracing: Have hwlat ts be first instance and record count of instances	2020-03-03 17:33:43 -05:00
trace_irqsoff.c	tracing: Rename trace_buffer to array_buffer	2020-01-13 13:19:38 -05:00
trace_kdb.c	tracing: Rename trace_buffer to array_buffer	2020-01-13 13:19:38 -05:00
trace_kprobe_selftest.c	selftest/ftrace: Move kprobe selftest function to separate compile unit	2018-07-30 18:41:04 -04:00
trace_kprobe_selftest.h	tracing: Fix SPDX format headers to use C++ style comments	2018-08-16 19:08:06 -04:00
trace_kprobe.c	maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault	2020-06-17 10:57:41 -07:00
trace_mmiotrace.c	tracing: Make struct ring_buffer less ambiguous	2020-01-13 13:19:38 -05:00
trace_nop.c
trace_output.c	mmap locking API: use coccinelle to convert mmap_sem rwsem call sites	2020-06-09 09:39:14 -07:00
trace_output.h	tracing: Fix SPDX format headers to use C++ style comments	2018-08-16 19:08:06 -04:00
trace_preemptirq.c	x86/entry: Rename trace_hardirqs_off_prepare()	2020-06-11 15:15:24 +02:00
trace_printk.c	tracing: Add locked_down checks to the open calls of files created for tracefs	2019-10-12 20:48:06 -04:00
trace_probe_tmpl.h	tracing/probe: Support user-space dereference	2019-05-25 23:04:42 -04:00
trace_probe.c	tracing/probe: Fix memleak in fetch_op_data operations	2020-06-16 21:21:02 -04:00
trace_probe.h	tracing/probe: Replace zero-length array with flexible-array	2020-06-15 23:08:32 -05:00
trace_sched_switch.c	tracing: Fix sched switch start/stop refcount racy updates	2020-01-30 09:46:10 -05:00
trace_sched_wakeup.c	tracing: Make struct ring_buffer less ambiguous	2020-01-13 13:19:38 -05:00
trace_selftest_dynamic.c
trace_selftest.c	tracing: Rename trace_buffer to array_buffer	2020-01-13 13:19:38 -05:00
trace_seq.c	tracing: Remove unused TRACE_SEQ_BUF_USED	2020-01-21 18:39:54 -05:00
trace_stack.c	trace: fix an incorrect __user annotation on stack_trace_sysctl	2020-06-08 10:13:56 -04:00
trace_stat.c	tracing: Fix tracing_stat return values in error handling paths	2020-01-24 18:06:48 -05:00
trace_stat.h	tracing: Use generic type for comparator function	2019-11-14 13:15:11 -05:00
trace_synth.h	tracing: Move synthetic events to a separate file	2020-06-01 08:23:22 -04:00
trace_syscalls.c	Tracing updates:	2020-02-06 07:12:11 +00:00
trace_uprobe.c	tracing/probe: Fix bpf_task_fd_query() for kprobes and uprobes	2020-06-09 11:10:12 -07:00
trace.c	tracing: Move pipe reference to trace array instead of current_tracer	2020-06-30 14:29:33 -04:00
trace.h	tracing: Move pipe reference to trace array instead of current_tracer	2020-06-30 14:29:33 -04:00
tracing_map.c	tracing: Convert local functions in tracing_map.c to static	2020-04-22 22:07:26 -04:00
tracing_map.h	tracing: Fix SPDX format headers to use C++ style comments	2018-08-16 19:08:06 -04:00