linux_dsm_epyc7002/kernel
Steven Rostedt b6345879cc tracing: Do not record user stack trace from NMI context
A bug was found with Li Zefan's ftrace_stress_test that caused applications
to segfault during the test.

Placing a tracing_off() in the segfault code, and examining several
traces, I found that the following was always the case. The lock tracer
was enabled (lockdep being required) and userstack was enabled. Testing
this out, I just enabled the two, but that was not good enough. I needed
to run something else that could trigger it. Running a load like hackbench
did not work, but executing a new program would. The following would
trigger the segfault within seconds:

  # echo 1 > /debug/tracing/options/userstacktrace
  # echo 1 > /debug/tracing/events/lock/enable
  # while :; do ls > /dev/null ; done

Enabling the function graph tracer and looking at what was happening
I finally noticed that all cashes happened just after an NMI.

 1)               |    copy_user_handle_tail() {
 1)               |      bad_area_nosemaphore() {
 1)               |        __bad_area_nosemaphore() {
 1)               |          no_context() {
 1)               |            fixup_exception() {
 1)   0.319 us    |              search_exception_tables();
 1)   0.873 us    |            }
[...]
 1)   0.314 us    |  __rcu_read_unlock();
 1)   0.325 us    |    native_apic_mem_write();
 1)   0.943 us    |  }
 1)   0.304 us    |  rcu_nmi_exit();
[...]
 1)   0.479 us    |  find_vma();
 1)               |  bad_area() {
 1)               |    __bad_area() {

After capturing several traces of failures, all of them happened
after an NMI. Curious about this, I added a trace_printk() to the NMI
handler to read the regs->ip to see where the NMI happened. In which I
found out it was here:

ffffffff8135b660 <page_fault>:
ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>

What was happening is that the NMI would happen at the place that a page
fault occurred. It would call rcu_read_lock() which was traced by
the lock events, and the user_stack_trace would run. This would trigger
a page fault inside the NMI. I do not see where the CR2 register is
saved or restored in NMI handling. This means that it would corrupt
the page fault handling that the NMI interrupted.

The reason the while loop of ls helped trigger the bug, was that
each execution of ls would cause lots of pages to be faulted in, and
increase the chances of the race happening.

The simple solution is to not allow user stack traces in NMI context.
After this patch, I ran the above "ls" test for a couple of hours
without any issues. Without this patch, the bug would trigger in less
than a minute.

Cc: stable@kernel.org
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-12 20:31:49 -05:00
..
gcov microblaze: Enable GCOV_PROFILE_ALL 2009-09-21 14:29:21 +02:00
irq genirq: Convert irq_desc.lock to raw_spinlock 2009-12-14 23:55:33 +01:00
power PM / Hibernate: Fix preallocating of memory 2010-02-26 20:39:13 +01:00
time Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-03-01 08:48:25 -08:00
trace tracing: Do not record user stack trace from NMI context 2010-03-12 20:31:49 -05:00
.gitignore
acct.c bsdacct: fix uid/gid misreporting 2009-12-15 08:53:10 -08:00
async.c
audit_tree.c fix more leaks in audit_tree.c tag_chunk() 2009-12-19 09:27:43 -08:00
audit_watch.c Audit: reorganize struct audit_watch to save 8 bytes 2009-09-24 03:50:25 -04:00
audit.c Audit: send signal info if selinux is disabled 2009-09-24 03:50:26 -04:00
audit.h Fix rule eviction order for AUDIT_DIR 2009-06-24 00:02:38 -04:00
auditfilter.c
auditsc.c Sanitize f_flags helpers 2009-12-22 12:27:34 -05:00
backtracetest.c
bounds.c kbuild: move bounds.h to include/generated 2009-12-12 13:08:14 +01:00
capability.c remove CONFIG_SECURITY_FILE_CAPABILITIES compile option 2009-11-24 15:06:47 +11:00
cgroup_freezer.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
cgroup.c sched, cgroups: Fix module export 2010-02-25 12:02:13 +01:00
compat.c
configs.c
cpu.c sched: Correct printk whitespace in warning from cpu down task check 2010-01-28 06:59:55 +01:00
cpuset.c sched: Fix balance vs hotplug race 2009-12-06 21:10:56 +01:00
cred-internals.h
cred.c kernel/cred.c: use kmem_cache_free 2010-02-03 10:21:57 +11:00
delayacct.c headers: taskstats_kern.h trim 2009-09-18 09:48:52 -07:00
dma.c
exec_domain.c
exit.c sched: Use lockdep-based checking on rcu_dereference() 2010-02-25 10:34:26 +01:00
extable.c
fork.c sched: Use lockdep-based checking on rcu_dereference() 2010-02-25 10:34:26 +01:00
freezer.c sched: fix nr_uninterruptible accounting of frozen tasks really 2009-07-18 14:19:53 +02:00
futex_compat.c futex: Protect pid lookup in compat code with RCU 2009-12-09 14:22:14 +01:00
futex.c futex: Handle futex value corruption gracefully 2010-02-03 15:13:22 +01:00
groups.c
hrtimer.c hrtimers: Convert to raw_spinlocks 2009-12-14 23:55:34 +01:00
hung_task.c softlockup: Fix hung_task_check_count sysctl 2009-11-27 06:21:57 +01:00
hw_breakpoint.c perf: Make bp_len type to u64 generic across the arch 2010-02-04 01:07:12 +01:00
itimer.c itimers: Fix racy writes to cpu_itimer fields 2009-11-18 16:32:12 +01:00
kallsyms.c hw-breakpoints: Fix broken hw-breakpoint sample module 2009-11-10 11:23:29 +01:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
Kconfig.preempt
kexec.c Merge git://git.infradead.org/~dwmw2/mtd-2.6.33 2010-01-24 10:31:34 -08:00
kfifo.c kfifo: Don't use integer as NULL pointer 2010-02-16 15:11:08 -08:00
kgdb.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-04 16:07:41 -08:00
kmod.c kmod: fix resource leak in call_usermodehelper_pipe() 2010-01-11 09:34:04 -08:00
kprobes.c kprobes: Add mcount to the kprobes blacklist 2010-02-15 05:45:49 +01:00
ksysfs.c sched: Remove USER_SCHED 2010-01-21 13:40:18 +01:00
kthread.c kthread, sched: Remove reference to kthread_create_on_cpu 2010-02-09 11:47:39 +01:00
latencytop.c
lockdep_internals.h lockdep: BFS cleanup 2009-07-24 10:53:29 +02:00
lockdep_proc.c seq_file: constify seq_operations 2009-09-23 07:39:29 -07:00
lockdep_states.h
lockdep.c rcu: Make lockdep_rcu_dereference() message less alarmist 2010-02-26 08:20:46 +01:00
Makefile padata: Generic parallelization/serialization interface 2010-01-06 19:47:10 +11:00
module.c modules: Skip empty sections when exporting section notes 2010-01-06 01:11:29 -08:00
mutex-debug.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
mutex-debug.h locking: Implement new raw_spinlock 2009-12-14 23:55:32 +01:00
mutex.c mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
mutex.h
notifier.c sched: Use lockdep-based checking on rcu_dereference() 2010-02-25 10:34:26 +01:00
ns_cgroup.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
nsproxy.c
padata.c padata: Generic parallelization/serialization interface 2010-01-06 19:47:10 +11:00
panic.c kmsg_dump: Dump on crash_kexec as well 2009-12-31 19:45:04 +00:00
params.c tree-wide: convert open calls to remove spaces to skip_spaces() lib function 2009-12-15 08:53:32 -08:00
perf_event.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-28 10:20:25 -08:00
pid_namespace.c pidns: deny CLONE_PARENT|CLONE_NEWPID combination 2009-09-24 07:21:04 -07:00
pid.c sched: Use lockdep-based checking on rcu_dereference() 2010-02-25 10:34:26 +01:00
pm_qos_params.c pm_qos: clean up racy global "name" variable 2009-10-14 15:31:10 +02:00
posix-cpu-timers.c posix-cpu-timers: optimize and document timer_create callback 2009-11-18 12:36:05 +01:00
posix-timers.c posix-timers.c: Don't export local functions 2010-02-05 14:54:10 +01:00
printk.c Merge git://git.infradead.org/~dwmw2/mtd-2.6.33 2010-01-24 10:31:34 -08:00
profile.c kernel/profile.c: Switch /proc/irq/prof_cpu_mask to seq_file 2009-09-20 20:15:40 +02:00
ptrace.c ptrace: Fix ptrace_regset() comments and diagnose errors specifically 2010-02-23 13:45:26 -08:00
rcupdate.c rcu: Export rcu_scheduler_active 2010-02-26 08:20:46 +01:00
rcutiny.c rcu: Eliminate unneeded function wrapping 2009-11-22 18:58:16 +01:00
rcutorture.c rcu: Fix rcutorture mod_timer argument to delay one jiffy 2010-02-25 10:35:00 +01:00
rcutree_plugin.h rcu: Fix accelerated GPs for last non-dynticked CPU 2010-02-27 09:53:53 +01:00
rcutree_trace.c rcu: Stop overflowing signed integers 2010-02-25 10:34:57 +01:00
rcutree.c rcu: Fix accelerated grace periods for last non-dynticked CPU 2010-02-27 09:53:52 +01:00
rcutree.h rcu: Fix accelerated grace periods for last non-dynticked CPU 2010-02-27 09:53:52 +01:00
relay.c const: constify remaining pipe_buf_operations 2009-12-16 07:20:05 -08:00
res_counter.c memcg: some modification to softlimit under hierarchical memory reclaim. 2009-10-01 16:11:13 -07:00
resource.c Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-28 10:38:45 -08:00
rtmutex_common.h
rtmutex-debug.c sched: Convert pi_lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c rtmutes: Convert rtmutex.lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex.h
rwsem.c
sched_clock.c sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 2009-12-15 09:04:36 +01:00
sched_cpupri.c sched: Use for_each_bit 2010-02-02 06:58:27 +01:00
sched_cpupri.h sched: Convert cpupri lock to raw_spinlock 2009-12-14 23:55:33 +01:00
sched_debug.c sched: Convert rq->lock to raw_spinlock 2009-12-14 23:55:33 +01:00
sched_fair.c sched: Fix SCHED_MC regression caused by change in sched cpu_power 2010-02-26 15:45:13 +01:00
sched_features.h sched: Discard some old bits 2009-12-09 10:03:07 +01:00
sched_idletask.c sched: Remove the sched_class load_balance methods 2010-01-21 13:40:09 +01:00
sched_rt.c sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu] 2010-02-04 09:57:32 +01:00
sched_stats.h
sched.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-28 10:31:01 -08:00
seccomp.c
semaphore.c
signal.c kernel/signal.c: fix kernel information leak with print-fatal-signals=1 2010-01-11 09:34:05 -08:00
slow-work-debugfs.c SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
slow-work.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 2009-12-08 07:38:50 -08:00
slow-work.h SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
smp.c generic-ipi: Optimize accesses by using DEFINE_PER_CPU_SHARED_ALIGNED for IPI data 2010-01-18 09:02:59 +01:00
softirq.c hrtimer, softirq: Fix hrtimer->softirq trampoline 2010-02-03 18:17:40 +01:00
softlockup.c softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume 2010-02-01 08:22:32 +01:00
spinlock.c locking: Cleanup the name space completely 2009-12-14 23:55:33 +01:00
srcu.c rcu: Introduce lockdep-based checking to RCU read-side primitives 2010-02-25 09:40:59 +01:00
stacktrace.c
stop_machine.c
sys_ni.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2009-12-08 07:55:01 -08:00
sys.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-28 10:31:01 -08:00
sysctl_binary.c SYSCTL: Print binary sysctl warnings (nearly) only once 2009-12-23 21:00:20 +01:00
sysctl_check.c ipv4 05/05: add sysctl to accept packets with local source addresses 2009-12-03 12:14:38 -08:00
sysctl.c sparc: Support show_unhandled_signals. 2010-03-01 00:02:23 -08:00
taskstats.c const: struct nla_policy 2010-02-18 14:30:18 -08:00
test_kprobes.c
time.c Revert "time: Remove xtime_cache" 2009-12-22 14:10:37 -08:00
timeconst.pl
timer.c perf: Fix perf_event_do_pending() fallback callsite 2010-01-21 13:40:39 +01:00
tracepoint.c trivial: fix typo "to to" in multiple files 2009-09-21 15:14:55 +02:00
tsacct.c
uid16.c headers: utsname.h redux 2009-09-23 18:13:10 -07:00
up.c
user_namespace.c
user-return-notifier.c core: Clean up user return notifers use of per_cpu 2009-12-02 10:22:59 +01:00
user.c sched: Remove USER_SCHED 2010-01-21 13:40:18 +01:00
utsname_sysctl.c sysctl kernel: Remove binary sysctl logic 2009-11-12 02:04:55 -08:00
utsname.c
wait.c locking, sched: Give waitqueue spinlocks their own lockdep classes 2009-08-10 14:43:09 +02:00
workqueue.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2009-12-10 09:35:44 -08:00