linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-03 03:26:41 +07:00

History

Balbir Singh 135e8c9250 sched/core: Fix a race between try_to_wake_up() and a woken up task The origin of the issue I've seen is related to a missing memory barrier between check for task->state and the check for task->on_rq. The task being woken up is already awake from a schedule() and is doing the following: do { schedule() set_current_state(TASK_(UN)INTERRUPTIBLE); } while (!cond); The waker, actually gets stuck doing the following in try_to_wake_up(): while (p->on_cpu) cpu_relax(); Analysis: The instance I've seen involves the following race: CPU1 CPU2 while () { if (cond) break; do { schedule(); set_current_state(TASK_UN..) } while (!cond); wakeup_routine() spin_lock_irqsave(wait_lock) raw_spin_lock_irqsave(wait_lock) wake_up_process() } try_to_wake_up() set_current_state(TASK_RUNNING); .. list_del(&waiter.list); CPU2 wakes up CPU1, but before it can get the wait_lock and set current state to TASK_RUNNING the following occurs: CPU3 wakeup_routine() raw_spin_lock_irqsave(wait_lock) if (!list_empty) wake_up_process() try_to_wake_up() raw_spin_lock_irqsave(p->pi_lock) .. if (p->on_rq && ttwu_wakeup()) .. while (p->on_cpu) cpu_relax() .. CPU3 tries to wake up the task on CPU1 again since it finds it on the wait_queue, CPU1 is spinning on wait_lock, but immediately after CPU2, CPU3 got it. CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and the task is spinning on the wait_lock. Interestingly since p->on_rq is checked under pi_lock, I've noticed that try_to_wake_up() finds p->on_rq to be 0. This was the most confusing bit of the analysis, but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq check is not reliable without this fix IMHO. The race is visible (based on the analysis) only when ttwu_queue() does a remote wakeup via ttwu_queue_remote. In which case the p->on_rq change is not done uder the pi_lock. The result is that after a while the entire system locks up on the raw_spin_irqlock_save(wait_lock) and the holder spins infintely Reproduction of the issue: The issue can be reproduced after a long run on my system with 80 threads and having to tweak available memory to very low and running memory stress-ng mmapfork test. It usually takes a long time to reproduce. I am trying to work on a test case that can reproduce the issue faster, but thats work in progress. I am still testing the changes on my still in a loop and the tests seem OK thus far. Big thanks to Benjamin and Nick for helping debug this as well. Ben helped catch the missing barrier, Nick caught every missing bit in my theory. Signed-off-by: Balbir Singh <bsingharora@gmail.com> [ Updated comment to clarify matching barriers. Many architectures do not have a full barrier in switch_to() so that cannot be relied upon. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicholas Piggin <nicholas.piggin@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>		2016-09-05 11:57:53 +02:00
..
auto_group.c	sched/core: Move the sched_to_prio[] arrays out of line	2015-12-04 10:34:46 +01:00
auto_group.h	sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()	2015-05-08 12:11:32 +02:00
clock.c	sched/clock: Make local_clock()/cpu_clock() inline	2016-04-13 12:25:22 +02:00
completion.c	sched/completion: Serialize completion_done() with complete()	2015-02-18 14:27:40 +01:00
core.c	sched/core: Fix a race between try_to_wake_up() and a woken up task	2016-09-05 11:57:53 +02:00
cpuacct.c	sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together	2016-07-09 13:56:15 +02:00
cpuacct.h	sched/cpuacct: Simplify the cpuacct code	2016-03-21 11:00:28 +01:00
cpudeadline.c	sched/deadline: Fix wrap-around in DL heap	2016-08-10 13:32:55 +02:00
cpudeadline.h	sched/deadline: Unify dl_time_before() usage	2015-09-23 09:51:25 +02:00
cpufreq_schedutil.c	cpufreq: schedutil: map raw required frequency to driver frequency	2016-07-21 22:28:21 +02:00
cpufreq.c	cpufreq: sched: Helpers to add and remove update_util hooks	2016-04-02 01:08:43 +02:00
cpupri.c	sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed	2016-05-12 09:55:35 +02:00
cpupri.h	sched/cpupri: Remove unnecessary definitions in cpupri.h	2014-11-16 10:58:59 +01:00
cputime.c	sched/cputime: Resync steal time when guest & host lose sync	2016-08-18 11:19:48 +02:00
deadline.c	sched/deadline: Fix lock pinning warning during CPU hotplug	2016-08-10 14:02:55 +02:00
debug.c	sched/debug: Always show 'nr_migrations'	2016-06-08 14:34:49 +02:00
fair.c	sched/fair: Fix typo in sync_throttle()	2016-08-10 13:32:55 +02:00
features.h	sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define	2015-09-13 09:52:55 +02:00
idle_task.c	locking/lockdep, sched/core: Implement a better lock pinning scheme	2016-05-05 09:23:59 +02:00
idle.c	Merge branch 'sched/urgent' into sched/core, to pick up fixes	2016-06-14 11:04:13 +02:00
loadavg.c	sched/core: Correct off by one bug in load migration calculation	2016-07-13 14:58:20 +02:00
Makefile	cpufreq: schedutil: New governor based on scheduler utilization data	2016-04-02 01:09:12 +02:00
rt.c	sched/core: Provide a tsk_nr_cpus_allowed() helper	2016-05-12 09:55:36 +02:00
sched.h	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-07-25 13:59:34 -07:00
stats.c	sched: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:37 -08:00
stats.h	sched/debug: Fix /proc/sched_debug regression	2016-06-08 14:31:58 +02:00
stop_task.c	locking/lockdep, sched/core: Implement a better lock pinning scheme	2016-05-05 09:23:59 +02:00
swait.c	wait.[ch]: Introduce the simple waitqueue (swait) implementation	2016-02-25 11:27:16 +01:00
wait.c	sched/wait: Fix the signal handling fix	2015-12-13 14:30:59 -08:00