linux_dsm_epyc7002/kernel
Thomas Gleixner 206b92353c cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n
Tianyu reported a crash in a CPU hotplug teardown callback when booting a
kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
parameter.

It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
forever in case that a bringup callback fails. Unfortunately this issue was
not recognized when the CPU hotplug code was reworked, so the shortcoming
just stayed in place.

When a bringup callback fails, the CPU hotplug code rolls back the
operation and takes the CPU offline.

The 'nosmt' command line argument uses a bringup failure to abort the
bringup of SMT sibling CPUs. This partial bringup is required due to the
MCE misdesign on Intel CPUs.

With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
teardown of a CPU including the synchronizations in various facilities like
RCU, NOHZ and others.

As a consequence the teardown callbacks which must be executed on the
outgoing CPU within stop machine with interrupts disabled are executed on
the control CPU in interrupt enabled and preemptible context causing the
kernel to crash and burn. The pre state machine code has a different
failure mode which is more subtle and resulting in a less obvious use after
free crash because the control side frees resources which are still in use
by the undead CPU.

But this is not a x86 only problem. Any architecture which supports the
SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
likely to be triggered because in 99.99999% of the cases all bringup
callbacks succeed.

The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
all architectures as the following architectures have either no hotplug
support at all or not all subarchitectures support it:

 alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).

Crashing the kernel in such a situation is not an acceptable state
either.

Implement a minimal rollback variant by limiting the teardown to the point
where all regular teardown callbacks have been invoked and leave the CPU in
the 'dead' idle state. This has the following consequences:

 - the CPU is brought down to the point where the stop_machine takedown
   would happen.

 - the CPU stays there forever and is idle

 - The CPU is cleared in the CPU active mask, but not in the CPU online
   mask which is a legit state.

 - Interrupts are not forced away from the CPU

 - All facilities which only look at online mask would still see it, but
   that is the case during normal hotplug/unplug operations as well. It's
   just a (way) longer time frame.

This will expose issues, which haven't been exposed before or only seldom,
because now the normally transient state of being non active but online is
a permanent state. In testing this exposed already an issue vs. work queues
where the vmstat code schedules work on the almost dead CPU which ends up
in an unbound workqueue and triggers 'preemtible context' warnings. This is
not a problem of this change, it merily exposes an already existing issue.
Still this is better than crashing fully without a chance to debug it.

This is mainly thought as workaround for those architectures which do not
support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.

Fixes: 2e1a3483ce ("cpu/hotplug: Split out the state walk into functions")
Reported-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Mukesh Ojha <mojha@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.de
2019-03-28 13:34:58 +01:00
..
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-03-11 08:54:01 -07:00
cgroup Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-12 14:08:19 -07:00
configs kvm_config: add CONFIG_VIRTIO_MENU 2018-10-24 20:55:56 -04:00
debug kdb: use bool for binary state indicators 2018-12-30 08:31:52 +00:00
dma memblock: drop memblock_alloc_*_nopanic() variants 2019-03-12 10:04:02 -07:00
events perf/core improvements and fixes: 2019-03-22 22:50:41 +01:00
gcov kernel/gcov/gcc_3_4.c: use struct_size() in kzalloc() 2019-03-07 18:32:02 -08:00
irq genirq: Mark expected switch case fall-through 2019-03-23 12:32:01 +01:00
livepatch Merge branch 'for-5.1/atomic-replace' into for-linus 2019-03-05 15:56:59 +01:00
locking locking/lockdep: Only call init_rcu_head() after RCU has been initialized 2019-03-09 14:15:51 +01:00
power treewide: add checks for the return value of memblock_alloc*() 2019-03-12 10:04:02 -07:00
printk fbdev changes for v5.1: 2019-03-15 14:22:59 -07:00
rcu Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-03-06 07:59:36 -08:00
sched Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-03-24 11:42:10 -07:00
time time/jiffies: Make refined_jiffies static 2019-03-22 13:38:26 +01:00
trace for-5.1/block-post-20190315 2019-03-16 12:36:39 -07:00
.gitignore kernel/configs: use .incbin directive to embed config_data.gz 2019-03-07 18:32:02 -08:00
acct.c
async.c async: Add support for queueing on specific NUMA node 2019-01-31 14:20:54 +01:00
audit_fsnotify.c audit: add syscall information to CONFIG_CHANGE records 2019-01-18 17:53:29 -05:00
audit_tree.c audit: hand taken context to audit_kill_trees for syscall logging 2019-01-14 18:01:05 -05:00
audit_watch.c audit: add syscall information to CONFIG_CHANGE records 2019-01-18 17:53:29 -05:00
audit.c audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL 2019-02-03 17:49:35 -05:00
audit.h audit: hide auditsc_get_stamp and audit_serial prototypes 2019-02-07 21:44:27 -05:00
auditfilter.c audit: mark expected switch fall-through 2019-02-12 20:17:13 -05:00
auditsc.c audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL 2019-02-03 17:49:35 -05:00
backtracetest.c
bounds.c kbuild: fix kernel/bounds.c 'W=1' warning 2018-10-31 08:54:14 -07:00
capability.c LSM: add SafeSetID module that gates setid calls 2019-01-25 11:22:43 -08:00
compat.c time: make adjtime compat handling available for 32 bit 2019-02-07 00:13:27 +01:00
configs.c kernel/configs: use .incbin directive to embed config_data.gz 2019-03-07 18:32:02 -08:00
context_tracking.c
cpu_pm.c
cpu.c cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n 2019-03-28 13:34:58 +01:00
crash_core.c kexec: export PG_offline to VMCOREINFO 2019-03-05 21:07:14 -08:00
crash_dump.c
cred.c SELinux: Remove cred security blob poisoning 2019-01-08 13:18:44 -08:00
delayacct.c delayacct: track delays from thrashing cache pages 2018-10-26 16:26:32 -07:00
dma.c
elfcore.c
exec_domain.c
exit.c Merge branch 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2019-03-07 10:11:41 -08:00
extable.c
fail_function.c kernel/fail_function.c: remove meaningless null pointer check before debugfs_remove_recursive 2018-10-31 08:54:12 -07:00
fork.c 5.1 Merge Window Pull Request 2019-03-09 15:53:03 -08:00
freezer.c PM / reboot: Eliminate race between reboot and suspend 2018-08-06 12:35:20 +02:00
futex.c futex: Ensure that futex address is aligned in handle_futex_death() 2019-03-22 13:05:26 +01:00
groups.c
hung_task.c kernel/hung_task.c: Use continuously blocked time when reporting. 2019-03-07 18:31:59 -08:00
iomem.c
irq_work.c
jump_label.c jump_label: move 'asm goto' support test to Kconfig 2019-01-06 09:46:51 +09:00
kallsyms.c bpf: Add module name [bpf] to ksymbols for bpf programs 2019-01-21 17:38:56 -03:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks bpf: introduce bpf_spin_lock 2019-02-01 20:55:38 +01:00
Kconfig.preempt kconfig: warn no new line at end of file 2018-12-15 17:44:35 +09:00
kcov.c kcov: convert kcov.refcount to refcount_t 2019-03-07 18:32:02 -08:00
kexec_core.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
kexec_file.c kexec_file: kexec_walk_memblock() only walks a dedicated region at kdump 2018-12-06 14:38:50 +00:00
kexec_internal.h
kexec.c kexec: add call to LSM hook in original kexec_load syscall 2018-07-16 12:31:57 -07:00
kmod.c
kprobes.c kprobes: Search non-suffixed symbol in blacklist 2019-02-13 08:16:40 +01:00
ksysfs.c
kthread.c Merge branch 'akpm' (patches from Andrew) 2019-03-06 10:31:36 -08:00
latencytop.c
Makefile kernel/configs: use .incbin directive to embed config_data.gz 2019-03-07 18:32:02 -08:00
memremap.c mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm 2018-12-28 12:11:52 -08:00
module_signing.c modsign: use all trusted keys to verify module signature 2018-11-07 14:41:41 +01:00
module-internal.h
module.c dynamic_debug: add static inline stub for ddebug_add_module 2019-03-07 18:32:00 -08:00
notifier.c
nsproxy.c
padata.c padata: clean an indentation issue, remove extraneous space 2018-11-16 14:11:04 +08:00
panic.c kernel/panic.c: taint: fix debugfs_simple_attr.cocci warnings 2019-03-07 18:31:59 -08:00
params.c
pid_namespace.c signal: Use group_send_sig_info to kill all processes in a pid namespace 2018-09-16 16:08:25 +02:00
pid.c Fix failure path in alloc_pid() 2018-12-28 12:42:30 -08:00
profile.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
ptrace.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
range.c
reboot.c kernel/reboot.c: export pm_power_off_prepare 2018-09-11 16:13:24 +01:00
relay.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-12 13:27:20 -07:00
resource.c device-dax for 5.1 2019-03-16 13:05:32 -07:00
rseq.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
seccomp.c Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2019-03-07 11:44:01 -08:00
signal.c pidfd patches for v5.1-rc1 2019-03-16 13:47:14 -07:00
smp.c cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM 2019-01-30 19:27:00 +01:00
smpboot.c
smpboot.h
softirq.c softirq: Don't skip softirq execution when softirq thread is parking 2019-02-10 21:51:39 +01:00
stackleak.c stackleak: Mark stackleak_track_stack() as notrace 2018-12-05 19:31:44 -08:00
stacktrace.c
stop_machine.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-08-13 11:25:07 -07:00
sys_ni.c pidfd patches for v5.1-rc1 2019-03-16 13:47:14 -07:00
sys.c Merge branch 'akpm' (patches from Andrew) 2019-03-07 19:25:37 -08:00
sysctl_binary.c kernel/sysctl: add panic_print into sysctl 2019-01-04 13:13:47 -08:00
sysctl.c kernel/sysctl.c: define minmax conv functions in terms of non-minmax versions 2019-03-12 10:04:00 -07:00
task_work.c
taskstats.c
test_kprobes.c
torture.c Merge branches 'doc.2019.01.26a', 'fixes.2019.01.26a', 'sil.2019.01.26a', 'spdx.2019.02.09a', 'srcu.2019.01.26a' and 'torture.2019.01.26a' into HEAD 2019-02-09 08:47:52 -08:00
tracepoint.c tracing: Replace synchronize_sched() and call_rcu_sched() 2018-11-27 09:21:41 -08:00
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c umh: add exit routine for UMH process 2019-01-11 18:05:40 -08:00
up.c smp,cpumask: introduce on_each_cpu_cond_mask 2018-10-09 16:51:11 +02:00
user_namespace.c userns: also map extents in the reverse map to kernel IDs 2018-11-07 23:51:16 -06:00
user-return-notifier.c
user.c userns: use irqsave variant of refcount_dec_and_lock() 2018-08-22 10:52:47 -07:00
utsname_sysctl.c sys: don't hold uts_sem while accessing userspace memory 2018-08-11 02:05:53 -05:00
utsname.c
watchdog_hld.c watchdog: Mark watchdog touch functions as notrace 2018-08-30 12:56:40 +02:00
watchdog.c watchdog/core: Make variables static 2019-03-22 13:40:17 +01:00
workqueue_internal.h psi: fix aggregation idle shut-off 2019-02-01 15:46:23 -08:00
workqueue.c workqueue: Only unregister a registered lockdep key 2019-03-21 12:00:18 +01:00