linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-28 11:18:45 +07:00

History

Daniel Axtens f2dd80ecca powerpc/powernv: Panic on unhandled Machine Check All unrecovered machine check errors on PowerNV should cause an immediate panic. There are 2 reasons that this is the right policy: it's not safe to continue, and we're already trying to reboot. Firstly, if we go through the recovery process and do not successfully recover, we can't be sure about the state of the machine, and it is not safe to recover and proceed. Linux knows about the following sources of Machine Check Errors: - Uncorrectable Errors (UE) - Effective - Real Address Translation (ERAT) - Segment Lookaside Buffer (SLB) - Translation Lookaside Buffer (TLB) - Unknown/Unrecognised In the SLB, TLB and ERAT cases, we can further categorise these as parity errors, multihit errors or unknown/unrecognised. We can handle SLB errors by flushing and reloading the SLB. We can handle TLB and ERAT multihit errors by flushing the TLB. (It appears we may not handle TLB and ERAT parity errors: I will investigate further and send a followup patch if appropriate.) This leaves us with uncorrectable errors. Uncorrectable errors are usually the result of ECC memory detecting an error that it cannot correct, but they also crop up in the context of PCI cards failing during DMA writes, and during CAPI error events. There are several types of UE, and there are 3 places a UE can occur: Skiboot, the kernel, and userspace. For Skiboot errors, we have the facility to make some recoverable. For userspace, we can simply kill (SIGBUS) the affected process. We have no meaningful way to deal with UEs in kernel space or in unrecoverable sections of Skiboot. Currently, these unrecovered UEs fall through to machine_check_expection() in traps.c, which calls die(), which OOPSes and sends SIGBUS to the process. This sometimes allows us to stumble onwards. For example we've seen UEs kill the kernel eehd and khugepaged. However, the process killed could have held a lock, or it could have been a more important process, etc: we can no longer make any assertions about the state of the machine. Similarly if we see a UE in skiboot (and again we've seen this happen), we're not in a position where we can make any assertions about the state of the machine. Likewise, for unknown or unrecognised errors, we're not able to say anything about the state of the machine. Therefore, if we have an unrecovered MCE, the most appropriate thing to do is to panic. The second reason is that since `e784b6499d` ("powerpc/powernv: Invoke opal_cec_reboot2() on unrecoverable machine check errors."), we attempt a special OPAL reboot on an unhandled MCE. This is so the hardware can record error data for later debugging. The comments in that commit assert that we are heading down the panic path anyway. At the moment this is not always true. With UEs in kernel space, for instance, they are marked as recoverable by the hardware, so if the attempt to reboot failed (e.g. old Skiboot), we wouldn't panic() but would simply die() and OOPS. It doesn't make sense to be staggering on if we've just tried to reboot: we should panic(). Explicitly panic() on unrecovered MCEs on PowerNV. Update the comments appropriately. This fixes some hangs following EEH events on cxlflash setups. Signed-off-by: Daniel Axtens <dja@axtens.net> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>		2015-10-09 08:07:19 +11:00
..
eeh-powernv.c	powerpc updates for 4.3	2015-09-03 16:41:38 -07:00
idle.c	powerpc/powernv: pnv_init_idle_states() should only run on powernv	2015-06-15 16:45:12 +10:00
Kconfig	powerpc/powernv: Add opal-prd channel	2015-06-05 08:32:21 +10:00
Makefile	powerpc/powernv: Add opal-prd channel	2015-06-05 08:32:21 +10:00
opal-async.c	powerpc/powernv: Reorder OPAL subsystem initialisation	2015-05-22 15:14:37 +10:00
opal-dump.c	powernv/opal-dump: Convert to irq domain	2015-05-22 15:14:38 +10:00
opal-elog.c	powerpc/powernv: Fix opal-elog interrupt handler	2015-07-06 20:24:36 +10:00
opal-flash.c	powerpc/powernv: Add interfaces for flash device access	2015-04-11 20:49:21 +10:00
opal-hmi.c	powerpc/powernv: Invoke opal_cec_reboot2() on unrecoverable HMI.	2015-08-06 15:10:19 +10:00
opal-irqchip.c	genirq/irqdomain: Allow irq domain aliasing	2015-07-30 00:14:36 +02:00
opal-lpc.c	powerpc/powernv: Properly fix LPC debugfs endianness	2014-10-31 17:09:04 +11:00
opal-memory-errors.c	powerpc/powernv: Reorder OPAL subsystem initialisation	2015-05-22 15:14:37 +10:00
opal-msglog.c	powerpc/powernv: Fix reading of OPAL msglog	2014-06-11 17:03:36 +10:00
opal-nvram.c	powerpc/powernv: Add pstore support on powernv	2015-03-23 14:06:10 +11:00
opal-power.c	powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platform	2015-07-16 13:34:36 +10:00
opal-prd.c	powerpc/powernv: Fix vma page prot flags in opal-prd driver	2015-07-06 12:06:42 +10:00
opal-rtc.c	rtc/tpo: Driver to support rtc and wakeup on PowerNV platform	2014-11-17 18:04:01 +11:00
opal-sensor.c	powerpc/powernv: Reorder OPAL subsystem initialisation	2015-05-22 15:14:37 +10:00
opal-sysparam.c	powerpc/powernv: convert OPAL codes returned by sysparam calls	2015-06-04 22:27:56 +10:00
opal-tracepoints.c	powerpc: Replace __get_cpu_var uses	2014-11-03 12:12:32 +11:00
opal-wrappers.S	powerpc/powernv: Add OPAL interfaces for accessing and modifying system LED states	2015-08-20 18:19:07 +10:00
opal-xscom.c	powerpc/powernv: Switch powernv drivers to use machine_xxx_initcall()	2014-07-28 14:11:26 +10:00
opal.c	powerpc/powernv: Panic on unhandled Machine Check	2015-10-09 08:07:19 +11:00
pci-ioda.c	powerpc/powernv/pci-ioda: fix kdump with non-power-of-2 crashkernel=	2015-09-07 20:14:18 +10:00
pci-p5ioc2.c	vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API	2015-06-11 15:16:52 +10:00
pci.c	powerpc/MSI: Fix race condition in tearing down MSI interrupts	2015-09-10 17:27:08 +10:00
pci.h	powerpc/powernv: move dma_get_required_mask from pnv_phb to pci_controller_ops	2015-08-18 19:32:11 +10:00
powernv.h	powerpc/powernv: move dma_get_required_mask from pnv_phb to pci_controller_ops	2015-08-18 19:32:11 +10:00
rng.c	powerpc: Use hardware RNG for arch_get_random_seed_* not arch_get_random_*	2015-07-23 19:52:03 +10:00
setup.c	powerpc/powernv: Reset HILE before kexec_sequence()	2015-08-20 18:19:09 +10:00
smp.c	powerpc updates for 4.1	2015-04-16 13:53:32 -05:00
subcore-asm.S	powerpc/powernv: Add support for POWER8 split core on powernv	2014-05-28 13:35:37 +10:00
subcore.c	powerpc: Add an inline function to update POWER8 HID0	2015-08-14 15:58:28 +10:00
subcore.h	powernv/powerpc: Add winkle support for offline cpus	2014-12-15 10:46:41 +11:00