mirror of
https://github.com/AuxXxilium/linux_dsm_epyc7002.git
synced 2024-12-25 12:52:56 +07:00
3c0797925f
On Intel platforms machine check exceptions are always broadcast to all CPUs. This patch makes the machine check handler synchronize all these machine checks, elect a Monarch to handle the event and collect the worst event from all CPUs and then process it first. This has some advantages: - When there is a truly data corrupting error the system panics as quickly as possible. This improves containment of corrupted data and makes sure the corrupted data never hits stable storage. - The panics are synchronized and do not reenter the panic code on multiple CPUs (which currently does not handle this well). - All the errors are reported. Currently it often happens that another CPU happens to do the panic first, but reports useless information (empty machine check) because the real error happened on another CPU which came in later. This is a big advantage on Nehalem where the 8 threads per CPU lead to often the wrong CPU winning the race and dumping useless information on a machine check. The problem also occurs in a less severe form on older CPUs. - The system can detect when no CPUs detected a machine check and shut down the system. This can happen when one CPU is so badly hung that that it cannot process a machine check anymore or when some external agent wants to stop the system by asserting the machine check pin. This follows Intel hardware recommendations. - This matches the recommended error model by the CPU designers. - The events can be output in true severity order - When a panic happens on another CPU it makes sure to be actually be able to process the stop IPI by enabling interrupts. The code is extremly careful to handle timeouts while waiting for other CPUs. It can't rely on the normal timing mechanisms (jiffies, ktime_get) because of its asynchronous/lockless nature, so it uses own timeouts using ndelay() and a "SPINUNIT" The timeout is configurable. By default it waits for upto one second for the other CPUs. This can be also disabled. From some informal testing AMD systems do not see to broadcast machine checks, so right now it's always disabled by default on non Intel CPUs or also on very old Intel systems. Includes fixes from Ying Huang Fixed a "ecception" in a comment (H.Seto) Moved global_nwo reset later based on suggestion from H.Seto v2: Avoid duplicate messages [ Impact: feature, fixes long standing problems. ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
84 lines
3.2 KiB
Plaintext
84 lines
3.2 KiB
Plaintext
|
|
Configurable sysfs parameters for the x86-64 machine check code.
|
|
|
|
Machine checks report internal hardware error conditions detected
|
|
by the CPU. Uncorrected errors typically cause a machine check
|
|
(often with panic), corrected ones cause a machine check log entry.
|
|
|
|
Machine checks are organized in banks (normally associated with
|
|
a hardware subsystem) and subevents in a bank. The exact meaning
|
|
of the banks and subevent is CPU specific.
|
|
|
|
mcelog knows how to decode them.
|
|
|
|
When you see the "Machine check errors logged" message in the system
|
|
log then mcelog should run to collect and decode machine check entries
|
|
from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
|
|
|
|
Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
|
|
(N = CPU number)
|
|
|
|
The directory contains some configurable entries:
|
|
|
|
Entries:
|
|
|
|
bankNctl
|
|
(N bank number)
|
|
64bit Hex bitmask enabling/disabling specific subevents for bank N
|
|
When a bit in the bitmask is zero then the respective
|
|
subevent will not be reported.
|
|
By default all events are enabled.
|
|
Note that BIOS maintain another mask to disable specific events
|
|
per bank. This is not visible here
|
|
|
|
The following entries appear for each CPU, but they are truly shared
|
|
between all CPUs.
|
|
|
|
check_interval
|
|
How often to poll for corrected machine check errors, in seconds
|
|
(Note output is hexademical). Default 5 minutes. When the poller
|
|
finds MCEs it triggers an exponential speedup (poll more often) on
|
|
the polling interval. When the poller stops finding MCEs, it
|
|
triggers an exponential backoff (poll less often) on the polling
|
|
interval. The check_interval variable is both the initial and
|
|
maximum polling interval. 0 means no polling for corrected machine
|
|
check errors (but some corrected errors might be still reported
|
|
in other ways)
|
|
|
|
tolerant
|
|
Tolerance level. When a machine check exception occurs for a non
|
|
corrected machine check the kernel can take different actions.
|
|
Since machine check exceptions can happen any time it is sometimes
|
|
risky for the kernel to kill a process because it defies
|
|
normal kernel locking rules. The tolerance level configures
|
|
how hard the kernel tries to recover even at some risk of
|
|
deadlock. Higher tolerant values trade potentially better uptime
|
|
with the risk of a crash or even corruption (for tolerant >= 3).
|
|
|
|
0: always panic on uncorrected errors, log corrected errors
|
|
1: panic or SIGBUS on uncorrected errors, log corrected errors
|
|
2: SIGBUS or log uncorrected errors, log corrected errors
|
|
3: never panic or SIGBUS, log all errors (for testing only)
|
|
|
|
Default: 1
|
|
|
|
Note this only makes a difference if the CPU allows recovery
|
|
from a machine check exception. Current x86 CPUs generally do not.
|
|
|
|
trigger
|
|
Program to run when a machine check event is detected.
|
|
This is an alternative to running mcelog regularly from cron
|
|
and allows to detect events faster.
|
|
monarch_timeout
|
|
How long to wait for the other CPUs to machine check too on a
|
|
exception. 0 to disable waiting for other CPUs.
|
|
Unit: us
|
|
|
|
TBD document entries for AMD threshold interrupt configuration
|
|
|
|
For more details about the x86 machine check architecture
|
|
see the Intel and AMD architecture manuals from their developer websites.
|
|
|
|
For more details about the architecture see
|
|
see http://one.firstfloor.org/~andi/mce.pdf
|