linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-25 20:15:13 +07:00

Author	SHA1	Message	Date
Yazen Ghannam	71a84402b9	x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models AMD family 17h Models 10h-2Fh may report a high number of L1 BTB MCA errors under certain conditions. The errors are benign and can safely be ignored. However, the high error rate may cause the MCA threshold counter to overflow causing a high rate of thresholding interrupts. In addition, users may see the errors reported through the AMD MCE decoder module, even with the interrupt disabled, due to MCA polling. Clear the "Counter Present" bit in the Instruction Fetch bank's MCA_MISC0 register. This will prevent enabling MCA thresholding on this bank which will prevent the high interrupt rate due to this error. Define an AMD-specific function to filter these errors from the MCE event pool so that they don't get reported during early boot. Rename filter function in EDAC/mce_amd to avoid a naming conflict, while at it. [ bp: Move function prototype to the internal header and massage/cleanup, fix typos. ] Reported-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "clemej@gmail.com" <clemej@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Shirish S <Shirish.S@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Cc: <stable@vger.kernel.org> # 5.0.x: `c95b323dcd`: x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models Cc: <stable@vger.kernel.org> # 5.0.x: `30aa3d26ed`: x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk Cc: <stable@vger.kernel.org> # 5.0.x: `9308fd4074`: x86/MCE: Group AMD function prototypes in <asm/mce.h> Cc: <stable@vger.kernel.org> # 5.0.x Link: https://lkml.kernel.org/r/20190325163410.171021-2-Yazen.Ghannam@amd.com	2019-04-23 18:16:07 +02:00
Yazen Ghannam	a0bcd3c0b8	EDAC/mce_amd: Decode MCA_STATUS in bit definition order Sort the MCA_STATUS bits in decode output to follow how they are defined in the register. The order is as follows: Bit \| Decode ------------ 62 \| Over 61 \| UC 59 \| MiscV 58 \| AddrV 57 \| PCC 55 \| TCC 53 \| SyndV 46 \| CECC 45 \| UECC 44 \| Deferred 43 \| Poison 40 \| Scrub [ bp: Massage a bit. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86@kernel.org Link: https://lkml.kernel.org/r/20190212212417.107049-2-Yazen.Ghannam@amd.com	2019-02-15 14:36:31 +01:00
Yazen Ghannam	3f4da372ec	EDAC/mce_amd: Decode MCA_STATUS[Scrub] bit Previous AMD systems have had a bit in MCA_STATUS to indicate that an error was detected on a scrub operation. However, this bit was defined differently within different banks and families/models. Starting with Family 17h, MCA_STATUS[40] is either Reserved/Read-as-Zero or defined as "Scrub", for all MCA banks and CPU models. Therefore, this bit can be defined as the "Scrub" bit. Define MCA_STATUS[40] as "Scrub" and decode it in the AMD MCE decoding module for Family 17h and newer systems. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190212212417.107049-1-Yazen.Ghannam@amd.com	2019-02-15 14:25:58 +01:00
Yazen Ghannam	1c1522d32a	EDAC, mce_amd: Print ExtErrorCode and description on a single line Save a log line by printing the extended error code and the description on a single line. This is similar to how errors are printed in other subsystems, e.g. "#, description". If we don't have a valid description then only the number/code is printed. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Link: https://lkml.kernel.org/r/20190201225534.8177-6-Yazen.Ghannam@amd.com	2019-02-04 19:29:13 +01:00
Yazen Ghannam	e03447ee71	EDAC, mce_amd: Match error descriptions to latest documentation Update the error descriptions to match the latest documentation for easier searching. In some cases the changes are small and in other cases the changes may be total rewording of the description. No functional changes. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Link: https://lkml.kernel.org/r/20190201225534.8177-5-Yazen.Ghannam@amd.com	2019-02-03 13:16:50 +01:00
Yazen Ghannam	8a5dd2cd2f	x86/MCE/AMD, EDAC/mce_amd: Add new error descriptions for some SMCA bank types Some SMCA bank types on future systems will report new error types even though the bank type is not treated as a new version. These new error types will reported by bits that are reserved in past systems. Add the new error descriptions to the lists in edac_mce_amd. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Shirish S <Shirish.S@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190201225534.8177-4-Yazen.Ghannam@amd.com	2019-02-03 13:05:16 +01:00
Yazen Ghannam	3ad7e748c1	x86/MCE/AMD, EDAC/mce_amd: Add new McaTypes for CS, PSP, and SMU units The existing CS, PSP, and SMU SMCA bank types will see new versions (as indicated by their McaTypes) in future SMCA systems. Add the new (HWID, MCATYPE) tuples for these new versions. Reuse the same names as the older versions, since they are logically the same to the user. SMCA systems won't mix and match IP blocks with different McaType versions in the same system, so there isn't a need to distinguish them. The MCA_IPID register is saved when logging an MCA error, and that can be used to triage the error. Also, add the new error descriptions to edac_mce_amd. Some error types (positions in the list) are overloaded compared to the previous McaTypes. Therefore, just create new lists of the error descriptions to keep things simple even if some of the error descriptions are the same between versions. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Shirish S <Shirish.S@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190201225534.8177-3-Yazen.Ghannam@amd.com	2019-02-03 13:01:57 +01:00
Yazen Ghannam	cbfa447edd	x86/MCE/AMD, EDAC/mce_amd: Add new MP5, NBIO, and PCIE SMCA bank types Add the (HWID, MCATYPE) tuples and names for the new MP5, NBIO, and PCIE SMCA bank types. Also, add their respective error descriptions to the MCE decoding module edac_mce_amd. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Shirish S <Shirish.S@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190201225534.8177-2-Yazen.Ghannam@amd.com	2019-02-03 13:01:44 +01:00
Pu Wen	c4a3e94641	EDAC, amd64: Add Hygon Dhyana support Add support for Hygon Dhyana CPU to EDAC. Signed-off-by: Pu Wen <puwen@hygon.cn> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: mchehab@kernel.org Cc: tglx@linutronix.de Cc: mingo@redhat.com Cc: hpa@zytor.com Cc: thomas.lendacky@amd.com Cc: linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/9d71061301177822bc55b3bfd44f91057458d886.1537533369.git.puwen@hygon.cn	2018-09-27 18:38:26 +02:00
Yazen Ghannam	68627a697c	x86/mce/AMD, EDAC/mce_amd: Enumerate Reserved SMCA bank type Currently, bank 4 is reserved on Fam17h, so we chose not to initialize bank 4 in the smca_banks array. This means that when we check if a bank is initialized, like during boot or resume, we will see that bank 4 is not initialized and try to initialize it. This will cause a call trace, when resuming from suspend, due to rdmsr_on_cpu() calls in the init path. The rdmsr_on_cpu() calls issue an IPI but we're running with interrupts disabled. This triggers: WARNING: CPU: 0 PID: 11523 at kernel/smp.c:291 smp_call_function_single+0xdc/0xe0 ... Reserved banks will be read-as-zero, so their MCA_IPID register will be zero. So, like the smca_banks array, the threshold_banks array will not have an entry for a reserved bank since all its MCA_MISC* registers will be zero. Enumerate a "Reserved" bank type that matches on a HWID_MCATYPE of 0,0. Use the "Reserved" type when checking if a bank is reserved. It's possible that other bank numbers may be reserved on future systems. Don't try to find the block address on reserved banks. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> # 4.14.x Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20180221101900.10326-7-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-02-21 17:00:54 +01:00
Borislav Petkov	398443471f	EDAC, mce_amd: Get rid of local var in amd_filter_mce() ... and use the macro for that. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de>	2017-08-21 17:59:38 +02:00
Borislav Petkov	f3c0891c2f	EDAC, mce_amd: Get rid of most struct cpuinfo_x86 uses struct mce.cpuid contains CPUID(1).EAX which contains family, model and stepping and thus has enough information for our purposes. Thus get rid of some external dependencies which are not really needed. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de>	2017-08-21 17:54:57 +02:00
Borislav Petkov	4ab1784b48	EDAC, mce_amd: Rename decode_smca_errors() to decode_smca_error() Singular fits better because it decodes a single error. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de>	2017-08-21 17:44:09 +02:00
Yazen Ghannam	fbe63acf62	EDAC, mce_amd: Use cpu_to_node() to find the node ID Using the homegrown amd_get_nb_id() to find a node ID on AMD was fine while the L3 to node mapping was 1:1. And Zen topology broke this. So let's start slowly moving away from it and use the topology interfaces instead. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1490041614-90057-2-git-send-email-Yazen.Ghannam@amd.com [ Massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de>	2017-07-17 07:01:08 +02:00
Yazen Ghannam	bdf1bf1744	EDAC, mce_amd: Fix typo in SMCA error description Fix typo in "poison consumption" error description. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1497286703-62853-1-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2017-06-12 19:03:55 +02:00
Linus Torvalds	60c906bab1	Merge branch 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Ingo Molnar: "The main changes in this cycle were: - Assign notifier chain priorities for all RAS related handlers to make the ordering explicit (Borislav Petkov) - Improve the AMD MCA banks sysfs output (Yazen Ghannam) - Various cleanups and restructuring of the x86 RAS code (Borislav Petkov)" * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority x86/ras: Get rid of mce_process_work() EDAC/mce/amd: Dump TSC value EDAC/mce/amd: Unexport amd_decode_mce() x86/ras/amd/inj: Change dependency x86/ras: Flip the TSC-adding logic x86/ras/amd: Make sysfs names of banks more user-friendly x86/ras/therm_throt: Do not log a fake MCE for thermal events x86/ras/inject: Make it depend on X86_LOCAL_APIC=y	2017-02-20 12:47:44 -08:00
Yazen Ghannam	75bf2f6478	EDAC, mce_amd: Print IPID and Syndrome on a separate line Currently, the IPID and Syndrome are printed on the same line as the Address. There are cases when we can have a valid Syndrome but not a valid Address. For example, the MCA_SYND register can be used to hold more detailed error info that the hardware folks can use. It's not just DRAM ECC syndromes. There are some error types that aren't related to memory that may have valid syndromes, like some errors related to links in the Data Fabric, etc. In these cases, the IPID and Syndrome are not printed at the same log level as the rest of the stanza, so users won't see them on the console. Console: [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over\|CE\|MiscV\|-\|-\|-\|-\|SyndV\|-]: 0xd82000000002080b [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2 Dmesg: [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over\|CE\|MiscV\|-\|-\|-\|-\|SyndV\|-]: 0xd82000000002080b , Syndrome: 0x000000010b404000, IPID: 0x0001002e00000002 [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2 Print the IPID first and on a new line. The IPID should always be printed on SMCA systems. The Syndrome will then be printed with the IPID and at the same log level when valid: [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over\|CE\|MiscV\|-\|-\|-\|-\|SyndV\|-]: 0xd82000000002080b [Hardware Error]: IPID: 0x0001002e00000002, Syndrome: 0x000000010b404000 [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2 Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1487192182-2474-1-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2017-02-16 15:39:32 +01:00
Yazen Ghannam	67d7fd306e	EDAC, mce_amd: Give more context to deferred error message Users may not be familiar with the concept of deferred errors. There is no action for users to take on this type of error, so give more context in the error message to make this more clear. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1485297149-13733-2-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2017-01-28 13:03:29 +01:00
Borislav Petkov	9026cc82b6	x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority Assign all notifiers on the MCE decode chain a priority so that they get called in the correct order. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-01-24 09:14:57 +01:00
Borislav Petkov	0bceab677d	EDAC/mce/amd: Dump TSC value Dump the TSC value of the time when the MCE got logged. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-8-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-01-24 09:14:56 +01:00
Borislav Petkov	1fbcd90903	EDAC/mce/amd: Unexport amd_decode_mce() It is not used outside of the driver anymore. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-7-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-01-24 09:14:55 +01:00
Yazen Ghannam	a6c14dce85	EDAC, mce_amd: Don't report poison bit on Fam15h, bank 4 MCA_STATUS[43] has been defined as "Poison" or "Reserved" for every bank since Fam15h except for Fam15h, bank 4 in which case it's defined as part of the McaStatSubCache bitfield. Filter out that case. Reported-by: Dean Liberty <Dean.Liberty@amd.com> Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1479478222-19896-1-git-send-email-Yazen.Ghannam@amd.com [ Split an almost unparseable ternary conditional, add a comment. ] Signed-off-by: Borislav Petkov <bp@suse.de>	2016-11-28 17:50:12 +01:00
Borislav Petkov	627bc29ed9	Merge tip:ras/core to pick up dependent changes tip:ras/core contains the respective Fam17h x86 RAS bits which amd64_edac is going to use. So merge it into the EDAC branch. Signed-off-by: Borislav Petkov <bp@suse.de>	2016-11-23 21:13:40 +01:00
Yazen Ghannam	5c332202f8	EDAC, mce_amd: Rename nb_bus_decoder to dram_ecc_decoder nb_bus_decoder() is only used for DRAM ECC errors so rename it so that the name is more generic and descriptive. Also, call it for DRAM ECC errors on SMCA systems. [ Boris: rename it to real function name with a verb in it. ] Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1479423463-8536-4-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2016-11-21 09:43:15 +01:00
Borislav Petkov	c09a8c40e0	x86/RAS: Hide SMCA bank names Add accessor functions and hide the smca_names array. Also, add a sanity-check to bank HWID assignment in get_smca_bank_info(). Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20161104152317.5r276t35df53qk76@pd.tnic Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-11-08 17:10:15 +01:00
Borislav Petkov	a9a1c0ee04	x86/RAS: Rename smca_bank_names to smca_names Make it differ more from struct smca_bank_name for better readability. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: http://lkml.kernel.org/r/20161103125556.15482-3-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-11-08 17:10:14 +01:00
Borislav Petkov	1ce9cd7f9f	x86/RAS: Simplify SMCA HWID descriptor struct Call it simply smca_hwid and call local variables "hwid". More readable. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: http://lkml.kernel.org/r/20161103125556.15482-2-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-11-08 17:10:14 +01:00
Yazen Ghannam	a884675b87	x86/MCE/AMD, EDAC: Handle reserved bank 4 on Fam17h properly Bank 4 is reserved on family 0x17 and shouldn't generate any MCE records. However, broken hardware and software is not something unheard of so warn about bank 4 errors. They shouldn't be coming from bank 4 naturally but users can still use mce_amd_inj to simulate errors from it for testing purposed. Also, avoid special handling in the injector mce_amd_inj like it is being done on the older families. [ bp: Rewrite commit message and merge into one patch. Use boot_cpu_data. ] Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Link: http://lkml.kernel.org/r/1473384591-5323-1-git-send-email-Yazen.Ghannam@amd.com Link: http://lkml.kernel.org/r/1473384591-5323-2-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-09-13 15:23:14 +02:00
Yazen Ghannam	4b711f92c9	x86/mce, EDAC/mce_amd: Print MCA_SYND and MCA_IPID during MCE on SMCA systems The MCA_SYND and MCA_IPID registers contain valuable information and should be included in MCE output. The MCA_SYND register contains syndrome and other error information, and the MCA_IPID register will uniquely identify the MCA bank's type without having to rely on system software. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1472680624-34221-2-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-09-13 15:23:13 +02:00
Yazen Ghannam	5896820e0a	x86/mce/AMD, EDAC/mce_amd: Define and use tables for known SMCA IP types Scalable MCA defines a number of IP types. An MCA bank on an SMCA system is defined as one of these IP types. A bank's type is uniquely identified by the combination of the HWID and MCATYPE values read from its MCA_IPID register. Add the required tables in order to be able to lookup error descriptions based on a bank's type and the error's extended error code. [ bp: Align comments, simplify a bit. ] Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1472741832-1690-1-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-09-13 15:23:10 +02:00
Yazen Ghannam	856095b179	EDAC/mce_amd: Use SMCA prefix for error descriptions arrays The error descriptions defined for Fam17h can be reused for other SMCA systems, so their names should reflect this. Change f17h prefix to smca for error descriptions. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1472673994-12235-4-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-09-13 15:23:09 +02:00
Yazen Ghannam	c019b951e1	EDAC/mce_amd: Add missing SMCA error descriptions Add missing SMCA error descriptions to the error descriptions arrays. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1472673994-12235-3-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-09-13 15:23:09 +02:00
Yazen Ghannam	b300e87300	EDAC/mce_amd: Print syndrome register value on SMCA systems Print SyndV bit status and print the raw value of the MCA_SYND register. Further decoding of the syndrome from struct mce.synd can be done in other places where appropriate, e.g. DRAM ECC. Boris: make the error stanza more compact by putting the error address and syndrome on the same line: [Hardware Error]: Corrected error, no action required. [Hardware Error]: CPU:2 (17:0:0) MC4_STATUS[-\|CE\|-\|PCC\|AddrV\|-\|-\|SyndV\|CECC]: 0x96204100001e0117 [Hardware Error]: Error Addr: 0x000000007f4c52e3, Syndrome: 0x0000000000000000 [Hardware Error]: Invalid IP block specified. [Hardware Error]: cache level: L3/GEN, tx: DATA, mem-tx: RD Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1467633035-32080-2-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-09-13 15:23:07 +02:00
Yazen Ghannam	a348ed83d9	EDAC, mce_amd: Detect SMCA using X86_FEATURE_SMCA Use X86_FEATURE_SMCA when detecting if SMCA is available instead of directly using CPUID 0x80000007_EBX. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1462971509-3856-7-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-05-12 09:08:23 +02:00
Aravind Gopalakrishnan	be0aec23bf	x86/mce/AMD, EDAC: Enable error decoding of Scalable MCA errors For Scalable MCA enabled processors, errors are listed per IP block. And since it is not required for an IP to map to a particular bank, we need to use HWID and McaType values from the MCx_IPID register to figure out which IP a given bank represents. We also have a new bit (TCC) in the MCx_STATUS register to indicate Task context is corrupt. Add logic here to decode errors from all known IP blocks for Fam17h Model 00-0fh and to print TCC errors. [ Minor fixups. ] Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1457021458-2522-3-git-send-email-Aravind.Gopalakrishnan@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-08 11:48:14 +01:00
Aravind Gopalakrishnan	99e1dfb7d2	EDAC, mce_amd: Don't emit 'CE' for Deferred error Currently, when decoding an MCE, we display 'CE' for a Deferred error, like this: [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[Over\|CE\|MiscV\|-\|AddrV\|Deferred\|-\|UECC]: 0xdc04b00095080813 When the 'UC' bit in the MCx_STATUS register is clear, the error status is either a Corrected error or Deferred error as determined by the 'Deferred' bit. So do not print 'CE' on a deferred error. Refer to AMD Error Scope Hierarchy table in a newer BKDG (example: 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"). Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1436788382-6463-1-git-send-email-aravind.gopalakrishnan@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2015-07-14 06:32:53 +02:00
Borislav Petkov	50872ccd87	EDAC, MCE, AMD: Correct formatting of decoded text Write out MCx_ADDR into the more humanly readable "MCx Error Address" and remove double colon in the output. Cc: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de>	2014-11-25 13:09:49 +01:00
Aravind Gopalakrishnan	bc4febe93c	EDAC, MCE, AMD: Add decoding table for MC6 xec Extended error code meanings are tabulated for other banks. Extend that tradition for MC6 too. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Link: http://lkml.kernel.org/r/1415122868-10969-1-git-send-email-aravind.gopalakrishnan@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2014-11-04 18:49:20 +01:00
Aravind Gopalakrishnan	eba4bfb34d	EDAC, MCE, AMD: Add MCE decoding for F15h M60h Add decoding logic for new Fam15h model 60h. Tested using mce_amd_inj module and works fine. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Link: http://lkml.kernel.org/r/1405098795-4678-1-git-send-email-Aravind.Gopalakrishnan@amd.com [ Boris: simplify a bit. ] Signed-off-by: Borislav Petkov <bp@suse.de>	2014-07-14 16:58:19 +02:00
Borislav Petkov	c5c0903b2c	EDAC, MCE, AMD: Remove leftover unused mask `295d8cda26` ("EDAC, MCE, AMD: Drop local coreid reporting") removed the code snippet which used that mask but forgot to drop the mask itself. Do that now. Signed-off-by: Borislav Petkov <bp@suse.de>	2014-05-08 20:37:07 +02:00
Borislav Petkov	fd0f5ffff8	MCE, AMD: Fix decoding module loading on unsupported hw We want to still be able to issue some error information on systems for which there is no decoding support (think older distro kernels here, for example). Therefore, we allow module registration but skip the per-family bank-specific decoders and issue the general information only, i.e.: [ 46.822828] [Hardware Error]: Error Status: Uncorrected, software containable error. [ 46.822846] [Hardware Error]: CPU:0 (15:30:0) MC0_STATUS[-\|UE\|-\|-\|-\|-\|-]: 0xa000000000010f0f [ 46.822858] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out) with the hope that it still contains helpful useful bits. Suggested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Tested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Link: http://lkml.kernel.org/r/1392659391-2411-1-git-send-email-Aravind.Gopalakrishnan@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>	2014-02-24 10:25:47 +01:00
Aravind Gopalakrishnan	aad19e5176	EDAC, MCE, AMD: Add an MCE signature for new Fam15h models Add a new error signature for Family 15h, models 30h-3fh. Patch has been tested on Fam15h using mce_amd_inj facility and has been verified to work correctly. Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> [ cleanup commit message and error string ] Signed-off-by: Borislav Petkov <bp@suse.de>	2013-06-08 10:17:03 +02:00
Borislav Petkov	0f08669e86	EDAC, MCE, AMD: Remove unneeded exports Initially, those strings describing different parts of an MCE message were shared with amd64_edac and were therefore exported to modules. However, all except pp_msgs are used only in one place right now so hide them and make them static. No functionality change. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Borislav Petkov <bp@alien8.de>	2013-01-22 22:40:03 +01:00
Jacob Shin	980eec8b20	EDAC, MCE, AMD: Add MCE decoding support for Family 16h Add MCE decoding logic for AMD Family 16h processors. Boris: - drop unneeded uu_msgs export - exit early in cat_mc1_mce and save us an indentation level Signed-off-by: Jacob Shin <jacob.shin@amd.com> Signed-off-by: Borislav Petkov <bp@alien8.de>	2013-01-22 22:39:58 +01:00
Jacob Shin	4a73d3de63	EDAC, MCE, AMD: Make MC2 decoding per-family Currently only AMD Family 15h processors have special handling for MC2 errors. Since upcoming Family 16h will also need unique handling, let's make MC2 handling part of amd_decoder_ops. Signed-off-by: Jacob Shin <jacob.shin@amd.com> Signed-off-by: Borislav Petkov <bp@alien8.de>	2013-01-22 22:39:54 +01:00
Borislav Petkov	d5c6770d4c	MCE, AMD: Dump error status Dump error status after decoding the error which describes the error disposition. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2012-11-28 11:56:30 +01:00
Borislav Petkov	d824c7718b	MCE, AMD: Report decoded error type first Instead of starting with the error details, report the decoded, readable error type first. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2012-11-28 11:56:17 +01:00
Borislav Petkov	f89f8388cd	MCE, AMD: Dump CPU f/m/s triple with the error It is very useful to have the family/model/stepping with the reported error so dump it. This saves us asking the bug reporter about it. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2012-11-28 11:55:57 +01:00
Borislav Petkov	f05c41a9c6	MCE, AMD: Remove functional unit references Having the functional unit names in each bank decode is only misleading as this code supports multiple families and there's no guarantee the mapping between FUs and MCE banks will stay the same. And also, knowing the functional unit name doesn't help much since you end up looking at the respective BKDG anyway. So drop all FU references and use the MC bank numbers instead. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2012-11-28 11:55:44 +01:00
Borislav Petkov	ec3e82d6dc	MCE, AMD: Drop too granulary family model checks MCA details seldom change inbetween the models of a family so don't be too conservative and enable decoding on everything starting from K8 onwards. Minor adjustments can come in later but most importantly, we have some decoding infrastructure in place for upcoming models by default. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2012-04-04 15:50:11 +02:00

1 2

89 Commits