Add a pci config read wrapper for signaling pci config space access
errors instead of them being visible only on a debug build. This is
important on amd64_edac since it uses all those pci config register
values to access the DRAM/DIMM configuration of the nodes.
In addition, the wrapper makes a _lot_ (look at the diffstat!) of
error handling code superfluous and improves much of the overall code
readability by removing error handling details out of the way.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Unify almost identical code into one function and remove NUMA-specific
usage (specifically cpumask_of_node()) in favor of generic topology
methods.
Remove unused defines, while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
cpumask_t -> struct cpumask, and don't put one on the stack. (Note: this
is actually on the stack unless CONFIG_CPUMASK_OFFSTACK=y).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Do not shift the TOP_MEM and TOP_MEM2 values by 23 but rather save the
whole 64-bit value read from the MSR. Although the TOP_MEM/TOP_MEM2 bits
are only a subset of the 64bit register, the values are correct since
the remaining bits are Read-As-Zero and no shifting is needed.
Also, cleanup DRAM base/limit debug output.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Make debug info formulations about the DRAM and DCT configuration of the
machine more human readable.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
In amd64_edac_init(void) in amd64_edac.c, cache_k8_northbridges() is
called before pci_register_driver. If it fails, should exit with err
directly.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Acked-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
On Fam10h and above, F1x[1, 0][7C:40] are DRAM Base/Limit registers
which specify the destination node of a DRAM address. Those address
boundaries are being extracted into ->dram_base[] and ->dram_limit[].
Correct the extraction masks to match the respective address bits.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Different processor families support a different number of chip selects.
Handle this in a family-dependent way with the proper values assigned at
init time (see amd64_set_dct_base_and_mask).
Remove _DCSM_COUNT defines since they're used at one place and originate
from public documentation.
CC: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This allows the errors to be further decoded and mapped to csrows.
Tested with ECC debug dimms and an Rev F cpu based system.
Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The check when DRAM interleaving is enabled should be done against the
pvt->dram_IntlvSel field and not against the ->dram_limit.
Simplify first loop and fixup printk formatting while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The pvt->dram_IntlvEn saves the 3 "Interleave Enable" bits already
right-shifted by 8 so the check in find_mc_by_sys_addr() by shifting the
values to the left 8 bits is wrong.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
K8 DRAM base and limit addresses from F1x40 +8*i and F1x44 + 8*i, where
i in (0..7) are both bits 39-24 and therefore the shifting should be
done by 24 and not by 8.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Allocate memory statically for 8-node machines max for simplicity
instead of relying on MAX_NUMNODES which is 0 on !CONFIG_NUMA builds.
Spotted by Jan Beulich.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The old code was using smp_call_function_many which skips the current
cpu if it is in the supplied cpumask. Switch to the rdmsr_on_cpus()
interface which takes care of that.
In addition, add get_cpus_on_this_dct_cpumask helper which computes a
cpumask of all the cores on a node and thus on a DCT.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Simplify the procedure by checking if there is any DIMM in each channel.
This patch will fix the bugs such as when there is no DIMMs under
certain node, two DIMMs in the same channel, and only one DIMM in each
channel of the node.
Borislav: minor fixups
Signed-off-by: Wan Wei <wanwei@mail.dawning.com.cn>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Simplify code flow and make sure return value is always valid since
further driver init depends on it. Carve out long warning string and
make code more readable. Shorten some names, while at it.
There should be no functional change resulting from this patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This is the MCE error code from the MCi_STATUS banks, bits [15:0] which
describe what type of error was encountered: GART TLB, Memory or Bus
error. The semantics of those bits are identical across all MCE banks so
decode those separately, irrespectively of MCE type.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The MCi_STATUS registers have most field definitions in common so decode
them in the general path. Do not pass ecc_type along and compute it in
__amd64_decode_bus_error instead.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Move NB decoder along with required defines to EDAC MCE core. Add
registration routines for further decoding of the MCE info in the AMD64
EDAC module.
CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This is in preparation of adding AMD-specific MCE decoding functionality
to the EDAC core. The error decoding macros originate from the AMD64
EDAC driver albeit in a simplified and cleaned up version here.
While at it, add macros to generate the error description strings and
use them in the error type decoders directly which removes a bunch of
code and makes the decoding functions much more readable. Also, fix
strings and shorten macro names.
Remove superfluous htlink_msgs.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
On the good path of BIOS enabled ECC and no override, the value returned
is 1 by omission and thus is deemed failing by the probe-function.
Allow proper module initialization by clearing the retval explicitly.
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
amd64_check_ecc_enabled() returns non-zero status when ECC
checking/correcting is disabled and this fails further loading of the
driver even when 'ecc_enable_override' boot param is used.
Fix that by clearing return status in that case.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Checking whether the machine is using ECC enabled DRAM is done through
testing the DimmEccEn bit in the DRAM Cfg Low register (F2x[1,0]90). Do
that instead of testing all bits from the DimmEccEn upwards.
Also, remove mci->edac_cap assignment and use value returned from
amd64_determine_edac_cap().
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Also, link into Kbuild by adding Kconfig and Makefile entries.
Borislav:
- Kconfig/Makefile splitting
- use zero-sized arrays for the sysfs attrs if not enabled
- rename sysfs attrs to more conform values
- shorten CONFIG_ names
- make multiple structure members assignment vertically aligned
- fix/cleanup comments
- fix function return value patterns
- fix err labels
- fix a memleak bug caught by Ingo
- remove the NUMA dependency and use num_k8_northbrides for initializing
a driver instance per NB.
- do not copy the pvt contents into the mci struct in
amd64_init_2nd_stage() and save it in the mci->pvt_info void ptr
instead.
- cleanup debug calls
- simplify amd64_setup_pci_device()
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Borislav:
- convert to the new {rd|wr}msr_on_cpus interfaces.
- convert pvt->old_mcgctl to a bitmask thus saving some bytes
- fix/cleanup comments
- fix function return value patterns
- add a proper bugfix found by Doug to amd64_check_ecc_enabled where we
missed checking for the ECC enabled bit in NB CFG.
- cleanup debug calls
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Borislav:
- compute dct_sel_base_off in f10_match_to_this_node() correctly since
it cannot be assumed that the Reserved bits are zero and they have to be
masked out instead.
- cleanup, remove StinkyIdentifiers, simplify logic
- fix function return value patterns
- cleanup debug calls
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Borislav:
- fix a wrong negation in f10_determine_base_addr_offset()
- fix a wrong mask in f10_determine_base_addr_offset() which should
select DctSelBaseAddr[31:11] and not [31:16] as it was before
- remove StinkyIdentifiers, trivially simplify code.
- fix/cleanup comments
- fix function return value patterns
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Borislav:
Fail f10_early_channel_count() if error encountered while reading a NB
register since those cached register contents are accessed afterwards.
- fix/cleanup comments
- fix function return value patterns
- cleanup debug calls
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>