Make the mcelog call a notifier which lands in the injector module and
does the injection. This allows for mce-inject to be a normal kernel
module now.
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20170613162835.30750-5-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reuse mce_amd_inj's debugfs interface so that mce-inject can
benefit from it too. The old functionality is still preserved under
CONFIG_X86_MCELOG_LEGACY.
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20170613162835.30750-4-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the amd_threshold_interrupt() handler, we loop through every possible
block in each bank and rediscover the block's address and if it's valid,
e.g. valid, counter present and not locked.
However, we already have the address saved in the threshold blocks list
for each CPU and bank. The list only contains blocks that have passed
all the valid checks.
Besides the redundancy, there's also a smp_call_function* in
get_block_address() which causes a warning when servicing the interrupt:
WARNING: CPU: 0 PID: 0 at kernel/smp.c:281 smp_call_function_single+0xdd/0xf0
...
Call Trace:
<IRQ>
rdmsr_safe_on_cpu()
get_block_address.isra.2()
amd_threshold_interrupt()
smp_threshold_interrupt()
threshold_interrupt()
because we do get called in an interrupt handler *with* interrupts
disabled, which can result in a deadlock.
Drop the redundant valid checks and move the overflow check, logging and
block reset into a separate function.
Check the first block then iterate over the rest. This procedure is
needed since the first block is used as the head of the list.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170613162835.30750-3-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The value of MCA_STATUS is used as the MSR when clearing MCA_STATUS.
This may cause the following warning:
unchecked MSR access error: WRMSR to 0x11b (tried to write 0x0000000000000000)
Call Trace:
<IRQ>
smp_threshold_interrupt()
threshold_interrupt()
Use msr_stat instead which has the MSR address.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Fixes: 37d43acfd7 ("x86/mce/AMD: Redo error logging from APIC LVT interrupt handlers")
Link: http://lkml.kernel.org/r/20170613162835.30750-2-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
During early boot, load_ucode_intel_ap() uses __load_ucode_intel()
to obtain a pointer to the relevant microcode patch (embedded in the
initrd), and stores this value in 'intel_ucode_patch' to speed up the
microcode patch application for subsequent CPUs.
On resuming from suspend-to-RAM, however, load_ucode_ap() calls
load_ucode_intel_ap() for each non-boot-CPU. By then the initramfs is
long gone so the pointer stored in 'intel_ucode_patch' no longer points to
a valid microcode patch.
Clear that pointer so that we effectively fall back to the CPU hotplug
notifier callbacks to update the microcode.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
[ Edit and massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 4.10..
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170607095819.9754-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A SoC variant of Geode GX1, notably NSC branded SC1100, seems to
report an inverted Device ID in its DIR0 configuration register,
specifically 0xb instead of the expected 0x4.
Catch this presumably quirky version so it's properly recognized
as GX1 and has its cache switched to write-back mode, which provides
a significant performance boost in most workloads.
SC1100's datasheet "Geode™ SC1100 Information Appliance On a Chip",
states in section 1.1.7.1 "Device ID" that device identification
values are specified in SC1100's device errata. These, however,
seem to not have been publicly released.
Wading through a number of boot logs and /proc/cpuinfo dumps found on
pastebin and blogs, this patch should mostly be relevant for a number
of now admittedly aging Soekris NET4801 and PC Engines WRAP devices,
the latter being the platform this issue was discovered on.
Performance impact was verified using "openssl speed", with
write-back caching scaling throughput between -3% and +41%.
Signed-off-by: Christian Sünkenberg <christian.suenkenberg@student.kit.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1496596719.26725.14.camel@student.kit.edu
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With CONFIG_DEBUG_PREEMPT enabled, I get:
BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
caller is debug_smp_processor_id
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc2+ #2
Call Trace:
dump_stack
check_preemption_disabled
debug_smp_processor_id
save_microcode_in_initrd_amd
? microcode_init
save_microcode_in_initrd
...
because, well, it says it above, we're using smp_processor_id() in
preemptible code.
But passing the CPU number is not really needed. It is only used to
determine whether we're on the BSP, and, if so, to save the microcode
patch for early loading.
[ We don't absolutely need to do it on the BSP but we do that
customarily there. ]
Instead, convert that function parameter to a boolean which denotes
whether the patch should be saved or not, thereby avoiding the use of
smp_processor_id() in preemptible code.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170528200414.31305-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
mtrr_save_state() is invoked from native_cpu_up() which is in the context
of a CPU hotplug operation and therefor calling get_online_cpus() is
pointless.
While this works in the current get_online_cpus() implementation it
prevents from converting the hotplug locking to percpu rwsems.
Remove it.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170524081547.651378834@linutronix.de
Scalable MCA systems have a new MCA_CONFIG register that we use to
configure each bank. We currently use this when we set up thresholding.
However, this is logically separate.
Group all SMCA-related initialization into a single function.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1493147772-2721-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We have support for the new SMCA MCA_DE{STAT,ADDR} registers in Linux.
So we've used these registers in place of MCA_{STATUS,ADDR} on SMCA
systems.
However, the guidance for current SMCA implementations of is to continue
using MCA_{STATUS,ADDR} and to use MCA_DE{STAT,ADDR} only if a Deferred
error was not found in the former registers. If we logged a Deferred
error in MCA_STATUS then we should also clear MCA_DESTAT. This also
means we shouldn't clear MCA_CONFIG[LogDeferredInMcaStat].
Rework __log_error() to only log an error and add helpers for the
different error types being logged from the corresponding interrupt
handlers.
Boris: carve out common functionality into a _log_error_bank(). Cleanup
comments, check MCi_STATUS bits before reading MSRs. Streamline flow.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1493147772-2721-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The refcount_t type and corresponding API should be used instead
of atomic_t when the variable is used as a reference counter. This
allows to avoid accidental refcounter overflows that might lead to
use-after-free situations.
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1492695536-5947-1-git-send-email-elena.reshetova@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Export the function which checks whether an MCE is a memory error to
other users so that we can reuse the logic. Drop the boot_cpu_data use,
while at it, as mce.cpuvendor already has the CPU vendor in there.
Integrate a piece from a patch from Vishal Verma
<vishal.l.verma@intel.com> to export it for modules (nfit).
The main reason we're exporting it is that the nfit handler
nfit_handle_mce() needs to detect a memory error properly before doing
its recovery actions.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170519093915.15413-2-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- two boot crash fixes
- unwinder fixes
- kexec related kernel direct mappings enhancements/fixes
- more Clang support quirks
- minor cleanups
- Documentation fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel_rdt: Fix a typo in Documentation
x86/build: Don't add -maccumulate-outgoing-args w/o compiler support
x86/boot/32: Fix UP boot on Quark and possibly other platforms
x86/mm/32: Set the '__vmalloc_start_set' flag in initmem_init()
x86/kexec/64: Use gbpages for identity mappings if available
x86/mm: Add support for gbpages to kernel_ident_mapping_init()
x86/boot: Declare error() as noreturn
x86/mm/kaslr: Use the _ASM_MUL macro for multiplication to work around Clang incompatibility
x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds()
x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()
x86/microcode/AMD: Remove redundant NULL check on mc
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJZFWTzAAoJELDendYovxMv24cIAJ3U2OZ64d7WTKD37AT2O6nF
6R3j+zJ6apoKX4zHvhWUOHZ6jpTASTnaisiIskVc52JcgAK0f8ZYTg5nhyWPceAD
Icf+JuXrI6uplD97qsjt7X9FbxUsRZninfsznoBkK6P8Cw8ZWlWIWIl6e3CrVwBD
geyKcbsKkVG8+bMjWvmQd94CFq5r8Ivup0sCECumx0lqw3RhxdhQvUix9eBULEoG
h/XAuPbMupdjU6phgqG4rvUjWd/R+9mIIDG1oY9Kpx4Kpn/7bHtoYZ//Qzs8bmuP
5ORujOedshdyAZqLGxQuQzo+/4E9gX3qVbaS6fPf1Ab+ra0k/iWtetUITZ0v2AQ=
=gWpG
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.12b-rc0c-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"This contains two fixes for booting under Xen introduced during this
merge window and two fixes for older problems, where one is just much
more probable due to another merge window change"
* tag 'for-linus-4.12b-rc0c-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: adjust early dom0 p2m handling to xen hypervisor behavior
x86/amd: don't set X86_BUG_SYSRET_SS_ATTRS when running under Xen
xen/x86: Do not call xen_init_time_ops() until shared_info is initialized
x86/xen: fix xsave capability setting
When running as Xen pv guest X86_BUG_SYSRET_SS_ATTRS must not be set
on AMD cpus.
This bug/feature bit is kind of special as it will be used very early
when switching threads. Setting the bit and clearing it a little bit
later leaves a critical window where things can go wrong. This time
window has enlarged a little bit by using setup_clear_cpu_cap() instead
of the hypervisor's set_cpu_features callback. It seems this larger
window now makes it rather easy to hit the problem.
The proper solution is to never set the bit in case of Xen.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
set_memory_* functions have moved to set_memory.h. Switch to this
explicitly.
Link: http://lkml.kernel.org/r/1488920133-27229-6-git-send-email-labbott@redhat.com
Signed-off-by: Laura Abbott <labbott@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here is the big set of new char/misc driver drivers and features for
4.12-rc1.
There's lots of new drivers added this time around, new firmware drivers
from Google, more auxdisplay drivers, extcon drivers, fpga drivers, and
a bunch of other driver updates. Nothing major, except if you happen to
have the hardware for these drivers, and then you will be happy :)
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWQvAgg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yknsACgzkAeyz16Z97J3UTaeejbR7nKUCAAoKY4WEHY
8O9f9pr9gj8GMBwxeZQa
=OIfB
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big set of new char/misc driver drivers and features for
4.12-rc1.
There's lots of new drivers added this time around, new firmware
drivers from Google, more auxdisplay drivers, extcon drivers, fpga
drivers, and a bunch of other driver updates. Nothing major, except if
you happen to have the hardware for these drivers, and then you will
be happy :)
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits)
firmware: google memconsole: Fix return value check in platform_memconsole_init()
firmware: Google VPD: Fix return value check in vpd_platform_init()
goldfish_pipe: fix build warning about using too much stack.
goldfish_pipe: An implementation of more parallel pipe
fpga fr br: update supported version numbers
fpga: region: release FPGA region reference in error path
fpga altera-hps2fpga: disable/unprepare clock on error in alt_fpga_bridge_probe()
mei: drop the TODO from samples
firmware: Google VPD sysfs driver
firmware: Google VPD: import lib_vpd source files
misc: lkdtm: Add volatile to intentional NULL pointer reference
eeprom: idt_89hpesx: Add OF device ID table
misc: ds1682: Add OF device ID table
misc: tsl2550: Add OF device ID table
w1: Remove unneeded use of assert() and remove w1_log.h
w1: Use kernel common min() implementation
uio_mf624: Align memory regions to page size and set correct offsets
uio_mf624: Refactor memory info initialization
uio: Allow handling of non page-aligned memory regions
hangcheck-timer: Fix typo in comment
...
There is no user of x86_hyper->set_cpu_features() any more. Remove it.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
There is no need to set the same capabilities for each cpu
individually. This can be done for all cpus in platform initialization.
Cc: Alok Kataria <akataria@vmware.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Alok Kataria <akataria@vmware.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
All code to support Xen PV will get under this new option. For the
beginning, check for it in the common code.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
As a preparation to splitting the code we need to untangle it:
x86_hyper_xen -> x86_hyper_xen_hvm and x86_hyper_xen_pv
xen_platform() -> xen_platform_hvm() and xen_platform_pv()
xen_cpu_up_prepare() -> xen_cpu_up_prepare_pv() and xen_cpu_up_prepare_hvm()
xen_cpu_dead() -> xen_cpu_dead_pv() and xen_cpu_dead_pv_hvm()
Add two parameters to xen_cpuhp_setup() to pass proper cpu_up_prepare and
cpu_dead hooks. xen_set_cpu_features() is now PV-only so the redundant
xen_pv_domain() check can be dropped.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Pull x86 mm updates from Ingo Molnar:
"The main x86 MM changes in this cycle were:
- continued native kernel PCID support preparation patches to the TLB
flushing code (Andy Lutomirski)
- various fixes related to 32-bit compat syscall returning address
over 4Gb in applications, launched from 64-bit binaries - motivated
by C/R frameworks such as Virtuozzo. (Dmitry Safonov)
- continued Intel 5-level paging enablement: in particular the
conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov)
- x86/mpx ABI corner case fixes/enhancements (Joerg Roedel)
- ... plus misc updates, fixes and cleanups"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
x86/mm: Fix flush_tlb_page() on Xen
x86/mm: Make flush_tlb_mm_range() more predictable
x86/mm: Remove flush_tlb() and flush_tlb_current_task()
x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
x86/mm/64: Fix crash in remove_pagetable()
Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
x86/boot/e820: Remove a redundant self assignment
x86/mm: Fix dump pagetables for 4 levels of page tables
x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow
x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
x86/espfix: Add support for 5-level paging
x86/kasan: Extend KASAN to support 5-level paging
x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
x86/paravirt: Add 5-level support to the paravirt code
x86/mm: Define virtual memory map for 5-level paging
x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert
x86/boot: Detect 5-level paging support
x86/mm/numa: Remove numa_nodemask_from_meminfo()
...
Pull x86 cpu updates from Ingo Molnar:
"The biggest changes are an extension of the Intel RDT code to extend
it with Intel Memory Bandwidth Allocation CPU support: MBA allows
bandwidth allocation between cores, while CBM (already upstream)
allows CPU cache partitioning.
There's also misc smaller fixes and updates"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/intel_rdt: Return error for incorrect resource names in schemata
x86/intel_rdt: Trim whitespace while parsing schemata input
x86/intel_rdt: Fix padding when resource is enabled via mount
x86/intel_rdt: Get rid of anon union
x86/cpu: Keep model defines sorted by model number
x86/intel_rdt/mba: Add schemata file support for MBA
x86/intel_rdt: Make schemata file parsers resource specific
x86/intel_rdt/mba: Add info directory files for Memory Bandwidth Allocation
x86/intel_rdt: Make information files resource specific
x86/intel_rdt/mba: Add primary support for Memory Bandwidth Allocation (MBA)
x86/intel_rdt/mba: Memory bandwith allocation feature detect
x86/intel_rdt: Add resource specific msr update function
x86/intel_rdt: Move CBM specific data into a struct
x86/intel_rdt: Cleanup namespace to support multiple resource types
Documentation, x86: Intel Memory bandwidth allocation
x86/intel_rdt: Organize code properly
x86/intel_rdt: Init padding only if a device exists
x86/intel_rdt: Add cpus_list rdtgroup file
x86/intel_rdt: Cleanup kernel-doc
x86/intel_rdt: Update schemata read to show data in tabular format
...
Pull x86 boot updates from Ingo Molnar:
"The biggest changes in this cycle were:
- reworking of the e820 code: separate in-kernel and boot-ABI data
structures and apply a whole range of cleanups to the kernel side.
No change in functionality.
- enable KASLR by default: it's used by all major distros and it's
out of the experimental stage as well.
- ... misc fixes and cleanups"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
x86/KASLR: Fix kexec kernel boot crash when KASLR randomization fails
x86/reboot: Turn off KVM when halting a CPU
x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup
x86: Enable KASLR by default
boot/param: Move next_arg() function to lib/cmdline.c for later reuse
x86/boot: Fix Sparse warning by including required header file
x86/boot/64: Rename start_cpu()
x86/xen: Update e820 table handling to the new core x86 E820 code
x86/boot: Fix pr_debug() API braindamage
xen, x86/headers: Add <linux/device.h> dependency to <asm/xen/page.h>
x86/boot/e820: Simplify e820__update_table()
x86/boot/e820: Separate the E820 ABI structures from the in-kernel structures
x86/boot/e820: Fix and clean up e820_type switch() statements
x86/boot/e820: Rename the remaining E820 APIs to the e820__*() prefix
x86/boot/e820: Remove unnecessary #include's
x86/boot/e820: Rename e820_mark_nosave_regions() to e820__register_nosave_regions()
x86/boot/e820: Rename e820_reserve_resources*() to e820__reserve_resources*()
x86/boot/e820: Use bool in query APIs
x86/boot/e820: Document e820__reserve_setup_data()
x86/boot/e820: Clean up __e820__update_table() et al
...
Pull RAS updates from Ingo Molnar:
"The main changes in this cycle were:
- add the 'Corrected Errors Collector' kernel feature which collect
and monitor correctable errors statistics and will preemptively
(soft-)offline physical pages that have a suspiciously high error
count.
- handle MCE errors during kexec() more gracefully
- factor out and deprecate the /dev/mcelog driver
- ... plus misc fixes and cleanpus"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Check MCi_STATUS[MISCV] for usable addr on Intel only
ACPI/APEI: Use setup_deferrable_timer()
x86/mce: Update notifier priority check
x86/mce: Enable PPIN for Knights Landing/Mill
x86/mce: Do not register notifiers with invalid prio
x86/mce: Factor out and deprecate the /dev/mcelog driver
RAS: Add a Corrected Errors Collector
x86/mce: Rename mce_log to mce_log_buffer
x86/mce: Rename mce_log()'s argument
x86/mce: Init some CPU features early
x86/mce: Handle broadcasted MCE gracefully with kexec
Pul x86/process updates from Ingo Molnar:
"The main change in this cycle was to add the ARCH_[GET|SET]_CPUID
prctl() ABI extension to control the availability of the CPUID
instruction, analogously to the existing PR_GET|SET_TSC ABI that
controls RDTSC.
Motivation: the 'rr' user-space record-and-replay execution debugger
would like to trap and emulate the CPUID instruction - which
instruction is normally unprivileged.
Trapping CPUID is possible on IvyBridge and later Intel CPUs - expose
this hardware capability"
* 'x86-process-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/syscalls/32: Ignore arch_prctl for other architectures
um/arch_prctl: Fix fallout from x86 arch_prctl() rework
x86/arch_prctl: Add ARCH_[GET|SET]_CPUID
x86/cpufeature: Detect CPUID faulting support
x86/syscalls/32: Wire up arch_prctl on x86-32
x86/arch_prctl: Add do_arch_prctl_common()
x86/arch_prctl/64: Rename do_arch_prctl() to do_arch_prctl_64()
x86/arch_prctl/64: Use SYSCALL_DEFINE2 to define sys_arch_prctl()
x86/arch_prctl: Rename 'code' argument to 'option'
x86/msr: Rename MISC_FEATURE_ENABLES to MISC_FEATURES_ENABLES
x86/process: Optimize TIF_NOTSC switch
x86/process: Correct and optimize TIF_BLOCKSTEP switch
x86/process: Optimize TIF checks in __switch_to_xtra()
When schemata parses the resource names it does not return an error if it
detects incorrect resource names and fails quietly.
This happens because for_each_enabled_rdt_resource(r) leaves "r" pointing
beyond the end of the rdt_resources_all[] array, and the check for !r->name
results in an out of bounds access.
Split the resource parsing part into a helper function to avoid the issue.
[ tglx: Made it readable by splitting the parser loop out into a function ]
Reported-by: Prakhya, Sai Praneeth <sai.praneeth.prakhya@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Tested-by: Prakhya, Sai Praneeth <sai.praneeth.prakhya@intel.com>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: ravi.v.shankar@intel.com
Cc: vikas.shivappa@intel.com
Link: http://lkml.kernel.org/r/1492645804-17465-4-git-send-email-vikas.shivappa@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
mce_usable_address() does a bunch of basic sanity checks to verify
whether the address reported with the error is usable for further
processing. However, we do check MCi_STATUS[MISCV] and that is not
needed on AMD as that bit says that there's additional information about
the logged error in the MCi_MISCj banks.
But we don't need that to know whether the address is usable - we only
need to know whether the physical address is valid - i.e., ADDRV.
On Intel the MISCV bit is needed to perform additional checks to determine
whether the reported address is a physical one, etc.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170418183924.6agjkebilwqj26or@pd.tnic
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The NFIT MCE handler callback (for handling media errors on NVDIMMs)
takes a mutex to add the location of a memory error to a list. But since
the notifier call chain for machine checks (x86_mce_decoder_chain) is
atomic, we get a lockdep splat like:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
in_atomic(): 1, irqs_disabled(): 0, pid: 4, name: kworker/0:0
[..]
Call Trace:
dump_stack
___might_sleep
__might_sleep
mutex_lock_nested
? __lock_acquire
nfit_handle_mce
notifier_call_chain
atomic_notifier_call_chain
? atomic_notifier_call_chain
mce_gen_pool_process
Convert the notifier to a blocking one which gets to run only in process
context.
Boris: remove the notifier call in atomic context in print_mce(). For
now, let's print the MCE on the atomic path so that we can make sure
they go out and get logged at least.
Fixes: 6839a6d96f ("nfit: do an ARS scrub on hitting a latent media error")
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170411224457.24777-1-vishal.l.verma@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Update the check which enforces the registration of MCE decoder notifier
callbacks with valid priority only, to include mcelog's priority.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lkp@01.org
Link: http://lkml.kernel.org/r/20170418073820.i6kl5tggcntwlisa@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The files in the info directory for MBA are as follows:
num_closids
The maximum number of CLOSids available for MBA
min_bandwidth
The minimum memory bandwidth percentage value
bandwidth_gran
The granularity of the bandwidth control in percent for the
particular CPU SKU. Intermediate values entered are rounded off
to the previous control step available. Available bandwidth
control steps are minimum_bandwidth + N * bandwidth_gran.
delay_linear
When set, the OS writes a linear percentage based value to the
control MSRs ranging from minimum_bandwidth to 100 percent.
This value is informational and has no influence on the values
written to the schemata files. The values written to the
schemata are always bandwidth percentage that is requested.
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: vikas.shivappa@intel.com
Link: http://lkml.kernel.org/r/1491611637-20417-7-git-send-email-vikas.shivappa@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Memory bandwidth allocation requires different information than cache
allocation.
To avoid a lump of data in struct rdt_resource, move all cache related
information into a seperate structure and add that to struct rdt_resource.
Sanitize the data types while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: vikas.shivappa@intel.com
The schemata lock is released before freeing the resource's temporary
tmp_cbms allocation. That's racy versus another write which allocates and
uses new temporary storage, resulting in memory leaks, freeing in use
memory, double a free or any combination of those.
Move the unlock after the release code.
Fixes: 60ec2440c6 ("x86/intel_rdt: Add schemata file")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Shaohua Li <shli@fb.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170411071446.15241-1-jolsa@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There's a conflict between ongoing level-5 paging support and
the E820 rewrite. Since the E820 rewrite is essentially ready,
merge it into x86/mm to reduce tree conflicts.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The E820 rework in WIP.x86/boot has gone through a couple of weeks
of exposure in -tip, merge it in a wider fashion.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The resource control filesystem provides only a bitmask based cpus file for
assigning CPUs to a resource group. That's cumbersome with large cpumasks
and non-intuitive when modifying the file from the command line.
Range based cpu lists are commonly used along with bitmask based cpu files
in various subsystems throughout the kernel.
Add 'cpus_list' file which is CPU range based.
# cd /sys/fs/resctrl/
# echo 1-10 > krava/cpus_list
# cat krava/cpus_list
1-10
# cat krava/cpus
0007fe
# cat cpus
fffff9
# cat cpus_list
0,3-23
[ tglx: Massaged changelog and replaced "bitmask lists" by "CPU ranges" ]
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20170410145232.GF25354@krava
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When auto EOI is not enabled; issue an explicit EOI for hyper-v
interrupts.
Fixes: 6c248aad81 ("Drivers: hv: Base autoeoi enablement based on hypervisor hints")
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The schemata file can have multiple lines and it is cumbersome to update
all lines.
Remove code that requires that the user provides values for every resource
(in the right order). If the user provides values for just a few
resources, update them and leave the rest unchanged.
Side benefit: we now check which values were updated and only send IPIs to
cpus that actually have updates.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Tested-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: vikas.shivappa@intel.com
Cc: h.peter.anvin@intel.com
Link: http://lkml.kernel.org/r/1491255857-17213-3-git-send-email-vikas.shivappa@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull RAS fix from Thomas Gleixner:
"Prevent dmesg from being spammed when MCE logging is active"
* 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Don't print MCEs when mcelog is active
MCA bank 3 is reserved on systems pre-Fam17h, so it didn't have a name.
However, MCA bank 3 is defined on Fam17h systems and can be accessed
using legacy MSRs. Without a name we get a stack trace on Fam17h systems
when trying to register sysfs files for bank 3 on kernels that don't
recognize Scalable MCA.
Call MCA bank 3 "decode_unit" since this is what it represents on
Fam17h. This will allow kernels without SMCA support to see this bank on
Fam17h+ and prevent the stack trace. This will not affect older systems
since this bank is reserved on them, i.e. it'll be ignored.
Tested on AMD Fam15h and Fam17h systems.
WARNING: CPU: 26 PID: 1 at lib/kobject.c:210 kobject_add_internal
kobject: (ffff88085bb256c0): attempted to be registered with empty name!
...
Call Trace:
kobject_add_internal
kobject_add
kobject_create_and_add
threshold_create_device
threshold_init_device
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1490102285-3659-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is just a defensive precaution: do not register notifiers with a
priority which would disrupt the error handling in the notifiers with
prio higher than MCE_PRIO_EDAC.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170327093304.10683-7-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move all code relating to /dev/mcelog to a separate source file.
/dev/mcelog driver can now operate from the machine check notifier with
lowest prio.
Signed-off-by: Tony Luck <tony.luck@intel.com>
[ Move the mce_helper and trigger functionality behind CONFIG_X86_MCELOG_LEGACY. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170327093304.10683-6-bp@alien8.de
[ Renamed CONFIG_X86_MCELOG to CONFIG_X86_MCELOG_LEGACY. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce a simple data structure for collecting correctable errors
along with accessors. More detailed description in the code itself.
The error decoding is done with the decoding chain now and
mce_first_notifier() gets to see the error first and the CEC decides
whether to log it and then the rest of the chain doesn't hear about it -
basically the main reason for the CE collector - or to continue running
the notifiers.
When the CEC hits the action threshold, it will try to soft-offine the
page containing the ECC and then the whole decoding chain gets to see
the error.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170327093304.10683-5-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is confusing when staring at "struct mce_log mcelog" and then there's
also a function called mce_log(). So call the buffer what it is.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170327093304.10683-4-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We call it everywhere "struct mce *m". Adjust that here too to avoid
confusion.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170327093304.10683-3-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since:
cd9c57cad3 ("x86/MCE: Dump MCE to dmesg if no consumers")
all MCEs are printed even when mcelog is running. Fix the regression to
not print to dmesg when mcelog is running as it is a consumer too.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org # 4.10..
Fixes: cd9c57cad3 ("x86/MCE: Dump MCE to dmesg if no consumers")
Link: http://lkml.kernel.org/r/20170327093304.10683-2-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Xen imposes special requirements on the GDT. Rather than using a
global variable for the pgprot, just use an explicit special case
for Xen -- this makes it clearer what's going on. It also debloats
64-bit kernels very slightly.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e9ea96abbfd6a8c87753849171bb5987ecfeb523.1490218061.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__pa() cannot be used on percpu pointers because they may be
virtually mapped. Use per_cpu_ptr_to_phys() instead.
This fixes a boot crash on a some 32-bit configurations. I assume
this is related to which allocation strategy is chosen by the percpu
core.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 69218e4799 x86: ("Remap GDT tables in the fixmap section")
Link: http://lkml.kernel.org/r/22e0069c29fba31998f193201e359eebfdac4960.1490218061.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Intel supports faulting on the CPUID instruction beginning with Ivy Bridge.
When enabled, the processor will fault on attempts to execute the CPUID
instruction with CPL>0. Exposing this feature to userspace will allow a
ptracer to trap and emulate the CPUID instruction.
When supported, this feature is controlled by toggling bit 0 of
MSR_MISC_FEATURES_ENABLES. It is documented in detail in Section 2.3.2 of
https://bugzilla.kernel.org/attachment.cgi?id=243991
Implement a new pair of arch_prctls, available on both x86-32 and x86-64.
ARCH_GET_CPUID: Returns the current CPUID state, either 0 if CPUID faulting
is enabled (and thus the CPUID instruction is not available) or 1 if
CPUID faulting is not enabled.
ARCH_SET_CPUID: Set the CPUID state to the second argument. If
cpuid_enabled is 0 CPUID faulting will be activated, otherwise it will
be deactivated. Returns ENODEV if CPUID faulting is not supported on
this system.
The state of the CPUID faulting flag is propagated across forks, but reset
upon exec.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Robert O'Callahan <robert@ocallahan.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: user-mode-linux-user@lists.sourceforge.net
Cc: David Matlack <dmatlack@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lkml.kernel.org/r/20170320081628.18952-9-khuey@kylehuey.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Intel supports faulting on the CPUID instruction beginning with Ivy Bridge.
When enabled, the processor will fault on attempts to execute the CPUID
instruction with CPL>0. This will allow a ptracer to emulate the CPUID
instruction.
Bit 31 of MSR_PLATFORM_INFO advertises support for this feature. It is
documented in detail in Section 2.3.2 of
https://bugzilla.kernel.org/attachment.cgi?id=243991
Detect support for this feature and expose it as X86_FEATURE_CPUID_FAULT.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Robert O'Callahan <robert@ocallahan.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: user-mode-linux-user@lists.sourceforge.net
Cc: David Matlack <dmatlack@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lkml.kernel.org/r/20170320081628.18952-8-khuey@kylehuey.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This matches the only public Intel documentation of this MSR, in the
"Virtualization Technology FlexMigration Application Note"
(preserved at https://bugzilla.kernel.org/attachment.cgi?id=243991)
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: linux-kselftest@vger.kernel.org
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Robert O'Callahan <robert@ocallahan.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: user-mode-linux-user@lists.sourceforge.net
Cc: David Matlack <dmatlack@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lkml.kernel.org/r/20170320081628.18952-2-khuey@kylehuey.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When the MCA banks in __mcheck_cpu_init_generic() are polled for leftover
errors logged during boot or from the previous boot, its required to have
CPU features detected sufficiently so that the reading out and handling of
those early errors is done correctly.
If those features are not available, the decoding may miss some information
and get incomplete errors logged. For example, on SMCA systems the MCA_IPID
and MCA_SYND registers are not logged and MCA_ADDR is not masked
appropriately.
To cure that, do a subset of the basic feature detection early while the
rest happens in its usual place in __mcheck_cpu_init_vendor().
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1489599055-20756-1-git-send-email-Yazen.Ghannam@amd.com
[ Massage commit message and simplify. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
mc is a pointer to the static u8 array amd_ucode_patch and
therefore can never be null, so the check is redundant. Remove it.
Detected by CoverityScan, CID#1372871 ("Logically Dead Code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: kernel-janitors@vger.kernel.org
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/20170315171010.17536-1-colin.king@canonical.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch makes the GDT remapped pages read-only, to prevent accidental
(or intentional) corruption of this key data structure.
This change is done only on 64-bit, because 32-bit needs it to be writable
for TSS switches.
The native_load_tr_desc function was adapted to correctly handle a
read-only GDT. The LTR instruction always writes to the GDT TSS entry.
This generates a page fault if the GDT is read-only. This change checks
if the current GDT is a remap and swap GDTs as needed. This function was
tested by booting multiple machines and checking hibernation works
properly.
KVM SVM and VMX were adapted to use the writeable GDT. On VMX, the
per-cpu variable was removed for functions to fetch the original GDT.
Instead of reloading the previous GDT, VMX will reload the fixmap GDT as
expected. For testing, VMs were started and restored on multiple
configurations.
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Luis R . Rodriguez <mcgrof@kernel.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: kasan-dev@googlegroups.com
Cc: kernel-hardening@lists.openwall.com
Cc: kvm@vger.kernel.org
Cc: lguest@lists.ozlabs.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-pm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: zijun_hu <zijun_hu@htc.com>
Link: http://lkml.kernel.org/r/20170314170508.100882-3-thgarnie@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Each processor holds a GDT in its per-cpu structure. The sgdt
instruction gives the base address of the current GDT. This address can
be used to bypass KASLR memory randomization. With another bug, an
attacker could target other per-cpu structures or deduce the base of
the main memory section (PAGE_OFFSET).
This patch relocates the GDT table for each processor inside the
fixmap section. The space is reserved based on number of supported
processors.
For consistency, the remapping is done by default on 32 and 64-bit.
Each processor switches to its remapped GDT at the end of
initialization. For hibernation, the main processor returns with the
original GDT and switches back to the remapping at completion.
This patch was tested on both architectures. Hibernation and KVM were
both tested specially for their usage of the GDT.
Thanks to Boris Ostrovsky <boris.ostrovsky@oracle.com> for testing and
recommending changes for Xen support.
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Luis R . Rodriguez <mcgrof@kernel.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: kasan-dev@googlegroups.com
Cc: kernel-hardening@lists.openwall.com
Cc: kvm@vger.kernel.org
Cc: lguest@lists.ozlabs.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-pm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: zijun_hu <zijun_hu@htc.com>
Link: http://lkml.kernel.org/r/20170314170508.100882-2-thgarnie@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The rdtgroup_kn_unlock waits for the last user to release and put its
node. But it's calling kernfs_put on the node which calls the
rdtgroup_kn_unlock, which might not be the group's directory node, but
another group's file node.
This race could be easily reproduced by running 2 instances
of following script:
mount -t resctrl resctrl /sys/fs/resctrl/
pushd /sys/fs/resctrl/
mkdir krava
echo "krava" > krava/schemata
rmdir krava
popd
umount /sys/fs/resctrl
It triggers the slub debug error message with following command
line config: slub_debug=,kernfs_node_cache.
Call kernfs_put on the group's node to fix it.
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Shaohua Li <shli@fb.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1489501253-20248-1-git-send-email-jolsa@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When we are about to kexec a crash kernel and right then and there a
broadcasted MCE fires while we're still in the first kernel and while
the other CPUs remain in a holding pattern, the #MC handler of the
first kernel will timeout and then panic due to never completing MCE
synchronization.
Handle this in a similar way as to when the CPUs are offlined when that
broadcasted MCE happens.
[ Boris: rewrote commit message and comments. ]
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: kexec@lists.infradead.org
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1487857012-9059-1-git-send-email-xlpang@redhat.com
Link: http://lkml.kernel.org/r/20170313095019.19351-1-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Remove the wp_works_ok member of struct cpuinfo_x86. It's an
optimization back from Linux v0.99 times where we had no fixup support
yet and did the CR0.WP test via special code in the page fault handler.
The < 0 test was an optimization to not do the special casing for each
NULL ptr access violation but just for the first one doing the WP test.
Today it serves no real purpose as the test no longer needs special code
in the page fault handler and the only call side -- mem_init() -- calls
it just once, anyway. However, Xen pre-initializes it to 1, to skip the
test.
Doing the test again for Xen should be no issue at all, as even the
commit introducing skipping the test (commit d560bc6157 ("x86, xen:
Suppress WP test on Xen")) mentioned it being ban aid only. And, in
fact, testing the patch on Xen showed nothing breaks.
The pre-fixup times are long gone and with the removal of the fallback
handling code in commit a5c2a893db ("x86, 386 removal: Remove
CONFIG_X86_WP_WORKS_OK") the kernel requires a working CR0.WP anyway.
So just get rid of the "optimization" and do the test unconditionally.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Arnd Hannemann <hannemann@nets.rwth-aachen.de>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/1486933932-585-3-git-send-email-minipli@googlemail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull x86 fixes from Ingo Molnar:
"Misc fixes and minor updates all over the place:
- an SGI/UV fix
- a defconfig update
- a build warning fix
- move the boot_params file to the arch location in debugfs
- a pkeys fix
- selftests fix
- boot message fixes
- sparse fixes
- a resume warning fix
- ioapic hotplug fixes
- reboot quirks
... plus various minor cleanups"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build/x86_64_defconfig: Enable CONFIG_R8169
x86/reboot/quirks: Add ASUS EeeBook X205TA/W reboot quirk
x86/hpet: Prevent might sleep splat on resume
x86/boot: Correct setup_header.start_sys name
x86/purgatory: Fix sparse warning, symbol not declared
x86/purgatory: Make functions and variables static
x86/events: Remove last remnants of old filenames
x86/pkeys: Check against max pkey to avoid overflows
x86/ioapic: Split IOAPIC hot-removal into two steps
x86/PCI: Implement pcibios_release_device to release IRQ from IOAPIC
x86/intel_rdt: Remove duplicate inclusion of linux/cpu.h
x86/vmware: Remove duplicate inclusion of asm/timer.h
x86/hyperv: Hide unused label
x86/reboot/quirks: Add ASUS EeeBook X205TA reboot quirk
x86/platform/uv/BAU: Fix HUB errors by remove initial write to sw-ack register
x86/selftests: Add clobbers for int80 on x86_64
x86/apic: Simplify enable_IR_x2apic(), remove try_to_enable_IR()
x86/apic: Fix a warning message in logical CPU IDs allocation
x86/kdebugfs: Move boot params hierarchy under (debugfs)/x86/
Pull scheduler fixes from Ingo Molnar:
"A fix for KVM's scheduler clock which (erroneously) was always marked
unstable, a fix for RT/DL load balancing, plus latency fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface
sched/core: Fix pick_next_task() for RT,DL
sched/fair: Make select_idle_cpu() more aggressive
Wanpeng Li reported that since the following commit:
acb04058de ("sched/clock: Fix hotplug crash")
... KVM always runs with unstable sched-clock even though KVM's
kvm_clock _is_ stable.
The problem is that we've tied clear_sched_clock_stable() to the TSC
state, and overlooked that sched_clock() is a paravirt function.
Solve this by doing two things:
- tie the sched_clock() stable state more clearly to the TSC stable
state for the normal (!paravirt) case.
- only call clear_sched_clock_stable() when we mark TSC unstable
when we use native_sched_clock().
The first means we can actually run with stable sched_clock in more
situations then before, which is good. And since commit:
12907fbb1a ("sched/clock, clocksource: Add optional cs::mark_unstable() method")
... this should be reliable. Since any detection of TSC fail now results
in marking the TSC unstable.
Reported-by: Wanpeng Li <kernellwp@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: acb04058de ("sched/clock: Fix hotplug crash")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/task.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/task.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.
Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.
Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Two documentation updates, plus a debugging annotation fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/crash: Update the stale comment in reserve_crashkernel()
x86/irq, trace: Add __irq_entry annotation to x86's platform IRQ handlers
Documentation, x86, resctrl: Recommend locking for resctrlfs
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:
git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'
This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.
(Michal Hocko provided most of the kerneldoc comment.)
Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here is the big char/misc driver patchset for 4.11-rc1.
Lots of different driver subsystems updated here. Rework for the hyperv
subsystem to handle new platforms better, mei and w1 and extcon driver
updates, as well as a number of other "minor" driver updates. Full
details are in the shortlog below.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWK2iRQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynhFACguVE+/ixj5u5bT5DXQaZNai/6zIAAmgMWwd/t
YTD2cwsJsGbTT1fY3SUe
=CiSI
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big char/misc driver patchset for 4.11-rc1.
Lots of different driver subsystems updated here: rework for the
hyperv subsystem to handle new platforms better, mei and w1 and extcon
driver updates, as well as a number of other "minor" driver updates.
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (169 commits)
goldfish: Sanitize the broken interrupt handler
x86/platform/goldfish: Prevent unconditional loading
vmbus: replace modulus operation with subtraction
vmbus: constify parameters where possible
vmbus: expose hv_begin/end_read
vmbus: remove conditional locking of vmbus_write
vmbus: add direct isr callback mode
vmbus: change to per channel tasklet
vmbus: put related per-cpu variable together
vmbus: callback is in softirq not workqueue
binder: Add support for file-descriptor arrays
binder: Add support for scatter-gather
binder: Add extra size to allocator
binder: Refactor binder_transact()
binder: Support multiple /dev instances
binder: Deal with contexts in debugfs
binder: Support multiple context managers
binder: Split flat_binder_object
auxdisplay: ht16k33: remove private workqueue
auxdisplay: ht16k33: rework input device initialization
...
Pull x86 microcode updates from Ingo Molnar:
"The main changes are further simplification and unification of the
code between the AMD and Intel microcode loaders, plus other
simplifications - by Borislav Petkov"
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Remove struct cont_desc.eq_id
x86/microcode/AMD: Remove AP scanning optimization
x86/microcode/AMD: Simplify saving from initrd
x86/microcode/AMD: Unify load_ucode_amd_ap()
x86/microcode/AMD: Check patch level only on the BSP
x86/microcode: Remove local vendor variable
x86/microcode/AMD: Use find_microcode_in_initrd()
x86/microcode/AMD: Get rid of global this_equiv_id
x86/microcode: Decrease CPUID use
x86/microcode/AMD: Rework container parsing
x86/microcode/AMD: Extend the container struct
x86/microcode/AMD: Shorten function parameter's name
x86/microcode/AMD: Clean up find_equiv_id()
x86/microcode: Convert to bare minimum MSR accessors
x86/MSR: Carve out bare minimum accessors
Pull x86 fpu updates from Ingo Molnar:
"The main changes relate to fixes between (lack of) CPUID and FPU
detection that should only affect old or weird CPUs, by Andy
Lutomirski"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Fix the "Giving up, no FPU found" test
x86/fpu: Fix CPUID-less FPU detection
x86/fpu: Fix "x86/fpu: Legacy x87 FPU detected" message
x86/cpu: Re-apply forced caps every time CPU caps are re-read
x86/cpu: Factor out application of forced CPU caps
x86/cpu: Add X86_FEATURE_CPUID
x86/fpu/xstate: Move XSAVES state init to a function
Pull x86 cpufeature updates from Ingo Molnar:
"The main changes in this cycle were related to enable ring-3
MONITOR/MWAIT instructions support on supported CPUs, by Grzegorz
Andrejczuk and Piotr Luc"
* 'x86-cpufeature-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpufeature: Move RING3MWAIT feature to avoid conflicts
x86/cpufeature: Enable RING3MWAIT for Knights Mill
x86/cpufeature: Enable RING3MWAIT for Knights Landing
x86/cpufeature: Add RING3MWAIT to CPU features
x86/elf: Add HWCAP2 to expose ring 3 MONITOR/MWAIT
x86/msr: Add MSR_MISC_FEATURE_ENABLES and RING3MWAIT bit
x86/cpufeature: Add AVX512_VPOPCNTDQ feature
Pull scheduler updates from Ingo Molnar:
"The main changes in this (fairly busy) cycle were:
- There was a class of scheduler bugs related to forgetting to update
the rq-clock timestamp which can cause weird and hard to debug
problems, so there's a new debug facility for this: which uncovered
a whole lot of bugs which convinced us that we want to keep the
debug facility.
(Peter Zijlstra, Matt Fleming)
- Various cputime related updates: eliminate cputime and use u64
nanoseconds directly, simplify and improve the arch interfaces,
implement delayed accounting more widely, etc. - (Frederic
Weisbecker)
- Move code around for better structure plus cleanups (Ingo Molnar)
- Move IO schedule accounting deeper into the scheduler plus related
changes to improve the situation (Tejun Heo)
- ... plus a round of sched/rt and sched/deadline fixes, plus other
fixes, updats and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
sched/core: Remove unlikely() annotation from sched_move_task()
sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
sched/topology: Split out scheduler topology code from core.c into topology.c
sched/core: Remove unnecessary #include headers
sched/rq_clock: Consolidate the ordering of the rq_clock methods
delayacct: Include <uapi/linux/taskstats.h>
sched/core: Clean up comments
sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
sched/clock: Add dummy clear_sched_clock_stable() stub function
sched/cputime: Remove generic asm headers
sched/cputime: Remove unused nsec_to_cputime()
s390, sched/cputime: Remove unused cputime definitions
powerpc, sched/cputime: Remove unused cputime definitions
s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
ia64, sched/cputime: Remove unused cputime definitions
ia64: Convert vtime to use nsec units directly
ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
sched/cputime: Remove jiffies based cputime
sched/cputime, vtime: Return nsecs instead of cputime_t to account
sched/cputime: Complete nsec conversion of tick based accounting
...
Pull RAS updates from Ingo Molnar:
"The main changes in this cycle were:
- Assign notifier chain priorities for all RAS related handlers to
make the ordering explicit (Borislav Petkov)
- Improve the AMD MCA banks sysfs output (Yazen Ghannam)
- Various cleanups and restructuring of the x86 RAS code (Borislav
Petkov)"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority
x86/ras: Get rid of mce_process_work()
EDAC/mce/amd: Dump TSC value
EDAC/mce/amd: Unexport amd_decode_mce()
x86/ras/amd/inj: Change dependency
x86/ras: Flip the TSC-adding logic
x86/ras/amd: Make sysfs names of banks more user-friendly
x86/ras/therm_throt: Do not log a fake MCE for thermal events
x86/ras/inject: Make it depend on X86_LOCAL_APIC=y
After:
a33d331761 ("x86/CPU/AMD: Fix Bulldozer topology")
our SMT scheduling topology for Fam17h systems is broken, because
the ThreadId is included in the ApicId when SMT is enabled.
So, without further decoding cpu_core_id is unique for each thread
rather than the same for threads on the same core. This didn't affect
systems with SMT disabled. Make cpu_core_id be what it is defined to be.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 4.9
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170205105022.8705-2-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
a33d331761 ("x86/CPU/AMD: Fix Bulldozer topology")
restored the initial approach we had with the Fam15h topology of
enumerating CU (Compute Unit) threads as cores. And this is still
correct - they're beefier than HT threads but still have some
shared functionality.
Our current approach has a problem with the Mad Max Steam game, for
example. Yves Dionne reported a certain "choppiness" while playing on
v4.9.5.
That problem stems most likely from the fact that the CU threads share
resources within one CU and when we schedule to a thread of a different
compute unit, this incurs latency due to migrating the working set to a
different CU through the caches.
When the thread siblings mask mirrors that aspect of the CUs and
threads, the scheduler pays attention to it and tries to schedule within
one CU first. Which takes care of the latency, of course.
Reported-by: Yves Dionne <yves.dionne@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 4.9
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20170205105022.8705-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Enable ring 3 MONITOR/MWAIT for Intel Xeon Phi codenamed Knights Mill. We
can't guarantee that this (KNM) will be the last CPU model that needs this
hack. But, we do recognize that this is far from optimal, and there is an
effort to ensure we don't keep doing extending this hack forever.
Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Cc: Piotr.Luc@intel.com
Cc: dave.hansen@linux.intel.com
Link: http://lkml.kernel.org/r/1484918557-15481-6-git-send-email-grzegorz.andrejczuk@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Enable ring 3 MONITOR/MWAIT for Intel Xeon Phi x200 codenamed Knights
Landing.
Presence of this feature cannot be detected automatically (by reading any
other MSR) therefore it is required to explicitly check for the family and
model of the CPU before attempting to enable it.
Signed-off-by: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: Piotr.Luc@intel.com
Cc: dave.hansen@linux.intel.com
Link: http://lkml.kernel.org/r/1484918557-15481-5-git-send-email-grzegorz.andrejczuk@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Introduce ELF_HWCAP2 variable for x86 and reserve its bit 0 to expose the
ring 3 MONITOR/MWAIT.
HWCAP variables contain bitmasks which can be used by userspace
applications to detect which instruction sets are supported by CPU. On x86
architecture information about CPU capabilities can be checked via CPUID
instructions, unfortunately presence of ring 3 MONITOR/MWAIT feature cannot
be checked this way. ELF_HWCAP cannot be used as well, because on x86 it is
set to CPUID[1].EDX which means that all bits are reserved there.
HWCAP2 approach was chosen because it reuses existing solution present
in other architectures, so only minor modifications are required to the
kernel and userspace applications. When ELF_HWCAP2 is defined
kernel maps it to AT_HWCAP2 during the start of the application.
This way the ring 3 MONITOR/MWAIT feature can be detected using getauxval()
API in a simple and fast manner. ELF_HWCAP2 type is u32 to be consistent
with x86 ELF_HWCAP type.
Signed-off-by: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: Piotr.Luc@intel.com
Cc: dave.hansen@linux.intel.com
Link: http://lkml.kernel.org/r/1484918557-15481-3-git-send-email-grzegorz.andrejczuk@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Erik reported that on a preproduction hardware a CMCI storm triggers the
BUG_ON in add_timer_on(). The reason is that the per CPU MCE timer is
started by the CMCI logic before the MCE CPU hotplug callback starts the
timer with add_timer_on(). So the timer is already queued which triggers
the BUG.
Using add_timer_on() is pretty pointless in this code because the timer is
strictlty per CPU, initialized as pinned and all operations which arm the
timer happen on the CPU to which the timer belongs.
Simplify the whole machinery by using mod_timer() instead of add_timer_on()
which avoids the problem because mod_timer() can handle already queued
timers. Use __start_timer() everywhere so the earliest armed expiry time is
preserved.
Reported-by: Erik Veijola <erik.veijola@intel.com>
Tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701310936080.3457@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
So there's a number of constants that start with "E820" but which
are not types - these create a confusing mixture when seen together
with 'enum e820_type' values:
E820MAP
E820NR
E820_X_MAX
E820MAX
To better differentiate the 'enum e820_type' values prefix them
with E820_TYPE_.
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have these three related functions:
extern void e820_add_region(u64 start, u64 size, int type);
extern u64 e820_update_range(u64 start, u64 size, unsigned old_type, unsigned new_type);
extern u64 e820_remove_range(u64 start, u64 size, unsigned old_type, int checktype);
But it's not clear from the naming that they are 3 operations based around the
same 'memory range' concept. Rename them to better signal this, and move
the prototypes next to each other:
extern void e820__range_add (u64 start, u64 size, int type);
extern u64 e820__range_update(u64 start, u64 size, unsigned old_type, unsigned new_type);
extern u64 e820__range_remove(u64 start, u64 size, unsigned old_type, int checktype);
Note that this improved organization of the functions shows another problem that was easy
to miss before: sometimes the E820 entry type is 'int', sometimes 'unsigned int' - but this
will be fixed in a separate patch.
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
update_e820() should have 'e820' as a prefix as most of the other E820
functions have - but it's also a bit unclear about its purpose, as
it's unclear what is updated - the whole table, or an entry?
Also, the name does not express that it's a trivial wrapper
around sanitize_e820_table() that also prints out the resulting
table.
So rename it to e820__update_table_print(). This also makes it
harmonize with the e820__update_table_firmware() function which
has a very similar purpose.
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In line with asm/e820/types.h, move the e820 API declarations to
asm/e820/api.h and update all usage sites.
This is just a mechanical, obviously correct move & replace patch,
there will be subsequent changes to clean up the code and to make
better use of the new header organization.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Calling get_cpu_cap() will reset a bunch of CPU features. This will
cause the system to lose track of force-set and force-cleared
features in the words that are reset until the end of CPU
initialization. This can cause X86_FEATURE_FPU, for example, to
change back and forth during boot and potentially confuse CPU setup.
To minimize the chance of confusion, re-apply forced caps every time
get_cpu_cap() is called.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Whitehead <tedheadster@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/c817eb373d2c67c2c81413a70fc9b845fa34a37e.1484705016.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are multiple call sites that apply forced CPU caps. Factor
them into a helper.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Whitehead <tedheadster@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/623ff7555488122143e4417de09b18be2085ad06.1484705016.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a synthetic CPUID flag denoting whether the CPU sports the CPUID
instruction or not. This will come useful later when accomodating
CPUID-less CPUs.
Signed-off-by: Borislav Petkov <bp@suse.de>
[ Slightly prettified. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Whitehead <tedheadster@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/dcb355adae3ab812c79397056a61c212f1a0c7cc.1484705016.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Assign all notifiers on the MCE decode chain a priority so that they get
called in the correct order.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make mce_gen_pool_process() the workqueue function directly and save us
an indirection.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-9-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the TSC value to the MCE record only when the MCE being logged is
precise, i.e., it is logged as an exception or an MCE-related interrupt.
So it doesn't look particularly easy to do without touching/changing a
bunch of places. That's why I'm trying tricks first.
For example, the mce-apei.c case I'm addressing by setting ->tsc only
for errors of panic severity. The idea there is, that, panic errors will
have raised an #MC and not polled.
And then instead of propagating a flag to mce_setup(), it seems
easier/less code to set ->tsc depending on the call sites, i.e.,
are we polling or are we preparing an MCE record in an exception
handler/thresholding interrupt.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-5-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, we append the MCA_IPID[InstanceId] to the bank name to create
the sysfs filename. The InstanceId field uniquely identifies a bank
instance but it doesn't look very nice for most banks.
Replace the InstanceId with a simpler, ascending (0, 1, ..) value.
Only use this in the sysfs name when there is more than 1 instance.
Otherwise, just use the bank's name as the sysfs name.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1484322741-41884-3-git-send-email-Yazen.Ghannam@amd.com
Link: http://lkml.kernel.org/r/20170123183514.13356-4-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it shouldn't have been done in the first place. And
besides we have other means for dealing with thermal events which are
much more suitable.
So let's kill the MCE logging part.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ashok Raj <ashok.raj@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170105213846.GA12024@gmail.com
Link: http://lkml.kernel.org/r/20170123183514.13356-3-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
... and get rid of the annoying:
arch/x86/kernel/cpu/mcheck/mce-inject.c:97:13: warning: ‘mce_irq_ipi’ defined but not used [-Wunused-function]
when doing randconfig builds.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-2-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The equivalence ID was needed outside of the container scanning logic
but now, after this has been cleaned up, not anymore. Now, cont_desc.mc
is used to denote whether the container we're looking at has the proper
microcode patch for this CPU or not.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-17-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The idea was to not scan the microcode blob on each AP (Application
Processor) during boot and thus save us some milliseconds. However, on
architectures where the microcode engine is shared between threads, this
doesn't work. Here's why:
The microcode on CPU0, i.e., the first thread, gets updated. The second
thread, i.e., CPU1, i.e., the first AP walks into load_ucode_amd_ap(),
sees that there's no container cached and goes and scans for the proper
blob.
It finds it and as a last step of apply_microcode_early_amd(), it tries
to apply the patch but that core has already the updated microcode
revision which it has received through CPU0's update. So it returns
false and we do desc->size = -1 to prevent other APs from scanning.
However, the next AP, CPU2, has a different microcode engine which
hasn't been updated yet. The desc->size == -1 test prevents it from
scanning the blob anew and we fail to update it.
The fix is much more straight-forward than it looks: the BSP
(BootStrapping Processor), i.e., CPU0, caches the microcode patch
in amd_ucode_patch. We use that on the AP and try to apply it.
In the 99.9999% of cases where we have homogeneous cores - *not*
mixed-steppings - the application will be successful and we're good to
go.
In the remaining small set of systems, we will simply rescan the blob
and find (or not, if none present) the proper patch and apply it then.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-16-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No need to use the previously stashed info in the container - simply go
ahead and parse the initrd once more. It simplifies and streamlines the
code a whole lot.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-15-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use a version for both bitness by adding a helper which does the actual
container finding and parsing which can be used on any CPU - BSP or AP.
Streamlines the paths more.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-14-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Check final patch levels for AMD only on the BSP. This way, we decide
early and only once whether to continue loading or to leave the loader
disabled on such systems.
Simplify a lot.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-13-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the generic helper instead of semi-open-coding the procedure.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-11-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We have a container which we update/prepare each time before applying a
microcode patch instead of using a global.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-10-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Get CPUID(1).EAX value once per CPU and propagate value into the callers
instead of conveniently calling it every time.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-9-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It was pretty clumsy before and the whole work of parsing the microcode
containers was spread around the functions wrongly.
Clean it up so that there's a main scan_containers() function which
iterates over the microcode blob and picks apart the containers glued
together. For each container, it calls a parse_container() helper which
concentrates on one container only: sanity-checking, parsing, counting
microcode patches in there, etc.
It makes much more sense now and it is actually very readable. Oh, and
we luvz a diffstat removing more crap than adding.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-8-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make it into a container descriptor which is being passed around and
stores important info like the matching container and the patch for the
current CPU. Make it static too.
Later patches will use this and thus get rid of a double container
parsing.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-7-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The whole driver calls this "mc", do that here too.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-6-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No need to have it marked "inline" - let gcc decide. Also, shorten the
argument name and simplify while-test.
While at it, make it into a proper for-loop and simplify it even more,
as tglx suggests.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-5-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This was meant to save us the scanning of the microcode containter in
the initrd since the first AP had already done that but it can also hurt
us:
Imagine a single hyperthreaded CPU (Intel(R) Atom(TM) CPU N270, for
example) which updates the microcode on the BSP but since the microcode
engine is shared between the two threads, the update on CPU1 doesn't
happen because it has already happened on CPU0 and we don't find a newer
microcode revision on CPU1.
Which doesn't set the intel_ucode_patch pointer and at initrd
jettisoning time we don't save the microcode patch for later
application.
Now, when we suspend to RAM, the loaded microcode gets cleared so we
need to reload but there's no patch saved in the cache.
Removing the optimization fixes this issue and all is fine and dandy.
Fixes: 06b8534cb7 ("x86/microcode: Rework microcode loading")
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170120202955.4091-2-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As part of the effort to separate out architecture specific code,
extract hypervisor version information in an architecture specific
file.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As part of the effort to separate out architecture specific code,
consolidate all Hyper-V specific clocksource code to an architecture
specific code.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mike reported that he could trigger the WARN_ON_ONCE() in
set_sched_clock_stable() using hotplug.
This exposed a fundamental problem with the interface, we should never
mark the TSC stable if we ever find it to be unstable. Therefore
set_sched_clock_stable() is a broken interface.
The reason it existed is that not having it is a pain, it means all
relevant architecture code needs to call clear_sched_clock_stable()
where appropriate.
Of the three architectures that select HAVE_UNSTABLE_SCHED_CLOCK ia64
and parisc are trivial in that they never called
set_sched_clock_stable(), so add an unconditional call to
clear_sched_clock_stable() to them.
For x86 the story is a lot more involved, and what this patch tries to
do is ensure we preserve the status quo. So even is Cyrix or Transmeta
have usable TSC they never called set_sched_clock_stable() so they now
get an explicit mark unstable.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9881b024b7 ("sched/clock: Delay switching sched_clock to stable")
Link: http://lkml.kernel.org/r/20170119133633.GB6536@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As part of the effort to separate out architecture specific code, move the
hypercall page setup to an architecture specific file.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In generic_load_microcode(), curr_mc_size is the size of the last
allocated buffer and since we have this performance "optimization"
there to vmalloc a new buffer only when the current one is bigger,
curr_mc_size ends up becoming the size of the biggest buffer we've seen
so far.
However, we end up saving the microcode patch which matches our CPU
and its size is not curr_mc_size but the respective mc_size during the
iteration while we're staring at it.
So save that mc_size into a separate variable and use it to store the
previously found microcode buffer.
Without this fix, we could get oops like this:
BUG: unable to handle kernel paging request at ffffc9000e30f000
IP: __memcpy+0x12/0x20
...
Call Trace:
? kmemdup+0x43/0x60
__alloc_microcode_buf+0x44/0x70
save_microcode_patch+0xd4/0x150
generic_load_microcode+0x1b8/0x260
request_microcode_user+0x15/0x20
microcode_write+0x91/0x100
__vfs_write+0x34/0x120
vfs_write+0xc1/0x130
SyS_write+0x56/0xc0
do_syscall_64+0x6c/0x160
entry_SYSCALL64_slow_path+0x25/0x25
Fixes: 06b8534cb7 ("x86/microcode: Rework microcode loading")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/4f33cbfd-44f2-9bed-3b66-7446cd14256f@ce.jp.nec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We allocate struct ucode_patch here. @size is the size of microcode data
and used for kmemdup() later in this function.
Fixes: 06b8534cb7 ("x86/microcode: Rework microcode loading")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/7a730dc9-ac17-35c4-fe76-dfc94e5ecd95@ce.jp.nec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Since on Intel we're required to do CPUID(1) first, before reading
the microcode revision MSR, let's add a special helper which does the
required steps so that we don't forget to do them next time, when we
want to read the microcode revision.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20170109114147.5082-4-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Intel supplies the microcode revision value in MSR 0x8b
(IA32_BIOS_SIGN_ID) after CPUID(1) has been executed. Execute it each
time before reading that MSR.
It used to do sync_core() which did do CPUID but
c198b121b1 ("x86/asm: Rewrite sync_core() to use IRET-to-self")
changed the sync_core() implementation so we better make the microcode
loading case explicit, as the SDM documents it.
Reported-and-tested-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20170109114147.5082-3-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The following commit:
8196dab4fc ("x86/cpu: Get rid of compute_unit_id")
... broke the initial strategy for Bulldozer-based cores' topology,
where we consider each thread of a compute unit a standalone core
and not a HT or SMT thread.
Revert to the firmware-supplied core_id numbering and do not make
them thread siblings as we don't consider them for such even if they
technically are, more or less.
Reported-and-tested-by: Brice Goglin <Brice.Goglin@inria.fr>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # v4.6+
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8196dab4fc ("x86/cpu: Get rid of compute_unit_id")
Link: http://lkml.kernel.org/r/20170105092638.5247-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A negative number can be specified in the cmdline which will be used as
setup_clear_cpu_cap() argument. With that we can clear/set some bit in
memory predceeding boot_cpu_data/cpu_caps_cleared which may cause kernel
to misbehave. This patch adds lower bound check to setup_disablecpuid().
Boris Petkov reproduced a crash:
[ 1.234575] BUG: unable to handle kernel paging request at ffffffff858bd540
[ 1.236535] IP: memcpy_erms+0x6/0x10
Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: andi.kleen@intel.com
Cc: bp@alien8.de
Cc: dave.hansen@linux.intel.com
Cc: luto@kernel.org
Cc: slaoub@gmail.com
Fixes: ac72e7888a ("x86: add generic clearcpuid=... option")
Link: http://lkml.kernel.org/r/1482933340-11857-1-git-send-email-lukasz.odzioba@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If mce_device_init() fails then the mce device pointer is NULL and the
AMD mce code happily dereferences it.
Add a sanity check.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>