Commit Graph

32489 Commits

Author SHA1 Message Date
Linus Torvalds
ab2486a9ee Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU updates from Thomas Gleixner:
 "A small set of updates for the FPU code:

   - Make the no387/nofxsr command line options useful by restricting
     them to 32bit and actually clearing all dependencies to prevent
     random crashes and malfunction.

   - Simplify and cleanup the kernel_fpu_*() helpers"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Inline fpu__xstate_clear_all_cpu_caps()
  x86/fpu: Make 'no387' and 'nofxsr' command line options useful
  x86/fpu: Remove the fpu__save() export
  x86/fpu: Simplify kernel_fpu_begin()
  x86/fpu: Simplify kernel_fpu_end()
2019-07-08 11:45:26 -07:00
Linus Torvalds
0d37dde706 Merge branch 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vsyscall updates from Thomas Gleixner:
 "Further hardening of the legacy vsyscall by providing support for
  execute only mode and switching the default to it.

  This prevents a certain class of attacks which rely on the vsyscall
  page being accessible at a fixed address in the canonical kernel
  address space"

* 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests/x86: Add a test for process_vm_readv() on the vsyscall page
  x86/vsyscall: Add __ro_after_init to global variables
  x86/vsyscall: Change the default vsyscall mode to xonly
  selftests/x86/vsyscall: Verify that vsyscall=none blocks execution
  x86/vsyscall: Document odd SIGSEGV error code for vsyscalls
  x86/vsyscall: Show something useful on a read fault
  x86/vsyscall: Add a new vsyscall=xonly mode
  Documentation/admin: Remove the vsyscall=native documentation
2019-07-08 11:42:09 -07:00
Linus Torvalds
0902d5011c Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x96 apic updates from Thomas Gleixner:
 "Updates for the x86 APIC interrupt handling and APIC timer:

   - Fix a long standing issue with spurious interrupts which was caused
     by the big vector management rework a few years ago. Robert Hodaszi
     provided finally enough debug data and an excellent initial failure
     analysis which allowed to understand the underlying issues.

     This contains a change to the core interrupt management code which
     is required to handle this correctly for the APIC/IO_APIC. The core
     changes are NOOPs for most architectures except ARM64. ARM64 is not
     impacted by the change as confirmed by Marc Zyngier.

   - Newer systems allow to disable the PIT clock for power saving
     causing panic in the timer interrupt delivery check of the IO/APIC
     when the HPET timer is not enabled either. While the clock could be
     turned on this would cause an endless whack a mole game to chase
     the proper register in each affected chipset.

     These systems provide the relevant frequencies for TSC, CPU and the
     local APIC timer via CPUID and/or MSRs, which allows to avoid the
     PIT/HPET based calibration. As the calibration code is the only
     usage of the legacy timers on modern systems and is skipped anyway
     when the frequencies are known already, there is no point in
     setting up the PIT and actually checking for the interrupt delivery
     via IO/APIC.

     To achieve this on a wide variety of platforms, the CPUID/MSR based
     frequency readout has been made more robust, which also allowed to
     remove quite some workarounds which turned out to be not longer
     required. Thanks to Daniel Drake for analysis, patches and
     verification"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/irq: Seperate unused system vectors from spurious entry again
  x86/irq: Handle spurious interrupt after shutdown gracefully
  x86/ioapic: Implement irq_get_irqchip_state() callback
  genirq: Add optional hardware synchronization for shutdown
  genirq: Fix misleading synchronize_irq() documentation
  genirq: Delay deactivation in free_irq()
  x86/timer: Skip PIT initialization on modern chipsets
  x86/apic: Use non-atomic operations when possible
  x86/apic: Make apic_bsp_setup() static
  x86/tsc: Set LAPIC timer period to crystal clock frequency
  x86/apic: Rename 'lapic_timer_frequency' to 'lapic_timer_period'
  x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency
2019-07-08 11:22:57 -07:00
Linus Torvalds
927ba67a63 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The timer and timekeeping departement delivers:

  Core:

   - The consolidation of the VDSO code into a generic library including
     the conversion of x86 and ARM64. Conversion of ARM and MIPS are en
     route through the relevant maintainer trees and should end up in
     5.4.

     This gets rid of the unnecessary different copies of the same code
     and brings all architectures on the same level of VDSO
     functionality.

   - Make the NTP user space interface more robust by restricting the
     TAI offset to prevent undefined behaviour. Includes a selftest.

   - Validate user input in the compat settimeofday() syscall to catch
     invalid values which would be turned into valid values by a
     multiplication overflow

   - Consolidate the time accessors

   - Small fixes, improvements and cleanups all over the place

  Drivers:

   - Support for the NXP system counter, TI davinci timer

   - Move the Microsoft HyperV clocksource/events code into the
     drivers/clocksource directory so it can be shared between x86 and
     ARM64.

   - Overhaul of the Tegra driver

   - Delay timer support for IXP4xx

   - Small fixes, improvements and cleanups as usual"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  time: Validate user input in compat_settimeofday()
  timer: Document TIMER_PINNED
  clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
  clocksource/drivers: Make Hyper-V clocksource ISA agnostic
  MAINTAINERS: Fix Andy's surname and the directory entries of VDSO
  hrtimer: Use a bullet for the returns bullet list
  arm64: vdso: Fix compilation with clang older than 8
  arm64: compat: Fix __arch_get_hw_counter() implementation
  arm64: Fix __arch_get_hw_counter() implementation
  lib/vdso: Make delta calculation work correctly
  MAINTAINERS: Add entry for the generic VDSO library
  arm64: compat: No need for pre-ARMv7 barriers on an ARMv8 system
  arm64: vdso: Remove unnecessary asm-offsets.c definitions
  vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h
  clocksource/drivers/davinci: Add support for clocksource
  clocksource/drivers/davinci: Add support for clockevents
  clocksource/drivers/tegra: Set up maximum-ticks limit properly
  clocksource/drivers/tegra: Cycles can't be 0
  clocksource/drivers/tegra: Restore base address before cleanup
  clocksource/drivers/tegra: Add verbose definition for 1MHz constant
  ...
2019-07-08 11:06:29 -07:00
Linus Torvalds
e0e86b111b Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP/hotplug updates from Thomas Gleixner:
 "A small set of updates for SMP and CPU hotplug:

   - Abort disabling secondary CPUs in the freezer when a wakeup is
     pending instead of evaluating it only after all CPUs have been
     offlined.

   - Remove the shared annotation for the strict per CPU cfd_data in the
     smp function call core code.

   - Remove the return values of smp_call_function() and on_each_cpu()
     as they are unconditionally 0. Fixup the few callers which actually
     bothered to check the return value"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Remove smp_call_function() and on_each_cpu() return values
  smp: Do not mark call_function_data as shared
  cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending
  cpu/hotplug: Fix notify_cpu_starting() reference in bringup_wait_for_ap()
2019-07-08 10:39:56 -07:00
Linus Torvalds
dfd437a257 arm64 updates for 5.3:
- arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}
 
 - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
   manage the permissions of executable vmalloc regions more strictly
 
 - Slight performance improvement by keeping softirqs enabled while
   touching the FPSIMD/SVE state (kernel_neon_begin/end)
 
 - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new XAFLAG
   and AXFLAG instructions for floating point comparison flags
   manipulation) and FRINT (rounding floating point numbers to integers)
 
 - Re-instate ARM64_PSEUDO_NMI support which was previously marked as
   BROKEN due to some bugs (now fixed)
 
 - Improve parking of stopped CPUs and implement an arm64-specific
   panic_smp_self_stop() to avoid warning on not being able to stop
   secondary CPUs during panic
 
 - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
   platforms
 
 - perf: DDR performance monitor support for iMX8QXP
 
 - cache_line_size() can now be set from DT or ACPI/PPTT if provided to
   cope with a system cache info not exposed via the CPUID registers
 
 - Avoid warning on hardware cache line size greater than
   ARCH_DMA_MINALIGN if the system is fully coherent
 
 - arm64 do_page_fault() and hugetlb cleanups
 
 - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)
 
 - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the 'arm_boot_flags'
   introduced in 5.1)
 
 - CONFIG_RANDOMIZE_BASE now enabled in defconfig
 
 - Allow the selection of ARM64_MODULE_PLTS, currently only done via
   RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
   over into the vmalloc area
 
 - Make ZONE_DMA32 configurable
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl0eHqcACgkQa9axLQDI
 XvFyNA/+L+bnkz8m3ncydlqqfXomQn4eJJVQ8Uksb0knJz+1+3CUxxbO4ry4jXZN
 fMkbggYrDPRKpDbsUl0lsRipj7jW9bqan+N37c3SWqCkgb6HqDaHViwxdx6Ec/Uk
 gHudozDSPh/8c7hxGcSyt/CFyuW6b+8eYIQU5rtIgz8aVY2BypBvS/7YtYCbIkx0
 w4CFleRTK1zXD5mJQhrc6jyDx659sVkrAvdhf6YIymOY8nBTv40vwdNo3beJMYp8
 Po/+0Ixu+VkHUNtmYYZQgP/AGH96xiTcRnUqd172JdtRPpCLqnLqwFokXeVIlUKT
 KZFMDPzK+756Ayn4z4huEePPAOGlHbJje8JVNnFyreKhVVcCotW7YPY/oJR10bnc
 eo7yD+DxABTn+93G2yP436bNVa8qO1UqjOBfInWBtnNFJfANIkZweij/MQ6MjaTA
 o7KtviHnZFClefMPoiI7HDzwL8XSmsBDbeQ04s2Wxku1Y2xUHLx4iLmadwLQ1ZPb
 lZMTZP3N/T1554MoURVA1afCjAwiqU3bt1xDUGjbBVjLfSPBAn/25IacsG9Li9AF
 7Rp1M9VhrfLftjFFkB2HwpbhRASOxaOSx+EI3kzEfCtM2O9I1WHgP3rvCdc3l0HU
 tbK0/IggQicNgz7GSZ8xDlWPwwSadXYGLys+xlMZEYd3pDIOiFc=
 =0TDT
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}

 - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
   manage the permissions of executable vmalloc regions more strictly

 - Slight performance improvement by keeping softirqs enabled while
   touching the FPSIMD/SVE state (kernel_neon_begin/end)

 - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new
   XAFLAG and AXFLAG instructions for floating point comparison flags
   manipulation) and FRINT (rounding floating point numbers to integers)

 - Re-instate ARM64_PSEUDO_NMI support which was previously marked as
   BROKEN due to some bugs (now fixed)

 - Improve parking of stopped CPUs and implement an arm64-specific
   panic_smp_self_stop() to avoid warning on not being able to stop
   secondary CPUs during panic

 - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
   platforms

 - perf: DDR performance monitor support for iMX8QXP

 - cache_line_size() can now be set from DT or ACPI/PPTT if provided to
   cope with a system cache info not exposed via the CPUID registers

 - Avoid warning on hardware cache line size greater than
   ARCH_DMA_MINALIGN if the system is fully coherent

 - arm64 do_page_fault() and hugetlb cleanups

 - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)

 - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the
   'arm_boot_flags' introduced in 5.1)

 - CONFIG_RANDOMIZE_BASE now enabled in defconfig

 - Allow the selection of ARM64_MODULE_PLTS, currently only done via
   RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
   over into the vmalloc area

 - Make ZONE_DMA32 configurable

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits)
  perf: arm_spe: Enable ACPI/Platform automatic module loading
  arm_pmu: acpi: spe: Add initial MADT/SPE probing
  ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens
  ACPI/PPTT: Modify node flag detection to find last IDENTICAL
  x86/entry: Simplify _TIF_SYSCALL_EMU handling
  arm64: rename dump_instr as dump_kernel_instr
  arm64/mm: Drop [PTE|PMD]_TYPE_FAULT
  arm64: Implement panic_smp_self_stop()
  arm64: Improve parking of stopped CPUs
  arm64: Expose FRINT capabilities to userspace
  arm64: Expose ARMv8.5 CondM capability to userspace
  arm64: defconfig: enable CONFIG_RANDOMIZE_BASE
  arm64: ARM64_MODULES_PLTS must depend on MODULES
  arm64: bpf: do not allocate executable memory
  arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages
  arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP
  arm64: module: create module allocations without exec permissions
  arm64: Allow user selection of ARM64_MODULE_PLTS
  acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
  arm64: Allow selecting Pseudo-NMI again
  ...
2019-07-08 09:54:55 -07:00
Sebastian Andrzej Siewior
7891bc0ab7 x86/fpu: Inline fpu__xstate_clear_all_cpu_caps()
All fpu__xstate_clear_all_cpu_caps() does is to invoke one simple
function since commit

  73e3a7d2a7 ("x86/fpu: Remove the explicit clearing of XSAVE dependent features")

so invoke that function directly and remove the wrapper.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190704060743.rvew4yrjd6n33uzx@linutronix.de
2019-07-07 12:01:47 +02:00
Sebastian Andrzej Siewior
9838e3bff0 x86/fpu: Make 'no387' and 'nofxsr' command line options useful
The command line option `no387' is designed to disable the FPU
entirely. This only 'works' with CONFIG_MATH_EMULATION enabled.

But on 64bit this cannot work because user space expects SSE to work which
required basic FPU support. MATH_EMULATION does not help because SSE is not
emulated.

The command line option `nofxsr' should also be limited to 32bit because
FXSR is part of the required flags on 64bit so turning it off is not
possible.

Clearing X86_FEATURE_FPU without emulation enabled will not work anyway and
hang in fpu__init_system_early_generic() before the console is enabled.

Setting additioal dependencies, ensures that the CPU still boots on a
modern CPU. Otherwise, dropping FPU will leave FXSR enabled causing the
kernel to crash early in fpu__init_system_mxcsr().

With XSAVE support it will crash in fpu__init_cpu_xstate(). The problem is
that xsetbv() with XMM set and SSE cleared is not allowed.  That means
XSAVE has to be disabled. The XSAVE support is disabled in
fpu__init_system_xstate_size_legacy() but it is too late. It can be
removed, it has been added in commit

  1f999ab5a1 ("x86, xsave: Disable xsave in i387 emulation mode")

to use `no387' on a CPU with XSAVE support.

All this happens before console output.

After hat, the next possible crash is in RAID6 detect code because MMX
remained enabled. With a 3DNOW enabled config it will explode in memcpy()
for instance due to kernel_fpu_begin() but this is unconditionally enabled.

This is enough to boot a Debian Wheezy on a 32bit qemu "host" CPU which
supports everything up to XSAVES, AVX2 without 3DNOW. Later, Debian
increased the minimum requirements to i686 which means it does not boot
userland atleast due to CMOV.

After masking the additional features it still keeps SSE4A and 3DNOW*
enabled (if present on the host) but those are unused in the kernel.

Restrict `no387' and `nofxsr' otions to 32bit only. Add dependencies for
FPU, FXSR to additionaly mask CMOV, MMX, XSAVE if FXSR or FPU is cleared.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190703083247.57kjrmlxkai3vpw3@linutronix.de
2019-07-07 12:01:46 +02:00
Linus Torvalds
9fdb86c8cf x86 bugfix patches and one compilation fix for ARM.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJdH7H6AAoJEL/70l94x66D3XcH/0If8IAHn746Y+2QoasbmW0m
 lACzkzNZYzKWUgzJ9+r4eNe0tq7wUzU546Jl2GgOjIcQGnRCPAJMGodGTEq5AGo2
 XWGHXkKLa9w3bSCRi9Ov7d1CGO9oDk6PtkTP4xT0oVyEtsgPdKWdEz2dDe8BnK7T
 BynVcz3JZOm+vE4N1GusjkdN6hbVFBdTZNSvN9uE3iNoUBUoe98Ctv2HiSca9tDF
 oPJpSWLSThC/cRXn1EX7uXiUTZ2bTaS+mCdwmR6QS5VBEHR+ssWzHq4ru/U0p/vO
 7L9AZoyCTqW5pOMuYZFnfntnbu54RUnfc7A0jyqzM2SFeBt8h9hpsVXoWmtnL5k=
 =qzdP
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "x86 bugfix patches and one compilation fix for ARM"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: arm64/sve: Fix vq_present() macro to yield a bool
  KVM: LAPIC: Fix pending interrupt in IRR blocked by software disable LAPIC
  KVM: nVMX: Change KVM_STATE_NESTED_EVMCS to signal vmcs12 is copied from eVMCS
  KVM: nVMX: Allow restore nested-state to enable eVMCS when vCPU in SMM
  KVM: x86: degrade WARN to pr_warn_ratelimited
2019-07-05 19:13:24 -07:00
Linus Torvalds
550d1f5bda This includes three fixes:
- Fixes a deadlock from a previous fix to keep module loading
    and function tracing text modifications from stepping on each other.
    (this has a few patches to help document the issue in comments)
 
  - Fix a crash when the snapshot buffer gets out of sync with the
    main ring buffer.
 
  - Fix a memory leak when reading the memory logs
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXRzBCBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnDaAP9qTFBOFtgIGCT5wVP8xjQeESxh1b8R
 tbaT7/U2oPpeiwEAvp1mYo5UYcc8KauBqVaLSLJVN4pv07xiZF5Qgh9C1QE=
 =m2IT
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "This includes three fixes:

   - Fix a deadlock from a previous fix to keep module loading and
     function tracing text modifications from stepping on each other
     (this has a few patches to help document the issue in comments)

   - Fix a crash when the snapshot buffer gets out of sync with the main
     ring buffer

   - Fix a memory leak when reading the memory logs"

* tag 'trace-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace/x86: Anotate text_mutex split between ftrace_arch_code_modify_post_process() and ftrace_arch_code_modify_prepare()
  tracing/snapshot: Resize spare buffer if size changed
  tracing: Fix memory leak in tracing_err_log_open()
  ftrace/x86: Add a comment to why we take text_mutex in ftrace_arch_code_modify_prepare()
  ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()
2019-07-04 10:26:17 +09:00
Michael Kelley
dd2cb34861 clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
Continue consolidating Hyper-V clock and timer code into an ISA
independent Hyper-V clocksource driver.

Move the existing clocksource code under drivers/hv and arch/x86 to the new
clocksource driver while separating out the ISA dependencies. Update
Hyper-V initialization to call initialization and cleanup routines since
the Hyper-V synthetic clock is not independently enumerated in ACPI.

Update Hyper-V clocksource users in KVM and VDSO to get definitions from
the new include file.

No behavior is changed and no new functionality is added.

Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "bp@alien8.de" <bp@alien8.de>
Cc: "will.deacon@arm.com" <will.deacon@arm.com>
Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
Cc: "olaf@aepfle.de" <olaf@aepfle.de>
Cc: "apw@canonical.com" <apw@canonical.com>
Cc: "jasowang@redhat.com" <jasowang@redhat.com>
Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
Cc: KY Srinivasan <kys@microsoft.com>
Cc: "sashal@kernel.org" <sashal@kernel.org>
Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
Cc: "arnd@arndb.de" <arnd@arndb.de>
Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
Cc: "paul.burton@mips.com" <paul.burton@mips.com>
Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
Cc: "salyzyn@android.com" <salyzyn@android.com>
Cc: "pcc@google.com" <pcc@google.com>
Cc: "shuah@kernel.org" <shuah@kernel.org>
Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
Cc: "huw@codeweavers.com" <huw@codeweavers.com>
Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
Link: https://lkml.kernel.org/r/1561955054-1838-3-git-send-email-mikelley@microsoft.com
2019-07-03 11:00:59 +02:00
Michael Kelley
fd1fea6834 clocksource/drivers: Make Hyper-V clocksource ISA agnostic
Hyper-V clock/timer code and data structures are currently mixed
in with other code in the ISA independent drivers/hv directory as
well as the ISA dependent Hyper-V code under arch/x86.

Consolidate this code and data structures into a Hyper-V clocksource driver
to better follow the Linux model. In doing so, separate out the ISA
dependent portions so the new clocksource driver works for x86 and for the
in-process Hyper-V on ARM64 code.

To start, move the existing clockevents code to create the new clocksource
driver. Update the VMbus driver to call initialization and cleanup routines
since the Hyper-V synthetic timers are not independently enumerated in
ACPI.

No behavior is changed and no new functionality is added.

Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "bp@alien8.de" <bp@alien8.de>
Cc: "will.deacon@arm.com" <will.deacon@arm.com>
Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
Cc: "olaf@aepfle.de" <olaf@aepfle.de>
Cc: "apw@canonical.com" <apw@canonical.com>
Cc: "jasowang@redhat.com" <jasowang@redhat.com>
Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
Cc: KY Srinivasan <kys@microsoft.com>
Cc: "sashal@kernel.org" <sashal@kernel.org>
Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
Cc: "arnd@arndb.de" <arnd@arndb.de>
Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
Cc: "paul.burton@mips.com" <paul.burton@mips.com>
Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
Cc: "salyzyn@android.com" <salyzyn@android.com>
Cc: "pcc@google.com" <pcc@google.com>
Cc: "shuah@kernel.org" <shuah@kernel.org>
Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
Cc: "huw@codeweavers.com" <huw@codeweavers.com>
Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
Link: https://lkml.kernel.org/r/1561955054-1838-2-git-send-email-mikelley@microsoft.com
2019-07-03 11:00:59 +02:00
Thomas Gleixner
3419240495 Merge branch 'timers/vdso' into timers/core
so the hyper-v clocksource update can be applied.
2019-07-03 10:50:21 +02:00
Thomas Gleixner
f8a8fe61fe x86/irq: Seperate unused system vectors from spurious entry again
Quite some time ago the interrupt entry stubs for unused vectors in the
system vector range got removed and directly mapped to the spurious
interrupt vector entry point.

Sounds reasonable, but it's subtly broken. The spurious interrupt vector
entry point pushes vector number 0xFF on the stack which makes the whole
logic in __smp_spurious_interrupt() pointless.

As a consequence any spurious interrupt which comes from a vector != 0xFF
is treated as a real spurious interrupt (vector 0xFF) and not
acknowledged. That subsequently stalls all interrupt vectors of equal and
lower priority, which brings the system to a grinding halt.

This can happen because even on 64-bit the system vector space is not
guaranteed to be fully populated. A full compile time handling of the
unused vectors is not possible because quite some of them are conditonally
populated at runtime.

Bring the entry stubs back, which wastes 160 bytes if all stubs are unused,
but gains the proper handling back. There is no point to selectively spare
some of the stubs which are known at compile time as the required code in
the IDT management would be way larger and convoluted.

Do not route the spurious entries through common_interrupt and do_IRQ() as
the original code did. Route it to smp_spurious_interrupt() which evaluates
the vector number and acts accordingly now that the real vector numbers are
handed in.

Fixup the pr_warn so the actual spurious vector (0xff) is clearly
distiguished from the other vectors and also note for the vectored case
whether it was pending in the ISR or not.

 "Spurious APIC interrupt (vector 0xFF) on CPU#0, should never happen."
 "Spurious interrupt vector 0xed on CPU#1. Acked."
 "Spurious interrupt vector 0xee on CPU#1. Not pending!."

Fixes: 2414e021ac ("x86: Avoid building unused IRQ entry stubs")
Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Jan Beulich <jbeulich@suse.com>
Link: https://lkml.kernel.org/r/20190628111440.550568228@linutronix.de
2019-07-03 10:12:31 +02:00
Thomas Gleixner
b7107a67f0 x86/irq: Handle spurious interrupt after shutdown gracefully
Since the rework of the vector management, warnings about spurious
interrupts have been reported. Robert provided some more information and
did an initial analysis. The following situation leads to these warnings:

   CPU 0                  CPU 1               IO_APIC

                                              interrupt is raised
                                              sent to CPU1
			  Unable to handle
			  immediately
			  (interrupts off,
			   deep idle delay)
   mask()
   ...
   free()
     shutdown()
     synchronize_irq()
     clear_vector()
                          do_IRQ()
                            -> vector is clear

Before the rework the vector entries of legacy interrupts were statically
assigned and occupied precious vector space while most of them were
unused. Due to that the above situation was handled silently because the
vector was handled and the core handler of the assigned interrupt
descriptor noticed that it is shut down and returned.

While this has been usually observed with legacy interrupts, this situation
is not limited to them. Any other interrupt source, e.g. MSI, can cause the
same issue.

After adding proper synchronization for level triggered interrupts, this
can only happen for edge triggered interrupts where the IO-APIC obviously
cannot provide information about interrupts in flight.

While the spurious warning is actually harmless in this case it worries
users and driver developers.

Handle it gracefully by marking the vector entry as VECTOR_SHUTDOWN instead
of VECTOR_UNUSED when the vector is freed up.

If that above late handling happens the spurious detector will not complain
and switch the entry to VECTOR_UNUSED. Any subsequent spurious interrupt on
that line will trigger the spurious warning as before.

Fixes: 464d12309e ("x86/vector: Switch IOAPIC to global reservation mode")
Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>-
Tested-by: Robert Hodaszi <Robert.Hodaszi@digi.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20190628111440.459647741@linutronix.de
2019-07-03 10:12:30 +02:00
Thomas Gleixner
dfe0cf8b51 x86/ioapic: Implement irq_get_irqchip_state() callback
When an interrupt is shut down in free_irq() there might be an inflight
interrupt pending in the IO-APIC remote IRR which is not yet serviced. That
means the interrupt has been sent to the target CPUs local APIC, but the
target CPU is in a state which delays the servicing.

So free_irq() would proceed to free resources and to clear the vector
because synchronize_hardirq() does not see an interrupt handler in
progress.

That can trigger a spurious interrupt warning, which is harmless and just
confuses users, but it also can leave the remote IRR in a stale state
because once the handler is invoked the interrupt resources might be freed
already and therefore acknowledgement is not possible anymore.

Implement the irq_get_irqchip_state() callback for the IO-APIC irq chip. The
callback is invoked from free_irq() via __synchronize_hardirq(). Check the
remote IRR bit of the interrupt and return 'in flight' if it is set and the
interrupt is configured in level mode. For edge mode the remote IRR has no
meaning.

As this is only meaningful for level triggered interrupts this won't cure
the potential spurious interrupt warning for edge triggered interrupts, but
the edge trigger case does not result in stale hardware state. This has to
be addressed at the vector/interrupt entry level seperately.

Fixes: 464d12309e ("x86/vector: Switch IOAPIC to global reservation mode")
Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20190628111440.370295517@linutronix.de
2019-07-03 10:12:30 +02:00
Jiri Kosina
074376ac0e ftrace/x86: Anotate text_mutex split between ftrace_arch_code_modify_post_process() and ftrace_arch_code_modify_prepare()
ftrace_arch_code_modify_prepare() is acquiring text_mutex, while the
corresponding release is happening in ftrace_arch_code_modify_post_process().

This has already been documented in the code, but let's also make the fact
that this is intentional clear to the semantic analysis tools such as sparse.

Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1906292321170.27227@cbobk.fhfr.pm

Fixes: 39611265ed ("ftrace/x86: Add a comment to why we take text_mutex in ftrace_arch_code_modify_prepare()")
Fixes: d5b844a2cf ("ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-02 15:41:35 -04:00
Wanpeng Li
bb34e690e9 KVM: LAPIC: Fix pending interrupt in IRR blocked by software disable LAPIC
Thomas reported that:

 | Background:
 |
 |    In preparation of supporting IPI shorthands I changed the CPU offline
 |    code to software disable the local APIC instead of just masking it.
 |    That's done by clearing the APIC_SPIV_APIC_ENABLED bit in the APIC_SPIV
 |    register.
 |
 | Failure:
 |
 |    When the CPU comes back online the startup code triggers occasionally
 |    the warning in apic_pending_intr_clear(). That complains that the IRRs
 |    are not empty.
 |
 |    The offending vector is the local APIC timer vector who's IRR bit is set
 |    and stays set.
 |
 | It took me quite some time to reproduce the issue locally, but now I can
 | see what happens.
 |
 | It requires apicv_enabled=0, i.e. full apic emulation. With apicv_enabled=1
 | (and hardware support) it behaves correctly.
 |
 | Here is the series of events:
 |
 |     Guest CPU
 |
 |     goes down
 |
 |       native_cpu_disable()
 |
 | 			apic_soft_disable();
 |
 |     play_dead()
 |
 |     ....
 |
 |     startup()
 |
 |       if (apic_enabled())
 |         apic_pending_intr_clear()	<- Not taken
 |
 |      enable APIC
 |
 |         apic_pending_intr_clear()	<- Triggers warning because IRR is stale
 |
 | When this happens then the deadline timer or the regular APIC timer -
 | happens with both, has fired shortly before the APIC is disabled, but the
 | interrupt was not serviced because the guest CPU was in an interrupt
 | disabled region at that point.
 |
 | The state of the timer vector ISR/IRR bits:
 |
 |     	     	       	        ISR     IRR
 | before apic_soft_disable()    0	      1
 | after apic_soft_disable()     0	      1
 |
 | On startup		      		 0	      1
 |
 | Now one would assume that the IRR is cleared after the INIT reset, but this
 | happens only on CPU0.
 |
 | Why?
 |
 | Because our CPU0 hotplug is just for testing to make sure nothing breaks
 | and goes through an NMI wakeup vehicle because INIT would send it through
 | the boots-trap code which is not really working if that CPU was not
 | physically unplugged.
 |
 | Now looking at a real world APIC the situation in that case is:
 |
 |     	     	       	      	ISR     IRR
 | before apic_soft_disable()    0	      1
 | after apic_soft_disable()     0	      1
 |
 | On startup		      		 0	      0
 |
 | Why?
 |
 | Once the dying CPU reenables interrupts the pending interrupt gets
 | delivered as a spurious interupt and then the state is clear.
 |
 | While that CPU0 hotplug test case is surely an esoteric issue, the APIC
 | emulation is still wrong, Even if the play_dead() code would not enable
 | interrupts then the pending IRR bit would turn into an ISR .. interrupt
 | when the APIC is reenabled on startup.

From SDM 10.4.7.2 Local APIC State After It Has Been Software Disabled
* Pending interrupts in the IRR and ISR registers are held and require
  masking or handling by the CPU.

In Thomas's testing, hardware cpu will not respect soft disable LAPIC
when IRR has already been set or APICv posted-interrupt is in flight,
so we can skip soft disable APIC checking when clearing IRR and set ISR,
continue to respect soft disable APIC when attempting to set IRR.

Reported-by: Rong Chen <rong.a.chen@intel.com>
Reported-by: Feng Tang <feng.tang@intel.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rong Chen <rong.a.chen@intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02 19:02:46 +02:00
Liran Alon
323d73a8ec KVM: nVMX: Change KVM_STATE_NESTED_EVMCS to signal vmcs12 is copied from eVMCS
Currently KVM_STATE_NESTED_EVMCS is used to signal that eVMCS
capability is enabled on vCPU.
As indicated by vmx->nested.enlightened_vmcs_enabled.

This is quite bizarre as userspace VMM should make sure to expose
same vCPU with same CPUID values in both source and destination.
In case vCPU is exposed with eVMCS support on CPUID, it is also
expected to enable KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability.
Therefore, KVM_STATE_NESTED_EVMCS is redundant.

KVM_STATE_NESTED_EVMCS is currently used on restore path
(vmx_set_nested_state()) only to enable eVMCS capability in KVM
and to signal need_vmcs12_sync such that on next VMEntry to guest
nested_sync_from_vmcs12() will be called to sync vmcs12 content
into eVMCS in guest memory.
However, because restore nested-state is rare enough, we could
have just modified vmx_set_nested_state() to always signal
need_vmcs12_sync.

From all the above, it seems that we could have just removed
the usage of KVM_STATE_NESTED_EVMCS. However, in order to preserve
backwards migration compatibility, we cannot do that.
(vmx_get_nested_state() needs to signal flag when migrating from
new kernel to old kernel).

Returning KVM_STATE_NESTED_EVMCS when just vCPU have eVMCS enabled
have a bad side-effect of userspace VMM having to send nested-state
from source to destination as part of migration stream. Even if
guest have never used eVMCS as it doesn't even run a nested
hypervisor workload. This requires destination userspace VMM and
KVM to support setting nested-state. Which make it more difficult
to migrate from new host to older host.
To avoid this, change KVM_STATE_NESTED_EVMCS to signal eVMCS is
not only enabled but also active. i.e. Guest have made some
eVMCS active via an enlightened VMEntry. i.e. vmcs12 is copied
from eVMCS and therefore should be restored into eVMCS resident
in memory (by copy_vmcs12_to_enlightened()).

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maran Wilson <maran.wilson@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02 19:02:45 +02:00
Liran Alon
65b712f156 KVM: nVMX: Allow restore nested-state to enable eVMCS when vCPU in SMM
As comment in code specifies, SMM temporarily disables VMX so we cannot
be in guest mode, nor can VMLAUNCH/VMRESUME be pending.

However, code currently assumes that these are the only flags that can be
set on kvm_state->flags. This is not true as KVM_STATE_NESTED_EVMCS
can also be set on this field to signal that eVMCS should be enabled.

Therefore, fix code to check for guest-mode and pending VMLAUNCH/VMRESUME
explicitly.

Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02 19:02:44 +02:00
Paolo Bonzini
3f16a5c318 KVM: x86: degrade WARN to pr_warn_ratelimited
This warning can be triggered easily by userspace, so it should certainly not
cause a panic if panic_on_warn is set.

Reported-by: syzbot+c03f30b4f4c46bdf8575@syzkaller.appspotmail.com
Suggested-by: Alexander Potapenko <glider@google.com>
Acked-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02 19:02:44 +02:00
Linus Torvalds
728254541e Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes all over the place:

   - might_sleep() atomicity fix in the microcode loader

   - resctrl boundary condition fix

   - APIC arithmethics bug fix for frequencies >= 4.2 GHz

   - three 5-level paging crash fixes

   - two speculation fixes

   - a perf/stacktrace fix"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/unwind/orc: Fall back to using frame pointers for generated code
  perf/x86: Always store regs->ip in perf_callchain_kernel()
  x86/speculation: Allow guests to use SSBD even if host does not
  x86/mm: Handle physical-virtual alignment mismatch in phys_p4d_init()
  x86/boot/64: Add missing fixup_pointer() for next_early_pgt access
  x86/boot/64: Fix crash if kernel image crosses page table boundary
  x86/apic: Fix integer overflow on 10 bit left shift of cpu_khz
  x86/resctrl: Prevent possible overrun during bitmap operations
  x86/microcode: Fix the microcode load on CPU hotplug for real
2019-06-29 19:42:30 +08:00
Linus Torvalds
57103eb7c6 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Various fixes, most of them related to bugs perf fuzzing found in the
  x86 code"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/regs: Use PERF_REG_EXTENDED_MASK
  perf/x86: Remove pmu->pebs_no_xmm_regs
  perf/x86: Clean up PEBS_XMM_REGS
  perf/x86/regs: Check reserved bits
  perf/x86: Disable extended registers for non-supported PMUs
  perf/ioctl: Add check for the sample_period value
  perf/core: Fix perf_sample_regs_user() mm check
2019-06-29 19:39:17 +08:00
Linus Torvalds
a7211bc9f3 Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
 "Four fixes:
   - fix a kexec crash on arm64
   - fix a reboot crash on some Android platforms
   - future-proof the code for upcoming ACPI 6.2 changes
   - fix a build warning on x86"

* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efibc: Replace variable set function in notifier call
  x86/efi: fix a -Wtype-limits compilation warning
  efi/bgrt: Drop BGRT status field reserved bits check
  efi/memreserve: deal with memreserve entries in unmapped memory
2019-06-29 19:32:09 +08:00
Thomas Gleixner
c8c4076723 x86/timer: Skip PIT initialization on modern chipsets
Recent Intel chipsets including Skylake and ApolloLake have a special
ITSSPRC register which allows the 8254 PIT to be gated.  When gated, the
8254 registers can still be programmed as normal, but there are no IRQ0
timer interrupts.

Some products such as the Connex L1430 and exone go Rugged E11 use this
register to ship with the PIT gated by default. This causes Linux to fail
to boot:

  Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with
  apic=debug and send a report.

The panic happens before the framebuffer is initialized, so to the user, it
appears as an early boot hang on a black screen.

Affected products typically have a BIOS option that can be used to enable
the 8254 and make Linux work (Chipset -> South Cluster Configuration ->
Miscellaneous Configuration -> 8254 Clock Gating), however it would be best
to make Linux support the no-8254 case.

Modern sytems allow to discover the TSC and local APIC timer frequencies,
so the calibration against the PIT is not required. These systems have
always running timers and the local APIC timer works also in deep power
states.

So the setup of the PIT including the IO-APIC timer interrupt delivery
checks are a pointless exercise.

Skip the PIT setup and the IO-APIC timer interrupt checks on these systems,
which avoids the panic caused by non ticking PITs and also speeds up the
boot process.

Thanks to Daniel for providing the changelog, initial analysis of the
problem and testing against a variety of machines.

Reported-by: Daniel Drake <drake@endlessm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: linux@endlessm.com
Cc: rafael.j.wysocki@intel.com
Cc: hdegoede@redhat.com
Link: https://lkml.kernel.org/r/20190628072307.24678-1-drake@endlessm.com
2019-06-29 11:35:35 +02:00
Steven Rostedt (VMware)
39611265ed ftrace/x86: Add a comment to why we take text_mutex in ftrace_arch_code_modify_prepare()
Taking the text_mutex in ftrace_arch_code_modify_prepare() is to fix a
race against module loading and live kernel patching that might try to
change the text permissions while ftrace has it as read/write. This
really needs to be documented in the code. Add a comment that does such.

Link: http://lkml.kernel.org/r/20190627211819.5a591f52@gandalf.local.home

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-28 14:21:25 -04:00
Petr Mladek
d5b844a2cf ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()
The commit 9f255b632b ("module: Fix livepatch/ftrace module text
permissions race") causes a possible deadlock between register_kprobe()
and ftrace_run_update_code() when ftrace is using stop_machine().

The existing dependency chain (in reverse order) is:

-> #1 (text_mutex){+.+.}:
       validate_chain.isra.21+0xb32/0xd70
       __lock_acquire+0x4b8/0x928
       lock_acquire+0x102/0x230
       __mutex_lock+0x88/0x908
       mutex_lock_nested+0x32/0x40
       register_kprobe+0x254/0x658
       init_kprobes+0x11a/0x168
       do_one_initcall+0x70/0x318
       kernel_init_freeable+0x456/0x508
       kernel_init+0x22/0x150
       ret_from_fork+0x30/0x34
       kernel_thread_starter+0x0/0xc

-> #0 (cpu_hotplug_lock.rw_sem){++++}:
       check_prev_add+0x90c/0xde0
       validate_chain.isra.21+0xb32/0xd70
       __lock_acquire+0x4b8/0x928
       lock_acquire+0x102/0x230
       cpus_read_lock+0x62/0xd0
       stop_machine+0x2e/0x60
       arch_ftrace_update_code+0x2e/0x40
       ftrace_run_update_code+0x40/0xa0
       ftrace_startup+0xb2/0x168
       register_ftrace_function+0x64/0x88
       klp_patch_object+0x1a2/0x290
       klp_enable_patch+0x554/0x980
       do_one_initcall+0x70/0x318
       do_init_module+0x6e/0x250
       load_module+0x1782/0x1990
       __s390x_sys_finit_module+0xaa/0xf0
       system_call+0xd8/0x2d0

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(text_mutex);
                               lock(cpu_hotplug_lock.rw_sem);
                               lock(text_mutex);
  lock(cpu_hotplug_lock.rw_sem);

It is similar problem that has been solved by the commit 2d1e38f566
("kprobes: Cure hotplug lock ordering issues"). Many locks are involved.
To be on the safe side, text_mutex must become a low level lock taken
after cpu_hotplug_lock.rw_sem.

This can't be achieved easily with the current ftrace design.
For example, arm calls set_all_modules_text_rw() already in
ftrace_arch_code_modify_prepare(), see arch/arm/kernel/ftrace.c.
This functions is called:

  + outside stop_machine() from ftrace_run_update_code()
  + without stop_machine() from ftrace_module_enable()

Fortunately, the problematic fix is needed only on x86_64. It is
the only architecture that calls set_all_modules_text_rw()
in ftrace path and supports livepatching at the same time.

Therefore it is enough to move text_mutex handling from the generic
kernel/trace/ftrace.c into arch/x86/kernel/ftrace.c:

   ftrace_arch_code_modify_prepare()
   ftrace_arch_code_modify_post_process()

This patch basically reverts the ftrace part of the problematic
commit 9f255b632b ("module: Fix livepatch/ftrace module
text permissions race"). And provides x86_64 specific-fix.

Some refactoring of the ftrace code will be needed when livepatching
is implemented for arm or nds32. These architectures call
set_all_modules_text_rw() and use stop_machine() at the same time.

Link: http://lkml.kernel.org/r/20190627081334.12793-1-pmladek@suse.com

Fixes: 9f255b632b ("module: Fix livepatch/ftrace module text permissions race")
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
[
  As reviewed by Miroslav Benes <mbenes@suse.cz>, removed return value of
  ftrace_run_update_code() as it is a void function.
]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-28 14:20:25 -04:00
Josh Poimboeuf
ae6a45a086 x86/unwind/orc: Fall back to using frame pointers for generated code
The ORC unwinder can't unwind through BPF JIT generated code because
there are no ORC entries associated with the code.

If an ORC entry isn't available, try to fall back to frame pointers.  If
BPF and other generated code always do frame pointer setup (even with
CONFIG_FRAME_POINTERS=n) then this will allow ORC to unwind through most
generated code despite there being no corresponding ORC entries.

Fixes: d15d356887 ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
Reported-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kairui Song <kasong@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/b6f69208ddff4343d56b7bfac1fc7cfcd62689e8.1561595111.git.jpoimboe@redhat.com
2019-06-28 00:11:21 +02:00
Song Liu
83f44ae0f8 perf/x86: Always store regs->ip in perf_callchain_kernel()
The stacktrace_map_raw_tp BPF selftest is failing because the RIP saved by
perf_arch_fetch_caller_regs() isn't getting saved by perf_callchain_kernel().

This was broken by the following commit:

  d15d356887 ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")

With that change, when starting with non-HW regs, the unwinder starts
with the current stack frame and unwinds until it passes up the frame
which called perf_arch_fetch_caller_regs().  So regs->ip needs to be
saved deliberately.

Fixes: d15d356887 ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kairui Song <kasong@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/3975a298fa52b506fea32666d8ff6a13467eee6d.1561595111.git.jpoimboe@redhat.com
2019-06-28 00:11:20 +02:00
Andy Lutomirski
441cedab2d x86/vsyscall: Add __ro_after_init to global variables
The vDSO is only configurable by command-line options, so make its
global variables __ro_after_init.  This seems highly unlikely to
ever stop an exploit, but it's nicer anyway.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/a386925835e49d319e70c4d7404b1f6c3c2e3702.1561610354.git.luto@kernel.org
2019-06-28 00:04:40 +02:00
Andy Lutomirski
625b7b7f79 x86/vsyscall: Change the default vsyscall mode to xonly
The use case for full emulation over xonly is very esoteric, e.g. magic
instrumentation tools.

Change the default to the safer xonly mode.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/30539f8072d2376b9c9efcc07e6ed0d6bf20e882.1561610354.git.luto@kernel.org
2019-06-28 00:04:39 +02:00
Andy Lutomirski
e0a446ce39 x86/vsyscall: Document odd SIGSEGV error code for vsyscalls
Even if vsyscall=none, user page faults on the vsyscall page are reported
as though the PROT bit in the error code was set.  Add a comment explaining
why this is probably okay and display the value in the test case.

While at it, explain why the behavior is correct with respect to PKRU.

Modify also the selftest to print the odd error code so that there is a
way to demonstrate the odd behaviour.

If anyone really cares about more accurate emulation, the behaviour could
be changed. But that needs a real good justification.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/75c91855fd850649ace162eec5495a1354221aaa.1561610354.git.luto@kernel.org
2019-06-28 00:04:39 +02:00
Andy Lutomirski
918ce32509 x86/vsyscall: Show something useful on a read fault
Just segfaulting the application when it tries to read the vsyscall page in
xonly mode is not helpful for those who need to debug it.

Emit a hint.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jann Horn <jannh@google.com>
Link: https://lkml.kernel.org/r/8016afffe0eab497be32017ad7f6f7030dc3ba66.1561610354.git.luto@kernel.org
2019-06-28 00:04:39 +02:00
Andy Lutomirski
bd49e16e33 x86/vsyscall: Add a new vsyscall=xonly mode
With vsyscall emulation on, a readable vsyscall page is still exposed that
contains syscall instructions that validly implement the vsyscalls.

This is required because certain dynamic binary instrumentation tools
attempt to read the call targets of call instructions in the instrumented
code.  If the instrumented code uses vsyscalls, then the vsyscall page needs
to contain readable code.

Unfortunately, leaving readable memory at a deterministic address can be
used to help various ASLR bypasses, so some hardening value can be gained
by disallowing vsyscall reads.

Given how rarely the vsyscall page needs to be readable, add a mechanism to
make the vsyscall page be execute only.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/d17655777c21bc09a7af1bbcf74e6f2b69a51152.1561610354.git.luto@kernel.org
2019-06-28 00:04:38 +02:00
Sudeep Holla
b07d7d5c7b x86/entry: Simplify _TIF_SYSCALL_EMU handling
The usage of emulated and _TIF_SYSCALL_EMU flags in syscall_trace_enter
is more complicated than required.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-06-27 10:14:06 +01:00
Alejandro Jimenez
c1f7fec1eb x86/speculation: Allow guests to use SSBD even if host does not
The bits set in x86_spec_ctrl_mask are used to calculate the guest's value
of SPEC_CTRL that is written to the MSR before VMENTRY, and control which
mitigations the guest can enable.  In the case of SSBD, unless the host has
enabled SSBD always on mode (by passing "spec_store_bypass_disable=on" in
the kernel parameters), the SSBD bit is not set in the mask and the guest
can not properly enable the SSBD always on mitigation mode.

This has been confirmed by running the SSBD PoC on a guest using the SSBD
always on mitigation mode (booted with kernel parameter
"spec_store_bypass_disable=on"), and verifying that the guest is vulnerable
unless the host is also using SSBD always on mode. In addition, the guest
OS incorrectly reports the SSB vulnerability as mitigated.

Always set the SSBD bit in x86_spec_ctrl_mask when the host CPU supports
it, allowing the guest to use SSBD whether or not the host has chosen to
enable the mitigation in any of its modes.

Fixes: be6fcb5478 ("x86/bugs: Rework spec_ctrl base and mask logic")
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: bp@alien8.de
Cc: rkrcmar@redhat.com
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1560187210-11054-1-git-send-email-alejandro.j.jimenez@oracle.com
2019-06-26 16:38:36 +02:00
Thomas Gleixner
9d90b93bf3 lib/vdso: Make delta calculation work correctly
The x86 vdso implementation on which the generic vdso library is based on
has subtle (unfortunately undocumented) twists:

 1) The code assumes that the clocksource mask is U64_MAX which means that
    no bits are masked. Which is true for any valid x86 VDSO clocksource.
    Stupidly it still did the mask operation for no reason and at the wrong
    place right after reading the clocksource.

 2) It contains a sanity check to catch the case where slightly
    unsynchronized TSC values can be observed which would cause the delta
    calculation to make a huge jump. It therefore checks whether the
    current TSC value is larger than the value on which the current
    conversion is based on. If it's not larger the base value is used to
    prevent time jumps.

#1 Is not only stupid for the X86 case because it does the masking for no
reason it is also completely wrong for clocksources with a smaller mask
which can legitimately wrap around during a conversion period. The core
timekeeping code does it correct by applying the mask after the delta
calculation:

	(now - base) & mask

#2 is equally broken for clocksources which have smaller masks and can wrap
around during a conversion period because there the now > base check is
just wrong and causes stale time stamps and time going backwards issues.

Unbreak it by:

  1) Removing the mask operation from the clocksource read which makes the
     fallback detection work for all clocksources

  2) Replacing the conditional delta calculation with a overrideable inline
     function.

#2 could reuse clocksource_delta() from the timekeeping code but that
results in a significant performance hit for the x86 VSDO. The timekeeping
core code must have the non optimized version as it has to operate
correctly with clocksources which have smaller masks as well to handle the
case where TSC is discarded as timekeeper clocksource and replaced by HPET
or pmtimer. For the VDSO there is no replacement clocksource. If TSC is
unusable the syscall is enforced which does the right thing.

To accommodate to the needs of various architectures provide an
override-able inline function which defaults to the regular delta
calculation with masking:

	(now - base) & mask

Override it for x86 with the non-masking and checking version.

This unbreaks the ARM64 syscall fallback operation, allows to use
clocksources with arbitrary width and preserves the performance
optimization for x86.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: LAK <linux-arm-kernel@lists.infradead.org>
Cc: linux-mips@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: catalin.marinas@arm.com
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux@armlinux.org.uk
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: paul.burton@mips.com
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: salyzyn@android.com
Cc: pcc@google.com
Cc: shuah@kernel.org
Cc: 0x7f454c46@gmail.com
Cc: linux@rasmusvillemoes.dk
Cc: huw@codeweavers.com
Cc: sthotton@marvell.com
Cc: andre.przywara@arm.com
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906261159230.32342@nanos.tec.linutronix.de
2019-06-26 14:26:53 +02:00
Kirill A. Shutemov
432c833218 x86/mm: Handle physical-virtual alignment mismatch in phys_p4d_init()
Kyle has reported occasional crashes when booting a kernel in 5-level
paging mode with KASLR enabled:

  WARNING: CPU: 0 PID: 0 at arch/x86/mm/init_64.c:87 phys_p4d_init+0x1d4/0x1ea
  RIP: 0010:phys_p4d_init+0x1d4/0x1ea
  Call Trace:
   __kernel_physical_mapping_init+0x10a/0x35c
   kernel_physical_mapping_init+0xe/0x10
   init_memory_mapping+0x1aa/0x3b0
   init_range_memory_mapping+0xc8/0x116
   init_mem_mapping+0x225/0x2eb
   setup_arch+0x6ff/0xcf5
   start_kernel+0x64/0x53b
   ? copy_bootdata+0x1f/0xce
   x86_64_start_reservations+0x24/0x26
   x86_64_start_kernel+0x8a/0x8d
   secondary_startup_64+0xb6/0xc0

which causes later:

  BUG: unable to handle page fault for address: ff484d019580eff8
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  BAD
  Oops: 0000 [#1] SMP NOPTI
  RIP: 0010:fill_pud+0x13/0x130
  Call Trace:
   set_pte_vaddr_p4d+0x2e/0x50
   set_pte_vaddr+0x6f/0xb0
   __native_set_fixmap+0x28/0x40
   native_set_fixmap+0x39/0x70
   register_lapic_address+0x49/0xb6
   early_acpi_boot_init+0xa5/0xde
   setup_arch+0x944/0xcf5
   start_kernel+0x64/0x53b

Kyle bisected the issue to commit b569c18434 ("x86/mm/KASLR: Reduce
randomization granularity for 5-level paging to 1GB")

Before this commit PAGE_OFFSET was always aligned to P4D_SIZE when booting
5-level paging mode. But now only PUD_SIZE alignment is guaranteed.

In the case I was able to reproduce the following vaddr/paddr values were
observed in phys_p4d_init():

Iteration     vaddr			paddr
   1 	      0xff4228027fe00000 	0x033fe00000
   2	      0xff42287f40000000	0x8000000000

'vaddr' in both cases belongs to the same p4d entry.

But due to the original assumption that PAGE_OFFSET is aligned to P4D_SIZE
this overlap cannot be handled correctly. The code assumes strictly aligned
entries and unconditionally increments the index into the P4D table, which
creates false duplicate entries. Once the index reaches the end, the last
entry in the page table is missing.

Aside of that the 'paddr >= paddr_end' condition can evaluate wrong which
causes an P4D entry to be cleared incorrectly.

Change the loop in phys_p4d_init() to walk purely based on virtual
addresses like __kernel_physical_mapping_init() does. This makes it work
correctly with unaligned virtual addresses.

Fixes: b569c18434 ("x86/mm/KASLR: Reduce randomization granularity for 5-level paging to 1GB")
Reported-by: Kyle Pelton <kyle.d.pelton@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kyle Pelton <kyle.d.pelton@intel.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190624123150.920-1-kirill.shutemov@linux.intel.com
2019-06-26 07:25:09 +02:00
Kirill A. Shutemov
c1887159eb x86/boot/64: Add missing fixup_pointer() for next_early_pgt access
__startup_64() uses fixup_pointer() to access global variables in a
position-independent fashion. Access to next_early_pgt was wrapped into the
helper, but one instance in the 5-level paging branch was missed.

GCC generates a R_X86_64_PC32 PC-relative relocation for the access which
doesn't trigger the issue, but Clang emmits a R_X86_64_32S which leads to
an invalid memory access and system reboot.

Fixes: 187e91fe5e ("x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt'")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Potapenko <glider@google.com>
Link: https://lkml.kernel.org/r/20190620112422.29264-1-kirill.shutemov@linux.intel.com
2019-06-26 07:25:09 +02:00
Kirill A. Shutemov
81c7ed296d x86/boot/64: Fix crash if kernel image crosses page table boundary
A kernel which boots in 5-level paging mode crashes in a small percentage
of cases if KASLR is enabled.

This issue was tracked down to the case when the kernel image unpacks in a
way that it crosses an 1G boundary. The crash is caused by an overrun of
the PMD page table in __startup_64() and corruption of P4D page table
allocated next to it. This particular issue is not visible with 4-level
paging as P4D page tables are not used.

But the P4D and the PUD calculation have similar problems.

The PMD index calculation is wrong due to operator precedence, which fails
to confine the PMDs in the PMD array on wrap around.

The P4D calculation for 5-level paging and the PUD calculation calculate
the first index correctly, but then blindly increment it which causes the
same issue when a kernel image is located across a 512G and for 5-level
paging across a 46T boundary.

This wrap around mishandling was introduced when these parts moved from
assembly to C.

Restore it to the correct behaviour.

Fixes: c88d71508e ("x86/boot/64: Rewrite startup_64() in C")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190620112345.28833-1-kirill.shutemov@linux.intel.com
2019-06-26 07:25:09 +02:00
Kan Liang
cd6b984f6d perf/x86: Remove pmu->pebs_no_xmm_regs
We don't need pmu->pebs_no_xmm_regs anymore, the capabilities
PERF_PMU_CAP_EXTENDED_REGS can be used to check if XMM registers
collection is supported.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/1559081314-9714-4-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:25 +02:00
Kan Liang
dce86ac75d perf/x86: Clean up PEBS_XMM_REGS
Use generic macro PERF_REG_EXTENDED_MASK to replace PEBS_XMM_REGS to
avoid duplication.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/1559081314-9714-3-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:24 +02:00
Kan Liang
90d424915a perf/x86/regs: Check reserved bits
The perf fuzzer triggers a warning which map to:

        if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
                return 0;

The bits between XMM registers and generic registers are reserved.
But perf_reg_validate() doesn't check these bits.

Add PERF_REG_X86_RESERVED for reserved bits on X86.
Check the reserved bits in perf_reg_validate().

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 878068ea27 ("perf/x86: Support outputting XMM registers")
Link: https://lkml.kernel.org/r/1559081314-9714-2-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:24 +02:00
Kan Liang
e321d02db8 perf/x86: Disable extended registers for non-supported PMUs
The perf fuzzer caused Skylake machine to crash:

[ 9680.085831] Call Trace:
[ 9680.088301]  <IRQ>
[ 9680.090363]  perf_output_sample_regs+0x43/0xa0
[ 9680.094928]  perf_output_sample+0x3aa/0x7a0
[ 9680.099181]  perf_event_output_forward+0x53/0x80
[ 9680.103917]  __perf_event_overflow+0x52/0xf0
[ 9680.108266]  ? perf_trace_run_bpf_submit+0xc0/0xc0
[ 9680.113108]  perf_swevent_hrtimer+0xe2/0x150
[ 9680.117475]  ? check_preempt_wakeup+0x181/0x230
[ 9680.122091]  ? check_preempt_curr+0x62/0x90
[ 9680.126361]  ? ttwu_do_wakeup+0x19/0x140
[ 9680.130355]  ? try_to_wake_up+0x54/0x460
[ 9680.134366]  ? reweight_entity+0x15b/0x1a0
[ 9680.138559]  ? __queue_work+0x103/0x3f0
[ 9680.142472]  ? update_dl_rq_load_avg+0x1cd/0x270
[ 9680.147194]  ? timerqueue_del+0x1e/0x40
[ 9680.151092]  ? __remove_hrtimer+0x35/0x70
[ 9680.155191]  __hrtimer_run_queues+0x100/0x280
[ 9680.159658]  hrtimer_interrupt+0x100/0x220
[ 9680.163835]  smp_apic_timer_interrupt+0x6a/0x140
[ 9680.168555]  apic_timer_interrupt+0xf/0x20
[ 9680.172756]  </IRQ>

The XMM registers can only be collected by PEBS hardware events on the
platforms with PEBS baseline support, e.g. Icelake, not software/probe
events.

Add capabilities flag PERF_PMU_CAP_EXTENDED_REGS to indicate the PMU
which support extended registers. For X86, the extended registers are
XMM registers.

Add has_extended_regs() to check if extended registers are applied.

The generic code define the mask of extended registers as 0 if arch
headers haven't overridden it.

Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 878068ea27 ("perf/x86: Support outputting XMM registers")
Link: https://lkml.kernel.org/r/1559081314-9714-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:23 +02:00
Andy Lutomirski
ecf9db3d1f x86/vdso: Give the [ph]vclock_page declarations real types
Clean up the vDSO code a bit by giving pvclock_page and hvclock_page
their actual types instead of u8[PAGE_SIZE].  This shouldn't
materially affect the generated code.

Heavily based on a patch from Linus.

[ tglx: Adapted to the unified VDSO code ]

Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/6920c5188f8658001af1fc56fd35b815706d300c.1561241273.git.luto@kernel.org
2019-06-24 01:21:31 +02:00
Nadav Amit
caa759323c smp: Remove smp_call_function() and on_each_cpu() return values
The return value is fixed. Remove it and amend the callers.

[ tglx: Fixup arm/bL_switcher and powerpc/rtas ]

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20190613064813.8102-2-namit@vmware.com
2019-06-23 14:26:26 +02:00
Nadav Amit
dde3626f81 x86/apic: Use non-atomic operations when possible
Using __clear_bit() and __cpumask_clear_cpu() is more efficient than using
their atomic counterparts.

Use them when atomicity is not needed, such as when manipulating bitmasks
that are on the stack.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20190613064813.8102-10-namit@vmware.com
2019-06-23 14:07:23 +02:00
Vincenzo Frascino
22ca962288 x86/vdso: Add clock_gettime64() entry point
Linux 5.1 gained the new clock_gettime64() syscall to address the Y2038
problem on 32bit systems. The x86 VDSO is missing support for this variant
of clock_gettime().

Update the x86 specific vDSO library accordingly so it exposes the new time
getter.

[ tglx: Massaged changelog ]

Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Huw Davies <huw@codeweavers.com>
Cc: Shijith Thotton <sthotton@marvell.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Link: https://lkml.kernel.org/r/20190621095252.32307-25-vincenzo.frascino@arm.com
2019-06-22 21:21:10 +02:00
Vincenzo Frascino
f66501dc53 x86/vdso: Add clock_getres() entry point
The generic vDSO library provides an implementation of clock_getres()
that can be leveraged by each architecture.

Add the clock_getres() VDSO entry point on x86.

[ tglx: Massaged changelog and cleaned up the function signature formatting ]

Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Huw Davies <huw@codeweavers.com>
Cc: Shijith Thotton <sthotton@marvell.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Link: https://lkml.kernel.org/r/20190621095252.32307-24-vincenzo.frascino@arm.com
2019-06-22 21:21:10 +02:00
Vincenzo Frascino
7ac8707479 x86/vdso: Switch to generic vDSO implementation
The x86 vDSO library requires some adaptations to take advantage of the
newly introduced generic vDSO library.

Introduce the following changes:
 - Modification of vdso.c to be compliant with the common vdso datapage
 - Use of lib/vdso for gettimeofday

[ tglx: Massaged changelog and cleaned up the function signature formatting ]

Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Huw Davies <huw@codeweavers.com>
Cc: Shijith Thotton <sthotton@marvell.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Link: https://lkml.kernel.org/r/20190621095252.32307-23-vincenzo.frascino@arm.com
2019-06-22 21:21:10 +02:00