Some events can provide a guest with information about other guests or the
host (e.g. L3 cache stats); providing the capability to restrict access
to a "safe" set of events would limit the potential for the PMU to be used
in any side channel attacks. This change introduces a new VM ioctl that
sets an event filter. If the guest attempts to program a counter for
any blacklisted or non-whitelisted event, the kernel counter won't be
created, so any RDPMC/RDMSR will show 0 instances of that event.
Signed-off-by: Eric Hankland <ehankland@google.com>
[Lots of changes. All remaining bugs are probably mine. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The pinning of sensitive CR0 and CR4 bits caused a boot crash when loading
the kvm_intel module on a kernel compiled with CONFIG_PARAVIRT=n.
The reason is that the static key which controls the pinning is marked RO
after init. The kvm_intel module contains a CR4 write which requires to
update the static key entry list. That obviously does not work when the key
is in a RO section.
With CONFIG_PARAVIRT enabled this does not happen because the CR4 write
uses the paravirt indirection and the actual write function is built in.
As the key is intended to be immutable after init, move
native_write_cr0/4() out of line.
While at it consolidate the update of the cr4 shadow variable and store the
value right away when the pinning is initialized on a booting CPU. No point
in reading it back 20 instructions later. This allows to confine the static
key and the pinning variable to cpu/common and allows to mark them static.
Fixes: 8dbec27a24 ("x86/asm: Pin sensitive CR0 bits")
Fixes: 873d50d58f ("x86/asm: Pin sensitive CR4 bits")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Xi Ruoyao <xry111@mengyan1223.wang>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Xi Ruoyao <xry111@mengyan1223.wang>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907102140340.1758@nanos.tec.linutronix.de
clang points out that the computation of LOWMEM_PAGES causes a signed
integer overflow on 32-bit x86:
arch/x86/kernel/head32.c:83:20: error: signed shift result (0x100000000) requires 34 bits to represent, but 'int' only has 32 bits [-Werror,-Wshift-overflow]
(PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT);
^~~~~~~~~~~~
arch/x86/include/asm/pgtable_32.h:109:27: note: expanded from macro 'LOWMEM_PAGES'
#define LOWMEM_PAGES ((((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT))
~^ ~~
arch/x86/include/asm/pgtable_32.h:98:34: note: expanded from macro 'PAGE_TABLE_SIZE'
#define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
Use the _ULL() macro to make it a 64-bit constant.
Fixes: 1e620f9b23 ("x86/boot/32: Convert the 32-bit pgtable setup code from assembly to C")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190710130522.1802800-1-arnd@arndb.de
- A fair pile of RST conversions, many from Mauro. These create more
than the usual number of simple but annoying merge conflicts with other
trees, unfortunately. He has a lot more of these waiting on the wings
that, I think, will go to you directly later on.
- A new document on how to use merges and rebases in kernel repos, and one
on Spectre vulnerabilities.
- Various improvements to the build system, including automatic markup of
function() references because some people, for reasons I will never
understand, were of the opinion that :c:func:``function()`` is
unattractive and not fun to type.
- We now recommend using sphinx 1.7, but still support back to 1.4.
- Lots of smaller improvements, warning fixes, typo fixes, etc.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl0krAEPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5Yg98H/AuLqO9LpOgUjF4LhyjxGPdzJkY9RExSJ7km
gznyreLCZgFaJR+AY6YDsd4Jw6OJlPbu1YM/Qo3C3WrZVFVhgL/s2ebvBgCo50A8
raAFd8jTf4/mGCHnAqRotAPQ3mETJUk315B66lBJ6Oc+YdpRhwXWq8ZW2bJxInFF
3HDvoFgMf0KhLuMHUkkL0u3fxH1iA+KvDu8diPbJYFjOdOWENz/CV8wqdVkXRSEW
DJxIq89h/7d+hIG3d1I7Nw+gibGsAdjSjKv4eRKauZs4Aoxd1Gpl62z0JNk6aT3m
dtq4joLdwScydonXROD/Twn2jsu4xYTrPwVzChomElMowW/ZBBY=
=D0eO
-----END PGP SIGNATURE-----
Merge tag 'docs-5.3' of git://git.lwn.net/linux
Pull Documentation updates from Jonathan Corbet:
"It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro. These create more
than the usual number of simple but annoying merge conflicts with
other trees, unfortunately. He has a lot more of these waiting on
the wings that, I think, will go to you directly later on.
- A new document on how to use merges and rebases in kernel repos,
and one on Spectre vulnerabilities.
- Various improvements to the build system, including automatic
markup of function() references because some people, for reasons I
will never understand, were of the opinion that
:c:func:``function()`` is unattractive and not fun to type.
- We now recommend using sphinx 1.7, but still support back to 1.4.
- Lots of smaller improvements, warning fixes, typo fixes, etc"
* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
docs: automarkup.py: ignore exceptions when seeking for xrefs
docs: Move binderfs to admin-guide
Disable Sphinx SmartyPants in HTML output
doc: RCU callback locks need only _bh, not necessarily _irq
docs: format kernel-parameters -- as code
Doc : doc-guide : Fix a typo
platform: x86: get rid of a non-existent document
Add the RCU docs to the core-api manual
Documentation: RCU: Add TOC tree hooks
Documentation: RCU: Rename txt files to rst
Documentation: RCU: Convert RCU UP systems to reST
Documentation: RCU: Convert RCU linked list to reST
Documentation: RCU: Convert RCU basic concepts to reST
docs: filesystems: Remove uneeded .rst extension on toctables
scripts/sphinx-pre-install: fix out-of-tree build
docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
Documentation: PGP: update for newer HW devices
Documentation: Add section about CPU vulnerabilities for Spectre
Documentation: platform: Delete x86-laptop-drivers.txt
docs: Note that :c:func: should no longer be used
...
Pull x865 kdump updates from Thomas Gleixner:
"Yet more kexec/kdump updates:
- Properly support kexec when AMD's memory encryption (SME) is
enabled
- Pass reserved e820 ranges to the kexec kernel so both PCI and SME
can work"
* 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/proc/vmcore: Enable dumping of encrypted memory when SEV was active
x86/kexec: Set the C-bit in the identity map page table when SEV is active
x86/kexec: Do not map kexec area as decrypted when SEV is active
x86/crash: Add e820 reserved ranges to kdump kernel's e820 table
x86/mm: Rework ioremap resource mapping determination
x86/e820, ioport: Add a new I/O resource descriptor IORES_DESC_RESERVED
x86/mm: Create a workarea in the kernel for SME early encryption
x86/mm: Identify the end of the kernel area to be reserved
Pull x86 boot updates from Thomas Gleixner:
"Assorted updates to kexec/kdump:
- Proper kexec support for 4/5-level paging and jumping from a
5-level to a 4-level paging kernel.
- Make the EFI support for kexec/kdump more robust
- Enforce that the GDT is properly aligned instead of getting the
alignment by chance"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kdump/64: Restrict kdump kernel reservation to <64TB
x86/kexec/64: Prevent kexec from 5-level paging to a 4-level only kernel
x86/boot: Add xloadflags bits to check for 5-level paging support
x86/boot: Make the GDT 8-byte aligned
x86/kexec: Add the ACPI NVS region to the ident map
x86/boot: Call get_rsdp_addr() after console_init()
Revert "x86/boot: Disable RSDP parsing temporarily"
x86/boot: Use efi_setup_data for searching RSDP on kexec-ed kernels
x86/kexec: Add the EFI system tables and ACPI tables to the ident map
Spectre v1 isn't only about array bounds checks. It can affect any
conditional checks. The kernel entry code interrupt, exception, and NMI
handlers all have conditional swapgs checks. Those may be problematic in
the context of Spectre v1, as kernel code can speculatively run with a user
GS.
For example:
if (coming from user space)
swapgs
mov %gs:<percpu_offset>, %reg
mov (%reg), %reg1
When coming from user space, the CPU can speculatively skip the swapgs, and
then do a speculative percpu load using the user GS value. So the user can
speculatively force a read of any kernel value. If a gadget exists which
uses the percpu value as an address in another load/store, then the
contents of the kernel value may become visible via an L1 side channel
attack.
A similar attack exists when coming from kernel space. The CPU can
speculatively do the swapgs, causing the user GS to get used for the rest
of the speculative window.
The mitigation is similar to a traditional Spectre v1 mitigation, except:
a) index masking isn't possible; because the index (percpu offset)
isn't user-controlled; and
b) an lfence is needed in both the "from user" swapgs path and the
"from kernel" non-swapgs path (because of the two attacks described
above).
The user entry swapgs paths already have SWITCH_TO_KERNEL_CR3, which has a
CR3 write when PTI is enabled. Since CR3 writes are serializing, the
lfences can be skipped in those cases.
On the other hand, the kernel entry swapgs paths don't depend on PTI.
To avoid unnecessary lfences for the user entry case, create two separate
features for alternative patching:
X86_FEATURE_FENCE_SWAPGS_USER
X86_FEATURE_FENCE_SWAPGS_KERNEL
Use these features in entry code to patch in lfences where needed.
The features aren't enabled yet, so there's no functional change.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
The mutex mm->context->lock for init_mm is not initialized for init_mm.
This wasn't a problem because it remained unused. This changed however
since commit
4fc19708b1 ("x86/alternatives: Initialize temporary mm for patching")
Initialize the mutex for init_mm.
Fixes: 4fc19708b1 ("x86/alternatives: Initialize temporary mm for patching")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20190701173354.2pe62hhliok2afea@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull force_sig() argument change from Eric Biederman:
"A source of error over the years has been that force_sig has taken a
task parameter when it is only safe to use force_sig with the current
task.
The force_sig function is built for delivering synchronous signals
such as SIGSEGV where the userspace application caused a synchronous
fault (such as a page fault) and the kernel responded with a signal.
Because the name force_sig does not make this clear, and because the
force_sig takes a task parameter the function force_sig has been
abused for sending other kinds of signals over the years. Slowly those
have been fixed when the oopses have been tracked down.
This set of changes fixes the remaining abusers of force_sig and
carefully rips out the task parameter from force_sig and friends
making this kind of error almost impossible in the future"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
signal: Remove the signal number and task parameters from force_sig_info
signal: Factor force_sig_info_to_task out of force_sig_info
signal: Generate the siginfo in force_sig
signal: Move the computation of force into send_signal and correct it.
signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
signal: Remove the task parameter from force_sig_fault
signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
signal: Explicitly call force_sig_fault on current
signal/unicore32: Remove tsk parameter from __do_user_fault
signal/arm: Remove tsk parameter from __do_user_fault
signal/arm: Remove tsk parameter from ptrace_break
signal/nds32: Remove tsk parameter from send_sigtrap
signal/riscv: Remove tsk parameter from do_trap
signal/sh: Remove tsk parameter from force_sig_info_fault
signal/um: Remove task parameter from send_sigtrap
signal/x86: Remove task parameter from send_sigtrap
signal: Remove task parameter from force_sig_mceerr
signal: Remove task parameter from force_sig
signal: Remove task parameter from force_sigsegv
...
Pull x86 topology updates from Ingo Molnar:
"Implement multi-die topology support on Intel CPUs and expose the die
topology to user-space tooling, by Len Brown, Kan Liang and Zhang Rui.
These changes should have no effect on the kernel's existing
understanding of topologies, i.e. there should be no behavioral impact
on cache, NUMA, scheduler, perf and other topologies and overall
system performance"
* 'x86-topology-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/rapl: Cosmetic rename internal variables in response to multi-die/pkg support
perf/x86/intel/uncore: Cosmetic renames in response to multi-die/pkg support
hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages
thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from packages
perf/x86/intel/cstate: Support multi-die/package
perf/x86/intel/rapl: Support multi-die/package
perf/x86/intel/uncore: Support multi-die/package
topology: Create core_cpus and die_cpus sysfs attributes
topology: Create package_cpus sysfs attribute
hwmon/coretemp: Support multi-die/package
powercap/intel_rapl: Update RAPL domain name and debug messages
thermal/x86_pkg_temp_thermal: Support multi-die/package
powercap/intel_rapl: Support multi-die/package
powercap/intel_rapl: Simplify rapl_find_package()
x86/topology: Define topology_logical_die_id()
x86/topology: Define topology_die_id()
cpu/topology: Export die_id
x86/topology: Create topology_max_die_per_package()
x86/topology: Add CPUID.1F multi-die/package support
Pull x86 platform updayes from Ingo Molnar:
"Most of the commits add ACRN hypervisor guest support, plus two
cleanups"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/jailhouse: Mark jailhouse_x2apic_available() as __init
x86/platform/geode: Drop <linux/gpio.h> includes
x86/acrn: Use HYPERVISOR_CALLBACK_VECTOR for ACRN guest upcall vector
x86: Add support for Linux guests on an ACRN hypervisor
x86/Kconfig: Add new X86_HV_CALLBACK_VECTOR config symbol
Pull x86 paravirt updates from Ingo Molnar:
"A handful of paravirt patching code enhancements to make it more
robust against patching failures, and related cleanups and not so
related cleanups - by Thomas Gleixner and myself"
* 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/paravirt: Rename paravirt_patch_site::instrtype to paravirt_patch_site::type
x86/paravirt: Standardize 'insn_buff' variable names
x86/paravirt: Match paravirt patchlet field definition ordering to initialization ordering
x86/paravirt: Replace the paravirt patch asm magic
x86/paravirt: Unify the 32/64 bit paravirt patching code
x86/paravirt: Detect over-sized patching bugs in paravirt_patch_call()
x86/paravirt: Detect over-sized patching bugs in paravirt_patch_insns()
x86/paravirt: Remove bogus extern declarations
Pull x86 asm updates from Ingo Molnar:
"Most of the changes relate to Peter Zijlstra's cleanup of ptregs
handling, in particular the i386 part is now much simplified and
standardized - no more partial ptregs stack frames via the esp/ss
oddity. This simplifies ftrace, kprobes, the unwinder, ptrace, kdump
and kgdb.
There's also a CR4 hardening enhancements by Kees Cook, to make the
generic platform functions such as native_write_cr4() less useful as
ROP gadgets that disable SMEP/SMAP. Also protect the WP bit of CR0
against similar attacks.
The rest is smaller cleanups/fixes"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Add int3_emulate_call() selftest
x86/stackframe/32: Allow int3_emulate_push()
x86/stackframe/32: Provide consistent pt_regs
x86/stackframe, x86/ftrace: Add pt_regs frame annotations
x86/stackframe, x86/kprobes: Fix frame pointer annotations
x86/stackframe: Move ENCODE_FRAME_POINTER to asm/frame.h
x86/entry/32: Clean up return from interrupt preemption path
x86/asm: Pin sensitive CR0 bits
x86/asm: Pin sensitive CR4 bits
Documentation/x86: Fix path to entry_32.S
x86/asm: Remove unused TASK_TI_flags from asm-offsets.c
Pull locking updates from Ingo Molnar:
"The main changes in this cycle are:
- rwsem scalability improvements, phase #2, by Waiman Long, which are
rather impressive:
"On a 2-socket 40-core 80-thread Skylake system with 40 reader
and writer locking threads, the min/mean/max locking operations
done in a 5-second testing window before the patchset were:
40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255
After the patchset, they became:
40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"
There's a lot of changes to the locking implementation that makes
it similar to qrwlock, including owner handoff for more fair
locking.
Another microbenchmark shows how across the spectrum the
improvements are:
"With a locking microbenchmark running on 5.1 based kernel, the
total locking rates (in kops/s) on a 2-socket Skylake system
with equal numbers of readers and writers (mixed) before and
after this patchset were:
# of Threads Before Patch After Patch
------------ ------------ -----------
2 2,618 4,193
4 1,202 3,726
8 802 3,622
16 729 3,359
32 319 2,826
64 102 2,744"
The changes are extensive and the patch-set has been through
several iterations addressing various locking workloads. There
might be more regressions, but unless they are pathological I
believe we want to use this new implementation as the baseline
going forward.
- jump-label optimizations by Daniel Bristot de Oliveira: the primary
motivation was to remove IPI disturbance of isolated RT-workload
CPUs, which resulted in the implementation of batched jump-label
updates. Beyond the improvement of the real-time characteristics
kernel, in one test this patchset improved static key update
overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
as well.
- atomic64_t cross-arch type cleanups by Mark Rutland: over the last
~10 years of atomic64_t existence the various types used by the
APIs only had to be self-consistent within each architecture -
which means they became wildly inconsistent across architectures.
Mark puts and end to this by reworking all the atomic64
implementations to use 's64' as the base type for atomic64_t, and
to ensure that this type is consistently used for parameters and
return values in the API, avoiding further problems in this area.
- A large set of small improvements to lockdep by Yuyang Du: type
cleanups, output cleanups, function return type and othr cleanups
all around the place.
- A set of percpu ops cleanups and fixes by Peter Zijlstra.
- Misc other changes - please see the Git log for more details"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
locking/lockdep: increase size of counters for lockdep statistics
locking/atomics: Use sed(1) instead of non-standard head(1) option
locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
x86/jump_label: Make tp_vec_nr static
x86/percpu: Optimize raw_cpu_xchg()
x86/percpu, sched/fair: Avoid local_clock()
x86/percpu, x86/irq: Relax {set,get}_irq_regs()
x86/percpu: Relax smp_processor_id()
x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
locking/rwsem: Guard against making count negative
locking/rwsem: Adaptive disabling of reader optimistic spinning
locking/rwsem: Enable time-based spinning on reader-owned rwsem
locking/rwsem: Make rwsem->owner an atomic_long_t
locking/rwsem: Enable readers spinning on writer
locking/rwsem: Clarify usage of owner's nonspinaable bit
locking/rwsem: Wake up almost all readers in wait queue
locking/rwsem: More optimal RT task handling of null owner
locking/rwsem: Always release wait_lock before waking up tasks
locking/rwsem: Implement lock handoff to prevent lock starvation
locking/rwsem: Make rwsem_spin_on_owner() return owner state
...
Break out parts of mshyperv.h that are ISA independent into a
separate file in include/asm-generic. This move facilitates
ARM64 code reusing these definitions and avoids code
duplication. No functionality or behavior is changed.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Pull x86 pti updates from Thomas Gleixner:
"The speculative paranoia departement delivers a few more plugs for
possible (probably theoretical) spectre/mds leaks"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tls: Fix possible spectre-v1 in do_get_thread_area()
x86/ptrace: Fix possible spectre-v1 in ptrace_get_debugreg()
x86/speculation/mds: Eliminate leaks by trace_hardirqs_on()
Pull x86 timer updates from Thomas Gleixner:
"A rather large series consolidating the HPET code, which was triggered
by the attempt to bolt HPET NMI watchdog support on to the existing
maze with the usual duct tape and super glue approach.
This mainly removes two separate partially redundant storage layers
and consolidates them into a single one which provides a consistent
view of the different HPET channels and their usage and allows to
integrate HPET NMI watchdog support (if it turns out to be feasible)
in a non intrusive way"
* 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
x86/hpet: Use channel for legacy clockevent storage
x86/hpet: Use common init for legacy clockevent
x86/hpet: Carve out shareable parts of init_one_hpet_msi_clockevent()
x86/hpet: Consolidate clockevent functions
x86/hpet: Wrap legacy clockevent in hpet_channel
x86/hpet: Use cached info instead of extra flags
x86/hpet: Move clockevents into channels
x86/hpet: Rename variables to prepare for switching to channels
x86/hpet: Add function to select a /dev/hpet channel
x86/hpet: Add mode information to struct hpet_channel
x86/hpet: Use cached channel data
x86/hpet: Introduce struct hpet_base and struct hpet_channel
x86/hpet: Coding style cleanup
x86/hpet: Clean up comments
x86/hpet: Make naming consistent
x86/hpet: Remove not required includes
x86/hpet: Decapitalize and rename EVT_TO_HPET_DEV
x86/hpet: Simplify counter validation
x86/hpet: Separate counter check out of clocksource register code
x86/hpet: Shuffle code around for readability sake
...
Pull x86 CPU feature updates from Thomas Gleixner:
"Updates for x86 CPU features:
- Support for UMWAIT/UMONITOR, which allows to use MWAIT and MONITOR
instructions in user space to save power e.g. in HPC workloads
which spin wait on synchronization points.
The maximum time a MWAIT can halt in userspace is controlled by the
kernel and can be adjusted by the sysadmin.
- Speed up the MTRR handling code on CPUs which support cache
self-snooping correctly.
On those CPUs the wbinvd() invocations can be omitted which speeds
up the MTRR setup by a factor of 50.
- Support for the new x86 vendor Zhaoxin who develops processors
based on the VIA Centaur technology.
- Prevent 'cat /proc/cpuinfo' from affecting isolated NOHZ_FULL CPUs
by sending IPIs to retrieve the CPU frequency and use the cached
values instead.
- The addition and late revert of the FSGSBASE support. The revert
was required as it turned out that the code still has hard to
diagnose issues. Yet another engineering trainwreck...
- Small fixes, cleanups, improvements and the usual new Intel CPU
family/model addons"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
x86/fsgsbase: Revert FSGSBASE support
selftests/x86/fsgsbase: Fix some test case bugs
x86/entry/64: Fix and clean up paranoid_exit
x86/entry/64: Don't compile ignore_sysret if 32-bit emulation is enabled
selftests/x86: Test SYSCALL and SYSENTER manually with TF set
x86/mtrr: Skip cache flushes on CPUs with cache self-snooping
x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata
Documentation/ABI: Document umwait control sysfs interfaces
x86/umwait: Add sysfs interface to control umwait maximum time
x86/umwait: Add sysfs interface to control umwait C0.2 state
x86/umwait: Initialize umwait control values
x86/cpufeatures: Enumerate user wait instructions
x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs
x86/acpi/cstate: Add Zhaoxin processors support for cache flush policy in C3
ACPI, x86: Add Zhaoxin processors support for NONSTOP TSC
x86/cpu: Create Zhaoxin processors architecture support file
x86/cpu: Split Tremont based Atoms from the rest
Documentation/x86/64: Add documentation for GS/FS addressing mode
x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2
x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit
...
Pull x86 FPU updates from Thomas Gleixner:
"A small set of updates for the FPU code:
- Make the no387/nofxsr command line options useful by restricting
them to 32bit and actually clearing all dependencies to prevent
random crashes and malfunction.
- Simplify and cleanup the kernel_fpu_*() helpers"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Inline fpu__xstate_clear_all_cpu_caps()
x86/fpu: Make 'no387' and 'nofxsr' command line options useful
x86/fpu: Remove the fpu__save() export
x86/fpu: Simplify kernel_fpu_begin()
x86/fpu: Simplify kernel_fpu_end()
Pull x86 vsyscall updates from Thomas Gleixner:
"Further hardening of the legacy vsyscall by providing support for
execute only mode and switching the default to it.
This prevents a certain class of attacks which rely on the vsyscall
page being accessible at a fixed address in the canonical kernel
address space"
* 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests/x86: Add a test for process_vm_readv() on the vsyscall page
x86/vsyscall: Add __ro_after_init to global variables
x86/vsyscall: Change the default vsyscall mode to xonly
selftests/x86/vsyscall: Verify that vsyscall=none blocks execution
x86/vsyscall: Document odd SIGSEGV error code for vsyscalls
x86/vsyscall: Show something useful on a read fault
x86/vsyscall: Add a new vsyscall=xonly mode
Documentation/admin: Remove the vsyscall=native documentation
Pull x96 apic updates from Thomas Gleixner:
"Updates for the x86 APIC interrupt handling and APIC timer:
- Fix a long standing issue with spurious interrupts which was caused
by the big vector management rework a few years ago. Robert Hodaszi
provided finally enough debug data and an excellent initial failure
analysis which allowed to understand the underlying issues.
This contains a change to the core interrupt management code which
is required to handle this correctly for the APIC/IO_APIC. The core
changes are NOOPs for most architectures except ARM64. ARM64 is not
impacted by the change as confirmed by Marc Zyngier.
- Newer systems allow to disable the PIT clock for power saving
causing panic in the timer interrupt delivery check of the IO/APIC
when the HPET timer is not enabled either. While the clock could be
turned on this would cause an endless whack a mole game to chase
the proper register in each affected chipset.
These systems provide the relevant frequencies for TSC, CPU and the
local APIC timer via CPUID and/or MSRs, which allows to avoid the
PIT/HPET based calibration. As the calibration code is the only
usage of the legacy timers on modern systems and is skipped anyway
when the frequencies are known already, there is no point in
setting up the PIT and actually checking for the interrupt delivery
via IO/APIC.
To achieve this on a wide variety of platforms, the CPUID/MSR based
frequency readout has been made more robust, which also allowed to
remove quite some workarounds which turned out to be not longer
required. Thanks to Daniel Drake for analysis, patches and
verification"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Seperate unused system vectors from spurious entry again
x86/irq: Handle spurious interrupt after shutdown gracefully
x86/ioapic: Implement irq_get_irqchip_state() callback
genirq: Add optional hardware synchronization for shutdown
genirq: Fix misleading synchronize_irq() documentation
genirq: Delay deactivation in free_irq()
x86/timer: Skip PIT initialization on modern chipsets
x86/apic: Use non-atomic operations when possible
x86/apic: Make apic_bsp_setup() static
x86/tsc: Set LAPIC timer period to crystal clock frequency
x86/apic: Rename 'lapic_timer_frequency' to 'lapic_timer_period'
x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency
Pull timer updates from Thomas Gleixner:
"The timer and timekeeping departement delivers:
Core:
- The consolidation of the VDSO code into a generic library including
the conversion of x86 and ARM64. Conversion of ARM and MIPS are en
route through the relevant maintainer trees and should end up in
5.4.
This gets rid of the unnecessary different copies of the same code
and brings all architectures on the same level of VDSO
functionality.
- Make the NTP user space interface more robust by restricting the
TAI offset to prevent undefined behaviour. Includes a selftest.
- Validate user input in the compat settimeofday() syscall to catch
invalid values which would be turned into valid values by a
multiplication overflow
- Consolidate the time accessors
- Small fixes, improvements and cleanups all over the place
Drivers:
- Support for the NXP system counter, TI davinci timer
- Move the Microsoft HyperV clocksource/events code into the
drivers/clocksource directory so it can be shared between x86 and
ARM64.
- Overhaul of the Tegra driver
- Delay timer support for IXP4xx
- Small fixes, improvements and cleanups as usual"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
time: Validate user input in compat_settimeofday()
timer: Document TIMER_PINNED
clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
clocksource/drivers: Make Hyper-V clocksource ISA agnostic
MAINTAINERS: Fix Andy's surname and the directory entries of VDSO
hrtimer: Use a bullet for the returns bullet list
arm64: vdso: Fix compilation with clang older than 8
arm64: compat: Fix __arch_get_hw_counter() implementation
arm64: Fix __arch_get_hw_counter() implementation
lib/vdso: Make delta calculation work correctly
MAINTAINERS: Add entry for the generic VDSO library
arm64: compat: No need for pre-ARMv7 barriers on an ARMv8 system
arm64: vdso: Remove unnecessary asm-offsets.c definitions
vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h
clocksource/drivers/davinci: Add support for clocksource
clocksource/drivers/davinci: Add support for clockevents
clocksource/drivers/tegra: Set up maximum-ticks limit properly
clocksource/drivers/tegra: Cycles can't be 0
clocksource/drivers/tegra: Restore base address before cleanup
clocksource/drivers/tegra: Add verbose definition for 1MHz constant
...
All fpu__xstate_clear_all_cpu_caps() does is to invoke one simple
function since commit
73e3a7d2a7 ("x86/fpu: Remove the explicit clearing of XSAVE dependent features")
so invoke that function directly and remove the wrapper.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190704060743.rvew4yrjd6n33uzx@linutronix.de
KVM does not have 100% coverage of VMX consistency checks, i.e. some
checks that cause VM-Fail may only be detected by hardware during a
nested VM-Entry. In such a case, KVM must restore L1's state to the
pre-VM-Enter state as L2's state has already been loaded into KVM's
software model.
L1's CR3 and PDPTRs in particular are loaded from vmcs01.GUEST_*. But
when EPT is disabled, the associated fields hold KVM's shadow values,
not L1's "real" values. Fortunately, when EPT is disabled the PDPTRs
come from memory, i.e. are not cached in the VMCS. Which leaves CR3
as the sole anomaly.
A previously applied workaround to handle CR3 was to force nested early
checks if EPT is disabled:
commit 2b27924bb1 ("KVM: nVMX: always use early vmcs check when EPT
is disabled")
Forcing nested early checks is undesirable as doing so adds hundreds of
cycles to every nested VM-Entry. Rather than take this performance hit,
handle CR3 by overwriting vmcs01.GUEST_CR3 with L1's CR3 during nested
VM-Entry when EPT is disabled *and* nested early checks are disabled.
By stuffing vmcs01.GUEST_CR3, nested_vmx_restore_host_state() will
naturally restore the correct vcpu->arch.cr3 from vmcs01.GUEST_CR3.
These shenanigans work because nested_vmx_restore_host_state() does a
full kvm_mmu_reset_context(), i.e. unloads the current MMU, which
guarantees vmcs01.GUEST_CR3 will be rewritten with a new shadow CR3
prior to re-entering L1.
vcpu->arch.root_mmu.root_hpa is set to INVALID_PAGE via:
nested_vmx_restore_host_state() ->
kvm_mmu_reset_context() ->
kvm_mmu_unload() ->
kvm_mmu_free_roots()
kvm_mmu_unload() has WARN_ON(root_hpa != INVALID_PAGE), i.e. we can bank
on 'root_hpa == INVALID_PAGE' unless the implementation of
kvm_mmu_reset_context() is changed.
On the way into L1, VMCS.GUEST_CR3 is guaranteed to be written (on a
successful entry) via:
vcpu_enter_guest() ->
kvm_mmu_reload() ->
kvm_mmu_load() ->
kvm_mmu_load_cr3() ->
vmx_set_cr3()
Stuff vmcs01.GUEST_CR3 if and only if nested early checks are disabled
as a "late" VM-Fail should never happen win that case (KVM WARNs), and
the conditional write avoids the need to restore the correct GUEST_CR3
when nested_vmx_check_vmentry_hw() fails.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20190607185534.24368-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The FSGSBASE series turned out to have serious bugs and there is still an
open issue which is not fully understood yet.
The confidence in those changes has become close to zero especially as the
test cases which have been shipped with that series were obviously never
run before sending the final series out to LKML.
./fsgsbase_64 >/dev/null
Segmentation fault
As the merge window is close, the only sane decision is to revert FSGSBASE
support. The revert is necessary as this branch has been merged into
perf/core already and rebasing all of that a few days before the merge
window is not the most brilliant idea.
I could definitely slap myself for not noticing the test case fail when
merging that series, but TBH my expectations weren't that low back
then. Won't happen again.
Revert the following commits:
539bca535d ("x86/entry/64: Fix and clean up paranoid_exit")
2c7b5ac5d5 ("Documentation/x86/64: Add documentation for GS/FS addressing mode")
f987c955c7 ("x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2")
2032f1f96e ("x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit")
5bf0cab60e ("x86/entry/64: Document GSBASE handling in the paranoid path")
708078f657 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit")
79e1932fa3 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
1d07316b13 ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")
f60a83df45 ("x86/process/64: Use FSGSBASE instructions on thread copy and ptrace")
1ab5f3f7fe ("x86/process/64: Use FSBSBASE in switch_to() if available")
a86b462513 ("x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions")
8b71340d70 ("x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions")
b64ed19b93 ("x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Chang S. Bae <chang.seok.bae@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Hyper-V clock/timer code and data structures are currently mixed
in with other code in the ISA independent drivers/hv directory as
well as the ISA dependent Hyper-V code under arch/x86.
Consolidate this code and data structures into a Hyper-V clocksource driver
to better follow the Linux model. In doing so, separate out the ISA
dependent portions so the new clocksource driver works for x86 and for the
in-process Hyper-V on ARM64 code.
To start, move the existing clockevents code to create the new clocksource
driver. Update the VMbus driver to call initialization and cleanup routines
since the Hyper-V synthetic timers are not independently enumerated in
ACPI.
No behavior is changed and no new functionality is added.
Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "bp@alien8.de" <bp@alien8.de>
Cc: "will.deacon@arm.com" <will.deacon@arm.com>
Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
Cc: "olaf@aepfle.de" <olaf@aepfle.de>
Cc: "apw@canonical.com" <apw@canonical.com>
Cc: "jasowang@redhat.com" <jasowang@redhat.com>
Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
Cc: KY Srinivasan <kys@microsoft.com>
Cc: "sashal@kernel.org" <sashal@kernel.org>
Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
Cc: "arnd@arndb.de" <arnd@arndb.de>
Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
Cc: "paul.burton@mips.com" <paul.burton@mips.com>
Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
Cc: "salyzyn@android.com" <salyzyn@android.com>
Cc: "pcc@google.com" <pcc@google.com>
Cc: "shuah@kernel.org" <shuah@kernel.org>
Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
Cc: "huw@codeweavers.com" <huw@codeweavers.com>
Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
Link: https://lkml.kernel.org/r/1561955054-1838-2-git-send-email-mikelley@microsoft.com
Quite some time ago the interrupt entry stubs for unused vectors in the
system vector range got removed and directly mapped to the spurious
interrupt vector entry point.
Sounds reasonable, but it's subtly broken. The spurious interrupt vector
entry point pushes vector number 0xFF on the stack which makes the whole
logic in __smp_spurious_interrupt() pointless.
As a consequence any spurious interrupt which comes from a vector != 0xFF
is treated as a real spurious interrupt (vector 0xFF) and not
acknowledged. That subsequently stalls all interrupt vectors of equal and
lower priority, which brings the system to a grinding halt.
This can happen because even on 64-bit the system vector space is not
guaranteed to be fully populated. A full compile time handling of the
unused vectors is not possible because quite some of them are conditonally
populated at runtime.
Bring the entry stubs back, which wastes 160 bytes if all stubs are unused,
but gains the proper handling back. There is no point to selectively spare
some of the stubs which are known at compile time as the required code in
the IDT management would be way larger and convoluted.
Do not route the spurious entries through common_interrupt and do_IRQ() as
the original code did. Route it to smp_spurious_interrupt() which evaluates
the vector number and acts accordingly now that the real vector numbers are
handed in.
Fixup the pr_warn so the actual spurious vector (0xff) is clearly
distiguished from the other vectors and also note for the vectored case
whether it was pending in the ISR or not.
"Spurious APIC interrupt (vector 0xFF) on CPU#0, should never happen."
"Spurious interrupt vector 0xed on CPU#1. Acked."
"Spurious interrupt vector 0xee on CPU#1. Not pending!."
Fixes: 2414e021ac ("x86: Avoid building unused IRQ entry stubs")
Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Jan Beulich <jbeulich@suse.com>
Link: https://lkml.kernel.org/r/20190628111440.550568228@linutronix.de
Since the rework of the vector management, warnings about spurious
interrupts have been reported. Robert provided some more information and
did an initial analysis. The following situation leads to these warnings:
CPU 0 CPU 1 IO_APIC
interrupt is raised
sent to CPU1
Unable to handle
immediately
(interrupts off,
deep idle delay)
mask()
...
free()
shutdown()
synchronize_irq()
clear_vector()
do_IRQ()
-> vector is clear
Before the rework the vector entries of legacy interrupts were statically
assigned and occupied precious vector space while most of them were
unused. Due to that the above situation was handled silently because the
vector was handled and the core handler of the assigned interrupt
descriptor noticed that it is shut down and returned.
While this has been usually observed with legacy interrupts, this situation
is not limited to them. Any other interrupt source, e.g. MSI, can cause the
same issue.
After adding proper synchronization for level triggered interrupts, this
can only happen for edge triggered interrupts where the IO-APIC obviously
cannot provide information about interrupts in flight.
While the spurious warning is actually harmless in this case it worries
users and driver developers.
Handle it gracefully by marking the vector entry as VECTOR_SHUTDOWN instead
of VECTOR_UNUSED when the vector is freed up.
If that above late handling happens the spurious detector will not complain
and switch the entry to VECTOR_UNUSED. Any subsequent spurious interrupt on
that line will trigger the spurious warning as before.
Fixes: 464d12309e ("x86/vector: Switch IOAPIC to global reservation mode")
Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>-
Tested-by: Robert Hodaszi <Robert.Hodaszi@digi.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20190628111440.459647741@linutronix.de
When sending a call-function IPI-many to vCPUs, yield if any of
the IPI target vCPUs was preempted, we just select the first
preempted target vCPU which we found since the state of target
vCPUs can change underneath and to avoid race conditions.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Doing the indirection through macros for the regs accessors just
makes them harder to read, so implement the helpers directly.
Note that only the helpers actually used are implemented now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Pull perf fixes from Ingo Molnar:
"Various fixes, most of them related to bugs perf fuzzing found in the
x86 code"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/regs: Use PERF_REG_EXTENDED_MASK
perf/x86: Remove pmu->pebs_no_xmm_regs
perf/x86: Clean up PEBS_XMM_REGS
perf/x86/regs: Check reserved bits
perf/x86: Disable extended registers for non-supported PMUs
perf/ioctl: Add check for the sample_period value
perf/core: Fix perf_sample_regs_user() mm check
Recent Intel chipsets including Skylake and ApolloLake have a special
ITSSPRC register which allows the 8254 PIT to be gated. When gated, the
8254 registers can still be programmed as normal, but there are no IRQ0
timer interrupts.
Some products such as the Connex L1430 and exone go Rugged E11 use this
register to ship with the PIT gated by default. This causes Linux to fail
to boot:
Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with
apic=debug and send a report.
The panic happens before the framebuffer is initialized, so to the user, it
appears as an early boot hang on a black screen.
Affected products typically have a BIOS option that can be used to enable
the 8254 and make Linux work (Chipset -> South Cluster Configuration ->
Miscellaneous Configuration -> 8254 Clock Gating), however it would be best
to make Linux support the no-8254 case.
Modern sytems allow to discover the TSC and local APIC timer frequencies,
so the calibration against the PIT is not required. These systems have
always running timers and the local APIC timer works also in deep power
states.
So the setup of the PIT including the IO-APIC timer interrupt delivery
checks are a pointless exercise.
Skip the PIT setup and the IO-APIC timer interrupt checks on these systems,
which avoids the panic caused by non ticking PITs and also speeds up the
boot process.
Thanks to Daniel for providing the changelog, initial analysis of the
problem and testing against a variety of machines.
Reported-by: Daniel Drake <drake@endlessm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: linux@endlessm.com
Cc: rafael.j.wysocki@intel.com
Cc: hdegoede@redhat.com
Link: https://lkml.kernel.org/r/20190628072307.24678-1-drake@endlessm.com
The current kernel supports 5-level paging mode, and supports dynamically
choosing the paging mode during bootup depending on the kernel image,
hardware and kernel parameter settings. This flexibility brings several
issues to kexec/kdump:
1) Dynamic switching between paging modes requires support in the target
kernel. This means kexec from a 5-level paging kernel into a kernel
which does not support mode switching is not possible. So the loader
needs to be able to analyze the supported paging modes of the kexec
target kernel.
2) If running on a 5-level paging kernel and the kexec target kernel is a
4-level paging kernel, the target immage cannot be loaded above the 64TB
address space limit. But the kexec loader searches for a load area from
top to bottom which would eventually put the target kernel above 64TB
when the machine has large enough RAM size. So the loader needs to be
able to analyze the paging mode of the target kernel to load it at a
suitable spot in the address space.
Solution:
Add two bits XLF_5LEVEL and XLF_5LEVEL_ENABLED:
- Bit XLF_5LEVEL indicates whether 5-level paging mode switching support
is available. (Issue #1)
- Bit XLF_5LEVEL_ENABLED indicates whether the kernel was compiled with
full 5-level paging support (CONFIG_X86_5LEVEL=y). (Issue #2)
The loader will use these bits to verify whether the target kernel is
suitable to be kexec'ed to from a 5-level paging kernel and to determine
the constraints of the target kernel load address.
The flags will be used by the kernel kexec subsystem and the userspace
kexec tools.
[ tglx: Massaged changelog ]
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: dyoung@redhat.com
Link: https://lkml.kernel.org/r/20190524073810.24298-2-bhe@redhat.com
Instead of allocating yet another data structure, move the clock event data
into the channel structure. This allows further consolidation of the
reservation code and the reuse of the cached boot config to replace the
extra flags in the clockevent data.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Link: https://lkml.kernel.org/r/20190623132436.185851116@linutronix.de
Just segfaulting the application when it tries to read the vsyscall page in
xonly mode is not helpful for those who need to debug it.
Emit a hint.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jann Horn <jannh@google.com>
Link: https://lkml.kernel.org/r/8016afffe0eab497be32017ad7f6f7030dc3ba66.1561610354.git.luto@kernel.org
The x86 vdso implementation on which the generic vdso library is based on
has subtle (unfortunately undocumented) twists:
1) The code assumes that the clocksource mask is U64_MAX which means that
no bits are masked. Which is true for any valid x86 VDSO clocksource.
Stupidly it still did the mask operation for no reason and at the wrong
place right after reading the clocksource.
2) It contains a sanity check to catch the case where slightly
unsynchronized TSC values can be observed which would cause the delta
calculation to make a huge jump. It therefore checks whether the
current TSC value is larger than the value on which the current
conversion is based on. If it's not larger the base value is used to
prevent time jumps.
#1 Is not only stupid for the X86 case because it does the masking for no
reason it is also completely wrong for clocksources with a smaller mask
which can legitimately wrap around during a conversion period. The core
timekeeping code does it correct by applying the mask after the delta
calculation:
(now - base) & mask
#2 is equally broken for clocksources which have smaller masks and can wrap
around during a conversion period because there the now > base check is
just wrong and causes stale time stamps and time going backwards issues.
Unbreak it by:
1) Removing the mask operation from the clocksource read which makes the
fallback detection work for all clocksources
2) Replacing the conditional delta calculation with a overrideable inline
function.
#2 could reuse clocksource_delta() from the timekeeping code but that
results in a significant performance hit for the x86 VSDO. The timekeeping
core code must have the non optimized version as it has to operate
correctly with clocksources which have smaller masks as well to handle the
case where TSC is discarded as timekeeper clocksource and replaced by HPET
or pmtimer. For the VDSO there is no replacement clocksource. If TSC is
unusable the syscall is enforced which does the right thing.
To accommodate to the needs of various architectures provide an
override-able inline function which defaults to the regular delta
calculation with masking:
(now - base) & mask
Override it for x86 with the non-masking and checking version.
This unbreaks the ARM64 syscall fallback operation, allows to use
clocksources with arbitrary width and preserves the performance
optimization for x86.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: LAK <linux-arm-kernel@lists.infradead.org>
Cc: linux-mips@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: catalin.marinas@arm.com
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux@armlinux.org.uk
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: paul.burton@mips.com
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: salyzyn@android.com
Cc: pcc@google.com
Cc: shuah@kernel.org
Cc: 0x7f454c46@gmail.com
Cc: linux@rasmusvillemoes.dk
Cc: huw@codeweavers.com
Cc: sthotton@marvell.com
Cc: andre.przywara@arm.com
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906261159230.32342@nanos.tec.linutronix.de
Now that x86_32 has an unconditional gap on the kernel stack frame,
the int3_emulate_push() thing will work without further changes.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently pt_regs on x86_32 has an oddity in that kernel regs
(!user_mode(regs)) are short two entries (esp/ss). This means that any
code trying to use them (typically: regs->sp) needs to jump through
some unfortunate hoops.
Change the entry code to fix this up and create a full pt_regs frame.
This then simplifies various trampolines in ftrace and kprobes, the
stack unwinder, ptrace, kdump and kgdb.
Much thanks to Josh for help with the cleanups!
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation for wider use, move the ENCODE_FRAME_POINTER macros to
a common header and provide inline asm versions.
These macros are used to encode a pt_regs frame for the unwinder; see
unwind_frame.c:decode_frame_pointer().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
umwait or tpause allows the processor to enter a light-weight
power/performance optimized state (C0.1 state) or an improved
power/performance optimized state (C0.2 state) for a period specified by
the instruction or until the system time limit or until a store to the
monitored address range in umwait.
IA32_UMWAIT_CONTROL MSR register allows the OS to enable/disable C0.2 on
the processor and to set the maximum time the processor can reside in C0.1
or C0.2.
By default C0.2 is enabled so the user wait instructions can enter the
C0.2 state to save more power with slower wakeup time.
Andy Lutomirski proposed to set the maximum umwait time to 100000 cycles by
default. A quote from Andy:
"What I want to avoid is the case where it works dramatically differently
on NO_HZ_FULL systems as compared to everything else. Also, UMWAIT may
behave a bit differently if the max timeout is hit, and I'd like that
path to get exercised widely by making it happen even on default
configs."
A sysfs interface to adjust the time and the C0.2 enablement is provided in
a follow up change.
[ tglx: Renamed MSR_IA32_UMWAIT_CONTROL_MAX_TIME to
MSR_IA32_UMWAIT_CONTROL_TIME_MASK because the constant is used as
mask throughout the code.
Massaged comments and changelog ]
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: "Borislav Petkov" <bp@alien8.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Link: https://lkml.kernel.org/r/1560994438-235698-3-git-send-email-fenghua.yu@intel.com
umonitor, umwait, and tpause are a set of user wait instructions.
umonitor arms address monitoring hardware using an address. The
address range is determined by using CPUID.0x5. A store to
an address within the specified address range triggers the
monitoring hardware to wake up the processor waiting in umwait.
umwait instructs the processor to enter an implementation-dependent
optimized state while monitoring a range of addresses. The optimized
state may be either a light-weight power/performance optimized state
(C0.1 state) or an improved power/performance optimized state
(C0.2 state).
tpause instructs the processor to enter an implementation-dependent
optimized state C0.1 or C0.2 state and wake up when time-stamp counter
reaches specified timeout.
The three instructions may be executed at any privilege level.
The instructions provide power saving method while waiting in
user space. Additionally, they can allow a sibling hyperthread to
make faster progress while this thread is waiting. One example of an
application usage of umwait is when waiting for input data from another
application, such as a user level multi-threaded packet processing
engine.
Availability of the user wait instructions is indicated by the presence
of the CPUID feature flag WAITPKG CPUID.0x07.0x0:ECX[5].
Detailed information on the instructions and CPUID feature WAITPKG flag
can be found in the latest Intel Architecture Instruction Set Extensions
and Future Features Programming Reference and Intel 64 and IA-32
Architectures Software Developer's Manual.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: "Borislav Petkov" <bp@alien8.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Link: https://lkml.kernel.org/r/1560994438-235698-2-git-send-email-fenghua.yu@intel.com
Clean up the vDSO code a bit by giving pvclock_page and hvclock_page
their actual types instead of u8[PAGE_SIZE]. This shouldn't
materially affect the generated code.
Heavily based on a patch from Linus.
[ tglx: Adapted to the unified VDSO code ]
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/6920c5188f8658001af1fc56fd35b815706d300c.1561241273.git.luto@kernel.org
The generic vDSO library provides an implementation of clock_getres()
that can be leveraged by each architecture.
Add the clock_getres() VDSO entry point on x86.
[ tglx: Massaged changelog and cleaned up the function signature formatting ]
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Huw Davies <huw@codeweavers.com>
Cc: Shijith Thotton <sthotton@marvell.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Link: https://lkml.kernel.org/r/20190621095252.32307-24-vincenzo.frascino@arm.com
The x86 vDSO library requires some adaptations to take advantage of the
newly introduced generic vDSO library.
Introduce the following changes:
- Modification of vdso.c to be compliant with the common vdso datapage
- Use of lib/vdso for gettimeofday
[ tglx: Massaged changelog and cleaned up the function signature formatting ]
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Huw Davies <huw@codeweavers.com>
Cc: Shijith Thotton <sthotton@marvell.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Link: https://lkml.kernel.org/r/20190621095252.32307-23-vincenzo.frascino@arm.com
With sensitive CR4 bits pinned now, it's possible that the WP bit for
CR0 might become a target as well.
Following the same reasoning for the CR4 pinning, pin CR0's WP
bit. Contrary to the cpu feature dependend CR4 pinning this can be done
with a constant value.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: kernel-hardening@lists.openwall.com
Link: https://lkml.kernel.org/r/20190618045503.39105-4-keescook@chromium.org
Several recent exploits have used direct calls to the native_write_cr4()
function to disable SMEP and SMAP before then continuing their exploits
using userspace memory access.
Direct calls of this form can be mitigate by pinning bits of CR4 so that
they cannot be changed through a common function. This is not intended to
be a general ROP protection (which would require CFI to defend against
properly), but rather a way to avoid trivial direct function calling (or
CFI bypasses via a matching function prototype) as seen in:
https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html
(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308)
The goals of this change:
- Pin specific bits (SMEP, SMAP, and UMIP) when writing CR4.
- Avoid setting the bits too early (they must become pinned only after
CPU feature detection and selection has finished).
- Pinning mask needs to be read-only during normal runtime.
- Pinning needs to be checked after write to validate the cr4 state
Using __ro_after_init on the mask is done so it can't be first disabled
with a malicious write.
Since these bits are global state (once established by the boot CPU and
kernel boot parameters), they are safe to write to secondary CPUs before
those CPUs have finished feature detection. As such, the bits are set at
the first cr4 write, so that cr4 write bugs can be detected (instead of
silently papered over). This uses a few bytes less storage of a location we
don't have: read-only per-CPU data.
A check is performed after the register write because an attack could just
skip directly to the register write. Such a direct jump is possible because
of how this function may be built by the compiler (especially due to the
removal of frame pointers) where it doesn't add a stack frame (function
exit may only be a retq without pops) which is sufficient for trivial
exploitation like in the timer overwrites mentioned above).
The asm argument constraints gain the "+" modifier to convince the compiler
that it shouldn't make ordering assumptions about the arguments or memory,
and treat them as changed.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: kernel-hardening@lists.openwall.com
Link: https://lkml.kernel.org/r/20190618045503.39105-3-keescook@chromium.org
Add x86 architecture support for new Zhaoxin processors.
Carve out initialization code needed by Zhaoxin processors into
a separate compilation unit.
To identify Zhaoxin CPU, add a new vendor type X86_VENDOR_ZHAOXIN
for system recognition.
Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "hpa@zytor.com" <hpa@zytor.com>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "rjw@rjwysocki.net" <rjw@rjwysocki.net>
Cc: "lenb@kernel.org" <lenb@kernel.org>
Cc: David Wang <DavidWang@zhaoxin.com>
Cc: "Cooper Yan(BJ-RD)" <CooperYan@zhaoxin.com>
Cc: "Qiyuan Wang(BJ-RD)" <QiyuanWang@zhaoxin.com>
Cc: "Herry Yang(BJ-RD)" <HerryYang@zhaoxin.com>
Link: https://lkml.kernel.org/r/01042674b2f741b2aed1f797359bdffb@zhaoxin.com
Split Tremont based Atoms from the rest to keep logical grouping.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20190617115537.33309-1-andriy.shevchenko@linux.intel.com
The kernel needs to explicitly enable FSGSBASE. So, the application needs
to know if it can safely use these instructions. Just looking at the CPUID
bit is not enough because it may be running in a kernel that does not
enable the instructions.
One way for the application would be to just try and catch the SIGILL.
But that is difficult to do in libraries which may not want to overwrite
the signal handlers of the main application.
Enumerate the enabled FSGSBASE capability in bit 1 of AT_HWCAP2 in the ELF
aux vector. AT_HWCAP2 is already used by PPC for similar purposes.
The application can access it open coded or by using the getauxval()
function in newer versions of glibc.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Link: https://lkml.kernel.org/r/1557309753-24073-18-git-send-email-chang.seok.bae@intel.com
GSBASE is used to find per-CPU data in the kernel. But when GSBASE is
unknown, the per-CPU base can be found from the per_cpu_offset table with a
CPU NR. The CPU NR is extracted from the limit field of the CPUNODE entry
in GDT, or by the RDPID instruction. This is a prerequisite for using
FSGSBASE in the low level entry code.
Also, add the GAS-compatible RDPID macro as binutils 2.21 do not support
it. Support is added in version 2.27.
[ tglx: Massaged changelog ]
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/1557309753-24073-12-git-send-email-chang.seok.bae@intel.com
Add cpu feature conditional FSGSBASE access to the relevant helper
functions. That allows to accelerate certain FS/GS base operations in
subsequent changes.
Note, that while possible, the user space entry/exit GSBASE operations are
not going to use the new FSGSBASE instructions. The reason is that it would
require additional storage for the user space value which adds more
complexity to the low level code and experiments have shown marginal
benefit. This may be revisited later but for now the SWAPGS based handling
in the entry code is preserved except for the paranoid entry/exit code.
To preserve the SWAPGS entry mechanism introduce __[rd|wr]gsbase_inactive()
helpers. Note, for Xen PV, paravirt hooks can be added later as they might
allow a very efficient but different implementation.
[ tglx: Massaged changelog ]
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Link: https://lkml.kernel.org/r/1557309753-24073-7-git-send-email-chang.seok.bae@intel.com
[ luto: Rename the variables from FS and GS to FSBASE and GSBASE and
make <asm/fsgsbase.h> safe to include on 32-bit kernels. ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Link: https://lkml.kernel.org/r/1557309753-24073-6-git-send-email-chang.seok.bae@intel.com
Another round of SPDX updates for 5.2-rc6
Here is what I am guessing is going to be the last "big" SPDX update for
5.2. It contains all of the remaining GPLv2 and GPLv2+ updates that
were "easy" to determine by pattern matching. The ones after this are
going to be a bit more difficult and the people on the spdx list will be
discussing them on a case-by-case basis now.
Another 5000+ files are fixed up, so our overall totals are:
Files checked: 64545
Files with SPDX: 45529
Compared to the 5.1 kernel which was:
Files checked: 63848
Files with SPDX: 22576
This is a huge improvement.
Also, we deleted another 20000 lines of boilerplate license crud, always
nice to see in a diffstat.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXQyQYA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymnGQCghETUBotn1p3hTjY56VEs6dGzpHMAnRT0m+lv
kbsjBGEJpLbMRB2krnaU
=RMcT
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx
Pull still more SPDX updates from Greg KH:
"Another round of SPDX updates for 5.2-rc6
Here is what I am guessing is going to be the last "big" SPDX update
for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates
that were "easy" to determine by pattern matching. The ones after this
are going to be a bit more difficult and the people on the spdx list
will be discussing them on a case-by-case basis now.
Another 5000+ files are fixed up, so our overall totals are:
Files checked: 64545
Files with SPDX: 45529
Compared to the 5.1 kernel which was:
Files checked: 63848
Files with SPDX: 22576
This is a huge improvement.
Also, we deleted another 20000 lines of boilerplate license crud,
always nice to see in a diffstat"
* tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits)
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485
...
This cleanly handles arches who do not yet define clone3.
clone3() was initially placed under __ARCH_WANT_SYS_CLONE under the
assumption that this would cleanly handle all architectures. It does
not.
Architectures such as nios2 or h8300 simply take the asm-generic syscall
definitions and generate their syscall table from it. Since they don't
define __ARCH_WANT_SYS_CLONE the build would fail complaining about
sys_clone3 missing. The reason this doesn't happen for legacy clone is
that nios2 and h8300 provide assembly stubs for sys_clone. This seems to
be done for architectural reasons.
The build failures for nios2 and h8300 were caught int -next luckily.
The solution is to define __ARCH_WANT_SYS_CLONE3 that architectures can
add. Additionally, we need a cond_syscall(clone3) for architectures such
as nios2 or h8300 that generate their syscall table in the way I
explained above.
Fixes: 8f3220a806 ("arch: wire-up clone3() syscall")
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
for nested state save/restore.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJdC7NHAAoJEL/70l94x66DHm0H/R8L80sWe1OJbHHK8caPpwm2
mPt6JNcG/ysbG/uoMuVsdRAjZsg9l8JZB9xfA2m/ZPQQThjSG/WX0rU+gWMMI3X8
8ZbN4BCFoiNpOzOkhmStwzMWnvovKvMfhFW0BAI3HLUfM9A+XyVvNM/JbLOvEMRk
WB2SxYRc38ZvIbi8eXgsoFrVyLFB2Fj/0jps4FbKnkjkl37PTDehYLWQ1pt9KsWS
2KdGoXm7/18ottqf0DPfLe0hiiiDuK3akKz7WQBMsAJHi4Fm5j39NuseeRdlablk
uE4vM/sVaLn4xwM9JfrsBl9TzZ2qHsOTRlMQG4iNWjEAuPKa45lt0Jo7OBs6DSY=
=Lzxe
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Fixes for ARM and x86, plus selftest patches and nicer structs for
nested state save/restore"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: nVMX: reorganize initial steps of vmx_set_nested_state
KVM: arm/arm64: Fix emulated ptimer irq injection
tests: kvm: Check for a kernel warning
kvm: tests: Sort tests in the Makefile alphabetically
KVM: x86/mmu: Allocate PAE root array when using SVM's 32-bit NPT
KVM: x86: Modify struct kvm_nested_state to have explicit fields for data
KVM: fix typo in documentation
KVM: nVMX: use correct clean fields when copying from eVMCS
KVM: arm/arm64: vgic: Fix kvm_device leak in vgic_its_destroy
KVM: arm64: Filter out invalid core register IDs in KVM_GET_REG_LIST
KVM: arm64: Implement vq_present() as a macro
AVX512 BFLOAT16 instructions support 16-bit BFLOAT16 floating-point
format (BF16) for deep learning optimization.
BF16 is a short version of 32-bit single-precision floating-point
format (FP32) and has several advantages over 16-bit half-precision
floating-point format (FP16). BF16 keeps FP32 accumulation after
multiplication without loss of precision, offers more than enough
range for deep learning training tasks, and doesn't need to handle
hardware exception.
AVX512 BFLOAT16 instructions are enumerated in CPUID.7.1:EAX[bit 5]
AVX512_BF16.
CPUID.7.1:EAX contains only feature bits. Reuse the currently empty
word 12 as a pure features word to hold the feature bits including
AVX512_BF16.
Detailed information of the CPUID bit and AVX512 BFLOAT16 instructions
can be found in the latest Intel Architecture Instruction Set Extensions
and Future Features Programming Reference.
[ bp: Check CPUID(7) subleaf validity before accessing subleaf 1. ]
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <namit@vmware.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Cc: Robert Hoo <robert.hu@linux.intel.com>
Cc: "Sean J Christopherson" <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
Cc: x86 <x86@kernel.org>
Link: https://lkml.kernel.org/r/1560794416-217638-3-git-send-email-fenghua.yu@intel.com
It's a waste for the four X86_FEATURE_CQM_* feature bits to occupy two
whole feature bits words. To better utilize feature words, re-define
word 11 to host scattered features and move the four X86_FEATURE_CQM_*
features into Linux defined word 11. More scattered features can be
added in word 11 in the future.
Rename leaf 11 in cpuid_leafs to CPUID_LNX_4 to reflect it's a
Linux-defined leaf.
Rename leaf 12 as CPUID_DUMMY which will be replaced by a meaningful
name in the next patch when CPUID.7.1:EAX occupies world 12.
Maximum number of RMID and cache occupancy scale are retrieved from
CPUID.0xf.1 after scattered CQM features are enumerated. Carve out the
code into a separate function.
KVM doesn't support resctrl now. So it's safe to move the
X86_FEATURE_CQM_* features to scattered features word 11 for KVM.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Aaron Lewis <aaronlewis@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Babu Moger <babu.moger@amd.com>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
Cc: "Sean J Christopherson" <sean.j.christopherson@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: kvm ML <kvm@vger.kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Ravi V Shankar <ravi.v.shankar@intel.com>
Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
Cc: x86 <x86@kernel.org>
Link: https://lkml.kernel.org/r/1560794416-217638-2-git-send-email-fenghua.yu@intel.com
The memory occupied by the kernel is reserved using memblock_reserve()
in setup_arch(). Currently, the area is from symbols _text to __bss_stop.
Everything after __bss_stop must be specifically reserved otherwise it
is discarded. This is not clearly documented.
Add a new symbol, __end_of_kernel_reserve, that more readily identifies
what is reserved, along with comments that indicate what is reserved,
what is discarded and what needs to be done to prevent a section from
being discarded.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Lianbo Jiang <lijiang@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sinan Kaya <okaya@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "x86@kernel.org" <x86@kernel.org>
Link: https://lkml.kernel.org/r/7db7da45b435f8477f25e66f292631ff766a844c.1560969363.git.thomas.lendacky@amd.com
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 4122 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this work is licensed under the terms of the gnu gpl version 2 see
the copying file in the top level directory
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 35 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format
of VMX nested state data in a struct.
In order to avoid changing the ioctl values of
KVM_{GET,SET}_NESTED_STATE, there is a need to preserve
sizeof(struct kvm_nested_state). This is done by defining the data
struct as "data.vmx[0]". It was the most elegant way I found to
preserve struct size while still keeping struct readable and easy to
maintain. It does have a misfortunate side-effect that now it has to be
accessed as "data.vmx[0]" rather than just "data.vmx".
Because we are already modifying these structs, I also modified the
following:
* Define the "format" field values as macros.
* Rename vmcs_pa to vmcs12_pa for better readability.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
[Remove SVM stubs, add KVM_STATE_NESTED_VMX_VMCS12_SIZE. - Paolo]
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Per commit 1b6269db3f ("KVM: VMX: Handle NMIs before enabling
interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run()
to "make sure we handle NMI on the current cpu, and that we don't
service maskable interrupts before non-maskable ones". The other
exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC,
have similar requirements, and are located there to avoid extra VMREADs
since VMX bins hardware exceptions and NMIs into a single exit reason.
Clean up the code and eliminate the vaguely named complete_atomic_exit()
by moving the interrupts-disabled exception and NMI handling into the
existing handle_external_intrs() callback, and rename the callback to
a more appropriate name. Rename VMexit handlers throughout so that the
atomic and non-atomic counterparts have similar names.
In addition to improving code readability, this also ensures the NMI
handler is run with the host's debug registers loaded in the unlikely
event that the user is debugging NMIs. Accuracy of the last_guest_tsc
field is also improved when handling NMIs (and #MCs) as the handler
will run after updating said field.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Naming cleanups. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make it available to AMD hosts as well, just in case someone is trying
to use an Intel processor's CPUID setup.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add an MSRs which allows the guest to disable
host polling (specifically the cpuidle-haltpoll,
when performing polling in the guest, disables
host side polling).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since raw_cpu_xchg() doesn't need to be IRQ-safe, like
this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg
loop.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Nadav reported that since the this_cpu_*() ops got asm-volatile
constraints on, code generation suffered for do_IRQ(), but since this
is all with IRQs disabled we can use __this_cpu_*().
smp_x86_platform_ipi 234 222 -12,+0
smp_kvm_posted_intr_ipi 74 66 -8,+0
smp_kvm_posted_intr_wakeup_ipi 86 78 -8,+0
smp_apic_timer_interrupt 292 284 -8,+0
smp_kvm_posted_intr_nested_ipi 74 66 -8,+0
do_IRQ 195 187 -8,+0
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recent probing at the Linux Kernel Memory Model uncovered a
'surprise'. Strongly ordered architectures where the atomic RmW
primitive implies full memory ordering and
smp_mb__{before,after}_atomic() are a simple barrier() (such as x86)
fail for:
*x = 1;
atomic_inc(u);
smp_mb__after_atomic();
r0 = *y;
Because, while the atomic_inc() implies memory order, it
(surprisingly) does not provide a compiler barrier. This then allows
the compiler to re-order like so:
atomic_inc(u);
*x = 1;
smp_mb__after_atomic();
r0 = *y;
Which the CPU is then allowed to re-order (under TSO rules) like:
atomic_inc(u);
r0 = *y;
*x = 1;
And this very much was not intended. Therefore strengthen the atomic
RmW ops to include a compiler barrier.
NOTE: atomic_{or,and,xor} and the bitops already had the compiler
barrier.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the jump label of a static key is transformed via the arch
specific function:
void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)
The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:
bool arch_jump_label_transform_queue(struct jump_entry *entry,
enum jump_label_type type)
Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated. This functions returns true in the case of a
successful enqueue of an entry. If it returns false, the caller must to
apply the queue and then try to queue again, for instance, because the
queue is full.
This function expects the caller to sort the entries by the address before
enqueueuing then. This is already done by the arch independent code, though.
After queuing all jump_entries, the function:
void arch_jump_label_transform_apply(void)
Applies the changes in the queue.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/57b4caa654bad7e3b066301c9a9ae233dea065b5.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the patch of an address is done in three steps:
-- Pseudo-code #1 - Current implementation ---
1) add an int3 trap to the address that will be patched
sync cores (send IPI to all other CPUs)
2) update all but the first byte of the patched range
sync cores (send IPI to all other CPUs)
3) replace the first byte (int3) by the first byte of replacing opcode
sync cores (send IPI to all other CPUs)
-- Pseudo-code #1 ---
When a static key has more than one entry, these steps are called once for
each entry. The number of IPIs then is linear with regard to the number 'n' of
entries of a key: O(n*3), which is O(n).
This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling the key is _not_ running.
With this patch, rather than receiving a single patch to be processed, a vector
of patches is passed, enabling the rewrite of the pseudo-code #1 in this
way:
-- Pseudo-code #2 - This patch ---
1) for each patch in the vector:
add an int3 trap to the address that will be patched
sync cores (send IPI to all other CPUs)
2) for each patch in the vector:
update all but the first byte of the patched range
sync cores (send IPI to all other CPUs)
3) for each patch in the vector:
replace the first byte (int3) by the first byte of replacing opcode
sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - This patch ---
Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).
The batch mode is done with the function text_poke_bp_batch(), that receives
two arguments: a vector of "struct text_to_poke", and the number of entries
in the vector.
The vector must be sorted by the addr field of the text_to_poke structure,
enabling the binary search of a handler in the poke_int3_handler function
(a fast path).
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/ca506ed52584c80f64de23f6f55ca288e5d079de.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Thomas Gleixner:
"The accumulated fixes from this and last week:
- Fix vmalloc TLB flush and map range calculations which lead to
stale TLBs, spurious faults and other hard to diagnose issues.
- Use fault_in_pages_writable() for prefaulting the user stack in the
FPU code as it's less fragile than the current solution
- Use the PF_KTHREAD flag when checking for a kernel thread instead
of current->mm as the latter can give the wrong answer due to
use_mm()
- Compute the vmemmap size correctly for KASLR and 5-Level paging.
Otherwise this can end up with a way too small vmemmap area.
- Make KASAN and 5-level paging work again by making sure that all
invalid bits are masked out when computing the P4D offset. This
worked before but got broken recently when the LDT remap area was
moved.
- Prevent a NULL pointer dereference in the resource control code
which can be triggered with certain mount options when the
requested resource is not available.
- Enforce ordering of microcode loading vs. perf initialization on
secondary CPUs. Otherwise perf tries to access a non-existing MSR
as the boot CPU marked it as available.
- Don't stop the resource control group walk early otherwise the
control bitmaps are not updated correctly and become inconsistent.
- Unbreak kgdb by returning 0 on success from
kgdb_arch_set_breakpoint() instead of an error code.
- Add more Icelake CPU model defines so depending changes can be
queued in other trees"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback
x86/kasan: Fix boot with 5-level paging and KASAN
x86/fpu: Don't use current->mm to check for a kthread
x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()
x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled
x86/resctrl: Don't stop walking closids when a locksetup group is found
x86/fpu: Update kernel's FPU state before using for the fsave header
x86/mm/KASLR: Compute the size of the vmemmap section properly
x86/fpu: Use fault_in_pages_writeable() for pre-faulting
x86/CPU: Add more Icelake model numbers
mm/vmalloc: Avoid rare case of flushing TLB with weird arguments
mm/vmalloc: Fix calculation of direct map addr range
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlz8fAYeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG1asH/3ySguxqtqL1MCBa
4/SZ37PHeWKMerfX6ZyJdgEqK3B+PWlmuLiOMNK5h2bPLzeQQQAmHU/mfKmpXqgB
dHwUbG9yNnyUtTfsfRqAnCA6vpuw9Yb1oIzTCVQrgJLSWD0j7scBBvmzYqguOkto
ThwigLUq3AILr8EfR4rh+GM+5Dn9OTEFAxwil9fPHQo7QoczwZxpURhScT6Co9TB
DqLA3fvXbBvLs/CZy/S5vKM9hKzC+p39ApFTURvFPrelUVnythAM0dPDJg3pIn5u
g+/+gDxDFa+7ANxvxO2ng1sJPDqJMeY/xmjJYlYyLpA33B7zLNk2vDHhAP06VTtr
XCMhQ9s=
=cb80
-----END PGP SIGNATURE-----
Merge tag 'v5.2-rc4' into mauro
We need to pick up post-rc1 changes to various document files so they don't
get lost in Mauro's massive RST conversion push.
Add the CPUID enumeration for Intel's de-feature bits to accommodate
passing these de-features through to kvm guests.
These de-features are (from SDM vol 1, section 8.1.8):
- X86_FEATURE_FDP_EXCPTN_ONLY: If CPUID.(EAX=07H,ECX=0H):EBX[bit 6] = 1, the
data pointer (FDP) is updated only for the x87 non-control instructions that
incur unmasked x87 exceptions.
- X86_FEATURE_ZERO_FCS_FDS: If CPUID.(EAX=07H,ECX=0H):EBX[bit 13] = 1, the
processor deprecates FCS and FDS; it saves each as 0000H.
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: marcorr@google.com
Cc: Peter Feiner <pfeiner@google.com>
Cc: pshier@google.com
Cc: Robert Hoo <robert.hu@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190605220252.103406-1-aaronlewis@google.com
current->mm can be non-NULL if a kthread calls use_mm(). Check for
PF_KTHREAD instead to decide when to store user mode FP state.
Fixes: 2722146eb7 ("x86/fpu: Remove fpu->initialized")
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190604175411.GA27477@lst.de
Add the CPUID model number of Ice Lake Neural Network Processor for Deep
Learning Inference (ICL-NNPI) to the Intel family list. Ice Lake NNPI uses
model number 0x9D and this will be documented in a future version of Intel
Software Development Manual.
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: platform-driver-x86@vger.kernel.org
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linux PM <linux-pm@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190606012419.13250-1-rajneesh.bhardwaj@linux.intel.com
Use the HYPERVISOR_CALLBACK_VECTOR to notify an ACRN guest.
Co-developed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/1559108037-18813-4-git-send-email-yakui.zhao@intel.com
ACRN is an open-source hypervisor maintained by The Linux Foundation. It
is built for embedded IOT with small footprint and real-time features.
Add ACRN guest support so that it allows Linux to be booted under the
ACRN hypervisor. This adds only the barebones implementation.
[ bp: Massage commit message and help text. ]
Co-developed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/1559108037-18813-3-git-send-email-yakui.zhao@intel.com
Add a special Kconfig symbol X86_HV_CALLBACK_VECTOR so that the guests
using the hypervisor interrupt callback counter can select and thus
enable that counter. Select it when xen or hyperv support is enabled. No
functional changes.
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: linux-hyperv@vger.kernel.org
Cc: Nicolai Stange <nstange@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: xen-devel@lists.xenproject.org
Link: https://lkml.kernel.org/r/1559108037-18813-2-git-send-email-yakui.zhao@intel.com
Mostly due to x86 and acpi conversion, several documentation
links are still pointing to the old file. Fix them.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Reviewed-by: Wolfram Sang <wsa@the-dreams.de>
Reviewed-by: Sven Van Asbroeck <TheSven73@gmail.com>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Acked-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Add the CPUID model numbers of Icelake (ICL) desktop and server
processors to the Intel family list.
[ Qiuxu: Sort the macros by model number. ]
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@linux.intel.com>
Cc: rui.zhang@intel.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190603134122.13853-1-kan.liang@linux.intel.com
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation version 2 of the license
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 315 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this file is released under the gplv2 see the file copying for more
details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 3 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531081035.872590698@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms and conditions of the gnu general public license
version 2 as published by the free software foundation this program
is distributed in the hope it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation inc 51 franklin st fifth floor boston ma 02110
1301 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 111 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000436.567572064@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation inc 59 temple place suite 330 boston ma 02111
1307 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 136 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000436.384967451@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms and conditions of the gnu general public license
version 2 as published by the free software foundation this program
is distributed in the hope it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation inc 59 temple place suite 330 boston ma 02111
1307 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 33 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000435.254582722@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms and conditions of the gnu general public license
version 2 as published by the free software foundation this program
is distributed in the hope it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 263 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.208660670@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
MSR IA32_MISC_ENABLE bit 18, according to SDM:
| When this bit is set to 0, the MONITOR feature flag is not set (CPUID.01H:ECX[bit 3] = 0).
| This indicates that MONITOR/MWAIT are not supported.
|
| Software attempts to execute MONITOR/MWAIT will cause #UD when this bit is 0.
|
| When this bit is set to 1 (default), MONITOR/MWAIT are supported (CPUID.01H:ECX[bit 3] = 1).
The CPUID.01H:ECX[bit 3] ought to mirror the value of the MSR bit,
CPUID.01H:ECX[bit 3] is a better guard than kvm_mwait_in_guest().
kvm_mwait_in_guest() affects the behavior of MONITOR/MWAIT, not its
guest visibility.
This patch implements toggling of the CPUID bit based on guest writes
to the MSR.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Fixes for backwards compatibility - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow guest reads CORE cstate when exposing host CPU power management capabilities
to the guest. PKG cstate is restricted to avoid a guest to get the whole package
information in multi-tenant scenario.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
1. Using X86_FEATURE_ARCH_CAPABILITIES to enumerate the existence of
MSR_IA32_ARCH_CAPABILITIES to avoid using rdmsrl_safe().
2. Since kvm_get_arch_capabilities() is only used in this file, making
it static.
Signed-off-by: Xiaoyao Li <xiaoyao.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a wrapper to invoke kvm_arch_check_processor_compat() so that the
boilerplate ugliness of checking virtualization support on all CPUs is
hidden from the arch specific code. x86's implementation in particular
is quite heinous, as it unnecessarily propagates the out-param pattern
into kvm_x86_ops.
While the x86 specific issue could be resolved solely by changing
kvm_x86_ops, make the change for all architectures as returning a value
directly is prettier and technically more robust, e.g. s390 doesn't set
the out param, which could lead to subtle breakage in the (highly
unlikely) scenario where the out-param was not pre-initialized by the
caller.
Opportunistically annotate svm_check_processor_compat() with __init.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0
Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>