Commit Graph

15791 Commits

Author SHA1 Message Date
Ard Biesheuvel
9cd437ac0e efi/x86: Make fw_vendor, config_table and runtime sysfs nodes x86 specific
There is some code that exposes physical addresses of certain parts of
the EFI firmware implementation via sysfs nodes. These nodes are only
used on x86, and are of dubious value to begin with, so let's move
their handling into the x86 arch code.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-02-23 21:59:42 +01:00
Ard Biesheuvel
0a67361dcd efi/x86: Remove runtime table address from kexec EFI setup data
Since commit 33b85447fa ("efi/x86: Drop two near identical versions
of efi_runtime_init()"), we no longer map the EFI runtime services table
before calling SetVirtualAddressMap(), which means we don't need the 1:1
mapped physical address of this table, and so there is no point in passing
the address via EFI setup data on kexec boot.

Note that the kexec tools will still look for this address in sysfs, so
we still need to provide it.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-02-23 21:59:42 +01:00
Arvind Sankar
67a6af7ad1 x86/boot: Remove KEEP_SEGMENTS support
Commit a24e785111 ("i386: paravirt boot sequence") added this flag for
use by paravirtualized environments such as Xen. However, Xen never made
use of this flag [1], and it was only ever used by lguest [2].

Commit ecda85e702 ("x86/lguest: Remove lguest support") removed
lguest, so KEEP_SEGMENTS has lost its last user.

[1] https://lore.kernel.org/lkml/4D4B097C.5050405@goop.org
[2] https://www.mail-archive.com/lguest@lists.ozlabs.org/msg00469.html

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Link: https://lore.kernel.org/r/20200202171353.3736319-2-nivedita@alum.mit.edu
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-02-22 23:37:37 +01:00
Linus Torvalds
1a2a76c268 A set of fixes for X86:
- Ensure that the PIT is set up when the local APIC is disable or
    configured in legacy mode. This is caused by an ordering issue
    introduced in the recent changes which skip PIT initialization when the
    TSC and APIC frequencies are already known.
 
  - Handle malformed SRAT tables during early ACPI parsing which caused an
    infinite loop anda boot hang.
 
  - Fix a long standing race in the affinity setting code which affects PCI
    devices with non-maskable MSI interrupts. The problem is caused by the
    non-atomic writes of the MSI address (destination APIC id) and data
    (vector) fields which the device uses to construct the MSI message. The
    non-atomic writes are mandated by PCI.
 
    If both fields change and the device raises an interrupt after writing
    address and before writing data, then the MSI block constructs a
    inconsistent message which causes interrupts to be lost and subsequent
    malfunction of the device.
 
    The fix is to redirect the interrupt to the new vector on the current
    CPU first and then switch it over to the new target CPU. This allows to
    observe an eventually raised interrupt in the transitional stage (old
    CPU, new vector) to be observed in the APIC IRR and retriggered on the
    new target CPU and the new vector. The potential spurious interrupts
    caused by this are harmless and can in the worst case expose a buggy
    driver (all handlers have to be able to deal with spurious interrupts as
    they can and do happen for various reasons).
 
  - Add the missing suspend/resume mechanism for the HYPERV hypercall page
    which prevents resume hibernation on HYPERV guests. This change got
    lost before the merge window.
 
  - Mask the IOAPIC before disabling the local APIC to prevent potentially
    stale IOAPIC remote IRR bits which cause stale interrupt lines after
    resume.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5AEJwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWY2D/47ur9gsVQGryKzneVAr0SCsq4Un11e
 uifX4ldu4gCEBRTYhpgcpiFKeLvY/QJ6uOD+gQUHyy/s+lCf6yzE6UhXEqSCtcT7
 LkSxD8jAFf6KhMA6iqYBfyxUsPMXBetLjjHWsyc/kf15O/vbYm7qf05timmNZkDS
 S7C+yr3KRqRjLR7G7t4twlgC9aLcNUQihUdsH2qyTvjnlkYHJLDa0/Js7bFYYKVx
 9GdUDLvPFB1mZ76g012De4R3kJsWitiyLlQ38DP5VysKulnszUCdiXlgCEFrgxvQ
 OQhLafQzOAzvxQmP+1alODR0dmJZA8k0zsDeeTB/vTpRvv6+Pe2qUswLSpauBzuq
 TpDsrv8/5pwZh28+91f/Unk+tH8NaVNtGe/Uf+ePxIkn1nbqL84o4NHGplM6R97d
 HAWdZQZ1cGRLf6YRRJ+57oM/5xE3vBbF1Wn0+QDTFwdsk2vcxuQ4eB3M/8E1V7Zk
 upp8ty50bZ5+rxQ8XTq/eb8epSRnfLoBYpi4ux6MIOWRdmKDl40cDeZCzA2kNP7m
 qY1haaRN3ksqvhzc0Yf6cL+CgvC4ur8gRHezfOqmBzVoaLyVEFIVjgjR/ojf0bq8
 /v+L9D5+IdIv4jEZruRRs0gOXNDzoBbvf0qKGaO0tUTWiDsv7c5AGixp8aozniHS
 HXsv1lIpRuC7WQ==
 =WxKD
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A set of fixes for X86:

   - Ensure that the PIT is set up when the local APIC is disable or
     configured in legacy mode. This is caused by an ordering issue
     introduced in the recent changes which skip PIT initialization when
     the TSC and APIC frequencies are already known.

   - Handle malformed SRAT tables during early ACPI parsing which caused
     an infinite loop anda boot hang.

   - Fix a long standing race in the affinity setting code which affects
     PCI devices with non-maskable MSI interrupts. The problem is caused
     by the non-atomic writes of the MSI address (destination APIC id)
     and data (vector) fields which the device uses to construct the MSI
     message. The non-atomic writes are mandated by PCI.

     If both fields change and the device raises an interrupt after
     writing address and before writing data, then the MSI block
     constructs a inconsistent message which causes interrupts to be
     lost and subsequent malfunction of the device.

     The fix is to redirect the interrupt to the new vector on the
     current CPU first and then switch it over to the new target CPU.
     This allows to observe an eventually raised interrupt in the
     transitional stage (old CPU, new vector) to be observed in the APIC
     IRR and retriggered on the new target CPU and the new vector.

     The potential spurious interrupts caused by this are harmless and
     can in the worst case expose a buggy driver (all handlers have to
     be able to deal with spurious interrupts as they can and do happen
     for various reasons).

   - Add the missing suspend/resume mechanism for the HYPERV hypercall
     page which prevents resume hibernation on HYPERV guests. This
     change got lost before the merge window.

   - Mask the IOAPIC before disabling the local APIC to prevent
     potentially stale IOAPIC remote IRR bits which cause stale
     interrupt lines after resume"

* tag 'x86-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic: Mask IOAPIC entries when disabling the local APIC
  x86/hyperv: Suspend/resume the hypercall page for hibernation
  x86/apic/msi: Plug non-maskable MSI affinity race
  x86/boot: Handle malformed SRAT tables during early ACPI parsing
  x86/timer: Don't skip PIT setup when APIC is disabled or in legacy mode
2020-02-09 12:11:12 -08:00
Linus Torvalds
c9d35ee049 Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs file system parameter updates from Al Viro:
 "Saner fs_parser.c guts and data structures. The system-wide registry
  of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
  the horror switch() in fs_parse() that would have to grow another case
  every time something got added to that system-wide registry.

  New syntax types can be added by filesystems easily now, and their
  namespace is that of functions - not of system-wide enum members. IOW,
  they can be shared or kept private and if some turn out to be widely
  useful, we can make them common library helpers, etc., without having
  to do anything whatsoever to fs_parse() itself.

  And we already get that kind of requests - the thing that finally
  pushed me into doing that was "oh, and let's add one for timeouts -
  things like 15s or 2h". If some filesystem really wants that, let them
  do it. Without somebody having to play gatekeeper for the variants
  blessed by direct support in fs_parse(), TYVM.

  Quite a bit of boilerplate is gone. And IMO the data structures make a
  lot more sense now. -200LoC, while we are at it"

* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
  tmpfs: switch to use of invalfc()
  cgroup1: switch to use of errorfc() et.al.
  procfs: switch to use of invalfc()
  hugetlbfs: switch to use of invalfc()
  cramfs: switch to use of errofc() et.al.
  gfs2: switch to use of errorfc() et.al.
  fuse: switch to use errorfc() et.al.
  ceph: use errorfc() and friends instead of spelling the prefix out
  prefix-handling analogues of errorf() and friends
  turn fs_param_is_... into functions
  fs_parse: handle optional arguments sanely
  fs_parse: fold fs_parameter_desc/fs_parameter_spec
  fs_parser: remove fs_parameter_description name field
  add prefix to fs_context->log
  ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
  new primitive: __fs_parse()
  switch rbd and libceph to p_log-based primitives
  struct p_log, variants of warnf() et.al. taking that one instead
  teach logfc() to handle prefices, give it saner calling conventions
  get rid of cg_invalf()
  ...
2020-02-08 13:26:41 -08:00
Al Viro
d7167b1499 fs_parse: fold fs_parameter_desc/fs_parameter_spec
The former contains nothing but a pointer to an array of the latter...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:37 -05:00
Eric Sandeen
96cafb9ccb fs_parser: remove fs_parameter_description name field
Unused now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:36 -05:00
Tony W Wang-oc
0f378d73d4 x86/apic: Mask IOAPIC entries when disabling the local APIC
When a system suspends, the local APIC is disabled in the suspend sequence,
but the IOAPIC is left in the current state. This means unmasked interrupt
lines stay unmasked. This is usually the case for IOAPIC pin 9 to which the
ACPI interrupt is connected.

That means that in suspended state the IOAPIC can respond to an external
interrupt, e.g. the wakeup via keyboard/RTC/ACPI, but the interrupt message
cannot be handled by the disabled local APIC. As a consequence the Remote
IRR bit is set, but the local APIC does not send an EOI to acknowledge
it. This causes the affected interrupt line to become stale and the stale
Remote IRR bit will cause a hang when __synchronize_hardirq() is invoked
for that interrupt line.

To prevent this, mask all IOAPIC entries before disabling the local
APIC. The resume code already has the unmask operation inside.

[ tglx: Massaged changelog ]

Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1579076539-7267-1-git-send-email-TonyWWang-oc@zhaoxin.com
2020-02-07 15:32:16 +01:00
Linus Torvalds
90568ecf56 s390:
* fix register corruption
 * ENOTSUPP/EOPNOTSUPP mixed
 * reset cleanups/fixes
 * selftests
 
 x86:
 * Bug fixes and cleanups
 * AMD support for APIC virtualization even in combination with
   in-kernel PIT or IOAPIC.
 
 MIPS:
 * Compilation fix.
 
 Generic:
 * Fix refcount overflow for zero page.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJeOuf7AAoJEL/70l94x66DOBQH/j1W9lUpbDgr9aWbrZT+O/yP
 FWzUDrRlCZCjV1FQKbGPa4YLeDRTG5n+RIQTjmCGRqiMqeoELSJ1+iK99e97nG/u
 L28zf/90Nf0R+wwHL4AOFeploTYfG4WP8EVnlr3CG2UCJrNjxN1KU7yRZoWmWa2d
 ckLJ8ouwNvx6VZd233LVmT38EP4352d1LyqIL8/+oXDIyAcRJLFQu1gRCwagsh3w
 1v1czowFpWnRQ/z9zU7YD+PA4v85/Ge8sVVHlpi1X5NgV/khk4U6B0crAw6M+la+
 mTmpz9g56oAh9m9NUdtv4zDCz1EWGH0v8+ZkAajUKtrM0ftJMn57P6p8PH4VVlE=
 =5+Wl
 -----END PGP SIGNATURE-----

Merge tag 'kvm-5.6-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:
 "s390:
   - fix register corruption
   - ENOTSUPP/EOPNOTSUPP mixed
   - reset cleanups/fixes
   - selftests

  x86:
   - Bug fixes and cleanups
   - AMD support for APIC virtualization even in combination with
     in-kernel PIT or IOAPIC.

  MIPS:
   - Compilation fix.

  Generic:
   - Fix refcount overflow for zero page"

* tag 'kvm-5.6-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
  KVM: vmx: delete meaningless vmx_decache_cr0_guest_bits() declaration
  KVM: x86: Mark CR4.UMIP as reserved based on associated CPUID bit
  x86: vmxfeatures: rename features for consistency with KVM and manual
  KVM: SVM: relax conditions for allowing MSR_IA32_SPEC_CTRL accesses
  KVM: x86: Fix perfctr WRMSR for running counters
  x86/kvm/hyper-v: don't allow to turn on unsupported VMX controls for nested guests
  x86/kvm/hyper-v: move VMX controls sanitization out of nested_enable_evmcs()
  kvm: mmu: Separate generating and setting mmio ptes
  kvm: mmu: Replace unsigned with unsigned int for PTE access
  KVM: nVMX: Remove stale comment from nested_vmx_load_cr3()
  KVM: MIPS: Fold comparecount_func() into comparecount_wakeup()
  KVM: MIPS: Fix a build error due to referencing not-yet-defined function
  x86/kvm: do not setup pv tlb flush when not paravirtualized
  KVM: fix overflow of zero page refcount with ksm running
  KVM: x86: Take a u64 when checking for a valid dr7 value
  KVM: x86: use raw clock values consistently
  KVM: x86: reorganize pvclock_gtod_data members
  KVM: nVMX: delete meaningless nested_vmx_run() declaration
  KVM: SVM: allow AVIC without split irqchip
  kvm: ioapic: Lazy update IOAPIC EOI
  ...
2020-02-06 09:07:45 -08:00
Thadeu Lima de Souza Cascardo
64b38bd190 x86/kvm: do not setup pv tlb flush when not paravirtualized
kvm_setup_pv_tlb_flush will waste memory and print a misguiding message
when KVM paravirtualization is not available.

Intel SDM says that the when cpuid is used with EAX higher than the
maximum supported value for basic of extended function, the data for the
highest supported basic function will be returned.

So, in some systems, kvm_arch_para_features will return bogus data,
causing kvm_setup_pv_tlb_flush to detect support for pv tlb flush.

Testing for kvm_para_available will work as it checks for the hypervisor
signature.

Besides, when the "nopv" command line parameter is used, it should not
continue as well, as kvm_guest_init will no be called in that case.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-05 15:28:07 +01:00
Alexey Dobriyan
97a32539b9 proc: convert everything to "struct proc_ops"
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

	llseek		=> proc_lseek
	unlocked_ioctl	=> proc_ioctl

	xxx		=> proc_xxx

	delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
  Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:26 +00:00
Thomas Gleixner
6f1a4891a5 x86/apic/msi: Plug non-maskable MSI affinity race
Evan tracked down a subtle race between the update of the MSI message and
the device raising an interrupt internally on PCI devices which do not
support MSI masking. The update of the MSI message is non-atomic and
consists of either 2 or 3 sequential 32bit wide writes to the PCI config
space.

   - Write address low 32bits
   - Write address high 32bits (If supported by device)
   - Write data

When an interrupt is migrated then both address and data might change, so
the kernel attempts to mask the MSI interrupt first. But for MSI masking is
optional, so there exist devices which do not provide it. That means that
if the device raises an interrupt internally between the writes then a MSI
message is sent built from half updated state.

On x86 this can lead to spurious interrupts on the wrong interrupt
vector when the affinity setting changes both address and data. As a
consequence the device interrupt can be lost causing the device to
become stuck or malfunctioning.

Evan tried to handle that by disabling MSI accross an MSI message
update. That's not feasible because disabling MSI has issues on its own:

 If MSI is disabled the PCI device is routing an interrupt to the legacy
 INTx mechanism. The INTx delivery can be disabled, but the disablement is
 not working on all devices.

 Some devices lose interrupts when both MSI and INTx delivery are disabled.

Another way to solve this would be to enforce the allocation of the same
vector on all CPUs in the system for this kind of screwed devices. That
could be done, but it would bring back the vector space exhaustion problems
which got solved a few years ago.

Fortunately the high address (if supported by the device) is only relevant
when X2APIC is enabled which implies interrupt remapping. In the interrupt
remapping case the affinity setting is happening at the interrupt remapping
unit and the PCI MSI message is programmed only once when the PCI device is
initialized.

That makes it possible to solve it with a two step update:

  1) Target the MSI msg to the new vector on the current target CPU

  2) Target the MSI msg to the new vector on the new target CPU

In both cases writing the MSI message is only changing a single 32bit word
which prevents the issue of inconsistency.

After writing the final destination it is necessary to check whether the
device issued an interrupt while the intermediate state #1 (new vector,
current CPU) was in effect.

This is possible because the affinity change is always happening on the
current target CPU. The code runs with interrupts disabled, so the
interrupt can be detected by checking the IRR of the local APIC. If the
vector is pending in the IRR then the interrupt is retriggered on the new
target CPU by sending an IPI for the associated vector on the target CPU.

This can cause spurious interrupts on both the local and the new target
CPU.

 1) If the new vector is not in use on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then interrupt entry code will
    ignore that spurious interrupt. The vector is marked so that the
    'No irq handler for vector' warning is supressed once.

 2) If the new vector is in use already on the local CPU then the IRR check
    might see an pending interrupt from the device which is using this
    vector. The IPI to the new target CPU will then invoke the handler of
    the device, which got the affinity change, even if that device did not
    issue an interrupt

 3) If the new vector is in use already on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then the handler of the device which
    uses that vector on the local CPU will be invoked.

expose issues in device driver interrupt handlers which are not prepared to
handle a spurious interrupt correctly. This not a regression, it's just
exposing something which was already broken as spurious interrupts can
happen for a lot of reasons and all driver handlers need to be able to deal
with them.

Reported-by: Evan Green <evgreen@chromium.org>
Debugged-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
2020-02-01 09:31:47 +01:00
Linus Torvalds
b70a2d6b29 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes:

   - three fixes and a cleanup for the resctrl code

   - a HyperV fix

   - a fix to /proc/kcore contents in live debugging sessions

   - a fix for the x86 decoder opcode map"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/decoder: Add TEST opcode to Group3-2
  x86/resctrl: Clean up unused function parameter in mkdir path
  x86/resctrl: Fix a deadlock due to inaccurate reference
  x86/resctrl: Fix use-after-free due to inaccurate refcount of rdtgroup
  x86/resctrl: Fix use-after-free when deleting resource groups
  x86/hyper-v: Add "polling" bit to hv_synic_sint
  x86/crash: Define arch_crash_save_vmcoreinfo() if CONFIG_CRASH_CORE=y
2020-01-31 11:05:33 -08:00
Linus Torvalds
ccaaaf6fe5 MPX requires recompiling applications, which requires compiler support.
Unfortunately, GCC 9.1 is expected to be be released without support for
 MPX.  This means that there was only a relatively small window where
 folks could have ever used MPX.  It failed to gain wide adoption in the
 industry, and Linux was the only mainstream OS to ever support it widely.
 
 Support for the feature may also disappear on future processors.
 
 This set completes the process that we started during the 5.4 merge window.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJeK1/pAAoJEGg1lTBwyZKwgC8QAIiVn1d7A9Uj/WpnpgfCChCZ
 9XiV6Ak999qD9fbAcrgNfPjieaD4mtokocSRVJuRgJu5iLnIJCINlozLPe4yVl7P
 7zebnxkLq0CIA8d56bEUoFlC0J+oWYlDVQePZzNQsSk5KHVGXVLpF6U4vDVzZeQy
 cprgvdeY+ehB7G6IIo0MWTg5ylKYAsOAyVvK8NIGpKY2k6/YqCnsptnsVE7bvlHy
 TrEOiUWLv+hh0bMkZdP1PwKQKEuMO/IZly0HtviFbMN7T4TB1spfg7ELoBucEq3T
 s4EVbYRe+nIE4tuEAveaX3CgxJek8cY5MlticskdaKSEACBwabdOF55qsZy0u+WA
 PYC4iUIXfbOH8OgieKWtGX4IuSkRYdQ2nP4BOpe4ZX4+zvU7zOCIyVSKRrwkX8cc
 ADtWI5FAtB36KCgUuWnHGHNZpOxPTbTLBuBataFY4Q2uBNJEBJpscZ5H9ObtyGFU
 ZjlzqFnM0nFNDKEI1EEtv9jLzgZTU1RQ46s7EFeSeEQ2/s9wJ3+s5sBlVbljsmus
 o658bLOEaRWC/aF15dgmEXW9GAO6uifNdmbzGnRn7oEMYyFQPTWbZvi1zGz58QaG
 Y6WTtigVtsSrHS4wpYd+p+n1W06VnB6J3BpBM4G1VQv1Vm0dNd1tUOfkqOzPjg7c
 33Itmsz2LaW1mb67GlgZ
 =g4cC
 -----END PGP SIGNATURE-----

Merge tag 'mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx

Pull x86 MPX removal from Dave Hansen:
 "MPX requires recompiling applications, which requires compiler
  support. Unfortunately, GCC 9.1 is expected to be be released without
  support for MPX. This means that there was only a relatively small
  window where folks could have ever used MPX. It failed to gain wide
  adoption in the industry, and Linux was the only mainstream OS to ever
  support it widely.

  Support for the feature may also disappear on future processors.

  This set completes the process that we started during the 5.4 merge
  window when the MPX prctl()s were removed. XSAVE support is left in
  place, which allows MPX-using KVM guests to continue to function"

* tag 'mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx:
  x86/mpx: remove MPX from arch/x86
  mm: remove arch_bprm_mm_init() hook
  x86/mpx: remove bounds exception code
  x86/mpx: remove build infrastructure
  x86/alternatives: add missing insn.h include
2020-01-30 16:11:50 -08:00
Linus Torvalds
ca9b5b6283 TTY/Serial driver updates for 5.6-rc1
Here are the big set of tty and serial driver updates for 5.6-rc1
 
 Included in here are:
 	- dummy_con cleanups (touches lots of arch code)
 	- sysrq logic cleanups (touches lots of serial drivers)
 	- samsung driver fixes (wasn't really being built)
 	- conmakeshash move to tty subdir out of scripts
 	- lots of small tty/serial driver updates
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXjFRBg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yn2VACgkge7vTeUNeZFc+6F4NWphAQ5tCQAoK/MMbU6
 0O8ef7PjFwCU4s227UTv
 =6m40
 -----END PGP SIGNATURE-----

Merge tag 'tty-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial driver updates from Greg KH:
 "Here are the big set of tty and serial driver updates for 5.6-rc1

  Included in here are:
   - dummy_con cleanups (touches lots of arch code)
   - sysrq logic cleanups (touches lots of serial drivers)
   - samsung driver fixes (wasn't really being built)
   - conmakeshash move to tty subdir out of scripts
   - lots of small tty/serial driver updates

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'tty-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (140 commits)
  tty: n_hdlc: Use flexible-array member and struct_size() helper
  tty: baudrate: SPARC supports few more baud rates
  tty: baudrate: Synchronise baud_table[] and baud_bits[]
  tty: serial: meson_uart: Add support for kernel debugger
  serial: imx: fix a race condition in receive path
  serial: 8250_bcm2835aux: Document struct bcm2835aux_data
  serial: 8250_bcm2835aux: Use generic remapping code
  serial: 8250_bcm2835aux: Allocate uart_8250_port on stack
  serial: 8250_bcm2835aux: Suppress register_port error on -EPROBE_DEFER
  serial: 8250_bcm2835aux: Suppress clk_get error on -EPROBE_DEFER
  serial: 8250_bcm2835aux: Fix line mismatch on driver unbind
  serial_core: Remove unused member in uart_port
  vt: Correct comment documenting do_take_over_console()
  vt: Delete comment referencing non-existent unbind_con_driver()
  arch/xtensa/setup: Drop dummy_con initialization
  arch/x86/setup: Drop dummy_con initialization
  arch/unicore32/setup: Drop dummy_con initialization
  arch/sparc/setup: Drop dummy_con initialization
  arch/sh/setup: Drop dummy_con initialization
  arch/s390/setup: Drop dummy_con initialization
  ...
2020-01-29 10:13:27 -08:00
Thomas Gleixner
979923871f x86/timer: Don't skip PIT setup when APIC is disabled or in legacy mode
Tony reported a boot regression caused by the recent workaround for systems
which have a disabled (clock gate off) PIT.

On his machine the kernel fails to initialize the PIT because
apic_needs_pit() does not take into account whether the local APIC
interrupt delivery mode will actually allow to setup and use the local
APIC timer. This should be easy to reproduce with acpi=off on the
command line which also disables HPET.

Due to the way the PIT/HPET and APIC setup ordering works (APIC setup can
require working PIT/HPET) the information is not available at the point
where apic_needs_pit() makes this decision.

To address this, split out the interrupt mode selection from
apic_intr_mode_init(), invoke the selection before making the decision
whether PIT is required or not, and add the missing checks into
apic_needs_pit().

Fixes: c8c4076723 ("x86/timer: Skip PIT initialization on modern chipsets")
Reported-by: Anthony Buckley <tony.buckley000@gmail.com>
Tested-by: Anthony Buckley <tony.buckley000@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Daniel Drake <drake@endlessm.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206125
Link: https://lore.kernel.org/r/87sgk6tmk2.fsf@nanos.tec.linutronix.de
2020-01-29 12:50:12 +01:00
Linus Torvalds
511fdb7844 Merge branch 'x86-mtrr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mtrr updates from Ingo Molnar:
 "Two changes: restrict /proc/mtrr to CAP_SYS_ADMIN, plus a cleanup"

* 'x86-mtrr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mtrr: Require CAP_SYS_ADMIN for all access
  x86/mtrr: Get rid of mtrr_seq_show() forward declaration
2020-01-28 13:06:05 -08:00
Linus Torvalds
4d6245ce8c Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU updates from Ingo Molnar:
 "Three changes: fix a race that can result in FPU corruption, plus two
  cleanups"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Deactivate FPU state after failure during state load
  x86/fpu/xstate: Make xfeature_is_supervisor()/xfeature_is_user() return bool
  x86/fpu/xstate: Fix small issues
2020-01-28 13:04:38 -08:00
Linus Torvalds
c0275ae758 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu-features updates from Ingo Molnar:
 "The biggest change in this cycle was a large series from Sean
  Christopherson to clean up the handling of VMX features. This both
  fixes bugs/inconsistencies and makes the code more coherent and
  future-proof.

  There are also two cleanups and a minor TSX syslog messages
  enhancement"

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86/cpu: Remove redundant cpu_detect_cache_sizes() call
  x86/cpu: Print "VMX disabled" error message iff KVM is enabled
  KVM: VMX: Allow KVM_INTEL when building for Centaur and/or Zhaoxin CPUs
  perf/x86: Provide stubs of KVM helpers for non-Intel CPUs
  KVM: VMX: Use VMX_FEATURE_* flags to define VMCS control bits
  KVM: VMX: Check for full VMX support when verifying CPU compatibility
  KVM: VMX: Use VMX feature flag to query BIOS enabling
  KVM: VMX: Drop initialization of IA32_FEAT_CTL MSR
  x86/cpufeatures: Add flag to track whether MSR IA32_FEAT_CTL is configured
  x86/cpu: Set synthetic VMX cpufeatures during init_ia32_feat_ctl()
  x86/cpu: Print VMX flags in /proc/cpuinfo using VMX_FEATURES_*
  x86/cpu: Detect VMX features on Intel, Centaur and Zhaoxin CPUs
  x86/vmx: Introduce VMX_FEATURES_*
  x86/cpu: Clear VMX feature flag if VMX is not fully enabled
  x86/zhaoxin: Use common IA32_FEAT_CTL MSR initialization
  x86/centaur: Use common IA32_FEAT_CTL MSR initialization
  x86/mce: WARN once if IA32_FEAT_CTL MSR is left unlocked
  x86/intel: Initialize IA32_FEAT_CTL MSR at boot
  tools/x86: Sync msr-index.h from kernel sources
  selftests, kvm: Replace manual MSR defs with common msr-index.h
  ...
2020-01-28 12:46:42 -08:00
Linus Torvalds
f6170f0afb Merge branch 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Ingo Molnar:
 "Misc changes:

   - Enhance #GP fault printouts by distinguishing between canonical and
     non-canonical address faults, and also add KASAN fault decoding.

   - Fix/enhance the x86 NMI handler by putting the duration check into
     a direct function call instead of an irq_work which we know to be
     broken in some cases.

   - Clean up do_general_protection() a bit"

* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/nmi: Remove irq_work from the long duration NMI handler
  x86/traps: Cleanup do_general_protection()
  x86/kasan: Print original address on #GP
  x86/dumpstack: Introduce die_addr() for die() with #GP fault address
  x86/traps: Print address on #GP
  x86/insn-eval: Add support for 64-bit kernel mode
2020-01-28 12:28:06 -08:00
Linus Torvalds
6da49d1abd Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups all around the map"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/CPU/AMD: Remove amd_get_topology_early()
  x86/tsc: Remove redundant assignment
  x86/crash: Use resource_size()
  x86/cpu: Add a missing prototype for arch_smt_update()
  x86/nospec: Remove unused RSB_FILL_LOOPS
  x86/vdso: Provide missing include file
  x86/Kconfig: Correct spelling and punctuation
  Documentation/x86/boot: Fix typo
  x86/boot: Fix a comment's incorrect file reference
  x86/process: Remove set but not used variables prev and next
  x86/Kconfig: Fix Kconfig indentation
2020-01-28 12:11:23 -08:00
Linus Torvalds
4244057c3d Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Ingo Molnar:
 "The main change in this tree is the extension of the resctrl procfs
  ABI with a new file that helps tooling to navigate from tasks back to
  resctrl groups: /proc/{pid}/cpu_resctrl_groups.

  Also fix static key usage for certain feature combinations and
  simplify the task exit resctrl case"

* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Add task resctrl information display
  x86/resctrl: Check monitoring static key in the MBM overflow handler
  x86/resctrl: Do not reconfigure exiting tasks
2020-01-28 12:00:29 -08:00
Linus Torvalds
6b90e71a47 Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot update from Ingo Molnar:
 "Two minor changes: fix an atypical binutils combination build bug, and
  also fix a VRAM size check for simplefb"

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sysfb: Fix check for bad VRAM size
  x86/boot: Discard .eh_frame sections
2020-01-28 11:54:05 -08:00
Linus Torvalds
bcc8aff6af Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
 "Misc updates:

   - Remove last remaining calls to exception_enter/exception_exit() and
     simplify the entry code some more.

   - Remove force_iret()

   - Add support for "Fast Short Rep Mov", which is available starting
     with Ice Lake Intel CPUs - and make the x86 assembly version of
     memmove() use REP MOV for all sizes when FSRM is available.

   - Micro-optimize/simplify the 32-bit boot code a bit.

   - Use a more future-proof SYSRET instruction mnemonic"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Simplify calculation of output address
  x86/entry/64: Add instruction suffix to SYSRET
  x86: Remove force_iret()
  x86/cpufeatures: Add support for fast short REP; MOVSB
  x86/context-tracking: Remove exception_enter/exit() from KVM_PV_REASON_PAGE_NOT_PRESENT async page fault
  x86/context-tracking: Remove exception_enter/exit() from do_page_fault()
2020-01-28 11:08:13 -08:00
Linus Torvalds
435dd727a4 Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic fix from Ingo Molnar:
 "A single commit that simplifies the code and gets rid of a compiler
  warning"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic/uv: Avoid unused variable warning
2020-01-28 11:06:49 -08:00
Ingo Molnar
6bd3357b61 Merge branches 'x86/hyperv', 'x86/kdump' and 'x86/misc' into x86/urgent, to pick up single-commit branches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-28 19:08:52 +01:00
Linus Torvalds
c0e809e244 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Kernel side changes:

   - Ftrace is one of the last W^X violators (after this only KLP is
     left). These patches move it over to the generic text_poke()
     interface and thereby get rid of this oddity. This requires a
     surprising amount of surgery, by Peter Zijlstra.

   - x86/AMD PMUs: add support for 'Large Increment per Cycle Events' to
     count certain types of events that have a special, quirky hw ABI
     (by Kim Phillips)

   - kprobes fixes by Masami Hiramatsu

  Lots of tooling updates as well, the following subcommands were
  updated: annotate/report/top, c2c, clang, record, report/top TUI,
  sched timehist, tests; plus updates were done to the gtk ui, libperf,
  headers and the parser"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  perf/x86/amd: Add support for Large Increment per Cycle Events
  perf/x86/amd: Constrain Large Increment per Cycle events
  perf/x86/intel/rapl: Add Comet Lake support
  tracing: Initialize ret in syscall_enter_define_fields()
  perf header: Use last modification time for timestamp
  perf c2c: Fix return type for histogram sorting comparision functions
  perf beauty sockaddr: Fix augmented syscall format warning
  perf/ui/gtk: Fix gtk2 build
  perf ui gtk: Add missing zalloc object
  perf tools: Use %define api.pure full instead of %pure-parser
  libperf: Setup initial evlist::all_cpus value
  perf report: Fix no libunwind compiled warning break s390 issue
  perf tools: Support --prefix/--prefix-strip
  perf report: Clarify in help that --children is default
  tools build: Fix test-clang.cpp with Clang 8+
  perf clang: Fix build with Clang 9
  kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic
  tools lib: Fix builds when glibc contains strlcpy()
  perf report/top: Make 'e' visible in the help and make it toggle showing callchains
  perf report/top: Do not offer annotation for symbols without samples
  ...
2020-01-28 09:44:15 -08:00
Linus Torvalds
634cd4b6af Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Cleanup of the GOP [graphics output] handling code in the EFI stub

   - Complete refactoring of the mixed mode handling in the x86 EFI stub

   - Overhaul of the x86 EFI boot/runtime code

   - Increase robustness for mixed mode code

   - Add the ability to disable DMA at the root port level in the EFI
     stub

   - Get rid of RWX mappings in the EFI memory map and page tables,
     where possible

   - Move the support code for the old EFI memory mapping style into its
     only user, the SGI UV1+ support code.

   - plus misc fixes, updates, smaller cleanups.

  ... and due to interactions with the RWX changes, another round of PAT
  cleanups make a guest appearance via the EFI tree - with no side
  effects intended"

* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
  efi/x86: Disable instrumentation in the EFI runtime handling code
  efi/libstub/x86: Fix EFI server boot failure
  efi/x86: Disallow efi=old_map in mixed mode
  x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld
  efi/x86: avoid KASAN false positives when accessing the 1: 1 mapping
  efi: Fix handling of multiple efi_fake_mem= entries
  efi: Fix efi_memmap_alloc() leaks
  efi: Add tracking for dynamically allocated memmaps
  efi: Add a flags parameter to efi_memory_map
  efi: Fix comment for efi_mem_type() wrt absent physical addresses
  efi/arm: Defer probe of PCIe backed efifb on DT systems
  efi/x86: Limit EFI old memory map to SGI UV machines
  efi/x86: Avoid RWX mappings for all of DRAM
  efi/x86: Don't map the entire kernel text RW for mixed mode
  x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd
  efi/libstub/x86: Fix unused-variable warning
  efi/libstub/x86: Use mandatory 16-byte stack alignment in mixed mode
  efi/libstub/x86: Use const attribute for efi_is_64bit()
  efi: Allow disabling PCI busmastering on bridges during boot
  efi/x86: Allow translating 64-bit arguments for mixed mode calls
  ...
2020-01-28 09:03:40 -08:00
Linus Torvalds
8b561778f2 Merge branch 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
 "The main changes are to move the ORC unwind table sorting from early
  init to build-time - this speeds up booting.

  No change in functionality intended"

* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/unwind/orc: Fix !CONFIG_MODULES build warning
  x86/unwind/orc: Remove boot-time ORC unwind tables sorting
  scripts/sorttable: Implement build-time ORC unwind table sorting
  scripts/sorttable: Rename 'sortextable' to 'sorttable'
  scripts/sortextable: Refactor the do_func() function
  scripts/sortextable: Remove dead code
  scripts/sortextable: Clean up the code to meet the kernel coding style better
  scripts/sortextable: Rewrite error/success handling
2020-01-28 08:38:25 -08:00
Linus Torvalds
9f2a43019e Merge branch 'core-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull header cleanup from Ingo Molnar:
 "This is a treewide cleanup, mostly (but not exclusively) with x86
  impact, which breaks implicit dependencies on the asm/realtime.h
  header and finally removes it from asm/acpi.h"

* 'core-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ACPI/sleep: Move acpi_get_wakeup_address() into sleep.c, remove <asm/realmode.h> from <asm/acpi.h>
  ACPI/sleep: Convert acpi_wakeup_address into a function
  x86/ACPI/sleep: Remove an unnecessary include of asm/realmode.h
  ASoC: Intel: Skylake: Explicitly include linux/io.h for virt_to_phys()
  vmw_balloon: Explicitly include linux/io.h for virt_to_phys()
  virt: vbox: Explicitly include linux/io.h to pick up various defs
  efi/capsule-loader: Explicitly include linux/io.h for page_to_phys()
  perf/x86/intel: Explicitly include asm/io.h to use virt_to_phys()
  x86/kprobes: Explicitly include vmalloc.h for set_vm_flush_reset_perms()
  x86/ftrace: Explicitly include vmalloc.h for set_vm_flush_reset_perms()
  x86/boot: Explicitly include realmode.h to handle RM reservations
  x86/efi: Explicitly include realmode.h to handle RM trampoline quirk
  x86/platform/intel/quark: Explicitly include linux/io.h for virt_to_phys()
  x86/setup: Enhance the comments
  x86/setup: Clean up the header portion of setup.c
2020-01-28 08:20:54 -08:00
Linus Torvalds
b0be0eff1a The performance deterioration departement provides a few non-scary fixes
and improvements:
 
  - Update the cached HLE state when the TSX state is changed via the new
    control register. This ensures feature bit consistency.
 
  - Exclude the new Zhaoxin CPUs from Spectre V2 and SWAPGS vulnerabilities.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vc5ITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRujEACeWTy4k+eCprV2tu4O/e15VaGaIdRp
 Kdv02kZ1U/3upSubRYW8XsBqesCdbSzfO78fE7yQYZmTbOxczSE1t2XH5Cj7c3ig
 3D4gzXaH2XKf+j9cXceba/3cpRsMU8fAhQYr+Immi/dG2ORs8i2C4G0cxKthqvbP
 R6qdS+Pg9Mo1a2c2nLRGMDyNbNdaYXiJEiWSGNlc6Den/vog+jR4T2WV5a5IjeKL
 yjjukoLoeScJ/3gYK0BQkaGNY3BeU1H5ECtbVM3SBFkp2a9OVGhr1bSxwO1U+T8C
 fdzwJZU/p+JTb69kx91Tz8+wd2zLtV1wxbdZY7nyUUfGd80q0eU4sQZQyY13mONw
 0LRvXrfHjIsRPSiRY3iyHYq9I0UZmjG7oInqRLimLrJmjrPtALhu/mO5TjZbxjzM
 FSVx5fWbZYUgnrQpB+hjqiBcrc3sjjrDwU6DufbSq/AhTBKQ/o0mC+S63OhJ0H7C
 Bzp1m0OniDuNRtYD0LZm0ktAkLpCTy5lgGBQ6ffNqs87ivFi0GTErqml5GrUEKQ5
 mq7h56mf00jKPBYOVoYs9MuWJ18JVJuVNipdlM+P7027XDRpl46l9g87YuMhIol/
 tQsDHB4Dv/1QTZY3K3MuNLFZaQJ4/DI9LA2B1br1PdHbZSDd81NflT6FKE11rqWC
 T+DCpmj+smRqVw==
 =dOgd
 -----END PGP SIGNATURE-----

Merge tag 'x86-pti-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 pti updates from Thomas Gleixner:
 "The performance deterioration departement provides a few non-scary
  fixes and improvements:

   - Update the cached HLE state when the TSX state is changed via the
     new control register. This ensures feature bit consistency.

   - Exclude the new Zhaoxin CPUs from Spectre V2 and SWAPGS
     vulnerabilities"

* tag 'x86-pti-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation/swapgs: Exclude Zhaoxin CPUs from SWAPGS vulnerability
  x86/speculation/spectre_v2: Exclude Zhaoxin CPUs from SPECTRE_V2
  x86/cpu: Update cached HLE state on write to TSX_CTRL_CPUID_CLEAR
2020-01-27 17:28:52 -08:00
Linus Torvalds
e279160f49 The timekeeping and timers departement provides:
- Time namespace support:
 
     If a container migrates from one host to another then it expects that
     clocks based on MONOTONIC and BOOTTIME are not subject to
     disruption. Due to different boot time and non-suspended runtime these
     clocks can differ significantly on two hosts, in the worst case time
     goes backwards which is a violation of the POSIX requirements.
 
     The time namespace addresses this problem. It allows to set offsets for
     clock MONOTONIC and BOOTTIME once after creation and before tasks are
     associated with the namespace. These offsets are taken into account by
     timers and timekeeping including the VDSO.
 
     Offsets for wall clock based clocks (REALTIME/TAI) are not provided by
     this mechanism. While in theory possible, the overhead and code
     complexity would be immense and not justified by the esoteric potential
     use cases which were discussed at Plumbers '18.
 
     The overhead for tasks in the root namespace (host time offsets = 0) is
     in the noise and great effort was made to ensure that especially in the
     VDSO. If time namespace is disabled in the kernel configuration the
     code is compiled out.
 
     Kudos to Andrei Vagin and Dmitry Sofanov who implemented this feature
     and kept on for more than a year addressing review comments, finding
     better solutions. A pleasant experience.
 
   - Overhaul of the alarmtimer device dependency handling to ensure that
     the init/suspend/resume ordering is correct.
 
   - A new clocksource/event driver for Microchip PIT64
 
   - Suspend/resume support for the Hyper-V clocksource
 
   - The usual pile of fixes, updates and improvements mostly in the
     driver code.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vbTcTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXT2D/96iJ3G9Snn2khEQP3XS2rYmtDGw7NO
 m1n96falwWeGe6zreU80R2Jge5nLxQtNhRoMPLLee1GpHwRC6lvqEqgdZ4LMBrD2
 JqV7Gzg8Urmdh+hpDsyTCpeEWEzoMKxiFOX8PxwctqUhM4szEe5iQg2YQsg85Jw2
 vG6M93N2xwDILh4rhEMbKjo+5ZmYn7c1RQvpGOSmpKOj940W/N7H2HBsFhdaJ1Kw
 FW5pFv1211PaU5RV2YNb2dMeeMTT1N3e2VN4Dkadoxp47pb+725gNHEBEjmV9poG
 Lp4IhzGAPnj8zVD88icQZSTaK3gUHMClxprJ0Pf84WEtiH7SeGu8BPYyu77+oNDe
 yzcctDJNyCWXkzmaP/fe/HLc0TStbvNAJ5Tagp4BC75gzebeb4/n8RtRT0fKeDYL
 pxpDPKDAPU7p1JSjxiWAtshqjBycWNY3Z49bA7/VhKBhnv8BDyBPGlYd7/4xrbGr
 RK7DQNXJwaJaiNJ7p5PiaFxGzNyB0B9sThD/slSlEInIKb4h9YzWr0TV+NB62VnB
 sDcN+tpLbRPz5/5cHGGfxR0+zKWpfyai8pzbmmaXEaKssjRYwyvcac5EZdgbWpbK
 k7CqAjoWLA2P+tGeePNJOf5JYK6Vmdyh4clmuwM0zOiRJ9NlWUyMf3z7QYILs4RO
 UAI+6opYlZEPAw==
 =x3qT
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "The timekeeping and timers departement provides:

   - Time namespace support:

     If a container migrates from one host to another then it expects
     that clocks based on MONOTONIC and BOOTTIME are not subject to
     disruption. Due to different boot time and non-suspended runtime
     these clocks can differ significantly on two hosts, in the worst
     case time goes backwards which is a violation of the POSIX
     requirements.

     The time namespace addresses this problem. It allows to set offsets
     for clock MONOTONIC and BOOTTIME once after creation and before
     tasks are associated with the namespace. These offsets are taken
     into account by timers and timekeeping including the VDSO.

     Offsets for wall clock based clocks (REALTIME/TAI) are not provided
     by this mechanism. While in theory possible, the overhead and code
     complexity would be immense and not justified by the esoteric
     potential use cases which were discussed at Plumbers '18.

     The overhead for tasks in the root namespace (ie where host time
     offsets = 0) is in the noise and great effort was made to ensure
     that especially in the VDSO. If time namespace is disabled in the
     kernel configuration the code is compiled out.

     Kudos to Andrei Vagin and Dmitry Sofanov who implemented this
     feature and kept on for more than a year addressing review
     comments, finding better solutions. A pleasant experience.

   - Overhaul of the alarmtimer device dependency handling to ensure
     that the init/suspend/resume ordering is correct.

   - A new clocksource/event driver for Microchip PIT64

   - Suspend/resume support for the Hyper-V clocksource

   - The usual pile of fixes, updates and improvements mostly in the
     driver code"

* tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
  alarmtimer: Use wakeup source from alarmtimer platform device
  alarmtimer: Make alarmtimer platform device child of RTC device
  alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
  hrtimer: Add missing sparse annotation for __run_timer()
  lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres()
  MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel
  clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC
  clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
  clocksource/drivers/timer-microchip-pit64b: Fix sparse warning
  clocksource/drivers/exynos_mct: Rename Exynos to lowercase
  clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access
  clocksource/drivers/timer-ti-dm: Switch to platform_get_irq
  clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource
  clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe
  clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource
  clocksource/drivers/bcm2835_timer: Fix memory leak of timer
  clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
  clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support
  clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page
  ...
2020-01-27 16:47:05 -08:00
Linus Torvalds
6a1000bd27 ioremap changes for 5.6
- remove ioremap_nocache given that is is equivalent to
    ioremap everywhere
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl4vKHwLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMPGBAAuVNUZaZfWYHpiVP2oRcUQUguFiD3NTbknsyzV2oH
 J9P0GfeENSKwE9OOhZ7XIjnCZAJwQgTK/ppQY5yiQ/KAtYyyXjXEJ6jqqjiTDInr
 +3+I3t/LhkgrK7tMrb7ylTGa/d7KhaciljnOXC8+b75iddvM9I1z2pbHDbppZMS9
 wT4RXL/cFtRb85AfOyPLybcka3f5P2gGvQz38qyimhJYEzHDXZu9VO1Bd20f8+Xf
 eLBKX0o6yWMhcaPLma8tm0M0zaXHEfLHUKLSOkiOk+eHTWBZ3b/w5nsOQZYZ7uQp
 25yaClbameAn7k5dHajduLGEJv//ZjLRWcN3HJWJ5vzO111aHhswpE7JgTZJSVWI
 ggCVkytD3ESXapvswmACSeCIDMmiJMzvn6JvwuSMVB7a6e5mcqTuGo/FN+DrBF/R
 IP+/gY/T7zIIOaljhQVkiEIIwiD/akYo0V9fheHTBnqcKEDTHV4WjKbeF6aCwcO+
 b8inHyXZSKSMG//UlDuN84/KH/o1l62oKaB1uDIYrrL8JVyjAxctWt3GOt5KgSFq
 wVz1lMw4kIvWtC/Sy2H4oB+RtODLp6yJDqmvmPkeJwKDUcd/1JKf0KsZ8j3FpGei
 /rEkBEss0KBKyFAgBSRO2jIpdj2epgcBcsdB/r5mlhcn8L77AS6mHbA173kY4pQ/
 Kdg=
 =TUCJ
 -----END PGP SIGNATURE-----

Merge tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap

Pull ioremap updates from Christoph Hellwig:
 "Remove the ioremap_nocache API (plus wrappers) that are always
  identical to ioremap"

* tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap:
  remove ioremap_nocache and devm_ioremap_nocache
  MIPS: define ioremap_nocache to ioremap
2020-01-27 13:03:00 -08:00
Linus Torvalds
30f5a75640 Merge branch 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:

 - Misc fixes to the MCE code all over the place, by Jan H. Schönherr.

 - Initial support for AMD F19h and other cleanups to amd64_edac, by
   Yazen Ghannam.

 - Other small cleanups.

* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  EDAC/mce_amd: Make fam_ops static global
  EDAC/amd64: Drop some family checks for newer systems
  EDAC/amd64: Add family ops for Family 19h Models 00h-0Fh
  x86/amd_nb: Add Family 19h PCI IDs
  EDAC/mce_amd: Always load on SMCA systems
  x86/MCE/AMD, EDAC/mce_amd: Add new Load Store unit McaType
  x86/mce: Fix use of uninitialized MCE message string
  x86/mce: Fix mce=nobootlog
  x86/mce: Take action on UCNA/Deferred errors again
  x86/mce: Remove mce_inject_log() in favor of mce_log()
  x86/mce: Pass MCE message to mce_panic() on failed kernel recovery
  x86/mce/therm_throt: Mark throttle_active_work() as __maybe_unused
2020-01-27 09:19:35 -08:00
Borislav Petkov
3c749b81ee x86/CPU/AMD: Remove amd_get_topology_early()
... and fold its function body into its single call site.

No functional changes:

  # arch/x86/kernel/cpu/amd.o:

   text    data     bss     dec     hex filename
   5994     385       1    6380    18ec amd.o.before
   5994     385       1    6380    18ec amd.o.after

md5:
   99ec6daa095b502297884e949c520f90  amd.o.before.asm
   99ec6daa095b502297884e949c520f90  amd.o.after.asm

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200123165811.5288-1-bp@alien8.de
2020-01-25 12:59:46 +01:00
Dave Hansen
45fc24e89b x86/mpx: remove MPX from arch/x86
From: Dave Hansen <dave.hansen@linux.intel.com>

MPX is being removed from the kernel due to a lack of support
in the toolchain going forward (gcc).

This removes all the remaining (dead at this point) MPX handling
code remaining in the tree.  The only remaining code is the XSAVE
support for MPX state which is currently needd for KVM to handle
VMs which might use MPX.

Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
2020-01-23 10:41:20 -08:00
Dave Hansen
aa9ccb7b47 x86/mpx: remove bounds exception code
From: Dave Hansen <dave.hansen@linux.intel.com>

MPX is being removed from the kernel due to a lack of support
in the toolchain going forward (gcc).

Remove the other user-visible ABI: signal handling.  This code
should basically have been inactive after the prctl()s were
removed, but there may be some small ABI remnants from this code.
Remove it.

Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
2020-01-23 10:41:15 -08:00
Dave Hansen
3a1255396b x86/alternatives: add missing insn.h include
From: Dave Hansen <dave.hansen@linux.intel.com>

While testing my MPX removal series, Borislav noted compilation
failure with an allnoconfig build.

Turned out to be a missing include of insn.h in alternative.c.
With MPX, it got it implicitly from:

	asm/mmu_context.h -> asm/mpx.h -> asm/insn.h

Fixes: c3d6324f84 ("x86/alternatives: Teach text_poke_bp() to emulate instructions")
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
2020-01-23 10:41:13 -08:00
Mateusz Nosek
4144fddbd3 x86/tsc: Remove redundant assignment
Previously, the assignment to the local variable 'now' took place
before the for loop. The loop is unconditional so it will be entered
at least once. The variable 'now' is reassigned in the loop and is not
used before reassigning. Therefore, the assignment before the loop is
unnecessary and can be removed.

No code changed:

  # arch/x86/kernel/tsc_sync.o:

   text    data     bss     dec     hex filename
   3569     198      44    3811     ee3 tsc_sync.o.before
   3569     198      44    3811     ee3 tsc_sync.o.after

md5:
   36216de29b208edbcd34fed9fe7f7b69  tsc_sync.o.before.asm
   36216de29b208edbcd34fed9fe7f7b69  tsc_sync.o.after.asm

 [ bp: Massage commit message. ]

Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200118171143.25178-1-mateusznosek0@gmail.com
2020-01-22 13:52:42 +01:00
Xiaochen Shen
32ada3b9e0 x86/resctrl: Clean up unused function parameter in mkdir path
Commit

  334b0f4e9b ("x86/resctrl: Fix a deadlock due to inaccurate reference")

changed the argument to rdtgroup_kn_lock_live()/rdtgroup_kn_unlock()
within mkdir_rdt_prepare(). That change resulted in an unused function
parameter to mkdir_rdt_prepare().

Clean up the unused function parameter in mkdir_rdt_prepare() and its
callers rdtgroup_mkdir_mon() and rdtgroup_mkdir_ctrl_mon().

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1578500886-21771-5-git-send-email-xiaochen.shen@intel.com
2020-01-20 17:00:41 +01:00
Xiaochen Shen
334b0f4e9b x86/resctrl: Fix a deadlock due to inaccurate reference
There is a race condition which results in a deadlock when rmdir and
mkdir execute concurrently:

$ ls /sys/fs/resctrl/c1/mon_groups/m1/
cpus  cpus_list  mon_data  tasks

Thread 1: rmdir /sys/fs/resctrl/c1
Thread 2: mkdir /sys/fs/resctrl/c1/mon_groups/m1

3 locks held by mkdir/48649:
 #0:  (sb_writers#17){.+.+}, at: [<ffffffffb4ca2aa0>] mnt_want_write+0x20/0x50
 #1:  (&type->i_mutex_dir_key#8/1){+.+.}, at: [<ffffffffb4c8c13b>] filename_create+0x7b/0x170
 #2:  (rdtgroup_mutex){+.+.}, at: [<ffffffffb4a4389d>] rdtgroup_kn_lock_live+0x3d/0x70

4 locks held by rmdir/48652:
 #0:  (sb_writers#17){.+.+}, at: [<ffffffffb4ca2aa0>] mnt_want_write+0x20/0x50
 #1:  (&type->i_mutex_dir_key#8/1){+.+.}, at: [<ffffffffb4c8c3cf>] do_rmdir+0x13f/0x1e0
 #2:  (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffb4c86d5d>] vfs_rmdir+0x4d/0x120
 #3:  (rdtgroup_mutex){+.+.}, at: [<ffffffffb4a4389d>] rdtgroup_kn_lock_live+0x3d/0x70

Thread 1 is deleting control group "c1". Holding rdtgroup_mutex,
kernfs_remove() removes all kernfs nodes under directory "c1"
recursively, then waits for sub kernfs node "mon_groups" to drop active
reference.

Thread 2 is trying to create a subdirectory "m1" in the "mon_groups"
directory. The wrapper kernfs_iop_mkdir() takes an active reference to
the "mon_groups" directory but the code drops the active reference to
the parent directory "c1" instead.

As a result, Thread 1 is blocked on waiting for active reference to drop
and never release rdtgroup_mutex, while Thread 2 is also blocked on
trying to get rdtgroup_mutex.

Thread 1 (rdtgroup_rmdir)   Thread 2 (rdtgroup_mkdir)
(rmdir /sys/fs/resctrl/c1)  (mkdir /sys/fs/resctrl/c1/mon_groups/m1)
-------------------------   -------------------------
                            kernfs_iop_mkdir
                              /*
                               * kn: "m1", parent_kn: "mon_groups",
                               * prgrp_kn: parent_kn->parent: "c1",
                               *
                               * "mon_groups", parent_kn->active++: 1
                               */
                              kernfs_get_active(parent_kn)
kernfs_iop_rmdir
  /* "c1", kn->active++ */
  kernfs_get_active(kn)

  rdtgroup_kn_lock_live
    atomic_inc(&rdtgrp->waitcount)
    /* "c1", kn->active-- */
    kernfs_break_active_protection(kn)
    mutex_lock

  rdtgroup_rmdir_ctrl
    free_all_child_rdtgrp
      sentry->flags = RDT_DELETED

    rdtgroup_ctrl_remove
      rdtgrp->flags = RDT_DELETED
      kernfs_get(kn)
      kernfs_remove(rdtgrp->kn)
        __kernfs_remove
          /* "mon_groups", sub_kn */
          atomic_add(KN_DEACTIVATED_BIAS, &sub_kn->active)
          kernfs_drain(sub_kn)
            /*
             * sub_kn->active == KN_DEACTIVATED_BIAS + 1,
             * waiting on sub_kn->active to drop, but it
             * never drops in Thread 2 which is blocked
             * on getting rdtgroup_mutex.
             */
Thread 1 hangs here ---->
            wait_event(sub_kn->active == KN_DEACTIVATED_BIAS)
            ...
                              rdtgroup_mkdir
                                rdtgroup_mkdir_mon(parent_kn, prgrp_kn)
                                  mkdir_rdt_prepare(parent_kn, prgrp_kn)
                                    rdtgroup_kn_lock_live(prgrp_kn)
                                      atomic_inc(&rdtgrp->waitcount)
                                      /*
                                       * "c1", prgrp_kn->active--
                                       *
                                       * The active reference on "c1" is
                                       * dropped, but not matching the
                                       * actual active reference taken
                                       * on "mon_groups", thus causing
                                       * Thread 1 to wait forever while
                                       * holding rdtgroup_mutex.
                                       */
                                      kernfs_break_active_protection(
                                                               prgrp_kn)
                                      /*
                                       * Trying to get rdtgroup_mutex
                                       * which is held by Thread 1.
                                       */
Thread 2 hangs here ---->             mutex_lock
                                      ...

The problem is that the creation of a subdirectory in the "mon_groups"
directory incorrectly releases the active protection of its parent
directory instead of itself before it starts waiting for rdtgroup_mutex.
This is triggered by the rdtgroup_mkdir() flow calling
rdtgroup_kn_lock_live()/rdtgroup_kn_unlock() with kernfs node of the
parent control group ("c1") as argument. It should be called with kernfs
node "mon_groups" instead. What is currently missing is that the
kn->priv of "mon_groups" is NULL instead of pointing to the rdtgrp.

Fix it by pointing kn->priv to rdtgrp when "mon_groups" is created. Then
it could be passed to rdtgroup_kn_lock_live()/rdtgroup_kn_unlock()
instead. And then it operates on the same rdtgroup structure but handles
the active reference of kernfs node "mon_groups" to prevent deadlock.
The same changes are also made to the "mon_data" directories.

This results in some unused function parameters that will be cleaned up
in follow-up patch as the focus here is on the fix only in support of
backporting efforts.

Fixes: c7d9aac613 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1578500886-21771-4-git-send-email-xiaochen.shen@intel.com
2020-01-20 16:57:53 +01:00
Xiaochen Shen
074fadee59 x86/resctrl: Fix use-after-free due to inaccurate refcount of rdtgroup
There is a race condition in the following scenario which results in an
use-after-free issue when reading a monitoring file and deleting the
parent ctrl_mon group concurrently:

Thread 1 calls atomic_inc() to take refcount of rdtgrp and then calls
kernfs_break_active_protection() to drop the active reference of kernfs
node in rdtgroup_kn_lock_live().

In Thread 2, kernfs_remove() is a blocking routine. It waits on all sub
kernfs nodes to drop the active reference when removing all subtree
kernfs nodes recursively. Thread 2 could block on kernfs_remove() until
Thread 1 calls kernfs_break_active_protection(). Only after
kernfs_remove() completes the refcount of rdtgrp could be trusted.

Before Thread 1 calls atomic_inc() and kernfs_break_active_protection(),
Thread 2 could call kfree() when the refcount of rdtgrp (sentry) is 0
instead of 1 due to the race.

In Thread 1, in rdtgroup_kn_unlock(), referring to earlier rdtgrp memory
(rdtgrp->waitcount) which was already freed in Thread 2 results in
use-after-free issue.

Thread 1 (rdtgroup_mondata_show)  Thread 2 (rdtgroup_rmdir)
--------------------------------  -------------------------
rdtgroup_kn_lock_live
  /*
   * kn active protection until
   * kernfs_break_active_protection(kn)
   */
  rdtgrp = kernfs_to_rdtgroup(kn)
                                  rdtgroup_kn_lock_live
                                    atomic_inc(&rdtgrp->waitcount)
                                    mutex_lock
                                  rdtgroup_rmdir_ctrl
                                    free_all_child_rdtgrp
                                      /*
                                       * sentry->waitcount should be 1
                                       * but is 0 now due to the race.
                                       */
                                      kfree(sentry)*[1]
  /*
   * Only after kernfs_remove()
   * completes, the refcount of
   * rdtgrp could be trusted.
   */
  atomic_inc(&rdtgrp->waitcount)
  /* kn->active-- */
  kernfs_break_active_protection(kn)
                                    rdtgroup_ctrl_remove
                                      rdtgrp->flags = RDT_DELETED
                                      /*
                                       * Blocking routine, wait for
                                       * all sub kernfs nodes to drop
                                       * active reference in
                                       * kernfs_break_active_protection.
                                       */
                                      kernfs_remove(rdtgrp->kn)
                                  rdtgroup_kn_unlock
                                    mutex_unlock
                                    atomic_dec_and_test(
                                                &rdtgrp->waitcount)
                                    && (flags & RDT_DELETED)
                                      kernfs_unbreak_active_protection(kn)
                                      kfree(rdtgrp)
  mutex_lock
mon_event_read
rdtgroup_kn_unlock
  mutex_unlock
  /*
   * Use-after-free: refer to earlier rdtgrp
   * memory which was freed in [1].
   */
  atomic_dec_and_test(&rdtgrp->waitcount)
  && (flags & RDT_DELETED)
    /* kn->active++ */
    kernfs_unbreak_active_protection(kn)
    kfree(rdtgrp)

Fix it by moving free_all_child_rdtgrp() to after kernfs_remove() in
rdtgroup_rmdir_ctrl() to ensure it has the accurate refcount of rdtgrp.

Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1578500886-21771-3-git-send-email-xiaochen.shen@intel.com
2020-01-20 16:56:11 +01:00
Xiaochen Shen
b8511ccc75 x86/resctrl: Fix use-after-free when deleting resource groups
A resource group (rdtgrp) contains a reference count (rdtgrp->waitcount)
that indicates how many waiters expect this rdtgrp to exist. Waiters
could be waiting on rdtgroup_mutex or some work sitting on a task's
workqueue for when the task returns from kernel mode or exits.

The deletion of a rdtgrp is intended to have two phases:

  (1) while holding rdtgroup_mutex the necessary cleanup is done and
  rdtgrp->flags is set to RDT_DELETED,

  (2) after releasing the rdtgroup_mutex, the rdtgrp structure is freed
  only if there are no waiters and its flag is set to RDT_DELETED. Upon
  gaining access to rdtgroup_mutex or rdtgrp, a waiter is required to check
  for the RDT_DELETED flag.

When unmounting the resctrl file system or deleting ctrl_mon groups,
all of the subdirectories are removed and the data structure of rdtgrp
is forcibly freed without checking rdtgrp->waitcount. If at this point
there was a waiter on rdtgrp then a use-after-free issue occurs when the
waiter starts running and accesses the rdtgrp structure it was waiting
on.

See kfree() calls in [1], [2] and [3] in these two call paths in
following scenarios:
(1) rdt_kill_sb() -> rmdir_all_sub() -> free_all_child_rdtgrp()
(2) rdtgroup_rmdir() -> rdtgroup_rmdir_ctrl() -> free_all_child_rdtgrp()

There are several scenarios that result in use-after-free issue in
following:

Scenario 1:
-----------
In Thread 1, rdtgroup_tasks_write() adds a task_work callback
move_myself(). If move_myself() is scheduled to execute after Thread 2
rdt_kill_sb() is finished, referring to earlier rdtgrp memory
(rdtgrp->waitcount) which was already freed in Thread 2 results in
use-after-free issue.

Thread 1 (rdtgroup_tasks_write)        Thread 2 (rdt_kill_sb)
-------------------------------        ----------------------
rdtgroup_kn_lock_live
  atomic_inc(&rdtgrp->waitcount)
  mutex_lock
rdtgroup_move_task
  __rdtgroup_move_task
    /*
     * Take an extra refcount, so rdtgrp cannot be freed
     * before the call back move_myself has been invoked
     */
    atomic_inc(&rdtgrp->waitcount)
    /* Callback move_myself will be scheduled for later */
    task_work_add(move_myself)
rdtgroup_kn_unlock
  mutex_unlock
  atomic_dec_and_test(&rdtgrp->waitcount)
  && (flags & RDT_DELETED)
                                       mutex_lock
                                       rmdir_all_sub
                                         /*
                                          * sentry and rdtgrp are freed
                                          * without checking refcount
                                          */
                                         free_all_child_rdtgrp
                                           kfree(sentry)*[1]
                                         kfree(rdtgrp)*[2]
                                       mutex_unlock
/*
 * Callback is scheduled to execute
 * after rdt_kill_sb is finished
 */
move_myself
  /*
   * Use-after-free: refer to earlier rdtgrp
   * memory which was freed in [1] or [2].
   */
  atomic_dec_and_test(&rdtgrp->waitcount)
  && (flags & RDT_DELETED)
    kfree(rdtgrp)

Scenario 2:
-----------
In Thread 1, rdtgroup_tasks_write() adds a task_work callback
move_myself(). If move_myself() is scheduled to execute after Thread 2
rdtgroup_rmdir() is finished, referring to earlier rdtgrp memory
(rdtgrp->waitcount) which was already freed in Thread 2 results in
use-after-free issue.

Thread 1 (rdtgroup_tasks_write)        Thread 2 (rdtgroup_rmdir)
-------------------------------        -------------------------
rdtgroup_kn_lock_live
  atomic_inc(&rdtgrp->waitcount)
  mutex_lock
rdtgroup_move_task
  __rdtgroup_move_task
    /*
     * Take an extra refcount, so rdtgrp cannot be freed
     * before the call back move_myself has been invoked
     */
    atomic_inc(&rdtgrp->waitcount)
    /* Callback move_myself will be scheduled for later */
    task_work_add(move_myself)
rdtgroup_kn_unlock
  mutex_unlock
  atomic_dec_and_test(&rdtgrp->waitcount)
  && (flags & RDT_DELETED)
                                       rdtgroup_kn_lock_live
                                         atomic_inc(&rdtgrp->waitcount)
                                         mutex_lock
                                       rdtgroup_rmdir_ctrl
                                         free_all_child_rdtgrp
                                           /*
                                            * sentry is freed without
                                            * checking refcount
                                            */
                                           kfree(sentry)*[3]
                                         rdtgroup_ctrl_remove
                                           rdtgrp->flags = RDT_DELETED
                                       rdtgroup_kn_unlock
                                         mutex_unlock
                                         atomic_dec_and_test(
                                                     &rdtgrp->waitcount)
                                         && (flags & RDT_DELETED)
                                           kfree(rdtgrp)
/*
 * Callback is scheduled to execute
 * after rdt_kill_sb is finished
 */
move_myself
  /*
   * Use-after-free: refer to earlier rdtgrp
   * memory which was freed in [3].
   */
  atomic_dec_and_test(&rdtgrp->waitcount)
  && (flags & RDT_DELETED)
    kfree(rdtgrp)

If CONFIG_DEBUG_SLAB=y, Slab corruption on kmalloc-2k can be observed
like following. Note that "0x6b" is POISON_FREE after kfree(). The
corrupted bits "0x6a", "0x64" at offset 0x424 correspond to
waitcount member of struct rdtgroup which was freed:

  Slab corruption (Not tainted): kmalloc-2k start=ffff9504c5b0d000, len=2048
  420: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkjkkkkkkkkkkk
  Single bit error detected. Probably bad RAM.
  Run memtest86+ or a similar memory test tool.
  Next obj: start=ffff9504c5b0d800, len=2048
  000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
  010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

  Slab corruption (Not tainted): kmalloc-2k start=ffff9504c58ab800, len=2048
  420: 6b 6b 6b 6b 64 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkdkkkkkkkkkkk
  Prev obj: start=ffff9504c58ab000, len=2048
  000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
  010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Fix this by taking reference count (waitcount) of rdtgrp into account in
the two call paths that currently do not do so. Instead of always
freeing the resource group it will only be freed if there are no waiters
on it. If there are waiters, the resource group will have its flags set
to RDT_DELETED.

It will be left to the waiter to free the resource group when it starts
running and finding that it was the last waiter and the resource group
has been removed (rdtgrp->flags & RDT_DELETED) since. (1) rdt_kill_sb()
-> rmdir_all_sub() -> free_all_child_rdtgrp() (2) rdtgroup_rmdir() ->
rdtgroup_rmdir_ctrl() -> free_all_child_rdtgrp()

Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1578500886-21771-2-git-send-email-xiaochen.shen@intel.com
2020-01-20 16:45:43 +01:00
Tony W Wang-oc
283bab9809 x86/cpu: Remove redundant cpu_detect_cache_sizes() call
Both functions call init_intel_cacheinfo() which computes L2 and L3 cache
sizes from CPUID(4). But then they also call cpu_detect_cache_sizes() a
bit later which computes ->x86_tlbsize and L2 size from CPUID(80000006).

However, the latter call is not needed because

 - on these CPUs, CPUID(80000006).EBX for ->x86_tlbsize is reserved

 - CPUID(80000006).ECX for the L2 size has the same result as CPUID(4)

Therefore, remove the latter call to simplify the code.

 [ bp: Rewrite commit message. ]

Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1579075257-6985-1-git-send-email-TonyWWang-oc@zhaoxin.com
2020-01-20 16:32:35 +01:00
Chen Yu
e79f15a459 x86/resctrl: Add task resctrl information display
Monitoring tools that want to find out which resctrl control and monitor
groups a task belongs to must currently read the "tasks" file in every
group until they locate the process ID.

Add an additional file /proc/{pid}/cpu_resctrl_groups to provide this
information:

1)   res:
     mon:

resctrl is not available.

2)   res:/
     mon:

Task is part of the root resctrl control group, and it is not associated
to any monitor group.

3)  res:/
    mon:mon0

Task is part of the root resctrl control group and monitor group mon0.

4)  res:group0
    mon:

Task is part of resctrl control group group0, and it is not associated
to any monitor group.

5) res:group0
   mon:mon1

Task is part of resctrl control group group0 and monitor group mon1.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jinshi Chen <jinshi.chen@intel.com>
Link: https://lkml.kernel.org/r/20200115092851.14761-1-yu.c.chen@intel.com
2020-01-20 16:19:10 +01:00
Arvind Sankar
dacc909233 x86/sysfb: Fix check for bad VRAM size
When checking whether the reported lfb_size makes sense, the height
* stride result is page-aligned before seeing whether it exceeds the
reported size.

This doesn't work if height * stride is not an exact number of pages.
For example, as reported in the kernel bugzilla below, an 800x600x32 EFI
framebuffer gets skipped because of this.

Move the PAGE_ALIGN to after the check vs size.

Reported-by: Christopher Head <chead@chead.ca>
Tested-by: Christopher Head <chead@chead.ca>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206051
Link: https://lkml.kernel.org/r/20200107230410.2291947-1-nivedita@alum.mit.edu
2020-01-20 10:57:53 +01:00
Ingo Molnar
cb6c82df68 Linux 5.5-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4k7i8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGvk0IAKRenVOdiudY77SQ
 VZjsteyrYTTQtPPv494ToIRjR0XQ+gYp8vyWzXTUC5Nm9Y9U3VzDqUPUjWszrSXE
 6mU+tzcMc9qwuUxnIFn8zfg64ygw+37sn/w3xqeH4QmF9Z5Wl3EX3SdXTs7jp3RS
 VxiztkUNI5ZBV2GDtla5K/9qLPqCQnUYXIiyi5lAtBtiitZDVXFp7dy7hMgEiaEO
 +78K5Kh3xlt5ndDsBFOlwIb2Oof3KL7bBXntdbSBc/bjol6IRvAgln48HWCv59G2
 jzAp2tj2KobX9GRAEPj+v4TQZEW0SXDNDi8MgQsM+3DYVCTmANsv57CBKRuf01+F
 nB1kAys=
 =zSnJ
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc7' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-20 08:43:44 +01:00
Ard Biesheuvel
1f299fad1e efi/x86: Limit EFI old memory map to SGI UV machines
We carry a quirk in the x86 EFI code to switch back to an older
method of mapping the EFI runtime services memory regions, because
it was deemed risky at the time to implement a new method without
providing a fallback to the old method in case problems arose.

Such problems did arise, but they appear to be limited to SGI UV1
machines, and so these are the only ones for which the fallback gets
enabled automatically (via a DMI quirk). The fallback can be enabled
manually as well, by passing efi=old_map, but there is very little
evidence that suggests that this is something that is being relied
upon in the field.

Given that UV1 support is not enabled by default by the distros
(Ubuntu, Fedora), there is no point in carrying this fallback code
all the time if there are no other users. So let's move it into the
UV support code, and document that efi=old_map now requires this
support code to be enabled.

Note that efi=old_map has been used in the past on other SGI UV
machines to work around kernel regressions in production, so we
keep the option to enable it by hand, but only if the kernel was
built with UV support.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200113172245.27925-8-ardb@kernel.org
2020-01-20 08:13:01 +01:00
Ingo Molnar
a786810cc8 Linux 5.5-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4k7i8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGvk0IAKRenVOdiudY77SQ
 VZjsteyrYTTQtPPv494ToIRjR0XQ+gYp8vyWzXTUC5Nm9Y9U3VzDqUPUjWszrSXE
 6mU+tzcMc9qwuUxnIFn8zfg64ygw+37sn/w3xqeH4QmF9Z5Wl3EX3SdXTs7jp3RS
 VxiztkUNI5ZBV2GDtla5K/9qLPqCQnUYXIiyi5lAtBtiitZDVXFp7dy7hMgEiaEO
 +78K5Kh3xlt5ndDsBFOlwIb2Oof3KL7bBXntdbSBc/bjol6IRvAgln48HWCv59G2
 jzAp2tj2KobX9GRAEPj+v4TQZEW0SXDNDi8MgQsM+3DYVCTmANsv57CBKRuf01+F
 nB1kAys=
 =zSnJ
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc7' into efi/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-20 08:05:16 +01:00
Linus Torvalds
0cc2682d8b Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes:

   - a resctrl fix for uninitialized objects found by debugobjects

   - a resctrl memory leak fix

   - fix the unintended re-enabling of the of SME and SEV CPU flags if
     memory encryption was disabled at bootup via the MSR space"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/CPU/AMD: Ensure clearing of SME/SEV features is maintained
  x86/resctrl: Fix potential memory leak
  x86/resctrl: Fix an imbalance in domain_remove_cpu()
2020-01-18 13:02:12 -08:00