Commit Graph

1804 Commits

Author SHA1 Message Date
Sean Christopherson
90d2f60f41 KVM: x86: Use KVM cpu caps to track UMIP emulation
Set UMIP in kvm_cpu_caps when it is emulated by VMX, even though the
bit will effectively be dropped by do_host_cpuid().  This allows
checking for UMIP emulation via kvm_cpu_caps instead of a dedicated
kvm_x86_ops callback.

No functional change intended.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:58:28 +01:00
Sean Christopherson
c10398b6d0 KVM: x86: Use KVM cpu caps to mark CR4.LA57 as not-reserved
Add accessor(s) for KVM cpu caps and use said accessor to detect
hardware support for LA57 instead of manually querying CPUID.

Note, the explicit conversion to bool via '!!' in kvm_cpu_cap_has() is
technically unnecessary, but it gives people a warm fuzzy feeling.

No functional change intended.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:58:27 +01:00
Sean Christopherson
cfc481810c KVM: x86: Calculate the supported xcr0 mask at load time
Add a new global variable, supported_xcr0, to track which xcr0 bits can
be exposed to the guest instead of calculating the mask on every call.
The supported bits are constant for a given instance of KVM.

This paves the way toward eliminating the ->mpx_supported() call in
kvm_mpx_supported(), e.g. eliminates multiple retpolines in VMX's nested
VM-Enter path, and eventually toward eliminating ->mpx_supported()
altogether.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:58:09 +01:00
Sean Christopherson
06add254c7 KVM: x86: Shrink the usercopy region of the emulation context
Shuffle a few operand structs to the end of struct x86_emulate_ctxt and
update the cache creation to whitelist only the region of the emulation
context that is expected to be copied to/from user memory, e.g. the
instruction operands, registers, and fetch/io/mem caches.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:53 +01:00
Sean Christopherson
2f728d66e8 KVM: x86: Move kvm_emulate.h into KVM's private directory
Now that the emulation context is dynamically allocated and not embedded
in struct kvm_vcpu, move its header, kvm_emulate.h, out of the public
asm directory and into KVM's private x86 directory.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:52 +01:00
Sean Christopherson
c9b8b07cde KVM: x86: Dynamically allocate per-vCPU emulation context
Allocate the emulation context instead of embedding it in struct
kvm_vcpu_arch.

Dynamic allocation provides several benefits:

  - Shrinks the size x86 vcpus by ~2.5k bytes, dropping them back below
    the PAGE_ALLOC_COSTLY_ORDER threshold.
  - Allows for dropping the include of kvm_emulate.h from asm/kvm_host.h
    and moving kvm_emulate.h into KVM's private directory.
  - Allows a reducing KVM's attack surface by shrinking the amount of
    vCPU data that is exposed to usercopy.
  - Allows a future patch to disable the emulator entirely, which may or
    may not be a realistic endeavor.

Mark the entire struct as valid for usercopy to maintain existing
behavior with respect to hardened usercopy.  Future patches can shrink
the usercopy range to cover only what is necessary.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:52 +01:00
Sean Christopherson
21f1b8f29e KVM: x86: Explicitly pass an exception struct to check_intercept
Explicitly pass an exception struct when checking for intercept from
the emulator, which eliminates the last reference to arch.emulate_ctxt
in vendor specific code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:50 +01:00
Sean Christopherson
2e3bb4d886 KVM: x86: Refactor I/O emulation helpers to provide vcpu-only variant
Add variants of the I/O helpers that take a vCPU instead of an emulation
context.  This will eventually allow KVM to limit use of the emulation
context to the full emulation path.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:49 +01:00
Sean Christopherson
abbed4fa94 KVM: x86: Fix warning due to implicit truncation on 32-bit KVM
Explicitly cast the integer literal to an unsigned long when stuffing a
non-canonical value into the host virtual address during private memslot
deletion.  The explicit cast fixes a warning that gets promoted to an
error when running with KVM's newfangled -Werror setting.

  arch/x86/kvm/x86.c:9739:9: error: large integer implicitly truncated
  to unsigned type [-Werror=overflow]

Fixes: a3e967c0b87d3 ("KVM: Terminate memslot walks via used_slots"
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:48 +01:00
Sean Christopherson
d8dd54e063 KVM: x86/mmu: Rename kvm_mmu->get_cr3() to ->get_guest_pgd()
Rename kvm_mmu->get_cr3() to call out that it is retrieving a guest
value, as opposed to kvm_mmu->set_cr3(), which sets a host value, and to
note that it will return something other than CR3 when nested EPT is in
use.  Hopefully the new name will also make it more obvious that L1's
nested_cr3 is returned in SVM's nested NPT case.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:46 +01:00
Sean Christopherson
a1c77abb8d KVM: nVMX: Properly handle userspace interrupt window request
Return true for vmx_interrupt_allowed() if the vCPU is in L2 and L1 has
external interrupt exiting enabled.  IRQs are never blocked in hardware
if the CPU is in the guest (L2 from L1's perspective) when IRQs trigger
VM-Exit.

The new check percolates up to kvm_vcpu_ready_for_interrupt_injection()
and thus vcpu_run(), and so KVM will exit to userspace if userspace has
requested an interrupt window (to inject an IRQ into L1).

Remove the @external_intr param from vmx_check_nested_events(), which is
actually an indicator that userspace wants an interrupt window, e.g.
it's named @req_int_win further up the stack.  Injecting a VM-Exit into
L1 to try and bounce out to L0 userspace is all kinds of broken and is
no longer necessary.

Remove the hack in nested_vmx_vmexit() that attempted to workaround the
breakage in vmx_check_nested_events() by only filling interrupt info if
there's an actual interrupt pending.  The hack actually made things
worse because it caused KVM to _never_ fill interrupt info when the
LAPIC resides in userspace (kvm_cpu_has_interrupt() queries
interrupt.injected, which is always cleared by prepare_vmcs12() before
reaching the hack in nested_vmx_vmexit()).

Fixes: 6550c4df7e ("KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"")
Cc: stable@vger.kernel.org
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:40 +01:00
Wanpeng Li
b34de572a8 KVM: X86: trigger kvmclock sync request just once on VM creation
In the progress of vCPUs creation, it queues a kvmclock sync worker to the global
workqueue before each vCPU creation completes. The workqueue subsystem guarantees
not to queue the already queued work; however, we can make the logic more clear by
making just one leader to trigger this kvmclock sync request, and also save on
cacheline bouncing caused by test_and_set_bit.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:40 +01:00
Wanpeng Li
4abaffce4d KVM: LAPIC: Recalculate apic map in batch
In the vCPU reset and set APIC_BASE MSR path, the apic map will be recalculated
several times, each time it will consume 10+ us observed by ftrace in my
non-overcommit environment since the expensive memory allocate/mutex/rcu etc
operations. This patch optimizes it by recaluating apic map in batch, I hope
this can benefit the serverless scenario which can frequently create/destroy
VMs.

Before patch:

kvm_lapic_reset  ~27us

After patch:

kvm_lapic_reset  ~14us

Observed by ftrace, improve ~48%.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:39 +01:00
Jay Zhou
3c9bd4006b KVM: x86: enable dirty log gradually in small chunks
It could take kvm->mmu_lock for an extended period of time when
enabling dirty log for the first time. The main cost is to clear
all the D-bits of last level SPTEs. This situation can benefit from
manual dirty log protect as well, which can reduce the mmu_lock
time taken. The sequence is like this:

1. Initialize all the bits of the dirty bitmap to 1 when enabling
   dirty log for the first time
2. Only write protect the huge pages
3. KVM_GET_DIRTY_LOG returns the dirty bitmap info
4. KVM_CLEAR_DIRTY_LOG will clear D-bit for each of the leaf level
   SPTEs gradually in small chunks

Under the Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz environment,
I did some tests with a 128G windows VM and counted the time taken
of memory_global_dirty_log_start, here is the numbers:

VM Size        Before    After optimization
128G           460ms     10ms

Signed-off-by: Jay Zhou <jianjay.zhou@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:37 +01:00
Sean Christopherson
562b6b089d KVM: x86: Consolidate VM allocation and free for VMX and SVM
Move the VM allocation and free code to common x86 as the logic is
more or less identical across SVM and VMX.

Note, although hyperv.hv_pa_pg is part of the common kvm->arch, it's
(currently) only allocated by VMX VMs.  But, since kfree() plays nice
when passed a NULL pointer, the superfluous call for SVM is harmless
and avoids future churn if SVM gains support for HyperV's direct TLB
flush.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
[Make vm_size a field instead of a function. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:33 +01:00
Sean Christopherson
b3594ffbf9 KVM: x86/mmu: Move kvm_arch_flush_remote_tlbs_memslot() to mmu.c
Move kvm_arch_flush_remote_tlbs_memslot() from x86.c to mmu.c in
preparation for calling kvm_flush_remote_tlbs_with_address() instead of
kvm_flush_remote_tlbs().  The with_address() variant is statically
defined in mmu.c, arguably kvm_arch_flush_remote_tlbs_memslot() belongs
in mmu.c anyways, and defining kvm_arch_flush_remote_tlbs_memslot() in
mmu.c will allow the compiler to inline said function when a future
patch consolidates open coded variants of the function.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:28 +01:00
Sean Christopherson
0577d1abe7 KVM: Terminate memslot walks via used_slots
Refactor memslot handling to treat the number of used slots as the de
facto size of the memslot array, e.g. return NULL from id_to_memslot()
when an invalid index is provided instead of relying on npages==0 to
detect an invalid memslot.  Rework the sorting and walking of memslots
in advance of dynamically sizing memslots to aid bisection and debug,
e.g. with luck, a bug in the refactoring will bisect here and/or hit a
WARN instead of randomly corrupting memory.

Alternatively, a global null/invalid memslot could be returned, i.e. so
callers of id_to_memslot() don't have to explicitly check for a NULL
memslot, but that approach runs the risk of introducing difficult-to-
debug issues, e.g. if the global null slot is modified.  Constifying
the return from id_to_memslot() to combat such issues is possible, but
would require a massive refactoring of arch specific code and would
still be susceptible to casting shenanigans.

Add function comments to update_memslots() and search_memslots() to
explicitly (and loudly) state how memslots are sorted.

Opportunistically stuff @hva with a non-canonical value when deleting a
private memslot on x86 to detect bogus usage of the freed slot.

No functional change intended.

Tested-by: Christoffer Dall <christoffer.dall@arm.com>
Tested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:26 +01:00
Sean Christopherson
0dff084607 KVM: Provide common implementation for generic dirty log functions
Move the implementations of KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG
for CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT into common KVM code.
The arch specific implemenations are extremely similar, differing
only in whether the dirty log needs to be sync'd from hardware (x86)
and how the TLBs are flushed.  Add new arch hooks to handle sync
and TLB flush; the sync will also be used for non-generic dirty log
support in a future patch (s390).

The ulterior motive for providing a common implementation is to
eliminate the dependency between arch and common code with respect to
the memslot referenced by the dirty log, i.e. to make it obvious in the
code that the validity of the memslot is guaranteed, as a future patch
will rework memslot handling such that id_to_memslot() can return NULL.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:24 +01:00
Sean Christopherson
e96c81ee89 KVM: Simplify kvm_free_memslot() and all its descendents
Now that all callers of kvm_free_memslot() pass NULL for @dont, remove
the param from the top-level routine and all arch's implementations.

No functional change intended.

Tested-by: Christoffer Dall <christoffer.dall@arm.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:22 +01:00
Sean Christopherson
21198846de KVM: x86: Free arrays for old memslot when moving memslot's base gfn
Explicitly free the metadata arrays (stored in slot->arch) in the old
memslot structure when moving the memslot's base gfn is committed.  This
eliminates x86's dependency on kvm_free_memslot() being called when a
memslot move is committed, and paves the way for removing the funky code
in kvm_free_memslot() that conditionally frees structures based on its
@dont param.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:21 +01:00
Sean Christopherson
9d4c197c0e KVM: Drop "const" attribute from old memslot in commit_memory_region()
Drop the "const" attribute from @old in kvm_arch_commit_memory_region()
to allow arch specific code to free arch specific resources in the old
memslot without having to cast away the attribute.  Freeing resources in
kvm_arch_commit_memory_region() paves the way for simplifying
kvm_free_memslot() by eliminating the last usage of its @dont param.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:20 +01:00
Sean Christopherson
414de7abbf KVM: Drop kvm_arch_create_memslot()
Remove kvm_arch_create_memslot() now that all arch implementations are
effectively nops.  Removing kvm_arch_create_memslot() eliminates the
possibility for arch specific code to allocate memory prior to setting
a memslot, which sets the stage for simplifying kvm_free_memslot().

Cc: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:17 +01:00
Sean Christopherson
0dab98b7ad KVM: x86: Allocate memslot resources during prepare_memory_region()
Allocate the various metadata structures associated with a new memslot
during kvm_arch_prepare_memory_region(), which paves the way for
removing kvm_arch_create_memslot() altogether.  Moving x86's memory
allocation only changes the order of kernel memory allocations between
x86 and common KVM code.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:16 +01:00
Sean Christopherson
edd4fa37ba KVM: x86: Allocate new rmap and large page tracking when moving memslot
Reallocate a rmap array and recalcuate large page compatibility when
moving an existing memslot to correctly handle the alignment properties
of the new memslot.  The number of rmap entries required at each level
is dependent on the alignment of the memslot's base gfn with respect to
that level, e.g. moving a large-page aligned memslot so that it becomes
unaligned will increase the number of rmap entries needed at the now
unaligned level.

Not updating the rmap array is the most obvious bug, as KVM accesses
garbage data beyond the end of the rmap.  KVM interprets the bad data as
pointers, leading to non-canonical #GPs, unexpected #PFs, etc...

  general protection fault: 0000 [#1] SMP
  CPU: 0 PID: 1909 Comm: move_memory_reg Not tainted 5.4.0-rc7+ #139
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:rmap_get_first+0x37/0x50 [kvm]
  Code: <48> 8b 3b 48 85 ff 74 ec e8 6c f4 ff ff 85 c0 74 e3 48 89 d8 5b c3
  RSP: 0018:ffffc9000021bbc8 EFLAGS: 00010246
  RAX: ffff00617461642e RBX: ffff00617461642e RCX: 0000000000000012
  RDX: ffff88827400f568 RSI: ffffc9000021bbe0 RDI: ffff88827400f570
  RBP: 0010000000000000 R08: ffffc9000021bd00 R09: ffffc9000021bda8
  R10: ffffc9000021bc48 R11: 0000000000000000 R12: 0030000000000000
  R13: 0000000000000000 R14: ffff88827427d700 R15: ffffc9000021bce8
  FS:  00007f7eda014700(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f7ed9216ff8 CR3: 0000000274391003 CR4: 0000000000162eb0
  Call Trace:
   kvm_mmu_slot_set_dirty+0xa1/0x150 [kvm]
   __kvm_set_memory_region.part.64+0x559/0x960 [kvm]
   kvm_set_memory_region+0x45/0x60 [kvm]
   kvm_vm_ioctl+0x30f/0x920 [kvm]
   do_vfs_ioctl+0xa1/0x620
   ksys_ioctl+0x66/0x70
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x4c/0x170
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f7ed9911f47
  Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48
  RSP: 002b:00007ffc00937498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 0000000001ab0010 RCX: 00007f7ed9911f47
  RDX: 0000000001ab1350 RSI: 000000004020ae46 RDI: 0000000000000004
  RBP: 000000000000000a R08: 0000000000000000 R09: 00007f7ed9214700
  R10: 00007f7ed92149d0 R11: 0000000000000246 R12: 00000000bffff000
  R13: 0000000000000003 R14: 00007f7ed9215000 R15: 0000000000000000
  Modules linked in: kvm_intel kvm irqbypass
  ---[ end trace 0c5f570b3358ca89 ]---

The disallow_lpage tracking is more subtle.  Failure to update results
in KVM creating large pages when it shouldn't, either due to stale data
or again due to indexing beyond the end of the metadata arrays, which
can lead to memory corruption and/or leaking data to guest/userspace.

Note, the arrays for the old memslot are freed by the unconditional call
to kvm_free_memslot() in __kvm_set_memory_region().

Fixes: 05da45583d ("KVM: MMU: large page support")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:13 +01:00
Sean Christopherson
744e699c7e KVM: x86: Move gpa_val and gpa_available into the emulator context
Move the GPA tracking into the emulator context now that the context is
guaranteed to be initialized via __init_emulate_ctxt() prior to
dereferencing gpa_{available,val}, i.e. now that seeing a stale
gpa_available will also trigger a WARN due to an invalid context.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:12 +01:00
Sean Christopherson
92daa48b34 KVM: x86: Add EMULTYPE_PF when emulation is triggered by a page fault
Add a new emulation type flag to explicitly mark emulation related to a
page fault.  Move the propation of the GPA into the emulator from the
page fault handler into x86_emulate_instruction, using EMULTYPE_PF as an
indicator that cr2 is valid.  Similarly, don't propagate cr2 into the
exception.address when it's *not* valid.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:12 +01:00
Miaohe Lin
e080e538e6 KVM: x86: eliminate some unreachable code
These code are unreachable, remove them.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16 17:57:09 +01:00
Oliver Upton
5ef8acbdd6 KVM: nVMX: Emulate MTF when performing instruction emulation
Since commit 5f3d45e7f2 ("kvm/x86: add support for
MONITOR_TRAP_FLAG"), KVM has allowed an L1 guest to use the monitor trap
flag processor-based execution control for its L2 guest. KVM simply
forwards any MTF VM-exits to the L1 guest, which works for normal
instruction execution.

However, when KVM needs to emulate an instruction on the behalf of an L2
guest, the monitor trap flag is not emulated. Add the necessary logic to
kvm_skip_emulated_instruction() to synthesize an MTF VM-exit to L1 upon
instruction emulation for L2.

Fixes: 5f3d45e7f2 ("kvm/x86: add support for MONITOR_TRAP_FLAG")
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-23 09:36:23 +01:00
Paolo Bonzini
9446e6fce0 KVM: x86: fix WARN_ON check of an unsigned less than zero
The check cpu->hv_clock.system_time < 0 is redundant since system_time
is a u64 and hence can never be less than zero.  But what was actually
meant is to check that the result is positive, since kernel_ns and
v->kvm->arch.kvmclock_offset are both s64.

Reported-by: Colin King <colin.king@canonical.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Addresses-Coverity: ("Macro compares unsigned to 0")
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-12 20:09:46 +01:00
Sean Christopherson
7a02674d15 KVM: x86/mmu: Avoid retpoline on ->page_fault() with TDP
Wrap calls to ->page_fault() with a small shim to directly invoke the
TDP fault handler when the kernel is using retpolines and TDP is being
used.  Single out the TDP fault handler and annotate the TDP path as
likely to coerce the compiler into preferring it over the indirect
function call.

Rename tdp_page_fault() to kvm_tdp_page_fault(), as it's exposed outside
of mmu.c to allow inlining the shim.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-12 20:09:42 +01:00
Miaohe Lin
20796447a1 KVM: x86: remove duplicated KVM_REQ_EVENT request
The KVM_REQ_EVENT request is already made in kvm_set_rflags(). We should
not make it again.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-12 20:09:40 +01:00
Oliver Upton
a06230b62b KVM: x86: Deliver exception payload on KVM_GET_VCPU_EVENTS
KVM allows the deferral of exception payloads when a vCPU is in guest
mode to allow the L1 hypervisor to intercept certain events (#PF, #DB)
before register state has been modified. However, this behavior is
incompatible with the KVM_{GET,SET}_VCPU_EVENTS ABI, as userspace
expects register state to have been immediately modified. Userspace may
opt-in for the payload deferral behavior with the
KVM_CAP_EXCEPTION_PAYLOAD per-VM capability. As such,
kvm_multiple_exception() will immediately manipulate guest registers if
the capability hasn't been requested.

Since the deferral is only necessary if a userspace ioctl were to be
serviced at the same as a payload bearing exception is recognized, this
behavior can be relaxed. Instead, opportunistically defer the payload
from kvm_multiple_exception() and deliver the payload before completing
a KVM_GET_VCPU_EVENTS ioctl.

Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-12 12:34:10 +01:00
Oliver Upton
307f1cfa26 KVM: x86: Mask off reserved bit from #DB exception payload
KVM defines the #DB payload as compatible with the 'pending debug
exceptions' field under VMX, not DR6. Mask off bit 12 when applying the
payload to DR6, as it is reserved on DR6 but not the 'pending debug
exceptions' field.

Fixes: f10c729ff9 ("kvm: vmx: Defer setting of DR6 until #DB delivery")
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-12 12:34:09 +01:00
Sean Christopherson
d76c7fbc01 KVM: x86: Mark CR4.UMIP as reserved based on associated CPUID bit
Re-add code to mark CR4.UMIP as reserved if UMIP is not supported by the
host.  The UMIP handling was unintentionally dropped during a recent
refactoring.

Not flagging CR4.UMIP allows the guest to set its CR4.UMIP regardless of
host support or userspace desires.  On CPUs with UMIP support, including
emulated UMIP, this allows the guest to enable UMIP against the wishes
of the userspace VMM.  On CPUs without any form of UMIP, this results in
a failed VM-Enter due to invalid guest state.

Fixes: 345599f9a2 ("KVM: x86: Add macro to ensure reserved cr4 bits checks stay in sync")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-05 16:30:19 +01:00
Paolo Bonzini
8171cd6880 KVM: x86: use raw clock values consistently
Commit 53fafdbb8b ("KVM: x86: switch KVMCLOCK base to monotonic raw
clock") changed kvmclock to use tkr_raw instead of tkr_mono.  However,
the default kvmclock_offset for the VM was still based on the monotonic
clock and, if the raw clock drifted enough from the monotonic clock,
this could cause a negative system_time to be written to the guest's
struct pvclock.  RHEL5 does not like it and (if it boots fast enough to
observe a negative time value) it hangs.

There is another thing to be careful about: getboottime64 returns the
host boot time with tkr_mono frequency, and subtracting the tkr_raw-based
kvmclock value will cause the wallclock to be off if tkr_raw drifts
from tkr_mono.  To avoid this, compute the wallclock delta from the
current time instead of being clever and using getboottime64.

Fixes: 53fafdbb8b ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
Cc: stable@vger.kernel.org
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-05 15:17:45 +01:00
Paolo Bonzini
917f9475c0 KVM: x86: reorganize pvclock_gtod_data members
We will need a copy of tk->offs_boot in the next patch.  Store it and
cleanup the struct: instead of storing tk->tkr_xxx.base with the tk->offs_boot
included, store the raw value in struct pvclock_clock and sum it in
do_monotonic_raw and do_realtime.   tk->tkr_xxx.xtime_nsec also moves
to struct pvclock_clock.

While at it, fix a (usually harmless) typo in do_monotonic_raw, which
was using gtod->clock.shift instead of gtod->raw_clock.shift.

Fixes: 53fafdbb8b ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
Cc: stable@vger.kernel.org
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-05 15:17:45 +01:00
Suravee Suthikulpanit
f4fdc0a2ed kvm: x86: hyperv: Use APICv update request interface
Since disabling APICv has to be done for all vcpus on AMD-based
system, adopt the newly introduced kvm_request_apicv_update()
interface, and introduce a new APICV_INHIBIT_REASON_HYPERV.

Also, remove the kvm_vcpu_deactivate_apicv() since no longer used.

Cc: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-05 15:17:43 +01:00
Suravee Suthikulpanit
2de9d0ccd0 kvm: x86: Introduce x86 ops hook for pre-update APICv
AMD SVM AVIC needs to update APIC backing page mapping before changing
APICv mode. Introduce struct kvm_x86_ops.pre_update_apicv_exec_ctrl
function hook to be called prior KVM APICv update request to each vcpu.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-05 15:17:42 +01:00
Suravee Suthikulpanit
ef8efd7a15 kvm: x86: Introduce APICv x86 ops for checking APIC inhibit reasons
Inibit reason bits are used to determine if APICv deactivation is
applicable for a particular hardware virtualization architecture.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-05 15:17:42 +01:00
Suravee Suthikulpanit
24bbf74c0c kvm: x86: Add APICv (de)activate request trace points
Add trace points when sending request to (de)activate APICv.

Suggested-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-05 15:17:41 +01:00
Suravee Suthikulpanit
8df14af42f kvm: x86: Add support for dynamic APICv activation
Certain runtime conditions require APICv to be temporary deactivated
during runtime.  The current implementation only support run-time
deactivation of APICv when Hyper-V SynIC is enabled, which is not
temporary.

In addition, for AMD, when APICv is (de)activated at runtime,
all vcpus in the VM have to operate in the same mode.  Thus the
requesting vcpu must notify the others.

So, introduce the following:
 * A new KVM_REQ_APICV_UPDATE request bit
 * Interfaces to request all vcpus to update APICv status
 * A new interface to update APICV-related parameters for each vcpu

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-05 15:17:41 +01:00
Suravee Suthikulpanit
4e19c36f2d kvm: x86: Introduce APICv inhibit reason bits
There are several reasons in which a VM needs to deactivate APICv
e.g. disable APICv via parameter during module loading, or when
enable Hyper-V SynIC support. Additional inhibit reasons will be
introduced later on when dynamic APICv is supported,

Introduce KVM APICv inhibit reason bits along with a new variable,
apicv_inhibit_reasons, to help keep track of APICv state for each VM,

Initially, the APICV_INHIBIT_REASON_DISABLE bit is used to indicate
the case where APICv is disabled during KVM module load.
(e.g. insmod kvm_amd avic=0 or insmod kvm_intel enable_apicv=0).

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[Do not use get_enable_apicv; consider irqchip_split in svm.c. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-05 15:17:40 +01:00
Linus Torvalds
e813e65038 ARM: Cleanups and corner case fixes
PPC: Bugfixes
 
 x86:
 * Support for mapping DAX areas with large nested page table entries.
 * Cleanups and bugfixes here too.  A particularly important one is
 a fix for FPU load when the thread has TIF_NEED_FPU_LOAD.  There is
 also a race condition which could be used in guest userspace to exploit
 the guest kernel, for which the embargo expired today.
 * Fast path for IPI delivery vmexits, shaving about 200 clock cycles
 from IPI latency.
 * Protect against "Spectre-v1/L1TF" (bring data in the cache via
 speculative out of bound accesses, use L1TF on the sibling hyperthread
 to read it), which unfortunately is an even bigger whack-a-mole game
 than SpectreV1.
 
 Sean continues his mission to rewrite KVM.  In addition to a sizable
 number of x86 patches, this time he contributed a pretty large refactoring
 of vCPU creation that affects all architectures but should not have any
 visible effect.
 
 s390 will come next week together with some more x86 patches.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJeMxtCAAoJEL/70l94x66DQxIIAJv9hMmXLQHGFnUMskjGErR6
 DCLSC0YRdRMwE50CerblyJtGsMwGsPyHZwvZxoAceKJ9w0Yay9cyaoJ87ItBgHoY
 ce0HrqIUYqRSJ/F8WH2lSzkzMBr839rcmqw8p1tt4D5DIsYnxHGWwRaaP+5M/1KQ
 YKFu3Hea4L00U339iIuDkuA+xgz92LIbsn38svv5fxHhPAyWza0rDEYHNgzMKuoF
 IakLf5+RrBFAh6ZuhYWQQ44uxjb+uQa9pVmcqYzzTd5t1g4PV5uXtlJKesHoAvik
 Eba8IEUJn+HgQJjhp3YxQYuLeWOwRF3bwOiZ578MlJ4OPfYXMtbdlqCQANHOcGk=
 =H/q1
 -----END PGP SIGNATURE-----

Merge tag 'kvm-5.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "This is the first batch of KVM changes.

  ARM:
   - cleanups and corner case fixes.

  PPC:
   - Bugfixes

  x86:
   - Support for mapping DAX areas with large nested page table entries.

   - Cleanups and bugfixes here too. A particularly important one is a
     fix for FPU load when the thread has TIF_NEED_FPU_LOAD. There is
     also a race condition which could be used in guest userspace to
     exploit the guest kernel, for which the embargo expired today.

   - Fast path for IPI delivery vmexits, shaving about 200 clock cycles
     from IPI latency.

   - Protect against "Spectre-v1/L1TF" (bring data in the cache via
     speculative out of bound accesses, use L1TF on the sibling
     hyperthread to read it), which unfortunately is an even bigger
     whack-a-mole game than SpectreV1.

  Sean continues his mission to rewrite KVM. In addition to a sizable
  number of x86 patches, this time he contributed a pretty large
  refactoring of vCPU creation that affects all architectures but should
  not have any visible effect.

  s390 will come next week together with some more x86 patches"

* tag 'kvm-5.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
  x86/KVM: Clean up host's steal time structure
  x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed
  x86/kvm: Cache gfn to pfn translation
  x86/kvm: Introduce kvm_(un)map_gfn()
  x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit
  KVM: PPC: Book3S PR: Fix -Werror=return-type build failure
  KVM: PPC: Book3S HV: Release lock on page-out failure path
  KVM: arm64: Treat emulated TVAL TimerValue as a signed 32-bit integer
  KVM: arm64: pmu: Only handle supported event counters
  KVM: arm64: pmu: Fix chained SW_INCR counters
  KVM: arm64: pmu: Don't mark a counter as chained if the odd one is disabled
  KVM: arm64: pmu: Don't increment SW_INCR if PMCR.E is unset
  KVM: x86: Use a typedef for fastop functions
  KVM: X86: Add 'else' to unify fastop and execute call path
  KVM: x86: inline memslot_valid_for_gpte
  KVM: x86/mmu: Use huge pages for DAX-backed files
  KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte()
  KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust()
  KVM: x86/mmu: Zap any compound page when collapsing sptes
  KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch)
  ...
2020-01-31 09:30:41 -08:00
Paolo Bonzini
4cbc418a44 Merge branch 'cve-2019-3016' into kvm-next-5.6
From Boris Ostrovsky:

The KVM hypervisor may provide a guest with ability to defer remote TLB
flush when the remote VCPU is not running. When this feature is used,
the TLB flush will happen only when the remote VPCU is scheduled to run
again. This will avoid unnecessary (and expensive) IPIs.

Under certain circumstances, when a guest initiates such deferred action,
the hypervisor may miss the request. It is also possible that the guest
may mistakenly assume that it has already marked remote VCPU as needing
a flush when in fact that request had already been processed by the
hypervisor. In both cases this will result in an invalid translation
being present in a vCPU, potentially allowing accesses to memory locations
in that guest's address space that should not be accessible.

Note that only intra-guest memory is vulnerable.

The five patches address both of these problems:
1. The first patch makes sure the hypervisor doesn't accidentally clear
a guest's remote flush request
2. The rest of the patches prevent the race between hypervisor
acknowledging a remote flush request and guest issuing a new one.

Conflicts:
	arch/x86/kvm/x86.c [move from kvm_arch_vcpu_free to kvm_arch_vcpu_destroy]
2020-01-30 18:47:59 +01:00
Boris Ostrovsky
a6bd811f12 x86/KVM: Clean up host's steal time structure
Now that we are mapping kvm_steal_time from the guest directly we
don't need keep a copy of it in kvm_vcpu_arch.st. The same is true
for the stime field.

This is part of CVE-2019-3016.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-30 18:45:55 +01:00
Boris Ostrovsky
b043138246 x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed
There is a potential race in record_steal_time() between setting
host-local vcpu->arch.st.steal.preempted to zero (i.e. clearing
KVM_VCPU_PREEMPTED) and propagating this value to the guest with
kvm_write_guest_cached(). Between those two events the guest may
still see KVM_VCPU_PREEMPTED in its copy of kvm_steal_time, set
KVM_VCPU_FLUSH_TLB and assume that hypervisor will do the right
thing. Which it won't.

Instad of copying, we should map kvm_steal_time and that will
guarantee atomicity of accesses to @preempted.

This is part of CVE-2019-3016.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-30 18:45:55 +01:00
Boris Ostrovsky
917248144d x86/kvm: Cache gfn to pfn translation
__kvm_map_gfn()'s call to gfn_to_pfn_memslot() is
* relatively expensive
* in certain cases (such as when done from atomic context) cannot be called

Stashing gfn-to-pfn mapping should help with both cases.

This is part of CVE-2019-3016.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-30 18:45:55 +01:00
Boris Ostrovsky
8c6de56a42 x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit
kvm_steal_time_set_preempted() may accidentally clear KVM_VCPU_FLUSH_TLB
bit if it is called more than once while VCPU is preempted.

This is part of CVE-2019-3016.

(This bug was also independently discovered by Jim Mattson
<jmattson@google.com>)

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-30 18:45:54 +01:00
Peng Hao
4d6d07aee8 kvm/x86: export kvm_vector_hashing_enabled() is unnecessary
kvm_vector_hashing_enabled() is just called in kvm.ko module.

Signed-off-by: Peng Hao <richard.peng@oppo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27 19:59:58 +01:00
Krish Sadhukhan
b91991bf6b KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests
According to section "Checks on Guest Control Registers, Debug Registers, and
and MSRs" in Intel SDM vol 3C, the following checks are performed on vmentry
of nested guests:

    If the "load debug controls" VM-entry control is 1, bits 63:32 in the DR7
    field must be 0.

In KVM, GUEST_DR7 is set prior to the vmcs02 VM-entry by kvm_set_dr() and the
latter synthesizes a #GP if any bit in the high dword in the former is set.
Hence this field needs to be checked in software.

Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27 19:59:55 +01:00