If the "virtualize APIC accesses" VM-execution control is set in the
VMCS, the APIC virtualization hardware is triggered when a page walk
in VMX non-root mode terminates at a PTE wherein the address of the 4k
page frame matches the APIC-access address specified in the VMCS. On
hardware, the APIC-access address may be any valid 4k-aligned physical
address.
KVM's nVMX implementation enforces the additional constraint that the
APIC-access address specified in the vmcs12 must be backed by
a "struct page" in L1. If not, L0 will simply clear the "virtualize
APIC accesses" VM-execution control in the vmcs02.
The problem with this approach is that the L1 guest has arranged the
vmcs12 EPT tables--or shadow page tables, if the "enable EPT"
VM-execution control is clear in the vmcs12--so that the L2 guest
physical address(es)--or L2 guest linear address(es)--that reference
the L2 APIC map to the APIC-access address specified in the
vmcs12. Without the "virtualize APIC accesses" VM-execution control in
the vmcs02, the APIC accesses in the L2 guest will directly access the
APIC-access page in L1.
When there is no mapping whatsoever for the APIC-access address in L1,
the L2 VM just loses the intended APIC virtualization. However, when
the APIC-access address is mapped to an MMIO region in L1, the L2
guest gets direct access to the L1 MMIO device. For example, if the
APIC-access address specified in the vmcs12 is 0xfee00000, then L2
gets direct access to L1's APIC.
Since this vmcs12 configuration is something that KVM cannot
faithfully emulate, the appropriate response is to exit to userspace
with KVM_INTERNAL_ERROR_EMULATION.
Fixes: fe3ef05c75 ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12")
Reported-by: Dan Cross <dcross@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After resetting the vCPU, the kvmclock MSR keeps the previous value but it is
not enabled. This can be confusing, so fix it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use BUG_ON instead of a if condition followed by BUG.
Generated by: scripts/coccinelle/misc/bugon.cocci
Fixes: 4b526de50e ("KVM: x86: Check kvm_rebooting in kvm_spurious_fault()")
CC: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
INTEL_PMC_MAX_GENERIC is currently 32, which exceeds the 18
contiguous MSR indices reserved by Intel for event selectors.
Since some machines actually have MSRs past the reserved range,
filtering them against x86_pmu.num_counters_gp may have false
positives. Cut the list to 18 entries to avoid this.
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Jim Mattson <jamttson@google.com>
Fixes: e2ada66ec4 ("kvm: x86: Add Intel PMU MSRs to msrs_to_save[]", 2019-08-21)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
INTEL_PMC_MAX_GENERIC is currently 32, which exceeds the 18 contiguous
MSR indices reserved by Intel for event selectors. Since some machines
actually have MSRs past the reserved range, these may survive the
filtering of msrs_to_save array and would be rejected by KVM_GET/SET_MSR.
To avoid this, cut the list to whatever CPUID reports for the host's
architectural PMU.
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Fixes: e2ada66ec4 ("kvm: x86: Add Intel PMU MSRs to msrs_to_save[]", 2019-08-21)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The largepages debugfs entry is incremented/decremented as shadow
pages are created or destroyed. Clearing it will result in an
underflow, which is harmless to KVM but ugly (and could be
misinterpreted by tools that use debugfs information), so make
this particular statistic read-only.
Cc: kvm-ppc@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported by syzkaller:
WARNING: CPU: 0 PID: 6544 at /home/kernel/data/kvm/arch/x86/kvm//vmx/vmx.c:4689 handle_desc+0x37/0x40 [kvm_intel]
CPU: 0 PID: 6544 Comm: a.out Tainted: G OE 5.3.0-rc4+ #4
RIP: 0010:handle_desc+0x37/0x40 [kvm_intel]
Call Trace:
vmx_handle_exit+0xbe/0x6b0 [kvm_intel]
vcpu_enter_guest+0x4dc/0x18d0 [kvm]
kvm_arch_vcpu_ioctl_run+0x407/0x660 [kvm]
kvm_vcpu_ioctl+0x3ad/0x690 [kvm]
do_vfs_ioctl+0xa2/0x690
ksys_ioctl+0x6d/0x80
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x74/0x720
entry_SYSCALL_64_after_hwframe+0x49/0xbe
When CR4.UMIP is set, guest should have UMIP cpuid flag. Current
kvm set_sregs function doesn't have such check when userspace inputs
sregs values. SECONDARY_EXEC_DESC is enabled on writes to CR4.UMIP
in vmx_set_cr4 though guest doesn't have UMIP cpuid flag. The testcast
triggers handle_desc warning when executing ltr instruction since
guest architectural CR4 doesn't set UMIP. This patch fixes it by
adding valid CR4 and CPUID combination checking in __set_sregs.
syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=138efb99600000
Reported-by: syzbot+0f1819555fbdce992df9@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly check kvm_rebooting in kvm_spurious_fault() prior to invoking
BUG(), as opposed to assuming the caller has already done so. Letting
kvm_spurious_fault() be called "directly" will allow VMX to better
optimize its low level assembly flows.
As a happy side effect, kvm_spurious_fault() no longer needs to be
marked as a dead end since it doesn't unconditionally BUG().
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that the fast invalidate mechanism has been reintroduced, restore
the performance tweaks for fast invalidation that existed prior to its
removal.
Paraphrasing the original changelog (commit 5ff0568374 was itself a
partial revert):
Don't force reloading the remote mmu when zapping an obsolete page, as
a MMU_RELOAD request has already been issued by kvm_mmu_zap_all_fast()
immediately after incrementing mmu_valid_gen, i.e. after marking pages
obsolete.
This reverts commit 5ff0568374.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
UMWAIT and TPAUSE instructions use 32bit IA32_UMWAIT_CONTROL at MSR index
E1H to determines the maximum time in TSC-quanta that the processor can
reside in either C0.1 or C0.2.
This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate
IA32_UMWAIT_CONTROL between host and guest. The variable
mwait_control_cached in arch/x86/kernel/cpu/umwait.c caches the MSR value,
so this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL.
Co-developed-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX's EPT misconfig flow to handle fast-MMIO path falls back to decoding
the instruction to determine the instruction length when running as a
guest (Hyper-V doesn't fill VMCS.VM_EXIT_INSTRUCTION_LEN because it's
technically not defined for EPT misconfigs). Rather than implement the
slow skip in VMX's generic skip_emulated_instruction(),
handle_ept_misconfig() directly calls kvm_emulate_instruction() with
EMULTYPE_SKIP, which intentionally doesn't do single-step detection, and
so handle_ept_misconfig() misses a single-step #DB.
Rework the EPT misconfig fallback case to route it through
kvm_skip_emulated_instruction() so that single-step #DBs and interrupt
shadow updates are handled automatically. I.e. make VMX's slow skip
logic match SVM's and have the SVM flow not intentionally avoid the
shadow update.
Alternatively, the handle_ept_misconfig() could manually handle single-
step detection, but that results in EMULTYPE_SKIP having split logic for
the interrupt shadow vs. single-step #DBs, and split emulator logic is
largely what led to this mess in the first place.
Modifying SVM to mirror VMX flow isn't really an option as SVM's case
isn't limited to a specific exit reason, i.e. handling the slow skip in
skip_emulated_instruction() is mandatory for all intents and purposes.
Drop VMX's skip_emulated_instruction() wrapper since it can now fail,
and instead WARN if it fails unexpectedly, e.g. if exit_reason somehow
becomes corrupted.
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: d391f12070 ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Deferring emulation failure handling (in some cases) to the caller of
x86_emulate_instruction() has proven fragile, e.g. multiple instances of
KVM not setting run->exit_reason on EMULATE_FAIL, largely due to it
being difficult to discern what emulation types can return what result,
and which combination of types and results are handled where.
Now that x86_emulate_instruction() always handles emulation failure,
i.e. EMULATION_FAIL is only referenced in callers, remove the
emulation_result enums entirely. Per KVM's existing exit handling
conventions, return '0' and '1' for "exit to userspace" and "resume
guest" respectively. Doing so cleans up many callers, e.g. they can
return kvm_emulate_instruction() directly instead of having to interpret
its result.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Request triple fault in kvm_inject_realmode_interrupt() instead of
returning EMULATE_FAIL and deferring to the caller. All existing
callers request triple fault and it's highly unlikely Real Mode is
going to acquire new features. While this consolidates a small amount
of code, the real goal is to remove the last reference to EMULATE_FAIL.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Consolidate the reporting of emulation failure into kvm_task_switch()
so that it can return EMULATE_USER_EXIT. This helps pave the way for
removing EMULATE_FAIL altogether.
This also fixes a theoretical bug where task switch interception could
suppress an EMULATE_USER_EXIT return.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Kill a few birds with one stone by forcing an exit to userspace on skip
emulation failure. This removes a reference to EMULATE_FAIL, fixes a
bug in handle_ept_misconfig() where it would exit to userspace without
setting run->exit_reason, and fixes a theoretical bug in SVM's
task_switch_interception() where it would overwrite run->exit_reason on
a return of EMULATE_USER_EXIT.
Note, this technically doesn't fully fix task_switch_interception()
as it now incorrectly handles EMULATE_FAIL, but in practice there is no
bug as EMULATE_FAIL will never be returned for EMULTYPE_SKIP.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Immediately inject a #UD and return EMULATE done if emulation fails when
handling an intercepted #UD. This helps pave the way for removing
EMULATE_FAIL altogether.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add an explicit emulation type for forced #UD emulation and use it to
detect that KVM should unconditionally inject a #UD instead of falling
into its standard emulation failure handling.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Immediately inject a #GP when VMware emulation fails and return
EMULATE_DONE instead of propagating EMULATE_FAIL up the stack. This
helps pave the way for removing EMULATE_FAIL altogether.
Rename EMULTYPE_VMWARE to EMULTYPE_VMWARE_GP to document that the x86
emulator is called to handle VMware #GP interception, e.g. why a #GP
is injected on emulation failure for EMULTYPE_VMWARE_GP.
Drop EMULTYPE_NO_UD_ON_FAIL as a standalone type. The "no #UD on fail"
is used only in the VMWare case and is obsoleted by having the emulator
itself reinject #GP.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return the single-step emulation result directly instead of via an out
param. Presumably at some point in the past kvm_vcpu_do_singlestep()
could be called with *r==EMULATE_USER_EXIT, but that is no longer the
case, i.e. all callers are happy to overwrite their own return variable.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When handling emulation failure, return the emulation result directly
instead of capturing it in a local variable. Future patches will move
additional cases into handle_emulation_failure(), clean up the cruft
before so there isn't an ugly mix of setting a local variable and
returning directly.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the stat.mmio_exits update into x86_emulate_instruction(). This is
both a bug fix, e.g. the current update flows will incorrectly increment
mmio_exits on emulation failure, and a preparatory change to set the
stage for eliminating EMULATE_DONE and company.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zapping collapsible sptes, a.k.a. 4k sptes that can be promoted into a
large page, is only necessary when changing only the dirty logging flag
of a memory region. If the memslot is also being moved, then all sptes
for the memslot are zapped when it is invalidated. When a memslot is
being created, it is impossible for there to be existing dirty mappings,
e.g. KVM can have MMIO sptes, but not present, and thus dirty, sptes.
Note, the comment and logic are shamelessly borrowed from MIPS's version
of kvm_arch_commit_memory_region().
Fixes: 3ea3b7fa9a ("kvm: mmu: lazy collapse small sptes into large sptes")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It was discovered that after commit 65efa61dc0 ("selftests: kvm: provide
common function to enable eVMCS") hyperv_cpuid selftest is failing on AMD.
The reason is that the commit changed _vcpu_ioctl() to vcpu_ioctl() in the
test and this one can't fail.
Instead of fixing the test is seems to make more sense to not announce
KVM_CAP_HYPERV_ENLIGHTENED_VMCS support if it is definitely missing
(on svm and in case kvm_intel.nested=0).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-V direct tlb flush function should be enabled for
guest that only uses Hyper-V hypercall. User space
hypervisor(e.g, Qemu) can disable KVM identification in
CPUID and just exposes Hyper-V identification to make
sure the precondition. Add new KVM capability KVM_CAP_
HYPERV_DIRECT_TLBFLUSH for user space to enable Hyper-V
direct tlb function and this function is default to be
disabled in KVM.
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These MSRs should be enumerated by KVM_GET_MSR_INDEX_LIST, so that
userspace knows that these MSRs may be part of the vCPU state.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Eric Hankland <ehankland@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* ARM: ITS translation cache; support for 512 vCPUs, various cleanups
and bugfixes
* PPC: various minor fixes and preparation
* x86: bugfixes all over the place (posted interrupts, SVM, emulation
corner cases, blocked INIT), some IPI optimizations
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJdf7fdAAoJEL/70l94x66DJzkIAKDcuWXJB4Qtoto6yUvPiHZm
LYkY/Dn1zulb/DhzrBoXFey/jZXwl9kxMYkVTefnrAl0fRwFGX+G1UYnQrtAL6Gr
ifdTYdy3kZhXCnnp99QAantWDswJHo1THwbmHrlmkxS4MdisEaTHwgjaHrDRZ4/d
FAEwW2isSonP3YJfTtsKFFjL9k2D4iMnwZ/R2B7UOaWvgnerZ1GLmOkilvnzGGEV
IQ89IIkWlkKd4SKgq8RkDKlfW5JrLrSdTK2Uf0DvAxV+J0EFkEaR+WlLsqumra0z
Eg3KwNScfQj0DyT0TzurcOxObcQPoMNSFYXLRbUu1+i0CGgm90XpF1IosiuihgU=
=w6I3
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"s390:
- ioctl hardening
- selftests
ARM:
- ITS translation cache
- support for 512 vCPUs
- various cleanups and bugfixes
PPC:
- various minor fixes and preparation
x86:
- bugfixes all over the place (posted interrupts, SVM, emulation
corner cases, blocked INIT)
- some IPI optimizations"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (75 commits)
KVM: X86: Use IPI shorthands in kvm guest when support
KVM: x86: Fix INIT signal handling in various CPU states
KVM: VMX: Introduce exit reason for receiving INIT signal on guest-mode
KVM: VMX: Stop the preemption timer during vCPU reset
KVM: LAPIC: Micro optimize IPI latency
kvm: Nested KVM MMUs need PAE root too
KVM: x86: set ctxt->have_exception in x86_decode_insn()
KVM: x86: always stop emulation on page fault
KVM: nVMX: trace nested VM-Enter failures detected by H/W
KVM: nVMX: add tracepoint for failed nested VM-Enter
x86: KVM: svm: Fix a check in nested_svm_vmrun()
KVM: x86: Return to userspace with internal error on unexpected exit reason
KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM code
KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callers
doc: kvm: Fix return description of KVM_SET_MSRS
KVM: X86: Tune PLE Window tracepoint
KVM: VMX: Change ple_window type to unsigned int
KVM: X86: Remove tailing newline for tracepoints
KVM: X86: Trace vcpu_id for vmexit
KVM: x86: Manually calculate reserved bits when loading PDPTRS
...
Emulation of VMPTRST can incorrectly inject a page fault
when passed an operand that points to an MMIO address.
The page fault will use uninitialized kernel stack memory
as the CR2 and error code.
The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR
exit to userspace; however, it is not an easy fix, so for now just ensure
that the error code and CR2 are zero.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Cc: stable@vger.kernel.org
[add comment]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86_emulate_instruction() takes into account ctxt->have_exception flag
during instruction decoding, but in practice this flag is never set in
x86_decode_insn().
Fixes: 6ea6e84309 ("KVM: x86: inject exceptions produced by x86_decode_insn")
Cc: stable@vger.kernel.org
Cc: Denis Lunev <den@virtuozzo.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
inject_emulated_exception() returns true if and only if nested page
fault happens. However, page fault can come from guest page tables
walk, either nested or not nested. In both cases we should stop an
attempt to read under RIP and give guest to step over its own page
fault handler.
This is also visible when an emulated instruction causes a #GP fault
and the VMware backdoor is enabled. To handle the VMware backdoor,
KVM intercepts #GP faults; with only the next patch applied,
x86_emulate_instruction() injects a #GP but returns EMULATE_FAIL
instead of EMULATE_DONE. EMULATE_FAIL causes handle_exception_nmi()
(or gp_interception() for SVM) to re-inject the original #GP because it
thinks emulation failed due to a non-VMware opcode. This patch prevents
the issue as x86_emulate_instruction() will return EMULATE_DONE after
injecting the #GP.
Fixes: 6ea6e84309 ("KVM: x86: inject exceptions produced by x86_decode_insn")
Cc: stable@vger.kernel.org
Cc: Denis Lunev <den@virtuozzo.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Debugging a failed VM-Enter is often like searching for a needle in a
haystack, e.g. there are over 80 consistency checks that funnel into
the "invalid control field" error code. One way to expedite debug is
to run the buggy code as an L1 guest under KVM (and pray that the
failing check is detected by KVM). However, extracting useful debug
information out of L0 KVM requires attaching a debugger to KVM and/or
modifying the source, e.g. to log which check is failing.
Make life a little less painful for VMM developers and add a tracepoint
for failed VM-Enter consistency checks. Ideally the tracepoint would
capture both what check failed and precisely why it failed, but logging
why a checked failed is difficult to do in a generic tracepoint without
resorting to invasive techniques, e.g. generating a custom string on
failure. That being said, for the vast majority of VM-Enter failures
the most difficult step is figuring out exactly what to look at, e.g.
figuring out which bit was incorrectly set in a control field is usually
not too painful once the guilty field as been identified.
To reach a happy medium between precision and ease of use, simply log
the code that detected a failed check, using a macro to execute the
check and log the trace event on failure. This approach enables tracing
arbitrary code, e.g. it's not limited to function calls or specific
formats of checks, and the changes to the existing code are minimally
invasive. A macro with a two-character name is desirable as usage of
the macro doesn't result in overly long lines or confusing alignment,
while still retaining some amount of readability. I.e. a one-character
name is a little too terse, and a three-character name results in the
contents being passed to the macro aligning with an indented line when
the macro is used an in if-statement, e.g.:
if (VCC(nested_vmx_check_long_line_one(...) &&
nested_vmx_check_long_line_two(...)))
return -EINVAL;
And that is the story of how the CC(), a.k.a. Consistency Check, macro
got its name.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move RDMSR and WRMSR emulation into common x86 code to consolidate
nearly identical SVM and VMX code.
Note, consolidating RDMSR introduces an extra indirect call, i.e.
retpoline, due to reaching {svm,vmx}_get_msr() via kvm_x86_ops, but a
guest kernel likely has bigger problems if increasing the latency of
RDMSR VM-Exits by ~70 cycles has a measurable impact on overall VM
performance. E.g. the only recurring RDMSR VM-Exits (after booting) on
my system running Linux 5.2 in the guest are for MSR_IA32_TSC_ADJUST via
arch_cpu_idle_enter().
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Refactor the top-level MSR accessors to take/return the index and value
directly instead of requiring the caller to dump them into a msr_data
struct.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The PLE window tracepoint triggers even if the window is not changed,
and the wording can be a bit confusing too. One example line:
kvm_ple_window: vcpu 0: ple_window 4096 (shrink 4096)
It easily let people think of "the window now is 4096 which is
shrinked", but the truth is the value actually didn't change (4096).
Let's only dump this message if the value really changed, and we make
the message even simpler like:
kvm_ple_window: vcpu 4 old 4096 new 8192 (growed)
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Manually generate the PDPTR reserved bit mask when explicitly loading
PDPTRs. The reserved bits that are being tracked by the MMU reflect the
current paging mode, which is unlikely to be PAE paging in the vast
majority of flows that use load_pdptrs(), e.g. CR0 and CR4 emulation,
__set_sregs(), etc... This can cause KVM to incorrectly signal a bad
PDPTR, or more likely, miss a reserved bit check and subsequently fail
a VM-Enter due to a bad VMCS.GUEST_PDPTR.
Add a one off helper to generate the reserved bits instead of sharing
code across the MMU's calculations and the PDPTR emulation. The PDPTR
reserved bits are basically set in stone, and pushing a helper into
the MMU's calculation adds unnecessary complexity without improving
readability.
Oppurtunistically fix/update the comment for load_pdptrs().
Note, the buggy commit also introduced a deliberate functional change,
"Also remove bit 5-6 from rsvd_bits_mask per latest SDM.", which was
effectively (and correctly) reverted by commit cd9ae5fe47 ("KVM: x86:
Fix page-tables reserved bits"). A bit of SDM archaeology shows that
the SDM from late 2008 had a bug (likely a copy+paste error) where it
listed bits 6:5 as AVL and A for PDPTEs used for 4k entries but reserved
for 2mb entries. I.e. the SDM contradicted itself, and bits 6:5 are and
always have been reserved.
Fixes: 20c466b561 ("KVM: Use rsvd_bits_mask in load_pdptrs()")
Cc: stable@vger.kernel.org
Cc: Nadav Amit <nadav.amit@gmail.com>
Reported-by: Doug Reiland <doug.reiland@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't advance RIP or inject a single-step #DB if emulation signals a
fault. This logic applies to all state updates that are conditional on
clean retirement of the emulation instruction, e.g. updating RFLAGS was
previously handled by commit 38827dbd3f ("KVM: x86: Do not update
EFLAGS on faulting emulation").
Not advancing RIP is likely a nop, i.e. ctxt->eip isn't updated with
ctxt->_eip until emulation "retires" anyways. Skipping #DB injection
fixes a bug reported by Andy Lutomirski where a #UD on SYSCALL due to
invalid state with EFLAGS.TF=1 would loop indefinitely due to emulation
overwriting the #UD with #DB and thus restarting the bad SYSCALL over
and over.
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@kernel.org>
Fixes: 663f4c61b8 ("KVM: x86: handle singlestep during emulation")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Remove a few stale checks for non-NULL ops now that the ops in question
are implemented by both VMX and SVM.
Note, this is **not** stable material, the Fixes tags are there purely
to show when a particular op was first supported by both VMX and SVM.
Fixes: 74f169090b ("kvm/svm: Setup MCG_CAP on AMD properly")
Fixes: b31c114b82 ("KVM: X86: Provide a capability to disable PAUSE intercepts")
Fixes: 411b44ba80 ("svm: Implements update_pi_irte hook to setup posted interrupt")
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename "access" to "mmio_access" to match the other MMIO cache members
and to make it more obvious that it's tracking the access permissions
for the MMIO cache.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To avoid hardcoding xsetbv length to '3' we need to support decoding it in
the emulator.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When doing x86_emulate_instruction(EMULTYPE_SKIP) interrupt shadow has to
be cleared if and only if the skipping is successful.
There are two immediate issues:
- In SVM skip_emulated_instruction() we are not zapping interrupt shadow
in case kvm_emulate_instruction(EMULTYPE_SKIP) is used to advance RIP
(!nrpip_save).
- In VMX handle_ept_misconfig() when running as a nested hypervisor we
(static_cpu_has(X86_FEATURE_HYPERVISOR) case) forget to clear interrupt
shadow.
Note that we intentionally don't handle the case when the skipped
instruction is supposed to prolong the interrupt shadow ("MOV/POP SS") as
skip-emulation of those instructions should not happen under normal
circumstances.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On AMD, kvm_x86_ops->skip_emulated_instruction(vcpu) can, in theory,
fail: in !nrips case we call kvm_emulate_instruction(EMULTYPE_SKIP).
Currently, we only do printk(KERN_DEBUG) when this happens and this
is not ideal. Propagate the error up the stack.
On VMX, skip_emulated_instruction() doesn't fail, we have two call
sites calling it explicitly: handle_exception_nmi() and
handle_task_switch(), we can just ignore the result.
On SVM, we also have two explicit call sites:
svm_queue_exception() and it seems we don't need to do anything there as
we check if RIP was advanced or not. In task_switch_interception(),
however, we are better off not proceeding to kvm_task_switch() in case
skip_emulated_instruction() failed.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Similar to AMD bits, set the Intel bits from the vendor-independent
feature and bug flags, because KVM_GET_SUPPORTED_CPUID does not care
about the vendor and they should be set on AMD processors as well.
Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After commit d73eb57b80 (KVM: Boost vCPUs that are delivering interrupts), a
five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:
INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
Call Trace:
flush_tlb_mm_range+0x68/0x140
tlb_flush_mmu.part.75+0x37/0xe0
tlb_finish_mmu+0x55/0x60
zap_page_range+0x142/0x190
SyS_madvise+0x3cd/0x9c0
system_call_fastpath+0x1c/0x21
swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.
This patch fixes it by checking conservatively a subset of events.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Cc: stable@vger.kernel.org
Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 11752adb (locking/pvqspinlock: Implement hybrid PV queued/unfair locks)
introduces hybrid PV queued/unfair locks
- queued mode (no starvation)
- unfair mode (good performance on not heavily contended lock)
The lock waiter goes into the unfair mode especially in VMs with over-commit
vCPUs since increaing over-commitment increase the likehood that the queue
head vCPU may have been preempted and not actively spinning.
However, reschedule queue head vCPU timely to acquire the lock still can get
better performance than just depending on lock stealing in over-subscribe
scenario.
Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
vanilla boosting improved
1VM 23520 25040 6%
2VM 8000 13600 70%
3VM 3100 5400 74%
The lock holder vCPU yields to the queue head vCPU when unlock, to boost queue
head vCPU which is involuntary preemption or the one which is voluntary halt
due to fail to acquire the lock after a short spin in the guest.
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After reverting commit 240c35a378 (kvm: x86: Use task structs fpu field
for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
is the order at which allocations are deemed costly to service. In serveless
scenario, one host can service hundreds/thoudands firecracker/kata-container
instances, howerver, new instance will fail to launch after memory is too
fragmented to allocate kvm_vcpu struct on host, this was observed in some
cloud provider product environments.
This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
Skylake server.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The idea before commit 240c35a37 (which has just been reverted)
was that we have the following FPU states:
userspace (QEMU) guest
---------------------------------------------------------------------------
processor vcpu->arch.guest_fpu
>>> KVM_RUN: kvm_load_guest_fpu
vcpu->arch.user_fpu processor
>>> preempt out
vcpu->arch.user_fpu current->thread.fpu
>>> preempt in
vcpu->arch.user_fpu processor
>>> back to userspace
>>> kvm_put_guest_fpu
processor vcpu->arch.guest_fpu
---------------------------------------------------------------------------
With the new lazy model we want to get the state back to the processor
when schedule in from current->thread.fpu.
Reported-by: Thomas Lambertz <mail@thomaslambertz.de>
Reported-by: anthony <antdev66@gmail.com>
Tested-by: anthony <antdev66@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Lambertz <mail@thomaslambertz.de>
Cc: anthony <antdev66@gmail.com>
Cc: stable@vger.kernel.org
Fixes: 5f409e20b (x86/fpu: Defer FPU state load until return to userspace)
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Add a comment in front of the warning. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 240c35a378
("kvm: x86: Use task structs fpu field for user", 2018-11-06).
The commit is broken and causes QEMU's FPU state to be destroyed
when KVM_RUN is preempted.
Fixes: 240c35a378 ("kvm: x86: Use task structs fpu field for user")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Dedicated instances are currently disturbed by unnecessary jitter due
to the emulated lapic timers firing on the same pCPUs where the
vCPUs reside. There is no hardware virtual timer on Intel for guest
like ARM, so both programming timer in guest and the emulated timer fires
incur vmexits. This patch tries to avoid vmexit when the emulated timer
fires, at least in dedicated instance scenario when nohz_full is enabled.
In that case, the emulated timers can be offload to the nearest busy
housekeeping cpus since APICv has been found for several years in server
processors. The guest timer interrupt can then be injected via posted interrupts,
which are delivered by the housekeeping cpu once the emulated timer fires.
The host should tuned so that vCPUs are placed on isolated physical
processors, and with several pCPUs surplus for busy housekeeping.
If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
~3% redis performance benefit can be observed on Skylake server, and the
number of external interrupt vmexits drops substantially. Without patch
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 42916 49.43% 39.30% 0.47us 106.09us 0.71us ( +- 1.09% )
While with patch:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 6871 9.29% 2.96% 0.44us 57.88us 0.72us ( +- 4.02% )
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 61abdbe0bc ("kvm: x86: make lapic hrtimer pinned") pinned the
lapic timer to avoid to wait until the next kvm exit for the guest to
see KVM_REQ_PENDING_TIMER set. There is another solution to give a kick
after setting the KVM_REQ_PENDING_TIMER bit, make lapic timer unpinned
will be used in follow up patches.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are some pr_debug in TSC code, which may have
been no use, so remove them as Paolo suggested.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>