KVM is not handling the case where EIP wraps around the 32-bit address
space (that is, outside long mode). This is needed both in vmx.c
and in emulate.c. SVM with NRIPS is okay, but it can still print
an error to dmesg due to integer overflow.
Reported-by: Nick Peterson <everdox@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add an argument to interrupt_allowed and nmi_allowed, to checking if
interrupt injection is blocked. Use the hook to handle the case where
an interrupt arrives between check_nested_events() and the injection
logic. Drop the retry of check_nested_events() that hack-a-fixed the
same condition.
Blocking injection is also a bit of a hack, e.g. KVM should do exiting
and non-exiting interrupt processing in a single pass, but it's a more
precise hack. The old comment is also misleading, e.g. KVM_REQ_EVENT is
purely an optimization, setting it on every run loop (which KVM doesn't
do) should not affect functionality, only performance.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200423022550.15113-13-sean.j.christopherson@intel.com>
[Extend to SVM, add SMI and NMI. Even though NMI and SMI cannot come
asynchronously right now, making the fix generic is easy and removes a
special case. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Report interrupts as allowed when the vCPU is in L2 and L2 is being run with
exit-on-interrupts enabled and EFLAGS.IF=1 (either on the host or on the guest
according to VINTR). Interrupts are always unblocked from L1's perspective
in this case.
While moving nested_exit_on_intr to svm.h, use INTERCEPT_INTR properly instead
of assuming it's zero (which it is of course).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the architectural (non-KVM specific) interrupt/NMI/SMI blocking checks
to a separate helper so that they can be used in a future patch by
svm_check_nested_events().
No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unlike VMX, SVM allows a hypervisor to take a SMI vmexit without having
any special SMM-monitor enablement sequence. Therefore, it has to be
handled like interrupts and NMIs. Check for an unblocked SMI in
svm_check_nested_events() so that pending SMIs are correctly prioritized
over IRQs and NMIs when the latter events will trigger VM-Exit.
Note that there is no need to test explicitly for SMI vmexits, because
guests always runs outside SMM and therefore can never get an SMI while
they are blocked.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Report NMIs as allowed when the vCPU is in L2 and L2 is being run with
Exit-on-NMI enabled, as NMIs are always unblocked from L1's perspective
in this case.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not hardcode is_smm so that all the architectural conditions for
blocking SMIs are listed in a single place. Well, in two places because
this introduces some code duplication between Intel and AMD.
This ensures that nested SVM obeys GIF in kvm_vcpu_has_events.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return an actual bool for kvm_x86_ops' {interrupt_nmi}_allowed() hook to
better reflect the return semantics, and to avoid creating an even
bigger mess when the related VMX code is refactored in upcoming patches.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200423022550.15113-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Migrate nested guest NMI intercept processing
to new check_nested_events.
Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20200414201107.22952-2-cavery@redhat.com>
[Reorder clauses as NMIs have higher priority than IRQs; inject
immediate vmexit as is now done for IRQ vmexits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We want to inject vmexits immediately from svm_check_nested_events,
so that the interrupt/NMI window requests happen in inject_pending_event
right after it returns.
This however has the same issue as in vmx_check_nested_events, so
introduce a nested_run_pending flag with the exact same purpose
of delaying vmexit injection after the vmentry.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are two issues with KVM_EXIT_DEBUG on AMD, whose root cause is the
different handling of DR6 on intercepted #DB exceptions on Intel and AMD.
On Intel, #DB exceptions transmit the DR6 value via the exit qualification
field of the VMCS, and the exit qualification only contains the description
of the precise event that caused a vmexit.
On AMD, instead the DR6 field of the VMCB is filled in as if the #DB exception
was to be injected into the guest. This has two effects when guest debugging
is in use:
* the guest DR6 is clobbered
* the kvm_run->debug.arch.dr6 field can accumulate more debug events, rather
than just the last one that happened (the testcase in the next patch covers
this issue).
This patch fixes both issues by emulating, so to speak, the Intel behavior
on AMD processors. The important observation is that (after the previous
patches) the VMCB value of DR6 is only ever observable from the guest is
KVM_DEBUGREG_WONT_EXIT is set. Therefore we can actually set vmcb->save.dr6
to any value we want as long as KVM_DEBUGREG_WONT_EXIT is clear, which it
will be if guest debugging is enabled.
Therefore it is possible to enter the guest with an all-zero DR6,
reconstruct the #DB payload from the DR6 we get at exit time, and let
kvm_deliver_exception_payload move the newly set bits into vcpu->arch.dr6.
Some extra bits may be included in the payload if KVM_DEBUGREG_WONT_EXIT
is set, but this is harmless.
This may not be the most optimized way to deal with this, but it is
simple and, being confined within SVM code, it gets rid of the set_dr6
callback and kvm_update_dr6.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_x86_ops.set_dr6 is only ever called with vcpu->arch.dr6 as the
second argument. Ensure that the VMCB value is synchronized to
vcpu->arch.dr6 on #DB (both "normal" and nested) and nested vmentry, so
that the current value of DR6 is always available in vcpu->arch.dr6.
The get_dr6 callback can just access vcpu->arch.dr6 and becomes redundant.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The corresponding code was added for VMX in commit 42dbaa5a05
("KVM: x86: Virtualize debug registers, 2008-12-15) but never for AMD.
Fix this.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Clean up some of the patching of kvm_x86_ops, by moving kvm_x86_ops related to
nested virtualization into a separate struct.
As a result, these ops will always be non-NULL on VMX. This is not a problem:
* check_nested_events is only called if is_guest_mode(vcpu) returns true
* get_nested_state treats VMXOFF state the same as nested being disabled
* set_nested_state fails if you attempt to set nested state while
nesting is disabled
* nested_enable_evmcs could already be called on a CPU without VMX enabled
in CPUID.
* nested_get_evmcs_version was fixed in the previous patch
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a nested page fault is taken from an address that does not have
a memslot associated to it, kvm_mmu_do_page_fault returns RET_PF_EMULATE
(via mmu_set_spte) and kvm_mmu_page_fault then invokes svm_need_emulation_on_page_fault.
The default answer there is to return false, but in this case this just
causes the page fault to be retried ad libitum. Since this is not a
fast path, and the only other case where it is taken is an erratum,
just stick a kvm_vcpu_gfn_to_memslot check in there to detect the
common case where the erratum is not happening.
This fixes an infinite loop in the new set_memory_region_test.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
IPI and Timer cause the main MSRs write vmexits in cloud environment
observation, let's optimize virtual IPI latency more aggressively to
inject target IPI as soon as possible.
Running kvm-unit-tests/vmexit.flat IPI testing on SKX server, disable
adaptive advance lapic timer and adaptive halt-polling to avoid the
interference, this patch can give another 7% improvement.
w/o fastpath -> x86.c fastpath 4238 -> 3543 16.4%
x86.c fastpath -> vmx.c fastpath 3543 -> 3293 7%
w/o fastpath -> vmx.c fastpath 4238 -> 3293 22.3%
Cc: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200410174703.1138-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use do_machine_check instead of INT $12 to pass MCE to the host,
the same approach VMX uses.
On a related note, there is no reason to limit the use of do_machine_check
to 64 bit targets, as is currently done for VMX. MCE handling works
for both target families.
The patch is only compile tested, for both, 64 and 32 bit targets,
someone should test the passing of the exception by injecting
some MCEs into the guest.
For future non-RFC patch, kvm_machine_check should be moved to some
appropriate header file.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20200411153627.3474710-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add KVM_REQ_TLB_FLUSH_CURRENT to allow optimized TLB flushing of VMX's
EPTP/VPID contexts[*] from the KVM MMU and/or in a deferred manner, e.g.
to flush L2's context during nested VM-Enter.
Convert KVM_REQ_TLB_FLUSH to KVM_REQ_TLB_FLUSH_CURRENT in flows where
the flush is directly associated with vCPU-scoped instruction emulation,
i.e. MOV CR3 and INVPCID.
Add a comment in vmx_vcpu_load_vmcs() above its KVM_REQ_TLB_FLUSH to
make it clear that it deliberately requests a flush of all contexts.
Service any pending flush request on nested VM-Exit as it's possible a
nested VM-Exit could occur after requesting a flush for L2. Add the
same logic for nested VM-Enter even though it's _extremely_ unlikely
for flush to be pending on nested VM-Enter, but theoretically possible
(in the future) due to RSM (SMM) emulation.
[*] Intel also has an Address Space Identifier (ASID) concept, e.g.
EPTP+VPID+PCID == ASID, it's just not documented in the SDM because
the rules of invalidation are different based on which piece of the
ASID is being changed, i.e. whether the EPTP, VPID, or PCID context
must be invalidated.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200320212833.3507-25-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename ->tlb_flush() to ->tlb_flush_all() in preparation for adding a
new hook to flush only the current ASID/context.
Opportunstically replace the comment in vmx_flush_tlb() that explains
why it flushes all EPTP/VPID contexts with a comment explaining why it
unconditionally uses INVEPT when EPT is enabled. I.e. rely on the "all"
part of the name to clarify why it does global INVEPT/INVVPID.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200320212833.3507-23-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a comment in svm_flush_tlb() to document why it flushes only the
current ASID, even when it is invoked when flushing remote TLBs.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200320212833.3507-22-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use svm_flush_tlb() directly for kvm_x86_ops->tlb_flush_guest() now that
the @invalidate_gpa param to ->tlb_flush() is gone, i.e. the wrapper for
->tlb_flush_guest() is no longer necessary.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200320212833.3507-18-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop @invalidate_gpa from ->tlb_flush() and kvm_vcpu_flush_tlb() now
that all callers pass %true for said param, or ignore the param (SVM has
an internal call to svm_flush_tlb() in svm_flush_tlb_guest that somewhat
arbitrarily passes %false).
Remove __vmx_flush_tlb() as it is no longer used.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200320212833.3507-17-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a dedicated hook to handle flushing TLB entries on behalf of the
guest, i.e. for a paravirtualized TLB flush, and use it directly instead
of bouncing through kvm_vcpu_flush_tlb().
For VMX, change the effective implementation implementation to never do
INVEPT and flush only the current context, i.e. to always flush via
INVVPID(SINGLE_CONTEXT). The INVEPT performed by __vmx_flush_tlb() when
@invalidate_gpa=false and enable_vpid=0 is unnecessary, as it will only
flush guest-physical mappings; linear and combined mappings are flushed
by VM-Enter when VPID is disabled, and changes in the guest pages tables
do not affect guest-physical mappings.
When EPT and VPID are enabled, doing INVVPID is not required (by Intel's
architecture) to invalidate guest-physical mappings, i.e. TLB entries
that cache guest-physical mappings can live across INVVPID as the
mappings are associated with an EPTP, not a VPID. The intent of
@invalidate_gpa is to inform vmx_flush_tlb() that it must "invalidate
gpa mappings", i.e. do INVEPT and not simply INVVPID. Other than nested
VPID handling, which now calls vpid_sync_context() directly, the only
scenario where KVM can safely do INVVPID instead of INVEPT (when EPT is
enabled) is if KVM is flushing TLB entries from the guest's perspective,
i.e. is only required to invalidate linear mappings.
For SVM, flushing TLB entries from the guest's perspective can be done
by flushing the current ASID, as changes to the guest's page tables are
associated only with the current ASID.
Adding a dedicated ->tlb_flush_guest() paves the way toward removing
@invalidate_gpa, which is a potentially dangerous control flag as its
meaning is not exactly crystal clear, even for those who are familiar
with the subtleties of what mappings Intel CPUs are/aren't allowed to
keep across various invalidation scenarios.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200320212833.3507-15-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The function returns no value.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 199cd1d7b5 ("KVM: SVM: Split svm_vcpu_run inline assembly to separate file")
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20200409114926.1407442-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
svm_vcpu_run does not change stack or frame pointer anymore.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20200414113612.104501-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Manipulate IF around vmload/vmsave to remove the confusing usage of
local_irq_enable where interrupts are actually disabled via GIF.
And stuff the RSB immediately without waiting for a RET to avoid
Spectre-v2 attacks.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The compiler (GCC) does not like the situation, where there is inline
assembly block that clobbers all available machine registers in the
middle of the function. This situation can be found in function
svm_vcpu_run in file kvm/svm.c and results in many register spills and
fills to/from stack frame.
This patch fixes the issue with the same approach as was done for
VMX some time ago. The big inline assembly is moved to a separate
assembly .S file, taking into account all ABI requirements.
There are two main benefits of the above approach:
* elimination of several register spills and fills to/from stack
frame, and consequently smaller function .text size. The binary size
of svm_vcpu_run is lowered from 2019 to 1626 bytes.
* more efficient access to a register save array. Currently, register
save array is accessed as:
7b00: 48 8b 98 28 02 00 00 mov 0x228(%rax),%rbx
7b07: 48 8b 88 18 02 00 00 mov 0x218(%rax),%rcx
7b0e: 48 8b 90 20 02 00 00 mov 0x220(%rax),%rdx
and passing ia pointer to a register array as an argument to a function one gets:
12: 48 8b 48 08 mov 0x8(%rax),%rcx
16: 48 8b 50 10 mov 0x10(%rax),%rdx
1a: 48 8b 58 18 mov 0x18(%rax),%rbx
As a result, the total size, considering that the new function size is 229
bytes, gets lowered by 164 bytes.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the SEV specific parts of svm.c into the new sev.c file.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Message-Id: <20200324094154.32352-5-joro@8bytes.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the AVIC related functions from svm.c to the new avic.c file.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Message-Id: <20200324094154.32352-4-joro@8bytes.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Split out the code for the nested SVM implementation and move it to a
separate file.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Message-Id: <20200324094154.32352-3-joro@8bytes.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move svm.c and pmu_amd.c into their own arch/x86/kvm/svm/
subdirectory.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Message-Id: <20200324094154.32352-2-joro@8bytes.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>