linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-18 01:36:55 +07:00

Author	SHA1	Message	Date
Vitaly Kuznetsov	11e349143e	x86/kvm/nVMX: fix VMCLEAR when Enlightened VMCS is in use When Enlightened VMCS is in use, it is valid to do VMCLEAR and, according to TLFS, this should "transition an enlightened VMCS from the active to the non-active state". It is, however, wrong to assume that it is only valid to do VMCLEAR for the eVMCS which is currently active on the vCPU performing VMCLEAR. Currently, the logic in handle_vmclear() is broken: in case, there is no active eVMCS on the vCPU doing VMCLEAR we treat the argument as a 'normal' VMCS and kvm_vcpu_write_guest() to the 'launch_state' field irreversibly corrupts the memory area. So, in case the VMCLEAR argument is not the current active eVMCS on the vCPU, how can we know if the area it is pointing to is a normal or an enlightened VMCS? Thanks to the bug in Hyper-V (see commit `72aeb60c52` ("KVM: nVMX: Verify eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD")) we can not, the revision can't be used to distinguish between them. So let's assume it is always enlightened in case enlightened vmentry is enabled in the assist page. Also, check if vmx->nested.enlightened_vmcs_enabled to minimize the impact for 'unenlightened' workloads. Fixes: `b8bbab928f` ("KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-07-02 18:56:00 +02:00
Vitaly Kuznetsov	a21a39c206	x86/KVM/nVMX: don't use clean fields data on enlightened VMLAUNCH Apparently, Windows doesn't maintain clean fields data after it does VMCLEAR for an enlightened VMCS so we can only use it on VMRESUME. The issue went unnoticed because currently we do nested_release_evmcs() in handle_vmclear() and the consecutive enlightened VMPTRLD invalidates clean fields when a new eVMCS is mapped but we're going to change the logic. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-07-02 18:56:00 +02:00
Paolo Bonzini	95c5c7c77c	KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST This allows userspace to know which MSRs are supported by the hypervisor. Unfortunately userspace must resort to tricks for everything except MSR_IA32_VMX_VMFUNC (which was just added in the previous patch). One possibility is to use the feature control MSR, which is tied to nested VMX as well and is present on all KVM versions that support feature MSRs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-07-02 17:36:18 +02:00
Paolo Bonzini	e8a70bd4e9	KVM: nVMX: allow setting the VMFUNC controls MSR Allow userspace to set a custom value for the VMFUNC controls MSR, as long as the capabilities it advertises do not exceed those of the host. Fixes: `27c42a1bb` ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor", 2017-08-03) Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-07-02 17:36:12 +02:00
Paolo Bonzini	6defc59184	KVM: nVMX: include conditional controls in /dev/kvm KVM_GET_MSRS Some secondary controls are automatically enabled/disabled based on the CPUID values that are set for the guest. However, they are still available at a global level and therefore should be present when KVM_GET_MSRS is sent to /dev/kvm. Fixes: `1389309c81` ("KVM: nVMX: expose VMX capabilities for nested hypervisors to userspace", 2018-02-26) Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-07-02 17:35:57 +02:00
Saar Amar	a251fb90ab	KVM: x86: Fix apic dangling pointer in vcpu The function kvm_create_lapic() attempts to allocate the apic structure and sets a pointer to it in the virtual processor structure. However, if get_zeroed_page() failed, the function frees the apic chunk, but forgets to set the pointer in the vcpu to NULL. It's not a security issue since there isn't a use of that pointer if kvm_create_lapic() returns error, but it's more accurate that way. Signed-off-by: Saar Amar <saaramar@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-20 14:23:17 +02:00
Wanpeng Li	4d763b168e	KVM: VMX: check CPUID before allowing read/write of IA32_XSS Raise #GP when guest read/write IA32_XSS, but the CPUID bits say that it shouldn't exist. Fixes: `203000993d` (kvm: vmx: add MSR logic for XSAVES) Reported-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Reported-by: Tao Xu <tao3.xu@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-20 14:21:51 +02:00
Paolo Bonzini	eceb9973d9	KVM: nVMX: shadow pin based execution controls The VMX_PREEMPTION_TIMER flag may be toggled frequently, though not very frequently. Since it does not affect KVM's dirty logic, e.g. the preemption timer value is loaded from vmcs12 even if vmcs12 is "clean", there is no need to mark vmcs12 dirty when L1 writes pin controls, and shadowing the field achieves that. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 17:10:50 +02:00
Sean Christopherson	804939ea20	KVM: VMX: Leave preemption timer running when it's disabled VMWRITEs to the major VMCS controls, pin controls included, are deceptively expensive. CPUs with VMCS caching (Westmere and later) also optimize away consistency checks on VM-Entry, i.e. skip consistency checks if the relevant fields have not changed since the last successful VM-Entry (of the cached VMCS). Because uops are a precious commodity, uCode's dirty VMCS field tracking isn't as precise as software would prefer. Notably, writing any of the major VMCS fields effectively marks the entire VMCS dirty, i.e. causes the next VM-Entry to perform all consistency checks, which consumes several hundred cycles. As it pertains to KVM, toggling PIN_BASED_VMX_PREEMPTION_TIMER more than doubles the latency of the next VM-Entry (and again when/if the flag is toggled back). In a non-nested scenario, running a "standard" guest with the preemption timer enabled, toggling the timer flag is uncommon but not rare, e.g. roughly 1 in 10 entries. Disabling the preemption timer can change these numbers due to its use for "immediate exits", even when explicitly disabled by userspace. Nested virtualization in particular is painful, as the timer flag is set for the majority of VM-Enters, but prepare_vmcs02() initializes vmcs02's pin controls to clear the flag since its the timer's final state isn't known until vmx_vcpu_run(). I.e. the majority of nested VM-Enters end up unnecessarily writing pin controls twice. Rather than toggle the timer flag in pin controls, set the timer value itself to the largest allowed value to put it into a "soft disabled" state, and ignore any spurious preemption timer exits. Sadly, the timer is a 32-bit value and so theoretically it can fire before the head death of the universe, i.e. spurious exits are possible. But because KVM does not save the timer value on VM-Exit and because the timer runs at a slower rate than the TSC, the maximuma timer value is still sufficiently large for KVM's purposes. E.g. on a modern CPU with a timer that runs at 1/32 the frequency of a 2.4ghz constant-rate TSC, the timer will fire after ~55 seconds of uninterrupted guest execution. In other words, spurious VM-Exits are effectively only possible if the host is completely tickless on the logical CPU, the guest is not using the preemption timer, and the guest is not generating VM-Exits for any other reason. To be safe from bad/weird hardware, disable the preemption timer if its maximum delay is less than ten seconds. Ten seconds is mostly arbitrary and was selected in no small part because it's a nice round number. For simplicity and paranoia, fall back to __kvm_request_immediate_exit() if the preemption timer is disabled by KVM or userspace. Previously KVM continued to use the preemption timer to force immediate exits even when the timer was disabled by userspace. Now that KVM leaves the timer running instead of truly disabling it, allow userspace to kill it entirely in the unlikely event the timer (or KVM) malfunctions. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 17:10:46 +02:00
Sean Christopherson	9d99cc49a4	KVM: VMX: Drop hv_timer_armed from 'struct loaded_vmcs' ... now that it is fully redundant with the pin controls shadow. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:46 +02:00
Sean Christopherson	469debdb8b	KVM: nVMX: Preset *DT exiting in vmcs02 when emulating UMIP KVM dynamically toggles SECONDARY_EXEC_DESC to intercept (a subset of) instructions that are subject to User-Mode Instruction Prevention, i.e. VMCS.SECONDARY_EXEC_DESC == CR4.UMIP when emulating UMIP. Preset the VMCS control when preparing vmcs02 to avoid unnecessarily VMWRITEs, e.g. KVM will clear VMCS.SECONDARY_EXEC_DESC in prepare_vmcs02_early() and then set it in vmx_set_cr4(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:45 +02:00
Sean Christopherson	de0286b788	KVM: nVMX: Preserve last USE_MSR_BITMAPS when preparing vmcs02 KVM dynamically toggles the CPU_BASED_USE_MSR_BITMAPS execution control for nested guests based on whether or not both L0 and L1 want to pass through the same MSRs to L2. Preserve the last used value from vmcs02 so as to avoid multiple VMWRITEs to (re)set/(re)clear the bit on nested VM-Entry. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:45 +02:00
Sean Christopherson	3af80fec6e	KVM: VMX: Explicitly initialize controls shadow at VMCS allocation Or: Don't re-initialize vmcs02's controls on every nested VM-Entry. VMWRITEs to the major VMCS controls are deceptively expensive. Intel CPUs with VMCS caching (Westmere and later) also optimize away consistency checks on VM-Entry, i.e. skip consistency checks if the relevant fields have not changed since the last successful VM-Entry (of the cached VMCS). Because uops are a precious commodity, uCode's dirty VMCS field tracking isn't as precise as software would prefer. Notably, writing any of the major VMCS fields effectively marks the entire VMCS dirty, i.e. causes the next VM-Entry to perform all consistency checks, which consumes several hundred cycles. Zero out the controls' shadow copies during VMCS allocation and use the optimized setter when "initializing" controls. While this technically affects both non-nested and nested virtualization, nested virtualization is the primary beneficiary as avoid VMWRITEs when prepare vmcs02 allows hardware to optimizie away consistency checks. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:44 +02:00
Sean Christopherson	ae81d08993	KVM: nVMX: Don't reset VMCS controls shadow on VMCS switch ... now that the shadow copies are per-VMCS. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:44 +02:00
Sean Christopherson	09e226cf07	KVM: nVMX: Shadow VMCS controls on a per-VMCS basis ... to pave the way for not preserving the shadow copies across switches between vmcs01 and vmcs02, and eventually to avoid VMWRITEs to vmcs02 when the desired value is unchanged across nested VM-Enters. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:43 +02:00
Sean Christopherson	fe7f895dae	KVM: VMX: Shadow VMCS secondary execution controls Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid costly VMWRITEs when switching between vmcs01 and vmcs02. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:42 +02:00
Sean Christopherson	2183f5645a	KVM: VMX: Shadow VMCS primary execution controls Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid VMREADs when switching between vmcs01 and vmcs02, and more importantly can eliminate costly VMWRITEs to controls when preparing vmcs02. Shadowing exec controls also saves a VMREAD when opening virtual INTR/NMI windows, yay... Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:42 +02:00
Sean Christopherson	c5f2c76643	KVM: VMX: Shadow VMCS pin controls Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid costly VMWRITEs when switching between vmcs01 and vmcs02. Shadowing pin controls also allows a future patch to remove the per-VMCS 'hv_timer_armed' flag, as the shadow copy is a superset of said flag. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:41 +02:00
Sean Christopherson	70f932ecdf	KVM: VMX: Add builder macros for shadowing controls ... to pave the way for shadowing all (five) major VMCS control fields without massive amounts of error prone copy+paste+modify. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:40 +02:00
Sean Christopherson	c075c3e49d	KVM: nVMX: Use adjusted pin controls for vmcs02 KVM provides a module parameter to allow disabling virtual NMI support to simplify testing (hardware without virtual NMI support is hard to come by but it does have users). When preparing vmcs02, use the accessor for pin controls to ensure that the module param is respected for nested guests. Opportunistically swap the order of applying L0's and L1's pin controls to better align with other controls and to prepare for a future patche that will ignore L1's, but not L0's, preemption timer flag. Fixes: `d02fcf5077` ("kvm: vmx: Allow disabling virtual NMI support") Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:40 +02:00
Sean Christopherson	c7554efc83	KVM: nVMX: Copy PDPTRs to/from vmcs12 only when necessary Per Intel's SDM: ... the logical processor uses PAE paging if CR0.PG=1, CR4.PAE=1 and IA32_EFER.LME=0. A VM entry to a guest that uses PAE paging loads the PDPTEs into internal, non-architectural registers based on the setting of the "enable EPT" VM-execution control. and: [GUEST_PDPTR] values are saved into the four PDPTE fields as follows: - If the "enable EPT" VM-execution control is 0 or the logical processor was not using PAE paging at the time of the VM exit, the values saved are undefined. In other words, if EPT is disabled or the guest isn't using PAE paging, then the PDPTRS aren't consumed by hardware on VM-Entry and are loaded with junk on VM-Exit. From a nesting perspective, all of the above hold true, i.e. KVM can effectively ignore the VMCS PDPTRs. E.g. KVM already loads the PDPTRs from memory when nested EPT is disabled (see nested_vmx_load_cr3()). Because KVM intercepts setting CR4.PAE, there is no danger of consuming a stale value or crushing L1's VMWRITEs regardless of whether L1 intercepts CR4.PAE. The vmcs12's values are unchanged up until the VM-Exit where L2 sets CR4.PAE, i.e. L0 will see the new PAE state on the subsequent VM-Entry and propagate the PDPTRs from vmcs12 to vmcs02. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:39 +02:00
Paolo Bonzini	bf03d4f933	KVM: x86: introduce is_pae_paging Checking for 32-bit PAE is quite common around code that fiddles with the PDPTRs. Add a function to compress all checks into a single invocation. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:38 +02:00
Sean Christopherson	c27e5b0d13	KVM: nVMX: Don't update GUEST_BNDCFGS if it's clean in HV eVMCS L1 is responsible for dirtying GUEST_GRP1 if it writes GUEST_BNDCFGS. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:38 +02:00
Sean Christopherson	699a1ac214	KVM: nVMX: Update vmcs12 for MSR_IA32_DEBUGCTLMSR when it's written KVM unconditionally intercepts WRMSR to MSR_IA32_DEBUGCTLMSR. In the unlikely event that L1 allows L2 to write L1's MSR_IA32_DEBUGCTLMSR, but but saves L2's value on VM-Exit, update vmcs12 during L2's WRMSR so as to eliminate the need to VMREAD the value from vmcs02 on nested VM-Exit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:37 +02:00
Sean Christopherson	de70d27970	KVM: nVMX: Update vmcs12 for SYSENTER MSRs when they're written For L2, KVM always intercepts WRMSR to SYSENTER MSRs. Update vmcs12 in the WRMSR handler so that they don't need to be (re)read from vmcs02 on every nested VM-Exit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:37 +02:00
Sean Christopherson	142e4be77b	KVM: nVMX: Update vmcs12 for MSR_IA32_CR_PAT when it's written As alluded to by the TODO comment, KVM unconditionally intercepts writes to the PAT MSR. In the unlikely event that L1 allows L2 to write L1's PAT directly but saves L2's PAT on VM-Exit, update vmcs12 when L2 writes the PAT. This eliminates the need to VMREAD the value from vmcs02 on VM-Exit as vmcs12 is already up to date in all situations. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:36 +02:00
Sean Christopherson	a49700b66e	KVM: nVMX: Don't speculatively write APIC-access page address If nested_get_vmcs12_pages() fails to map L1's APIC_ACCESS_ADDR into L2, then it disables SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES in vmcs02. In other words, the APIC_ACCESS_ADDR in vmcs02 is guaranteed to be written with the correct value before being consumed by hardware, drop the unneessary VMWRITE. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:35 +02:00
Sean Christopherson	ca2f5466f8	KVM: nVMX: Don't speculatively write virtual-APIC page address The VIRTUAL_APIC_PAGE_ADDR in vmcs02 is guaranteed to be updated before it is consumed by hardware, either in nested_vmx_enter_non_root_mode() or via the KVM_REQ_GET_VMCS12_PAGES callback. Avoid an extra VMWRITE and only stuff a bad value into vmcs02 when mapping vmcs12's address fails. This also eliminates the need for extra comments to connect the dots between prepare_vmcs02_early() and nested_get_vmcs12_pages(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:35 +02:00
Sean Christopherson	73cb855684	KVM: nVMX: Don't dump VMCS if virtual APIC page can't be mapped ... as a malicious userspace can run a toy guest to generate invalid virtual-APIC page addresses in L1, i.e. flood the kernel log with error messages. Fixes: `690908104e` ("KVM: nVMX: allow tests to use bad virtual-APIC page address") Cc: stable@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:21 +02:00
Sean Christopherson	8ef863e67a	KVM: nVMX: Don't reread VMCS-agnostic state when switching VMCS When switching between vmcs01 and vmcs02, there is no need to update state tracking for values that aren't tied to any particular VMCS as the per-vCPU values are already up-to-date (vmx_switch_vmcs() can only be called when the vCPU is loaded). Avoiding the update eliminates a RDMSR, and potentially a RDPKRU and posted-interrupt update (cmpxchg64() and more). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:47:06 +02:00
Sean Christopherson	13b964a29d	KVM: nVMX: Don't "put" vCPU or host state when switching VMCS When switching between vmcs01 and vmcs02, KVM isn't actually switching between guest and host. If guest state is already loaded (the likely, if not guaranteed, case), keep the guest state loaded and manually swap the loaded_cpu_state pointer after propagating saved host state to the new vmcs0{1,2}. Avoiding the switch between guest and host reduces the latency of switching between vmcs01 and vmcs02 by several hundred cycles, and reduces the roundtrip time of a nested VM by upwards of 1000 cycles. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:55 +02:00
Paolo Bonzini	b464f57e13	KVM: VMX: simplify vmx_prepare_switch_to_{guest,host} vmx->loaded_cpu_state can only be NULL or equal to vmx->loaded_vmcs, so change it to a bool. Because the direction of the bool is now the opposite of vmx->guest_msrs_dirty, change the direction of vmx->guest_msrs_dirty so that they match. Finally, do not imply that MSRs have to be reloaded when vmx->guest_state_loaded is false; instead, set vmx->guest_msrs_ready to false explicitly in vmx_prepare_switch_to_host. Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:54 +02:00
Sean Christopherson	4d6c989284	KVM: nVMX: Don't rewrite GUEST_PML_INDEX during nested VM-Entry Emulation of GUEST_PML_INDEX for a nested VMM is a bit weird. Because L0 flushes the PML on every VM-Exit, the value in vmcs02 at the time of VM-Enter is a constant -1, regardless of what L1 thinks/wants. Fixes: `09abe32002` ("KVM: nVMX: split pieces of prepare_vmcs02() to prepare_vmcs02_early()") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:53 +02:00
Sean Christopherson	c538d57f67	KVM: nVMX: Write ENCLS-exiting bitmap once per vmcs02 KVM doesn't yet support SGX virtualization, i.e. writes a constant value to ENCLS_EXITING_BITMAP so that it can intercept ENCLS and inject a #UD. Fixes: `0b665d3040` ("KVM: vmx: Inject #UD for SGX ENCLS instruction in guest") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:53 +02:00
Sean Christopherson	3b013a2972	KVM: nVMX: Always sync GUEST_BNDCFGS when it comes from vmcs01 If L1 does not set VM_ENTRY_LOAD_BNDCFGS, then L1's BNDCFGS value must be propagated to vmcs02 since KVM always runs with VM_ENTRY_LOAD_BNDCFGS when MPX is supported. Because the value effectively comes from vmcs01, vmcs02 must be updated even if vmcs12 is clean. Fixes: `62cf9bd811` ("KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS") Cc: stable@vger.kernel.org Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:52 +02:00
Sean Christopherson	d28f4290b5	KVM: VMX: Always signal #GP on WRMSR to MSR_IA32_CR_PAT with bad value The behavior of WRMSR is in no way dependent on whether or not KVM consumes the value. Fixes: `4566654bb9` ("KVM: vmx: Inject #GP on invalid PAT CR") Cc: stable@vger.kernel.org Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:51 +02:00
Paolo Bonzini	b1346ab2af	KVM: nVMX: Rename prepare_vmcs02__full to prepare_vmcs02__rare These function do not prepare the entire state of the vmcs02, only the rarely needed parts. Rename them to make this clearer. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:51 +02:00
Sean Christopherson	7952d769c2	KVM: nVMX: Sync rarely accessed guest fields only when needed Many guest fields are rarely read (or written) by VMMs, i.e. likely aren't accessed between runs of a nested VMCS. Delay pulling rarely accessed guest fields from vmcs02 until they are VMREAD or until vmcs12 is dirtied. The latter case is necessary because nested VM-Entry will consume all manner of fields when vmcs12 is dirty, e.g. for consistency checks. Note, an alternative to synchronizing all guest fields on VMREAD would be to read only the field being accessed, but switching VMCS pointers is expensive and odds are good if one guest field is being accessed then others will soon follow, or that vmcs12 will be dirtied due to a VMWRITE (see above). And the full synchronization results in slightly cleaner code. Note, although GUEST_PDPTRs are relevant only for a 32-bit PAE guest, they are accessed quite frequently for said guests, and a separate patch is in flight to optimize away GUEST_PDTPR synchronziation for non-PAE guests. Skipping rarely accessed guest fields reduces the latency of a nested VM-Exit by ~200 cycles. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:50 +02:00
Sean Christopherson	e2174295b4	KVM: nVMX: Add helpers to identify shadowed VMCS fields So that future optimizations related to shadowed fields don't need to define their own switch statement. Add a BUILD_BUG_ON() to ensure at least one of the types (RW vs RO) is defined when including vmcs_shadow_fields.h (guess who keeps mistyping SHADOW_FIELD_RO as SHADOW_FIELD_R0). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:47 +02:00
Sean Christopherson	3731905ef2	KVM: nVMX: Use descriptive names for VMCS sync functions and flags Nested virtualization involves copying data between many different types of VMCSes, e.g. vmcs02, vmcs12, shadow VMCS and eVMCS. Rename a variety of functions and flags to document both the source and destination of each sync. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:06 +02:00
Sean Christopherson	f4f8316d2a	KVM: nVMX: Lift sync_vmcs12() out of prepare_vmcs12() ... to make it more obvious that sync_vmcs12() is invoked on all nested VM-Exits, e.g. hiding sync_vmcs12() in prepare_vmcs12() makes it appear that guest state is NOT propagated to vmcs12 for a normal VM-Exit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:06 +02:00
Sean Christopherson	1c6f0b47fb	KVM: nVMX: Track vmcs12 offsets for shadowed VMCS fields The vmcs12 fields offsets are constant and known at compile time. Store the associated offset for each shadowed field to avoid the costly lookup in vmcs_field_to_offset() when copying between vmcs12 and the shadow VMCS. Avoiding the costly lookup reduces the latency of copying by ~100 cycles in each direction. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:05 +02:00
Sean Christopherson	b643780562	KVM: nVMX: Intercept VMWRITEs to GUEST_{CS,SS}_AR_BYTES VMMs frequently read the guest's CS and SS AR bytes to detect 64-bit mode and CPL respectively, but effectively never write said fields once the VM is initialized. Intercepting VMWRITEs for the two fields saves ~55 cycles in copy_shadow_to_vmcs12(). Because some Intel CPUs, e.g. Haswell, drop the reserved bits of the guest access rights fields on VMWRITE, exposing the fields to L1 for VMREAD but not VMWRITE leads to inconsistent behavior between L1 and L2. On hardware that drops the bits, L1 will see the stripped down value due to reading the value from hardware, while L2 will see the full original value as stored by KVM. To avoid such an inconsistency, emulate the behavior on all CPUS, but only for intercepted VMWRITEs so as to avoid introducing pointless latency into copy_shadow_to_vmcs12(), e.g. if the emulation were added to vmcs12_write_any(). Since the AR_BYTES emulation is done only for intercepted VMWRITE, if a future patch (re)exposed AR_BYTES for both VMWRITE and VMREAD, then KVM would end up with incosistent behavior on pre-Haswell hardware, e.g. KVM would drop the reserved bits on intercepted VMWRITE, but direct VMWRITE to the shadow VMCS would not drop the bits. Add a WARN in the shadow field initialization to detect any attempt to expose an AR_BYTES field without updating vmcs12_write_any(). Note, emulation of the AR_BYTES reserved bit behavior is based on a patch[1] from Jim Mattson that applied the emulation to all writes to vmcs12 so that live migration across different generations of hardware would not introduce divergent behavior. But given that live migration of nested state has already been enabled, that ship has sailed (not to mention that no sane VMM will be affected by this behavior). [1] https://patchwork.kernel.org/patch/10483321/ Cc: Jim Mattson <jmattson@google.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:05 +02:00
Sean Christopherson	fadcead00c	KVM: nVMX: Intercept VMWRITEs to read-only shadow VMCS fields Allowing L1 to VMWRITE read-only fields is only beneficial in a double nesting scenario, e.g. no sane VMM will VMWRITE VM_EXIT_REASON in normal non-nested operation. Intercepting RO fields means KVM doesn't need to sync them from the shadow VMCS to vmcs12 when running L2. The obvious downside is that L1 will VM-Exit more often when running L3, but it's likely safe to assume most folks would happily sacrifice a bit of L3 performance, which may not even be noticeable in the grande scheme, to improve L2 performance across the board. Not intercepting fields tagged read-only also allows for additional optimizations, e.g. marking GUEST_{CS,SS}_AR_BYTES as SHADOW_FIELD_RO since those fields are rarely written by a VMMs, but read frequently. When utilizing a shadow VMCS with asymmetric R/W and R/O bitmaps, fields that cause VM-Exit on VMWRITE but not VMREAD need to be propagated to the shadow VMCS during VMWRITE emulation, otherwise a subsequence VMREAD from L1 will consume a stale value. Note, KVM currently utilizes asymmetric bitmaps when "VMWRITE any field" is not exposed to L1, but only so that it can reject the VMWRITE, i.e. propagating the VMWRITE to the shadow VMCS is a new requirement, not a bug fix. Eliminating the copying of RO fields reduces the latency of nested VM-Entry (copy_shadow_to_vmcs12()) by ~100 cycles (plus 40-50 cycles if/when the AR_BYTES fields are exposed RO). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:04 +02:00
Sean Christopherson	95b5a48c4f	KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn Per commit `1b6269db3f` ("KVM: VMX: Handle NMIs before enabling interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run() to "make sure we handle NMI on the current cpu, and that we don't service maskable interrupts before non-maskable ones". The other exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC, have similar requirements, and are located there to avoid extra VMREADs since VMX bins hardware exceptions and NMIs into a single exit reason. Clean up the code and eliminate the vaguely named complete_atomic_exit() by moving the interrupts-disabled exception and NMI handling into the existing handle_external_intrs() callback, and rename the callback to a more appropriate name. Rename VMexit handlers throughout so that the atomic and non-atomic counterparts have similar names. In addition to improving code readability, this also ensures the NMI handler is run with the host's debug registers loaded in the unlikely event that the user is debugging NMIs. Accuracy of the last_guest_tsc field is also improved when handling NMIs (and #MCs) as the handler will run after updating said field. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> [Naming cleanups. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:04 +02:00
Sean Christopherson	165072b089	KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code VMX can conditionally call kvm_{before,after}_interrupt() since KVM always uses "ack interrupt on exit" and therefore explicitly handles interrupts as opposed to blindly enabling irqs. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:03 +02:00
Sean Christopherson	2342080cd6	KVM: VMX: Store the host kernel's IDT base in a global variable Although the kernel may use multiple IDTs, KVM should only ever see the "real" IDT, e.g. the early init IDT is long gone by the time KVM runs and the debug stack IDT is only used for small windows of time in very specific flows. Before commit `a547c6db4d` ("KVM: VMX: Enable acknowledge interupt on vmexit"), the kernel's IDT base was consumed by KVM only when setting constant VMCS state, i.e. to set VMCS.HOST_IDTR_BASE. Because constant host state is done once per vCPU, there was ostensibly no need to cache the kernel's IDT base. When support for "ack interrupt on exit" was introduced, KVM added a second consumer of the IDT base as handling already-acked interrupts requires directly calling the interrupt handler, i.e. KVM uses the IDT base to find the address of the handler. Because interrupts are a fast path, KVM cached the IDT base to avoid having to VMREAD HOST_IDTR_BASE. Presumably, the IDT base was cached on a per-vCPU basis simply because the existing code grabbed the IDT base on a per-vCPU (VMCS) basis. Note, all post-boot IDTs use the same handlers for external interrupts, i.e. the "ack interrupt on exit" use of the IDT base would be unaffected even if the cached IDT somehow did not match the current IDT. And as for the original use case of setting VMCS.HOST_IDTR_BASE, if any of the above analysis is wrong then KVM has had a bug since the beginning of time since KVM has effectively been caching the IDT at vCPU creation since commit a8b732ca01c ("[PATCH] kvm: userspace interface"). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:02 +02:00
Sean Christopherson	49def500e5	KVM: VMX: Read cached VM-Exit reason to detect external interrupt Generic x86 code invokes the kvm_x86_ops external interrupt handler on all VM-Exits regardless of the actual exit type. Use the already-cached EXIT_REASON to determine if the VM-Exit was due to an interrupt, thus avoiding an extra VMREAD (to query VM_EXIT_INTR_INFO) for all other types of VM-Exit. In addition to avoiding the extra VMREAD, checking the EXIT_REASON instead of VM_EXIT_INTR_INFO makes it more obvious that vmx_handle_external_intr() is called for all VM-Exits, e.g. someone unfamiliar with the flow might wonder under what condition(s) VM_EXIT_INTR_INFO does not contain a valid interrupt, which is simply not possible since KVM always runs with "ack interrupt on exit". WARN once if VM_EXIT_INTR_INFO doesn't contain a valid interrupt on an EXTERNAL_INTERRUPT VM-Exit, as such a condition would indicate a hardware bug. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:02 +02:00
Paolo Bonzini	2ea7203980	kvm: nVMX: small cleanup in handle_exception The reason for skipping handling of NMI and #MC in handle_exception is the same, namely they are handled earlier by vmx_complete_atomic_exit. Calling the machine check handler (which just returns 1) is misleading, don't do it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:01 +02:00
Sean Christopherson	beb8d93b3e	KVM: VMX: Fix handling of #MC that occurs during VM-Entry A previous fix to prevent KVM from consuming stale VMCS state after a failed VM-Entry inadvertantly blocked KVM's handling of machine checks that occur during VM-Entry. Per Intel's SDM, a #MC during VM-Entry is handled in one of three ways, depending on when the #MC is recognoized. As it pertains to this bug fix, the third case explicitly states EXIT_REASON_MCE_DURING_VMENTRY is handled like any other VM-Exit during VM-Entry, i.e. sets bit 31 to indicate the VM-Entry failed. If a machine-check event occurs during a VM entry, one of the following occurs: - The machine-check event is handled as if it occurred before the VM entry: ... - The machine-check event is handled after VM entry completes: ... - A VM-entry failure occurs as described in Section 26.7. The basic exit reason is 41, for "VM-entry failure due to machine-check event". Explicitly handle EXIT_REASON_MCE_DURING_VMENTRY as a one-off case in vmx_vcpu_run() instead of binning it into vmx_complete_atomic_exit(). Doing so allows vmx_vcpu_run() to handle VMX_EXIT_REASONS_FAILED_VMENTRY in a sane fashion and also simplifies vmx_complete_atomic_exit() since VMCS.VM_EXIT_INTR_INFO is guaranteed to be fresh. Fixes: `b060ca3b2e` ("kvm: vmx: Handle VMLAUNCH/VMRESUME failure properly") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:45:44 +02:00

1 2 3 4 5 ...

32422 Commits