linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-28 11:18:45 +07:00

Author	SHA1	Message	Date
Sean Christopherson	3bae0459bc	KVM: x86/mmu: Drop KVM's hugepage enums in favor of the kernel's enums Replace KVM's PT_PAGE_TABLE_LEVEL, PT_DIRECTORY_LEVEL and PT_PDPE_LEVEL with the kernel's PG_LEVEL_4K, PG_LEVEL_2M and PG_LEVEL_1G. KVM's enums are borderline impossible to remember and result in code that is visually difficult to audit, e.g. if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PT_PDPE_LEVEL; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PT_DIRECTORY_LEVEL; else ept_lpage_level = PT_PAGE_TABLE_LEVEL; versus if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PG_LEVEL_1G; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PG_LEVEL_2M; else ept_lpage_level = PG_LEVEL_4K; No functional change intended. Suggested-by: Barret Rhoden <brho@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:11 -04:00
Sean Christopherson	e662ec3e07	KVM: x86/mmu: Move max hugepage level to a separate #define Rename PT_MAX_HUGEPAGE_LEVEL to KVM_MAX_HUGEPAGE_LEVEL and make it a separate define in anticipation of dropping KVM's PT__LEVEL enums in favor of the kernel's PG_LEVEL_ enums. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:11 -04:00
Sean Christopherson	b2f432f872	KVM: x86/mmu: Tweak PSE hugepage handling to avoid 2M vs 4M conundrum Change the PSE hugepage handling in walk_addr_generic() to fire on any page level greater than PT_PAGE_TABLE_LEVEL, a.k.a. PG_LEVEL_4K. PSE paging only has two levels, so "== 2" and "> 1" are functionally the same, i.e. this is a nop. A future patch will drop KVM's PT__LEVEL enums in favor of the kernel's PG_LEVEL_ enums, at which point "walker->level == PG_LEVEL_2M" is semantically incorrect (though still functionally ok). No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:10 -04:00
Xiaoyao Li	a71936ab46	kvm: x86: Cleanup vcpu->arch.guest_xstate_size vcpu->arch.guest_xstate_size lost its only user since commit `df1daba7d1` ("KVM: x86: support XSAVES usage in the host"), so clean it up. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200429154312.1411-1-xiaoyao.li@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:10 -04:00
Sean Christopherson	68cda40d9f	KVM: nVMX: Tweak handling of failure code for nested VM-Enter failure Use an enum for passing around the failure code for a failed VM-Enter that results in VM-Exit to provide a level of indirection from the final resting place of the failure code, vmcs.EXIT_QUALIFICATION. The exit qualification field is an unsigned long, e.g. passing around 'u32 exit_qual' throws up red flags as it suggests KVM may be dropping bits when reporting errors to L1. This is a red herring because the only defined failure codes are 0, 2, 3, and 4, i.e. don't come remotely close to overflowing a u32. Setting vmcs.EXIT_QUALIFICATION on entry failure is further complicated by the MSR load list, which returns the (1-based) entry that failed, and the number of MSRs to load is a 32-bit VMCS field. At first blush, it would appear that overflowing a u32 is possible, but the number of MSRs that can be loaded is hardcapped at 4096 (limited by MSR_IA32_VMX_MISC). In other words, there are two completely disparate types of data that eventually get stuffed into vmcs.EXIT_QUALIFICATION, neither of which is an 'unsigned long' in nature. This was presumably the reasoning for switching to 'u32' when the related code was refactored in commit `ca0bde28f2` ("kvm: nVMX: Split VMCS checks from nested_vmx_run()"). Using an enum for the failure code addresses the technically-possible- but-will-never-happen scenario where Intel defines a failure code that doesn't fit in a 32-bit integer. The enum variables and values will either be automatically sized (gcc 5.4 behavior) or be subjected to some combination of truncation. The former case will simply work, while the latter will trigger a compile-time warning unless the compiler is being particularly unhelpful. Separating the failure code from the failed MSR entry allows for disassociating both from vmcs.EXIT_QUALIFICATION, which avoids the conundrum where KVM has to choose between 'u32 exit_qual' and tracking values as 'unsigned long' that have no business being tracked as such. To cement the split, set vmcs12->exit_qualification directly from the entry error code or failed MSR index instead of bouncing through a local variable. Opportunistically rename the variables in load_vmcs12_host_state() and vmx_set_nested_state() to call out that they're ignored, set exit_reason on demand on nested VM-Enter failure, and add a comment in nested_vmx_load_msr() to call out that returning 'i + 1' can't wrap. No functional change intended. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200511220529.11402-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:07:31 -04:00
Sean Christopherson	e93fd3b3e8	KVM: x86/mmu: Capture TDP level when updating CPUID Snapshot the TDP level now that it's invariant (SVM) or dependent only on host capabilities and guest CPUID (VMX). This avoids having to call kvm_x86_ops.get_tdp_level() when initializing a TDP MMU and/or calculating the page role, and thus avoids the associated retpoline. Drop the WARN in vmx_get_tdp_level() as updating CPUID while L2 is active is legal, if dodgy. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-11-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:14 -04:00
Sean Christopherson	0047fcade4	KVM: VMX: Move nested EPT out of kvm_x86_ops.get_tdp_level() hook Separate the "core" TDP level handling from the nested EPT path to make it clear that kvm_x86_ops.get_tdp_level() is used if and only if nested EPT is not in use (kvm_init_shadow_ept_mmu() calculates the level from the passed in vmcs12->eptp). Add a WARN_ON() to enforce that the kvm_x86_ops hook is not called for nested EPT. This sets the stage for snapshotting the non-"nested EPT" TDP page level during kvm_cpuid_update() to avoid the retpoline associated with kvm_x86_ops.get_tdp_level() when resetting the MMU, a relatively frequent operation when running a nested guest. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:13 -04:00
Sean Christopherson	bd31fe495d	KVM: VMX: Add proper cache tracking for CR0 Move CR0 caching into the standard register caching mechanism in order to take advantage of the availability checks provided by regs_avail. This avoids multiple VMREADs in the (uncommon) case where kvm_read_cr0() is called multiple times in a single VM-Exit, and more importantly eliminates a kvm_x86_ops hook, saves a retpoline on SVM when reading CR0, and squashes the confusing naming discrepancy of "cache_reg" vs. "decache_cr0_guest_bits". No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:12 -04:00
Sean Christopherson	f98c1e7712	KVM: VMX: Add proper cache tracking for CR4 Move CR4 caching into the standard register caching mechanism in order to take advantage of the availability checks provided by regs_avail. This avoids multiple VMREADs and retpolines (when configured) during nested VMX transitions as kvm_read_cr4_bits() is invoked multiple times on each transition, e.g. when stuffing CR0 and CR3. As an added bonus, this eliminates a kvm_x86_ops hook, saves a retpoline on SVM when reading CR4, and squashes the confusing naming discrepancy of "cache_reg" vs. "decache_cr4_guest_bits". No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:10 -04:00
Sean Christopherson	0cc69204e7	KVM: nVMX: Unconditionally validate CR3 during nested transitions Unconditionally check the validity of the incoming CR3 during nested VM-Enter/VM-Exit to avoid invoking kvm_read_cr3() in the common case where the guest isn't using PAE paging. If vmcs.GUEST_CR3 hasn't yet been cached (common case), kvm_read_cr3() will trigger a VMREAD. The VMREAD (~30 cycles) alone is likely slower than nested_cr3_valid() (~5 cycles if vcpu->arch.maxphyaddr gets a cache hit), and the poor exchange only gets worse when retpolines are enabled as the call to kvm_x86_ops.cache_reg() will incur a retpoline (60+ cycles). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:09 -04:00
Sean Christopherson	56ba77a459	KVM: x86: Save L1 TSC offset in 'struct kvm_vcpu_arch' Save L1's TSC offset in 'struct kvm_vcpu_arch' and drop the kvm_x86_ops hook read_l1_tsc_offset(). This avoids a retpoline (when configured) when reading L1's effective TSC, which is done at least once on every VM-Exit. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:04 -04:00
Sean Christopherson	1af1bb0562	KVM: nVMX: Skip IBPB when temporarily switching between vmcs01 and vmcs02 Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS switch when temporarily loading vmcs02 to synchronize it to vmcs12, i.e. give copy_vmcs02_to_vmcs12_rare() the same treatment as vmx_switch_vmcs(). Make vmx_vcpu_load() static now that it's only referenced within vmx.c. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506235850.22600-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:03 -04:00
Sean Christopherson	5c911beff2	KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02 Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS switch when running with spectre_v2_user=on/auto if the switch is between two VMCSes in the same guest, i.e. between vmcs01 and vmcs02. The IBPB is intended to prevent one guest from attacking another, which is unnecessary in the nested case as it's the same guest from KVM's perspective. This all but eliminates the overhead observed for nested VMX transitions when running with CONFIG_RETPOLINE=y and spectre_v2_user=on/auto, which can be significant, e.g. roughly 3x on current systems. Reported-by: Alexander Graf <graf@amazon.com> Cc: KarimAllah Raslan <karahmed@amazon.de> Cc: stable@vger.kernel.org Fixes: `15d4507152` ("KVM/x86: Add IBPB support") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200501163117.4655-1-sean.j.christopherson@intel.com> [Invert direction of bool argument. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:02 -04:00
Sean Christopherson	f27ad73a6e	KVM: VMX: Use accessor to read vmcs.INTR_INFO when handling exception Use vmx_get_intr_info() when grabbing the cached vmcs.INTR_INFO in handle_exception_nmi() to ensure the cache isn't stale. Bypassing the caching accessor doesn't cause any known issues as the cache is always refreshed by handle_exception_nmi_irqoff(), but the whole point of adding the proper caching mechanism was to avoid such dependencies. Fixes: `8791585837` ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200427171837.22613-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:01 -04:00
Paolo Bonzini	fede8076aa	KVM: x86: handle wrap around 32-bit address space KVM is not handling the case where EIP wraps around the 32-bit address space (that is, outside long mode). This is needed both in vmx.c and in emulate.c. SVM with NRIPS is okay, but it can still print an error to dmesg due to integer overflow. Reported-by: Nick Peterson <everdox@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:59 -04:00
Davidlohr Bueso	da4ad88cab	kvm: Replace vcpu->swait with rcuwait The use of any sort of waitqueue (simple or regular) for wait/waking vcpus has always been an overkill and semantically wrong. Because this is per-vcpu (which is blocked) there is only ever a single waiting vcpu, thus no need for any sort of queue. As such, make use of the rcuwait primitive, with the following considerations: - rcuwait already provides the proper barriers that serialize concurrent waiter and waker. - Task wakeup is done in rcu read critical region, with a stable task pointer. - Because there is no concurrency among waiters, we need not worry about rcuwait_wait_event() calls corrupting the wait->task. As a consequence, this saves the locking done in swait when modifying the queue. This also applies to per-vcore wait for powerpc kvm-hv. The x86 tscdeadline_latency test mentioned in `8577370fb0` ("KVM: Use simple waitqueue for vcpu->wq") shows that, on avg, latency is reduced by around 15-20% with this change. Cc: Paul Mackerras <paulus@ozlabs.org> Cc: kvmarm@lists.cs.columbia.edu Cc: linux-mips@vger.kernel.org Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Message-Id: <20200424054837.5138-6-dave@stgolabs.net> [Avoid extra logic changes. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:56 -04:00
Paolo Bonzini	c300ab9f08	KVM: x86: Replace late check_nested_events() hack with more precise fix Add an argument to interrupt_allowed and nmi_allowed, to checking if interrupt injection is blocked. Use the hook to handle the case where an interrupt arrives between check_nested_events() and the injection logic. Drop the retry of check_nested_events() that hack-a-fixed the same condition. Blocking injection is also a bit of a hack, e.g. KVM should do exiting and non-exiting interrupt processing in a single pass, but it's a more precise hack. The old comment is also misleading, e.g. KVM_REQ_EVENT is purely an optimization, setting it on every run loop (which KVM doesn't do) should not affect functionality, only performance. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-13-sean.j.christopherson@intel.com> [Extend to SVM, add SMI and NMI. Even though NMI and SMI cannot come asynchronously right now, making the fix generic is easy and removes a special case. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:49 -04:00
Sean Christopherson	7ab0abdb55	KVM: VMX: Use vmx_get_rflags() to query RFLAGS in vmx_interrupt_blocked() Use vmx_get_rflags() instead of manually reading vmcs.GUEST_RFLAGS when querying RFLAGS.IF so that multiple checks against interrupt blocking in a single run loop only require a single VMREAD. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-14-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:48 -04:00
Sean Christopherson	db43859280	KVM: VMX: Use vmx_interrupt_blocked() directly from vmx_handle_exit() Use vmx_interrupt_blocked() instead of bouncing through vmx_interrupt_allowed() when handling edge cases in vmx_handle_exit(). The nested_run_pending check in vmx_interrupt_allowed() should never evaluate true in the VM-Exit path. Hoist the WARN in handle_invalid_guest_state() up to vmx_handle_exit() to enforce the above assumption for the !enable_vnmi case, and to detect any other potential bugs with nested VM-Enter. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-12-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:47 -04:00
Sean Christopherson	3b82b8d7fd	KVM: x86: WARN on injected+pending exception even in nested case WARN if a pending exception is coincident with an injected exception before calling check_nested_events() so that the WARN will fire even if inject_pending_event() bails early because check_nested_events() detects the conflict. Bailing early isn't problematic (quite the opposite), but suppressing the WARN is undesirable as it could mask a bug elsewhere in KVM. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-11-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:46 -04:00
Paolo Bonzini	221e761090	KVM: nSVM: Preserve IRQ/NMI/SMI priority irrespective of exiting behavior Short circuit vmx_check_nested_events() if an unblocked IRQ/NMI/SMI is pending and needs to be injected into L2, priority between coincident events is not dependent on exiting behavior. Fixes: `b518ba9fa6` ("KVM: nSVM: implement check_nested_events for interrupts") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:45 -04:00
Paolo Bonzini	fc6f7c03ad	KVM: nSVM: Report interrupts as allowed when in L2 and exit-on-interrupt is set Report interrupts as allowed when the vCPU is in L2 and L2 is being run with exit-on-interrupts enabled and EFLAGS.IF=1 (either on the host or on the guest according to VINTR). Interrupts are always unblocked from L1's perspective in this case. While moving nested_exit_on_intr to svm.h, use INTERCEPT_INTR properly instead of assuming it's zero (which it is of course). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:44 -04:00
Sean Christopherson	1cd2f0b0dd	KVM: nVMX: Prioritize SMI over nested IRQ/NMI Check for an unblocked SMI in vmx_check_nested_events() so that pending SMIs are correctly prioritized over IRQs and NMIs when the latter events will trigger VM-Exit. This also fixes an issue where an SMI that was marked pending while processing a nested VM-Enter wouldn't trigger an immediate exit, i.e. would be incorrectly delayed until L2 happened to take a VM-Exit. Fixes: `64d6067057` ("KVM: x86: stubs for SMM support") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:43 -04:00
Sean Christopherson	15ff0b450b	KVM: nVMX: Preserve IRQ/NMI priority irrespective of exiting behavior Short circuit vmx_check_nested_events() if an unblocked IRQ/NMI is pending and needs to be injected into L2, priority between coincident events is not dependent on exiting behavior. Fixes: `b6b8a1451f` ("KVM: nVMX: Rework interception of IRQs and NMIs") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:42 -04:00
Paolo Bonzini	cae96af184	KVM: SVM: Split out architectural interrupt/NMI/SMI blocking checks Move the architectural (non-KVM specific) interrupt/NMI/SMI blocking checks to a separate helper so that they can be used in a future patch by svm_check_nested_events(). No functional change intended. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:40 -04:00
Sean Christopherson	1b660b6baa	KVM: VMX: Split out architectural interrupt/NMI blocking checks Move the architectural (non-KVM specific) interrupt/NMI blocking checks to a separate helper so that they can be used in a future patch by vmx_check_nested_events(). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:39 -04:00
Paolo Bonzini	55714cddbf	KVM: nSVM: Move SMI vmexit handling to svm_check_nested_events() Unlike VMX, SVM allows a hypervisor to take a SMI vmexit without having any special SMM-monitor enablement sequence. Therefore, it has to be handled like interrupts and NMIs. Check for an unblocked SMI in svm_check_nested_events() so that pending SMIs are correctly prioritized over IRQs and NMIs when the latter events will trigger VM-Exit. Note that there is no need to test explicitly for SMI vmexits, because guests always runs outside SMM and therefore can never get an SMI while they are blocked. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:38 -04:00
Paolo Bonzini	bbdad0b5a7	KVM: nSVM: Report NMIs as allowed when in L2 and Exit-on-NMI is set Report NMIs as allowed when the vCPU is in L2 and L2 is being run with Exit-on-NMI enabled, as NMIs are always unblocked from L1's perspective in this case. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:33 -04:00
Sean Christopherson	429ab576f3	KVM: nVMX: Report NMIs as allowed when in L2 and Exit-on-NMI is set Report NMIs as allowed when the vCPU is in L2 and L2 is being run with Exit-on-NMI enabled, as NMIs are always unblocked from L1's perspective in this case. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:32 -04:00
Paolo Bonzini	a9fa7cb6aa	KVM: x86: replace is_smm checks with kvm_x86_ops.smi_allowed Do not hardcode is_smm so that all the architectural conditions for blocking SMIs are listed in a single place. Well, in two places because this introduces some code duplication between Intel and AMD. This ensures that nested SVM obeys GIF in kvm_vcpu_has_events. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:31 -04:00
Sean Christopherson	88c604b66e	KVM: x86: Make return for {interrupt_nmi,smi}_allowed() a bool instead of int Return an actual bool for kvm_x86_ops' {interrupt_nmi}_allowed() hook to better reflect the return semantics, and to avoid creating an even bigger mess when the related VMX code is refactored in upcoming patches. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:29 -04:00
Sean Christopherson	8081ad06b6	KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit set Re-request KVM_REQ_EVENT if vcpu_enter_guest() bails after processing pending requests and an immediate exit was requested. This fixes a bug where a pending event, e.g. VMX preemption timer, is delayed and/or lost if the exit was deferred due to something other than a higher priority _injected_ event, e.g. due to a pending nested VM-Enter. This bug only affects the !injected case as kvm_x86_ops.cancel_injection() sets KVM_REQ_EVENT to redo the injection, but that's purely serendipitous behavior with respect to the deferred event. Note, emulated preemption timer isn't the only event that can be affected, it simply happens to be the only event where not re-requesting KVM_REQ_EVENT is blatantly visible to the guest. Fixes: `f4124500c2` ("KVM: nVMX: Fully emulate preemption timer") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:28 -04:00
Sean Christopherson	d2060bd42e	KVM: nVMX: Open a window for pending nested VMX preemption timer Add a kvm_x86_ops hook to detect a nested pending "hypervisor timer" and use it to effectively open a window for servicing the expired timer. Like pending SMIs on VMX, opening a window simply means requesting an immediate exit. This fixes a bug where an expired VMX preemption timer (for L2) will be delayed and/or lost if a pending exception is injected into L2. The pending exception is rightly prioritized by vmx_check_nested_events() and injected into L2, with the preemption timer left pending. Because no window opened, L2 is free to run uninterrupted. Fixes: `f4124500c2` ("KVM: nVMX: Fully emulate preemption timer") Reported-by: Jim Mattson <jmattson@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-3-sean.j.christopherson@intel.com> [Check it in kvm_vcpu_has_events too, to ensure that the preemption timer is serviced promptly even if the vCPU is halted and L1 is not intercepting HLT. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:27 -04:00
Sean Christopherson	6ce347af14	KVM: nVMX: Preserve exception priority irrespective of exiting behavior Short circuit vmx_check_nested_events() if an exception is pending and needs to be injected into L2, priority between coincident events is not dependent on exiting behavior. This fixes a bug where a single-step #DB that is not intercepted by L1 is incorrectly dropped due to servicing a VMX Preemption Timer VM-Exit. Injected exceptions also need to be blocked if nested VM-Enter is pending or an exception was already injected, otherwise injecting the exception could overwrite an existing event injection from L1. Technically, this scenario should be impossible, i.e. KVM shouldn't inject its own exception during nested VM-Enter. This will be addressed in a future patch. Note, event priority between SMI, NMI and INTR is incorrect for L2, e.g. SMI should take priority over VM-Exit on NMI/INTR, and NMI that is injected into L2 should take priority over VM-Exit INTR. This will also be addressed in a future patch. Fixes: `b6b8a1451f` ("KVM: nVMX: Rework interception of IRQs and NMIs") Reported-by: Jim Mattson <jmattson@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:25 -04:00
Cathy Avery	9c3d370a8e	KVM: SVM: Implement check_nested_events for NMI Migrate nested guest NMI intercept processing to new check_nested_events. Signed-off-by: Cathy Avery <cavery@redhat.com> Message-Id: <20200414201107.22952-2-cavery@redhat.com> [Reorder clauses as NMIs have higher priority than IRQs; inject immediate vmexit as is now done for IRQ vmexits. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:24 -04:00
Paolo Bonzini	6e085cbfb0	KVM: SVM: immediately inject INTR vmexit We can immediately leave SVM guest mode in svm_check_nested_events now that we have the nested_run_pending mechanism. This makes things easier because we can run the rest of inject_pending_event with GIF=0, and KVM will naturally end up requesting the next interrupt window. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:23 -04:00
Paolo Bonzini	38c0b192bd	KVM: SVM: leave halted state on vmexit Similar to VMX, we need to leave the halted state when performing a vmexit. Failure to do so will cause a hang after vmexit. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:22 -04:00
Paolo Bonzini	f74f94140f	KVM: SVM: introduce nested_run_pending We want to inject vmexits immediately from svm_check_nested_events, so that the interrupt/NMI window requests happen in inject_pending_event right after it returns. This however has the same issue as in vmx_check_nested_events, so introduce a nested_run_pending flag with the exact same purpose of delaying vmexit injection after the vmentry. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:21 -04:00
Paolo Bonzini	4aef2ec902	Merge branch 'kvm-amd-fixes' into HEAD	2020-05-13 12:14:05 -04:00
Babu Moger	37486135d3	KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c Though rdpkru and wrpkru are contingent upon CR4.PKE, the PKRU resource isn't. It can be read with XSAVE and written with XRSTOR. So, if we don't set the guest PKRU value here(kvm_load_guest_xsave_state), the guest can read the host value. In case of kvm_load_host_xsave_state, guest with CR4.PKE clear could potentially use XRSTOR to change the host PKRU value. While at it, move pkru state save/restore to common code and the host_pkru field to kvm_vcpu_arch. This will let SVM support protection keys. Cc: stable@vger.kernel.org Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Message-Id: <158932794619.44260.14508381096663848853.stgit@naples-babu.amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 11:27:41 -04:00
Linus Torvalds	af38553c66	Merge branch 'akpm' (patches from Andrew) Merge misc fixes from Andrew Morton: "14 fixes and one selftest to verify the ipc fixes herein" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: limit boost_watermark on small zones ubsan: disable UBSAN_ALIGNMENT under COMPILE_TEST mm/vmscan: remove unnecessary argument description of isolate_lru_pages() epoll: atomically remove wait entry on wake up kselftests: introduce new epoll60 testcase for catching lost wakeups percpu: make pcpu_alloc() aware of current gfp context mm/slub: fix incorrect interpretation of s->offset scripts/gdb: repair rb_first() and rb_last() eventpoll: fix missing wakeup for ovflist in ep_poll_callback arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory() scripts/decodecode: fix trapping instruction formatting kernel/kcov.c: fix typos in kcov_remote_start documentation mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() mm, memcg: fix error return value of mem_cgroup_css_alloc() ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()	2020-05-08 08:41:09 -07:00
Suravee Suthikulpanit	7d611233b0	KVM: SVM: Disable AVIC before setting V_IRQ The commit `64b5bd2704` ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1") introduced a WARN_ON, which checks if AVIC is enabled when trying to set V_IRQ in the VMCB for enabling irq window. The following warning is triggered because the requesting vcpu (to deactivate AVIC) does not get to process APICv update request for itself until the next #vmexit. WARNING: CPU: 0 PID: 118232 at arch/x86/kvm/svm/svm.c:1372 enable_irq_window+0x6a/0xa0 [kvm_amd] RIP: 0010:enable_irq_window+0x6a/0xa0 [kvm_amd] Call Trace: kvm_arch_vcpu_ioctl_run+0x6e3/0x1b50 [kvm] ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm] ? _copy_to_user+0x26/0x30 ? kvm_vm_ioctl+0xb3e/0xd90 [kvm] ? set_next_entity+0x78/0xc0 kvm_vcpu_ioctl+0x236/0x610 [kvm] ksys_ioctl+0x8a/0xc0 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x58/0x210 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes by sending APICV update request to all other vcpus, and immediately update APIC for itself. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lkml.org/lkml/2020/5/2/167 Fixes: `64b5bd2704` ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1") Message-Id: <1588818939-54264-1-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-08 07:44:32 -04:00
Suravee Suthikulpanit	54163a346d	KVM: Introduce kvm_make_all_cpus_request_except() This allows making request to all other vcpus except the one specified in the parameter. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <1588771076-73790-2-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-08 07:44:32 -04:00
Paolo Bonzini	45981dedf5	KVM: VMX: pass correct DR6 for GD userspace exit When KVM_EXIT_DEBUG is raised for the disabled-breakpoints case (DR7.GD), DR6 was incorrectly copied from the value in the VM. Instead, DR6.BD should be set in order to catch this case. On AMD this does not need any special code because the processor triggers a #DB exception that is intercepted. However, the testcase would fail without the previous patch because both DR6.BS and DR6.BD would be set. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-08 07:44:31 -04:00
Paolo Bonzini	d67668e9dd	KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6 There are two issues with KVM_EXIT_DEBUG on AMD, whose root cause is the different handling of DR6 on intercepted #DB exceptions on Intel and AMD. On Intel, #DB exceptions transmit the DR6 value via the exit qualification field of the VMCS, and the exit qualification only contains the description of the precise event that caused a vmexit. On AMD, instead the DR6 field of the VMCB is filled in as if the #DB exception was to be injected into the guest. This has two effects when guest debugging is in use: * the guest DR6 is clobbered * the kvm_run->debug.arch.dr6 field can accumulate more debug events, rather than just the last one that happened (the testcase in the next patch covers this issue). This patch fixes both issues by emulating, so to speak, the Intel behavior on AMD processors. The important observation is that (after the previous patches) the VMCB value of DR6 is only ever observable from the guest is KVM_DEBUGREG_WONT_EXIT is set. Therefore we can actually set vmcb->save.dr6 to any value we want as long as KVM_DEBUGREG_WONT_EXIT is clear, which it will be if guest debugging is enabled. Therefore it is possible to enter the guest with an all-zero DR6, reconstruct the #DB payload from the DR6 we get at exit time, and let kvm_deliver_exception_payload move the newly set bits into vcpu->arch.dr6. Some extra bits may be included in the payload if KVM_DEBUGREG_WONT_EXIT is set, but this is harmless. This may not be the most optimized way to deal with this, but it is simple and, being confined within SVM code, it gets rid of the set_dr6 callback and kvm_update_dr6. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-08 07:44:31 -04:00
Paolo Bonzini	5679b803e4	KVM: SVM: keep DR6 synchronized with vcpu->arch.dr6 kvm_x86_ops.set_dr6 is only ever called with vcpu->arch.dr6 as the second argument. Ensure that the VMCB value is synchronized to vcpu->arch.dr6 on #DB (both "normal" and nested) and nested vmentry, so that the current value of DR6 is always available in vcpu->arch.dr6. The get_dr6 callback can just access vcpu->arch.dr6 and becomes redundant. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-08 07:43:47 -04:00
Janakarajan Natarajan	996ed22c7a	arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory() When trying to lock read-only pages, sev_pin_memory() fails because FOLL_WRITE is used as the flag for get_user_pages_fast(). Commit `73b0140bf0` ("mm/gup: change GUP fast to use flags rather than a write 'bool'") updated the get_user_pages_fast() call sites to use flags, but incorrectly updated the call in sev_pin_memory(). As the original coding of this call was correct, revert the change made by that commit. Fixes: `73b0140bf0` ("mm/gup: change GUP fast to use flags rather than a write 'bool'") Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Mike Marshall <hubcap@omnibond.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Link: http://lkml.kernel.org/r/20200423152419.87202-1-Janakarajan.Natarajan@amd.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-05-07 19:27:20 -07:00
Paolo Bonzini	2c19dba680	KVM: nSVM: trap #DB and #BP to userspace if guest debugging is on Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-07 07:45:16 -04:00
Peter Xu	d5d260c5ff	KVM: X86: Fix single-step with KVM_SET_GUEST_DEBUG When single-step triggered with KVM_SET_GUEST_DEBUG, we should fill in the pc value with current linear RIP rather than the cached singlestep address. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200505205000.188252-3-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-07 06:13:41 -04:00
Peter Xu	13196638d5	KVM: X86: Set RTM for DB_VECTOR too for KVM_EXIT_DEBUG RTM should always been set even with KVM_EXIT_DEBUG on #DB. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200505205000.188252-2-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-07 06:13:41 -04:00
Paolo Bonzini	4d5523cfd5	KVM: x86: fix DR6 delivery for various cases of #DB injection Go through kvm_queue_exception_p so that the payload is correctly delivered through the exit qualification, and add a kvm_update_dr6 call to kvm_deliver_exception_payload that is needed on AMD. Reported-by: Peter Xu <peterx@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-07 06:13:41 -04:00
Peter Xu	b9b2782cd5	KVM: X86: Declare KVM_CAP_SET_GUEST_DEBUG properly KVM_CAP_SET_GUEST_DEBUG should be supported for x86 however it's not declared as supported. My wild guess is that userspaces like QEMU are using "#ifdef KVM_CAP_SET_GUEST_DEBUG" to check for the capability instead, but that could be wrong because the compilation host may not be the runtime host. The userspace might still want to keep the old "#ifdef" though to not break the guest debug on old kernels. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200505154750.126300-1-peterx@redhat.com> [Do the same for PPC and s390. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-07 06:13:40 -04:00
Peter Xu	495907ec36	KVM: X86: Declare KVM_CAP_SET_GUEST_DEBUG properly KVM_CAP_SET_GUEST_DEBUG should be supported for x86 however it's not declared as supported. My wild guess is that userspaces like QEMU are using "#ifdef KVM_CAP_SET_GUEST_DEBUG" to check for the capability instead, but that could be wrong because the compilation host may not be the runtime host. The userspace might still want to keep the old "#ifdef" though to not break the guest debug on old kernels. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200505154750.126300-1-peterx@redhat.com> [Do the same for PPC and s390. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-06 06:51:38 -04:00
Paolo Bonzini	139f7425fd	kvm: x86: Use KVM CPU capabilities to determine CR4 reserved bits Using CPUID data can be useful for the processor compatibility check, but that's it. Using it to compute guest-reserved bits can have both false positives (such as LA57 and UMIP which we are already handling) and false negatives: in particular, with this patch we don't allow anymore a KVM guest to set CR4.PKE when CR4.PKE is clear on the host. Fixes: `b9dd21e104` ("KVM: x86: simplify handling of PKRU") Reported-by: Jim Mattson <jmattson@google.com> Tested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-06 06:51:36 -04:00
Sean Christopherson	c7cb2d650c	KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB path Clear CF and ZF in the VM-Exit path after doing __FILL_RETURN_BUFFER so that KVM doesn't interpret clobbered RFLAGS as a VM-Fail. Filling the RSB has always clobbered RFLAGS, its current incarnation just happens clear CF and ZF in the processs. Relying on the macro to clear CF and ZF is extremely fragile, e.g. commit `089dd8e531` ("x86/speculation: Change FILL_RETURN_BUFFER to work with objtool") tweaks the loop such that the ZF flag is always set. Reported-by: Qian Cai <cai@lca.pw> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: stable@vger.kernel.org Fixes: `f2fde6a5bc` ("KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506035355.2242-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-06 06:51:35 -04:00
Paolo Bonzini	8be8f932e3	kvm: ioapic: Restrict lazy EOI update to edge-triggered interrupts Commit `f458d039db` ("kvm: ioapic: Lazy update IOAPIC EOI") introduces the following infinite loop: BUG: stack guard page was hit at 000000008f595917 \ (stack is 00000000bdefe5a4..00000000ae2b06f5) kernel stack overflow (double-fault): 0000 [#1] SMP NOPTI RIP: 0010:kvm_set_irq+0x51/0x160 [kvm] Call Trace: irqfd_resampler_ack+0x32/0x90 [kvm] kvm_notify_acked_irq+0x62/0xd0 [kvm] kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm] ioapic_set_irq+0x20e/0x240 [kvm] kvm_ioapic_set_irq+0x5c/0x80 [kvm] kvm_set_irq+0xbb/0x160 [kvm] ? kvm_hv_set_sint+0x20/0x20 [kvm] irqfd_resampler_ack+0x32/0x90 [kvm] kvm_notify_acked_irq+0x62/0xd0 [kvm] kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm] ioapic_set_irq+0x20e/0x240 [kvm] kvm_ioapic_set_irq+0x5c/0x80 [kvm] kvm_set_irq+0xbb/0x160 [kvm] ? kvm_hv_set_sint+0x20/0x20 [kvm] .... The re-entrancy happens because the irq state is the OR of the interrupt state and the resamplefd state. That is, we don't want to show the state as 0 until we've had a chance to set the resamplefd. But if the interrupt has _not_ gone low then ioapic_set_irq is invoked again, causing an infinite loop. This can only happen for a level-triggered interrupt, otherwise irqfd_inject would immediately set the KVM_USERSPACE_IRQ_SOURCE_ID high and then low. Fortunately, in the case of level-triggered interrupts the VMEXIT already happens because TMR is set. Thus, fix the bug by restricting the lazy invocation of the ack notifier to edge-triggered interrupts, the only ones that need it. Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reported-by: borisvk@bstnet.org Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://www.spinics.net/lists/kvm/msg213512.html Fixes: `f458d039db` ("kvm: ioapic: Lazy update IOAPIC EOI") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207489 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-04 12:29:05 -04:00
Paolo Bonzini	dee919d15d	KVM: SVM: fill in kvm_run->debug.arch.dr[67] The corresponding code was added for VMX in commit `42dbaa5a05` ("KVM: x86: Virtualize debug registers, 2008-12-15) but never for AMD. Fix this. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-04 11:59:03 -04:00
Sean Christopherson	f9336e3281	KVM: nVMX: Replace a BUG_ON(1) with BUG() to squash clang warning Use BUG() in the impossible-to-hit default case when switching on the scope of INVEPT to squash a warning with clang 11 due to clang treating the BUG_ON() as conditional. >> arch/x86/kvm/vmx/nested.c:5246:3: warning: variable 'roots_to_free' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] BUG_ON(1); Reported-by: kbuild test robot <lkp@intel.com> Fixes: `ce8fe7b77b` ("KVM: nVMX: Free only the affected contexts when emulating INVEPT") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200504153506.28898-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-04 11:58:55 -04:00
Paolo Bonzini	7c67f54661	KVM: SVM: do not allow VMRUN inside SMM VMRUN is not supported inside the SMM handler and the behavior is undefined. Just raise a #UD. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-24 12:53:18 -04:00
Sean Christopherson	87796555d4	KVM: nVMX: Store vmcs.EXIT_QUALIFICATION as an unsigned long, not u32 Use an unsigned long for 'exit_qual' in nested_vmx_reflect_vmexit(), the EXIT_QUALIFICATION field is naturally sized, not a 32-bit field. The bug is most easily observed by doing VMXON (or any VMX instruction) in L2 with a negative displacement, in which case dropping the upper bits on nested VM-Exit results in L1 calculating the wrong virtual address for the memory operand, e.g. "vmxon -0x8(%rbp)" yields: Unhandled cpu exception 14 #PF at ip 0000000000400553 rbp=0000000000537000 cr2=0000000100536ff8 Fixes: `fbdd502503` ("KVM: nVMX: Move VM-Fail check out of nested_vmx_exit_reflected()") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423001127.13490-1-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-24 12:51:21 -04:00
Sean Christopherson	9bd4af240f	KVM: nVMX: Drop a redundant call to vmx_get_intr_info() Drop nested_vmx_l1_wants_exit()'s initialization of intr_info from vmx_get_intr_info() that was inadvertantly introduced along with the caching mechanism. EXIT_REASON_EXCEPTION_NMI, the only consumer of intr_info, populates the variable before using it. Fixes: bb53120d67cd ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200421075328.14458-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-23 18:24:28 -04:00
Paolo Bonzini	33b2217245	KVM: x86: move nested-related kvm_x86_ops to a separate struct Clean up some of the patching of kvm_x86_ops, by moving kvm_x86_ops related to nested virtualization into a separate struct. As a result, these ops will always be non-NULL on VMX. This is not a problem: * check_nested_events is only called if is_guest_mode(vcpu) returns true * get_nested_state treats VMXOFF state the same as nested being disabled * set_nested_state fails if you attempt to set nested state while nesting is disabled * nested_enable_evmcs could already be called on a CPU without VMX enabled in CPUID. * nested_get_evmcs_version was fixed in the previous patch Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-23 09:04:57 -04:00
Paolo Bonzini	25091990ef	KVM: eVMCS: check if nesting is enabled In the next patch nested_get_evmcs_version will be always set in kvm_x86_ops for VMX, even if nesting is disabled. Therefore, check whether VMX (aka nesting) is available in the function, the caller will not do the check anymore. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-23 09:04:56 -04:00
Paolo Bonzini	56083bdf67	KVM: x86: check_nested_events is never NULL Both Intel and AMD now implement it, so there is no need to check if the callback is implemented. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-23 09:04:56 -04:00
Paolo Bonzini	3bda03865f	KVM: s390: Fix for 5.7 and maintainer update - Silence false positive lockdep warning - add Claudio as reviewer -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABAgAGBQJenY6AAAoJEBF7vIC1phx8bykQAK+QZyD+H/zGNuqeUVn0sh8e yKUVMR+kuE+l57q77nt2AYVxqpCD9xSKRR+SOSLzhVH/HJf625nm+Ny/WOWMebwJ EA/KK+v15T5rga8gFza+4cPg4v/pHwjHhSbjTb1JWg+8cJR1BTj6OxRuTtWr5+25 GF4RhkJOit/VhNbCo1aIgs7/7F1pPALstdPAUsHYe1PeULdRMVqSVluXT2KTPhpi /kzDw8sKKcYgv/eaVdcNoHv+VX1AWIRDAKEttCywyocfbu0ESwadmR7C0qlm1446 HqowP6F0xCF0Whi/65aN4ZOv7wjO/qrV08DZ7JLA3/oKlXtZ1ieyiE2q/P1frSo1 gvmuHiH5/UI6t6a/BSCpJwqcilxKYArqAAYBKoGiJhTbsJStqw0wl41klWTKXlTq VrCvjoUxQ9JMjFCQ1GXOU+ODNyX2IwZYptJ5vF24HYzBJwUBe3HPG9/BA8YcodzG qGQ5IKv0Q1IFTwOqnt557H0MjcBtNIEx54aLJrPy3wldsiNSj39Ft0cuvnbR+Q4F QhKk88dHtd7NW1IirfgYmLGe0rB1ANKM7wUGEdM5w2y5Eg8wCs8/P4KeGh0YyFI9 xPqZDfwof6KkDjOGFXr/CeD/thi+km0/FpePb7cL5Ow4a+JmrCvqQiXrf0TbnFpv t5ZlHnGzoSHsEaRgmJ+X =d46L -----END PGP SIGNATURE----- Merge tag 'kvm-s390-master-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master KVM: s390: Fix for 5.7 and maintainer update - Silence false positive lockdep warning - add Claudio as reviewer	2020-04-21 09:37:13 -04:00
Paolo Bonzini	e72436bc3a	KVM: SVM: avoid infinite loop on NPF from bad address When a nested page fault is taken from an address that does not have a memslot associated to it, kvm_mmu_do_page_fault returns RET_PF_EMULATE (via mmu_set_spte) and kvm_mmu_page_fault then invokes svm_need_emulation_on_page_fault. The default answer there is to return false, but in this case this just causes the page fault to be retried ad libitum. Since this is not a fast path, and the only other case where it is taken is an erratum, just stick a kvm_vcpu_gfn_to_memslot check in there to detect the common case where the erratum is not happening. This fixes an infinite loop in the new set_memory_region_test. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:13 -04:00
Tianjia Zhang	1b94f6f810	KVM: Remove redundant argument to kvm_arch_vcpu_ioctl_run In earlier versions of kvm, 'kvm_run' was an independent structure and was not included in the vcpu structure. At present, 'kvm_run' is already included in the vcpu structure, so the parameter 'kvm_run' is redundant. This patch simplifies the function definition, removes the extra 'kvm_run' parameter, and extracts it from the 'kvm_vcpu' structure if necessary. Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Message-Id: <20200416051057.26526-1-tianjia.zhang@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:11 -04:00
Krish Sadhukhan	4f233371f6	KVM: nSVM: Check for CR0.CD and CR0.NW on VMRUN of nested guests According to section "Canonicalization and Consistency Checks" in APM vol. 2, the following guest state combination is illegal: "CR0.CD is zero and CR0.NW is set" Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Message-Id: <20200409205035.16830-2-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:10 -04:00
Wanpeng Li	a9ab13ff6e	KVM: X86: Improve latency for single target IPI fastpath IPI and Timer cause the main MSRs write vmexits in cloud environment observation, let's optimize virtual IPI latency more aggressively to inject target IPI as soon as possible. Running kvm-unit-tests/vmexit.flat IPI testing on SKX server, disable adaptive advance lapic timer and adaptive halt-polling to avoid the interference, this patch can give another 7% improvement. w/o fastpath -> x86.c fastpath 4238 -> 3543 16.4% x86.c fastpath -> vmx.c fastpath 3543 -> 3293 7% w/o fastpath -> vmx.c fastpath 4238 -> 3293 22.3% Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200410174703.1138-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:10 -04:00
Sean Christopherson	873e1da169	KVM: VMX: Optimize handling of VM-Entry failures in vmx_vcpu_run() Mark the VM-Fail, VM-Exit on VM-Enter, and #MC on VM-Enter paths as 'unlikely' so as to improve code generation so that it favors successful VM-Enter. The performance of successful VM-Enter is for more important, irrespective of whether or not success is actually likely. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200410174703.1138-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:09 -04:00
Sean Christopherson	b8d295f96b	KVM: nVMX: Remove non-functional "support" for CR3 target values Remove all references to cr3_target_value[0-3] and replace the fields in vmcs12 with "dead_space" to preserve the vmcs12 layout. KVM doesn't support emulating CR3-target values, despite a variety of code that implies otherwise, as KVM unconditionally reports '0' for the number of supported CR3-target values. This technically fixes a bug where KVM would incorrectly allow VMREAD and VMWRITE to nonexistent fields, i.e. cr3_target_value[0-3]. Per Intel's SDM, the number of supported CR3-target values reported in VMX_MISC also enumerates the existence of the associated VMCS fields: If a future implementation supports more than 4 CR3-target values, they will be encoded consecutively following the 4 encodings given here. Alternatively, the "bug" could be fixed by actually advertisting support for 4 CR3-target values, but that'd likely just enable kvm-unit-tests given that no one has complained about lack of support for going on ten years, e.g. KVM, Xen and HyperV don't use CR3-target values. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200416000739.9012-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:09 -04:00
Paolo Bonzini	c36b71503a	KVM: x86/mmu: Avoid an extra memslot lookup in try_async_pf() for L2 Create a new function kvm_is_visible_memslot() and use it from kvm_is_visible_gfn(); use the new function in try_async_pf() too, to avoid an extra memslot lookup. Opportunistically squish a multi-line comment into a single-line comment. Note, the end result, KVM_PFN_NOSLOT, is unchanged. Cc: Jim Mattson <jmattson@google.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:08 -04:00
Sean Christopherson	c583eed6d7	KVM: x86/mmu: Set @writable to false for non-visible accesses by L2 Explicitly set @writable to false in try_async_pf() if the GFN->PFN translation is short-circuited due to the requested GFN not being visible to L2. Leaving @writable ('map_writable' in the callers) uninitialized is ok in that it's never actually consumed, but one has to track it all the way through set_spte() being short-circuited by set_mmio_spte() to understand that the uninitialized variable is benign, and relying on @writable being ignored is an unnecessary risk. Explicitly setting @writable also aligns try_async_pf() with __gfn_to_pfn_memslot(). Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415214414.10194-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:08 -04:00
Sean Christopherson	8791585837	KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags Introduce a new "extended register" type, EXIT_INFO_2 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_INTR_INFO. Drop a comment in vmx_recover_nmi_blocking() that is obsoleted by the generic caching mechanism. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:07 -04:00
Sean Christopherson	5addc23519	KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags Introduce a new "extended register" type, EXIT_INFO_1 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_QUALIFICATION. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:07 -04:00
Sean Christopherson	ec0241f3bb	KVM: nVMX: Drop manual clearing of segment cache on nested VMCS switch Drop the call to vmx_segment_cache_clear() in vmx_switch_vmcs() now that the entire register cache is reset when switching the active VMCS, e.g. vmx_segment_cache_test_set() will reset the segment cache due to VCPU_EXREG_SEGMENTS being unavailable. Move vmx_segment_cache_clear() to vmx.c now that it's no longer invoked by the nested code. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:06 -04:00
Sean Christopherson	e5d03de593	KVM: nVMX: Reset register cache (available and dirty masks) on VMCS switch Reset the per-vCPU available and dirty register masks when switching between vmcs01 and vmcs02, as the masks track state relative to the current VMCS. The stale masks don't cause problems in the current code base because the registers are either unconditionally written on nested transitions or, in the case of segment registers, have an additional tracker that is manually reset. Note, by dropping (previously implicitly, now explicitly) the dirty mask when switching the active VMCS, KVM is technically losing writes to the associated fields. But, the only regs that can be dirtied (RIP, RSP and PDPTRs) are unconditionally written on nested transitions, e.g. explicit writeback is a waste of cycles, and a WARN_ON would be rather pointless. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:06 -04:00
Sean Christopherson	9932b49e5a	KVM: nVMX: Invoke ept_save_pdptrs() if and only if PAE paging is enabled Invoke ept_save_pdptrs() when restoring L1's host state on a "late" VM-Fail if and only if PAE paging is enabled. This saves a CALL in the common case where L1 is a 64-bit host, and avoids incorrectly marking the PDPTRs as dirty. WARN if ept_save_pdptrs() is called with PAE disabled now that the nested usage pre-checks is_pae_paging(). Barring a bug in KVM's MMU, attempting to read the PDPTRs with PAE disabled is now impossible. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:06 -04:00
Sean Christopherson	4dcefa312a	KVM: nVMX: Rename exit_reason to vm_exit_reason for nested VM-Exit Use "vm_exit_reason" for code related to injecting a nested VM-Exit to VM-Exits to make it clear that nested_vmx_vmexit() expects the full exit eason, not just the basic exit reason. The basic exit reason (bits 15:0 of vmcs.VM_EXIT_REASON) is colloquially referred to as simply "exit reason". Note, other flows, e.g. vmx_handle_exit(), are intentionally left as is. A future patch will convert vmx->exit_reason to a union + bit-field, and the exempted flows will interact with the unionized of "exit_reason". Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:05 -04:00
Sean Christopherson	2a7833899f	KVM: nVMX: Cast exit_reason to u16 to check for nested EXTERNAL_INTERRUPT Explicitly check only the basic exit reason when emulating an external interrupt VM-Exit in nested_vmx_vmexit(). Checking the full exit reason doesn't currently cause problems, but only because the only exit reason modifier support by KVM is FAILED_VMENTRY, which is mutually exclusive with EXTERNAL_INTERRUPT. Future modifiers, e.g. ENCLAVE_MODE, will coexist with EXTERNAL_INTERRUPT. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:05 -04:00
Sean Christopherson	f47baaed4f	KVM: nVMX: Pull exit_reason from vcpu_vmx in nested_vmx_reflect_vmexit() Grab the exit reason from the vcpu struct in nested_vmx_reflect_vmexit() instead of having the exit reason explicitly passed from the caller. This fixes a discrepancy between VM-Fail and VM-Exit handling, as the VM-Fail case is already handled by checking vcpu_vmx, e.g. the exit reason previously passed on the stack is bogus if vmx->fail is set. Not taking the exit reason on the stack also avoids having to document that nested_vmx_reflect_vmexit() requires the full exit reason, as opposed to just the basic exit reason, which is not at all obvious since the only usages of the full exit reason are for tracing and way down in prepare_vmcs12() where it's propagated to vmcs12. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:04 -04:00
Sean Christopherson	1d283062c9	KVM: nVMX: Drop a superfluous WARN on reflecting EXTERNAL_INTERRUPT Drop the WARN in nested_vmx_reflect_vmexit() that fires if KVM attempts to reflect an external interrupt. The WARN is blatantly impossible to hit now that nested_vmx_l0_wants_exit() is called from nested_vmx_reflect_vmexit() unconditionally returns true for EXTERNAL_INTERRUPT. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:04 -04:00
Sean Christopherson	2c1f332380	KVM: nVMX: Split VM-Exit reflection logic into L0 vs. L1 wants Split the logic that determines whether a nested VM-Exit is reflected into L1 into "L0 wants" and "L1 wants" to document the core control flow at a high level. If L0 wants the VM-Exit, e.g. because the exit is due to a hardware event that isn't passed through to L1, then KVM should handle the exit in L0 without considering L1's configuration. Then, if L0 doesn't want the exit, KVM needs to query L1's wants to determine whether or not L1 "caused" the exit, e.g. by setting an exiting control, versus the exit occurring due to an L0 setting, e.g. when L0 intercepts an action that L1 chose to pass-through. Note, this adds an extra read on vmcs.VM_EXIT_INTR_INFO for exception. This will be addressed in a future patch via a VMX-wide enhancement, rather than pile on another case where vmx->exit_intr_info is conditionally available. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:03 -04:00
Sean Christopherson	236871b674	KVM: nVMX: Move nested VM-Exit tracepoint into nested_vmx_reflect_vmexit() Move the tracepoint for nested VM-Exits in preparation of splitting the reflection logic into L1 wants the exit vs. L0 always handles the exit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:03 -04:00
Sean Christopherson	fbdd502503	KVM: nVMX: Move VM-Fail check out of nested_vmx_exit_reflected() Check for VM-Fail on nested VM-Enter in nested_vmx_reflect_vmexit() in preparation for separating nested_vmx_exit_reflected() into separate "L0 wants exit exit" and "L1 wants the exit" helpers. Explicitly set exit_intr_info and exit_qual to zero instead of reading them from vmcs02, as they are invalid on VM-Fail (and thankfully ignored by nested_vmx_vmexit() for nested VM-Fail). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:02 -04:00
Sean Christopherson	7b7bd87dbd	KVM: nVMX: Uninline nested_vmx_reflect_vmexit(), i.e. move it to nested.c Uninline nested_vmx_reflect_vmexit() in preparation of refactoring nested_vmx_exit_reflected() to split up the reflection logic into more consumable chunks, e.g. VM-Fail vs. L1 wants the exit vs. L0 always handles the exit. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:02 -04:00
Sean Christopherson	789afc5ccd	KVM: nVMX: Move reflection check into nested_vmx_reflect_vmexit() Move the call to nested_vmx_exit_reflected() from vmx_handle_exit() into nested_vmx_reflect_vmexit() and change the semantics of the return value for nested_vmx_reflect_vmexit() to indicate whether or not the exit was reflected into L1. nested_vmx_exit_reflected() and nested_vmx_reflect_vmexit() are intrinsically tied together, calling one without simultaneously calling the other makes little sense. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:01 -04:00
Emanuele Giuseppe Esposito	812756a82e	kvm_host: unify VM_STAT and VCPU_STAT definitions in a single place The macros VM_STAT and VCPU_STAT are redundantly implemented in multiple files, each used by a different architecure to initialize the debugfs entries for statistics. Since they all have the same purpose, they can be unified in a single common definition in include/linux/kvm_host.h Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20200414155625.20559-1-eesposit@redhat.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:01 -04:00
Uros Bizjak	1c164cb3ff	KVM: SVM: Use do_machine_check to pass MCE to the host Use do_machine_check instead of INT $12 to pass MCE to the host, the same approach VMX uses. On a related note, there is no reason to limit the use of do_machine_check to 64 bit targets, as is currently done for VMX. MCE handling works for both target families. The patch is only compile tested, for both, 64 and 32 bit targets, someone should test the passing of the exception by injecting some MCEs into the guest. For future non-RFC patch, kvm_machine_check should be moved to some appropriate header file. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Message-Id: <20200411153627.3474710-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:00 -04:00
Sean Christopherson	be100ef136	KVM: VMX: Clean cr3/pgd handling in vmx_load_mmu_pgd() Rename @cr3 to @pgd in vmx_load_mmu_pgd() to reflect that it will be loaded into vmcs.EPT_POINTER and not vmcs.GUEST_CR3 when EPT is enabled. Similarly, load guest_cr3 with @pgd if and only if EPT is disabled. This fixes one of the last, if not _the_ last, cases in KVM where a variable that is not strictly a cr3 value uses "cr3" instead of "pgd". Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-38-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:59 -04:00
Sean Christopherson	be01e8e2c6	KVM: x86: Replace "cr3" with "pgd" in "new cr3/pgd" related code Rename functions and variables in kvm_mmu_new_cr3() and related code to replace "cr3" with "pgd", i.e. continue the work started by commit `727a7e27cf` ("KVM: x86: rename set_cr3 callback and related flags to load_mmu_pgd"). kvm_mmu_new_cr3() and company are not always loading a new CR3, e.g. when nested EPT is enabled "cr3" is actually an EPTP. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-37-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:59 -04:00
Sean Christopherson	ce8fe7b77b	KVM: nVMX: Free only the affected contexts when emulating INVEPT Add logic to handle_invept() to free only those roots that match the target EPT context when emulating a single-context INVEPT. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-36-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:58 -04:00
Sean Christopherson	9805c5f74b	KVM: nVMX: Don't flush TLB on nested VMX transition Unconditionally skip the TLB flush triggered when reusing a root for a nested transition as nested_vmx_transition_tlb_flush() ensures the TLB is flushed when needed, regardless of whether the MMU can reuse a cached root (or the last root). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-35-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:58 -04:00
Sean Christopherson	41fab65e7c	KVM: nVMX: Skip MMU sync on nested VMX transition when possible Skip the MMU sync when reusing a cached root if EPT is enabled or L1 enabled VPID for L2. If EPT is enabled, guest-physical mappings aren't flushed even if VPID is disabled, i.e. L1 can't expect stale TLB entries to be flushed if it has enabled EPT and L0 isn't shadowing PTEs (for L1 or L2) if L1 has EPT disabled. If VPID is enabled (and EPT is disabled), then L1 can't expect stale TLB entries to be flushed (for itself or L2). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-34-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:57 -04:00
Sean Christopherson	71fe70130d	KVM: x86/mmu: Add module param to force TLB flush on root reuse Add a module param, flush_on_reuse, to override skip_tlb_flush and skip_mmu_sync when performing a so called "fast cr3 switch", i.e. when reusing a cached root. The primary motiviation for the control is to provide a fallback mechanism in the event that TLB flushing and/or MMU sync bugs are exposed/introduced by upcoming changes to stop unconditionally flushing on nested VMX transitions. Suggested-by: Jim Mattson <jmattson@google.com> Suggested-by: Junaid Shahid <junaids@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-33-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:57 -04:00
Sean Christopherson	4a632ac6ca	KVM: x86/mmu: Add separate override for MMU sync during fast CR3 switch Add a separate "skip" override for MMU sync, a future change to avoid TLB flushes on nested VMX transitions may need to sync the MMU even if the TLB flush is unnecessary. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-32-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:56 -04:00
Sean Christopherson	b869855bad	KVM: x86/mmu: Move fast_cr3_switch() side effects to __kvm_mmu_new_cr3() Handle the side effects of a fast CR3 (PGD) switch up a level in __kvm_mmu_new_cr3(), which is the only caller of fast_cr3_switch(). This consolidates handling all side effects in __kvm_mmu_new_cr3() (where freeing the current root when KVM can't do a fast switch is already handled), and ameliorates the pain of adding a second boolean in a future patch to provide a separate "skip" override for the MMU sync. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-31-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:56 -04:00
Sean Christopherson	4de1f9d469	KVM: VMX: Don't reload APIC access page if its control is disabled Don't reload the APIC access page if its control is disabled, e.g. if the guest is running with x2APIC (likely) or with the local APIC disabled (unlikely), to avoid unnecessary TLB flushes and VMWRITEs. Unconditionally reload the APIC access page and flush the TLB when the guest's virtual APIC transitions to "xAPIC enabled", as any changes to the APIC access page's mapping will not be recorded while the guest's virtual APIC is disabled. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-30-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:55 -04:00
Sean Christopherson	a4148b7ca2	KVM: VMX: Retrieve APIC access page HPA only when necessary Move the retrieval of the HPA associated with L1's APIC access page into VMX code to avoid unnecessarily calling gfn_to_page(), e.g. when the vCPU is in guest mode (L2). Alternatively, the optimization logic in VMX could be mirrored into the common x86 code, but that will get ugly fast when further optimizations are introduced. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-29-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:55 -04:00
Sean Christopherson	1196cb970b	KVM: nVMX: Reload APIC access page on nested VM-Exit only if necessary Defer reloading L1's APIC page by logging the need for a reload and processing it during nested VM-Exit instead of unconditionally reloading the APIC page on nested VM-Exit. This eliminates a TLB flush on the majority of VM-Exits as the APIC page rarely needs to be reloaded. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-28-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:54 -04:00

1 2 3 4 5 ...

6247 Commits