Hyper-V identifies vCPUs by Virtual Processor Index, which can be
queried via HV_X64_MSR_VP_INDEX msr. It is defined by the spec as a
sequential number which can't exceed the maximum number of vCPUs per VM.
APIC ids can be sparse and thus aren't a valid replacement for VP
indices.
Current KVM uses its internal vcpu index as VP_INDEX. However, to make
it predictable and persistent across VM migrations, the userspace has to
control the value of VP_INDEX.
This patch achieves that, by storing vp_index explicitly on vcpu, and
allowing HV_X64_MSR_VP_INDEX to be set from the host side. For
compatibility it's initialized to KVM vcpu index. Also a few variables
are renamed to make clear distinction betweed this Hyper-V vp_index and
KVM vcpu_id (== APIC id). Besides, a new capability,
KVM_CAP_HYPERV_VP_INDEX, is added to allow the userspace to skip
attempting msr writes where unsupported, to avoid spamming error logs.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1,
async page faults are delivered to L1 as #PF vmexits; if bit 2 is 0,
kvm_can_do_async_pf returns 0 if in guest mode.
This is similar to what svm.c wanted to do all along, but it is only
enabled for Linux as L1 hypervisor. Foreign hypervisors must never
receive async page faults as vmexits, because they'd probably be very
confused about that.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Add an nested_apf field to vcpu->arch.exception to identify an async page
fault, and constructs the expected vm-exit information fields. Force a
nested VM exit from nested_vmx_check_exception() if the injected #PF is
async page fault.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This patch adds the L1 guest async page fault #PF vmexit handler, such
by L1 similar to ordinary async page fault.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Passed insn parameters to kvm_mmu_page_fault().]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This patch removes all arguments except the first in
kvm_x86_ops->queue_exception since they can extract the arguments from
vcpu->arch.exception themselves.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
There is a flaw in the Hyper-V SynIC implementation in KVM: when message
page or event flags page is enabled by setting the corresponding msr,
KVM zeroes it out. This is problematic because on migration the
corresponding MSRs are loaded on the destination, so the content of
those pages is lost.
This went unnoticed so far because the only user of those pages was
in-KVM hyperv synic timers, which could continue working despite that
zeroing.
Newer QEMU uses those pages for Hyper-V VMBus implementation, and
zeroing them breaks the migration.
Besides, in newer QEMU the content of those pages is fully managed by
QEMU, so zeroing them is undesirable even when writing the MSRs from the
guest side.
To support this new scheme, introduce a new capability,
KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic
pages aren't zeroed out in KVM.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The backwards_tsc_observed global introduced in commit 16a9602 is never
reset to false. If a VM happens to be running while the host is suspended
(a common source of the TSC jumping backwards), master clock will never
be enabled again for any VM. In contrast, if no VM is running while the
host is suspended, master clock is unaffected. This is inconsistent and
unnecessarily strict. Let's track the backwards_tsc_observed variable
separately and let each VM start with a clean slate.
Real world impact: My Windows VMs get slower after my laptop undergoes a
suspend/resume cycle. The only way to get the perf back is unloading and
reloading the kvm module.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Enable the Virtual VMLOAD VMSAVE feature. This is done by setting bit 1
at position B8h in the vmcb.
The processor must have nested paging enabled, be in 64-bit mode and
have support for the Virtual VMLOAD VMSAVE feature for the bit to be set
in the vmcb.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Rename the lbr_ctl variable to better reflect the purpose of the field -
provide support for virtualization extensions.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The lbr_ctl variable in the vmcb control area is used to enable or
disable Last Branch Record (LBR) virtualization. However, this is to be
done using only bit 0 of the variable. To correct this and to prepare
for a new feature, change the current usage to work only on a particular
bit.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
kvm_skip_emulated_instruction handles the singlestep debug exception
which is something we almost always want. This commit (specifically
the change in rdmsr_interception) makes the debug.flat KVM unit test
pass on AMD.
Two call sites still call skip_emulated_instruction directly:
* In svm_queue_exception where it's used only for moving the rip forward
* In task_switch_interception which is analogous to handle_task_switch
in VMX
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
kvm_vm_release() did not have slots_lock when calling
kvm_io_bus_unregister_dev() and this went unnoticed until 4a12f95177
("KVM: mark kvm->busses as rcu protected") added dynamic checks.
Luckily, there should be no race at that point:
=============================
WARNING: suspicious RCU usage
4.12.0.kvm+ #0 Not tainted
-----------------------------
./include/linux/kvm_host.h:479 suspicious rcu_dereference_check() usage!
lockdep_rcu_suspicious+0xc5/0x100
kvm_io_bus_unregister_dev+0x173/0x190 [kvm]
kvm_free_pit+0x28/0x80 [kvm]
kvm_arch_sync_events+0x2d/0x30 [kvm]
kvm_put_kvm+0xa7/0x2a0 [kvm]
kvm_vm_release+0x21/0x30 [kvm]
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
vmx_complete_atomic_exit should call kvm_machine_check for any
VM-entry failure due to a machine-check event. Such an exit should be
recognized solely by its basic exit reason (i.e. the low 16 bits of
the VMCS exit reason field). None of the other VMCS exit information
fields contain valid information when the VM-exit is due to "VM-entry
failure due to machine-check event".
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@tencent.com>
[Changed VM_EXIT_INTR_INFO condition to better describe its reason.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
kvm master clock usually has a different frequency than the kernel boot
clock. This is not a problem until the master clock is updated;
update uses the current kernel boot clock to compute new kvm clock,
which erases any kvm clock cycles that might have built up due to
frequency difference over a long period.
KVM_SET_CLOCK is one of places where we can safely update master clock
as the guest-visible clock is going to be shifted anyway.
The problem with current code is that it updates the kvm master clock
after updating the offset. If the master clock was enabled before
calling KVM_SET_CLOCK, then it might have built up a significant delta
from kernel boot clock.
In the worst case, the time set by userspace would be shifted by so much
that it couldn't have been set at any point during KVM_SET_CLOCK.
To fix this, move kvm_gen_update_masterclock() before computing
kvmclock_offset, which means that the master clock and kernel boot clock
will be sufficiently close together.
Another solution would be to replace get_kvmclock_ns() with
"ktime_get_boot_ns() + ka->kvmclock_offset", which is marginally more
accurate, but would break symmetry with KVM_GET_CLOCK.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Inconsistencies result from shadowing only accesses to the full
64-bits of a 64-bit VMCS field, but not shadowing accesses to the high
32-bits of the field. The "high" part of a 64-bit field should be
shadowed whenever the full 64-bit field is shadowed.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow the L1 guest to specify the last page of addressable guest
physical memory for an L2 MSR permission bitmap. Also remove the
vmcs12_read_any() check that should never fail.
Fixes: 3af18d9c5f ("KVM: nVMX: Prepare for using hardware MSR bitmap")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to the SDM, if the "use I/O bitmaps" VM-execution control is
1, bits 11:0 of each I/O-bitmap address must be 0. Neither address
should set any bits beyond the processor's physical-address width.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The VMCS launch state is not set to "launched" unless the VMLAUNCH
actually succeeds. VMLAUNCH failure includes VM-exits with bit 31 set.
Note that this change does not address the general problem that a
failure to launch/resume vmcs02 (i.e. vmx->fail) is not handled
correctly.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This exit ended up being reported, but the currently exposed data does not provide
much of a starting point for debugging. In the reported case, the vmexit was
an EPT misconfiguration (MMIO access). Let userspace report ethe exit qualification
and, if relevant, the GPA.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Better machine check handling for HV KVM
- Ability to support guests with threads=2, 4 or 8 on POWER9
- Fix for a race that could cause delayed recognition of signals
- Fix for a bug where POWER9 guests could sleep with interrupts pending.
ARM:
- VCPU request overhaul
- allow timer and PMU to have their interrupt number selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups
s390:
- initial machine check forwarding
- migration support for the CMMA page hinting information
- cleanups and fixes
x86:
- nested VMX bugfixes and improvements
- more reliable NMI window detection on AMD
- APIC timer optimizations
Generic:
- VCPU request overhaul + documentation of common code patterns
- kvm_stat improvements
There is a small conflict in arch/s390 due to an arch-wide field rename.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJZW4XTAAoJEL/70l94x66DkhMH/izpk54KI17PtyQ9VYI2sYeZ
BWK6Kl886g3ij4pFi3pECqjDJzWaa3ai+vFfzzpJJ8OkCJT5Rv4LxC5ERltVVmR8
A3T1I/MRktSC0VJLv34daPC2z4Lco/6SPipUpPnL4bE2HATKed4vzoOjQ3tOeGTy
dwi7TFjKwoVDiM7kPPDRnTHqCe5G5n13sZ49dBe9WeJ7ttJauWqoxhlYosCGNPEj
g8ZX8+cvcAhVnz5uFL8roqZ8ygNEQq2mgkU18W8ZZKuiuwR0gdsG0gSBFNTdwIMK
NoreRKMrw0+oLXTIB8SZsoieU6Qi7w3xMAMabe8AJsvYtoersugbOmdxGCr1lsA=
=OD7H
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"PPC:
- Better machine check handling for HV KVM
- Ability to support guests with threads=2, 4 or 8 on POWER9
- Fix for a race that could cause delayed recognition of signals
- Fix for a bug where POWER9 guests could sleep with interrupts pending.
ARM:
- VCPU request overhaul
- allow timer and PMU to have their interrupt number selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups
s390:
- initial machine check forwarding
- migration support for the CMMA page hinting information
- cleanups and fixes
x86:
- nested VMX bugfixes and improvements
- more reliable NMI window detection on AMD
- APIC timer optimizations
Generic:
- VCPU request overhaul + documentation of common code patterns
- kvm_stat improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
Update my email address
kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS
x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12
kvm: x86: mmu: allow A/D bits to be disabled in an mmu
x86: kvm: mmu: make spte mmio mask more explicit
x86: kvm: mmu: dead code thanks to access tracking
KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code
KVM: PPC: Book3S HV: Close race with testing for signals on guest entry
KVM: PPC: Book3S HV: Simplify dynamic micro-threading code
KVM: x86: remove ignored type attribute
KVM: LAPIC: Fix lapic timer injection delay
KVM: lapic: reorganize restart_apic_timer
KVM: lapic: reorganize start_hv_timer
kvm: nVMX: Check memory operand to INVVPID
KVM: s390: Inject machine check into the nested guest
KVM: s390: Inject machine check into the guest
tools/kvm_stat: add new interactive command 'b'
tools/kvm_stat: add new command line switch '-i'
tools/kvm_stat: fix error on interactive command 'g'
KVM: SVM: suppress unnecessary NMI singlestep on GIF=0 and nested exit
...
It's easier for host applications, such as QEMU, if they can always
access guest MSR_IA32_BNDCFGS in VMCS, even though MPX is disabled in
guest cpuid.
Cc: stable@vger.kernel.org
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull x86 mm updates from Ingo Molnar:
"The main changes in this cycle were:
- Continued work to add support for 5-level paging provided by future
Intel CPUs. In particular we switch the x86 GUP code to the generic
implementation. (Kirill A. Shutemov)
- Continued work to add PCID CPU support to native kernels as well.
In this round most of the focus is on reworking/refreshing the TLB
flush infrastructure for the upcoming PCID changes. (Andy
Lutomirski)"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
x86/mm: Delete a big outdated comment about TLB flushing
x86/mm: Don't reenter flush_tlb_func_common()
x86/KASLR: Fix detection 32/64 bit bootloaders for 5-level paging
x86/ftrace: Exclude functions in head64.c from function-tracing
x86/mmap, ASLR: Do not treat unlimited-stack tasks as legacy mmap
x86/mm: Remove reset_lazy_tlbstate()
x86/ldt: Simplify the LDT switching logic
x86/boot/64: Put __startup_64() into .head.text
x86/mm: Add support for 5-level paging for KASLR
x86/mm: Make kernel_physical_mapping_init() support 5-level paging
x86/mm: Add sync_global_pgds() for configuration with 5-level paging
x86/boot/64: Add support of additional page table level during early boot
x86/boot/64: Rename init_level4_pgt and early_level4_pgt
x86/boot/64: Rewrite startup_64() in C
x86/boot/compressed: Enable 5-level paging during decompression stage
x86/boot/efi: Define __KERNEL32_CS GDT on 64-bit configurations
x86/boot/efi: Fix __KERNEL_CS definition of GDT entry on 64-bit configurations
x86/boot/efi: Cleanup initialization of GDT entries
x86/asm: Fix comment in return_from_SYSCALL_64()
x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
...
EPT A/D was enabled in the vmcs02 EPTP regardless of the vmcs12's EPTP
value. The problem is that enabling A/D changes the behavior of L2's
x86 page table walks as seen by L1. With A/D enabled, x86 page table
walks are always treated as EPT writes.
Commit ae1e2d1082 ("kvm: nVMX: support EPT accessed/dirty bits",
2017-03-30) tried to work around this problem by clearing the write
bit in the exit qualification for EPT violations triggered by page
walks. However, that fixup introduced the opposite bug: page-table walks
that actually set x86 A/D bits were *missing* the write bit in the exit
qualification.
This patch fixes the problem by disabling EPT A/D in the shadow MMU
when EPT A/D is disabled in vmcs12's EPTP.
Signed-off-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adds the plumbing to disable A/D bits in the MMU based on a new role
bit, ad_disabled. When A/D is disabled, the MMU operates as though A/D
aren't available (i.e., using access tracking faults instead).
To avoid SP -> kvm_mmu_page.role.ad_disabled lookups all over the
place, A/D disablement is now stored in the SPTE. This state is stored
in the SPTE by tweaking the use of SPTE_SPECIAL_MASK for access
tracking. Rather than just setting SPTE_SPECIAL_MASK when an
access-tracking SPTE is non-present, we now always set
SPTE_SPECIAL_MASK for access-tracking SPTEs.
Signed-off-by: Peter Feiner <pfeiner@google.com>
[Use role.ad_disabled even for direct (non-shadow) EPT page tables. Add
documentation and a few MMU_WARN_ONs. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Specify both a mask (i.e., bits to consider) and a value (i.e.,
pattern of bits that indicates a special PTE) for mmio SPTEs. On
Intel, this lets us pack even more information into the
(SPTE_SPECIAL_MASK | EPT_VMX_RWX_MASK) mask we use for access
tracking liberating all (SPTE_SPECIAL_MASK | (non-misconfigured-RWX))
values.
Signed-off-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The MMU always has hardware A bits or access tracking support, thus
it's unnecessary to handle the scenario where we have neither.
Signed-off-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The macro insn_fetch marks the 'type' argument as having a specified
alignment. Type attributes can only be applied to structs, unions, or
enums, but insn_fetch is only ever invoked with integral types, so Clang
produces 19 -Wignored-attributes warnings for this source file.
Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- vcpu request overhaul
- allow timer and PMU to have their interrupt number
selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAllWCM0VHG1hcmMuenlu
Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDjJ0QAI16x6+trKhH31lTSYekYfqm4hZ2
Fp7IbALW9KNCaY35tZov2Zuh99qGRduxTh7ewqhKpON8kkU+UKj0F7zH22+vfN4m
yas/+uNr8R9VLyvea4ysPsgx8Q8v1Ix9setohHYNZIL9/klVqtaHpYvArHVF/mzq
p2j/NxRS2dlp9r2TtoMRMhA05u6r0wolhUuh+z9v2ipib0gfOBIG24jsqCTEcD9n
5A/cVd+ztYshkrV95h3y9peahwt3zOA4QBGzrQ2K25jp0s54nqhmC7JTNSa8dtar
YGW2MuAMoIFTwCFAlpwCzrwpOJFzF3Q6A8bOxei2fjclzjPMgT1xQxuhOoe4ntFa
lTPxSHalm5W6dFTW90YSo2DBcPe+N7sQkhjR0cCeY3GYsOFhXMLTlOl5Pt1YK1or
+3FAI74tFRKvVmb9mhZeGTvuzhDgRvtf3Qq5rjwlGzKc2BBOEgtMyj/Wgwo4N6Dz
IjOnoRaUGELoBCWoTorMxLpsPBdPVSUxNyJTdAhqZ/ZtT1xqjhFNLZcrVWmOTzDM
1cav+jZkla4sLmJSNDD54aCSvvtPHis0nZn9PRlh12xgOyYiAVx4K++MNuWP0P37
hbh1gbPT+FcoVxPurUsX/pjNlTucPZcBwFytZDQlpwtPBpEFzJiImLYe/PldRb0f
9WQOH1Y1+q14MF+N
=6hNK
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/ARM updates for 4.13
- vcpu request overhaul
- allow timer and PMU to have their interrupt number
selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups
Conflicts:
arch/s390/include/asm/kvm_host.h
In preparation for an objtool rewrite which will have broader checks,
whitelist functions and files which cause problems because they do
unusual things with the stack.
These whitelists serve as a TODO list for which functions and files
don't yet have undwarf unwinder coverage. Eventually most of the
whitelists can be removed in favor of manual CFI hint annotations or
objtool improvements.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/7f934a5d707a574bda33ea282e9478e627fb1829.1498659915.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the TSC deadline timer is programmed really close to the deadline or
even in the past, the computation in vmx_set_hv_timer will program the
absolute target tsc value to vmcs preemption timer field w/ delta == 0,
then plays a vmentry and an upcoming vmx preemption timer fire vmexit
dance, the lapic timer injection is delayed due to this duration. Actually
the lapic timer which is emulated by hrtimer can handle this correctly.
This patch fixes it by firing the lapic timer and injecting a timer interrupt
immediately during the next vmentry if the TSC deadline timer is programmed
really close to the deadline or even in the past. This saves ~300 cycles on
the tsc_deadline_timer test of apic.flat.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the code to cancel the hv timer into the caller, just before
it starts the hrtimer. Check availability of the hv timer in
start_hv_timer.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are many cases in which the hv timer must be canceled. Split out
a new function to avoid duplication.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The memory operand fetched for INVVPID is 128 bits. Bits 63:16 are
reserved and must be zero. Otherwise, the instruction fails with
VMfail(Invalid operand to INVEPT/INVVPID). If the INVVPID_TYPE is 0
(individual address invalidation), then bits 127:64 must be in
canonical form, or the instruction fails with VMfail(Invalid operand
to INVEPT/INVVPID).
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
enable_nmi_window is supposed to be a no-op if we know that we'll see
a VM exit by the time the NMI window opens. This commit adds two more
cases:
* We intercept stgi so we don't need to singlestep on GIF=0.
* We emulate nested vmexit so we don't need to singlestep when nested
VM exit is required.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Singlestepping is enabled by setting the TF flag and care must be
taken to not let the guest see (and reuse at an inconvenient time)
the modified rflag value. One such case is event injection, as part
of which flags are pushed on the stack and restored later on iret.
This commit disables singlestepping when we're about to inject an
event and forces an immediate exit for us to re-evaluate the NMI
related state.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These flags are used internally by SVM so it's cleaner to not leak
them to callers of svm_get_rflags. This is similar to how the TF
flag is handled on KVM_GUESTDBG_SINGLESTEP by kvm_get_rflags and
kvm_set_rflags.
Without this change, the flags may propagate from host VMCB to nested
VMCB or vice versa while singlestepping over a nested VM enter/exit,
and then get stuck in inappropriate places.
Example: NMI singlestepping is enabled while running L1 guest. The
instruction to step over is VMRUN and nested vmrun emulation stashes
rflags to hsave->save.rflags. Then if singlestepping is disabled
while still in L2, TF/RF will be cleared from the nested VMCB but the
next nested VM exit will restore them from hsave->save.rflags and
cause an unexpected DB exception.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nested hypervisor should not see singlestep VM exits if singlestepping
was enabled internally by KVM. Windows is particularly sensitive to this
and known to bluescreen on unexpected VM exits.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Just moving the code to a new helper in preparation for following
commits.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
TF is handled a bit differently for syscall and sysret, compared
to the other instructions: TF is checked after the instruction completes,
so that the OS can disable #DB at a syscall by adding TF to FMASK.
When the sysret is executed the #DB is taken "as if" the syscall insn
just completed.
KVM emulates syscall so that it can trap 32-bit syscall on Intel processors.
Fix the behavior, otherwise you could get #DB on a user stack which is not
nice. This does not affect Linux guests, as they use an IST or task gate
for #DB.
This fixes CVE-2017-7518.
Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The kernel has several code paths that read CR3. Most of them assume that
CR3 contains the PGD's physical address, whereas some of them awkwardly
use PHYSICAL_PAGE_MASK to mask off low bits.
Add explicit mask macros for CR3 and convert all of the CR3 readers.
This will keep them from breaking when PCID is enabled.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: xen-devel <xen-devel@lists.xen.org>
Link: http://lkml.kernel.org/r/883f8fb121f4616c1c1427ad87350bb2f5ffeca1.1497288170.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
INFO: task gnome-terminal-:1734 blocked for more than 120 seconds.
Not tainted 4.12.0-rc4+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gnome-terminal- D 0 1734 1015 0x00000000
Call Trace:
__schedule+0x3cd/0xb30
schedule+0x40/0x90
kvm_async_pf_task_wait+0x1cc/0x270
? __vfs_read+0x37/0x150
? prepare_to_swait+0x22/0x70
do_async_page_fault+0x77/0xb0
? do_async_page_fault+0x77/0xb0
async_page_fault+0x28/0x30
This is triggered by running both win7 and win2016 on L1 KVM simultaneously,
and then gives stress to memory on L1, I can observed this hang on L1 when
at least ~70% swap area is occupied on L0.
This is due to async pf was injected to L2 which should be injected to L1,
L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host
actually), and L1 guest starts accumulating tasks stuck in D state in
kvm_async_pf_task_wait() since missing PAGE_READY async_pfs.
This patch fixes the hang by doing async pf when executing L1 guest.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If "i" is the last element in the vcpu->arch.cpuid_entries[] array, it
potentially can be exploited the vulnerability. this will out-of-bounds
read and write. Luckily, the effect is small:
/* when no next entry is found, the current entry[i] is reselected */
for (j = i + 1; ; j = (j + 1) % nent) {
struct kvm_cpuid_entry2 *ej = &vcpu->arch.cpuid_entries[j];
if (ej->function == e->function) {
It reads ej->maxphyaddr, which is user controlled. However...
ej->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT;
After cpuid_entries there is
int maxphyaddr;
struct x86_emulate_ctxt emulate_ctxt; /* 16-byte aligned */
So we have:
- cpuid_entries at offset 1B50 (6992)
- maxphyaddr at offset 27D0 (6992 + 3200 = 10192)
- padding at 27D4...27DF
- emulate_ctxt at 27E0
And it writes in the padding. Pfew, writing the ops field of emulate_ctxt
would have been much worse.
This patch fixes it by modding the index to avoid the out-of-bounds
access. Worst case, i == j and ej->function == e->function,
the loop can bail out.
Reported-by: Moguofang <moguofang@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Guofang Mo <moguofang@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The guest-linear address field is set for VM exits due to attempts to
execute LMSW with a memory operand and VM exits due to attempts to
execute INS or OUTS for which the relevant segment is usable,
regardless of whether or not EPT is in use.
Fixes: 119a9c01a5 ("KVM: nVMX: pass valid guest linear-address to the L1")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The XSS-exiting bitmap is a VMCS control field that does not change
while the CPU is in non-root mode. Transferring the unchanged value
from vmcs02 to vmcs12 is unnecessary.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Bits 11:2 must be zero and the linear addess in bits 63:12 must be
canonical. Otherwise, WRMSR(BNDCFGS) should raise #GP.
Fixes: 0dd376e709 ("KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The BNDCFGS MSR should only be exposed to the guest if the guest
supports MPX. (cf. the TSC_AUX MSR and RDTSCP.)
Fixes: 0dd376e709 ("KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save")
Change-Id: I3ad7c01bda616715137ceac878f3fa7e66b6b387
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The MSR permission bitmaps are shared by all VMs. However, some VMs
may not be configured to support MPX, even when the host does. If the
host supports VMX and the guest does not, we should intercept accesses
to the BNDCFGS MSR, so that we can synthesize a #GP
fault. Furthermore, if the host does not support MPX and the
"ignore_msrs" kvm kernel parameter is set, then we should intercept
accesses to the BNDCFGS MSR, so that we can skip over the rdmsr/wrmsr
without raising a #GP fault.
Fixes: da8999d318 ("KVM: x86: Intel MPX vmx and msr handle")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
WARNING: CPU: 3 PID: 2840 at arch/x86/kvm/vmx.c:10966 nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel]
CPU: 3 PID: 2840 Comm: qemu-system-x86 Tainted: G OE 4.12.0-rc3+ #23
RIP: 0010:nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel]
Call Trace:
? kvm_check_async_pf_completion+0xef/0x120 [kvm]
? rcu_read_lock_sched_held+0x79/0x80
vmx_queue_exception+0x104/0x160 [kvm_intel]
? vmx_queue_exception+0x104/0x160 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x1171/0x1ce0 [kvm]
? kvm_arch_vcpu_load+0x47/0x240 [kvm]
? kvm_arch_vcpu_load+0x62/0x240 [kvm]
kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
? kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
? __fget+0xf3/0x210
do_vfs_ioctl+0xa4/0x700
? __fget+0x114/0x210
SyS_ioctl+0x79/0x90
do_syscall_64+0x81/0x220
entry_SYSCALL64_slow_path+0x25/0x25
This is triggered occasionally by running both win7 and win2016 in L2, in
addition, EPT is disabled on both L1 and L2. It can't be reproduced easily.
Commit 0b6ac343fc (KVM: nVMX: Correct handling of exception injection) mentioned
that "KVM wants to inject page-faults which it got to the guest. This function
assumes it is called with the exit reason in vmcs02 being a #PF exception".
Commit e011c663 (KVM: nVMX: Check all exceptions for intercept during delivery to
L2) allows to check all exceptions for intercept during delivery to L2. However,
there is no guarantee the exit reason is exception currently, when there is an
external interrupt occurred on host, maybe a time interrupt for host which should
not be injected to guest, and somewhere queues an exception, then the function
nested_vmx_check_exception() will be called and the vmexit emulation codes will
try to emulate the "Acknowledge interrupt on exit" behavior, the warning is
triggered.
Reusing the exit reason from the L2->L0 vmexit is wrong in this case,
the reason must always be EXCEPTION_NMI when injecting an exception into
L1 as a nested vmexit.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Fixes: e011c663b9 ("KVM: nVMX: Check all exceptions for intercept during delivery to L2")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
When PCID is enabled, CR3's PCID bits can change during context
switches, so KVM won't be able to treat CR3 as a per-mm constant any
more.
I structured this like the existing CR4 handling. Under ordinary
circumstances (PCID disabled or if the current PCID and the value
that's already in the VMCS match), then we won't do an extra VMCS
write, and we'll never do an extra direct CR3 read. The overhead
should be minimal.
I disallowed using the new helper in non-atomic context because
PCID support will cause CR3 to stop being constant in non-atomic
process context.
(Frankly, it also scares me a bit that KVM ever treated CR3 as
constant, but it looks like it was okay before.)
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>