An interrupt handler that uses the fpu can kill a KVM VM, if it runs
under the following conditions:
- the guest's xcr0 register is loaded on the cpu
- the guest's fpu context is not loaded
- the host is using eagerfpu
Note that the guest's xcr0 register and fpu context are not loaded as
part of the atomic world switch into "guest mode". They are loaded by
KVM while the cpu is still in "host mode".
Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
interrupt handler will look something like this:
if (irq_fpu_usable()) {
kernel_fpu_begin();
[... code that uses the fpu ...]
kernel_fpu_end();
}
As long as the guest's fpu is not loaded and the host is using eager
fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
returns true). The interrupt handler proceeds to use the fpu with
the guest's xcr0 live.
kernel_fpu_begin() saves the current fpu context. If this uses
XSAVE[OPT], it may leave the xsave area in an undesirable state.
According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
xcr0[i] == 0 following an XSAVE.
kernel_fpu_end() restores the fpu context. Now if any bit i in
XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
fault is trapped and SIGSEGV is delivered to the current process.
Only pre-4.2 kernels appear to be vulnerable to this sequence of
events. Commit 653f52c ("kvm,x86: load guest FPU context more eagerly")
from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts.
This patch fixes the bug by keeping the host's xcr0 loaded outside
of the interrupts-disabled region where KVM switches into guest mode.
Cc: stable@vger.kernel.org
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: David Matlack <dmatlack@google.com>
[Move load after goto cancel_injection. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Non maskable interrupts (NMI) are preferred to interrupts in current
implementation. If a NMI is pending and NMI is blocked by the result
of nmi_allowed(), pending interrupt is not injected and
enable_irq_window() is not executed, even if interrupts injection is
allowed.
In old kernel (e.g. 2.6.32), schedule() is often called in NMI context.
In this case, interrupts are needed to execute iret that intends end
of NMI. The flag of blocking new NMI is not cleared until the guest
execute the iret, and interrupts are blocked by pending NMI. Due to
this, iret can't be invoked in the guest, and the guest is starved
until block is cleared by some events (e.g. canceling injection).
This patch injects pending interrupts, when it's allowed, even if NMI
is blocked. And, If an interrupts is pending after executing
inject_pending_event(), enable_irq_window() is executed regardless of
NMI pending counter.
Cc: stable@vger.kernel.org
Signed-off-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The barrier also orders the write to mode from any reads
to the page tables done and so update the comment.
Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:
CPUID.7.0.ECX[3]:PKU. X86_FEATURE_OSPKE is software support for pkeys,
enumerated with CPUID.7.0.ECX[4]:OSPKE, and it reflects the setting of
CR4.PKE(bit 22).
This patch disables CPUID:PKU without ept, because pkeys is not yet
implemented for shadow paging.
Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Protection keys define a new 4-bit protection key field (PKEY) in bits
62:59 of leaf entries of the page tables, the PKEY is an index to PKRU
register(16 domains), every domain has 2 bits(write disable bit, access
disable bit).
Static logic has been produced in update_pkru_bitmask, dynamic logic need
read pkey from page table entries, get pkru value, and deduce the correct
result.
[ Huaitong: Xiao helps to modify many sections. ]
Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
but lots of architecture-specific changes.
* ARM:
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- various optimizations to the vgic save/restore code.
* PPC:
- enabled KVM-VFIO integration ("VFIO device")
- optimizations to speed up IPIs between vcpus
- in-kernel handling of IOMMU hypercalls
- support for dynamic DMA windows (DDW).
* s390:
- provide the floating point registers via sync regs;
- separated instruction vs. data accesses
- dirty log improvements for huge guests
- bugfixes and documentation improvements.
* x86:
- Hyper-V VMBus hypercall userspace exit
- alternative implementation of lowest-priority interrupts using vector
hashing (for better VT-d posted interrupt support)
- fixed guest debugging with nested virtualizations
- improved interrupt tracking in the in-kernel IOAPIC
- generic infrastructure for tracking writes to guest memory---currently
its only use is to speedup the legacy shadow paging (pre-EPT) case, but
in the future it will be used for virtual GPUs as well
- much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJW5r3BAAoJEL/70l94x66D2pMH/jTSWWwdTUJMctrDjPVzKzG0
yOzHW5vSLFoFlwEOY2VpslnXzn5TUVmCAfrdmFNmQcSw6hGb3K/xA/ZX/KLwWhyb
oZpr123ycahga+3q/ht/dFUBCCyWeIVMdsLSFwpobEBzPL0pMgc9joLgdUC6UpWX
tmN0LoCAeS7spC4TTiTTpw3gZ/L+aB0B6CXhOMjldb9q/2CsgaGyoVvKA199nk9o
Ngu7ImDt7l/x1VJX4/6E/17VHuwqAdUrrnbqerB/2oJ5ixsZsHMGzxQ3sHCmvyJx
WG5L00ubB1oAJAs9fBg58Y/MdiWX99XqFhdEfxq4foZEiQuCyxygVvq3JwZTxII=
=OUZZ
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"One of the largest releases for KVM... Hardly any generic
changes, but lots of architecture-specific updates.
ARM:
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- various optimizations to the vgic save/restore code.
PPC:
- enabled KVM-VFIO integration ("VFIO device")
- optimizations to speed up IPIs between vcpus
- in-kernel handling of IOMMU hypercalls
- support for dynamic DMA windows (DDW).
s390:
- provide the floating point registers via sync regs;
- separated instruction vs. data accesses
- dirty log improvements for huge guests
- bugfixes and documentation improvements.
x86:
- Hyper-V VMBus hypercall userspace exit
- alternative implementation of lowest-priority interrupts using
vector hashing (for better VT-d posted interrupt support)
- fixed guest debugging with nested virtualizations
- improved interrupt tracking in the in-kernel IOAPIC
- generic infrastructure for tracking writes to guest
memory - currently its only use is to speedup the legacy shadow
paging (pre-EPT) case, but in the future it will be used for
virtual GPUs as well
- much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits)
KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
KVM: x86: disable MPX if host did not enable MPX XSAVE features
arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
arm64: KVM: vgic-v3: Reset LRs at boot time
arm64: KVM: vgic-v3: Do not save an LR known to be empty
arm64: KVM: vgic-v3: Save maintenance interrupt state only if required
arm64: KVM: vgic-v3: Avoid accessing ICH registers
KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit
KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit
KVM: arm/arm64: vgic-v2: Reset LRs at boot time
KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty
KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function
KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required
KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers
KVM: s390: allocate only one DMA page per VM
KVM: s390: enable STFLE interpretation only if enabled for the guest
KVM: s390: wake up when the VCPU cpu timer expires
KVM: s390: step the VCPU timer while in enabled wait
KVM: s390: protect VCPU cpu timer with a seqcount
KVM: s390: step VCPU cpu timer during kvm_run ioctl
...
channels has offset 0 and correct size now, but that can change.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Discard policy doesn't rely on information from notifiers, so we don't
need to register notifiers unconditionally. We kept correct counts in
case userspace switched between policies during runtime, but that can be
avoided by reseting the state.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- kvm_create_pit had to lock only because it exposed kvm->arch.vpit very
early, but initialization doesn't use kvm->arch.vpit since the last
patch, so we can drop locking.
- kvm_free_pit is only run after there are no users of KVM and therefore
is the sole actor.
- Locking in kvm_vm_ioctl_reinject doesn't do anything, because reinject
is only protected at that place.
- kvm_pit_reset isn't used anywhere and its locking can be dropped if we
hide it.
Removing useless locking allows to see what actually is being protected
by PIT state lock (values accessible from the guest).
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch passes struct kvm_pit into internal PIT functions.
Those functions used to get PIT through kvm->arch.vpit, even though most
of them never used *kvm for other purposes. Another benefit is that we
don't need to set kvm->arch.vpit during initialization.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Register the notifier to receive write track event so that we can update
our shadow page table
It makes kvm_mmu_pte_write() be the callback of the notifier, no function
is changed
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Notifier list is introduced so that any node wants to receive the track
event can register to the list
Two APIs are introduced here:
- kvm_page_track_register_notifier(): register the notifier to receive
track event
- kvm_page_track_unregister_notifier(): stop receiving track event by
unregister the notifier
The callback, node->track_write() is called when a write access on the
write tracked page happens
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The array, gfn_track[mode][gfn], is introduced in memory slot for every
guest page, this is the tracking count for the gust page on different
modes. If the page is tracked then the count is increased, the page is
not tracked after the count reaches zero
We use 'unsigned short' as the tracking count which should be enough as
shadow page table only can use 2^14 (2^3 for level, 2^1 for cr4_pae, 2^2
for quadrant, 2^3 for access, 2^1 for nxe, 2^1 for cr0_wp, 2^1 for
smep_andnot_wp, 2^1 for smap_andnot_wp, and 2^1 for smm) at most, there
is enough room for other trackers
Two callbacks, kvm_page_track_create_memslot() and
kvm_page_track_free_memslot() are implemented in this patch, they are
internally used to initialize and reclaim the memory of the array
Currently, only write track mode is supported
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_lpage_info->write_count is used to detect if the large page mapping
for the gfn on the specified level is allowed, rename it to disallow_lpage
to reflect its purpose, also we rename has_wrprotected_page() to
mmu_gfn_lpage_is_disallowed() to make the code more clearer
Later we will extend this mechanism for page tracking: if the gfn is
tracked then large mapping for that gfn on any level is not allowed.
The new name is more straightforward
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 172b2386ed ("KVM: x86: fix missed hardware breakpoints",
2016-02-10) worked around a case where the debug registers are not loaded
correctly on preemption and on the first entry to KVM_RUN.
However, Xiao Guangrong pointed out that the root cause must be that
KVM_DEBUGREG_BP_ENABLED is not being set correctly. This can indeed
happen due to the lazy debug exit mechanism, which does not call
kvm_update_dr7. Fix it by replacing the existing loop (more or less
equivalent to kvm_update_dr0123) with calls to all the kvm_update_dr*
functions.
Cc: stable@vger.kernel.org # 4.1+
Fixes: 172b2386ed
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is the same as before:
kvm_scale_tsc(tgt_tsc_khz)
= tgt_tsc_khz * ratio
= tgt_tsc_khz * user_tsc_khz / tsc_khz (see set_tsc_khz)
= user_tsc_khz (see kvm_guest_time_update)
= vcpu->arch.virtual_tsc_khz (see kvm_set_tsc_khz)
However, computing it through kvm_scale_tsc will make it possible
to include the NTP correction in tgt_tsc_khz.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This refers to the desired (scaled) frequency, which is called
user_tsc_khz in the rest of the file.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Different pieces of code checked for vcpu->arch.apic being (non-)NULL,
or used kvm_vcpu_has_lapic (more optimized) or lapic_in_kernel.
Replace everything with lapic_in_kernel's name and kvm_vcpu_has_lapic's
implementation.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use vector-hashing to deliver lowest-priority interrupts, As an
example, modern Intel CPUs in server platform use this method to
handle lowest-priority interrupts.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is similar to the existing div_frac function, but it returns the
remainder too. Unlike div_frac, it can be used to implement long
division, e.g. (a << 64) / b for 32-bit a and b.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
support of 248 VCPUs.
* ARM: rewrite of the arm64 world switch in C, support for
16-bit VM identifiers. Performance counter virtualization
missed the boat.
* x86: Support for more Hyper-V features (synthetic interrupt
controller), MMU cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWlSKwAAoJEL/70l94x66DY0UIAK5vp4zfQoQOJC4KP4Xgxwdu
kpnK2Boz3/74o1b0y5+eJZoUZCsXCVLtmP5uhmMxUYWDgByFG2X8ZDhPFwB5FYLT
2dN+Lr4tsolgIfRdHZtrT6Svp9SDL039bWTdscnbR6l37/j9FRWvpKdhI3orloFD
/i4CSW2dVIq1/9Xctwu/rtcOEesEx4Cad+6YV3/530eVAXFzE908nXfmqJNZTocY
YCGcmrMVCOu0ng5QM4xSzmmYjKMLUcRs+QzZWkVBzdJtTgwZUr09yj7I2dZ1yj/i
cxYrJy6shSwE74XkXsmvG+au3C5u3vX4tnXjBFErnPJ99oqzHatVnFWNRhj4dLQ=
=PIj1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"PPC changes will come next week.
- s390: Support for runtime instrumentation within guests, support of
248 VCPUs.
- ARM: rewrite of the arm64 world switch in C, support for 16-bit VM
identifiers. Performance counter virtualization missed the boat.
- x86: Support for more Hyper-V features (synthetic interrupt
controller), MMU cleanups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (115 commits)
kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL
kvm/x86: Hyper-V SynIC timers tracepoints
kvm/x86: Hyper-V SynIC tracepoints
kvm/x86: Update SynIC timers on guest entry only
kvm/x86: Skip SynIC vector check for QEMU side
kvm/x86: Hyper-V fix SynIC timer disabling condition
kvm/x86: Reorg stimer_expiration() to better control timer restart
kvm/x86: Hyper-V unify stimer_start() and stimer_restart()
kvm/x86: Drop stimer_stop() function
kvm/x86: Hyper-V timers fix incorrect logical operation
KVM: move architecture-dependent requests to arch/
KVM: renumber vcpu->request bits
KVM: document which architecture uses each request bit
KVM: Remove unused KVM_REQ_KICK to save a bit in vcpu->requests
kvm: x86: Check kvm_write_guest return value in kvm_write_wall_clock
KVM: s390: implement the RI support of guest
kvm/s390: drop unpaired smp_mb
kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa
KVM: x86: MMU: Use clear_page() instead of init_shadow_page_table()
arm/arm64: KVM: Detect vGIC presence at runtime
...
Consolidate updating the Hyper-V SynIC timers in a
single place: on guest entry in processing KVM_REQ_HV_STIMER
request. This simplifies the overall logic, and makes sure
the most current state of msrs and guest clock is used for
arming the timers (to achieve that, KVM_REQ_HV_STIMER
has to be processed after KVM_REQ_CLOCK_UPDATE).
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since the numbers now overlap, it makes sense to enumerate
them in asm/kvm_host.h rather than linux/kvm_host.h. Functions
that refer to architecture-specific requests are also moved
to arch/.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This makes sure the wall clock is updated only after an odd version value
is successfully written to guest memory.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While setting the KVM PIT counters in 'kvm_pit_load_count', if
'hpet_legacy_start' is set, the function disables the timer on
channel[0], instead of the respective index 'channel'. This is
because channels 1-3 are not linked to the HPET. Fix the caller
to only activate the special HPET processing for channel 0.
Reported-by: P J P <pjp@fedoraproject.org>
Fixes: 0185604c2d
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently if userspace restores the pit counters with a count of 0
on channels 1 or 2 and the guest attempts to read the count on those
channels, then KVM will perform a mod of 0 and crash. This will ensure
that 0 values are converted to 65536 as per the spec.
This is CVE-2015-7513.
Signed-off-by: Andy Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Per Hyper-V specification (and as required by Hyper-V-aware guests),
SynIC provides 4 per-vCPU timers. Each timer is programmed via a pair
of MSRs, and signals expiration by delivering a special format message
to the configured SynIC message slot and triggering the corresponding
synthetic interrupt.
Note: as implemented by this patch, all periodic timers are "lazy"
(i.e. if the vCPU wasn't scheduled for more than the timer period the
timer events are lost), regardless of the corresponding configuration
MSR. If deemed necessary, the "catch up" mode (the timer period is
shortened until the timer catches up) will be implemented later.
Changes v2:
* Use remainder to calculate periodic timer expiration time
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If we do not do this, it is not properly saved and restored across
migration. Windows notices due to its self-protection mechanisms,
and is very upset about it (blue screen of death).
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A new vcpu exit is introduced to notify the userspace of the
changes in Hyper-V SynIC configuration triggered by guest writing to the
corresponding MSRs.
Changes v4:
* exit into userspace only if guest writes into SynIC MSR's
Changes v3:
* added KVM_EXIT_HYPERV types and structs notes into docs
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SynIC (synthetic interrupt controller) is a lapic extension,
which is controlled via MSRs and maintains for each vCPU
- 16 synthetic interrupt "lines" (SINT's); each can be configured to
trigger a specific interrupt vector optionally with auto-EOI
semantics
- a message page in the guest memory with 16 256-byte per-SINT message
slots
- an event flag page in the guest memory with 16 2048-bit per-SINT
event flag areas
The host triggers a SINT whenever it delivers a new message to the
corresponding slot or flips an event flag bit in the corresponding area.
The guest informs the host that it can try delivering a message by
explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
MSR.
The userspace (qemu) triggers interrupts and receives EOM notifications
via irqfd with resampler; for that, a GSI is allocated for each
configured SINT, and irq_routing api is extended to support GSI-SINT
mapping.
Changes v4:
* added activation of SynIC by vcpu KVM_ENABLE_CAP
* added per SynIC active flag
* added deactivation of APICv upon SynIC activation
Changes v3:
* added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
docs
Changes v2:
* do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
* add Hyper-V SynIC vectors into EOI exit bitmap
* Hyper-V SyniIC SINT msr write logic simplified
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The decision on whether to use hardware APIC virtualization used to be
taken globally, based on the availability of the feature in the CPU
and the value of a module parameter.
However, under certain circumstances we want to control it on per-vcpu
basis. In particular, when the userspace activates HyperV synthetic
interrupt controller (SynIC), APICv has to be disabled as it's
incompatible with SynIC auto-EOI behavior.
To achieve that, introduce 'apicv_active' flag on struct
kvm_vcpu_arch, and kvm_vcpu_deactivate_apicv() function to turn APICv
off. The flag is initialized based on the module parameter and CPU
capability, and consulted whenever an APICv-specific action is
performed.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The function to determine if the vector is handled by ioapic used to
rely on the fact that only ioapic-handled vectors were set up to
cause vmexits when virtual apic was in use.
We're going to break this assumption when introducing Hyper-V
synthetic interrupts: they may need to cause vmexits too.
To achieve that, introduce a new bitmap dedicated specifically for
ioapic-handled vectors, and populate EOI exit bitmap from it for now.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Before this patch, we incorrectly enter the guest without requesting an
interrupt window if the IRQ chip is split between user space and the
kernel.
Because lapic_in_kernel no longer implies the PIC is in the kernel, this
patch tests pic_in_kernel to determining whether an interrupt window
should be requested when entering the guest.
If the APIC is in the kernel and we request an interrupt window the
guest will return immediately. If the APIC is masked the guest will not
not make forward progress and unmask it, leading to a loop when KVM
reenters and requests again. This patch adds a check to ensure the APIC
is ready to accept an interrupt before requesting a window.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
[Use the other newly introduced functions. - Paolo]
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Set KVM_REQ_EVENT when a PIC in user space injects a local interrupt.
Currently a request is only made when neither the PIC nor the APIC is in
the kernel, which is not sufficient in the split IRQ chip case.
This addresses a problem in QEMU where interrupts are delayed until
another path invokes the event loop.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch breaks out a new function kvm_vcpu_ready_for_interrupt_injection.
This routine encapsulates the logic required to determine whether a vcpu
is ready to accept an interrupt injection, which is now required on
multiple paths.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch ensures that dm_request_for_irq_injection and
post_kvm_run_save are in sync, avoiding that an endless ping-pong
between userspace (who correctly notices that IF=0) and
the kernel (who insists that userspace handles its request
for the interrupt window).
To synchronize them, it also adds checks for kvm_arch_interrupt_allowed
and !kvm_event_needs_reinjection. These are always needed, not
just for in-kernel LAPIC.
Signed-off-by: Matt Gingell <gingell@google.com>
[A collage of two patches from Matt. - Paolo]
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- x86: work around two nasty cases where a benign exception occurs while
another is being delivered. The endless stream of exceptions causes an
infinite loop in the processor, which not even NMIs or SMIs can interrupt;
in the virt case, there is no possibility to exit to the host either.
- x86: support for Skylake per-guest TSC rate. Long supported by AMD,
the patches mostly move things from there to common arch/x86/kvm/ code.
- generic: remove local_irq_save/restore from the guest entry and exit
paths when context tracking is enabled. The patches are a few months
old, but we discussed them again at kernel summit. Andy will pick up
from here and, in 4.5, try to remove it from the user entry/exit paths.
- PPC: Two bug fixes, see merge commit 370289756b for details.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWRFb0AAoJEL/70l94x66DjjMH/31jr8d119MW0uv2x+03+wRq
6dbJ8tjQ8grvBRExKvLsUVjDmHlhCa1BQl5qjCsyYhX9UeAf4NQOmoEFpq+YTLxh
Ctveyn+yiZWC7qxbQDmauiQ4JCOp+W9ial782iqw5+ouQMajGOffq5WrojCa2ZNF
jI278JgdHJLrKj/uie//WBu3V7MJY5Apc3p4zatnSYFSQ3MA0sxl4r4zIrwOa5qs
23ZeeoqbP4sHh4X5wL/30Y6XFSCHj0qoYHHyAgzLi0PCMvBdt4DrAFUPDG/Rhlv6
o1WB/kcUfcz3DtBX85wfSOMuw0nF6patWhWv07R/3EIbYoz3dKvp9d6ORYgXqlY=
=Um9M
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second batch of kvm updates from Paolo Bonzini:
"Four changes:
- x86: work around two nasty cases where a benign exception occurs
while another is being delivered. The endless stream of exceptions
causes an infinite loop in the processor, which not even NMIs or
SMIs can interrupt; in the virt case, there is no possibility to
exit to the host either.
- x86: support for Skylake per-guest TSC rate. Long supported by
AMD, the patches mostly move things from there to common
arch/x86/kvm/ code.
- generic: remove local_irq_save/restore from the guest entry and
exit paths when context tracking is enabled. The patches are a few
months old, but we discussed them again at kernel summit. Andy
will pick up from here and, in 4.5, try to remove it from the user
entry/exit paths.
- PPC: Two bug fixes, see merge commit 370289756b for details"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
KVM: x86: rename update_db_bp_intercept to update_bp_intercept
KVM: svm: unconditionally intercept #DB
KVM: x86: work around infinite loop in microcode when #AC is delivered
context_tracking: avoid irq_save/irq_restore on guest entry and exit
context_tracking: remove duplicate enabled check
KVM: VMX: Dump TSC multiplier in dump_vmcs()
KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
KVM: VMX: Enable and initialize VMX TSC scaling
KVM: x86: Use the correct vcpu's TSC rate to compute time scale
KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
KVM: x86: Replace call-back compute_tsc_offset() with a common function
KVM: x86: Replace call-back set_tsc_khz() with a common function
KVM: x86: Add a common TSC scaling function
KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
KVM: x86: Collect information for setting TSC scaling ratio
KVM: x86: declare a few variables as __read_mostly
KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common
KVM: PPC: Book3S HV: Don't dynamically split core when already split
...
Because #DB is now intercepted unconditionally, this callback
only operates on #BP for both VMX and SVM.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch makes KVM use virtual_tsc_khz rather than the host TSC rate
as vcpu's TSC rate to compute the time scale if TSC scaling is enabled.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both VMX and SVM scales the host TSC in the same way in call-back
read_l1_tsc(), so this patch moves the scaling logic from call-back
read_l1_tsc() to a common function kvm_read_l1_tsc().
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For both VMX and SVM, if the 2nd argument of call-back
adjust_tsc_offset() is the host TSC, then adjust_tsc_offset() will scale
it first. This patch moves this common TSC scaling logic to its caller
adjust_tsc_offset_host() and rename the call-back adjust_tsc_offset() to
adjust_tsc_offset_guest().
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>