Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
I am sending this as RFC because the error messages it produces are
very ugly. Because of inlining, the original line is lost. The
alternative is to change vmcs_read/write/checkXX into macros, but
then you need to have a single huge BUILD_BUG_ON or BUILD_BUG_ON_MSG
because multiple BUILD_BUG_ON* with the same __LINE__ are not
supported well.
This was not printing the high parts of several 64-bit fields on
32-bit kernels. Separate from the previous one to make the patches
easier to review.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In theory this should have broken EPT on 32-bit kernels (due to
reading the high part of natural-width field GUEST_CR3). Not sure
if no one noticed or the processor behaves differently from the
documentation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Per Hyper-V specification (and as required by Hyper-V-aware guests),
SynIC provides 4 per-vCPU timers. Each timer is programmed via a pair
of MSRs, and signals expiration by delivering a special format message
to the configured SynIC message slot and triggering the corresponding
synthetic interrupt.
Note: as implemented by this patch, all periodic timers are "lazy"
(i.e. if the vCPU wasn't scheduled for more than the timer period the
timer events are lost), regardless of the corresponding configuration
MSR. If deemed necessary, the "catch up" mode (the timer period is
shortened until the timer catches up) will be implemented later.
Changes v2:
* Use remainder to calculate periodic timer expiration time
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The SynIC message protocol mandates that the message slot is claimed
by atomically setting message type to something other than HVMSG_NONE.
If another message is to be delivered while the slot is still busy,
message pending flag is asserted to indicate to the guest that the
hypervisor wants to be notified when the slot is released.
To make sure the protocol works regardless of where the message
sources are (kernel or userspace), clear the pending flag on SINT ACK
notification, and let the message sources compete for the slot again.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This helper will be used also in Hyper-V SynIC timers implementation.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This rearrangement places functions declarations together
according to their functionality, so future additions
will be simplier.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This struct is required for Hyper-V SynIC timers implementation inside KVM
and for upcoming Hyper-V VMBus support by userspace(QEMU). So place it into
Hyper-V UAPI header.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This struct is required for Hyper-V SynIC timers implementation inside KVM
and for upcoming Hyper-V VMBus support by userspace(QEMU). So place it into
Hyper-V UAPI header.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This constant is required for Hyper-V SynIC timers MSR's
support by userspace(QEMU).
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As kvm_mmu_get_page() was changed so that every parent pointer would not
get into the sp->parent_ptes chain before the entry pointed to by it was
set properly, we can use the for_each_rmap_spte macro instead of
pte_list_walk().
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Every time kvm_mmu_get_page() is called with a non-NULL parent_pte
argument, link_shadow_page() follows that to set the parent entry so
that the new mapping will point to the returned page table.
Moving parent_pte handling there allows to clean up the code because
parent_pte is passed to kvm_mmu_get_page() just for mark_unsync() and
mmu_page_add_parent_pte().
In addition, the patch avoids calling mark_unsync() for other parents in
the sp->parent_ptes chain than the newly added parent_pte, because they
have been there since before the current page fault handling started.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make kvm_mmu_alloc_page() do just what its name tells to do, and remove
the extra allocation error check and zero-initialization of parent_ptes:
shadow page headers allocated by kmem_cache_zalloc() are always in the
per-VCPU pools.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
At some call sites of rmap_get_first() and rmap_get_next(), BUG_ON is
placed right after the call to detect unrelated sptes which must not be
found in the reverse-mapping list.
Move this check in rmap_get_first/next() so that all call sites, not
just the users of the for_each_rmap_spte() macro, will be checked the
same way.
One thing to keep in mind is that kvm_mmu_unlink_parents() also uses
rmap_get_first() to handle parent sptes. The change will not break it
because parent sptes are present, at least until drop_parent_pte()
actually unlinks them, and not mmio-sptes.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
is_rmap_spte(), originally named is_rmap_pte(), was introduced when the
simple reverse mapping was implemented by commit cd4a4e5374
("[PATCH] KVM: MMU: Implement simple reverse mapping"). At that point,
its role was clear and only rmap_add() and rmap_remove() were using it
to select sptes that need to be reverse-mapped.
Independently of that, is_shadow_present_pte() was first introduced by
commit c7addb9020 ("KVM: Allow not-present guest page faults to
bypass kvm") to do bypass_guest_pf optimization, which does not exist
any more.
These two seem to have changed their roles somewhat, and is_rmap_spte()
just calls is_shadow_present_pte() now.
Since using both of them without clear distinction just makes the code
confusing, remove is_rmap_spte().
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
mmu_set_spte()'s code is based on the assumption that the emulate
parameter has a valid pointer value if set_spte() returns true and
write_fault is not zero. In other cases, emulate may be NULL, so a
NULL-check is needed.
Stop passing emulate pointer and make mmu_set_spte() return the emulate
value instead to clean up this complex interface. Prefetch functions
can just throw away the return value.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both __mmu_unsync_walk() and mmu_pages_clear_parents() have three line
code which clears a bit in the unsync child bitmap; the former places it
inside a loop block and uses a few goto statements to jump to it.
A new helper function, clear_unsync_child_bit(), makes the code cleaner.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
New struct kvm_rmap_head makes the code type-safe to some extent.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 7a1638ce42 ("nEPT: Redefine EPT-specific link_shadow_page()",
2013-08-05) says:
Since nEPT doesn't support A/D bit, we should not set those bit
when building the shadow page table.
but this is not necessary. Even though nEPT doesn't support A/D
bits, and hence the vmcs12 EPT pointer will never enable them,
we always use them for shadow page tables if available (see
construct_eptp in vmx.c). So we can set the A/D bits freely
in the shadow page table.
This patch hence basically reverts commit 7a1638ce42.
Cc: Yang Zhang <yang.z.zhang@Intel.com>
Cc: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Poor #AC was so unimportant until a few days ago that we were
not even tracing its name correctly. But now it's all over
the place.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
RDTSCP was never supported for AMD CPUs, which nobody noticed because
Linux does not use it. But exactly the fact that Linux does not
use it makes the implementation very simple; we can freely trash
MSR_TSC_AUX while running the guest.
Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If we do not do this, it is not properly saved and restored across
migration. Windows notices due to its self-protection mechanisms,
and is very upset about it (blue screen of death).
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A new vcpu exit is introduced to notify the userspace of the
changes in Hyper-V SynIC configuration triggered by guest writing to the
corresponding MSRs.
Changes v4:
* exit into userspace only if guest writes into SynIC MSR's
Changes v3:
* added KVM_EXIT_HYPERV types and structs notes into docs
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SynIC (synthetic interrupt controller) is a lapic extension,
which is controlled via MSRs and maintains for each vCPU
- 16 synthetic interrupt "lines" (SINT's); each can be configured to
trigger a specific interrupt vector optionally with auto-EOI
semantics
- a message page in the guest memory with 16 256-byte per-SINT message
slots
- an event flag page in the guest memory with 16 2048-bit per-SINT
event flag areas
The host triggers a SINT whenever it delivers a new message to the
corresponding slot or flips an event flag bit in the corresponding area.
The guest informs the host that it can try delivering a message by
explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
MSR.
The userspace (qemu) triggers interrupts and receives EOM notifications
via irqfd with resampler; for that, a GSI is allocated for each
configured SINT, and irq_routing api is extended to support GSI-SINT
mapping.
Changes v4:
* added activation of SynIC by vcpu KVM_ENABLE_CAP
* added per SynIC active flag
* added deactivation of APICv upon SynIC activation
Changes v3:
* added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
docs
Changes v2:
* do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
* add Hyper-V SynIC vectors into EOI exit bitmap
* Hyper-V SyniIC SINT msr write logic simplified
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The decision on whether to use hardware APIC virtualization used to be
taken globally, based on the availability of the feature in the CPU
and the value of a module parameter.
However, under certain circumstances we want to control it on per-vcpu
basis. In particular, when the userspace activates HyperV synthetic
interrupt controller (SynIC), APICv has to be disabled as it's
incompatible with SynIC auto-EOI behavior.
To achieve that, introduce 'apicv_active' flag on struct
kvm_vcpu_arch, and kvm_vcpu_deactivate_apicv() function to turn APICv
off. The flag is initialized based on the module parameter and CPU
capability, and consulted whenever an APICv-specific action is
performed.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The function to determine if the vector is handled by ioapic used to
rely on the fact that only ioapic-handled vectors were set up to
cause vmexits when virtual apic was in use.
We're going to break this assumption when introducing Hyper-V
synthetic interrupts: they may need to cause vmexits too.
To achieve that, introduce a new bitmap dedicated specifically for
ioapic-handled vectors, and populate EOI exit bitmap from it for now.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Actually kvm_arch_irq_routing_update() should be
kvm_arch_post_irq_routing_update() as it's called at the end
of irq routing update.
This renaming frees kvm_arch_irq_routing_update function name.
kvm_arch_irq_routing_update() weak function which will be used
to update mappings for arch-specific irq routing entries
(in particular, the upcoming Hyper-V synthetic interrupts).
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch removes the vpid check when emulating nested invvpid
instruction of type all-contexts invalidation. The existing code is
incorrect because:
(1) According to Intel SDM Vol 3, Section "INVVPID - Invalidate
Translations Based on VPID", invvpid instruction does not check
vpid in the invvpid descriptor when its type is all-contexts
invalidation.
(2) According to the same document, invvpid of type all-contexts
invalidation does not require there is an active VMCS, so/and
get_vmcs12() in the existing code may result in a NULL-pointer
dereference. In practice, it can crash both KVM itself and L1
hypervisors that use invvpid (e.g. Xen).
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Includes some timer fixes, properly unmapping PTEs, an errata fix, and two
tweaks to the EL2 panic code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWVJ7gAAoJEEtpOizt6ddyD5MH/3M/nhtZTnT6v0RPDvHWJo7s
5BQmITJYPHFkTO14OHWTVLXiGgLws8gPZnWHxC4jjHjpuJnL+/MM551FpCOqDDd7
vweYgVlSqD8ANH5nKbv1PPnzjrqhTVN+yi3ZItXy2pxsfvu63FC6Z43B2axelLvw
XYmHoMZaeWBBw2gHi3djGfju3Yj/2SOe+ozuvAXpxA5+NhSiPHHnMefGy5k3wKnJ
sETwshPdjiMeK4ItfMhveFTDRjl4uh9uQyORfaa5gqG0uePt3EalYynw+gEjZ6RX
Bpc3nLwboIfRIa/WwyoHm+nmLIUYjU8dAgLwUOIbdeG0igpdALdvsB0aBHCgngk=
=+7ED
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-v4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
KVM/ARM Fixes for v4.4-rc3.
Includes some timer fixes, properly unmapping PTEs, an errata fix, and two
tweaks to the EL2 panic code.
Pull x86 fixes from Thomas Gleixner:
"This update contains:
- MPX updates for handling 32bit processes
- A fix for a long standing bug in 32bit signal frame handling
related to FPU/XSAVE state
- Handle get_xsave_addr() correctly in KVM
- Fix SMAP check under paravirtualization
- Add a comment to the static function trace entry to avoid further
confusion about the difference to dynamic tracing"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Fix SMAP check in PVOPS environments
x86/ftrace: Add comment on static function tracing
x86/fpu: Fix get_xsave_addr() behavior under virtualization
x86/fpu: Fix 32-bit signal frame handling
x86/mpx: Fix 32-bit address space calculation
x86/mpx: Do proper get_user() when running 32-bit binaries on 64-bit kernels
There appears to be no formal statement of what pv_irq_ops.save_fl() is
supposed to return precisely. Native returns the full flags, while lguest and
Xen only return the Interrupt Flag, and both have comments by the
implementations stating that only the Interrupt Flag is looked at. This may
have been true when initially implemented, but no longer is.
To make matters worse, the Xen PVOP leaves the upper bits undefined, making
the BUG_ON() undefined behaviour. Experimentally, this now trips for 32bit PV
guests on Broadwell hardware. The BUG_ON() is consistent for an individual
build, but not consistent for all builds. It has also been a sitting timebomb
since SMAP support was introduced.
Use native_save_fl() instead, which will obtain an accurate view of the AC
flag.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Tested-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: <lguest@lists.ozlabs.org>
Cc: Xen-devel <xen-devel@lists.xen.org>
CC: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1433323874-6927-1-git-send-email-andrew.cooper3@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There was a confusion between update_ftrace_function() and static
function tracing trampoline regarding 3rd parameter (ftrace_ops).
Add a comment for clarification.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1447721004-2551-1-git-send-email-namhyung@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Before this patch, we incorrectly enter the guest without requesting an
interrupt window if the IRQ chip is split between user space and the
kernel.
Because lapic_in_kernel no longer implies the PIC is in the kernel, this
patch tests pic_in_kernel to determining whether an interrupt window
should be requested when entering the guest.
If the APIC is in the kernel and we request an interrupt window the
guest will return immediately. If the APIC is masked the guest will not
not make forward progress and unmask it, leading to a loop when KVM
reenters and requests again. This patch adds a check to ensure the APIC
is ready to accept an interrupt before requesting a window.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
[Use the other newly introduced functions. - Paolo]
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Set KVM_REQ_EVENT when a PIC in user space injects a local interrupt.
Currently a request is only made when neither the PIC nor the APIC is in
the kernel, which is not sufficient in the split IRQ chip case.
This addresses a problem in QEMU where interrupts are delayed until
another path invokes the event loop.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch breaks out a new function kvm_vcpu_ready_for_interrupt_injection.
This routine encapsulates the logic required to determine whether a vcpu
is ready to accept an interrupt injection, which is now required on
multiple paths.
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Matt Gingell <gingell@google.com>
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch ensures that dm_request_for_irq_injection and
post_kvm_run_save are in sync, avoiding that an endless ping-pong
between userspace (who correctly notices that IF=0) and
the kernel (who insists that userspace handles its request
for the interrupt window).
To synchronize them, it also adds checks for kvm_arch_interrupt_allowed
and !kvm_event_needs_reinjection. These are always needed, not
just for in-kernel LAPIC.
Signed-off-by: Matt Gingell <gingell@google.com>
[A collage of two patches from Matt. - Paolo]
Fixes: 1c1a9ce973
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull perf updates from Thomas Gleixner:
"Mostly updates to the perf tool plus two fixes to the kernel core code:
- Handle tracepoint filters correctly for inherited events (Peter
Zijlstra)
- Prevent a deadlock in perf_lock_task_context (Paul McKenney)
- Add missing newlines to some pr_err() calls (Arnaldo Carvalho de
Melo)
- Print full source file paths when using 'perf annotate --print-line
--full-paths' (Michael Petlan)
- Fix 'perf probe -d' when just one out of uprobes and kprobes is
enabled (Wang Nan)
- Add compiler.h to list.h to fix 'make perf-tar-src-pkg' generated
tarballs, i.e. out of tree building (Arnaldo Carvalho de Melo)
- Add the llvm-src-base.c and llvm-src-kbuild.c files, generated by
the 'perf test' LLVM entries, when running it in-tree, to
.gitignore (Yunlong Song)
- libbpf error reporting improvements, using a strerror interface to
more precisely tell the user about problems with the provided
scriptlet, be it in C or as a ready made object file (Wang Nan)
- Do not be case sensitive when searching for matching 'perf test'
entries (Arnaldo Carvalho de Melo)
- Inform the user about objdump failures in 'perf annotate' (Andi
Kleen)
- Improve the LLVM 'perf test' entry, introduce a new ones for BPF
and kbuild tests to check the environment used by clang to compile
.c scriptlets (Wang Nan)"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
perf/x86/intel/rapl: Remove the unused RAPL_EVENT_DESC() macro
tools include: Add compiler.h to list.h
perf probe: Verify parameters in two functions
perf session: Add missing newlines to some pr_err() calls
perf annotate: Support full source file paths for srcline fix
perf test: Add llvm-src-base.c and llvm-src-kbuild.c to .gitignore
perf: Fix inherited events vs. tracepoint filters
perf: Disable IRQs across RCU RS CS that acquires scheduler lock
perf test: Do not be case sensitive when searching for matching tests
perf test: Add 'perf test BPF'
perf test: Enhance the LLVM tests: add kbuild test
perf test: Enhance the LLVM test: update basic BPF test program
perf bpf: Improve BPF related error messages
perf tools: Make fetch_kernel_version() publicly available
bpf tools: Add new API bpf_object__get_kversion()
bpf tools: Improve libbpf error reporting
perf probe: Cleanup find_perf_probe_point_from_map to reduce redundancy
perf annotate: Inform the user about objdump failures in --stdio
perf stat: Make stat options global
perf sched latency: Fix thread pid reuse issue
...
Pull x86 fixes from Thomas Gleixner:
"A couple of fixes and updates related to x86:
- Fix the W+X check regression on XEN
- The real fix for the low identity map trainwreck
- Probe legacy PIC early instead of unconditionally allocating legacy
irqs
- Add cpu verification to long mode entry
- Adjust the cache topology to AMD Fam17H systems
- Let Merrifield use the TSC across S3"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Call verify_cpu() after having entered long mode too
x86/setup: Fix low identity map for >= 2GB kernel range
x86/mm: Skip the hypervisor range when walking PGD
x86/AMD: Fix last level cache topology for AMD Fam17h systems
x86/irq: Probe for PIC presence before allocating descs for legacy IRQs
x86/cpu/intel: Enable X86_FEATURE_NONSTOP_TSC_S3 for Merrifield
MSR_NHM_PLATFORM_INFO has been replaced by...
MSR_PLATFORM_INFO
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- x86: work around two nasty cases where a benign exception occurs while
another is being delivered. The endless stream of exceptions causes an
infinite loop in the processor, which not even NMIs or SMIs can interrupt;
in the virt case, there is no possibility to exit to the host either.
- x86: support for Skylake per-guest TSC rate. Long supported by AMD,
the patches mostly move things from there to common arch/x86/kvm/ code.
- generic: remove local_irq_save/restore from the guest entry and exit
paths when context tracking is enabled. The patches are a few months
old, but we discussed them again at kernel summit. Andy will pick up
from here and, in 4.5, try to remove it from the user entry/exit paths.
- PPC: Two bug fixes, see merge commit 370289756b for details.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWRFb0AAoJEL/70l94x66DjjMH/31jr8d119MW0uv2x+03+wRq
6dbJ8tjQ8grvBRExKvLsUVjDmHlhCa1BQl5qjCsyYhX9UeAf4NQOmoEFpq+YTLxh
Ctveyn+yiZWC7qxbQDmauiQ4JCOp+W9ial782iqw5+ouQMajGOffq5WrojCa2ZNF
jI278JgdHJLrKj/uie//WBu3V7MJY5Apc3p4zatnSYFSQ3MA0sxl4r4zIrwOa5qs
23ZeeoqbP4sHh4X5wL/30Y6XFSCHj0qoYHHyAgzLi0PCMvBdt4DrAFUPDG/Rhlv6
o1WB/kcUfcz3DtBX85wfSOMuw0nF6patWhWv07R/3EIbYoz3dKvp9d6ORYgXqlY=
=Um9M
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second batch of kvm updates from Paolo Bonzini:
"Four changes:
- x86: work around two nasty cases where a benign exception occurs
while another is being delivered. The endless stream of exceptions
causes an infinite loop in the processor, which not even NMIs or
SMIs can interrupt; in the virt case, there is no possibility to
exit to the host either.
- x86: support for Skylake per-guest TSC rate. Long supported by
AMD, the patches mostly move things from there to common
arch/x86/kvm/ code.
- generic: remove local_irq_save/restore from the guest entry and
exit paths when context tracking is enabled. The patches are a few
months old, but we discussed them again at kernel summit. Andy
will pick up from here and, in 4.5, try to remove it from the user
entry/exit paths.
- PPC: Two bug fixes, see merge commit 370289756b for details"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
KVM: x86: rename update_db_bp_intercept to update_bp_intercept
KVM: svm: unconditionally intercept #DB
KVM: x86: work around infinite loop in microcode when #AC is delivered
context_tracking: avoid irq_save/irq_restore on guest entry and exit
context_tracking: remove duplicate enabled check
KVM: VMX: Dump TSC multiplier in dump_vmcs()
KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
KVM: VMX: Enable and initialize VMX TSC scaling
KVM: x86: Use the correct vcpu's TSC rate to compute time scale
KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
KVM: x86: Replace call-back compute_tsc_offset() with a common function
KVM: x86: Replace call-back set_tsc_khz() with a common function
KVM: x86: Add a common TSC scaling function
KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
KVM: x86: Collect information for setting TSC scaling ratio
KVM: x86: declare a few variables as __read_mostly
KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common
KVM: PPC: Book3S HV: Don't dynamically split core when already split
...
KVM uses the get_xsave_addr() function in a different fashion from
the native kernel, in that the 'xsave' parameter belongs to guest vcpu,
not the currently running task.
But 'xsave' is replaced with current task's (host) xsave structure, so
get_xsave_addr() will incorrectly return the bad xsave address to KVM.
Fix it so that the passed in 'xsave' address is used - as intended
originally.
Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Link: http://lkml.kernel.org/r/1446800423-21622-1-git-send-email-huaitong.han@intel.com
[ Tidied up the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(This should have gone to LKML originally. Sorry for the extra
noise, folks on the cc.)
Background:
Signal frames on x86 have two formats:
1. For 32-bit executables (whether on a real 32-bit kernel or
under 32-bit emulation on a 64-bit kernel) we have a
'fpregset_t' that includes the "FSAVE" registers.
2. For 64-bit executables (on 64-bit kernels obviously), the
'fpregset_t' is smaller and does not contain the "FSAVE"
state.
When creating the signal frame, we have to be aware of whether
we are running a 32 or 64-bit executable so we create the
correct format signal frame.
Problem:
save_xstate_epilog() uses 'fx_sw_reserved_ia32' whenever it is
called for a 32-bit executable. This is for real 32-bit and
ia32 emulation.
But, fpu__init_prepare_fx_sw_frame() only initializes
'fx_sw_reserved_ia32' when emulation is enabled, *NOT* for real
32-bit kernels.
This leads to really wierd situations where 32-bit programs
lose their extended state when returning from a signal handler.
The kernel copies the uninitialized (zero) 'fx_sw_reserved_ia32'
out to userspace in save_xstate_epilog(). But when returning
from the signal, the kernel errors out in check_for_xstate()
when it does not see FP_XSTATE_MAGIC1 present (because it was
zeroed). This leads to the FPU/XSAVE state being initialized.
For MPX, this leads to the most permissive state and means we
silently lose bounds violations. I think this would also mean
that we could lose *ANY* FPU/SSE/AVX state. I'm not sure why
no one has spotted this bug.
I believe this was broken by:
72a671ced6 ("x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels")
way back in 2012.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@sr71.net
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151111002354.A0799571@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I received a bug report that running 32-bit MPX binaries on
64-bit kernels was broken. I traced it down to this little code
snippet. We were switching our "number of bounds directory
entries" calculation correctly. But, we didn't switch the other
side of the calculation: the virtual space size.
This meant that we were calculating an absurd size for
bd_entry_virt_space() on 32-bit because we used the 64-bit
virt_space.
This was _also_ broken for 32-bit kernels running on 64-bit
hardware since boot_cpu_data.x86_virt_bits=48 even when running
in 32-bit mode.
Correct that and properly handle all 3 possible cases:
1. 32-bit binary on 64-bit kernel
2. 64-bit binary on 64-bit kernel
3. 32-bit binary on 32-bit kernel
This manifested in having bounds tables not properly unmapped.
It "leaked" memory but had no functional impact otherwise.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151111181934.FA7FAC34@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When you call get_user(foo, bar), you effectively do a
copy_from_user(&foo, bar, sizeof(*bar));
Note that the sizeof() is implicit.
When we reach out to userspace to try to zap an entire "bounds
table" we need to go read a "bounds directory entry" in order to
locate the table's address. The size of a "directory entry"
depends on the binary being run and is always the size of a
pointer.
But, when we have a 64-bit kernel and a 32-bit application, the
directory entry is still only 32-bits long, but we fetch it with
a 64-bit pointer which makes get_user() does a 64-bit fetch.
Reading 4 extra bytes isn't harmful, unless we are at the end of
and run off the table. It might also cause the zero page to get
faulted in unnecessarily even if you are not at the end.
Fix it up by doing a special 32-bit get_user() via a cast when
we have 32-bit userspace.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151111181931.3ACF6822@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>