2019-05-29 21:12:40 +07:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0-only */
|
2008-04-17 11:28:09 +07:00
|
|
|
/*
|
|
|
|
*
|
|
|
|
* Copyright IBM Corp. 2007
|
|
|
|
*
|
|
|
|
* Authors: Hollis Blanchard <hollisb@us.ibm.com>
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef __POWERPC_KVM_HOST_H__
|
|
|
|
#define __POWERPC_KVM_HOST_H__
|
|
|
|
|
|
|
|
#include <linux/mutex.h>
|
2009-11-02 19:02:31 +07:00
|
|
|
#include <linux/hrtimer.h>
|
|
|
|
#include <linux/interrupt.h>
|
2008-04-17 11:28:09 +07:00
|
|
|
#include <linux/types.h>
|
|
|
|
#include <linux/kvm_types.h>
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
#include <linux/threads.h>
|
|
|
|
#include <linux/spinlock.h>
|
2010-07-29 19:47:42 +07:00
|
|
|
#include <linux/kvm_para.h>
|
KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests
This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility. These processors require a physically
contiguous, aligned area of memory for each guest. When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access. The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.
Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator. The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.
KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs. The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.
This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA. It
also returns the size of the RMA in the argument structure.
Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace. To cope with this, we now preallocate the
kvm->arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory. Subsequently we will get rid of this
array and use memory associated with each memslot instead.
This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region. Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB. However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.
Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu->arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest. This moves the LPCR value into the
kvm->arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:25:44 +07:00
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/atomic.h>
|
2008-04-17 11:28:09 +07:00
|
|
|
#include <asm/kvm_asm.h>
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
#include <asm/processor.h>
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 19:38:05 +07:00
|
|
|
#include <asm/page.h>
|
2012-08-03 18:56:33 +07:00
|
|
|
#include <asm/cacheflush.h>
|
2014-06-02 08:02:59 +07:00
|
|
|
#include <asm/hvcall.h>
|
2017-05-11 18:03:37 +07:00
|
|
|
#include <asm/mce.h>
|
2008-04-17 11:28:09 +07:00
|
|
|
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
#define KVM_MAX_VCPUS NR_CPUS
|
|
|
|
#define KVM_MAX_VCORES NR_CPUS
|
2015-12-09 17:34:07 +07:00
|
|
|
#define KVM_USER_MEM_SLOTS 512
|
2008-04-17 11:28:09 +07:00
|
|
|
|
2016-05-09 23:13:37 +07:00
|
|
|
#include <asm/cputhreads.h>
|
2018-07-26 11:53:54 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
|
|
|
|
#include <asm/kvm_book3s_asm.h> /* for MAX_SMT_THREADS */
|
|
|
|
#define KVM_MAX_VCPU_ID (MAX_SMT_THREADS * KVM_MAX_VCORES)
|
2018-10-08 12:31:03 +07:00
|
|
|
#define KVM_MAX_NESTED_GUESTS KVMPPC_NR_LPIDS
|
2018-07-26 11:53:54 +07:00
|
|
|
|
|
|
|
#else
|
|
|
|
#define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
|
|
|
|
#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
|
2016-05-09 23:13:37 +07:00
|
|
|
|
2016-08-10 08:27:27 +07:00
|
|
|
#define __KVM_HAVE_ARCH_INTC_INITIALIZED
|
|
|
|
|
2016-10-14 07:53:22 +07:00
|
|
|
#define KVM_HALT_POLL_NS_DEFAULT 10000 /* 10 us */
|
2008-05-30 21:05:56 +07:00
|
|
|
|
2013-04-16 22:42:19 +07:00
|
|
|
/* These values are internal and can be increased later */
|
|
|
|
#define KVM_NR_IRQCHIPS 1
|
|
|
|
#define KVM_IRQCHIP_NUM_PINS 256
|
|
|
|
|
2016-01-07 21:05:10 +07:00
|
|
|
/* PPC-specific vcpu->requests bit members */
|
2017-06-04 19:43:51 +07:00
|
|
|
#define KVM_REQ_WATCHDOG KVM_ARCH_REQ(0)
|
|
|
|
#define KVM_REQ_EPR_EXIT KVM_ARCH_REQ(1)
|
2016-01-07 21:05:10 +07:00
|
|
|
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 19:38:05 +07:00
|
|
|
#include <linux/mmu_notifier.h>
|
|
|
|
|
|
|
|
#define KVM_ARCH_WANT_MMU_NOTIFIER
|
|
|
|
|
2012-07-02 15:56:33 +07:00
|
|
|
extern int kvm_unmap_hva_range(struct kvm *kvm,
|
|
|
|
unsigned long start, unsigned long end);
|
2014-09-23 04:54:42 +07:00
|
|
|
extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 19:38:05 +07:00
|
|
|
extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
|
2018-12-06 20:21:10 +07:00
|
|
|
extern int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 19:38:05 +07:00
|
|
|
|
2010-06-30 20:18:46 +07:00
|
|
|
#define HPTEG_CACHE_NUM (1 << 15)
|
|
|
|
#define HPTEG_HASH_BITS_PTE 13
|
2010-07-29 20:04:19 +07:00
|
|
|
#define HPTEG_HASH_BITS_PTE_LONG 12
|
2010-06-30 20:18:46 +07:00
|
|
|
#define HPTEG_HASH_BITS_VPTE 13
|
|
|
|
#define HPTEG_HASH_BITS_VPTE_LONG 5
|
2013-09-20 11:52:44 +07:00
|
|
|
#define HPTEG_HASH_BITS_VPTE_64K 11
|
2010-06-30 20:18:46 +07:00
|
|
|
#define HPTEG_HASH_NUM_PTE (1 << HPTEG_HASH_BITS_PTE)
|
2010-07-29 20:04:19 +07:00
|
|
|
#define HPTEG_HASH_NUM_PTE_LONG (1 << HPTEG_HASH_BITS_PTE_LONG)
|
2010-06-30 20:18:46 +07:00
|
|
|
#define HPTEG_HASH_NUM_VPTE (1 << HPTEG_HASH_BITS_VPTE)
|
|
|
|
#define HPTEG_HASH_NUM_VPTE_LONG (1 << HPTEG_HASH_BITS_VPTE_LONG)
|
2013-09-20 11:52:44 +07:00
|
|
|
#define HPTEG_HASH_NUM_VPTE_64K (1 << HPTEG_HASH_BITS_VPTE_64K)
|
2009-10-30 12:47:04 +07:00
|
|
|
|
2010-07-29 19:47:52 +07:00
|
|
|
/* Physical Address Mask - allowed range of real mode RAM access */
|
|
|
|
#define KVM_PAM 0x0fffffffffffffffULL
|
|
|
|
|
2011-06-29 07:22:05 +07:00
|
|
|
struct lppaca;
|
|
|
|
struct slb_shadow;
|
KVM: PPC: Book3S HV: Make virtual processor area registration more robust
The PAPR API allows three sorts of per-virtual-processor areas to be
registered (VPA, SLB shadow buffer, and dispatch trace log), and
furthermore, these can be registered and unregistered for another
virtual CPU. Currently we just update the vcpu fields pointing to
these areas at the time of registration or unregistration. If this
is done on another vcpu, there is the possibility that the target vcpu
is using those fields at the time and could end up using a bogus
pointer and corrupting memory.
This fixes the race by making the target cpu itself do the update, so
we can be sure that the update happens at a time when the fields
aren't being used. Each area now has a struct kvmppc_vpa which is
used to manage these updates. There is also a spinlock which protects
access to all of the kvmppc_vpa structs, other than to the pinned_addr
fields. (We could have just taken the spinlock when using the vpa,
slb_shadow or dtl fields, but that would mean taking the spinlock on
every guest entry and exit.)
This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
which is what the rest of the kernel uses.
Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out
the need to initialize vcpu->arch.vpa_update_lock.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-20 00:46:32 +07:00
|
|
|
struct dtl_entry;
|
2011-06-29 07:22:05 +07:00
|
|
|
|
2013-09-20 11:52:49 +07:00
|
|
|
struct kvmppc_vcpu_book3s;
|
|
|
|
struct kvmppc_book3s_shadow_vcpu;
|
2018-10-08 12:31:04 +07:00
|
|
|
struct kvm_nested_guest;
|
2013-09-20 11:52:49 +07:00
|
|
|
|
2008-04-17 11:28:09 +07:00
|
|
|
struct kvm_vm_stat {
|
2016-08-02 11:03:22 +07:00
|
|
|
ulong remote_tlb_flush;
|
2019-02-19 10:53:45 +07:00
|
|
|
ulong num_2M_pages;
|
|
|
|
ulong num_1G_pages;
|
2008-04-17 11:28:09 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
struct kvm_vcpu_stat {
|
2016-08-02 11:03:22 +07:00
|
|
|
u64 sum_exits;
|
|
|
|
u64 mmio_exits;
|
|
|
|
u64 signal_exits;
|
|
|
|
u64 light_exits;
|
2008-04-17 11:28:09 +07:00
|
|
|
/* Account for special types of light exits: */
|
2016-08-02 11:03:22 +07:00
|
|
|
u64 itlb_real_miss_exits;
|
|
|
|
u64 itlb_virt_miss_exits;
|
|
|
|
u64 dtlb_real_miss_exits;
|
|
|
|
u64 dtlb_virt_miss_exits;
|
|
|
|
u64 syscall_exits;
|
|
|
|
u64 isi_exits;
|
|
|
|
u64 dsi_exits;
|
|
|
|
u64 emulated_inst_exits;
|
|
|
|
u64 dec_exits;
|
|
|
|
u64 ext_intr_exits;
|
2016-08-02 11:03:23 +07:00
|
|
|
u64 halt_poll_success_ns;
|
|
|
|
u64 halt_poll_fail_ns;
|
|
|
|
u64 halt_wait_ns;
|
2016-08-02 11:03:22 +07:00
|
|
|
u64 halt_successful_poll;
|
|
|
|
u64 halt_attempted_poll;
|
2016-08-02 11:03:23 +07:00
|
|
|
u64 halt_successful_wait;
|
2016-08-02 11:03:22 +07:00
|
|
|
u64 halt_poll_invalid;
|
|
|
|
u64 halt_wakeup;
|
|
|
|
u64 dbell_exits;
|
|
|
|
u64 gdbell_exits;
|
|
|
|
u64 ld;
|
|
|
|
u64 st;
|
2010-04-16 05:11:42 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S
|
2016-08-02 11:03:22 +07:00
|
|
|
u64 pf_storage;
|
|
|
|
u64 pf_instruc;
|
|
|
|
u64 sp_storage;
|
|
|
|
u64 sp_instruc;
|
|
|
|
u64 queue_intr;
|
|
|
|
u64 ld_slow;
|
|
|
|
u64 st_slow;
|
2009-10-30 12:47:04 +07:00
|
|
|
#endif
|
2016-08-19 12:35:57 +07:00
|
|
|
u64 pthru_all;
|
|
|
|
u64 pthru_host;
|
|
|
|
u64 pthru_bad_aff;
|
2008-04-17 11:28:09 +07:00
|
|
|
};
|
|
|
|
|
2008-12-03 04:51:57 +07:00
|
|
|
enum kvm_exit_types {
|
|
|
|
MMIO_EXITS,
|
|
|
|
SIGNAL_EXITS,
|
|
|
|
ITLB_REAL_MISS_EXITS,
|
|
|
|
ITLB_VIRT_MISS_EXITS,
|
|
|
|
DTLB_REAL_MISS_EXITS,
|
|
|
|
DTLB_VIRT_MISS_EXITS,
|
|
|
|
SYSCALL_EXITS,
|
|
|
|
ISI_EXITS,
|
|
|
|
DSI_EXITS,
|
|
|
|
EMULATED_INST_EXITS,
|
|
|
|
EMULATED_MTMSRWE_EXITS,
|
|
|
|
EMULATED_WRTEE_EXITS,
|
|
|
|
EMULATED_MTSPR_EXITS,
|
|
|
|
EMULATED_MFSPR_EXITS,
|
|
|
|
EMULATED_MTMSR_EXITS,
|
|
|
|
EMULATED_MFMSR_EXITS,
|
|
|
|
EMULATED_TLBSX_EXITS,
|
|
|
|
EMULATED_TLBWE_EXITS,
|
|
|
|
EMULATED_RFI_EXITS,
|
2011-12-20 22:34:43 +07:00
|
|
|
EMULATED_RFCI_EXITS,
|
2014-08-06 13:38:52 +07:00
|
|
|
EMULATED_RFDI_EXITS,
|
2008-12-03 04:51:57 +07:00
|
|
|
DEC_EXITS,
|
|
|
|
EXT_INTR_EXITS,
|
|
|
|
HALT_WAKEUP,
|
|
|
|
USR_PR_INST,
|
|
|
|
FP_UNAVAIL,
|
|
|
|
DEBUG_EXITS,
|
|
|
|
TIMEINGUEST,
|
2011-12-20 22:34:43 +07:00
|
|
|
DBELL_EXITS,
|
|
|
|
GDBELL_EXITS,
|
2008-12-03 04:51:57 +07:00
|
|
|
__NUMBER_OF_KVM_EXIT_TYPES
|
|
|
|
};
|
|
|
|
|
|
|
|
/* allow access to big endian 32bit upper/lower parts and 64bit var */
|
2008-12-03 04:51:58 +07:00
|
|
|
struct kvmppc_exit_timing {
|
2008-12-03 04:51:57 +07:00
|
|
|
union {
|
|
|
|
u64 tv64;
|
|
|
|
struct {
|
|
|
|
u32 tbu, tbl;
|
|
|
|
} tv32;
|
|
|
|
};
|
|
|
|
};
|
|
|
|
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
struct kvmppc_pginfo {
|
|
|
|
unsigned long pfn;
|
|
|
|
atomic_t refcnt;
|
|
|
|
};
|
|
|
|
|
2017-03-22 11:21:56 +07:00
|
|
|
struct kvmppc_spapr_tce_iommu_table {
|
|
|
|
struct rcu_head rcu;
|
|
|
|
struct list_head next;
|
|
|
|
struct iommu_table *tbl;
|
|
|
|
struct kref kref;
|
|
|
|
};
|
|
|
|
|
2019-03-29 12:43:26 +07:00
|
|
|
#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
|
|
|
|
|
2011-06-29 07:22:41 +07:00
|
|
|
struct kvmppc_spapr_tce_table {
|
|
|
|
struct list_head list;
|
|
|
|
struct kvm *kvm;
|
|
|
|
u64 liobn;
|
2016-02-15 08:55:05 +07:00
|
|
|
struct rcu_head rcu;
|
2016-03-01 13:54:38 +07:00
|
|
|
u32 page_shift;
|
2016-03-01 13:54:39 +07:00
|
|
|
u64 offset; /* in pages */
|
2016-03-01 13:54:38 +07:00
|
|
|
u64 size; /* window size in pages */
|
2017-03-22 11:21:56 +07:00
|
|
|
struct list_head iommu_tables;
|
2019-03-29 12:43:26 +07:00
|
|
|
struct mutex alloc_lock;
|
2011-06-29 07:22:41 +07:00
|
|
|
struct page *pages[0];
|
|
|
|
};
|
|
|
|
|
2013-04-18 03:30:26 +07:00
|
|
|
/* XICS components, defined in book3s_xics.c */
|
|
|
|
struct kvmppc_xics;
|
|
|
|
struct kvmppc_icp;
|
2017-04-05 14:54:56 +07:00
|
|
|
extern struct kvm_device_ops kvm_xics_ops;
|
|
|
|
|
|
|
|
/* XIVE components, defined in book3s_xive.c */
|
|
|
|
struct kvmppc_xive;
|
|
|
|
struct kvmppc_xive_vcpu;
|
|
|
|
extern struct kvm_device_ops kvm_xive_ops;
|
2019-04-18 17:39:27 +07:00
|
|
|
extern struct kvm_device_ops kvm_xive_native_ops;
|
2013-04-18 03:30:26 +07:00
|
|
|
|
2016-08-19 12:35:48 +07:00
|
|
|
struct kvmppc_passthru_irqmap;
|
|
|
|
|
2011-12-12 19:27:39 +07:00
|
|
|
/*
|
|
|
|
* The reverse mapping array has one entry for each HPTE,
|
|
|
|
* which stores the guest's view of the second word of the HPTE
|
2011-12-12 19:33:07 +07:00
|
|
|
* (including the guest physical address of the mapping),
|
|
|
|
* plus forward and backward pointers in a doubly-linked ring
|
|
|
|
* of HPTEs that map the same host page. The pointers in this
|
|
|
|
* ring are 32-bit HPTE indexes, to save space.
|
2011-12-12 19:27:39 +07:00
|
|
|
*/
|
|
|
|
struct revmap_entry {
|
|
|
|
unsigned long guest_rpte;
|
2011-12-12 19:33:07 +07:00
|
|
|
unsigned int forw, back;
|
2011-12-12 19:27:39 +07:00
|
|
|
};
|
|
|
|
|
2011-12-12 19:33:07 +07:00
|
|
|
/*
|
2019-08-20 13:13:49 +07:00
|
|
|
* The rmap array of size number of guest pages is allocated for each memslot.
|
|
|
|
* This array is used to store usage specific information about the guest page.
|
|
|
|
* Below are the encodings of the various possible usage types.
|
2011-12-12 19:33:07 +07:00
|
|
|
*/
|
2019-08-20 13:13:49 +07:00
|
|
|
/* Free bits which can be used to define a new usage */
|
|
|
|
#define KVMPPC_RMAP_TYPE_MASK 0xff00000000000000
|
|
|
|
#define KVMPPC_RMAP_NESTED 0xc000000000000000 /* Nested rmap array */
|
|
|
|
#define KVMPPC_RMAP_HPT 0x0100000000000000 /* HPT guest */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* rmap usage definition for a hash page table (hpt) guest:
|
|
|
|
* 0x0000080000000000 Lock bit
|
|
|
|
* 0x0000018000000000 RC bits
|
|
|
|
* 0x0000000100000000 Present bit
|
|
|
|
* 0x00000000ffffffff HPT index bits
|
|
|
|
* The bottom 32 bits are the index in the guest HPT of a HPTE that points to
|
|
|
|
* the page.
|
|
|
|
*/
|
|
|
|
#define KVMPPC_RMAP_LOCK_BIT 43
|
2011-12-15 09:02:02 +07:00
|
|
|
#define KVMPPC_RMAP_RC_SHIFT 32
|
|
|
|
#define KVMPPC_RMAP_REFERENCED (HPTE_R_R << KVMPPC_RMAP_RC_SHIFT)
|
2011-12-12 19:33:07 +07:00
|
|
|
#define KVMPPC_RMAP_PRESENT 0x100000000ul
|
|
|
|
#define KVMPPC_RMAP_INDEX 0xfffffffful
|
|
|
|
|
2012-02-08 11:02:18 +07:00
|
|
|
struct kvm_arch_memory_slot {
|
2013-10-07 23:47:52 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
|
2012-08-01 16:03:28 +07:00
|
|
|
unsigned long *rmap;
|
2013-10-07 23:47:52 +07:00
|
|
|
#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
|
2012-02-08 11:02:18 +07:00
|
|
|
};
|
|
|
|
|
2016-12-20 12:49:00 +07:00
|
|
|
struct kvm_hpt_info {
|
|
|
|
/* Host virtual (linear mapping) address of guest HPT */
|
|
|
|
unsigned long virt;
|
|
|
|
/* Array of reverse mapping entries for each guest HPTE */
|
|
|
|
struct revmap_entry *rev;
|
|
|
|
/* Guest HPT size is 2**(order) bytes */
|
|
|
|
u32 order;
|
|
|
|
/* 1 if HPT allocated with CMA, 0 otherwise */
|
|
|
|
int cma;
|
|
|
|
};
|
|
|
|
|
2016-12-20 12:49:05 +07:00
|
|
|
struct kvm_resize_hpt;
|
|
|
|
|
2019-11-25 10:06:26 +07:00
|
|
|
/* Flag values for kvm_arch.secure_guest */
|
|
|
|
#define KVMPPC_SECURE_INIT_START 0x1 /* H_SVM_INIT_START has been called */
|
|
|
|
#define KVMPPC_SECURE_INIT_DONE 0x2 /* H_SVM_INIT_DONE completed */
|
2020-01-07 09:02:37 +07:00
|
|
|
#define KVMPPC_SECURE_INIT_ABORT 0x4 /* H_SVM_INIT_ABORT issued */
|
2019-11-25 10:06:26 +07:00
|
|
|
|
2008-04-17 11:28:09 +07:00
|
|
|
struct kvm_arch {
|
2011-12-20 22:34:43 +07:00
|
|
|
unsigned int lpid;
|
KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode
This allows userspace to set the desired virtual SMT (simultaneous
multithreading) mode for a VM, that is, the number of VCPUs that
get assigned to each virtual core. Previously, the virtual SMT mode
was fixed to the number of threads per subcore, and if userspace
wanted to have fewer vcpus per vcore, then it would achieve that by
using a sparse CPU numbering. This had the disadvantage that the
vcpu numbers can get quite large, particularly for SMT1 guests on
a POWER8 with 8 threads per core. With this patch, userspace can
set its desired virtual SMT mode and then use contiguous vcpu
numbering.
On POWER8, where the threading mode is "strict", the virtual SMT mode
must be less than or equal to the number of threads per subcore. On
POWER9, which implements a "loose" threading mode, the virtual SMT
mode can be any power of 2 between 1 and 8, even though there is
effectively one thread per subcore, since the threads are independent
and can all be in different partitions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-06 09:24:41 +07:00
|
|
|
unsigned int smt_mode; /* # vcpus per virtual core */
|
KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9
On POWER9, we no longer have the restriction that we had on POWER8
where all threads in a core have to be in the same partition, so
the CPU threads are now independent. However, we still want to be
able to run guests with a virtual SMT topology, if only to allow
migration of guests from POWER8 systems to POWER9.
A guest that has a virtual SMT mode greater than 1 will expect to
be able to use the doorbell facility; it will expect the msgsndp
and msgclrp instructions to work appropriately and to be able to read
sensible values from the TIR (thread identification register) and
DPDES (directed privileged doorbell exception status) special-purpose
registers. However, since each CPU thread is a separate sub-processor
in POWER9, these instructions and registers can only be used within
a single CPU thread.
In order for these instructions to appear to act correctly according
to the guest's virtual SMT mode, we have to trap and emulate them.
We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
register. The emulation is triggered by the hypervisor facility
unavailable interrupt that occurs when the guest uses them.
To cause a doorbell interrupt to occur within the guest, we set the
DPDES register to 1. If the guest has interrupts enabled, the CPU
will generate a doorbell interrupt and clear the DPDES register in
hardware. The DPDES hardware register for the guest is saved in the
vcpu->arch.vcore->dpdes field. Since this gets written by the guest
exit code, other VCPUs wishing to cause a doorbell interrupt don't
write that field directly, but instead set a vcpu->arch.doorbell_request
flag. This is consumed and set to 0 by the guest entry code, which
then sets DPDES to 1.
Emulating reads of the DPDES register is somewhat involved, because
it requires reading the doorbell pending interrupt status of all of the
VCPU threads in the virtual core, and if any of those VCPUs are
running, their doorbell status is only up-to-date in the hardware
DPDES registers of the CPUs where they are running. In order to get
a reasonable approximation of the current doorbell status, we send
those CPUs an IPI, causing an exit from the guest which will update
the vcpu->arch.vcore->dpdes field. We then use that value in
constructing the emulated DPDES register value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-05-16 13:41:20 +07:00
|
|
|
unsigned int emul_smt_mode; /* emualted SMT mode, on P9 */
|
2013-10-07 23:47:52 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
|
2016-11-18 04:28:51 +07:00
|
|
|
unsigned int tlb_sets;
|
2016-12-20 12:49:00 +07:00
|
|
|
struct kvm_hpt_info hpt;
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 12:55:12 +07:00
|
|
|
atomic64_t mmio_update;
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
unsigned int host_lpid;
|
|
|
|
unsigned long host_lpcr;
|
|
|
|
unsigned long sdr1;
|
|
|
|
unsigned long host_sdr1;
|
KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests
This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility. These processors require a physically
contiguous, aligned area of memory for each guest. When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access. The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.
Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator. The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.
KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs. The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.
This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA. It
also returns the size of the RMA in the argument structure.
Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace. To cope with this, we now preallocate the
kvm->arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory. Subsequently we will get rid of this
array and use memory associated with each memslot instead.
This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region. Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB. However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.
Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu->arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest. This moves the LPCR value into the
kvm->arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:25:44 +07:00
|
|
|
unsigned long lpcr;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 19:36:37 +07:00
|
|
|
unsigned long vrma_slb_v;
|
2017-09-13 12:53:48 +07:00
|
|
|
int mmu_ready;
|
KVM: PPC: Book3S HV: Make the guest hash table size configurable
This adds a new ioctl to enable userspace to control the size of the guest
hashed page table (HPT) and to clear it out when resetting the guest.
The KVM_PPC_ALLOCATE_HTAB ioctl is a VM ioctl and takes as its parameter
a pointer to a u32 containing the desired order of the HPT (log base 2
of the size in bytes), which is updated on successful return to the
actual order of the HPT which was allocated.
There must be no vcpus running at the time of this ioctl. To enforce
this, we now keep a count of the number of vcpus running in
kvm->arch.vcpus_running.
If the ioctl is called when a HPT has already been allocated, we don't
reallocate the HPT but just clear it out. We first clear the
kvm->arch.rma_setup_done flag, which has two effects: (a) since we hold
the kvm->lock mutex, it will prevent any vcpus from starting to run until
we're done, and (b) it means that the first vcpu to run after we're done
will re-establish the VRMA if necessary.
If userspace doesn't call this ioctl before running the first vcpu, the
kernel will allocate a default-sized HPT at that point. We do it then
rather than when creating the VM, as the code did previously, so that
userspace has a chance to do the ioctl if it wants.
When allocating the HPT, we can allocate either from the kernel page
allocator, or from the preallocated pool. If userspace is asking for
a different size from the preallocated HPTs, we first try to allocate
using the kernel page allocator. Then we try to allocate from the
preallocated pool, and then if that fails, we try allocating decreasing
sizes from the kernel page allocator, down to the minimum size allowed
(256kB). Note that the kernel page allocator limits allocations to
1 << CONFIG_FORCE_MAX_ZONEORDER pages, which by default corresponds to
16MB (on 64-bit powerpc, at least).
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix module compilation]
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-05-04 09:32:53 +07:00
|
|
|
atomic_t vcpus_running;
|
KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
When we change or remove a HPT (hashed page table) entry, we can do
either a global TLB invalidation (tlbie) that works across the whole
machine, or a local invalidation (tlbiel) that only affects this core.
Currently we do local invalidations if the VM has only one vcpu or if
the guest requests it with the H_LOCAL flag, though the guest Linux
kernel currently doesn't ever use H_LOCAL. Then, to cope with the
possibility that vcpus moving around to different physical cores might
expose stale TLB entries, there is some code in kvmppc_hv_entry to
flush the whole TLB of entries for this VM if either this vcpu is now
running on a different physical core from where it last ran, or if this
physical core last ran a different vcpu.
There are a number of problems on POWER7 with this as it stands:
- The TLB invalidation is done per thread, whereas it only needs to be
done per core, since the TLB is shared between the threads.
- With the possibility of the host paging out guest pages, the use of
H_LOCAL by an SMP guest is dangerous since the guest could possibly
retain and use a stale TLB entry pointing to a page that had been
removed from the guest.
- The TLB invalidations that we do when a vcpu moves from one physical
core to another are unnecessary in the case of an SMP guest that isn't
using H_LOCAL.
- The optimization of using local invalidations rather than global should
apply to guests with one virtual core, not just one vcpu.
(None of this applies on PPC970, since there we always have to
invalidate the whole TLB when entering and leaving the guest, and we
can't support paging out guest memory.)
To fix these problems and simplify the code, we now maintain a simple
cpumask of which cpus need to flush the TLB on entry to the guest.
(This is indexed by cpu, though we only ever use the bits for thread
0 of each core.) Whenever we do a local TLB invalidation, we set the
bits for every cpu except the bit for thread 0 of the core that we're
currently running on. Whenever we enter a guest, we test and clear the
bit for our core, and flush the TLB if it was set.
On initial startup of the VM, and when resetting the HPT, we set all the
bits in the need_tlb_flush cpumask, since any core could potentially have
stale TLB entries from the previous VM to use the same LPID, or the
previous contents of the HPT.
Then, we maintain a count of the number of online virtual cores, and use
that when deciding whether to use a local invalidation rather than the
number of online vcpus. The code to make that decision is extracted out
into a new function, global_invalidates(). For multi-core guests on
POWER7 (i.e. when we are using mmu notifiers), we now never do local
invalidations regardless of the H_LOCAL flag.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-22 06:28:08 +07:00
|
|
|
u32 online_vcores;
|
2012-11-20 05:52:49 +07:00
|
|
|
atomic_t hpte_mod_interest;
|
KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
When we change or remove a HPT (hashed page table) entry, we can do
either a global TLB invalidation (tlbie) that works across the whole
machine, or a local invalidation (tlbiel) that only affects this core.
Currently we do local invalidations if the VM has only one vcpu or if
the guest requests it with the H_LOCAL flag, though the guest Linux
kernel currently doesn't ever use H_LOCAL. Then, to cope with the
possibility that vcpus moving around to different physical cores might
expose stale TLB entries, there is some code in kvmppc_hv_entry to
flush the whole TLB of entries for this VM if either this vcpu is now
running on a different physical core from where it last ran, or if this
physical core last ran a different vcpu.
There are a number of problems on POWER7 with this as it stands:
- The TLB invalidation is done per thread, whereas it only needs to be
done per core, since the TLB is shared between the threads.
- With the possibility of the host paging out guest pages, the use of
H_LOCAL by an SMP guest is dangerous since the guest could possibly
retain and use a stale TLB entry pointing to a page that had been
removed from the guest.
- The TLB invalidations that we do when a vcpu moves from one physical
core to another are unnecessary in the case of an SMP guest that isn't
using H_LOCAL.
- The optimization of using local invalidations rather than global should
apply to guests with one virtual core, not just one vcpu.
(None of this applies on PPC970, since there we always have to
invalidate the whole TLB when entering and leaving the guest, and we
can't support paging out guest memory.)
To fix these problems and simplify the code, we now maintain a simple
cpumask of which cpus need to flush the TLB on entry to the guest.
(This is indexed by cpu, though we only ever use the bits for thread
0 of each core.) Whenever we do a local TLB invalidation, we set the
bits for every cpu except the bit for thread 0 of the core that we're
currently running on. Whenever we enter a guest, we test and clear the
bit for our core, and flush the TLB if it was set.
On initial startup of the VM, and when resetting the HPT, we set all the
bits in the need_tlb_flush cpumask, since any core could potentially have
stale TLB entries from the previous VM to use the same LPID, or the
previous contents of the HPT.
Then, we maintain a count of the number of online virtual cores, and use
that when deciding whether to use a local invalidation rather than the
number of online vcpus. The code to make that decision is extracted out
into a new function, global_invalidates(). For multi-core guests on
POWER7 (i.e. when we are using mmu notifiers), we now never do local
invalidations regardless of the H_LOCAL flag.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-22 06:28:08 +07:00
|
|
|
cpumask_t need_tlb_flush;
|
KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement
With radix, the guest can do TLB invalidations itself using the tlbie
(global) and tlbiel (local) TLB invalidation instructions. Linux guests
use local TLB invalidations for translations that have only ever been
accessed on one vcpu. However, that doesn't mean that the translations
have only been accessed on one physical cpu (pcpu) since vcpus can move
around from one pcpu to another. Thus a tlbiel might leave behind stale
TLB entries on a pcpu where the vcpu previously ran, and if that task
then moves back to that previous pcpu, it could see those stale TLB
entries and thus access memory incorrectly. The usual symptom of this
is random segfaults in userspace programs in the guest.
To cope with this, we detect when a vcpu is about to start executing on
a thread in a core that is a different core from the last time it
executed. If that is the case, then we mark the core as needing a
TLB flush and then send an interrupt to any thread in the core that is
currently running a vcpu from the same guest. This will get those vcpus
out of the guest, and the first one to re-enter the guest will do the
TLB flush. The reason for interrupting the vcpus executing on the old
core is to cope with the following scenario:
CPU 0 CPU 1 CPU 4
(core 0) (core 0) (core 1)
VCPU 0 runs task X VCPU 1 runs
core 0 TLB gets
entries from task X
VCPU 0 moves to CPU 4
VCPU 0 runs task X
Unmap pages of task X
tlbiel
(still VCPU 1) task X moves to VCPU 1
task X runs
task X sees stale TLB
entries
That is, as soon as the VCPU starts executing on the new core, it
could unmap and tlbiel some page table entries, and then the task
could migrate to one of the VCPUs running on the old core and
potentially see stale TLB entries.
Since the TLB is shared between all the threads in a core, we only
use the bit of kvm->arch.need_tlb_flush corresponding to the first
thread in the core. To ensure that we don't have a window where we
can miss a flush, this moves the clearing of the bit from before the
actual flush to after it. This way, two threads might both do the
flush, but we prevent the situation where one thread can enter the
guest before the flush is finished.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30 17:21:50 +07:00
|
|
|
cpumask_t cpu_in_guest;
|
2017-01-30 17:21:44 +07:00
|
|
|
u8 radix;
|
2017-05-11 18:02:48 +07:00
|
|
|
u8 fwnmi_enabled;
|
2019-08-22 10:48:38 +07:00
|
|
|
u8 secure_guest;
|
KVM: PPC: Book3S HV: Allow for running POWER9 host in single-threaded mode
This patch allows for a mode on POWER9 hosts where we control all the
threads of a core, much as we do on POWER8. The mode is controlled by
a module parameter on the kvm_hv module, called "indep_threads_mode".
The normal mode on POWER9 is the "independent threads" mode, with
indep_threads_mode=Y, where the host is in SMT4 mode (or in fact any
desired SMT mode) and each thread independently enters and exits from
KVM guests without reference to what other threads in the core are
doing.
If indep_threads_mode is set to N at the point when a VM is started,
KVM will expect every core that the guest runs on to be in single
threaded mode (that is, threads 1, 2 and 3 offline), and will set the
flag that prevents secondary threads from coming online. We can still
use all four threads; the code that implements dynamic micro-threading
on POWER8 will become active in over-commit situations and will allow
up to three other VCPUs to be run on the secondary threads of the core
whenever a VCPU is run.
The reason for wanting this mode is that this will allow us to run HPT
guests on a radix host on a POWER9 machine that does not support
"mixed mode", that is, having some threads in a core be in HPT mode
while other threads are in radix mode. It will also make it possible
to implement a "strict threads" mode in future, if desired.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-16 12:11:57 +07:00
|
|
|
bool threads_indep;
|
2018-10-08 12:31:03 +07:00
|
|
|
bool nested_enable;
|
2017-01-30 17:21:44 +07:00
|
|
|
pgd_t *pgtable;
|
2017-01-30 17:21:42 +07:00
|
|
|
u64 process_table;
|
2015-03-28 10:21:01 +07:00
|
|
|
struct dentry *debugfs_dir;
|
|
|
|
struct dentry *htab_dentry;
|
2018-10-08 12:30:57 +07:00
|
|
|
struct dentry *radix_dentry;
|
2016-12-20 12:49:05 +07:00
|
|
|
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
|
2013-10-07 23:47:52 +07:00
|
|
|
#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
|
2013-10-07 23:47:51 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
|
KVM: PPC: Book3S PR: Make HPT accesses and updates SMP-safe
This adds a per-VM mutex to provide mutual exclusion between vcpus
for accesses to and updates of the guest hashed page table (HPT).
This also makes the code use single-byte writes to the HPT entry
when updating of the reference (R) and change (C) bits. The reason
for doing this, rather than writing back the whole HPTE, is that on
non-PAPR virtual machines, the guest OS might be writing to the HPTE
concurrently, and writing back the whole HPTE might conflict with
that. Also, real hardware does single-byte writes to update R and C.
The new mutex is taken in kvmppc_mmu_book3s_64_xlate() when reading
the HPT and updating R and/or C, and in the PAPR HPT update hcalls
(H_ENTER, H_REMOVE, etc.). Having the mutex means that we don't need
to use a hypervisor lock bit in the HPT update hcalls, and we don't
need to be careful about the order in which the bytes of the HPTE are
updated by those hcalls.
The other change here is to make emulated TLB invalidations (tlbie)
effective across all vcpus. To do this we call kvmppc_mmu_pte_vflush
for all vcpus in kvmppc_ppc_book3s_64_tlbie().
For 32-bit, this makes the setting of the accessed and dirty bits use
single-byte writes, and makes tlbie invalidate shadow HPTEs for all
vcpus.
With this, PR KVM can successfully run SMP guests.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-09-20 11:52:48 +07:00
|
|
|
struct mutex hpt_mutex;
|
|
|
|
#endif
|
2012-03-16 04:58:34 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S_64
|
|
|
|
struct list_head spapr_tce_tables;
|
2013-04-18 03:30:00 +07:00
|
|
|
struct list_head rtas_tokens;
|
2019-05-29 08:54:00 +07:00
|
|
|
struct mutex rtas_token_lock;
|
2014-06-02 08:02:59 +07:00
|
|
|
DECLARE_BITMAP(enabled_hcalls, MAX_HCALL_OPCODE/4 + 1);
|
2012-03-16 04:58:34 +07:00
|
|
|
#endif
|
2013-04-16 22:42:19 +07:00
|
|
|
#ifdef CONFIG_KVM_MPIC
|
|
|
|
struct openpic *mpic;
|
|
|
|
#endif
|
2013-04-18 03:30:26 +07:00
|
|
|
#ifdef CONFIG_KVM_XICS
|
|
|
|
struct kvmppc_xics *xics;
|
2019-04-18 17:39:42 +07:00
|
|
|
struct kvmppc_xive *xive; /* Current XIVE device in use */
|
|
|
|
struct {
|
|
|
|
struct kvmppc_xive *native;
|
|
|
|
struct kvmppc_xive *xics_on_xive;
|
|
|
|
} xive_devices;
|
2016-08-19 12:35:48 +07:00
|
|
|
struct kvmppc_passthru_irqmap *pimap;
|
2013-04-18 03:30:26 +07:00
|
|
|
#endif
|
2013-10-07 23:48:01 +07:00
|
|
|
struct kvmppc_ops *kvm_ops;
|
2014-07-04 17:52:51 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
|
2019-11-25 10:06:26 +07:00
|
|
|
struct mutex uvmem_lock;
|
|
|
|
struct list_head uvmem_pfns;
|
KVM: PPC: Book3S HV: Use new mutex to synchronize MMU setup
Currently the HV KVM code uses kvm->lock in conjunction with a flag,
kvm->arch.mmu_ready, to synchronize MMU setup and hold off vcpu
execution until the MMU-related data structures are ready. However,
this means that kvm->lock is being taken inside vcpu->mutex, which
is contrary to Documentation/virtual/kvm/locking.txt and results in
lockdep warnings.
To fix this, we add a new mutex, kvm->arch.mmu_setup_lock, which nests
inside the vcpu mutexes, and is taken in the places where kvm->lock
was taken that are related to MMU setup.
Additionally we take the new mutex in the vcpu creation code at the
point where we are creating a new vcore, in order to provide mutual
exclusion with kvmppc_update_lpcr() and ensure that an update to
kvm->arch.lpcr doesn't get missed, which could otherwise lead to a
stale vcore->lpcr value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-23 13:35:34 +07:00
|
|
|
struct mutex mmu_setup_lock; /* nests inside vcpu mutexes */
|
2018-10-08 12:31:03 +07:00
|
|
|
u64 l1_ptcr;
|
|
|
|
int max_nested_lpid;
|
|
|
|
struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS];
|
2014-07-04 17:52:51 +07:00
|
|
|
/* This array can grow quite large, keep it at the end */
|
|
|
|
struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
|
|
|
|
#endif
|
2008-04-17 11:28:09 +07:00
|
|
|
};
|
|
|
|
|
2015-03-28 10:21:09 +07:00
|
|
|
#define VCORE_ENTRY_MAP(vc) ((vc)->entry_exit_map & 0xff)
|
|
|
|
#define VCORE_EXIT_MAP(vc) ((vc)->entry_exit_map >> 8)
|
|
|
|
#define VCORE_IS_EXITING(vc) (VCORE_EXIT_MAP(vc) != 0)
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
|
KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8
This builds on the ability to run more than one vcore on a physical
core by using the micro-threading (split-core) modes of the POWER8
chip. Previously, only vcores from the same VM could be run together,
and (on POWER8) only if they had just one thread per core. With the
ability to split the core on guest entry and unsplit it on guest exit,
we can run up to 8 vcpu threads from up to 4 different VMs, and we can
run multiple vcores with 2 or 4 vcpus per vcore.
Dynamic micro-threading is only available if the static configuration
of the cores is whole-core mode (unsplit), and only on POWER8.
To manage this, we introduce a new kvm_split_mode struct which is
shared across all of the subcores in the core, with a pointer in the
paca on each thread. In addition we extend the core_info struct to
have information on each subcore. When deciding whether to add a
vcore to the set already on the core, we now have two possibilities:
(a) piggyback the vcore onto an existing subcore, or (b) start a new
subcore.
Currently, when any vcpu needs to exit the guest and switch to host
virtual mode, we interrupt all the threads in all subcores and switch
the core back to whole-core mode. It may be possible in future to
allow some of the subcores to keep executing in the guest while
subcore 0 switches to the host, but that is not implemented in this
patch.
This adds a module parameter called dynamic_mt_modes which controls
which micro-threading (split-core) modes the code will consider, as a
bitmap. In other words, if it is 0, no micro-threading mode is
considered; if it is 2, only 2-way micro-threading is considered; if
it is 4, only 4-way, and if it is 6, both 2-way and 4-way
micro-threading mode will be considered. The default is 6.
With this, we now have secondary threads which are the primary thread
for their subcore and therefore need to do the MMU switch. These
threads will need to be started even if they have no vcpu to run, so
we use the vcore pointer in the PACA rather than the vcpu pointer to
trigger them.
It is now possible for thread 0 to find that an exit has been
requested before it gets to switch the subcore state to the guest. In
that case we haven't added the guest's timebase offset to the
timebase, so we need to be careful not to subtract the offset in the
guest exit path. In fact we just skip the whole path that switches
back to host context, since we haven't switched to the guest context.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-07-02 17:38:16 +07:00
|
|
|
/* This bit is used when a vcore exit is triggered from outside the vcore */
|
|
|
|
#define VCORE_EXIT_REQ 0x10000
|
|
|
|
|
KVM: PPC: Book3S HV: Make use of unused threads when running guests
When running a virtual core of a guest that is configured with fewer
threads per core than the physical cores have, the extra physical
threads are currently unused. This makes it possible to use them to
run one or more other virtual cores from the same guest when certain
conditions are met. This applies on POWER7, and on POWER8 to guests
with one thread per virtual core. (It doesn't apply to POWER8 guests
with multiple threads per vcore because they require a 1-1 virtual to
physical thread mapping in order to be able to use msgsndp and the
TIR.)
The idea is that we maintain a list of preempted vcores for each
physical cpu (i.e. each core, since the host runs single-threaded).
Then, when a vcore is about to run, it checks to see if there are
any vcores on the list for its physical cpu that could be
piggybacked onto this vcore's execution. If so, those additional
vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
threads are started as well as the original vcore, which is called
the master vcore.
After the vcores have exited the guest, the extra ones are put back
onto the preempted list if any of their VCPUs are still runnable and
not idle.
This means that vcpu->arch.ptid is no longer necessarily the same as
the physical thread that the vcpu runs on. In order to make it easier
for code that wants to send an IPI to know which CPU to target, we
now store that in a new field in struct vcpu_arch, called thread_cpu.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-06-24 18:18:03 +07:00
|
|
|
/*
|
|
|
|
* Values for vcore_state.
|
|
|
|
* Note that these are arranged such that lower values
|
|
|
|
* (< VCORE_SLEEPING) don't require stolen time accounting
|
|
|
|
* on load/unload, and higher values do.
|
|
|
|
*/
|
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
core), whenever a CPU goes idle, we have to pull all the other
hardware threads in the core out of the guest, because the H_CEDE
hcall is handled in the kernel. This is inefficient.
This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
in real mode. When a guest vcpu does an H_CEDE hcall, we now only
exit to the kernel if all the other vcpus in the same core are also
idle. Otherwise we mark this vcpu as napping, save state that could
be lost in nap mode (mainly GPRs and FPRs), and execute the nap
instruction. When the thread wakes up, because of a decrementer or
external interrupt, we come back in at kvm_start_guest (from the
system reset interrupt vector), find the `napping' flag set in the
paca, and go to the resume path.
This has some other ramifications. First, when starting a core, we
now start all the threads, both those that are immediately runnable and
those that are idle. This is so that we don't have to pull all the
threads out of the guest when an idle thread gets a decrementer interrupt
and wants to start running. In fact the idle threads will all start
with the H_CEDE hcall returning; being idle they will just do another
H_CEDE immediately and go to nap mode.
This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
These functions have been restructured to make them simpler and clearer.
We introduce a level of indirection in the wait queue that gets woken
when external and decrementer interrupts get generated for a vcpu, so
that we can have the 4 vcpus in a vcore using the same wait queue.
We need this because the 4 vcpus are being handled by one thread.
Secondly, when we need to exit from the guest to the kernel, we now
have to generate an IPI for any napping threads, because an HDEC
interrupt doesn't wake up a napping thread.
Thirdly, we now need to be able to handle virtual external interrupts
and decrementer interrupts becoming pending while a thread is napping,
and deliver those interrupts to the guest when the thread wakes.
This is done in kvmppc_cede_reentry, just before fast_guest_return.
Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
from kvm_arch_vcpu_runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 14:42:46 +07:00
|
|
|
#define VCORE_INACTIVE 0
|
KVM: PPC: Book3S HV: Make use of unused threads when running guests
When running a virtual core of a guest that is configured with fewer
threads per core than the physical cores have, the extra physical
threads are currently unused. This makes it possible to use them to
run one or more other virtual cores from the same guest when certain
conditions are met. This applies on POWER7, and on POWER8 to guests
with one thread per virtual core. (It doesn't apply to POWER8 guests
with multiple threads per vcore because they require a 1-1 virtual to
physical thread mapping in order to be able to use msgsndp and the
TIR.)
The idea is that we maintain a list of preempted vcores for each
physical cpu (i.e. each core, since the host runs single-threaded).
Then, when a vcore is about to run, it checks to see if there are
any vcores on the list for its physical cpu that could be
piggybacked onto this vcore's execution. If so, those additional
vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
threads are started as well as the original vcore, which is called
the master vcore.
After the vcores have exited the guest, the extra ones are put back
onto the preempted list if any of their VCPUs are still runnable and
not idle.
This means that vcpu->arch.ptid is no longer necessarily the same as
the physical thread that the vcpu runs on. In order to make it easier
for code that wants to send an IPI to know which CPU to target, we
now store that in a new field in struct vcpu_arch, called thread_cpu.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-06-24 18:18:03 +07:00
|
|
|
#define VCORE_PREEMPT 1
|
|
|
|
#define VCORE_PIGGYBACK 2
|
|
|
|
#define VCORE_SLEEPING 3
|
|
|
|
#define VCORE_RUNNING 4
|
|
|
|
#define VCORE_EXITING 5
|
KVM: PPC: Book3S HV: Implement halt polling
This patch introduces new halt polling functionality into the kvm_hv kernel
module. When a vcore is idle it will poll for some period of time before
scheduling itself out.
When all of the runnable vcpus on a vcore have ceded (and thus the vcore is
idle) we schedule ourselves out to allow something else to run. In the
event that we need to wake up very quickly (for example an interrupt
arrives), we are required to wait until we get scheduled again.
Implement halt polling so that when a vcore is idle, and before scheduling
ourselves, we poll for vcpus in the runnable_threads list which have
pending exceptions or which leave the ceded state. If we poll successfully
then we can get back into the guest very quickly without ever scheduling
ourselves, otherwise we schedule ourselves out as before.
There exists generic halt_polling code in virt/kvm_main.c, however on
powerpc the polling conditions are different to the generic case. It would
be nice if we could just implement an arch specific kvm_check_block()
function, but there is still other arch specific things which need to be
done for kvm_hv (for example manipulating vcore states) which means that a
separate implementation is the best option.
Testing of this patch with a TCP round robin test between two guests with
virtio network interfaces has found a decrease in round trip time of ~15us
on average. A performance gain is only seen when going out of and
back into the guest often and quickly, otherwise there is no net benefit
from the polling. The polling interval is adjusted such that when we are
often scheduled out for long periods of time it is reduced, and when we
often poll successfully it is increased. The rate at which the polling
interval increases or decreases, and the maximum polling interval, can
be set through module parameters.
Based on the implementation in the generic kvm module by Wanpeng Li and
Paolo Bonzini, and on direction from Paul Mackerras.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-08-02 11:03:21 +07:00
|
|
|
#define VCORE_POLLING 6
|
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
core), whenever a CPU goes idle, we have to pull all the other
hardware threads in the core out of the guest, because the H_CEDE
hcall is handled in the kernel. This is inefficient.
This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
in real mode. When a guest vcpu does an H_CEDE hcall, we now only
exit to the kernel if all the other vcpus in the same core are also
idle. Otherwise we mark this vcpu as napping, save state that could
be lost in nap mode (mainly GPRs and FPRs), and execute the nap
instruction. When the thread wakes up, because of a decrementer or
external interrupt, we come back in at kvm_start_guest (from the
system reset interrupt vector), find the `napping' flag set in the
paca, and go to the resume path.
This has some other ramifications. First, when starting a core, we
now start all the threads, both those that are immediately runnable and
those that are idle. This is so that we don't have to pull all the
threads out of the guest when an idle thread gets a decrementer interrupt
and wants to start running. In fact the idle threads will all start
with the H_CEDE hcall returning; being idle they will just do another
H_CEDE immediately and go to nap mode.
This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
These functions have been restructured to make them simpler and clearer.
We introduce a level of indirection in the wait queue that gets woken
when external and decrementer interrupts get generated for a vcpu, so
that we can have the 4 vcpus in a vcore using the same wait queue.
We need this because the 4 vcpus are being handled by one thread.
Secondly, when we need to exit from the guest to the kernel, we now
have to generate an IPI for any napping threads, because an HDEC
interrupt doesn't wake up a napping thread.
Thirdly, we now need to be able to handle virtual external interrupts
and decrementer interrupts becoming pending while a thread is napping,
and deliver those interrupts to the guest when the thread wakes.
This is done in kvmppc_cede_reentry, just before fast_guest_return.
Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
from kvm_arch_vcpu_runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 14:42:46 +07:00
|
|
|
|
KVM: PPC: Book3S HV: Make virtual processor area registration more robust
The PAPR API allows three sorts of per-virtual-processor areas to be
registered (VPA, SLB shadow buffer, and dispatch trace log), and
furthermore, these can be registered and unregistered for another
virtual CPU. Currently we just update the vcpu fields pointing to
these areas at the time of registration or unregistration. If this
is done on another vcpu, there is the possibility that the target vcpu
is using those fields at the time and could end up using a bogus
pointer and corrupting memory.
This fixes the race by making the target cpu itself do the update, so
we can be sure that the update happens at a time when the fields
aren't being used. Each area now has a struct kvmppc_vpa which is
used to manage these updates. There is also a spinlock which protects
access to all of the kvmppc_vpa structs, other than to the pinned_addr
fields. (We could have just taken the spinlock when using the vpa,
slb_shadow or dtl fields, but that would mean taking the spinlock on
every guest entry and exit.)
This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
which is what the rest of the kernel uses.
Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out
the need to initialize vcpu->arch.vpa_update_lock.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-20 00:46:32 +07:00
|
|
|
/*
|
|
|
|
* Struct used to manage memory for a virtual processor area
|
|
|
|
* registered by a PAPR guest. There are three types of area
|
|
|
|
* that a guest can register.
|
|
|
|
*/
|
|
|
|
struct kvmppc_vpa {
|
KVM: PPC: Book3S HV: Report VPA and DTL modifications in dirty map
At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications
done by the host to the virtual processor areas (VPAs) and dispatch
trace logs (DTLs) registered by the guest. This is because those
modifications are done either in real mode or in the host kernel
context, and in neither case does the access go through the guest's
HPT, and thus no change (C) bit gets set in the guest's HPT.
However, the changes done by the host do need to be tracked so that
the modified pages get transferred when doing live migration. In
order to track these modifications, this adds a dirty flag to the
struct representing the VPA/DTL areas, and arranges to set the flag
when the VPA/DTL gets modified by the host. Then, when we are
collecting the dirty log, we also check the dirty flags for the
VPA and DTL for each vcpu and set the relevant bit in the dirty log
if necessary. Doing this also means we now need to keep track of
the guest physical address of the VPA/DTL areas.
So as not to lose track of modifications to a VPA/DTL area when it gets
unregistered, or when a new area gets registered in its place, we need
to transfer the dirty state to the rmap chain. This adds code to
kvmppc_unpin_guest_page() to do that if the area was dirty. To simplify
that code, we now require that all VPA, DTL and SLB shadow buffer areas
fit within a single host page. Guests already comply with this
requirement because pHyp requires that these areas not cross a 4k
boundary.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-19 02:51:04 +07:00
|
|
|
unsigned long gpa; /* Current guest phys addr */
|
KVM: PPC: Book3S HV: Make virtual processor area registration more robust
The PAPR API allows three sorts of per-virtual-processor areas to be
registered (VPA, SLB shadow buffer, and dispatch trace log), and
furthermore, these can be registered and unregistered for another
virtual CPU. Currently we just update the vcpu fields pointing to
these areas at the time of registration or unregistration. If this
is done on another vcpu, there is the possibility that the target vcpu
is using those fields at the time and could end up using a bogus
pointer and corrupting memory.
This fixes the race by making the target cpu itself do the update, so
we can be sure that the update happens at a time when the fields
aren't being used. Each area now has a struct kvmppc_vpa which is
used to manage these updates. There is also a spinlock which protects
access to all of the kvmppc_vpa structs, other than to the pinned_addr
fields. (We could have just taken the spinlock when using the vpa,
slb_shadow or dtl fields, but that would mean taking the spinlock on
every guest entry and exit.)
This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
which is what the rest of the kernel uses.
Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out
the need to initialize vcpu->arch.vpa_update_lock.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-20 00:46:32 +07:00
|
|
|
void *pinned_addr; /* Address in kernel linear mapping */
|
|
|
|
void *pinned_end; /* End of region */
|
|
|
|
unsigned long next_gpa; /* Guest phys addr for update */
|
|
|
|
unsigned long len; /* Number of bytes required */
|
|
|
|
u8 update_pending; /* 1 => update pinned_addr from next_gpa */
|
KVM: PPC: Book3S HV: Report VPA and DTL modifications in dirty map
At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications
done by the host to the virtual processor areas (VPAs) and dispatch
trace logs (DTLs) registered by the guest. This is because those
modifications are done either in real mode or in the host kernel
context, and in neither case does the access go through the guest's
HPT, and thus no change (C) bit gets set in the guest's HPT.
However, the changes done by the host do need to be tracked so that
the modified pages get transferred when doing live migration. In
order to track these modifications, this adds a dirty flag to the
struct representing the VPA/DTL areas, and arranges to set the flag
when the VPA/DTL gets modified by the host. Then, when we are
collecting the dirty log, we also check the dirty flags for the
VPA and DTL for each vcpu and set the relevant bit in the dirty log
if necessary. Doing this also means we now need to keep track of
the guest physical address of the VPA/DTL areas.
So as not to lose track of modifications to a VPA/DTL area when it gets
unregistered, or when a new area gets registered in its place, we need
to transfer the dirty state to the rmap chain. This adds code to
kvmppc_unpin_guest_page() to do that if the area was dirty. To simplify
that code, we now require that all VPA, DTL and SLB shadow buffer areas
fit within a single host page. Guests already comply with this
requirement because pHyp requires that these areas not cross a 4k
boundary.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-19 02:51:04 +07:00
|
|
|
bool dirty; /* true => area has been modified by kernel */
|
KVM: PPC: Book3S HV: Make virtual processor area registration more robust
The PAPR API allows three sorts of per-virtual-processor areas to be
registered (VPA, SLB shadow buffer, and dispatch trace log), and
furthermore, these can be registered and unregistered for another
virtual CPU. Currently we just update the vcpu fields pointing to
these areas at the time of registration or unregistration. If this
is done on another vcpu, there is the possibility that the target vcpu
is using those fields at the time and could end up using a bogus
pointer and corrupting memory.
This fixes the race by making the target cpu itself do the update, so
we can be sure that the update happens at a time when the fields
aren't being used. Each area now has a struct kvmppc_vpa which is
used to manage these updates. There is also a spinlock which protects
access to all of the kvmppc_vpa structs, other than to the pinned_addr
fields. (We could have just taken the spinlock when using the vpa,
slb_shadow or dtl fields, but that would mean taking the spinlock on
every guest entry and exit.)
This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
which is what the rest of the kernel uses.
Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out
the need to initialize vcpu->arch.vpa_update_lock.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-20 00:46:32 +07:00
|
|
|
};
|
|
|
|
|
2009-10-30 12:47:04 +07:00
|
|
|
struct kvmppc_pte {
|
2010-04-20 07:49:46 +07:00
|
|
|
ulong eaddr;
|
2009-10-30 12:47:04 +07:00
|
|
|
u64 vpage;
|
2010-04-20 07:49:46 +07:00
|
|
|
ulong raddr;
|
2010-03-25 03:48:36 +07:00
|
|
|
bool may_read : 1;
|
|
|
|
bool may_write : 1;
|
|
|
|
bool may_execute : 1;
|
2017-03-24 13:49:22 +07:00
|
|
|
unsigned long wimg;
|
2018-10-08 12:31:07 +07:00
|
|
|
unsigned long rc;
|
2013-09-20 11:52:44 +07:00
|
|
|
u8 page_size; /* MMU_PAGE_xxx */
|
2018-10-08 12:31:07 +07:00
|
|
|
u8 page_shift;
|
2009-10-30 12:47:04 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
struct kvmppc_mmu {
|
|
|
|
/* book3s_64 only */
|
|
|
|
void (*slbmte)(struct kvm_vcpu *vcpu, u64 rb, u64 rs);
|
|
|
|
u64 (*slbmfee)(struct kvm_vcpu *vcpu, u64 slb_nr);
|
|
|
|
u64 (*slbmfev)(struct kvm_vcpu *vcpu, u64 slb_nr);
|
2019-02-04 15:06:18 +07:00
|
|
|
int (*slbfee)(struct kvm_vcpu *vcpu, gva_t eaddr, ulong *ret_slb);
|
2009-10-30 12:47:04 +07:00
|
|
|
void (*slbie)(struct kvm_vcpu *vcpu, u64 slb_nr);
|
|
|
|
void (*slbia)(struct kvm_vcpu *vcpu);
|
|
|
|
/* book3s */
|
|
|
|
void (*mtsrin)(struct kvm_vcpu *vcpu, u32 srnum, ulong value);
|
|
|
|
u32 (*mfsrin)(struct kvm_vcpu *vcpu, u32 srnum);
|
KVM: PPC: Book3S PR: Better handling of host-side read-only pages
Currently we request write access to all pages that get mapped into the
guest, even if the guest is only loading from the page. This reduces
the effectiveness of KSM because it means that we unshare every page we
access. Also, we always set the changed (C) bit in the guest HPTE if
it allows writing, even for a guest load.
This fixes both these problems. We pass an 'iswrite' flag to the
mmu.xlate() functions and to kvmppc_mmu_map_page() to indicate whether
the access is a load or a store. The mmu.xlate() functions now only
set C for stores. kvmppc_gfn_to_pfn() now calls gfn_to_pfn_prot()
instead of gfn_to_pfn() so that it can indicate whether we need write
access to the page, and get back a 'writable' flag to indicate whether
the page is writable or not. If that 'writable' flag is clear, we then
make the host HPTE read-only even if the guest HPTE allowed writing.
This means that we can get a protection fault when the guest writes to a
page that it has mapped read-write but which is read-only on the host
side (perhaps due to KSM having merged the page). Thus we now call
kvmppc_handle_pagefault() for protection faults as well as HPTE not found
faults. In kvmppc_handle_pagefault(), if the access was allowed by the
guest HPTE and we thus need to install a new host HPTE, we then need to
remove the old host HPTE if there is one. This is done with a new
function, kvmppc_mmu_unmap_page(), which uses kvmppc_mmu_pte_vflush() to
find and remove the old host HPTE.
Since the memslot-related functions require the KVM SRCU read lock to
be held, this adds srcu_read_lock/unlock pairs around the calls to
kvmppc_handle_pagefault().
Finally, this changes kvmppc_mmu_book3s_32_xlate_pte() to not ignore
guest HPTEs that don't permit access, and to return -EPERM for accesses
that are not permitted by the page protections.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-09-20 11:52:51 +07:00
|
|
|
int (*xlate)(struct kvm_vcpu *vcpu, gva_t eaddr,
|
|
|
|
struct kvmppc_pte *pte, bool data, bool iswrite);
|
2009-10-30 12:47:04 +07:00
|
|
|
void (*tlbie)(struct kvm_vcpu *vcpu, ulong addr, bool large);
|
2010-04-20 07:49:46 +07:00
|
|
|
int (*esid_to_vsid)(struct kvm_vcpu *vcpu, ulong esid, u64 *vsid);
|
2009-10-30 12:47:04 +07:00
|
|
|
u64 (*ea_to_vp)(struct kvm_vcpu *vcpu, gva_t eaddr, bool data);
|
|
|
|
bool (*is_dcbz32)(struct kvm_vcpu *vcpu);
|
|
|
|
};
|
|
|
|
|
2011-06-29 07:17:33 +07:00
|
|
|
struct kvmppc_slb {
|
|
|
|
u64 esid;
|
|
|
|
u64 vsid;
|
|
|
|
u64 orige;
|
|
|
|
u64 origv;
|
|
|
|
bool valid : 1;
|
|
|
|
bool Ks : 1;
|
|
|
|
bool Kp : 1;
|
|
|
|
bool nx : 1;
|
|
|
|
bool large : 1; /* PTEs are 16MB */
|
|
|
|
bool tb : 1; /* 1TB segment */
|
|
|
|
bool class : 1;
|
2013-09-20 11:52:44 +07:00
|
|
|
u8 base_page_size; /* MMU_PAGE_xxx */
|
2009-10-30 12:47:04 +07:00
|
|
|
};
|
|
|
|
|
KVM: PPC: Book3S HV: Accumulate timing information for real-mode code
This reads the timebase at various points in the real-mode guest
entry/exit code and uses that to accumulate total, minimum and
maximum time spent in those parts of the code. Currently these
times are accumulated per vcpu in 5 parts of the code:
* rm_entry - time taken from the start of kvmppc_hv_entry() until
just before entering the guest.
* rm_intr - time from when we take a hypervisor interrupt in the
guest until we either re-enter the guest or decide to exit to the
host. This includes time spent handling hcalls in real mode.
* rm_exit - time from when we decide to exit the guest until the
return from kvmppc_hv_entry().
* guest - time spend in the guest
* cede - time spent napping in real mode due to an H_CEDE hcall
while other threads in the same vcore are active.
These times are exposed in debugfs in a directory per vcpu that
contains a file called "timings". This file contains one line for
each of the 5 timings above, with the name followed by a colon and
4 numbers, which are the count (number of times the code has been
executed), the total time, the minimum time, and the maximum time,
all in nanoseconds.
The overhead of the extra code amounts to about 30ns for an hcall that
is handled in real mode (e.g. H_SET_DABR), which is about 25%. Since
production environments may not wish to incur this overhead, the new
code is conditional on a new config symbol,
CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-28 10:21:02 +07:00
|
|
|
/* Struct used to accumulate timing information in HV real mode code */
|
|
|
|
struct kvmhv_tb_accumulator {
|
|
|
|
u64 seqcount; /* used to synchronize access, also count * 2 */
|
|
|
|
u64 tb_total; /* total time in timebase ticks */
|
|
|
|
u64 tb_min; /* min time */
|
|
|
|
u64 tb_max; /* max time */
|
|
|
|
};
|
|
|
|
|
2016-08-19 12:35:48 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S_64
|
|
|
|
struct kvmppc_irq_map {
|
|
|
|
u32 r_hwirq;
|
|
|
|
u32 v_hwirq;
|
|
|
|
struct irq_desc *desc;
|
|
|
|
};
|
|
|
|
|
|
|
|
#define KVMPPC_PIRQ_MAPPED 1024
|
|
|
|
struct kvmppc_passthru_irqmap {
|
|
|
|
int n_mapped;
|
|
|
|
struct kvmppc_irq_map mapped[KVMPPC_PIRQ_MAPPED];
|
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
2012-08-09 04:17:55 +07:00
|
|
|
# ifdef CONFIG_PPC_FSL_BOOK3E
|
|
|
|
#define KVMPPC_BOOKE_IAC_NUM 2
|
|
|
|
#define KVMPPC_BOOKE_DAC_NUM 2
|
|
|
|
# else
|
|
|
|
#define KVMPPC_BOOKE_IAC_NUM 4
|
|
|
|
#define KVMPPC_BOOKE_DAC_NUM 2
|
|
|
|
# endif
|
|
|
|
#define KVMPPC_BOOKE_MAX_IAC 4
|
|
|
|
#define KVMPPC_BOOKE_MAX_DAC 2
|
|
|
|
|
2013-04-12 21:08:46 +07:00
|
|
|
/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */
|
|
|
|
#define KVMPPC_EPR_NONE 0 /* EPR not supported */
|
|
|
|
#define KVMPPC_EPR_USER 1 /* exit to userspace to fill EPR */
|
|
|
|
#define KVMPPC_EPR_KERNEL 2 /* in-kernel irqchip */
|
|
|
|
|
2013-04-12 21:08:47 +07:00
|
|
|
#define KVMPPC_IRQ_DEFAULT 0
|
|
|
|
#define KVMPPC_IRQ_MPIC 1
|
2017-04-05 14:54:56 +07:00
|
|
|
#define KVMPPC_IRQ_XICS 2 /* Includes a XIVE option */
|
2019-04-18 17:39:28 +07:00
|
|
|
#define KVMPPC_IRQ_XIVE 3 /* XIVE native exploitation mode */
|
2013-04-12 21:08:47 +07:00
|
|
|
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 12:55:12 +07:00
|
|
|
#define MMIO_HPTE_CACHE_SIZE 4
|
|
|
|
|
|
|
|
struct mmio_hpte_cache_entry {
|
|
|
|
unsigned long hpte_v;
|
|
|
|
unsigned long hpte_r;
|
|
|
|
unsigned long rpte;
|
|
|
|
unsigned long pte_index;
|
|
|
|
unsigned long eaddr;
|
|
|
|
unsigned long slb_v;
|
|
|
|
long mmio_update;
|
|
|
|
unsigned int slb_base_pshift;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mmio_hpte_cache {
|
|
|
|
struct mmio_hpte_cache_entry entry[MMIO_HPTE_CACHE_SIZE];
|
|
|
|
unsigned int index;
|
|
|
|
};
|
|
|
|
|
KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-21 20:12:36 +07:00
|
|
|
#define KVMPPC_VSX_COPY_NONE 0
|
|
|
|
#define KVMPPC_VSX_COPY_WORD 1
|
|
|
|
#define KVMPPC_VSX_COPY_DWORD 2
|
|
|
|
#define KVMPPC_VSX_COPY_DWORD_LOAD_DUMP 3
|
2018-05-21 12:24:20 +07:00
|
|
|
#define KVMPPC_VSX_COPY_WORD_LOAD_DUMP 4
|
KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-21 20:12:36 +07:00
|
|
|
|
2018-05-21 12:24:25 +07:00
|
|
|
#define KVMPPC_VMX_COPY_BYTE 8
|
|
|
|
#define KVMPPC_VMX_COPY_HWORD 9
|
|
|
|
#define KVMPPC_VMX_COPY_WORD 10
|
|
|
|
#define KVMPPC_VMX_COPY_DWORD 11
|
|
|
|
|
2013-04-12 21:08:47 +07:00
|
|
|
struct openpic;
|
|
|
|
|
2017-04-05 14:54:56 +07:00
|
|
|
/* W0 and W1 of a XIVE thread management context */
|
|
|
|
union xive_tma_w01 {
|
|
|
|
struct {
|
|
|
|
u8 nsr;
|
|
|
|
u8 cppr;
|
|
|
|
u8 ipb;
|
|
|
|
u8 lsmfb;
|
|
|
|
u8 ack;
|
|
|
|
u8 inc;
|
|
|
|
u8 age;
|
|
|
|
u8 pipr;
|
|
|
|
};
|
|
|
|
__be64 w01;
|
|
|
|
};
|
|
|
|
|
2008-04-17 11:28:09 +07:00
|
|
|
struct kvm_vcpu_arch {
|
2009-10-30 12:47:04 +07:00
|
|
|
ulong host_stack;
|
2008-04-17 11:28:09 +07:00
|
|
|
u32 host_pid;
|
2010-04-16 05:11:42 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S
|
2011-06-29 07:17:33 +07:00
|
|
|
struct kvmppc_slb slb[64];
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
int slb_max; /* 1 + index of last valid entry in slb[] */
|
2011-06-29 07:17:33 +07:00
|
|
|
int slb_nr; /* total number of entries in SLB */
|
2009-10-30 12:47:04 +07:00
|
|
|
struct kvmppc_mmu mmu;
|
2013-09-20 11:52:49 +07:00
|
|
|
struct kvmppc_vcpu_book3s *book3s;
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_PPC_BOOK3S_32
|
|
|
|
struct kvmppc_book3s_shadow_vcpu *shadow_vcpu;
|
2009-10-30 12:47:04 +07:00
|
|
|
#endif
|
2008-04-17 11:28:09 +07:00
|
|
|
|
2018-05-07 13:20:07 +07:00
|
|
|
struct pt_regs regs;
|
2008-04-17 11:28:09 +07:00
|
|
|
|
2013-10-15 16:43:02 +07:00
|
|
|
struct thread_fp_state fp;
|
2010-01-15 20:49:11 +07:00
|
|
|
|
2011-06-15 06:34:31 +07:00
|
|
|
#ifdef CONFIG_SPE
|
|
|
|
ulong evr[32];
|
|
|
|
ulong spefscr;
|
|
|
|
ulong host_spefscr;
|
|
|
|
u64 acc;
|
|
|
|
#endif
|
2010-01-15 20:49:11 +07:00
|
|
|
#ifdef CONFIG_ALTIVEC
|
2013-10-15 16:43:02 +07:00
|
|
|
struct thread_vr_state vr;
|
2010-01-15 20:49:11 +07:00
|
|
|
#endif
|
|
|
|
|
2011-12-20 22:34:43 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOKE_HV
|
|
|
|
u32 host_mas4;
|
|
|
|
u32 host_mas6;
|
|
|
|
u32 shadow_epcr;
|
|
|
|
u32 shadow_msrp;
|
|
|
|
u32 eplc;
|
|
|
|
u32 epsc;
|
|
|
|
u32 oldpir;
|
|
|
|
#endif
|
|
|
|
|
2012-12-01 20:50:26 +07:00
|
|
|
#if defined(CONFIG_BOOKE)
|
|
|
|
#if defined(CONFIG_KVM_BOOKE_HV) || defined(CONFIG_64BIT)
|
|
|
|
u32 epcr;
|
|
|
|
#endif
|
|
|
|
#endif
|
|
|
|
|
2010-02-19 17:00:27 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S
|
|
|
|
/* For Gekko paired singles */
|
|
|
|
u32 qpr[32];
|
|
|
|
#endif
|
|
|
|
|
2014-04-22 17:26:58 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S
|
2014-01-08 17:25:21 +07:00
|
|
|
ulong tar;
|
2014-04-22 17:26:58 +07:00
|
|
|
#endif
|
2010-01-08 08:58:03 +07:00
|
|
|
|
2010-04-16 05:11:42 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S
|
2009-10-30 12:47:04 +07:00
|
|
|
ulong hflags;
|
2010-01-15 20:49:11 +07:00
|
|
|
ulong guest_owned_ext;
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
ulong purr;
|
|
|
|
ulong spurr;
|
2014-01-08 17:25:21 +07:00
|
|
|
ulong ic;
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
ulong dscr;
|
|
|
|
ulong amr;
|
|
|
|
ulong uamor;
|
2014-01-08 17:25:21 +07:00
|
|
|
ulong iamr;
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
u32 ctrl;
|
KVM: PPC: Book3S HV: Add support for DABRX register on POWER7
The DABRX (DABR extension) register on POWER7 processors provides finer
control over which accesses cause a data breakpoint interrupt. It
contains 3 bits which indicate whether to enable accesses in user,
kernel and hypervisor modes respectively to cause data breakpoint
interrupts, plus one bit that enables both real mode and virtual mode
accesses to cause interrupts. Currently, KVM sets DABRX to allow
both kernel and user accesses to cause interrupts while in the guest.
This adds support for the guest to specify other values for DABRX.
PAPR defines a H_SET_XDABR hcall to allow the guest to set both DABR
and DABRX with one call. This adds a real-mode implementation of
H_SET_XDABR, which shares most of its code with the existing H_SET_DABR
implementation. To support this, we add a per-vcpu field to store the
DABRX value plus code to get and set it via the ONE_REG interface.
For Linux guests to use this new hcall, userspace needs to add
"hcall-xdabr" to the set of strings in the /chosen/hypertas-functions
property in the device tree. If userspace does this and then migrates
the guest to a host where the kernel doesn't include this patch, then
userspace will need to implement H_SET_XDABR by writing the specified
DABR value to the DABR using the ONE_REG interface. In that case, the
old kernel will set DABRX to DABRX_USER | DABRX_KERNEL. That should
still work correctly, at least for Linux guests, since Linux guests
cope with getting data breakpoint interrupts in modes that weren't
requested by just ignoring the interrupt, and Linux guests never set
DABRX_BTI.
The other thing this does is to make H_SET_DABR and H_SET_XDABR work
on POWER8, which has the DAWR and DAWRX instead of DABR/X. Guests that
know about POWER8 should use H_SET_MODE rather than H_SET_[X]DABR, but
guests running in POWER7 compatibility mode will still use H_SET_[X]DABR.
For them, this adds the logic to convert DABR/X values into DAWR/X values
on POWER8.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-08 17:25:29 +07:00
|
|
|
u32 dabrx;
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
ulong dabr;
|
2014-01-08 17:25:21 +07:00
|
|
|
ulong dawr;
|
|
|
|
ulong dawrx;
|
|
|
|
ulong ciabr;
|
2013-02-05 01:10:51 +07:00
|
|
|
ulong cfar;
|
2013-09-20 11:52:39 +07:00
|
|
|
ulong ppr;
|
2015-09-02 16:14:48 +07:00
|
|
|
u32 pspb;
|
2014-01-08 17:25:21 +07:00
|
|
|
ulong fscr;
|
2014-04-29 21:48:44 +07:00
|
|
|
ulong shadow_fscr;
|
2014-01-08 17:25:21 +07:00
|
|
|
ulong ebbhr;
|
|
|
|
ulong ebbrr;
|
|
|
|
ulong bescr;
|
|
|
|
ulong csigr;
|
|
|
|
ulong tacr;
|
|
|
|
ulong tcscr;
|
|
|
|
ulong acop;
|
|
|
|
ulong wort;
|
2016-11-18 09:11:42 +07:00
|
|
|
ulong tid;
|
|
|
|
ulong psscr;
|
2017-02-15 10:30:17 +07:00
|
|
|
ulong hfscr;
|
KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu
Currently PR-style KVM keeps the volatile guest register values
(R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than
the main kvm_vcpu struct. For 64-bit, the shadow_vcpu exists in two
places, a kmalloc'd struct and in the PACA, and it gets copied back
and forth in kvmppc_core_vcpu_load/put(), because the real-mode code
can't rely on being able to access the kmalloc'd struct.
This changes the code to copy the volatile values into the shadow_vcpu
as one of the last things done before entering the guest. Similarly
the values are copied back out of the shadow_vcpu to the kvm_vcpu
immediately after exiting the guest. We arrange for interrupts to be
still disabled at this point so that we can't get preempted on 64-bit
and end up copying values from the wrong PACA.
This means that the accessor functions in kvm_book3s.h for these
registers are greatly simplified, and are same between PR and HV KVM.
In places where accesses to shadow_vcpu fields are now replaced by
accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs.
Finally, on 64-bit, we don't need the kmalloc'd struct at all any more.
With this, the time to read the PVR one million times in a loop went
from 567.7ms to 575.5ms (averages of 6 values), an increase of about
1.4% for this worse-case test for guest entries and exits. The
standard deviation of the measurements is about 11ms, so the
difference is only marginally significant statistically.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-09-20 11:52:43 +07:00
|
|
|
ulong shadow_srr1;
|
2009-10-30 12:47:04 +07:00
|
|
|
#endif
|
2011-04-28 05:24:10 +07:00
|
|
|
u32 vrsave; /* also USPRG0 */
|
2008-04-17 11:28:09 +07:00
|
|
|
u32 mmucr;
|
2012-02-16 22:04:54 +07:00
|
|
|
/* shadow_msr is unused for BookE HV */
|
2011-06-15 06:34:29 +07:00
|
|
|
ulong shadow_msr;
|
2008-11-05 22:36:19 +07:00
|
|
|
ulong csrr0;
|
|
|
|
ulong csrr1;
|
|
|
|
ulong dsrr0;
|
|
|
|
ulong dsrr1;
|
2011-04-28 05:24:21 +07:00
|
|
|
ulong mcsrr0;
|
|
|
|
ulong mcsrr1;
|
|
|
|
ulong mcsr;
|
2017-05-22 13:55:16 +07:00
|
|
|
ulong dec;
|
2012-05-21 06:21:23 +07:00
|
|
|
#ifdef CONFIG_BOOKE
|
2008-04-17 11:28:09 +07:00
|
|
|
u32 decar;
|
2012-05-21 06:21:23 +07:00
|
|
|
#endif
|
2014-06-04 18:17:55 +07:00
|
|
|
/* Time base value when we entered the guest */
|
|
|
|
u64 entry_tb;
|
2014-06-05 19:08:02 +07:00
|
|
|
u64 entry_vtb;
|
2014-06-05 19:08:05 +07:00
|
|
|
u64 entry_ic;
|
2008-04-17 11:28:09 +07:00
|
|
|
u32 tcr;
|
2011-11-17 19:39:59 +07:00
|
|
|
ulong tsr; /* we need to perform set/clr_bits() which requires ulong */
|
2009-01-04 05:23:13 +07:00
|
|
|
u32 ivor[64];
|
2008-11-05 22:36:19 +07:00
|
|
|
ulong ivpr;
|
2009-10-30 12:47:04 +07:00
|
|
|
u32 pvr;
|
2008-07-26 01:54:53 +07:00
|
|
|
|
|
|
|
u32 shadow_pid;
|
2011-06-15 06:35:14 +07:00
|
|
|
u32 shadow_pid1;
|
2008-04-17 11:28:09 +07:00
|
|
|
u32 pid;
|
2008-07-26 01:54:53 +07:00
|
|
|
u32 swap_pid;
|
|
|
|
|
2008-04-17 11:28:09 +07:00
|
|
|
u32 ccr0;
|
|
|
|
u32 ccr1;
|
2009-01-04 05:23:07 +07:00
|
|
|
u32 dbsr;
|
2008-04-17 11:28:09 +07:00
|
|
|
|
2014-01-08 17:25:21 +07:00
|
|
|
u64 mmcr[5];
|
KVM: PPC: book3s_hv: Add support for PPC970-family processors
This adds support for running KVM guests in supervisor mode on those
PPC970 processors that have a usable hypervisor mode. Unfortunately,
Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to
1), but the YDL PowerStation does have a usable hypervisor mode.
There are several differences between the PPC970 and POWER7 in how
guests are managed. These differences are accommodated using the
CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature
bits. Notably, on PPC970:
* The LPCR, LPID or RMOR registers don't exist, and the functions of
those registers are provided by bits in HID4 and one bit in HID0.
* External interrupts can be directed to the hypervisor, but unlike
POWER7 they are masked by MSR[EE] in non-hypervisor modes and use
SRR0/1 not HSRR0/1.
* There is no virtual RMA (VRMA) mode; the guest must use an RMO
(real mode offset) area.
* The TLB entries are not tagged with the LPID, so it is necessary to
flush the whole TLB on partition switch. Furthermore, when switching
partitions we have to ensure that no other CPU is executing the tlbie
or tlbsync instructions in either the old or the new partition,
otherwise undefined behaviour can occur.
* The PMU has 8 counters (PMC registers) rather than 6.
* The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist.
* The SLB has 64 entries rather than 32.
* There is no mediated external interrupt facility, so if we switch to
a guest that has a virtual external interrupt pending but the guest
has MSR[EE] = 0, we have to arrange to have an interrupt pending for
it so that we can get control back once it re-enables interrupts. We
do that by sending ourselves an IPI with smp_send_reschedule after
hard-disabling interrupts.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:40:08 +07:00
|
|
|
u32 pmc[8];
|
2014-01-08 17:25:21 +07:00
|
|
|
u32 spmc[2];
|
2013-09-06 10:11:18 +07:00
|
|
|
u64 siar;
|
|
|
|
u64 sdar;
|
2014-01-08 17:25:21 +07:00
|
|
|
u64 sier;
|
2014-01-08 17:25:32 +07:00
|
|
|
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
|
|
|
|
u64 tfhar;
|
|
|
|
u64 texasr;
|
|
|
|
u64 tfiar;
|
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 17:32:01 +07:00
|
|
|
u64 orig_texasr;
|
2014-01-08 17:25:32 +07:00
|
|
|
|
|
|
|
u32 cr_tm;
|
2016-11-07 11:09:58 +07:00
|
|
|
u64 xer_tm;
|
2014-01-08 17:25:32 +07:00
|
|
|
u64 lr_tm;
|
|
|
|
u64 ctr_tm;
|
|
|
|
u64 amr_tm;
|
|
|
|
u64 ppr_tm;
|
|
|
|
u64 dscr_tm;
|
|
|
|
u64 tar_tm;
|
|
|
|
|
|
|
|
ulong gpr_tm[32];
|
|
|
|
|
|
|
|
struct thread_fp_state fp_tm;
|
|
|
|
|
|
|
|
struct thread_vr_state vr_tm;
|
|
|
|
u32 vrsave_tm; /* also USPRG0 */
|
|
|
|
#endif
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
|
2008-12-03 04:51:57 +07:00
|
|
|
#ifdef CONFIG_KVM_EXIT_TIMING
|
2011-03-25 12:02:13 +07:00
|
|
|
struct mutex exit_timing_lock;
|
2008-12-03 04:51:58 +07:00
|
|
|
struct kvmppc_exit_timing timing_exit;
|
|
|
|
struct kvmppc_exit_timing timing_last_enter;
|
2008-12-03 04:51:57 +07:00
|
|
|
u32 last_exit_type;
|
|
|
|
u32 timing_count_type[__NUMBER_OF_KVM_EXIT_TYPES];
|
|
|
|
u64 timing_sum_duration[__NUMBER_OF_KVM_EXIT_TYPES];
|
|
|
|
u64 timing_sum_quad_duration[__NUMBER_OF_KVM_EXIT_TYPES];
|
|
|
|
u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
|
|
|
|
u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
|
|
|
|
u64 timing_last_exit;
|
|
|
|
struct dentry *debugfs_exit_timing;
|
|
|
|
#endif
|
|
|
|
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S
|
|
|
|
ulong fault_dar;
|
|
|
|
u32 fault_dsisr;
|
2014-05-05 10:09:44 +07:00
|
|
|
unsigned long intr_msr;
|
2017-01-30 17:21:45 +07:00
|
|
|
ulong fault_gpa; /* guest real address of page fault (POWER9) */
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
#endif
|
|
|
|
|
2010-04-16 05:11:44 +07:00
|
|
|
#ifdef CONFIG_BOOKE
|
2008-11-05 22:36:19 +07:00
|
|
|
ulong fault_dear;
|
|
|
|
ulong fault_esr;
|
2010-02-02 18:44:35 +07:00
|
|
|
ulong queued_dear;
|
|
|
|
ulong queued_esr;
|
2012-08-09 03:38:19 +07:00
|
|
|
spinlock_t wdt_lock;
|
|
|
|
struct timer_list wdt_timer;
|
2011-12-20 22:34:34 +07:00
|
|
|
u32 tlbcfg[4];
|
2013-04-11 07:03:10 +07:00
|
|
|
u32 tlbps[4];
|
2011-12-20 22:34:34 +07:00
|
|
|
u32 mmucfg;
|
2013-04-11 07:03:11 +07:00
|
|
|
u32 eptcfg;
|
2011-12-20 22:34:43 +07:00
|
|
|
u32 epr;
|
2014-07-21 12:53:26 +07:00
|
|
|
u64 sprg9;
|
2014-07-04 15:17:28 +07:00
|
|
|
u32 pwrmgtcr0;
|
2013-02-28 01:13:10 +07:00
|
|
|
u32 crit_save;
|
2013-07-04 13:57:47 +07:00
|
|
|
/* guest debug registers*/
|
2013-07-04 13:57:46 +07:00
|
|
|
struct debug_reg dbg_reg;
|
2010-04-16 05:11:44 +07:00
|
|
|
#endif
|
2008-04-17 11:28:09 +07:00
|
|
|
gpa_t paddr_accessed;
|
2012-03-12 08:26:30 +07:00
|
|
|
gva_t vaddr_accessed;
|
2013-11-18 12:48:54 +07:00
|
|
|
pgd_t *pgdir;
|
2008-04-17 11:28:09 +07:00
|
|
|
|
2018-05-28 08:48:26 +07:00
|
|
|
u16 io_gpr; /* GPR used as IO source/target */
|
2015-02-03 12:36:24 +07:00
|
|
|
u8 mmio_host_swabbed;
|
2010-02-19 17:00:30 +07:00
|
|
|
u8 mmio_sign_extend;
|
KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-21 20:12:36 +07:00
|
|
|
/* conversion between single and double precision */
|
|
|
|
u8 mmio_sp64_extend;
|
|
|
|
/*
|
|
|
|
* Number of simulations for vsx.
|
|
|
|
* If we use 2*8bytes to simulate 1*16bytes,
|
|
|
|
* then the number should be 2 and
|
2018-05-21 12:24:25 +07:00
|
|
|
* mmio_copy_type=KVMPPC_VSX_COPY_DWORD.
|
KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-21 20:12:36 +07:00
|
|
|
* If we use 4*4bytes to simulate 1*16bytes,
|
|
|
|
* the number should be 4 and
|
|
|
|
* mmio_vsx_copy_type=KVMPPC_VSX_COPY_WORD.
|
|
|
|
*/
|
|
|
|
u8 mmio_vsx_copy_nums;
|
|
|
|
u8 mmio_vsx_offset;
|
2018-02-04 03:24:26 +07:00
|
|
|
u8 mmio_vmx_copy_nums;
|
2018-05-21 12:24:26 +07:00
|
|
|
u8 mmio_vmx_offset;
|
2018-05-21 12:24:25 +07:00
|
|
|
u8 mmio_copy_type;
|
2010-03-25 03:48:30 +07:00
|
|
|
u8 osi_needed;
|
|
|
|
u8 osi_enabled;
|
2011-08-08 21:08:55 +07:00
|
|
|
u8 papr_enabled;
|
2012-08-09 03:38:19 +07:00
|
|
|
u8 watchdog_enabled;
|
2011-08-10 18:57:08 +07:00
|
|
|
u8 sane;
|
|
|
|
u8 cpu_type;
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
u8 hcall_needed;
|
2013-04-12 21:08:46 +07:00
|
|
|
u8 epr_flags; /* KVMPPC_EPR_xxx */
|
2013-01-05 00:12:48 +07:00
|
|
|
u8 epr_needed;
|
2018-10-08 12:30:48 +07:00
|
|
|
u8 external_oneshot; /* clear external irq after delivery */
|
2008-04-17 11:28:09 +07:00
|
|
|
|
|
|
|
u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
|
|
|
|
|
2009-11-02 19:02:31 +07:00
|
|
|
struct hrtimer dec_timer;
|
2009-10-30 12:47:04 +07:00
|
|
|
u64 dec_jiffies;
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
u64 dec_expires;
|
2008-04-17 11:28:09 +07:00
|
|
|
unsigned long pending_exceptions;
|
2011-06-29 07:22:05 +07:00
|
|
|
u8 ceded;
|
|
|
|
u8 prodded;
|
KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9
On POWER9, we no longer have the restriction that we had on POWER8
where all threads in a core have to be in the same partition, so
the CPU threads are now independent. However, we still want to be
able to run guests with a virtual SMT topology, if only to allow
migration of guests from POWER8 systems to POWER9.
A guest that has a virtual SMT mode greater than 1 will expect to
be able to use the doorbell facility; it will expect the msgsndp
and msgclrp instructions to work appropriately and to be able to read
sensible values from the TIR (thread identification register) and
DPDES (directed privileged doorbell exception status) special-purpose
registers. However, since each CPU thread is a separate sub-processor
in POWER9, these instructions and registers can only be used within
a single CPU thread.
In order for these instructions to appear to act correctly according
to the guest's virtual SMT mode, we have to trap and emulate them.
We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
register. The emulation is triggered by the hypervisor facility
unavailable interrupt that occurs when the guest uses them.
To cause a doorbell interrupt to occur within the guest, we set the
DPDES register to 1. If the guest has interrupts enabled, the CPU
will generate a doorbell interrupt and clear the DPDES register in
hardware. The DPDES hardware register for the guest is saved in the
vcpu->arch.vcore->dpdes field. Since this gets written by the guest
exit code, other VCPUs wishing to cause a doorbell interrupt don't
write that field directly, but instead set a vcpu->arch.doorbell_request
flag. This is consumed and set to 0 by the guest entry code, which
then sets DPDES to 1.
Emulating reads of the DPDES register is somewhat involved, because
it requires reading the doorbell pending interrupt status of all of the
VCPU threads in the virtual core, and if any of those VCPUs are
running, their doorbell status is only up-to-date in the hardware
DPDES registers of the CPUs where they are running. In order to get
a reasonable approximation of the current doorbell status, we send
those CPUs an IPI, causing an exit from the guest which will update
the vcpu->arch.vcore->dpdes field. We then use that value in
constructing the emulated DPDES register value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-05-16 13:41:20 +07:00
|
|
|
u8 doorbell_request;
|
2018-01-12 09:37:13 +07:00
|
|
|
u8 irq_pending; /* Used by XIVE to signal pending guest irqs */
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
u32 last_inst;
|
2011-06-29 07:22:05 +07:00
|
|
|
|
2016-02-19 15:46:39 +07:00
|
|
|
struct swait_queue_head *wqp;
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
struct kvmppc_vcore *vcore;
|
|
|
|
int ret;
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
int trap;
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
int state;
|
|
|
|
int ptid;
|
KVM: PPC: Book3S HV: Make use of unused threads when running guests
When running a virtual core of a guest that is configured with fewer
threads per core than the physical cores have, the extra physical
threads are currently unused. This makes it possible to use them to
run one or more other virtual cores from the same guest when certain
conditions are met. This applies on POWER7, and on POWER8 to guests
with one thread per virtual core. (It doesn't apply to POWER8 guests
with multiple threads per vcore because they require a 1-1 virtual to
physical thread mapping in order to be able to use msgsndp and the
TIR.)
The idea is that we maintain a list of preempted vcores for each
physical cpu (i.e. each core, since the host runs single-threaded).
Then, when a vcore is about to run, it checks to see if there are
any vcores on the list for its physical cpu that could be
piggybacked onto this vcore's execution. If so, those additional
vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
threads are started as well as the original vcore, which is called
the master vcore.
After the vcores have exited the guest, the extra ones are put back
onto the preempted list if any of their VCPUs are still runnable and
not idle.
This means that vcpu->arch.ptid is no longer necessarily the same as
the physical thread that the vcpu runs on. In order to make it easier
for code that wants to send an IPI to know which CPU to target, we
now store that in a new field in struct vcpu_arch, called thread_cpu.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-06-24 18:18:03 +07:00
|
|
|
int thread_cpu;
|
KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement
With radix, the guest can do TLB invalidations itself using the tlbie
(global) and tlbiel (local) TLB invalidation instructions. Linux guests
use local TLB invalidations for translations that have only ever been
accessed on one vcpu. However, that doesn't mean that the translations
have only been accessed on one physical cpu (pcpu) since vcpus can move
around from one pcpu to another. Thus a tlbiel might leave behind stale
TLB entries on a pcpu where the vcpu previously ran, and if that task
then moves back to that previous pcpu, it could see those stale TLB
entries and thus access memory incorrectly. The usual symptom of this
is random segfaults in userspace programs in the guest.
To cope with this, we detect when a vcpu is about to start executing on
a thread in a core that is a different core from the last time it
executed. If that is the case, then we mark the core as needing a
TLB flush and then send an interrupt to any thread in the core that is
currently running a vcpu from the same guest. This will get those vcpus
out of the guest, and the first one to re-enter the guest will do the
TLB flush. The reason for interrupting the vcpus executing on the old
core is to cope with the following scenario:
CPU 0 CPU 1 CPU 4
(core 0) (core 0) (core 1)
VCPU 0 runs task X VCPU 1 runs
core 0 TLB gets
entries from task X
VCPU 0 moves to CPU 4
VCPU 0 runs task X
Unmap pages of task X
tlbiel
(still VCPU 1) task X moves to VCPU 1
task X runs
task X sees stale TLB
entries
That is, as soon as the VCPU starts executing on the new core, it
could unmap and tlbiel some page table entries, and then the task
could migrate to one of the VCPUs running on the old core and
potentially see stale TLB entries.
Since the TLB is shared between all the threads in a core, we only
use the bit of kvm->arch.need_tlb_flush corresponding to the first
thread in the core. To ensure that we don't have a window where we
can miss a flush, this moves the clearing of the bit from before the
actual flush to after it. This way, two threads might both do the
flush, but we prevent the situation where one thread can enter the
guest before the flush is finished.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30 17:21:50 +07:00
|
|
|
int prev_cpu;
|
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
core), whenever a CPU goes idle, we have to pull all the other
hardware threads in the core out of the guest, because the H_CEDE
hcall is handled in the kernel. This is inefficient.
This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
in real mode. When a guest vcpu does an H_CEDE hcall, we now only
exit to the kernel if all the other vcpus in the same core are also
idle. Otherwise we mark this vcpu as napping, save state that could
be lost in nap mode (mainly GPRs and FPRs), and execute the nap
instruction. When the thread wakes up, because of a decrementer or
external interrupt, we come back in at kvm_start_guest (from the
system reset interrupt vector), find the `napping' flag set in the
paca, and go to the resume path.
This has some other ramifications. First, when starting a core, we
now start all the threads, both those that are immediately runnable and
those that are idle. This is so that we don't have to pull all the
threads out of the guest when an idle thread gets a decrementer interrupt
and wants to start running. In fact the idle threads will all start
with the H_CEDE hcall returning; being idle they will just do another
H_CEDE immediately and go to nap mode.
This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
These functions have been restructured to make them simpler and clearer.
We introduce a level of indirection in the wait queue that gets woken
when external and decrementer interrupts get generated for a vcpu, so
that we can have the 4 vcpus in a vcore using the same wait queue.
We need this because the 4 vcpus are being handled by one thread.
Secondly, when we need to exit from the guest to the kernel, we now
have to generate an IPI for any napping threads, because an HDEC
interrupt doesn't wake up a napping thread.
Thirdly, we now need to be able to handle virtual external interrupts
and decrementer interrupts becoming pending while a thread is napping,
and deliver those interrupts to the guest when the thread wakes.
This is done in kvmppc_cede_reentry, just before fast_guest_return.
Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
from kvm_arch_vcpu_runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 14:42:46 +07:00
|
|
|
bool timer_running;
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
wait_queue_head_t cpu_run;
|
2017-05-11 18:03:37 +07:00
|
|
|
struct machine_check_event mce_evt; /* Valid if trap == 0x200 */
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
|
2010-07-29 19:47:42 +07:00
|
|
|
struct kvm_vcpu_arch_shared *shared;
|
2014-04-24 18:46:24 +07:00
|
|
|
#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_PR_POSSIBLE)
|
|
|
|
bool shared_big_endian;
|
|
|
|
#endif
|
2010-07-29 19:47:53 +07:00
|
|
|
unsigned long magic_page_pa; /* phys addr to map the magic page to */
|
|
|
|
unsigned long magic_page_ea; /* effect. addr to map the magic page to */
|
2014-05-12 06:08:32 +07:00
|
|
|
bool disable_kernel_nx;
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
|
2013-04-12 21:08:47 +07:00
|
|
|
int irq_type; /* one of KVM_IRQ_* */
|
|
|
|
int irq_cpu_id;
|
|
|
|
struct openpic *mpic; /* KVM_IRQ_MPIC */
|
2013-04-18 03:30:26 +07:00
|
|
|
#ifdef CONFIG_KVM_XICS
|
|
|
|
struct kvmppc_icp *icp; /* XICS presentation controller */
|
2017-04-05 14:54:56 +07:00
|
|
|
struct kvmppc_xive_vcpu *xive_vcpu; /* XIVE virtual CPU data */
|
|
|
|
__be32 xive_cam_word; /* Cooked W2 in proper endian with valid bit */
|
2018-01-12 09:37:15 +07:00
|
|
|
u8 xive_pushed; /* Is the VP pushed on the physical CPU ? */
|
2018-01-12 09:37:16 +07:00
|
|
|
u8 xive_esc_on; /* Is the escalation irq enabled ? */
|
2017-04-05 14:54:56 +07:00
|
|
|
union xive_tma_w01 xive_saved_state; /* W0..1 of XIVE thread state */
|
2018-01-12 09:37:16 +07:00
|
|
|
u64 xive_esc_raddr; /* Escalation interrupt ESB real addr */
|
|
|
|
u64 xive_esc_vaddr; /* Escalation interrupt ESB virt addr */
|
2013-04-18 03:30:26 +07:00
|
|
|
#endif
|
2013-04-12 21:08:47 +07:00
|
|
|
|
2013-10-07 23:47:52 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
struct kvm_vcpu_arch_shared shregs;
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 12:55:12 +07:00
|
|
|
struct mmio_hpte_cache mmio_cache;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 19:36:37 +07:00
|
|
|
unsigned long pgfault_addr;
|
|
|
|
long pgfault_index;
|
|
|
|
unsigned long pgfault_hpte[2];
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 12:55:12 +07:00
|
|
|
struct mmio_hpte_cache_entry *pgfault_cache;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 19:36:37 +07:00
|
|
|
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
struct task_struct *run_task;
|
|
|
|
struct kvm_run *kvm_run;
|
KVM: PPC: Book3S HV: Make virtual processor area registration more robust
The PAPR API allows three sorts of per-virtual-processor areas to be
registered (VPA, SLB shadow buffer, and dispatch trace log), and
furthermore, these can be registered and unregistered for another
virtual CPU. Currently we just update the vcpu fields pointing to
these areas at the time of registration or unregistration. If this
is done on another vcpu, there is the possibility that the target vcpu
is using those fields at the time and could end up using a bogus
pointer and corrupting memory.
This fixes the race by making the target cpu itself do the update, so
we can be sure that the update happens at a time when the fields
aren't being used. Each area now has a struct kvmppc_vpa which is
used to manage these updates. There is also a spinlock which protects
access to all of the kvmppc_vpa structs, other than to the pinned_addr
fields. (We could have just taken the spinlock when using the vpa,
slb_shadow or dtl fields, but that would mean taking the spinlock on
every guest entry and exit.)
This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
which is what the rest of the kernel uses.
Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out
the need to initialize vcpu->arch.vpa_update_lock.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-20 00:46:32 +07:00
|
|
|
|
|
|
|
spinlock_t vpa_update_lock;
|
|
|
|
struct kvmppc_vpa vpa;
|
|
|
|
struct kvmppc_vpa dtl;
|
|
|
|
struct dtl_entry *dtl_ptr;
|
|
|
|
unsigned long dtl_index;
|
2012-02-03 07:56:21 +07:00
|
|
|
u64 stolen_logged;
|
KVM: PPC: Book3S HV: Make virtual processor area registration more robust
The PAPR API allows three sorts of per-virtual-processor areas to be
registered (VPA, SLB shadow buffer, and dispatch trace log), and
furthermore, these can be registered and unregistered for another
virtual CPU. Currently we just update the vcpu fields pointing to
these areas at the time of registration or unregistration. If this
is done on another vcpu, there is the possibility that the target vcpu
is using those fields at the time and could end up using a bogus
pointer and corrupting memory.
This fixes the race by making the target cpu itself do the update, so
we can be sure that the update happens at a time when the fields
aren't being used. Each area now has a struct kvmppc_vpa which is
used to manage these updates. There is also a spinlock which protects
access to all of the kvmppc_vpa structs, other than to the pinned_addr
fields. (We could have just taken the spinlock when using the vpa,
slb_shadow or dtl fields, but that would mean taking the spinlock on
every guest entry and exit.)
This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
which is what the rest of the kernel uses.
Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out
the need to initialize vcpu->arch.vpa_update_lock.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-20 00:46:32 +07:00
|
|
|
struct kvmppc_vpa slb_shadow;
|
KVM: PPC: Book3S HV: Fix accounting of stolen time
Currently the code that accounts stolen time tends to overestimate the
stolen time, and will sometimes report more stolen time in a DTL
(dispatch trace log) entry than has elapsed since the last DTL entry.
This can cause guests to underflow the user or system time measured
for some tasks, leading to ridiculous CPU percentages and total runtimes
being reported by top and other utilities.
In addition, the current code was designed for the previous policy where
a vcore would only run when all the vcpus in it were runnable, and so
only counted stolen time on a per-vcore basis. Now that a vcore can
run while some of the vcpus in it are doing other things in the kernel
(e.g. handling a page fault), we need to count the time when a vcpu task
is preempted while it is not running as part of a vcore as stolen also.
To do this, we bring back the BUSY_IN_HOST vcpu state and extend the
vcpu_load/put functions to count preemption time while the vcpu is
in that state. Handling the transitions between the RUNNING and
BUSY_IN_HOST states requires checking and updating two variables
(accumulated time stolen and time last preempted), so we add a new
spinlock, vcpu->arch.tbacct_lock. This protects both the per-vcpu
stolen/preempt-time variables, and the per-vcore variables while this
vcpu is running the vcore.
Finally, we now don't count time spent in userspace as stolen time.
The task could be executing in userspace on behalf of the vcpu, or
it could be preempted, or the vcpu could be genuinely stopped. Since
we have no way of dividing up the time between these cases, we don't
count any of it as stolen.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-10-15 08:18:07 +07:00
|
|
|
|
|
|
|
spinlock_t tbacct_lock;
|
|
|
|
u64 busy_stolen;
|
|
|
|
u64 busy_preempt;
|
KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register
There are two ways in which a guest instruction can be obtained from
the guest in the guest exit code in book3s_hv_rmhandlers.S. If the
exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
instruction), the offending instruction is in the HEIR register
(Hypervisor Emulation Instruction Register). If the exit was caused
by a load or store to an emulated MMIO device, we load the instruction
from the guest by turning data relocation on and loading the instruction
with an lwz instruction.
Unfortunately, in the case where the guest has opposite endianness to
the host, these two methods give results of different endianness, but
both get put into vcpu->arch.last_inst. The HEIR value has been loaded
using guest endianness, whereas the lwz will load the instruction using
host endianness. The rest of the code that uses vcpu->arch.last_inst
assumes it was loaded using host endianness.
To fix this, we define a new vcpu field to store the HEIR value. Then,
in kvmppc_handle_exit_hv(), we transfer the value from this new field to
vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
differ.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-03 09:30:39 +07:00
|
|
|
|
|
|
|
u32 emul_inst;
|
2018-04-20 12:33:21 +07:00
|
|
|
|
|
|
|
u32 online;
|
2018-10-08 12:31:04 +07:00
|
|
|
|
|
|
|
/* For support of nested guests */
|
|
|
|
struct kvm_nested_guest *nested;
|
|
|
|
u32 nested_vcpu_id;
|
2018-12-14 12:29:08 +07:00
|
|
|
gpa_t nested_io_gpr;
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
#endif
|
KVM: PPC: Book3S HV: Accumulate timing information for real-mode code
This reads the timebase at various points in the real-mode guest
entry/exit code and uses that to accumulate total, minimum and
maximum time spent in those parts of the code. Currently these
times are accumulated per vcpu in 5 parts of the code:
* rm_entry - time taken from the start of kvmppc_hv_entry() until
just before entering the guest.
* rm_intr - time from when we take a hypervisor interrupt in the
guest until we either re-enter the guest or decide to exit to the
host. This includes time spent handling hcalls in real mode.
* rm_exit - time from when we decide to exit the guest until the
return from kvmppc_hv_entry().
* guest - time spend in the guest
* cede - time spent napping in real mode due to an H_CEDE hcall
while other threads in the same vcore are active.
These times are exposed in debugfs in a directory per vcpu that
contains a file called "timings". This file contains one line for
each of the 5 timings above, with the name followed by a colon and
4 numbers, which are the count (number of times the code has been
executed), the total time, the minimum time, and the maximum time,
all in nanoseconds.
The overhead of the extra code amounts to about 30ns for an hcall that
is handled in real mode (e.g. H_SET_DABR), which is about 25%. Since
production environments may not wish to incur this overhead, the new
code is conditional on a new config symbol,
CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-28 10:21:02 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
|
|
|
|
struct kvmhv_tb_accumulator *cur_activity; /* What we're timing */
|
|
|
|
u64 cur_tb_start; /* when it started */
|
|
|
|
struct kvmhv_tb_accumulator rm_entry; /* real-mode entry code */
|
|
|
|
struct kvmhv_tb_accumulator rm_intr; /* real-mode intr handling */
|
|
|
|
struct kvmhv_tb_accumulator rm_exit; /* real-mode exit code */
|
|
|
|
struct kvmhv_tb_accumulator guest_time; /* guest execution */
|
|
|
|
struct kvmhv_tb_accumulator cede_time; /* time napping inside guest */
|
|
|
|
|
|
|
|
struct dentry *debugfs_dir;
|
|
|
|
struct dentry *debugfs_timings;
|
|
|
|
#endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
|
2008-04-17 11:28:09 +07:00
|
|
|
};
|
|
|
|
|
2013-10-15 16:43:02 +07:00
|
|
|
#define VCPU_FPR(vcpu, i) (vcpu)->arch.fp.fpr[i][TS_FPROFFSET]
|
KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-21 20:12:36 +07:00
|
|
|
#define VCPU_VSX_FPR(vcpu, i, j) ((vcpu)->arch.fp.fpr[i][j])
|
|
|
|
#define VCPU_VSX_VR(vcpu, i) ((vcpu)->arch.vr.vr[i])
|
2013-10-15 16:43:02 +07:00
|
|
|
|
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
core), whenever a CPU goes idle, we have to pull all the other
hardware threads in the core out of the guest, because the H_CEDE
hcall is handled in the kernel. This is inefficient.
This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
in real mode. When a guest vcpu does an H_CEDE hcall, we now only
exit to the kernel if all the other vcpus in the same core are also
idle. Otherwise we mark this vcpu as napping, save state that could
be lost in nap mode (mainly GPRs and FPRs), and execute the nap
instruction. When the thread wakes up, because of a decrementer or
external interrupt, we come back in at kvm_start_guest (from the
system reset interrupt vector), find the `napping' flag set in the
paca, and go to the resume path.
This has some other ramifications. First, when starting a core, we
now start all the threads, both those that are immediately runnable and
those that are idle. This is so that we don't have to pull all the
threads out of the guest when an idle thread gets a decrementer interrupt
and wants to start running. In fact the idle threads will all start
with the H_CEDE hcall returning; being idle they will just do another
H_CEDE immediately and go to nap mode.
This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
These functions have been restructured to make them simpler and clearer.
We introduce a level of indirection in the wait queue that gets woken
when external and decrementer interrupts get generated for a vcpu, so
that we can have the 4 vcpus in a vcore using the same wait queue.
We need this because the 4 vcpus are being handled by one thread.
Secondly, when we need to exit from the guest to the kernel, we now
have to generate an IPI for any napping threads, because an HDEC
interrupt doesn't wake up a napping thread.
Thirdly, we now need to be able to handle virtual external interrupts
and decrementer interrupts becoming pending while a thread is napping,
and deliver those interrupts to the guest when the thread wakes.
This is done in kvmppc_cede_reentry, just before fast_guest_return.
Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
from kvm_arch_vcpu_runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 14:42:46 +07:00
|
|
|
/* Values for vcpu->arch.state */
|
2012-10-15 08:17:42 +07:00
|
|
|
#define KVMPPC_VCPU_NOTREADY 0
|
|
|
|
#define KVMPPC_VCPU_RUNNABLE 1
|
KVM: PPC: Book3S HV: Fix accounting of stolen time
Currently the code that accounts stolen time tends to overestimate the
stolen time, and will sometimes report more stolen time in a DTL
(dispatch trace log) entry than has elapsed since the last DTL entry.
This can cause guests to underflow the user or system time measured
for some tasks, leading to ridiculous CPU percentages and total runtimes
being reported by top and other utilities.
In addition, the current code was designed for the previous policy where
a vcore would only run when all the vcpus in it were runnable, and so
only counted stolen time on a per-vcore basis. Now that a vcore can
run while some of the vcpus in it are doing other things in the kernel
(e.g. handling a page fault), we need to count the time when a vcpu task
is preempted while it is not running as part of a vcore as stolen also.
To do this, we bring back the BUSY_IN_HOST vcpu state and extend the
vcpu_load/put functions to count preemption time while the vcpu is
in that state. Handling the transitions between the RUNNING and
BUSY_IN_HOST states requires checking and updating two variables
(accumulated time stolen and time last preempted), so we add a new
spinlock, vcpu->arch.tbacct_lock. This protects both the per-vcpu
stolen/preempt-time variables, and the per-vcore variables while this
vcpu is running the vcore.
Finally, we now don't count time spent in userspace as stolen time.
The task could be executing in userspace on behalf of the vcpu, or
it could be preempted, or the vcpu could be genuinely stopped. Since
we have no way of dividing up the time between these cases, we don't
count any of it as stolen.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-10-15 08:18:07 +07:00
|
|
|
#define KVMPPC_VCPU_BUSY_IN_HOST 2
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
|
2012-01-07 08:07:38 +07:00
|
|
|
/* Values for vcpu->arch.io_gpr */
|
2018-05-28 08:48:26 +07:00
|
|
|
#define KVM_MMIO_REG_MASK 0x003f
|
|
|
|
#define KVM_MMIO_REG_EXT_MASK 0xffc0
|
2012-01-07 08:07:38 +07:00
|
|
|
#define KVM_MMIO_REG_GPR 0x0000
|
2018-05-28 08:48:26 +07:00
|
|
|
#define KVM_MMIO_REG_FPR 0x0040
|
|
|
|
#define KVM_MMIO_REG_QPR 0x0080
|
|
|
|
#define KVM_MMIO_REG_FQPR 0x00c0
|
|
|
|
#define KVM_MMIO_REG_VSX 0x0100
|
|
|
|
#define KVM_MMIO_REG_VMX 0x0180
|
2018-12-14 12:29:08 +07:00
|
|
|
#define KVM_MMIO_REG_NESTED_GPR 0xffc0
|
|
|
|
|
2012-01-07 08:07:38 +07:00
|
|
|
|
2012-03-14 04:35:01 +07:00
|
|
|
#define __KVM_HAVE_ARCH_WQP
|
2013-04-12 21:08:46 +07:00
|
|
|
#define __KVM_HAVE_CREATE_DEVICE
|
2012-03-09 04:44:24 +07:00
|
|
|
|
2014-08-28 20:13:03 +07:00
|
|
|
static inline void kvm_arch_hardware_disable(void) {}
|
2014-08-28 20:13:02 +07:00
|
|
|
static inline void kvm_arch_hardware_unsetup(void) {}
|
|
|
|
static inline void kvm_arch_sync_events(struct kvm *kvm) {}
|
2019-02-06 03:54:17 +07:00
|
|
|
static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
|
2014-08-28 20:13:02 +07:00
|
|
|
static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
|
|
|
|
static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
|
|
|
|
static inline void kvm_arch_exit(void) {}
|
2015-08-27 21:41:15 +07:00
|
|
|
static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
|
|
|
|
static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
|
2016-05-13 17:16:35 +07:00
|
|
|
static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
|
2014-08-28 20:13:02 +07:00
|
|
|
|
2008-04-17 11:28:09 +07:00
|
|
|
#endif /* __POWERPC_KVM_HOST_H__ */
|