This starts the process of adding the code to support nested HV-style
virtualization. It defines a new H_SET_PARTITION_TABLE hypercall which
a nested hypervisor can use to set the base address and size of a
partition table in its memory (analogous to the PTCR register).
On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
hypercall from the guest is handled by code that saves the virtual
PTCR value for the guest.
This also adds code for creating and destroying nested guests and for
reading the partition table entry for a nested guest from L1 memory.
Each nested guest has its own shadow LPID value, different in general
from the LPID value used by the nested hypervisor to refer to it. The
shadow LPID value is allocated at nested guest creation time.
Nested hypervisor functionality is only available for a radix guest,
which therefore means a radix host on a POWER9 (or later) processor.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the 'regs' field was added to struct kvm_vcpu_arch, the code
was changed to use several of the fields inside regs (e.g., gpr, lr,
etc.) but not the ccr field, because the ccr field in struct pt_regs
is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
only 32 bits. This changes the code to use the regs.ccr field
instead of cr, and changes the assembly code on 64-bit platforms to
use 64-bit loads and stores instead of 32-bit ones.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a file called 'radix' in the debugfs directory for the
guest, which when read gives all of the valid leaf PTEs in the
partition-scoped radix tree for a radix guest, in human-readable
format. It is analogous to the existing 'htab' file which dumps
the HPT entries for a HPT guest.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we use two bits in the vcpu pending_exceptions bitmap to
indicate that an external interrupt is pending for the guest, one
for "one-shot" interrupts that are cleared when delivered, and one
for interrupts that persist until cleared by an explicit action of
the OS (e.g. an acknowledge to an interrupt controller). The
BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests
and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts.
In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our
Book3S platforms generally, and pseries in particular, expect
external interrupt requests to persist until they are acknowledged
at the interrupt controller. That combined with the confusion
introduced by having two bits for what is essentially the same thing
makes it attractive to simplify things by only using one bit. This
patch does that.
With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default
it has the semantics of a persisting interrupt. In order to avoid
breaking the ABI, we introduce a new "external_oneshot" flag which
preserves the behaviour of the KVM_INTERRUPT ioctl with the
KVM_INTERRUPT_SET argument.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 1e175d2 ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full
VCPU ID space", 2018-07-25) allowed use of VCPU IDs up to
KVM_MAX_VCPU_ID on POWER9 in all guest SMT modes and guest emulated
hardware SMT modes. However, with the current definition of
KVM_MAX_VCPU_ID, a guest SMT mode of 1 and an emulated SMT mode of 8,
it is only possible to create KVM_MAX_VCPUS / 2 VCPUS, because
threads_per_subcore is 4 on POWER9 CPUs. (Using an emulated SMT mode
of 8 is useful when migrating VMs to or from POWER8 hosts.)
This increases KVM_MAX_VCPU_ID to 8 * KVM_MAX_VCPUS when HV KVM is
configured in, so that a full complement of KVM_MAX_VCPUS VCPUs can
be created on POWER9 in all guest SMT modes and emulated hardware
SMT modes.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Originally PPC KVM MMIO emulation uses only 0~31#(5 bits) for VSR
reg number, and use mmio_vsx_tx_sx_enabled field together for
0~63# VSR regs.
Currently PPC KVM MMIO emulation is reimplemented with analyse_instr()
assistance. analyse_instr() returns 0~63 for VSR register number, so
it is not necessary to use additional mmio_vsx_tx_sx_enabled field
any more.
This patch extends related reg bits (expand io_gpr to u16 from u8
and use 6 bits for VSR reg#), so that mmio_vsx_tx_sx_enabled can
be removed.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The transaction memory checkpoint area save/restore behavior is
triggered when VCPU qemu process is switching out/into CPU, i.e.
at kvmppc_core_vcpu_put_pr() and kvmppc_core_vcpu_load_pr().
MSR TM active state is determined by TS bits:
active: 10(transactional) or 01 (suspended)
inactive: 00 (non-transactional)
We don't "fake" TM functionality for guest. We "sync" guest virtual
MSR TM active state(10 or 01) with shadow MSR. That is to say,
we don't emulate a transactional guest with a TM inactive MSR.
TM SPR support(TFIAR/TFAR/TEXASR) has already been supported by
commit 9916d57e64 ("KVM: PPC: Book3S PR: Expose TM registers").
Math register support (FPR/VMX/VSX) will be done at subsequent
patch.
Whether TM context need to be saved/restored can be determined
by kvmppc_get_msr() TM active state:
* TM active - save/restore TM context
* TM inactive - no need to do so and only save/restore
TM SPRs.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch reimplements LOAD_VMX/STORE_VMX MMIO emulation with
analyse_instr() input. When emulating the store, the VMX reg will need to
be flushed so that the right reg val can be retrieved before writing to
IO MEM.
This patch also adds support for lvebx/lvehx/lvewx/stvebx/stvehx/stvewx
MMIO emulation. To meet the requirement of handling different element
sizes, kvmppc_handle_load128_by2x64()/kvmppc_handle_store128_by2x64()
were replaced with kvmppc_handle_vmx_load()/kvmppc_handle_vmx_store().
The framework used is similar to VSX instruction MMIO emulation.
Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
VSX MMIO emulation uses mmio_vsx_copy_type to represent VSX emulated
element size/type, such as KVMPPC_VSX_COPY_DWORD_LOAD, etc. This
patch expands mmio_vsx_copy_type to cover VMX copy type, such as
KVMPPC_VMX_COPY_BYTE(stvebx/lvebx), etc. As a result,
mmio_vsx_copy_type is also renamed to mmio_copy_type.
It is a preparation for reimplementing VMX MMIO emulation.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Some VSX instructions like lxvwsx will splat word into VSR. This patch
adds a new VSX copy type KVMPPC_VSX_COPY_WORD_LOAD_DUMP to support this.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
tlbies to an LPAR do not have to be serialised since POWER4/PPC970,
after which the MMU_FTR_LOCKLESS_TLBIE feature was introduced to
avoid tlbie locking.
Since commit c17b98cf60 ("KVM: PPC: Book3S HV: Remove code for
PPC970 processors"), KVM no longer supports processors that do not
have this feature, so the tlbie locking can be removed completely.
A sanity check for the feature is put in kvmppc_mmu_hv_init.
Testing was done on a POWER9 system in HPT mode, with a -smp 32 guest
in HPT mode. 32 instances of the powerpc fork benchmark from selftests
were run with --fork, and the results measured.
Without this patch, total throughput was about 13.5K/sec, and this is
the top of the host profile:
74.52% [k] do_tlbies
2.95% [k] kvmppc_book3s_hv_page_fault
1.80% [k] calc_checksum
1.80% [k] kvmppc_vcpu_run_hv
1.49% [k] kvmppc_run_core
After this patch, throughput was about 51K/sec, with this profile:
21.28% [k] do_tlbies
5.26% [k] kvmppc_run_core
4.88% [k] kvmppc_book3s_hv_page_fault
3.30% [k] _raw_spin_lock_irqsave
3.25% [k] gup_pgd_range
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch moves nip/ctr/lr/xer registers from scattered places in
kvm_vcpu_arch to pt_regs structure.
cr register is "unsigned long" in pt_regs and u32 in vcpu->arch.
It will need more consideration and may move in later patches.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Current regs are scattered at kvm_vcpu_arch structure and it will
be more neat to organize them into pt_regs structure.
Also it will enable reimplementation of MMIO emulation code with
analyse_instr() later.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a new KVM_REG_PPC_ONLINE register which userspace can set
to 0 or 1 via the GET/SET_ONE_REG interface to indicate whether it
considers the VCPU to be offline (0), that is, not currently running,
or online (1). This will be used in a later patch to configure the
register which controls PURR and SPURR accumulation.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
- VHE optimizations
- EL2 address space randomization
- speculative execution mitigations ("variant 3a", aka execution past invalid
privilege register access)
- bugfixes and cleanups
PPC:
- improvements for the radix page fault handler for HV KVM on POWER9
s390:
- more kvm stat counters
- virtio gpu plumbing
- documentation
- facilities improvements
x86:
- support for VMware magic I/O port and pseudo-PMCs
- AMD pause loop exiting
- support for AMD core performance extensions
- support for synchronous register access
- expose nVMX capabilities to userspace
- support for Hyper-V signaling via eventfd
- use Enlightened VMCS when running on Hyper-V
- allow userspace to disable MWAIT/HLT/PAUSE vmexits
- usual roundup of optimizations and nested virtualization bugfixes
Generic:
- API selftest infrastructure (though the only tests are for x86 as of now)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJay19UAAoJEL/70l94x66DGKYIAIu9PTHAEwaX0et15fPW5y2x
rrtS355lSAmMrPJ1nePRQ+rProD/1B0Kizj3/9O+B9OTKKRsorRYNa4CSu9neO2k
N3rdE46M1wHAPwuJPcYvh3iBVXtgbMayk1EK5aVoSXaMXEHh+PWZextkl+F+G853
kC27yDy30jj9pStwnEFSBszO9ua/URdKNKBATNx8WUP6d9U/dlfm5xv3Dc3WtKt2
UMGmog2wh0i7ecXo7hRkMK4R7OYP3ZxAexq5aa9BOPuFp+ZdzC/MVpN+jsjq2J/M
Zq6RNyA2HFyQeP0E9QgFsYS2BNOPeLZnT5Jg1z4jyiD32lAZ/iC51zwm4oNKcDM=
=bPlD
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- VHE optimizations
- EL2 address space randomization
- speculative execution mitigations ("variant 3a", aka execution past
invalid privilege register access)
- bugfixes and cleanups
PPC:
- improvements for the radix page fault handler for HV KVM on POWER9
s390:
- more kvm stat counters
- virtio gpu plumbing
- documentation
- facilities improvements
x86:
- support for VMware magic I/O port and pseudo-PMCs
- AMD pause loop exiting
- support for AMD core performance extensions
- support for synchronous register access
- expose nVMX capabilities to userspace
- support for Hyper-V signaling via eventfd
- use Enlightened VMCS when running on Hyper-V
- allow userspace to disable MWAIT/HLT/PAUSE vmexits
- usual roundup of optimizations and nested virtualization bugfixes
Generic:
- API selftest infrastructure (though the only tests are for x86 as
of now)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (174 commits)
kvm: x86: fix a prototype warning
kvm: selftests: add sync_regs_test
kvm: selftests: add API testing infrastructure
kvm: x86: fix a compile warning
KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
KVM: X86: Introduce handle_ud()
KVM: vmx: unify adjacent #ifdefs
x86: kvm: hide the unused 'cpu' variable
KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig
Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown"
kvm: Add emulation for movups/movupd
KVM: VMX: raise internal error for exception during invalid protected mode state
KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending
KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
KVM: x86: Fix misleading comments on handling pending exceptions
KVM: x86: Rename interrupt.pending to interrupt.injected
KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt
x86/kvm: use Enlightened VMCS when running on Hyper-V
x86/hyper-v: detect nested features
x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits
...
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since commit fb1522e099 ("KVM: update to new mmu_notifier semantic
v2", 2017-08-31), the MMU notifier code in KVM no longer calls the
kvm_unmap_hva callback. This removes the PPC implementations of
kvm_unmap_hva().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch provides the MMIO load/store vector indexed
X-Form emulation.
Instructions implemented:
lvx: the quadword in storage addressed by the result of EA &
0xffff_ffff_ffff_fff0 is loaded into VRT.
stvx: the contents of VRS are stored into the quadword in storage
addressed by the result of EA & 0xffff_ffff_ffff_fff0.
Reported-by: Gopesh Kumar Chaudhary <gopchaud@in.ibm.com>
Reported-by: Balamuruhan S <bala24@linux.vnet.ibm.com>
Signed-off-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This works on top of the single escalation support. When in single
escalation, with this change, we will keep the escalation interrupt
disabled unless the VCPU is in H_CEDE (idle). In any other case, we
know the VCPU will be rescheduled and thus there is no need to take
escalation interrupts in the host whenever a guest interrupt fires.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The prodded flag is only cleared at the beginning of H_CEDE,
so every time we have an escalation, we will cause the *next*
H_CEDE to return immediately.
Instead use a dedicated "irq_pending" flag to indicate that
a guest interrupt is pending for the VCPU. We don't reuse the
existing exception bitmap so as to avoid expensive atomic ops.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch allows for a mode on POWER9 hosts where we control all the
threads of a core, much as we do on POWER8. The mode is controlled by
a module parameter on the kvm_hv module, called "indep_threads_mode".
The normal mode on POWER9 is the "independent threads" mode, with
indep_threads_mode=Y, where the host is in SMT4 mode (or in fact any
desired SMT mode) and each thread independently enters and exits from
KVM guests without reference to what other threads in the core are
doing.
If indep_threads_mode is set to N at the point when a VM is started,
KVM will expect every core that the guest runs on to be in single
threaded mode (that is, threads 1, 2 and 3 offline), and will set the
flag that prevents secondary threads from coming online. We can still
use all four threads; the code that implements dynamic micro-threading
on POWER8 will become active in over-commit situations and will allow
up to three other VCPUs to be run on the secondary threads of the core
whenever a VCPU is run.
The reason for wanting this mode is that this will allow us to run HPT
guests on a radix host on a POWER9 machine that does not support
"mixed mode", that is, having some threads in a core be in HPT mode
while other threads are in radix mode. It will also make it possible
to implement a "strict threads" mode in future, if desired.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This renames the kvm->arch.hpte_setup_done field to mmu_ready because
we will want to use it for radix guests too -- both for setting things
up before vcpu execution, and for excluding vcpus from executing while
MMU-related things get changed, such as in future switching the MMU
from radix to HPT mode or vice-versa.
This also moves the call to kvmppc_setup_partition_table() that was
done in kvmppc_hv_setup_htab_rma() for HPT guests, and the setting
of mmu_ready, into the caller in kvmppc_vcpu_run_hv().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Changed since v1 (Linus Torvalds)
- remove now useless kvm_arch_mmu_notifier_invalidate_page()
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Tested-by: Adam Borowski <kilobyte@angband.pl>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Better machine check handling for HV KVM
- Ability to support guests with threads=2, 4 or 8 on POWER9
- Fix for a race that could cause delayed recognition of signals
- Fix for a bug where POWER9 guests could sleep with interrupts
pending.
Enhance KVM to cause a guest exit with KVM_EXIT_NMI
exit reason upon a machine check exception (MCE) in
the guest address space if the KVM_CAP_PPC_FWNMI
capability is enabled (instead of delivering a 0x200
interrupt to guest). This enables QEMU to build error
log and deliver machine check exception to guest via
guest registered machine check handler.
This approach simplifies the delivery of machine
check exception to guest OS compared to the earlier
approach of KVM directly invoking 0x200 guest interrupt
vector.
This design/approach is based on the feedback for the
QEMU patches to handle machine check exception. Details
of earlier approach of handling machine check exception
in QEMU and related discussions can be found at:
https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
Note:
This patch now directly invokes machine_check_print_event_info()
from kvmppc_handle_exit_hv() to print the event to host console
at the time of guest exit before the exception is passed on to the
guest. Hence, the host-side handling which was performed earlier
via machine_check_fwnmi is removed.
The reasons for this approach is (i) it is not possible
to distinguish whether the exception occurred in the
guest or the host from the pt_regs passed on the
machine_check_exception(). Hence machine_check_exception()
calls panic, instead of passing on the exception to
the guest, if the machine check exception is not
recoverable. (ii) the approach introduced in this
patch gives opportunity to the host kernel to perform
actions in virtual mode before passing on the exception
to the guest. This approach does not require complex
tweaks to machine_check_fwnmi and friends.
Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This introduces a new KVM capability to control how KVM behaves
on machine check exception (MCE) in HV KVM guests.
If this capability has not been enabled, KVM redirects machine check
exceptions to guest's 0x200 vector, if the address in error belongs to
the guest. With this capability enabled, KVM will cause a guest exit
with the exit reason indicating an NMI.
The new capability is required to avoid problems if a new kernel/KVM
is used with an old QEMU, running a guest that doesn't issue
"ibm,nmi-register". As old QEMU does not understand the NMI exit
type, it treats it as a fatal error. However, the guest could have
handled the machine check error if the exception was delivered to
guest's 0x200 interrupt vector instead of NMI exit in case of old
QEMU.
[paulus@ozlabs.org - Reworded the commit message to be clearer,
enable only on HV KVM.]
Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
On POWER9, we no longer have the restriction that we had on POWER8
where all threads in a core have to be in the same partition, so
the CPU threads are now independent. However, we still want to be
able to run guests with a virtual SMT topology, if only to allow
migration of guests from POWER8 systems to POWER9.
A guest that has a virtual SMT mode greater than 1 will expect to
be able to use the doorbell facility; it will expect the msgsndp
and msgclrp instructions to work appropriately and to be able to read
sensible values from the TIR (thread identification register) and
DPDES (directed privileged doorbell exception status) special-purpose
registers. However, since each CPU thread is a separate sub-processor
in POWER9, these instructions and registers can only be used within
a single CPU thread.
In order for these instructions to appear to act correctly according
to the guest's virtual SMT mode, we have to trap and emulate them.
We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
register. The emulation is triggered by the hypervisor facility
unavailable interrupt that occurs when the guest uses them.
To cause a doorbell interrupt to occur within the guest, we set the
DPDES register to 1. If the guest has interrupts enabled, the CPU
will generate a doorbell interrupt and clear the DPDES register in
hardware. The DPDES hardware register for the guest is saved in the
vcpu->arch.vcore->dpdes field. Since this gets written by the guest
exit code, other VCPUs wishing to cause a doorbell interrupt don't
write that field directly, but instead set a vcpu->arch.doorbell_request
flag. This is consumed and set to 0 by the guest entry code, which
then sets DPDES to 1.
Emulating reads of the DPDES register is somewhat involved, because
it requires reading the doorbell pending interrupt status of all of the
VCPU threads in the virtual core, and if any of those VCPUs are
running, their doorbell status is only up-to-date in the hardware
DPDES registers of the CPUs where they are running. In order to get
a reasonable approximation of the current doorbell status, we send
those CPUs an IPI, causing an exit from the guest which will update
the vcpu->arch.vcore->dpdes field. We then use that value in
constructing the emulated DPDES register value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This allows userspace to set the desired virtual SMT (simultaneous
multithreading) mode for a VM, that is, the number of VCPUs that
get assigned to each virtual core. Previously, the virtual SMT mode
was fixed to the number of threads per subcore, and if userspace
wanted to have fewer vcpus per vcore, then it would achieve that by
using a sparse CPU numbering. This had the disadvantage that the
vcpu numbers can get quite large, particularly for SMT1 guests on
a POWER8 with 8 threads per core. With this patch, userspace can
set its desired virtual SMT mode and then use contiguous vcpu
numbering.
On POWER8, where the threading mode is "strict", the virtual SMT mode
must be less than or equal to the number of threads per subcore. On
POWER9, which implements a "loose" threading mode, the virtual SMT
mode can be any power of 2 between 1 and 8, even though there is
effectively one thread per subcore, since the threads are independent
and can all be in different partitions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to allow us to use a different value for the HFSCR
(Hypervisor Facilities Status and Control Register) when running the
guest from that which applies in the host. The reason for doing this
is to allow us to trap the msgsndp instruction and related operations
in future so that they can be virtualized. We also save the value of
HFSCR when a hypervisor facility unavailable interrupt occurs, because
the high byte of HFSCR indicates which facility the guest attempted to
access.
We save and restore the host value on guest entry/exit because some
bits of it affect host userspace execution.
We only do all this on POWER9, not on POWER8, because we are not
intending to virtualize any of the facilities controlled by HFSCR on
POWER8. In particular, the HFSCR bit that controls execution of
msgsndp and related operations does not exist on POWER8. The HFSCR
doesn't exist at all on POWER7.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This allows userspace (e.g. QEMU) to enable large decrementer mode for
the guest when running on a POWER9 host, by setting the LPCR_LD bit in
the guest LPCR value. With this, the guest exit code saves 64 bits of
the guest DEC value on exit. Other places that use the guest DEC
value check the LPCR_LD bit in the guest LPCR value, and if it is set,
omit the 32-bit sign extension that would otherwise be done.
This doesn't change the DEC emulation used by PR KVM because PR KVM
is not supported on POWER9 yet.
This is partly based on an earlier patch by Oliver O'Halloran.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Marc Zyngier suggested that we define the arch specific VCPU request
base, rather than requiring each arch to remember to start from 8.
That suggestion, along with Radim Krcmar's recent VCPU request flag
addition, snowballed into defining something of an arch VCPU request
defining API.
No functional change.
(Looks like x86 is running out of arch VCPU request bits. Maybe
someday we'll need to extend to 64.)
Signed-off-by: Andrew Jones <drjones@redhat.com>
Acked-by: Christoffer Dall <cdall@linaro.org>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
This merges in the powerpc topic/xive branch to bring in the code for
the in-kernel XICS interrupt controller emulation to use the new XIVE
(eXternal Interrupt Virtualization Engine) hardware in the POWER9 chip
directly, rather than via a XICS emulation in firmware.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch makes KVM capable of using the XIVE interrupt controller
to provide the standard PAPR "XICS" style hypercalls. It is necessary
for proper operations when the host uses XIVE natively.
This has been lightly tested on an actual system, including PCI
pass-through with a TG3 device.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Cleanup pr_xxx(), unsplit pr_xxx() strings, etc., fix build
failures by adding KVM_XIVE which depends on KVM_XICS and XIVE, and
adding empty stubs for the kvm_xive_xxx() routines, fixup subject,
integrate fixes from Paul for building PR=y HV=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.
This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.
To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.
If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.
This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.
This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.
This adds real mode version of WARN_ON_ONCE() as the generic version
causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
returns in the code, this also adds a check for already existing
vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.
Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
PR KVM page fault handler performs eaddr to pte translation for a guest,
however kvmppc_mmu_book3s_64_xlate() does not preserve WIMG bits
(storage control) in the kvmppc_pte struct. If PR KVM is running as
a second level guest under HV KVM, and PR KVM tries inserting HPT entry,
this fails in HV KVM if it already has this mapping.
This preserves WIMG bits between kvmppc_mmu_book3s_64_xlate() and
kvmppc_mmu_map_page().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Its value has never changed; we might as well make it part of the ABI instead
of using the return value of KVM_CHECK_EXTENSION(KVM_CAP_COALESCED_MMIO).
Because PPC does not always make MMIO available, the code has to be made
dependent on CONFIG_KVM_MMIO rather than KVM_COALESCED_MMIO_PAGE_OFFSET.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This adds a not yet working outline of the HPT resizing PAPR
extension. Specifically it adds the necessary ioctl() functions,
their basic steps, the work function which will handle preparation for
the resize, and synchronization between these, the guest page fault
path and guest HPT update path.
The actual guts of the implementation isn't here yet, so for now the
calls will always fail.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently the kvm_hpt_info structure stores the hashed page table's order,
and also the number of HPTEs it contains and a mask for its size. The
last two can be easily derived from the order, so remove them and just
calculate them as necessary with a couple of helper inlines.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently, the powerpc kvm_arch structure contains a number of variables
tracking the state of the guest's hashed page table (HPT) in KVM HV. This
patch gathers them all together into a single kvm_hpt_info substructure.
This makes life more convenient for the upcoming HPT resizing
implementation.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With radix, the guest can do TLB invalidations itself using the tlbie
(global) and tlbiel (local) TLB invalidation instructions. Linux guests
use local TLB invalidations for translations that have only ever been
accessed on one vcpu. However, that doesn't mean that the translations
have only been accessed on one physical cpu (pcpu) since vcpus can move
around from one pcpu to another. Thus a tlbiel might leave behind stale
TLB entries on a pcpu where the vcpu previously ran, and if that task
then moves back to that previous pcpu, it could see those stale TLB
entries and thus access memory incorrectly. The usual symptom of this
is random segfaults in userspace programs in the guest.
To cope with this, we detect when a vcpu is about to start executing on
a thread in a core that is a different core from the last time it
executed. If that is the case, then we mark the core as needing a
TLB flush and then send an interrupt to any thread in the core that is
currently running a vcpu from the same guest. This will get those vcpus
out of the guest, and the first one to re-enter the guest will do the
TLB flush. The reason for interrupting the vcpus executing on the old
core is to cope with the following scenario:
CPU 0 CPU 1 CPU 4
(core 0) (core 0) (core 1)
VCPU 0 runs task X VCPU 1 runs
core 0 TLB gets
entries from task X
VCPU 0 moves to CPU 4
VCPU 0 runs task X
Unmap pages of task X
tlbiel
(still VCPU 1) task X moves to VCPU 1
task X runs
task X sees stale TLB
entries
That is, as soon as the VCPU starts executing on the new core, it
could unmap and tlbiel some page table entries, and then the task
could migrate to one of the VCPUs running on the old core and
potentially see stale TLB entries.
Since the TLB is shared between all the threads in a core, we only
use the bit of kvm->arch.need_tlb_flush corresponding to the first
thread in the core. To ensure that we don't have a window where we
can miss a flush, this moves the clearing of the bit from before the
actual flush to after it. This way, two threads might both do the
flush, but we prevent the situation where one thread can enter the
guest before the flush is finished.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds code to branch around the parts that radix guests don't
need - clearing and loading the SLB with the guest SLB contents,
saving the guest SLB contents on exit, and restoring the host SLB
contents.
Since the host is now using radix, we need to save and restore the
host value for the PID register.
On hypervisor data/instruction storage interrupts, we don't do the
guest HPT lookup on radix, but just save the guest physical address
for the fault (from the ASDR register) in the vcpu struct.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a field in struct kvm_arch and an inline helper to
indicate whether a guest is a radix guest or not, plus a new file
to contain the radix MMU code, which currently contains just a
translate function which knows how to traverse the guest page
tables to translate an address.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds the implementation of the KVM_PPC_CONFIGURE_V3_MMU ioctl
for HPT guests on POWER9. With this, we can return 1 for the
KVM_CAP_PPC_MMU_HASH_V3 capability.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
KVM_HALT_POLL_NS_DEFAULT is an arch specific constant which sets the
default value of the halt_poll_ns kvm module parameter which determines
the global maximum halt polling interval.
The current value for powerpc is 500000 (500us) which means that any
repetitive workload with a period of less than that can drive the cpu
usage to 100% where it may have been mostly idle without halt polling.
This presents the possibility of a large increase in power usage with
a comparatively small performance benefit.
Reduce the default to 10000 (10us) and a user can tune this themselves
to set their affinity for halt polling based on the trade off between power
and performance which they are willing to make.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
and tlbiel (local tlbie) instructions. Both instructions get a
set of new parameters (RIC, PRS and R) which appear as bits in the
instruction word. The tlbiel instruction now has a second register
operand, which contains a PID and/or LPID value if needed, and
should otherwise contain 0.
This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
as well as older processors. Since we only handle HPT guests so
far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
word as on previous processors, so we don't need to conditionally
execute different instructions depending on the processor.
The local flush on first entry to a guest in book3s_hv_rmhandlers.S
is a loop which depends on the number of TLB sets. Rather than
using feature sections to set the number of iterations based on
which CPU we're on, we now work out this number at VM creation time
and store it in the kvm_arch struct. That will make it possible to
get the number from the device tree in future, which will help with
compatibility with future processors.
Since mmu_partition_table_set_entry() does a global flush of the
whole LPID, we don't need to do the TLB flush on first entry to the
guest on each processor. Therefore we don't set all bits in the
tlb_need_flush bitmap on VM startup on POWER9.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to handle two new guest-accessible special-purpose
registers on POWER9: TIDR (thread ID register) and PSSCR (processor
stop status and control register). They are context-switched
between host and guest, and the guest values can be read and set
via the one_reg interface.
The PSSCR contains some fields which are guest-accessible and some
which are only accessible in hypervisor mode. We only allow the
guest-accessible fields to be read or set by userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When switching from/to a guest that has a transaction in progress,
we need to save/restore the checkpointed register state. Although
XER is part of the CPU state that gets checkpointed, the code that
does this saving and restoring doesn't save/restore XER.
This fixes it by saving and restoring the XER. To allow userspace
to read/write the checkpointed XER value, we also add a new ONE_REG
specifier.
The visible effect of this bug is that the guest may see its XER
value being corrupted when it uses transactions.
Fixes: e4e3812150 ("KVM: PPC: Book3S HV: Add transactional memory support")
Fixes: 0a8eccefcb ("KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>