linux_dsm_epyc7002/arch/powerpc/kernel/asm-offsets.c

787 lines
30 KiB
C
Raw Normal View History

/*
* This program is used to generate definitions needed by
* assembly language modules.
*
* We use the technique used in the OSF Mach kernel code:
* generate asm statements containing #defines,
* compile this file to assembler, and then extract the
* #defines from the assembly-language output.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*/
compat: Move compat_timespec/ timeval to compat_time.h All the current architecture specific defines for these are the same. Refactor these common defines to a common header file. The new common linux/compat_time.h is also useful as it will eventually be used to hold all the defines that are needed for compat time types that support non y2038 safe types. New architectures need not have to define these new types as they will only use new y2038 safe syscalls. This file can be deleted after y2038 when we stop supporting non y2038 safe syscalls. The patch also requires an operation similar to: git grep "asm/compat\.h" | cut -d ":" -f 1 | xargs -n 1 sed -i -e "s%asm/compat.h%linux/compat.h%g" Cc: acme@kernel.org Cc: benh@kernel.crashing.org Cc: borntraeger@de.ibm.com Cc: catalin.marinas@arm.com Cc: cmetcalf@mellanox.com Cc: cohuck@redhat.com Cc: davem@davemloft.net Cc: deller@gmx.de Cc: devel@driverdev.osuosl.org Cc: gerald.schaefer@de.ibm.com Cc: gregkh@linuxfoundation.org Cc: heiko.carstens@de.ibm.com Cc: hoeppner@linux.vnet.ibm.com Cc: hpa@zytor.com Cc: jejb@parisc-linux.org Cc: jwi@linux.vnet.ibm.com Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: mark.rutland@arm.com Cc: mingo@redhat.com Cc: mpe@ellerman.id.au Cc: oberpar@linux.vnet.ibm.com Cc: oprofile-list@lists.sf.net Cc: paulus@samba.org Cc: peterz@infradead.org Cc: ralf@linux-mips.org Cc: rostedt@goodmis.org Cc: rric@kernel.org Cc: schwidefsky@de.ibm.com Cc: sebott@linux.vnet.ibm.com Cc: sparclinux@vger.kernel.org Cc: sth@linux.vnet.ibm.com Cc: ubraun@linux.vnet.ibm.com Cc: will.deacon@arm.com Cc: x86@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: James Hogan <jhogan@kernel.org> Acked-by: Helge Deller <deller@gmx.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-03-14 11:03:25 +07:00
#include <linux/compat.h>
#include <linux/signal.h>
#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/string.h>
#include <linux/types.h>
#include <linux/mman.h>
#include <linux/mm.h>
#include <linux/suspend.h>
#include <linux/hrtimer.h>
#ifdef CONFIG_PPC64
#include <linux/time.h>
#include <linux/hardirq.h>
#endif
#include <linux/kbuild.h>
#include <asm/io.h>
#include <asm/page.h>
#include <asm/pgtable.h>
#include <asm/processor.h>
#include <asm/cputable.h>
#include <asm/thread_info.h>
#include <asm/rtas.h>
#include <asm/vdso_datapage.h>
#include <asm/dbell.h>
#ifdef CONFIG_PPC64
#include <asm/paca.h>
#include <asm/lppaca.h>
#include <asm/cache.h>
#include <asm/mmu.h>
#include <asm/hvcall.h>
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per core), whenever a CPU goes idle, we have to pull all the other hardware threads in the core out of the guest, because the H_CEDE hcall is handled in the kernel. This is inefficient. This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall in real mode. When a guest vcpu does an H_CEDE hcall, we now only exit to the kernel if all the other vcpus in the same core are also idle. Otherwise we mark this vcpu as napping, save state that could be lost in nap mode (mainly GPRs and FPRs), and execute the nap instruction. When the thread wakes up, because of a decrementer or external interrupt, we come back in at kvm_start_guest (from the system reset interrupt vector), find the `napping' flag set in the paca, and go to the resume path. This has some other ramifications. First, when starting a core, we now start all the threads, both those that are immediately runnable and those that are idle. This is so that we don't have to pull all the threads out of the guest when an idle thread gets a decrementer interrupt and wants to start running. In fact the idle threads will all start with the H_CEDE hcall returning; being idle they will just do another H_CEDE immediately and go to nap mode. This required some changes to kvmppc_run_core() and kvmppc_run_vcpu(). These functions have been restructured to make them simpler and clearer. We introduce a level of indirection in the wait queue that gets woken when external and decrementer interrupts get generated for a vcpu, so that we can have the 4 vcpus in a vcore using the same wait queue. We need this because the 4 vcpus are being handled by one thread. Secondly, when we need to exit from the guest to the kernel, we now have to generate an IPI for any napping threads, because an HDEC interrupt doesn't wake up a napping thread. Thirdly, we now need to be able to handle virtual external interrupts and decrementer interrupts becoming pending while a thread is napping, and deliver those interrupts to the guest when the thread wakes. This is done in kvmppc_cede_reentry, just before fast_guest_return. Finally, since we are not using the generic kvm_vcpu_block for book3s_hv, and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef from kvm_arch_vcpu_runnable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 14:42:46 +07:00
#include <asm/xics.h>
#endif
#ifdef CONFIG_PPC_POWERNV
#include <asm/opal.h>
#endif
#if defined(CONFIG_KVM) || defined(CONFIG_KVM_GUEST)
#include <linux/kvm_host.h>
#endif
#if defined(CONFIG_KVM) && defined(CONFIG_PPC_BOOK3S)
#include <asm/kvm_book3s.h>
#include <asm/kvm_ppc.h>
#endif
#ifdef CONFIG_PPC32
#if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
#include "head_booke.h"
#endif
#endif
#if defined(CONFIG_PPC_FSL_BOOK3E)
#include "../mm/mmu_decl.h"
#endif
#ifdef CONFIG_PPC_8xx
#include <asm/fixmap.h>
#endif
#define STACK_PT_REGS_OFFSET(sym, val) \
DEFINE(sym, STACK_FRAME_OVERHEAD + offsetof(struct pt_regs, val))
int main(void)
{
OFFSET(THREAD, task_struct, thread);
OFFSET(MM, task_struct, mm);
powerpc/32: add stack protector support This functionality was tentatively added in the past (commit 6533b7c16ee5 ("powerpc: Initial stack protector (-fstack-protector) support")) but had to be reverted (commit f2574030b0e3 ("powerpc: Revert the initial stack protector support") because of GCC implementing it differently whether it had been built with libc support or not. Now, GCC offers the possibility to manually set the stack-protector mode (global or tls) regardless of libc support. This time, the patch selects HAVE_STACKPROTECTOR only if -mstack-protector-guard=tls is supported by GCC. On PPC32, as register r2 points to current task_struct at all time, the stack_canary located inside task_struct can be used directly by using the following GCC options: -mstack-protector-guard=tls -mstack-protector-guard-reg=r2 -mstack-protector-guard-offset=offsetof(struct task_struct, stack_canary)) The protector is disabled for prom_init and bootx_init as it is too early to handle it properly. $ echo CORRUPT_STACK > /sys/kernel/debug/provoke-crash/DIRECT [ 134.943666] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: lkdtm_CORRUPT_STACK+0x64/0x64 [ 134.943666] [ 134.955414] CPU: 0 PID: 283 Comm: sh Not tainted 4.18.0-s3k-dev-12143-ga3272be41209 #835 [ 134.963380] Call Trace: [ 134.965860] [c6615d60] [c001f76c] panic+0x118/0x260 (unreliable) [ 134.971775] [c6615dc0] [c001f654] panic+0x0/0x260 [ 134.976435] [c6615dd0] [c032c368] lkdtm_CORRUPT_STACK_STRONG+0x0/0x64 [ 134.982769] [c6615e00] [ffffffff] 0xffffffff Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-27 14:05:53 +07:00
#ifdef CONFIG_STACKPROTECTOR
OFFSET(TASK_CANARY, task_struct, stack_canary);
#ifdef CONFIG_PPC64
OFFSET(PACA_CANARY, paca_struct, canary);
#endif
powerpc/32: add stack protector support This functionality was tentatively added in the past (commit 6533b7c16ee5 ("powerpc: Initial stack protector (-fstack-protector) support")) but had to be reverted (commit f2574030b0e3 ("powerpc: Revert the initial stack protector support") because of GCC implementing it differently whether it had been built with libc support or not. Now, GCC offers the possibility to manually set the stack-protector mode (global or tls) regardless of libc support. This time, the patch selects HAVE_STACKPROTECTOR only if -mstack-protector-guard=tls is supported by GCC. On PPC32, as register r2 points to current task_struct at all time, the stack_canary located inside task_struct can be used directly by using the following GCC options: -mstack-protector-guard=tls -mstack-protector-guard-reg=r2 -mstack-protector-guard-offset=offsetof(struct task_struct, stack_canary)) The protector is disabled for prom_init and bootx_init as it is too early to handle it properly. $ echo CORRUPT_STACK > /sys/kernel/debug/provoke-crash/DIRECT [ 134.943666] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: lkdtm_CORRUPT_STACK+0x64/0x64 [ 134.943666] [ 134.955414] CPU: 0 PID: 283 Comm: sh Not tainted 4.18.0-s3k-dev-12143-ga3272be41209 #835 [ 134.963380] Call Trace: [ 134.965860] [c6615d60] [c001f76c] panic+0x118/0x260 (unreliable) [ 134.971775] [c6615dc0] [c001f654] panic+0x0/0x260 [ 134.976435] [c6615dd0] [c032c368] lkdtm_CORRUPT_STACK_STRONG+0x0/0x64 [ 134.982769] [c6615e00] [ffffffff] 0xffffffff Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-27 14:05:53 +07:00
#endif
OFFSET(MMCONTEXTID, mm_struct, context.id);
#ifdef CONFIG_PPC64
powerpc: Allow perf_counters to access user memory at interrupt time This provides a mechanism to allow the perf_counters code to access user memory in a PMU interrupt routine. Such an access can cause various kinds of interrupt: SLB miss, MMU hash table miss, segment table miss, or TLB miss, depending on the processor. This commit only deals with 64-bit classic/server processors, which use an MMU hash table. 32-bit processors are already able to access user memory at interrupt time. Since we don't soft-disable on 32-bit, we avoid the possibility of reentering hash_page or the TLB miss handlers, since they run with interrupts disabled. On 64-bit processors, an SLB miss interrupt on a user address will update the slb_cache and slb_cache_ptr fields in the paca. This is OK except in the case where a PMU interrupt occurs in switch_slb, which also accesses those fields. To prevent this, we hard-disable interrupts in switch_slb. Interrupts are already soft-disabled at this point, and will get hard-enabled when they get soft-enabled later. This also reworks slb_flush_and_rebolt: to avoid hard-disabling twice, and to make sure that it clears the slb_cache_ptr when called from other callers than switch_slb, the existing routine is renamed to __slb_flush_and_rebolt, which is called by switch_slb and the new version of slb_flush_and_rebolt. Similarly, switch_stab (used on POWER3 and RS64 processors) gets a hard_irq_disable() to protect the per-cpu variables used there and in ste_allocate. If a MMU hashtable miss interrupt occurs, normally we would call hash_page to look up the Linux PTE for the address and create a HPTE. However, hash_page is fairly complex and takes some locks, so to avoid the possibility of deadlock, we check the preemption count to see if we are in a (pseudo-)NMI handler, and if so, we don't call hash_page but instead treat it like a bad access that will get reported up through the exception table mechanism. An interrupt whose handler runs even though the interrupt occurred when soft-disabled (such as the PMU interrupt) is considered a pseudo-NMI handler, which should use nmi_enter()/nmi_exit() rather than irq_enter()/irq_exit(). Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-08-17 12:17:54 +07:00
DEFINE(SIGSEGV, SIGSEGV);
DEFINE(NMI_MASK, NMI_MASK);
#else
OFFSET(THREAD_INFO, task_struct, stack);
DEFINE(THREAD_INFO_GAP, _ALIGN_UP(sizeof(struct thread_info), 16));
OFFSET(KSP_LIMIT, thread_struct, ksp_limit);
#endif /* CONFIG_PPC64 */
powerpc/livepatch: Add live patching support on ppc64le Add the kconfig logic & assembly support for handling live patched functions. This depends on DYNAMIC_FTRACE_WITH_REGS, which in turn depends on the new -mprofile-kernel ftrace ABI, which is only supported currently on ppc64le. Live patching is handled by a special ftrace handler. This means it runs from ftrace_caller(). The live patch handler modifies the NIP so as to redirect the return from ftrace_caller() to the new patched function. However there is one particularly tricky case we need to handle. If a function A calls another function B, and it is known at link time that they share the same TOC, then A will not save or restore its TOC, and will call the local entry point of B. When we live patch B, we replace it with a new function C, which may not have the same TOC as A. At live patch time it's too late to modify A to do the TOC save/restore, so the live patching code must interpose itself between A and C, and do the TOC save/restore that A omitted. An additionaly complication is that the livepatch code can not create a stack frame in order to save the TOC. That is because if C takes > 8 arguments, or is varargs, A will have written the arguments for C in A's stack frame. To solve this, we introduce a "livepatch stack" which grows upward from the base of the regular stack, and is used to store the TOC & LR when calling a live patched function. When the patched function returns, we retrieve the real LR & TOC from the livepatch stack, restore them, and pop the livepatch "stack frame". Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Torsten Duwe <duwe@suse.de> Reviewed-by: Balbir Singh <bsingharora@gmail.com>
2016-03-24 18:04:05 +07:00
#ifdef CONFIG_LIVEPATCH
OFFSET(TI_livepatch_sp, thread_info, livepatch_sp);
powerpc/livepatch: Add live patching support on ppc64le Add the kconfig logic & assembly support for handling live patched functions. This depends on DYNAMIC_FTRACE_WITH_REGS, which in turn depends on the new -mprofile-kernel ftrace ABI, which is only supported currently on ppc64le. Live patching is handled by a special ftrace handler. This means it runs from ftrace_caller(). The live patch handler modifies the NIP so as to redirect the return from ftrace_caller() to the new patched function. However there is one particularly tricky case we need to handle. If a function A calls another function B, and it is known at link time that they share the same TOC, then A will not save or restore its TOC, and will call the local entry point of B. When we live patch B, we replace it with a new function C, which may not have the same TOC as A. At live patch time it's too late to modify A to do the TOC save/restore, so the live patching code must interpose itself between A and C, and do the TOC save/restore that A omitted. An additionaly complication is that the livepatch code can not create a stack frame in order to save the TOC. That is because if C takes > 8 arguments, or is varargs, A will have written the arguments for C in A's stack frame. To solve this, we introduce a "livepatch stack" which grows upward from the base of the regular stack, and is used to store the TOC & LR when calling a live patched function. When the patched function returns, we retrieve the real LR & TOC from the livepatch stack, restore them, and pop the livepatch "stack frame". Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Torsten Duwe <duwe@suse.de> Reviewed-by: Balbir Singh <bsingharora@gmail.com>
2016-03-24 18:04:05 +07:00
#endif
OFFSET(KSP, thread_struct, ksp);
OFFSET(PT_REGS, thread_struct, regs);
#ifdef CONFIG_BOOKE
OFFSET(THREAD_NORMSAVES, thread_struct, normsave[0]);
#endif
OFFSET(THREAD_FPEXC_MODE, thread_struct, fpexc_mode);
OFFSET(THREAD_FPSTATE, thread_struct, fp_state.fpr);
OFFSET(THREAD_FPSAVEAREA, thread_struct, fp_save_area);
OFFSET(FPSTATE_FPSCR, thread_fp_state, fpscr);
OFFSET(THREAD_LOAD_FP, thread_struct, load_fp);
#ifdef CONFIG_ALTIVEC
OFFSET(THREAD_VRSTATE, thread_struct, vr_state.vr);
OFFSET(THREAD_VRSAVEAREA, thread_struct, vr_save_area);
OFFSET(THREAD_VRSAVE, thread_struct, vrsave);
OFFSET(THREAD_USED_VR, thread_struct, used_vr);
OFFSET(VRSTATE_VSCR, thread_vr_state, vscr);
OFFSET(THREAD_LOAD_VEC, thread_struct, load_vec);
#endif /* CONFIG_ALTIVEC */
powerpc: Introduce VSX thread_struct and CONFIG_VSX The layout of the new VSR registers and how they overlap on top of the legacy FPR and VR registers is: VSR doubleword 0 VSR doubleword 1 ---------------------------------------------------------------- VSR[0] | FPR[0] | | ---------------------------------------------------------------- VSR[1] | FPR[1] | | ---------------------------------------------------------------- | ... | | | ... | | ---------------------------------------------------------------- VSR[30] | FPR[30] | | ---------------------------------------------------------------- VSR[31] | FPR[31] | | ---------------------------------------------------------------- VSR[32] | VR[0] | ---------------------------------------------------------------- VSR[33] | VR[1] | ---------------------------------------------------------------- | ... | | ... | ---------------------------------------------------------------- VSR[62] | VR[30] | ---------------------------------------------------------------- VSR[63] | VR[31] | ---------------------------------------------------------------- VSX has 64 128bit registers. The first 32 regs overlap with the FP registers and hence extend them with and additional 64 bits. The second 32 regs overlap with the VMX registers. This commit introduces the thread_struct changes required to reflect this register layout. Ptrace and signals code is updated so that the floating point registers are correctly accessed from the thread_struct when CONFIG_VSX is enabled. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-06-25 11:07:18 +07:00
#ifdef CONFIG_VSX
OFFSET(THREAD_USED_VSR, thread_struct, used_vsr);
powerpc: Introduce VSX thread_struct and CONFIG_VSX The layout of the new VSR registers and how they overlap on top of the legacy FPR and VR registers is: VSR doubleword 0 VSR doubleword 1 ---------------------------------------------------------------- VSR[0] | FPR[0] | | ---------------------------------------------------------------- VSR[1] | FPR[1] | | ---------------------------------------------------------------- | ... | | | ... | | ---------------------------------------------------------------- VSR[30] | FPR[30] | | ---------------------------------------------------------------- VSR[31] | FPR[31] | | ---------------------------------------------------------------- VSR[32] | VR[0] | ---------------------------------------------------------------- VSR[33] | VR[1] | ---------------------------------------------------------------- | ... | | ... | ---------------------------------------------------------------- VSR[62] | VR[30] | ---------------------------------------------------------------- VSR[63] | VR[31] | ---------------------------------------------------------------- VSX has 64 128bit registers. The first 32 regs overlap with the FP registers and hence extend them with and additional 64 bits. The second 32 regs overlap with the VMX registers. This commit introduces the thread_struct changes required to reflect this register layout. Ptrace and signals code is updated so that the floating point registers are correctly accessed from the thread_struct when CONFIG_VSX is enabled. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-06-25 11:07:18 +07:00
#endif /* CONFIG_VSX */
#ifdef CONFIG_PPC64
OFFSET(KSP_VSID, thread_struct, ksp_vsid);
#else /* CONFIG_PPC64 */
OFFSET(PGDIR, thread_struct, pgdir);
#ifdef CONFIG_SPE
OFFSET(THREAD_EVR0, thread_struct, evr[0]);
OFFSET(THREAD_ACC, thread_struct, acc);
OFFSET(THREAD_SPEFSCR, thread_struct, spefscr);
OFFSET(THREAD_USED_SPE, thread_struct, used_spe);
#endif /* CONFIG_SPE */
#endif /* CONFIG_PPC64 */
#if defined(CONFIG_4xx) || defined(CONFIG_BOOKE)
OFFSET(THREAD_DBCR0, thread_struct, debug.dbcr0);
#endif
#ifdef CONFIG_KVM_BOOK3S_32_HANDLER
OFFSET(THREAD_KVM_SVCPU, thread_struct, kvm_shadow_vcpu);
#endif
#if defined(CONFIG_KVM) && defined(CONFIG_BOOKE)
OFFSET(THREAD_KVM_VCPU, thread_struct, kvm_vcpu);
#endif
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
OFFSET(PACATMSCRATCH, paca_struct, tm_scratch);
OFFSET(THREAD_TM_TFHAR, thread_struct, tm_tfhar);
OFFSET(THREAD_TM_TEXASR, thread_struct, tm_texasr);
OFFSET(THREAD_TM_TFIAR, thread_struct, tm_tfiar);
OFFSET(THREAD_TM_TAR, thread_struct, tm_tar);
OFFSET(THREAD_TM_PPR, thread_struct, tm_ppr);
OFFSET(THREAD_TM_DSCR, thread_struct, tm_dscr);
OFFSET(PT_CKPT_REGS, thread_struct, ckpt_regs);
OFFSET(THREAD_CKVRSTATE, thread_struct, ckvr_state.vr);
OFFSET(THREAD_CKVRSAVE, thread_struct, ckvrsave);
OFFSET(THREAD_CKFPSTATE, thread_struct, ckfp_state.fpr);
/* Local pt_regs on stack for Transactional Memory funcs. */
DEFINE(TM_FRAME_SIZE, STACK_FRAME_OVERHEAD +
sizeof(struct pt_regs) + 16);
#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
OFFSET(TI_FLAGS, thread_info, flags);
OFFSET(TI_LOCAL_FLAGS, thread_info, local_flags);
OFFSET(TI_PREEMPT, thread_info, preempt_count);
OFFSET(TI_TASK, thread_info, task);
OFFSET(TI_CPU, thread_info, cpu);
#ifdef CONFIG_PPC64
OFFSET(DCACHEL1BLOCKSIZE, ppc64_caches, l1d.block_size);
OFFSET(DCACHEL1LOGBLOCKSIZE, ppc64_caches, l1d.log_block_size);
OFFSET(DCACHEL1BLOCKSPERPAGE, ppc64_caches, l1d.blocks_per_page);
OFFSET(ICACHEL1BLOCKSIZE, ppc64_caches, l1i.block_size);
OFFSET(ICACHEL1LOGBLOCKSIZE, ppc64_caches, l1i.log_block_size);
OFFSET(ICACHEL1BLOCKSPERPAGE, ppc64_caches, l1i.blocks_per_page);
/* paca */
DEFINE(PACA_SIZE, sizeof(struct paca_struct));
OFFSET(PACAPACAINDEX, paca_struct, paca_index);
OFFSET(PACAPROCSTART, paca_struct, cpu_start);
OFFSET(PACAKSAVE, paca_struct, kstack);
OFFSET(PACACURRENT, paca_struct, __current);
OFFSET(PACASAVEDMSR, paca_struct, saved_msr);
OFFSET(PACASTABRR, paca_struct, stab_rr);
OFFSET(PACAR1, paca_struct, saved_r1);
OFFSET(PACATOC, paca_struct, kernel_toc);
OFFSET(PACAKBASE, paca_struct, kernelbase);
OFFSET(PACAKMSR, paca_struct, kernel_msr);
OFFSET(PACAIRQSOFTMASK, paca_struct, irq_soft_mask);
OFFSET(PACAIRQHAPPENED, paca_struct, irq_happened);
OFFSET(PACA_FTRACE_ENABLED, paca_struct, ftrace_enabled);
#ifdef CONFIG_PPC_BOOK3S
OFFSET(PACACONTEXTID, paca_struct, mm_ctx_id);
#ifdef CONFIG_PPC_MM_SLICES
OFFSET(PACALOWSLICESPSIZE, paca_struct, mm_ctx_low_slices_psize);
OFFSET(PACAHIGHSLICEPSIZE, paca_struct, mm_ctx_high_slices_psize);
OFFSET(PACA_SLB_ADDR_LIMIT, paca_struct, mm_ctx_slb_addr_limit);
DEFINE(MMUPSIZEDEFSIZE, sizeof(struct mmu_psize_def));
#endif /* CONFIG_PPC_MM_SLICES */
#endif
#ifdef CONFIG_PPC_BOOK3E
OFFSET(PACAPGD, paca_struct, pgd);
OFFSET(PACA_KERNELPGD, paca_struct, kernel_pgd);
OFFSET(PACA_EXGEN, paca_struct, exgen);
OFFSET(PACA_EXTLB, paca_struct, extlb);
OFFSET(PACA_EXMC, paca_struct, exmc);
OFFSET(PACA_EXCRIT, paca_struct, excrit);
OFFSET(PACA_EXDBG, paca_struct, exdbg);
OFFSET(PACA_MC_STACK, paca_struct, mc_kstack);
OFFSET(PACA_CRIT_STACK, paca_struct, crit_kstack);
OFFSET(PACA_DBG_STACK, paca_struct, dbg_kstack);
OFFSET(PACA_TCD_PTR, paca_struct, tcd_ptr);
OFFSET(TCD_ESEL_NEXT, tlb_core_data, esel_next);
OFFSET(TCD_ESEL_MAX, tlb_core_data, esel_max);
OFFSET(TCD_ESEL_FIRST, tlb_core_data, esel_first);
#endif /* CONFIG_PPC_BOOK3E */
#ifdef CONFIG_PPC_BOOK3S_64
OFFSET(PACASLBCACHE, paca_struct, slb_cache);
OFFSET(PACASLBCACHEPTR, paca_struct, slb_cache_ptr);
OFFSET(PACAVMALLOCSLLP, paca_struct, vmalloc_sllp);
#ifdef CONFIG_PPC_MM_SLICES
OFFSET(MMUPSIZESLLP, mmu_psize_def, sllp);
[POWERPC] Introduce address space "slices" The basic issue is to be able to do what hugetlbfs does but with different page sizes for some other special filesystems; more specifically, my need is: - Huge pages - SPE local store mappings using 64K pages on a 4K base page size kernel on Cell - Some special 4K segments in 64K-page kernels for mapping a dodgy type of powerpc-specific infiniband hardware that requires 4K MMU mappings for various reasons I won't explain here. The main issues are: - To maintain/keep track of the page size per "segment" (as we can only have one page size per segment on powerpc, which are 256MB divisions of the address space). - To make sure special mappings stay within their allotted "segments" (including MAP_FIXED crap) - To make sure everybody else doesn't mmap/brk/grow_stack into a "segment" that is used for a special mapping Some of the necessary mechanisms to handle that were present in the hugetlbfs code, but mostly in ways not suitable for anything else. The patch relies on some changes to the generic get_unmapped_area() that just got merged. It still hijacks hugetlb callbacks here or there as the generic code hasn't been entirely cleaned up yet but that shouldn't be a problem. So what is a slice ? Well, I re-used the mechanism used formerly by our hugetlbfs implementation which divides the address space in "meta-segments" which I called "slices". The division is done using 256MB slices below 4G, and 1T slices above. Thus the address space is divided currently into 16 "low" slices and 16 "high" slices. (Special case: high slice 0 is the area between 4G and 1T). Doing so simplifies significantly the tracking of segments and avoids having to keep track of all the 256MB segments in the address space. While I used the "concepts" of hugetlbfs, I mostly re-implemented everything in a more generic way and "ported" hugetlbfs to it. Slices can have an associated page size, which is encoded in the mmu context and used by the SLB miss handler to set the segment sizes. The hash code currently doesn't care, it has a specific check for hugepages, though I might add a mechanism to provide per-slice hash mapping functions in the future. The slice code provide a pair of "generic" get_unmapped_area() (bottomup and topdown) functions that should work with any slice size. There is some trickiness here so I would appreciate people to have a look at the implementation of these and let me know if I got something wrong. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-05-08 13:27:27 +07:00
#else
OFFSET(PACACONTEXTSLLP, paca_struct, mm_ctx_sllp);
[POWERPC] Introduce address space "slices" The basic issue is to be able to do what hugetlbfs does but with different page sizes for some other special filesystems; more specifically, my need is: - Huge pages - SPE local store mappings using 64K pages on a 4K base page size kernel on Cell - Some special 4K segments in 64K-page kernels for mapping a dodgy type of powerpc-specific infiniband hardware that requires 4K MMU mappings for various reasons I won't explain here. The main issues are: - To maintain/keep track of the page size per "segment" (as we can only have one page size per segment on powerpc, which are 256MB divisions of the address space). - To make sure special mappings stay within their allotted "segments" (including MAP_FIXED crap) - To make sure everybody else doesn't mmap/brk/grow_stack into a "segment" that is used for a special mapping Some of the necessary mechanisms to handle that were present in the hugetlbfs code, but mostly in ways not suitable for anything else. The patch relies on some changes to the generic get_unmapped_area() that just got merged. It still hijacks hugetlb callbacks here or there as the generic code hasn't been entirely cleaned up yet but that shouldn't be a problem. So what is a slice ? Well, I re-used the mechanism used formerly by our hugetlbfs implementation which divides the address space in "meta-segments" which I called "slices". The division is done using 256MB slices below 4G, and 1T slices above. Thus the address space is divided currently into 16 "low" slices and 16 "high" slices. (Special case: high slice 0 is the area between 4G and 1T). Doing so simplifies significantly the tracking of segments and avoids having to keep track of all the 256MB segments in the address space. While I used the "concepts" of hugetlbfs, I mostly re-implemented everything in a more generic way and "ported" hugetlbfs to it. Slices can have an associated page size, which is encoded in the mmu context and used by the SLB miss handler to set the segment sizes. The hash code currently doesn't care, it has a specific check for hugepages, though I might add a mechanism to provide per-slice hash mapping functions in the future. The slice code provide a pair of "generic" get_unmapped_area() (bottomup and topdown) functions that should work with any slice size. There is some trickiness here so I would appreciate people to have a look at the implementation of these and let me know if I got something wrong. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-05-08 13:27:27 +07:00
#endif /* CONFIG_PPC_MM_SLICES */
OFFSET(PACA_EXGEN, paca_struct, exgen);
OFFSET(PACA_EXMC, paca_struct, exmc);
OFFSET(PACA_EXSLB, paca_struct, exslb);
OFFSET(PACA_EXNMI, paca_struct, exnmi);
#ifdef CONFIG_PPC_PSERIES
OFFSET(PACALPPACAPTR, paca_struct, lppaca_ptr);
#endif
OFFSET(PACA_SLBSHADOWPTR, paca_struct, slb_shadow_ptr);
OFFSET(SLBSHADOW_STACKVSID, slb_shadow, save_area[SLB_NUM_BOLTED - 1].vsid);
OFFSET(SLBSHADOW_STACKESID, slb_shadow, save_area[SLB_NUM_BOLTED - 1].esid);
OFFSET(SLBSHADOW_SAVEAREA, slb_shadow, save_area);
OFFSET(LPPACA_PMCINUSE, lppaca, pmcregs_in_use);
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
OFFSET(PACA_PMCINUSE, paca_struct, pmcregs_in_use);
#endif
OFFSET(LPPACA_DTLIDX, lppaca, dtl_idx);
OFFSET(LPPACA_YIELDCOUNT, lppaca, yield_count);
OFFSET(PACA_DTL_RIDX, paca_struct, dtl_ridx);
#endif /* CONFIG_PPC_BOOK3S_64 */
OFFSET(PACAEMERGSP, paca_struct, emergency_sp);
powerpc/book3s: handle machine check in Linux host. Move machine check entry point into Linux. So far we were dependent on firmware to decode MCE error details and handover the high level info to OS. This patch introduces early machine check routine that saves the MCE information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate stack frame on emergency stack and set the r1 accordingly. This allows us to be prepared to take another exception without loosing context. One thing to note here that, if we get another machine check while ME bit is off then we risk a checkstop. Hence we restrict ourselves to save only MCE information and register saved on PACA_EXMC save are before we turn the ME bit on. We use paca->in_mce flag to differentiate between first entry and nested machine check entry which helps proper use of emergency stack. We increment paca->in_mce every time we enter in early machine check handler and decrement it while leaving. When we enter machine check early handler first time (paca->in_mce == 0), we are sure nobody is using MC emergency stack and allocate a stack frame at the start of the emergency stack. During subsequent entry (paca->in_mce > 0), we know that r1 points inside emergency stack and we allocate separate stack frame accordingly. This prevents us from clobbering MCE information during nested machine checks. The early machine check handler changes are placed under CPU_FTR_HVMODE section. This makes sure that the early machine check handler will get executed only in hypervisor kernel. This is the code flow: Machine Check Interrupt | V 0x200 vector ME=0, IR=0, DR=0 | V +-----------------------------------------------+ |machine_check_pSeries_early: | ME=0, IR=0, DR=0 | Alloc frame on emergency stack | | Save srr1, srr0, dar and dsisr on stack | +-----------------------------------------------+ | (ME=1, IR=0, DR=0, RFID) | V machine_check_handle_early ME=1, IR=0, DR=0 | V +-----------------------------------------------+ | machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0 | Things to do: (in next patches) | | Flush SLB for SLB errors | | Flush TLB for TLB errors | | Decode and save MCE info | +-----------------------------------------------+ | (Fall through existing exception handler routine.) | V machine_check_pSerie ME=1, IR=0, DR=0 | (ME=1, IR=1, DR=1, RFID) | V machine_check_common ME=1, IR=1, DR=1 . . . Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-30 21:34:08 +07:00
#ifdef CONFIG_PPC_BOOK3S_64
OFFSET(PACAMCEMERGSP, paca_struct, mc_emergency_sp);
OFFSET(PACA_NMI_EMERG_SP, paca_struct, nmi_emergency_sp);
OFFSET(PACA_IN_MCE, paca_struct, in_mce);
OFFSET(PACA_IN_NMI, paca_struct, in_nmi);
powerpc/64s: Add support for RFI flush of L1-D cache On some CPUs we can prevent the Meltdown vulnerability by flushing the L1-D cache on exit from kernel to user mode, and from hypervisor to guest. This is known to be the case on at least Power7, Power8 and Power9. At this time we do not know the status of the vulnerability on other CPUs such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale CPUs. As more information comes to light we can enable this, or other mechanisms on those CPUs. The vulnerability occurs when the load of an architecturally inaccessible memory region (eg. userspace load of kernel memory) is speculatively executed to the point where its result can influence the address of a subsequent speculatively executed load. In order for that to happen, the first load must hit in the L1, because before the load is sent to the L2 the permission check is performed. Therefore if no kernel addresses hit in the L1 the vulnerability can not occur. We can ensure that is the case by flushing the L1 whenever we return to userspace. Similarly for hypervisor vs guest. In order to flush the L1-D cache on exit, we add a section of nops at each (h)rfi location that returns to a lower privileged context, and patch that with some sequence. Newer firmwares are able to advertise to us that there is a special nop instruction that flushes the L1-D. If we do not see that advertised, we fall back to doing a displacement flush in software. For guest kernels we support migration between some CPU versions, and different CPUs may use different flush instructions. So that we are prepared to migrate to a machine with a different flush instruction activated, we may have to patch more than one flush instruction at boot if the hypervisor tells us to. In the end this patch is mostly the work of Nicholas Piggin and Michael Ellerman. However a cast of thousands contributed to analysis of the issue, earlier versions of the patch, back ports testing etc. Many thanks to all of them. Tested-by: Jon Masters <jcm@redhat.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
OFFSET(PACA_RFI_FLUSH_FALLBACK_AREA, paca_struct, rfi_flush_fallback_area);
OFFSET(PACA_EXRFI, paca_struct, exrfi);
OFFSET(PACA_L1D_FLUSH_SIZE, paca_struct, l1d_flush_size);
powerpc/64s: Add support for RFI flush of L1-D cache On some CPUs we can prevent the Meltdown vulnerability by flushing the L1-D cache on exit from kernel to user mode, and from hypervisor to guest. This is known to be the case on at least Power7, Power8 and Power9. At this time we do not know the status of the vulnerability on other CPUs such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale CPUs. As more information comes to light we can enable this, or other mechanisms on those CPUs. The vulnerability occurs when the load of an architecturally inaccessible memory region (eg. userspace load of kernel memory) is speculatively executed to the point where its result can influence the address of a subsequent speculatively executed load. In order for that to happen, the first load must hit in the L1, because before the load is sent to the L2 the permission check is performed. Therefore if no kernel addresses hit in the L1 the vulnerability can not occur. We can ensure that is the case by flushing the L1 whenever we return to userspace. Similarly for hypervisor vs guest. In order to flush the L1-D cache on exit, we add a section of nops at each (h)rfi location that returns to a lower privileged context, and patch that with some sequence. Newer firmwares are able to advertise to us that there is a special nop instruction that flushes the L1-D. If we do not see that advertised, we fall back to doing a displacement flush in software. For guest kernels we support migration between some CPU versions, and different CPUs may use different flush instructions. So that we are prepared to migrate to a machine with a different flush instruction activated, we may have to patch more than one flush instruction at boot if the hypervisor tells us to. In the end this patch is mostly the work of Nicholas Piggin and Michael Ellerman. However a cast of thousands contributed to analysis of the issue, earlier versions of the patch, back ports testing etc. Many thanks to all of them. Tested-by: Jon Masters <jcm@redhat.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
#endif
OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
powerpc updates for 4.11 part 2 Highlights include: - An update of the disassembly code used by xmon to the latest versions in binutils. We've received permission from all the authors of the relevant binutils changes to relicense their changes to the relevant files from GPLv3 to GPLv2, for inclusion in Linux. Thanks to Peter Bergner for doing the leg work to get permission from everyone. - Addition of the "architected" Power9 CPU table entry, allowing us to boot in Power9 architected mode under a hypervisor. - Updates to the Power9 PMU code. - Implementation of clear_bit_unlock_is_negative_byte() to optimise unlock_page(). - Freescale updates from Scott: "Highlights include 8xx breakpoints and perf, t1042rdb display support, and board updates." Thanks to: Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Balbir Singh, Douglas Miller, Frédéric Weisbecker, Gavin Shan, Madhavan Srinivasan, Michael Roth, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Peter Bergner, Paul E. McKenney, Rashmica Gupta, Russell Currey, Sahil Mehta, Stewart Smith. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJYthsKAAoJEFHr6jzI4aWAaWMQAJ7mAwX98ncoYschPgRmmIun f6DtE4IonrxiZ22gp1ct4+c9OFtA+B5FXMcEhOKpfh93lg38PTDjHs9e5kfauD7+ oTQ2Bg1eXaL48FKdmC5Vs4Kt+/J8e9guGafUC1OVIpTyyRPoZeUDH0lx+kSPV5bd PkL+wY/k3W0Njo8WgD1P9u3W15+BxISo/k8c7ajzKTHGBZlAvj5h2gO6XUBNMLyy YClB/qIymjZriSB+AeWYD79k8gPbBZPsmZG0ZF1hY060894LgqLB9mPOJdffx/DY H7/uP6jcsRDOXTOmyueW1SEmPoQbtysiMd1lNrCXKtC/Okr5uhn2cUhi88AsgWvd 1QFly2lobcDAKPah/yB7YQGMAcmYvGGNuqrWaosaV2T7r0KprzUYYgCOqzvC3WSJ QtVatBzMIqRTMYq+3U4G1aHeCXlRazVQHDuvPby8RdR5b2gIexiqMab2eS7tSMIH mCOIunRIvT14g/7wxUV7tahN+ifncNxzAk4DvPO+Wc4FQ4sy7wArv2YipSaWRWtE u7tNdBkEwlDkKhJgRU5T0Op2PyMbHwCP8pWuz7PQIhKIcgwmP9wb07BIWG/GGIqn 07TxJYX2ItabyEMZMsYhzILZqjLyiAaCARANB7ScbQbdP8wdcGZcwismhwnfROIU NuxsZg63BUDMoxk7Sauu =rspd -----END PGP SIGNATURE----- Merge tag 'powerpc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull more powerpc updates from Michael Ellerman: "Highlights include: - an update of the disassembly code used by xmon to the latest versions in binutils. We've received permission from all the authors of the relevant binutils changes to relicense their changes to the relevant files from GPLv3 to GPLv2, for inclusion in Linux. Thanks to Peter Bergner for doing the leg work to get permission from everyone. - addition of the "architected" Power9 CPU table entry, allowing us to boot in Power9 architected mode under a hypervisor. - updates to the Power9 PMU code. - implementation of clear_bit_unlock_is_negative_byte() to optimise unlock_page(). - Freescale updates from Scott: "Highlights include 8xx breakpoints and perf, t1042rdb display support, and board updates." Thanks to: Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Balbir Singh, Douglas Miller, Frédéric Weisbecker, Gavin Shan, Madhavan Srinivasan, Michael Roth, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Peter Bergner, Paul E. McKenney, Rashmica Gupta, Russell Currey, Sahil Mehta, Stewart Smith" * tag 'powerpc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits) powerpc: Remove leftover cputime_to_nsecs call causing build error powerpc/mm/hash: Always clear UPRT and Host Radix bits when setting up CPU powerpc/optprobes: Fix TOC handling in optprobes trampoline powerpc/pseries: Advertise Hot Plug Event support to firmware cxl: fix nested locking hang during EEH hotplug powerpc/xmon: Dump memory in CPU endian format powerpc/pseries: Revert 'Auto-online hotplugged memory' powerpc/powernv: Make PCI non-optional powerpc/64: Implement clear_bit_unlock_is_negative_byte() powerpc/powernv: Remove unused variable in pnv_pci_sriov_disable() powerpc/kernel: Remove error message in pcibios_setup_phb_resources() powerpc/mm: Fix typo in set_pte_at() pci/hotplug/pnv-php: Disable MSI and PCI device properly pci/hotplug/pnv-php: Disable surprise hotplug capability on conflicts pci/hotplug/pnv-php: Remove WARN_ON() in pnv_php_put_slot() powerpc: Add POWER9 architected mode to cputable powerpc/perf: use is_kernel_addr macro in perf_get_misc_flags() powerpc/perf: Avoid FAB_*_MATCH checks for power9 powerpc/perf: Add restrictions to PMC5 in power9 DD1 powerpc/perf: Use Instruction Counter value ...
2017-03-02 01:10:16 +07:00
OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
OFFSET(ACCOUNT_SYSTEM_TIME, paca_struct, accounting.stime);
OFFSET(PACA_TRAP_SAVE, paca_struct, trap_save);
OFFSET(PACA_NAPSTATELOST, paca_struct, nap_state_lost);
OFFSET(PACA_SPRG_VDSO, paca_struct, sprg_vdso);
#else /* CONFIG_PPC64 */
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
OFFSET(ACCOUNT_STARTTIME, thread_info, accounting.starttime);
OFFSET(ACCOUNT_STARTTIME_USER, thread_info, accounting.starttime_user);
powerpc updates for 4.11 part 2 Highlights include: - An update of the disassembly code used by xmon to the latest versions in binutils. We've received permission from all the authors of the relevant binutils changes to relicense their changes to the relevant files from GPLv3 to GPLv2, for inclusion in Linux. Thanks to Peter Bergner for doing the leg work to get permission from everyone. - Addition of the "architected" Power9 CPU table entry, allowing us to boot in Power9 architected mode under a hypervisor. - Updates to the Power9 PMU code. - Implementation of clear_bit_unlock_is_negative_byte() to optimise unlock_page(). - Freescale updates from Scott: "Highlights include 8xx breakpoints and perf, t1042rdb display support, and board updates." Thanks to: Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Balbir Singh, Douglas Miller, Frédéric Weisbecker, Gavin Shan, Madhavan Srinivasan, Michael Roth, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Peter Bergner, Paul E. McKenney, Rashmica Gupta, Russell Currey, Sahil Mehta, Stewart Smith. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJYthsKAAoJEFHr6jzI4aWAaWMQAJ7mAwX98ncoYschPgRmmIun f6DtE4IonrxiZ22gp1ct4+c9OFtA+B5FXMcEhOKpfh93lg38PTDjHs9e5kfauD7+ oTQ2Bg1eXaL48FKdmC5Vs4Kt+/J8e9guGafUC1OVIpTyyRPoZeUDH0lx+kSPV5bd PkL+wY/k3W0Njo8WgD1P9u3W15+BxISo/k8c7ajzKTHGBZlAvj5h2gO6XUBNMLyy YClB/qIymjZriSB+AeWYD79k8gPbBZPsmZG0ZF1hY060894LgqLB9mPOJdffx/DY H7/uP6jcsRDOXTOmyueW1SEmPoQbtysiMd1lNrCXKtC/Okr5uhn2cUhi88AsgWvd 1QFly2lobcDAKPah/yB7YQGMAcmYvGGNuqrWaosaV2T7r0KprzUYYgCOqzvC3WSJ QtVatBzMIqRTMYq+3U4G1aHeCXlRazVQHDuvPby8RdR5b2gIexiqMab2eS7tSMIH mCOIunRIvT14g/7wxUV7tahN+ifncNxzAk4DvPO+Wc4FQ4sy7wArv2YipSaWRWtE u7tNdBkEwlDkKhJgRU5T0Op2PyMbHwCP8pWuz7PQIhKIcgwmP9wb07BIWG/GGIqn 07TxJYX2ItabyEMZMsYhzILZqjLyiAaCARANB7ScbQbdP8wdcGZcwismhwnfROIU NuxsZg63BUDMoxk7Sauu =rspd -----END PGP SIGNATURE----- Merge tag 'powerpc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull more powerpc updates from Michael Ellerman: "Highlights include: - an update of the disassembly code used by xmon to the latest versions in binutils. We've received permission from all the authors of the relevant binutils changes to relicense their changes to the relevant files from GPLv3 to GPLv2, for inclusion in Linux. Thanks to Peter Bergner for doing the leg work to get permission from everyone. - addition of the "architected" Power9 CPU table entry, allowing us to boot in Power9 architected mode under a hypervisor. - updates to the Power9 PMU code. - implementation of clear_bit_unlock_is_negative_byte() to optimise unlock_page(). - Freescale updates from Scott: "Highlights include 8xx breakpoints and perf, t1042rdb display support, and board updates." Thanks to: Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Balbir Singh, Douglas Miller, Frédéric Weisbecker, Gavin Shan, Madhavan Srinivasan, Michael Roth, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Peter Bergner, Paul E. McKenney, Rashmica Gupta, Russell Currey, Sahil Mehta, Stewart Smith" * tag 'powerpc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits) powerpc: Remove leftover cputime_to_nsecs call causing build error powerpc/mm/hash: Always clear UPRT and Host Radix bits when setting up CPU powerpc/optprobes: Fix TOC handling in optprobes trampoline powerpc/pseries: Advertise Hot Plug Event support to firmware cxl: fix nested locking hang during EEH hotplug powerpc/xmon: Dump memory in CPU endian format powerpc/pseries: Revert 'Auto-online hotplugged memory' powerpc/powernv: Make PCI non-optional powerpc/64: Implement clear_bit_unlock_is_negative_byte() powerpc/powernv: Remove unused variable in pnv_pci_sriov_disable() powerpc/kernel: Remove error message in pcibios_setup_phb_resources() powerpc/mm: Fix typo in set_pte_at() pci/hotplug/pnv-php: Disable MSI and PCI device properly pci/hotplug/pnv-php: Disable surprise hotplug capability on conflicts pci/hotplug/pnv-php: Remove WARN_ON() in pnv_php_put_slot() powerpc: Add POWER9 architected mode to cputable powerpc/perf: use is_kernel_addr macro in perf_get_misc_flags() powerpc/perf: Avoid FAB_*_MATCH checks for power9 powerpc/perf: Add restrictions to PMC5 in power9 DD1 powerpc/perf: Use Instruction Counter value ...
2017-03-02 01:10:16 +07:00
OFFSET(ACCOUNT_USER_TIME, thread_info, accounting.utime);
OFFSET(ACCOUNT_SYSTEM_TIME, thread_info, accounting.stime);
#endif
#endif /* CONFIG_PPC64 */
/* RTAS */
OFFSET(RTASBASE, rtas_t, base);
OFFSET(RTASENTRY, rtas_t, entry);
/* Interrupt register frame */
DEFINE(INT_FRAME_SIZE, STACK_INT_FRAME_SIZE);
DEFINE(SWITCH_FRAME_SIZE, STACK_FRAME_OVERHEAD + sizeof(struct pt_regs));
STACK_PT_REGS_OFFSET(GPR0, gpr[0]);
STACK_PT_REGS_OFFSET(GPR1, gpr[1]);
STACK_PT_REGS_OFFSET(GPR2, gpr[2]);
STACK_PT_REGS_OFFSET(GPR3, gpr[3]);
STACK_PT_REGS_OFFSET(GPR4, gpr[4]);
STACK_PT_REGS_OFFSET(GPR5, gpr[5]);
STACK_PT_REGS_OFFSET(GPR6, gpr[6]);
STACK_PT_REGS_OFFSET(GPR7, gpr[7]);
STACK_PT_REGS_OFFSET(GPR8, gpr[8]);
STACK_PT_REGS_OFFSET(GPR9, gpr[9]);
STACK_PT_REGS_OFFSET(GPR10, gpr[10]);
STACK_PT_REGS_OFFSET(GPR11, gpr[11]);
STACK_PT_REGS_OFFSET(GPR12, gpr[12]);
STACK_PT_REGS_OFFSET(GPR13, gpr[13]);
#ifndef CONFIG_PPC64
STACK_PT_REGS_OFFSET(GPR14, gpr[14]);
#endif /* CONFIG_PPC64 */
/*
* Note: these symbols include _ because they overlap with special
* register names
*/
STACK_PT_REGS_OFFSET(_NIP, nip);
STACK_PT_REGS_OFFSET(_MSR, msr);
STACK_PT_REGS_OFFSET(_CTR, ctr);
STACK_PT_REGS_OFFSET(_LINK, link);
STACK_PT_REGS_OFFSET(_CCR, ccr);
STACK_PT_REGS_OFFSET(_XER, xer);
STACK_PT_REGS_OFFSET(_DAR, dar);
STACK_PT_REGS_OFFSET(_DSISR, dsisr);
STACK_PT_REGS_OFFSET(ORIG_GPR3, orig_gpr3);
STACK_PT_REGS_OFFSET(RESULT, result);
STACK_PT_REGS_OFFSET(_TRAP, trap);
#ifndef CONFIG_PPC64
/*
* The PowerPC 400-class & Book-E processors have neither the DAR
* nor the DSISR SPRs. Hence, we overload them to hold the similar
* DEAR and ESR SPRs for such processors. For critical interrupts
* we use them to hold SRR0 and SRR1.
*/
STACK_PT_REGS_OFFSET(_DEAR, dar);
STACK_PT_REGS_OFFSET(_ESR, dsisr);
#else /* CONFIG_PPC64 */
STACK_PT_REGS_OFFSET(SOFTE, softe);
STACK_PT_REGS_OFFSET(_PPR, ppr);
#endif /* CONFIG_PPC64 */
#if defined(CONFIG_PPC32)
#if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
DEFINE(EXC_LVL_SIZE, STACK_EXC_LVL_FRAME_SIZE);
DEFINE(MAS0, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, mas0));
/* we overload MMUCR for 44x on MAS0 since they are mutually exclusive */
DEFINE(MMUCR, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, mas0));
DEFINE(MAS1, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, mas1));
DEFINE(MAS2, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, mas2));
DEFINE(MAS3, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, mas3));
DEFINE(MAS6, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, mas6));
DEFINE(MAS7, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, mas7));
DEFINE(_SRR0, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, srr0));
DEFINE(_SRR1, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, srr1));
DEFINE(_CSRR0, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, csrr0));
DEFINE(_CSRR1, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, csrr1));
DEFINE(_DSRR0, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, dsrr0));
DEFINE(_DSRR1, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, dsrr1));
DEFINE(SAVED_KSP_LIMIT, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, saved_ksp_limit));
#endif
#endif
#ifndef CONFIG_PPC64
OFFSET(MM_PGD, mm_struct, pgd);
#endif /* ! CONFIG_PPC64 */
/* About the CPU features table */
OFFSET(CPU_SPEC_FEATURES, cpu_spec, cpu_features);
OFFSET(CPU_SPEC_SETUP, cpu_spec, cpu_setup);
OFFSET(CPU_SPEC_RESTORE, cpu_spec, cpu_restore);
OFFSET(pbe_address, pbe, address);
OFFSET(pbe_orig_address, pbe, orig_address);
OFFSET(pbe_next, pbe, next);
#ifndef CONFIG_PPC64
DEFINE(TASK_SIZE, TASK_SIZE);
DEFINE(NUM_USER_SEGMENTS, TASK_SIZE>>28);
#endif /* ! CONFIG_PPC64 */
/* datapage offsets for use by vdso */
OFFSET(CFG_TB_ORIG_STAMP, vdso_data, tb_orig_stamp);
OFFSET(CFG_TB_TICKS_PER_SEC, vdso_data, tb_ticks_per_sec);
OFFSET(CFG_TB_TO_XS, vdso_data, tb_to_xs);
OFFSET(CFG_TB_UPDATE_COUNT, vdso_data, tb_update_count);
OFFSET(CFG_TZ_MINUTEWEST, vdso_data, tz_minuteswest);
OFFSET(CFG_TZ_DSTTIME, vdso_data, tz_dsttime);
OFFSET(CFG_SYSCALL_MAP32, vdso_data, syscall_map_32);
OFFSET(WTOM_CLOCK_SEC, vdso_data, wtom_clock_sec);
OFFSET(WTOM_CLOCK_NSEC, vdso_data, wtom_clock_nsec);
OFFSET(STAMP_XTIME, vdso_data, stamp_xtime);
OFFSET(STAMP_SEC_FRAC, vdso_data, stamp_sec_fraction);
OFFSET(CFG_ICACHE_BLOCKSZ, vdso_data, icache_block_size);
OFFSET(CFG_DCACHE_BLOCKSZ, vdso_data, dcache_block_size);
OFFSET(CFG_ICACHE_LOGBLOCKSZ, vdso_data, icache_log_block_size);
OFFSET(CFG_DCACHE_LOGBLOCKSZ, vdso_data, dcache_log_block_size);
#ifdef CONFIG_PPC64
OFFSET(CFG_SYSCALL_MAP64, vdso_data, syscall_map_64);
OFFSET(TVAL64_TV_SEC, timeval, tv_sec);
OFFSET(TVAL64_TV_USEC, timeval, tv_usec);
OFFSET(TVAL32_TV_SEC, compat_timeval, tv_sec);
OFFSET(TVAL32_TV_USEC, compat_timeval, tv_usec);
OFFSET(TSPC64_TV_SEC, timespec, tv_sec);
OFFSET(TSPC64_TV_NSEC, timespec, tv_nsec);
OFFSET(TSPC32_TV_SEC, compat_timespec, tv_sec);
OFFSET(TSPC32_TV_NSEC, compat_timespec, tv_nsec);
#else
OFFSET(TVAL32_TV_SEC, timeval, tv_sec);
OFFSET(TVAL32_TV_USEC, timeval, tv_usec);
OFFSET(TSPC32_TV_SEC, timespec, tv_sec);
OFFSET(TSPC32_TV_NSEC, timespec, tv_nsec);
#endif
/* timeval/timezone offsets for use by vdso */
OFFSET(TZONE_TZ_MINWEST, timezone, tz_minuteswest);
OFFSET(TZONE_TZ_DSTTIME, timezone, tz_dsttime);
/* Other bits used by the vdso */
DEFINE(CLOCK_REALTIME, CLOCK_REALTIME);
DEFINE(CLOCK_MONOTONIC, CLOCK_MONOTONIC);
DEFINE(CLOCK_REALTIME_COARSE, CLOCK_REALTIME_COARSE);
DEFINE(CLOCK_MONOTONIC_COARSE, CLOCK_MONOTONIC_COARSE);
DEFINE(NSEC_PER_SEC, NSEC_PER_SEC);
DEFINE(CLOCK_REALTIME_RES, MONOTONIC_RES_NSEC);
#ifdef CONFIG_BUG
DEFINE(BUG_ENTRY_SIZE, sizeof(struct bug_entry));
#endif
powerpc/mm: Fix swapper_pg_dir size on 64-bit hash w/64K pages Recently in commit f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB"), we increased H_PGD_INDEX_SIZE to 15 when we're building with 64K pages. This makes it larger than RADIX_PGD_INDEX_SIZE (13), which means the logic to calculate MAX_PGD_INDEX_SIZE in book3s/64/pgtable.h is wrong. The end result is that the PGD (Page Global Directory, ie top level page table) of the kernel (aka. swapper_pg_dir), is too small. This generally doesn't lead to a crash, as we don't use the full range in normal operation. However if we try to dump the kernel pagetables we can trigger a crash because we walk off the end of the pgd into other memory and eventually try to dereference something bogus: $ cat /sys/kernel/debug/kernel_pagetables Unable to handle kernel paging request for data at address 0xe8fece0000000000 Faulting instruction address: 0xc000000000072314 cpu 0xc: Vector: 380 (Data SLB Access) at [c0000000daa13890] pc: c000000000072314: ptdump_show+0x164/0x430 lr: c000000000072550: ptdump_show+0x3a0/0x430 dar: e802cf0000000000 seq_read+0xf8/0x560 full_proxy_read+0x84/0xc0 __vfs_read+0x6c/0x1d0 vfs_read+0xbc/0x1b0 SyS_read+0x6c/0x110 system_call+0x38/0xfc The root cause is that MAX_PGD_INDEX_SIZE isn't actually computed to be the max of H_PGD_INDEX_SIZE or RADIX_PGD_INDEX_SIZE. To fix that move the calculation into asm-offsets.c where we can do it easily using max(). Fixes: f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-12 11:56:36 +07:00
#ifdef CONFIG_PPC_BOOK3S_64
DEFINE(PGD_TABLE_SIZE, (sizeof(pgd_t) << max(RADIX_PGD_INDEX_SIZE, H_PGD_INDEX_SIZE)));
#else
DEFINE(PGD_TABLE_SIZE, PGD_TABLE_SIZE);
#endif
DEFINE(PTE_SIZE, sizeof(pte_t));
#ifdef CONFIG_KVM
OFFSET(VCPU_HOST_STACK, kvm_vcpu, arch.host_stack);
OFFSET(VCPU_HOST_PID, kvm_vcpu, arch.host_pid);
OFFSET(VCPU_GUEST_PID, kvm_vcpu, arch.pid);
OFFSET(VCPU_GPRS, kvm_vcpu, arch.regs.gpr);
OFFSET(VCPU_VRSAVE, kvm_vcpu, arch.vrsave);
OFFSET(VCPU_FPRS, kvm_vcpu, arch.fp.fpr);
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
#ifdef CONFIG_ALTIVEC
OFFSET(VCPU_VRS, kvm_vcpu, arch.vr.vr);
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
#endif
OFFSET(VCPU_XER, kvm_vcpu, arch.regs.xer);
OFFSET(VCPU_CTR, kvm_vcpu, arch.regs.ctr);
OFFSET(VCPU_LR, kvm_vcpu, arch.regs.link);
#ifdef CONFIG_PPC_BOOK3S
OFFSET(VCPU_TAR, kvm_vcpu, arch.tar);
#endif
OFFSET(VCPU_CR, kvm_vcpu, arch.cr);
OFFSET(VCPU_PC, kvm_vcpu, arch.regs.nip);
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
OFFSET(VCPU_MSR, kvm_vcpu, arch.shregs.msr);
OFFSET(VCPU_SRR0, kvm_vcpu, arch.shregs.srr0);
OFFSET(VCPU_SRR1, kvm_vcpu, arch.shregs.srr1);
OFFSET(VCPU_SPRG0, kvm_vcpu, arch.shregs.sprg0);
OFFSET(VCPU_SPRG1, kvm_vcpu, arch.shregs.sprg1);
OFFSET(VCPU_SPRG2, kvm_vcpu, arch.shregs.sprg2);
OFFSET(VCPU_SPRG3, kvm_vcpu, arch.shregs.sprg3);
KVM: PPC: Book3S HV: Accumulate timing information for real-mode code This reads the timebase at various points in the real-mode guest entry/exit code and uses that to accumulate total, minimum and maximum time spent in those parts of the code. Currently these times are accumulated per vcpu in 5 parts of the code: * rm_entry - time taken from the start of kvmppc_hv_entry() until just before entering the guest. * rm_intr - time from when we take a hypervisor interrupt in the guest until we either re-enter the guest or decide to exit to the host. This includes time spent handling hcalls in real mode. * rm_exit - time from when we decide to exit the guest until the return from kvmppc_hv_entry(). * guest - time spend in the guest * cede - time spent napping in real mode due to an H_CEDE hcall while other threads in the same vcore are active. These times are exposed in debugfs in a directory per vcpu that contains a file called "timings". This file contains one line for each of the 5 timings above, with the name followed by a colon and 4 numbers, which are the count (number of times the code has been executed), the total time, the minimum time, and the maximum time, all in nanoseconds. The overhead of the extra code amounts to about 30ns for an hcall that is handled in real mode (e.g. H_SET_DABR), which is about 25%. Since production environments may not wish to incur this overhead, the new code is conditional on a new config symbol, CONFIG_KVM_BOOK3S_HV_EXIT_TIMING. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-28 10:21:02 +07:00
#endif
#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
OFFSET(VCPU_TB_RMENTRY, kvm_vcpu, arch.rm_entry);
OFFSET(VCPU_TB_RMINTR, kvm_vcpu, arch.rm_intr);
OFFSET(VCPU_TB_RMEXIT, kvm_vcpu, arch.rm_exit);
OFFSET(VCPU_TB_GUEST, kvm_vcpu, arch.guest_time);
OFFSET(VCPU_TB_CEDE, kvm_vcpu, arch.cede_time);
OFFSET(VCPU_CUR_ACTIVITY, kvm_vcpu, arch.cur_activity);
OFFSET(VCPU_ACTIVITY_START, kvm_vcpu, arch.cur_tb_start);
OFFSET(TAS_SEQCOUNT, kvmhv_tb_accumulator, seqcount);
OFFSET(TAS_TOTAL, kvmhv_tb_accumulator, tb_total);
OFFSET(TAS_MIN, kvmhv_tb_accumulator, tb_min);
OFFSET(TAS_MAX, kvmhv_tb_accumulator, tb_max);
#endif
OFFSET(VCPU_SHARED_SPRG3, kvm_vcpu_arch_shared, sprg3);
OFFSET(VCPU_SHARED_SPRG4, kvm_vcpu_arch_shared, sprg4);
OFFSET(VCPU_SHARED_SPRG5, kvm_vcpu_arch_shared, sprg5);
OFFSET(VCPU_SHARED_SPRG6, kvm_vcpu_arch_shared, sprg6);
OFFSET(VCPU_SHARED_SPRG7, kvm_vcpu_arch_shared, sprg7);
OFFSET(VCPU_SHADOW_PID, kvm_vcpu, arch.shadow_pid);
OFFSET(VCPU_SHADOW_PID1, kvm_vcpu, arch.shadow_pid1);
OFFSET(VCPU_SHARED, kvm_vcpu, arch.shared);
OFFSET(VCPU_SHARED_MSR, kvm_vcpu_arch_shared, msr);
OFFSET(VCPU_SHADOW_MSR, kvm_vcpu, arch.shadow_msr);
#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_PR_POSSIBLE)
OFFSET(VCPU_SHAREDBE, kvm_vcpu, arch.shared_big_endian);
#endif
OFFSET(VCPU_SHARED_MAS0, kvm_vcpu_arch_shared, mas0);
OFFSET(VCPU_SHARED_MAS1, kvm_vcpu_arch_shared, mas1);
OFFSET(VCPU_SHARED_MAS2, kvm_vcpu_arch_shared, mas2);
OFFSET(VCPU_SHARED_MAS7_3, kvm_vcpu_arch_shared, mas7_3);
OFFSET(VCPU_SHARED_MAS4, kvm_vcpu_arch_shared, mas4);
OFFSET(VCPU_SHARED_MAS6, kvm_vcpu_arch_shared, mas6);
OFFSET(VCPU_KVM, kvm_vcpu, kvm);
OFFSET(KVM_LPID, kvm, arch.lpid);
/* book3s */
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
OFFSET(KVM_TLB_SETS, kvm, arch.tlb_sets);
OFFSET(KVM_SDR1, kvm, arch.sdr1);
OFFSET(KVM_HOST_LPID, kvm, arch.host_lpid);
OFFSET(KVM_HOST_LPCR, kvm, arch.host_lpcr);
OFFSET(KVM_HOST_SDR1, kvm, arch.host_sdr1);
OFFSET(KVM_NEED_FLUSH, kvm, arch.need_tlb_flush.bits);
OFFSET(KVM_ENABLED_HCALLS, kvm, arch.enabled_hcalls);
OFFSET(KVM_VRMA_SLB_V, kvm, arch.vrma_slb_v);
OFFSET(KVM_RADIX, kvm, arch.radix);
OFFSET(KVM_FWNMI, kvm, arch.fwnmi_enabled);
OFFSET(VCPU_DSISR, kvm_vcpu, arch.shregs.dsisr);
OFFSET(VCPU_DAR, kvm_vcpu, arch.shregs.dar);
OFFSET(VCPU_VPA, kvm_vcpu, arch.vpa.pinned_addr);
OFFSET(VCPU_VPA_DIRTY, kvm_vcpu, arch.vpa.dirty);
OFFSET(VCPU_HEIR, kvm_vcpu, arch.emul_inst);
OFFSET(VCPU_CPU, kvm_vcpu, cpu);
OFFSET(VCPU_THREAD_CPU, kvm_vcpu, arch.thread_cpu);
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
#endif
#ifdef CONFIG_PPC_BOOK3S
OFFSET(VCPU_PURR, kvm_vcpu, arch.purr);
OFFSET(VCPU_SPURR, kvm_vcpu, arch.spurr);
OFFSET(VCPU_IC, kvm_vcpu, arch.ic);
OFFSET(VCPU_DSCR, kvm_vcpu, arch.dscr);
OFFSET(VCPU_AMR, kvm_vcpu, arch.amr);
OFFSET(VCPU_UAMOR, kvm_vcpu, arch.uamor);
OFFSET(VCPU_IAMR, kvm_vcpu, arch.iamr);
OFFSET(VCPU_CTRL, kvm_vcpu, arch.ctrl);
OFFSET(VCPU_DABR, kvm_vcpu, arch.dabr);
OFFSET(VCPU_DABRX, kvm_vcpu, arch.dabrx);
OFFSET(VCPU_DAWR, kvm_vcpu, arch.dawr);
OFFSET(VCPU_DAWRX, kvm_vcpu, arch.dawrx);
OFFSET(VCPU_CIABR, kvm_vcpu, arch.ciabr);
OFFSET(VCPU_HFLAGS, kvm_vcpu, arch.hflags);
OFFSET(VCPU_DEC, kvm_vcpu, arch.dec);
OFFSET(VCPU_DEC_EXPIRES, kvm_vcpu, arch.dec_expires);
OFFSET(VCPU_PENDING_EXC, kvm_vcpu, arch.pending_exceptions);
OFFSET(VCPU_CEDED, kvm_vcpu, arch.ceded);
OFFSET(VCPU_PRODDED, kvm_vcpu, arch.prodded);
OFFSET(VCPU_IRQ_PENDING, kvm_vcpu, arch.irq_pending);
KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9 On POWER9, we no longer have the restriction that we had on POWER8 where all threads in a core have to be in the same partition, so the CPU threads are now independent. However, we still want to be able to run guests with a virtual SMT topology, if only to allow migration of guests from POWER8 systems to POWER9. A guest that has a virtual SMT mode greater than 1 will expect to be able to use the doorbell facility; it will expect the msgsndp and msgclrp instructions to work appropriately and to be able to read sensible values from the TIR (thread identification register) and DPDES (directed privileged doorbell exception status) special-purpose registers. However, since each CPU thread is a separate sub-processor in POWER9, these instructions and registers can only be used within a single CPU thread. In order for these instructions to appear to act correctly according to the guest's virtual SMT mode, we have to trap and emulate them. We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR register. The emulation is triggered by the hypervisor facility unavailable interrupt that occurs when the guest uses them. To cause a doorbell interrupt to occur within the guest, we set the DPDES register to 1. If the guest has interrupts enabled, the CPU will generate a doorbell interrupt and clear the DPDES register in hardware. The DPDES hardware register for the guest is saved in the vcpu->arch.vcore->dpdes field. Since this gets written by the guest exit code, other VCPUs wishing to cause a doorbell interrupt don't write that field directly, but instead set a vcpu->arch.doorbell_request flag. This is consumed and set to 0 by the guest entry code, which then sets DPDES to 1. Emulating reads of the DPDES register is somewhat involved, because it requires reading the doorbell pending interrupt status of all of the VCPU threads in the virtual core, and if any of those VCPUs are running, their doorbell status is only up-to-date in the hardware DPDES registers of the CPUs where they are running. In order to get a reasonable approximation of the current doorbell status, we send those CPUs an IPI, causing an exit from the guest which will update the vcpu->arch.vcore->dpdes field. We then use that value in constructing the emulated DPDES register value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-05-16 13:41:20 +07:00
OFFSET(VCPU_DBELL_REQ, kvm_vcpu, arch.doorbell_request);
OFFSET(VCPU_MMCR, kvm_vcpu, arch.mmcr);
OFFSET(VCPU_PMC, kvm_vcpu, arch.pmc);
OFFSET(VCPU_SPMC, kvm_vcpu, arch.spmc);
OFFSET(VCPU_SIAR, kvm_vcpu, arch.siar);
OFFSET(VCPU_SDAR, kvm_vcpu, arch.sdar);
OFFSET(VCPU_SIER, kvm_vcpu, arch.sier);
OFFSET(VCPU_SLB, kvm_vcpu, arch.slb);
OFFSET(VCPU_SLB_MAX, kvm_vcpu, arch.slb_max);
OFFSET(VCPU_SLB_NR, kvm_vcpu, arch.slb_nr);
OFFSET(VCPU_FAULT_DSISR, kvm_vcpu, arch.fault_dsisr);
OFFSET(VCPU_FAULT_DAR, kvm_vcpu, arch.fault_dar);
OFFSET(VCPU_FAULT_GPA, kvm_vcpu, arch.fault_gpa);
OFFSET(VCPU_INTR_MSR, kvm_vcpu, arch.intr_msr);
OFFSET(VCPU_LAST_INST, kvm_vcpu, arch.last_inst);
OFFSET(VCPU_TRAP, kvm_vcpu, arch.trap);
OFFSET(VCPU_CFAR, kvm_vcpu, arch.cfar);
OFFSET(VCPU_PPR, kvm_vcpu, arch.ppr);
OFFSET(VCPU_FSCR, kvm_vcpu, arch.fscr);
OFFSET(VCPU_PSPB, kvm_vcpu, arch.pspb);
OFFSET(VCPU_EBBHR, kvm_vcpu, arch.ebbhr);
OFFSET(VCPU_EBBRR, kvm_vcpu, arch.ebbrr);
OFFSET(VCPU_BESCR, kvm_vcpu, arch.bescr);
OFFSET(VCPU_CSIGR, kvm_vcpu, arch.csigr);
OFFSET(VCPU_TACR, kvm_vcpu, arch.tacr);
OFFSET(VCPU_TCSCR, kvm_vcpu, arch.tcscr);
OFFSET(VCPU_ACOP, kvm_vcpu, arch.acop);
OFFSET(VCPU_WORT, kvm_vcpu, arch.wort);
OFFSET(VCPU_TID, kvm_vcpu, arch.tid);
OFFSET(VCPU_PSSCR, kvm_vcpu, arch.psscr);
OFFSET(VCPU_HFSCR, kvm_vcpu, arch.hfscr);
OFFSET(VCORE_ENTRY_EXIT, kvmppc_vcore, entry_exit_map);
OFFSET(VCORE_IN_GUEST, kvmppc_vcore, in_guest);
OFFSET(VCORE_NAPPING_THREADS, kvmppc_vcore, napping_threads);
OFFSET(VCORE_KVM, kvmppc_vcore, kvm);
OFFSET(VCORE_TB_OFFSET, kvmppc_vcore, tb_offset);
KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry Currently, the HV KVM guest entry/exit code adds the timebase offset from the vcore struct to the timebase on guest entry, and subtracts it on guest exit. Which is fine, except that it is possible for userspace to change the offset using the SET_ONE_REG interface while the vcore is running, as there is only one timebase offset per vcore but potentially multiple VCPUs in the vcore. If that were to happen, KVM would subtract a different offset on guest exit from that which it had added on guest entry, leading to the timebase being out of sync between cores in the host, which then leads to bad things happening such as hangs and spurious watchdog timeouts. To fix this, we add a new field 'tb_offset_applied' to the vcore struct which stores the offset that is currently applied to the timebase. This value is set from the vcore tb_offset field on guest entry, and is what is subtracted from the timebase on guest exit. Since it is zero when the timebase offset is not applied, we can simplify the logic in kvmhv_start_timing and kvmhv_accumulate_time. In addition, we had secondary threads reading the timebase while running concurrently with code on the primary thread which would eventually add or subtract the timebase offset from the timebase. This occurred while saving or restoring the DEC register value on the secondary threads. Although no specific incorrect behaviour has been observed, this is a race which should be fixed. To fix it, we move the DEC saving code to just before we call kvmhv_commence_exit, and the DEC restoring code to after the point where we have waited for the primary thread to switch the MMU context and add the timebase offset. That way we are sure that the timebase contains the guest timebase value in both cases. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-04-20 19:51:11 +07:00
OFFSET(VCORE_TB_OFFSET_APPL, kvmppc_vcore, tb_offset_applied);
OFFSET(VCORE_LPCR, kvmppc_vcore, lpcr);
OFFSET(VCORE_PCR, kvmppc_vcore, pcr);
OFFSET(VCORE_DPDES, kvmppc_vcore, dpdes);
OFFSET(VCORE_VTB, kvmppc_vcore, vtb);
OFFSET(VCPU_SLB_E, kvmppc_slb, orige);
OFFSET(VCPU_SLB_V, kvmppc_slb, origv);
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
DEFINE(VCPU_SLB_SIZE, sizeof(struct kvmppc_slb));
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
OFFSET(VCPU_TFHAR, kvm_vcpu, arch.tfhar);
OFFSET(VCPU_TFIAR, kvm_vcpu, arch.tfiar);
OFFSET(VCPU_TEXASR, kvm_vcpu, arch.texasr);
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9 POWER9 has hardware bugs relating to transactional memory and thread reconfiguration (changes to hardware SMT mode). Specifically, the core does not have enough storage to store a complete checkpoint of all the architected state for all four threads. The DD2.2 version of POWER9 includes hardware modifications designed to allow hypervisor software to implement workarounds for these problems. This patch implements those workarounds in KVM code so that KVM guests see a full, working transactional memory implementation. The problems center around the use of TM suspended state, where the CPU has a checkpointed state but execution is not transactional. The workaround is to implement a "fake suspend" state, which looks to the guest like suspended state but the CPU does not store a checkpoint. In this state, any instruction that would cause a transition to transactional state (rfid, rfebb, mtmsrd, tresume) or would use the checkpointed state (treclaim) causes a "soft patch" interrupt (vector 0x1500) to the hypervisor so that it can be emulated. The trechkpt instruction also causes a soft patch interrupt. On POWER9 DD2.2, we avoid returning to the guest in any state which would require a checkpoint to be present. The trechkpt in the guest entry path which would normally create that checkpoint is replaced by either a transition to fake suspend state, if the guest is in suspend state, or a rollback to the pre-transactional state if the guest is in transactional state. Fake suspend state is indicated by a flag in the PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and reads back as 0. On exit from the guest, if the guest is in fake suspend state, we still do the treclaim instruction as we would in real suspend state, in order to get into non-transactional state, but we do not save the resulting register state since there was no checkpoint. Emulation of the instructions that cause a softpatch interrupt is handled in two paths. If the guest is in real suspend mode, we call kvmhv_p9_tm_emulation_early() to handle the cases where the guest is transitioning to transactional state. This is called before we do the treclaim in the guest exit path; because we haven't done treclaim, we can get back to the guest with the transaction still active. If the instruction is a case that kvmhv_p9_tm_emulation_early() doesn't handle, or if the guest is in fake suspend state, then we proceed to do the complete guest exit path and subsequently call kvmhv_p9_tm_emulation() in host context with the MMU on. This handles all the cases including the cases that generate program interrupts (illegal instruction or TM Bad Thing) and facility unavailable interrupts. The emulation is reasonably straightforward and is mostly concerned with checking for exception conditions and updating the state of registers such as MSR and CR0. The treclaim emulation takes care to ensure that the TEXASR register gets updated as if it were the guest treclaim instruction that had done failure recording, not the treclaim done in hypervisor state in the guest exit path. With this, the KVM_CAP_PPC_HTM capability returns true (1) even if transactional memory is not available to host userspace. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 17:32:01 +07:00
OFFSET(VCPU_ORIG_TEXASR, kvm_vcpu, arch.orig_texasr);
OFFSET(VCPU_GPR_TM, kvm_vcpu, arch.gpr_tm);
OFFSET(VCPU_FPRS_TM, kvm_vcpu, arch.fp_tm.fpr);
OFFSET(VCPU_VRS_TM, kvm_vcpu, arch.vr_tm.vr);
OFFSET(VCPU_VRSAVE_TM, kvm_vcpu, arch.vrsave_tm);
OFFSET(VCPU_CR_TM, kvm_vcpu, arch.cr_tm);
OFFSET(VCPU_XER_TM, kvm_vcpu, arch.xer_tm);
OFFSET(VCPU_LR_TM, kvm_vcpu, arch.lr_tm);
OFFSET(VCPU_CTR_TM, kvm_vcpu, arch.ctr_tm);
OFFSET(VCPU_AMR_TM, kvm_vcpu, arch.amr_tm);
OFFSET(VCPU_PPR_TM, kvm_vcpu, arch.ppr_tm);
OFFSET(VCPU_DSCR_TM, kvm_vcpu, arch.dscr_tm);
OFFSET(VCPU_TAR_TM, kvm_vcpu, arch.tar_tm);
#endif
#ifdef CONFIG_PPC_BOOK3S_64
#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
OFFSET(PACA_SVCPU, paca_struct, shadow_vcpu);
# define SVCPU_FIELD(x, f) DEFINE(x, offsetof(struct paca_struct, shadow_vcpu.f))
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
#else
# define SVCPU_FIELD(x, f)
#endif
# define HSTATE_FIELD(x, f) DEFINE(x, offsetof(struct paca_struct, kvm_hstate.f))
#else /* 32-bit */
# define SVCPU_FIELD(x, f) DEFINE(x, offsetof(struct kvmppc_book3s_shadow_vcpu, f))
# define HSTATE_FIELD(x, f) DEFINE(x, offsetof(struct kvmppc_book3s_shadow_vcpu, hstate.f))
#endif
SVCPU_FIELD(SVCPU_CR, cr);
SVCPU_FIELD(SVCPU_XER, xer);
SVCPU_FIELD(SVCPU_CTR, ctr);
SVCPU_FIELD(SVCPU_LR, lr);
SVCPU_FIELD(SVCPU_PC, pc);
SVCPU_FIELD(SVCPU_R0, gpr[0]);
SVCPU_FIELD(SVCPU_R1, gpr[1]);
SVCPU_FIELD(SVCPU_R2, gpr[2]);
SVCPU_FIELD(SVCPU_R3, gpr[3]);
SVCPU_FIELD(SVCPU_R4, gpr[4]);
SVCPU_FIELD(SVCPU_R5, gpr[5]);
SVCPU_FIELD(SVCPU_R6, gpr[6]);
SVCPU_FIELD(SVCPU_R7, gpr[7]);
SVCPU_FIELD(SVCPU_R8, gpr[8]);
SVCPU_FIELD(SVCPU_R9, gpr[9]);
SVCPU_FIELD(SVCPU_R10, gpr[10]);
SVCPU_FIELD(SVCPU_R11, gpr[11]);
SVCPU_FIELD(SVCPU_R12, gpr[12]);
SVCPU_FIELD(SVCPU_R13, gpr[13]);
SVCPU_FIELD(SVCPU_FAULT_DSISR, fault_dsisr);
SVCPU_FIELD(SVCPU_FAULT_DAR, fault_dar);
SVCPU_FIELD(SVCPU_LAST_INST, last_inst);
SVCPU_FIELD(SVCPU_SHADOW_SRR1, shadow_srr1);
#ifdef CONFIG_PPC_BOOK3S_32
SVCPU_FIELD(SVCPU_SR, sr);
#endif
#ifdef CONFIG_PPC64
SVCPU_FIELD(SVCPU_SLB, slb);
SVCPU_FIELD(SVCPU_SLB_MAX, slb_max);
SVCPU_FIELD(SVCPU_SHADOW_FSCR, shadow_fscr);
#endif
HSTATE_FIELD(HSTATE_HOST_R1, host_r1);
HSTATE_FIELD(HSTATE_HOST_R2, host_r2);
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
HSTATE_FIELD(HSTATE_HOST_MSR, host_msr);
HSTATE_FIELD(HSTATE_VMHANDLER, vmhandler);
HSTATE_FIELD(HSTATE_SCRATCH0, scratch0);
HSTATE_FIELD(HSTATE_SCRATCH1, scratch1);
HSTATE_FIELD(HSTATE_SCRATCH2, scratch2);
HSTATE_FIELD(HSTATE_IN_GUEST, in_guest);
KVM: PPC: book3s_pr: Simplify transitions between virtual and real mode This simplifies the way that the book3s_pr makes the transition to real mode when entering the guest. We now call kvmppc_entry_trampoline (renamed from kvmppc_rmcall) in the base kernel using a normal function call instead of doing an indirect call through a pointer in the vcpu. If kvm is a module, the module loader takes care of generating a trampoline as it does for other calls to functions outside the module. kvmppc_entry_trampoline then disables interrupts and jumps to kvmppc_handler_trampoline_enter in real mode using an rfi[d]. That then uses the link register as the address to return to (potentially in module space) when the guest exits. This also simplifies the way that we call the Linux interrupt handler when we exit the guest due to an external, decrementer or performance monitor interrupt. Instead of turning on the MMU, then deciding that we need to call the Linux handler and turning the MMU back off again, we now go straight to the handler at the point where we would turn the MMU on. The handler will then return to the virtual-mode code (potentially in the module). Along the way, this moves the setting and clearing of the HID5 DCBZ32 bit into real-mode interrupts-off code, and also makes sure that we clear the MSR[RI] bit before loading values into SRR0/1. The net result is that we no longer need any code addresses to be stored in vcpu->arch. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 14:41:44 +07:00
HSTATE_FIELD(HSTATE_RESTORE_HID5, restore_hid5);
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per core), whenever a CPU goes idle, we have to pull all the other hardware threads in the core out of the guest, because the H_CEDE hcall is handled in the kernel. This is inefficient. This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall in real mode. When a guest vcpu does an H_CEDE hcall, we now only exit to the kernel if all the other vcpus in the same core are also idle. Otherwise we mark this vcpu as napping, save state that could be lost in nap mode (mainly GPRs and FPRs), and execute the nap instruction. When the thread wakes up, because of a decrementer or external interrupt, we come back in at kvm_start_guest (from the system reset interrupt vector), find the `napping' flag set in the paca, and go to the resume path. This has some other ramifications. First, when starting a core, we now start all the threads, both those that are immediately runnable and those that are idle. This is so that we don't have to pull all the threads out of the guest when an idle thread gets a decrementer interrupt and wants to start running. In fact the idle threads will all start with the H_CEDE hcall returning; being idle they will just do another H_CEDE immediately and go to nap mode. This required some changes to kvmppc_run_core() and kvmppc_run_vcpu(). These functions have been restructured to make them simpler and clearer. We introduce a level of indirection in the wait queue that gets woken when external and decrementer interrupts get generated for a vcpu, so that we can have the 4 vcpus in a vcore using the same wait queue. We need this because the 4 vcpus are being handled by one thread. Secondly, when we need to exit from the guest to the kernel, we now have to generate an IPI for any napping threads, because an HDEC interrupt doesn't wake up a napping thread. Thirdly, we now need to be able to handle virtual external interrupts and decrementer interrupts becoming pending while a thread is napping, and deliver those interrupts to the guest when the thread wakes. This is done in kvmppc_cede_reentry, just before fast_guest_return. Finally, since we are not using the generic kvm_vcpu_block for book3s_hv, and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef from kvm_arch_vcpu_runnable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 14:42:46 +07:00
HSTATE_FIELD(HSTATE_NAPPING, napping);
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
HSTATE_FIELD(HSTATE_HWTHREAD_REQ, hwthread_req);
HSTATE_FIELD(HSTATE_HWTHREAD_STATE, hwthread_state);
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
HSTATE_FIELD(HSTATE_KVM_VCPU, kvm_vcpu);
KVM: PPC: Allow book3s_hv guests to use SMT processor modes This lifts the restriction that book3s_hv guests can only run one hardware thread per core, and allows them to use up to 4 threads per core on POWER7. The host still has to run single-threaded. This capability is advertised to qemu through a new KVM_CAP_PPC_SMT capability. The return value of the ioctl querying this capability is the number of vcpus per virtual CPU core (vcore), currently 4. To use this, the host kernel should be booted with all threads active, and then all the secondary threads should be offlined. This will put the secondary threads into nap mode. KVM will then wake them from nap mode and use them for running guest code (while they are still offline). To wake the secondary threads, we send them an IPI using a new xics_wake_cpu() function, implemented in arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage we assume that the platform has a XICS interrupt controller and we are using icp-native.c to drive it. Since the woken thread will need to acknowledge and clear the IPI, we also export the base physical address of the XICS registers using kvmppc_set_xics_phys() for use in the low-level KVM book3s code. When a vcpu is created, it is assigned to a virtual CPU core. The vcore number is obtained by dividing the vcpu number by the number of threads per core in the host. This number is exported to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes to run the guest in single-threaded mode, it should make all vcpu numbers be multiples of the number of threads per core. We distinguish three states of a vcpu: runnable (i.e., ready to execute the guest), blocked (that is, idle), and busy in host. We currently implement a policy that the vcore can run only when all its threads are runnable or blocked. This way, if a vcpu needs to execute elsewhere in the kernel or in qemu, it can do so without being starved of CPU by the other vcpus. When a vcore starts to run, it executes in the context of one of the vcpu threads. The other vcpu threads all go to sleep and stay asleep until something happens requiring the vcpu thread to return to qemu, or to wake up to run the vcore (this can happen when another vcpu thread goes from busy in host state to blocked). It can happen that a vcpu goes from blocked to runnable state (e.g. because of an interrupt), and the vcore it belongs to is already running. In that case it can start to run immediately as long as the none of the vcpus in the vcore have started to exit the guest. We send the next free thread in the vcore an IPI to get it to start to execute the guest. It synchronizes with the other threads via the vcore->entry_exit_count field to make sure that it doesn't go into the guest if the other vcpus are exiting by the time that it is ready to actually enter the guest. Note that there is no fixed relationship between the hardware thread number and the vcpu number. Hardware threads are assigned to vcpus as they become runnable, so we will always use the lower-numbered hardware threads in preference to higher-numbered threads if not all the vcpus in the vcore are runnable, regardless of which vcpus are runnable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
HSTATE_FIELD(HSTATE_KVM_VCORE, kvm_vcore);
HSTATE_FIELD(HSTATE_XICS_PHYS, xics_phys);
HSTATE_FIELD(HSTATE_XIVE_TIMA_PHYS, xive_tima_phys);
HSTATE_FIELD(HSTATE_XIVE_TIMA_VIRT, xive_tima_virt);
HSTATE_FIELD(HSTATE_SAVED_XIRR, saved_xirr);
HSTATE_FIELD(HSTATE_HOST_IPI, host_ipi);
KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbers On a threaded processor such as POWER7, we group VCPUs into virtual cores and arrange that the VCPUs in a virtual core run on the same physical core. Currently we don't enforce any correspondence between virtual thread numbers within a virtual core and physical thread numbers. Physical threads are allocated starting at 0 on a first-come first-served basis to runnable virtual threads (VCPUs). POWER8 implements a new "msgsndp" instruction which guest kernels can use to interrupt other threads in the same core or sub-core. Since the instruction takes the destination physical thread ID as a parameter, it becomes necessary to align the physical thread IDs with the virtual thread IDs, that is, to make sure virtual thread N within a virtual core always runs on physical thread N. This means that it's possible that thread 0, which is where we call __kvmppc_vcore_entry, may end up running some other vcpu than the one whose task called kvmppc_run_core(), or it may end up running no vcpu at all, if for example thread 0 of the virtual core is currently executing in userspace. However, we do need thread 0 to be responsible for switching the MMU -- a previous version of this patch that had other threads switching the MMU was found to be responsible for occasional memory corruption and machine check interrupts in the guest on POWER7 machines. To accommodate this, we no longer pass the vcpu pointer to __kvmppc_vcore_entry, but instead let the assembly code load it from the PACA. Since the assembly code will need to know the kvm pointer and the thread ID for threads which don't have a vcpu, we move the thread ID into the PACA and we add a kvm pointer to the virtual core structure. In the case where thread 0 has no vcpu to run, it still calls into kvmppc_hv_entry in order to do the MMU switch, and then naps until either its vcpu is ready to run in the guest, or some other thread needs to exit the guest. In the latter case, thread 0 jumps to the code that switches the MMU back to the host. This control flow means that now we switch the MMU before loading any guest vcpu state. Similarly, on guest exit we now save all the guest vcpu state before switching the MMU back to the host. This has required substantial code movement, making the diff rather large. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-08 17:25:20 +07:00
HSTATE_FIELD(HSTATE_PTID, ptid);
KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts This patch removes the restriction that a radix host can only run radix guests, allowing us to run HPT (hashed page table) guests as well. This is useful because it provides a way to run old guest kernels that know about POWER8 but not POWER9. Unfortunately, POWER9 currently has a restriction that all threads in a given code must either all be in HPT mode, or all in radix mode. This means that when entering a HPT guest, we have to obtain control of all 4 threads in the core and get them to switch their LPIDR and LPCR registers, even if they are not going to run a guest. On guest exit we also have to get all threads to switch LPIDR and LPCR back to host values. To make this feasible, we require that KVM not be in the "independent threads" mode, and that the CPU cores be in single-threaded mode from the host kernel's perspective (only thread 0 online; threads 1, 2 and 3 offline). That allows us to use the same code as on POWER8 for obtaining control of the secondary threads. To manage the LPCR/LPIDR changes required, we extend the kvm_split_info struct to contain the information needed by the secondary threads. All threads perform a barrier synchronization (where all threads wait for every other thread to reach the synchronization point) on guest entry, both before and after loading LPCR and LPIDR. On guest exit, they all once again perform a barrier synchronization both before and after loading host values into LPCR and LPIDR. Finally, it is also currently necessary to flush the entire TLB every time we enter a HPT guest on a radix host. We do this on thread 0 with a loop of tlbiel instructions. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-19 10:11:23 +07:00
HSTATE_FIELD(HSTATE_TID, tid);
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9 POWER9 has hardware bugs relating to transactional memory and thread reconfiguration (changes to hardware SMT mode). Specifically, the core does not have enough storage to store a complete checkpoint of all the architected state for all four threads. The DD2.2 version of POWER9 includes hardware modifications designed to allow hypervisor software to implement workarounds for these problems. This patch implements those workarounds in KVM code so that KVM guests see a full, working transactional memory implementation. The problems center around the use of TM suspended state, where the CPU has a checkpointed state but execution is not transactional. The workaround is to implement a "fake suspend" state, which looks to the guest like suspended state but the CPU does not store a checkpoint. In this state, any instruction that would cause a transition to transactional state (rfid, rfebb, mtmsrd, tresume) or would use the checkpointed state (treclaim) causes a "soft patch" interrupt (vector 0x1500) to the hypervisor so that it can be emulated. The trechkpt instruction also causes a soft patch interrupt. On POWER9 DD2.2, we avoid returning to the guest in any state which would require a checkpoint to be present. The trechkpt in the guest entry path which would normally create that checkpoint is replaced by either a transition to fake suspend state, if the guest is in suspend state, or a rollback to the pre-transactional state if the guest is in transactional state. Fake suspend state is indicated by a flag in the PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and reads back as 0. On exit from the guest, if the guest is in fake suspend state, we still do the treclaim instruction as we would in real suspend state, in order to get into non-transactional state, but we do not save the resulting register state since there was no checkpoint. Emulation of the instructions that cause a softpatch interrupt is handled in two paths. If the guest is in real suspend mode, we call kvmhv_p9_tm_emulation_early() to handle the cases where the guest is transitioning to transactional state. This is called before we do the treclaim in the guest exit path; because we haven't done treclaim, we can get back to the guest with the transaction still active. If the instruction is a case that kvmhv_p9_tm_emulation_early() doesn't handle, or if the guest is in fake suspend state, then we proceed to do the complete guest exit path and subsequently call kvmhv_p9_tm_emulation() in host context with the MMU on. This handles all the cases including the cases that generate program interrupts (illegal instruction or TM Bad Thing) and facility unavailable interrupts. The emulation is reasonably straightforward and is mostly concerned with checking for exception conditions and updating the state of registers such as MSR and CR0. The treclaim emulation takes care to ensure that the TEXASR register gets updated as if it were the guest treclaim instruction that had done failure recording, not the treclaim done in hypervisor state in the guest exit path. With this, the KVM_CAP_PPC_HTM capability returns true (1) even if transactional memory is not available to host userspace. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 17:32:01 +07:00
HSTATE_FIELD(HSTATE_FAKE_SUSPEND, fake_suspend);
HSTATE_FIELD(HSTATE_MMCR0, host_mmcr[0]);
HSTATE_FIELD(HSTATE_MMCR1, host_mmcr[1]);
HSTATE_FIELD(HSTATE_MMCRA, host_mmcr[2]);
HSTATE_FIELD(HSTATE_SIAR, host_mmcr[3]);
HSTATE_FIELD(HSTATE_SDAR, host_mmcr[4]);
HSTATE_FIELD(HSTATE_MMCR2, host_mmcr[5]);
HSTATE_FIELD(HSTATE_SIER, host_mmcr[6]);
HSTATE_FIELD(HSTATE_PMC1, host_pmc[0]);
HSTATE_FIELD(HSTATE_PMC2, host_pmc[1]);
HSTATE_FIELD(HSTATE_PMC3, host_pmc[2]);
HSTATE_FIELD(HSTATE_PMC4, host_pmc[3]);
HSTATE_FIELD(HSTATE_PMC5, host_pmc[4]);
HSTATE_FIELD(HSTATE_PMC6, host_pmc[5]);
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
HSTATE_FIELD(HSTATE_PURR, host_purr);
HSTATE_FIELD(HSTATE_SPURR, host_spurr);
HSTATE_FIELD(HSTATE_DSCR, host_dscr);
HSTATE_FIELD(HSTATE_DABR, dabr);
HSTATE_FIELD(HSTATE_DECEXP, dec_expires);
KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8 This builds on the ability to run more than one vcore on a physical core by using the micro-threading (split-core) modes of the POWER8 chip. Previously, only vcores from the same VM could be run together, and (on POWER8) only if they had just one thread per core. With the ability to split the core on guest entry and unsplit it on guest exit, we can run up to 8 vcpu threads from up to 4 different VMs, and we can run multiple vcores with 2 or 4 vcpus per vcore. Dynamic micro-threading is only available if the static configuration of the cores is whole-core mode (unsplit), and only on POWER8. To manage this, we introduce a new kvm_split_mode struct which is shared across all of the subcores in the core, with a pointer in the paca on each thread. In addition we extend the core_info struct to have information on each subcore. When deciding whether to add a vcore to the set already on the core, we now have two possibilities: (a) piggyback the vcore onto an existing subcore, or (b) start a new subcore. Currently, when any vcpu needs to exit the guest and switch to host virtual mode, we interrupt all the threads in all subcores and switch the core back to whole-core mode. It may be possible in future to allow some of the subcores to keep executing in the guest while subcore 0 switches to the host, but that is not implemented in this patch. This adds a module parameter called dynamic_mt_modes which controls which micro-threading (split-core) modes the code will consider, as a bitmap. In other words, if it is 0, no micro-threading mode is considered; if it is 2, only 2-way micro-threading is considered; if it is 4, only 4-way, and if it is 6, both 2-way and 4-way micro-threading mode will be considered. The default is 6. With this, we now have secondary threads which are the primary thread for their subcore and therefore need to do the MMU switch. These threads will need to be started even if they have no vcpu to run, so we use the vcore pointer in the PACA rather than the vcpu pointer to trigger them. It is now possible for thread 0 to find that an exit has been requested before it gets to switch the subcore state to the guest. In that case we haven't added the guest's timebase offset to the timebase, so we need to be careful not to subtract the offset in the guest exit path. In fact we just skip the whole path that switches back to host context, since we haven't switched to the guest context. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2015-07-02 17:38:16 +07:00
HSTATE_FIELD(HSTATE_SPLIT_MODE, kvm_split_mode);
KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per core), whenever a CPU goes idle, we have to pull all the other hardware threads in the core out of the guest, because the H_CEDE hcall is handled in the kernel. This is inefficient. This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall in real mode. When a guest vcpu does an H_CEDE hcall, we now only exit to the kernel if all the other vcpus in the same core are also idle. Otherwise we mark this vcpu as napping, save state that could be lost in nap mode (mainly GPRs and FPRs), and execute the nap instruction. When the thread wakes up, because of a decrementer or external interrupt, we come back in at kvm_start_guest (from the system reset interrupt vector), find the `napping' flag set in the paca, and go to the resume path. This has some other ramifications. First, when starting a core, we now start all the threads, both those that are immediately runnable and those that are idle. This is so that we don't have to pull all the threads out of the guest when an idle thread gets a decrementer interrupt and wants to start running. In fact the idle threads will all start with the H_CEDE hcall returning; being idle they will just do another H_CEDE immediately and go to nap mode. This required some changes to kvmppc_run_core() and kvmppc_run_vcpu(). These functions have been restructured to make them simpler and clearer. We introduce a level of indirection in the wait queue that gets woken when external and decrementer interrupts get generated for a vcpu, so that we can have the 4 vcpus in a vcore using the same wait queue. We need this because the 4 vcpus are being handled by one thread. Secondly, when we need to exit from the guest to the kernel, we now have to generate an IPI for any napping threads, because an HDEC interrupt doesn't wake up a napping thread. Thirdly, we now need to be able to handle virtual external interrupts and decrementer interrupts becoming pending while a thread is napping, and deliver those interrupts to the guest when the thread wakes. This is done in kvmppc_cede_reentry, just before fast_guest_return. Finally, since we are not using the generic kvm_vcpu_block for book3s_hv, and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef from kvm_arch_vcpu_runnable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-23 14:42:46 +07:00
DEFINE(IPI_PRIORITY, IPI_PRIORITY);
OFFSET(KVM_SPLIT_RPR, kvm_split_mode, rpr);
OFFSET(KVM_SPLIT_PMMAR, kvm_split_mode, pmmar);
OFFSET(KVM_SPLIT_LDBAR, kvm_split_mode, ldbar);
OFFSET(KVM_SPLIT_DO_NAP, kvm_split_mode, do_nap);
OFFSET(KVM_SPLIT_NAPPED, kvm_split_mode, napped);
KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts This patch removes the restriction that a radix host can only run radix guests, allowing us to run HPT (hashed page table) guests as well. This is useful because it provides a way to run old guest kernels that know about POWER8 but not POWER9. Unfortunately, POWER9 currently has a restriction that all threads in a given code must either all be in HPT mode, or all in radix mode. This means that when entering a HPT guest, we have to obtain control of all 4 threads in the core and get them to switch their LPIDR and LPCR registers, even if they are not going to run a guest. On guest exit we also have to get all threads to switch LPIDR and LPCR back to host values. To make this feasible, we require that KVM not be in the "independent threads" mode, and that the CPU cores be in single-threaded mode from the host kernel's perspective (only thread 0 online; threads 1, 2 and 3 offline). That allows us to use the same code as on POWER8 for obtaining control of the secondary threads. To manage the LPCR/LPIDR changes required, we extend the kvm_split_info struct to contain the information needed by the secondary threads. All threads perform a barrier synchronization (where all threads wait for every other thread to reach the synchronization point) on guest entry, both before and after loading LPCR and LPIDR. On guest exit, they all once again perform a barrier synchronization both before and after loading host values into LPCR and LPIDR. Finally, it is also currently necessary to flush the entire TLB every time we enter a HPT guest on a radix host. We do this on thread 0 with a loop of tlbiel instructions. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-19 10:11:23 +07:00
OFFSET(KVM_SPLIT_DO_SET, kvm_split_mode, do_set);
OFFSET(KVM_SPLIT_DO_RESTORE, kvm_split_mode, do_restore);
#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
#ifdef CONFIG_PPC_BOOK3S_64
HSTATE_FIELD(HSTATE_CFAR, cfar);
HSTATE_FIELD(HSTATE_PPR, ppr);
HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr);
#endif /* CONFIG_PPC_BOOK3S_64 */
#else /* CONFIG_PPC_BOOK3S */
OFFSET(VCPU_CR, kvm_vcpu, arch.cr);
OFFSET(VCPU_XER, kvm_vcpu, arch.regs.xer);
OFFSET(VCPU_LR, kvm_vcpu, arch.regs.link);
OFFSET(VCPU_CTR, kvm_vcpu, arch.regs.ctr);
OFFSET(VCPU_PC, kvm_vcpu, arch.regs.nip);
OFFSET(VCPU_SPRG9, kvm_vcpu, arch.sprg9);
OFFSET(VCPU_LAST_INST, kvm_vcpu, arch.last_inst);
OFFSET(VCPU_FAULT_DEAR, kvm_vcpu, arch.fault_dear);
OFFSET(VCPU_FAULT_ESR, kvm_vcpu, arch.fault_esr);
OFFSET(VCPU_CRIT_SAVE, kvm_vcpu, arch.crit_save);
#endif /* CONFIG_PPC_BOOK3S */
#endif /* CONFIG_KVM */
#ifdef CONFIG_KVM_GUEST
OFFSET(KVM_MAGIC_SCRATCH1, kvm_vcpu_arch_shared, scratch1);
OFFSET(KVM_MAGIC_SCRATCH2, kvm_vcpu_arch_shared, scratch2);
OFFSET(KVM_MAGIC_SCRATCH3, kvm_vcpu_arch_shared, scratch3);
OFFSET(KVM_MAGIC_INT, kvm_vcpu_arch_shared, int_pending);
OFFSET(KVM_MAGIC_MSR, kvm_vcpu_arch_shared, msr);
OFFSET(KVM_MAGIC_CRITICAL, kvm_vcpu_arch_shared, critical);
OFFSET(KVM_MAGIC_SR, kvm_vcpu_arch_shared, sr);
#endif
#ifdef CONFIG_44x
DEFINE(PGD_T_LOG2, PGD_T_LOG2);
DEFINE(PTE_T_LOG2, PTE_T_LOG2);
#endif
#ifdef CONFIG_PPC_FSL_BOOK3E
DEFINE(TLBCAM_SIZE, sizeof(struct tlbcam));
OFFSET(TLBCAM_MAS0, tlbcam, MAS0);
OFFSET(TLBCAM_MAS1, tlbcam, MAS1);
OFFSET(TLBCAM_MAS2, tlbcam, MAS2);
OFFSET(TLBCAM_MAS3, tlbcam, MAS3);
OFFSET(TLBCAM_MAS7, tlbcam, MAS7);
#endif
#if defined(CONFIG_KVM) && defined(CONFIG_SPE)
OFFSET(VCPU_EVR, kvm_vcpu, arch.evr[0]);
OFFSET(VCPU_ACC, kvm_vcpu, arch.acc);
OFFSET(VCPU_SPEFSCR, kvm_vcpu, arch.spefscr);
OFFSET(VCPU_HOST_SPEFSCR, kvm_vcpu, arch.host_spefscr);
#endif
#ifdef CONFIG_KVM_BOOKE_HV
OFFSET(VCPU_HOST_MAS4, kvm_vcpu, arch.host_mas4);
OFFSET(VCPU_HOST_MAS6, kvm_vcpu, arch.host_mas6);
#endif
#ifdef CONFIG_KVM_XICS
DEFINE(VCPU_XIVE_SAVED_STATE, offsetof(struct kvm_vcpu,
arch.xive_saved_state));
DEFINE(VCPU_XIVE_CAM_WORD, offsetof(struct kvm_vcpu,
arch.xive_cam_word));
DEFINE(VCPU_XIVE_PUSHED, offsetof(struct kvm_vcpu, arch.xive_pushed));
DEFINE(VCPU_XIVE_ESC_ON, offsetof(struct kvm_vcpu, arch.xive_esc_on));
DEFINE(VCPU_XIVE_ESC_RADDR, offsetof(struct kvm_vcpu, arch.xive_esc_raddr));
DEFINE(VCPU_XIVE_ESC_VADDR, offsetof(struct kvm_vcpu, arch.xive_esc_vaddr));
#endif
#ifdef CONFIG_KVM_EXIT_TIMING
OFFSET(VCPU_TIMING_EXIT_TBU, kvm_vcpu, arch.timing_exit.tv32.tbu);
OFFSET(VCPU_TIMING_EXIT_TBL, kvm_vcpu, arch.timing_exit.tv32.tbl);
OFFSET(VCPU_TIMING_LAST_ENTER_TBU, kvm_vcpu, arch.timing_last_enter.tv32.tbu);
OFFSET(VCPU_TIMING_LAST_ENTER_TBL, kvm_vcpu, arch.timing_last_enter.tv32.tbl);
#endif
#ifdef CONFIG_PPC_POWERNV
OFFSET(PACA_CORE_IDLE_STATE_PTR, paca_struct, core_idle_state_ptr);
OFFSET(PACA_THREAD_IDLE_STATE, paca_struct, thread_idle_state);
OFFSET(PACA_THREAD_MASK, paca_struct, thread_mask);
OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask);
OFFSET(PACA_REQ_PSSCR, paca_struct, requested_psscr);
powerpc/powernv: Provide a way to force a core into SMT4 mode POWER9 processors up to and including "Nimbus" v2.2 have hardware bugs relating to transactional memory and thread reconfiguration. One of these bugs has a workaround which is to get the core into SMT4 state temporarily. This workaround is only needed when running bare-metal. This patch provides a function which gets the core into SMT4 mode by preventing threads from going to a stop state, and waking up those which are already in a stop state. Once at least 3 threads are not in a stop state, the core will be in SMT4 and we can continue. To do this, we add a "dont_stop" flag to the paca to tell the thread not to go into a stop state. If this flag is set, power9_idle_stop() just returns immediately with a return value of 0. The pnv_power9_force_smt4_catch() function does the following: 1. Set the dont_stop flag for each thread in the core, except ourselves (in fact we use an atomic_inc() in case more than one thread is calling this function concurrently). 2. See how many threads are awake, indicated by their requested_psscr field in the paca being 0. If this is at least 3, skip to step 5. 3. Send a doorbell interrupt to each thread that was seen as being in a stop state in step 2. 4. Until at least 3 threads are awake, scan the threads to which we sent a doorbell interrupt and check if they are awake now. This relies on the following properties: - Once dont_stop is non-zero, requested_psccr can't go from zero to non-zero, except transiently (and without the thread doing stop). - requested_psscr being zero guarantees that the thread isn't in a state-losing stop state where thread reconfiguration could occur. - Doing stop with a PSSCR value of 0 won't be a state-losing stop and thus won't allow thread reconfiguration. - Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core must be in SMT4 mode, since SMT modes are powers of 2. This does add a sync to power9_idle_stop(), which is necessary to provide the correct ordering between setting requested_psscr and checking dont_stop. The overhead of the sync should be unnoticeable compared to the latency of going into and out of a stop state. Because some objected to incurring this extra latency on systems where the XER[SO] bug is not relevant, I have put the test in power9_idle_stop inside a feature section. This means that pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will probably hang the system. In order to cater for uses where the caller has an operation that has to be done while the core is in SMT4, the core continues to be kept in SMT4 after pnv_power9_force_smt4_catch() function returns, until the pnv_power9_force_smt4_release() function is called. It undoes the effect of step 1 above and allows the other threads to go into a stop state. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 17:32:00 +07:00
OFFSET(PACA_DONT_STOP, paca_struct, dont_stop);
#define STOP_SPR(x, f) OFFSET(x, paca_struct, stop_sprs.f)
STOP_SPR(STOP_PID, pid);
STOP_SPR(STOP_LDBAR, ldbar);
STOP_SPR(STOP_FSCR, fscr);
STOP_SPR(STOP_HFSCR, hfscr);
STOP_SPR(STOP_MMCR1, mmcr1);
STOP_SPR(STOP_MMCR2, mmcr2);
STOP_SPR(STOP_MMCRA, mmcra);
#endif
DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
DEFINE(PPC_DBELL_MSGTYPE, PPC_DBELL_MSGTYPE);
#ifdef CONFIG_PPC_8xx
DEFINE(VIRT_IMMR_BASE, (u64)__fix_to_virt(FIX_IMMR_BASE));
#endif
return 0;
}