2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2005-11-09 09:38:01 +07:00
|
|
|
* This control block defines the PACA which defines the processor
|
|
|
|
* specific data for each logical processor on the system.
|
2005-04-17 05:20:36 +07:00
|
|
|
* There are some pointers defined that are utilized by PLIC.
|
|
|
|
*
|
|
|
|
* C 2001 PPC 64 Team, IBM Corp
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public License
|
|
|
|
* as published by the Free Software Foundation; either version
|
|
|
|
* 2 of the License, or (at your option) any later version.
|
2005-11-09 09:38:01 +07:00
|
|
|
*/
|
|
|
|
#ifndef _ASM_POWERPC_PACA_H
|
|
|
|
#define _ASM_POWERPC_PACA_H
|
2005-12-17 04:43:46 +07:00
|
|
|
#ifdef __KERNEL__
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-01-28 20:23:22 +07:00
|
|
|
#ifdef CONFIG_PPC64
|
|
|
|
|
2015-12-11 05:34:42 +07:00
|
|
|
#include <linux/string.h>
|
2009-07-24 06:15:42 +07:00
|
|
|
#include <asm/types.h>
|
|
|
|
#include <asm/lppaca.h>
|
|
|
|
#include <asm/mmu.h>
|
|
|
|
#include <asm/page.h>
|
2017-05-21 20:15:46 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3E
|
2009-07-24 06:15:42 +07:00
|
|
|
#include <asm/exception-64e.h>
|
2017-05-21 20:15:46 +07:00
|
|
|
#else
|
|
|
|
#include <asm/exception-64s.h>
|
|
|
|
#endif
|
2010-01-08 08:58:03 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
|
2010-04-16 05:11:32 +07:00
|
|
|
#include <asm/kvm_book3s_asm.h>
|
2010-01-08 08:58:03 +07:00
|
|
|
#endif
|
2016-05-17 13:33:46 +07:00
|
|
|
#include <asm/accounting.h>
|
KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt
When a guest is assigned to a core it converts the host Timebase (TB)
into guest TB by adding guest timebase offset before entering into
guest. During guest exit it restores the guest TB to host TB. This means
under certain conditions (Guest migration) host TB and guest TB can differ.
When we get an HMI for TB related issues the opal HMI handler would
try fixing errors and restore the correct host TB value. With no guest
running, we don't have any issues. But with guest running on the core
we run into TB corruption issues.
If we get an HMI while in the guest, the current HMI handler invokes opal
hmi handler before forcing guest to exit. The guest exit path subtracts
the guest TB offset from the current TB value which may have already
been restored with host value by opal hmi handler. This leads to incorrect
host and guest TB values.
With split-core, things become more complex. With split-core, TB also gets
split and each subcore gets its own TB register. When a hmi handler fixes
a TB error and restores the TB value, it affects all the TB values of
sibling subcores on the same core. On TB errors all the thread in the core
gets HMI. With existing code, the individual threads call opal hmi handle
independently which can easily throw TB out of sync if we have guest
running on subcores. Hence we will need to co-ordinate with all the
threads before making opal hmi handler call followed by TB resync.
This patch introduces a sibling subcore state structure (shared by all
threads in the core) in paca which holds information about whether sibling
subcores are in Guest mode or host mode. An array in_guest[] of size
MAX_SUBCORE_PER_CORE=4 is used to maintain the state of each subcore.
The subcore id is used as index into in_guest[] array. Only primary
thread entering/exiting the guest is responsible to set/unset its
designated array element.
On TB error, we get HMI interrupt on every thread on the core. Upon HMI,
this patch will now force guest to vacate the core/subcore. Primary
thread from each subcore will then turn off its respective bit
from the above bitmap during the guest exit path just after the
guest->host partition switch is complete.
All other threads that have just exited the guest OR were already in host
will wait until all other subcores clears their respective bit.
Once all the subcores turn off their respective bit, all threads will
will make call to opal hmi handler.
It is not necessary that opal hmi handler would resync the TB value for
every HMI interrupts. It would do so only for the HMI caused due to
TB errors. For rest, it would not touch TB value. Hence to make things
simpler, primary thread would call TB resync explicitly once for each
core immediately after opal hmi handler instead of subtracting guest
offset from TB. TB resync call will restore the TB with host value.
Thus we can be sure about the TB state.
One of the primary threads exiting the guest will take up the
responsibility of calling TB resync. It will use one of the top bits
(bit 63) from subcore state flags bitmap to make the decision. The first
primary thread (among the subcores) that is able to set the bit will
have to call the TB resync. Rest all other threads will wait until TB
resync is complete. Once TB resync is complete all threads will then
proceed.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-05-15 11:14:26 +07:00
|
|
|
#include <asm/hmi.h>
|
powerpc/powernv: Save/Restore additional SPRs for stop4 cpuidle
The stop4 idle state on POWER9 is a deep idle state which loses
hypervisor resources, but whose latency is low enough that it can be
exposed via cpuidle.
Until now, the deep idle states which lose hypervisor resources (eg:
winkle) were only exposed via CPU-Hotplug. Hence currently on wakeup
from such states, barring a few SPRs which need to be restored to
their older value, rest of the SPRS are reinitialized to their values
corresponding to that at boot time.
When stop4 is used in the context of cpuidle, we want these additional
SPRs to be restored to their older value, to ensure that the context
on the CPU coming back from idle is same as it was before going idle.
In this patch, we define a SPR save area in PACA (since we have used
up the volatile register space in the stack) and on POWER9, we restore
SPRN_PID, SPRN_LDBAR, SPRN_FSCR, SPRN_HFSCR, SPRN_MMCRA, SPRN_MMCR1,
SPRN_MMCR2 to the values they had before entering stop.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-21 17:41:37 +07:00
|
|
|
#include <asm/cpuidle.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
register struct paca_struct *local_paca asm("r13");
|
2006-11-01 01:44:54 +07:00
|
|
|
|
|
|
|
#if defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_SMP)
|
|
|
|
extern unsigned int debug_smp_processor_id(void); /* from linux/smp.h */
|
|
|
|
/*
|
|
|
|
* Add standard checks that preemption cannot occur when using get_paca():
|
|
|
|
* otherwise the paca_struct it points to may be the wrong one just after.
|
|
|
|
*/
|
|
|
|
#define get_paca() ((void) debug_smp_processor_id(), local_paca)
|
|
|
|
#else
|
2005-04-17 05:20:36 +07:00
|
|
|
#define get_paca() local_paca
|
2006-11-01 01:44:54 +07:00
|
|
|
#endif
|
|
|
|
|
2006-01-13 06:26:42 +07:00
|
|
|
#define get_lppaca() (get_paca()->lppaca_ptr)
|
2006-08-07 13:19:19 +07:00
|
|
|
#define get_slb_shadow() (get_paca()->slb_shadow_ptr)
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
struct task_struct;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Defines the layout of the paca.
|
|
|
|
*
|
|
|
|
* This structure is not directly accessed by firmware or the service
|
2008-04-10 13:43:47 +07:00
|
|
|
* processor.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
struct paca_struct {
|
2009-06-03 04:17:41 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Because hw_cpu_id, unlike other paca fields, is accessed
|
|
|
|
* routinely from other CPUs (from the IRQ code), we stick to
|
|
|
|
* read-only (after boot) fields in the first cacheline to
|
|
|
|
* avoid cacheline bouncing.
|
|
|
|
*/
|
|
|
|
|
|
|
|
struct lppaca *lppaca_ptr; /* Pointer to LpPaca for PLIC */
|
2009-06-03 04:17:41 +07:00
|
|
|
#endif /* CONFIG_PPC_BOOK3S */
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2006-01-23 23:58:20 +07:00
|
|
|
* MAGIC: the spinlock functions in arch/powerpc/lib/locks.c
|
2005-04-17 05:20:36 +07:00
|
|
|
* load lock_token and paca_index with a single lwz
|
|
|
|
* instruction. They must travel together and be properly
|
|
|
|
* aligned.
|
|
|
|
*/
|
2013-08-06 23:01:51 +07:00
|
|
|
#ifdef __BIG_ENDIAN__
|
2005-04-17 05:20:36 +07:00
|
|
|
u16 lock_token; /* Constant 0x8000, used in locks */
|
|
|
|
u16 paca_index; /* Logical processor number */
|
2013-08-06 23:01:51 +07:00
|
|
|
#else
|
|
|
|
u16 paca_index; /* Logical processor number */
|
|
|
|
u16 lock_token; /* Constant 0x8000, used in locks */
|
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
u64 kernel_toc; /* Kernel TOC address */
|
2008-08-30 08:40:24 +07:00
|
|
|
u64 kernelbase; /* Base address of kernel */
|
|
|
|
u64 kernel_msr; /* MSR while running in kernel */
|
2005-04-17 05:20:36 +07:00
|
|
|
void *emergency_sp; /* pointer to emergency stack */
|
[PATCH] powerpc/64: per cpu data optimisations
The current ppc64 per cpu data implementation is quite slow. eg:
lhz 11,18(13) /* smp_processor_id() */
ld 9,.LC63-.LCTOC1(30) /* per_cpu__variable_name */
ld 8,.LC61-.LCTOC1(30) /* __per_cpu_offset */
sldi 11,11,3 /* form index into __per_cpu_offset */
mr 10,9
ldx 9,11,8 /* __per_cpu_offset[smp_processor_id()] */
ldx 0,10,9 /* load per cpu data */
5 loads for something that is supposed to be fast, pretty awful. One
reason for the large number of loads is that we have to synthesize 2
64bit constants (per_cpu__variable_name and __per_cpu_offset).
By putting __per_cpu_offset into the paca we can avoid the 2 loads
associated with it:
ld 11,56(13) /* paca->data_offset */
ld 9,.LC59-.LCTOC1(30) /* per_cpu__variable_name */
ldx 0,9,11 /* load per cpu data
Longer term we can should be able to do even better than 3 loads.
If per_cpu__variable_name wasnt a 64bit constant and paca->data_offset
was in a register we could cut it down to one load. A suggestion from
Rusty is to use gcc's __thread extension here. In order to do this we
would need to free up r13 (the __thread register and where the paca
currently is). So far Ive had a few unsuccessful attempts at doing that :)
The patch also allocates per cpu memory node local on NUMA machines.
This patch from Rusty has been sitting in my queue _forever_ but stalled
when I hit the compiler bug. Sorry about that.
Finally I also only allocate per cpu data for possible cpus, which comes
straight out of the x86-64 port. On a pseries kernel (with NR_CPUS == 128)
and 4 possible cpus we see some nice gains:
total used free shared buffers cached
Mem: 4012228 212860 3799368 0 0 162424
total used free shared buffers cached
Mem: 4016200 212984 3803216 0 0 162424
A saving of 3.75MB. Quite nice for smaller machines. Note: we now have
to be careful of per cpu users that touch data for !possible cpus.
At this stage it might be worth making the NUMA and possible cpu
optimisations generic, but per cpu init is done so early we have to be
careful that all architectures have their possible map setup correctly.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-01-11 09:16:44 +07:00
|
|
|
u64 data_offset; /* per cpu data offset */
|
2005-04-17 05:20:36 +07:00
|
|
|
s16 hw_cpu_id; /* Physical processor number */
|
|
|
|
u8 cpu_start; /* At startup, processor spins until */
|
|
|
|
/* this becomes non-zero. */
|
2010-05-14 02:40:11 +07:00
|
|
|
u8 kexec_state; /* set when kexec down has irqs off */
|
2009-06-03 04:17:41 +07:00
|
|
|
#ifdef CONFIG_PPC_STD_MMU_64
|
2007-03-16 13:47:07 +07:00
|
|
|
struct slb_shadow *slb_shadow_ptr;
|
powerpc: Account time using timebase rather than PURR
Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the
PURR register for measuring the user and system time used by
processes, as well as other related times such as hardirq and
softirq times. This turns out to be quite confusing for users
because it means that a program will often be measured as taking
less time when run on a multi-threaded processor (SMT2 or SMT4 mode)
than it does when run on a single-threaded processor (ST mode), even
though the program takes longer to finish. The discrepancy is
accounted for as stolen time, which is also confusing, particularly
when there are no other partitions running.
This changes the accounting to use the timebase instead, meaning that
the reported user and system times are the actual number of real-time
seconds that the program was executing on the processor thread,
regardless of which SMT mode the processor is in. Thus a program will
generally show greater user and system times when run on a
multi-threaded processor than on a single-threaded processor.
On pSeries systems on POWER5 or later processors, we measure the
stolen time (time when this partition wasn't running) using the
hypervisor dispatch trace log. We check for new entries in the
log on every entry from user mode and on every transition from
kernel process context to soft or hard IRQ context (i.e. when
account_system_vtime() gets called). So that we can correctly
distinguish time stolen from user time and time stolen from system
time, without having to check the log on every exit to user mode,
we store separate timestamps for exit to user mode and entry from
user mode.
On systems that have a SPURR (POWER6 and POWER7), we read the SPURR
in account_system_vtime() (as before), and then apportion the SPURR
ticks since the last time we read it between scaled user time and
scaled system time according to the relative proportions of user
time and system time over the same interval. This avoids having to
read the SPURR on every kernel entry and exit. On systems that have
PURR but not SPURR (i.e., POWER5), we do the same using the PURR
rather than the SPURR.
This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl
for now since it conflicts with the use of the dispatch trace log
by the time accounting code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-08-27 02:56:43 +07:00
|
|
|
struct dtl_entry *dispatch_log;
|
|
|
|
struct dtl_entry *dispatch_log_end;
|
2014-05-21 13:32:38 +07:00
|
|
|
#endif /* CONFIG_PPC_STD_MMU_64 */
|
|
|
|
u64 dscr_default; /* per-CPU default DSCR */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-05-21 13:32:38 +07:00
|
|
|
#ifdef CONFIG_PPC_STD_MMU_64
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Now, starting in cacheline 2, the exception save areas
|
|
|
|
*/
|
2005-11-07 07:06:55 +07:00
|
|
|
/* used for most interrupts/exceptions */
|
2017-05-21 20:15:46 +07:00
|
|
|
u64 exgen[EX_SIZE] __attribute__((aligned(0x80)));
|
|
|
|
u64 exslb[EX_SIZE]; /* used for SLB/segment table misses
|
2005-11-07 07:06:55 +07:00
|
|
|
* on the linear mapping */
|
2009-06-03 04:17:41 +07:00
|
|
|
/* SLB related definitions */
|
2006-06-15 07:45:18 +07:00
|
|
|
u16 vmalloc_sllp;
|
2005-04-17 05:20:36 +07:00
|
|
|
u16 slb_cache_ptr;
|
2012-09-10 09:52:54 +07:00
|
|
|
u32 slb_cache[SLB_CACHE_ENTRIES];
|
2009-06-03 04:17:41 +07:00
|
|
|
#endif /* CONFIG_PPC_STD_MMU_64 */
|
|
|
|
|
2009-07-24 06:15:42 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3E
|
powerpc: book3e_64: fix the align size for paca_struct
All the cache line size of the current book3e 64bit SoCs are 64 bytes.
So we should use this size to align the member of paca_struct.
This only change the paca_struct's members which are private to book3e
CPUs, and should not have any effect to book3s ones. With this, we save
192 bytes. Also change it to __aligned(size) since it is preferred over
__attribute__((aligned(size))).
Before:
/* size: 1920, cachelines: 30, members: 46 */
/* sum members: 1667, holes: 6, sum holes: 141 */
/* padding: 112 */
After:
/* size: 1728, cachelines: 27, members: 46 */
/* sum members: 1667, holes: 4, sum holes: 13 */
/* padding: 48 */
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
2015-03-10 19:41:31 +07:00
|
|
|
u64 exgen[8] __aligned(0x40);
|
2011-06-22 18:25:42 +07:00
|
|
|
/* Keep pgd in the same cacheline as the start of extlb */
|
powerpc: book3e_64: fix the align size for paca_struct
All the cache line size of the current book3e 64bit SoCs are 64 bytes.
So we should use this size to align the member of paca_struct.
This only change the paca_struct's members which are private to book3e
CPUs, and should not have any effect to book3s ones. With this, we save
192 bytes. Also change it to __aligned(size) since it is preferred over
__attribute__((aligned(size))).
Before:
/* size: 1920, cachelines: 30, members: 46 */
/* sum members: 1667, holes: 6, sum holes: 141 */
/* padding: 112 */
After:
/* size: 1728, cachelines: 27, members: 46 */
/* sum members: 1667, holes: 4, sum holes: 13 */
/* padding: 48 */
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
2015-03-10 19:41:31 +07:00
|
|
|
pgd_t *pgd __aligned(0x40); /* Current PGD */
|
2011-06-22 18:25:42 +07:00
|
|
|
pgd_t *kernel_pgd; /* Kernel PGD */
|
2013-10-12 07:22:38 +07:00
|
|
|
|
|
|
|
/* Shared by all threads of a core -- points to tcd of first thread */
|
|
|
|
struct tlb_core_data *tcd_ptr;
|
|
|
|
|
2014-03-11 05:29:38 +07:00
|
|
|
/*
|
|
|
|
* We can have up to 3 levels of reentrancy in the TLB miss handler,
|
|
|
|
* in each of four exception levels (normal, crit, mcheck, debug).
|
|
|
|
*/
|
|
|
|
u64 extlb[12][EX_TLB_SIZE / sizeof(u64)];
|
2009-07-24 06:15:42 +07:00
|
|
|
u64 exmc[8]; /* used for machine checks */
|
|
|
|
u64 excrit[8]; /* used for crit interrupts */
|
|
|
|
u64 exdbg[8]; /* used for debug interrupts */
|
|
|
|
|
|
|
|
/* Kernel stack pointers for use by special exceptions */
|
|
|
|
void *mc_kstack;
|
|
|
|
void *crit_kstack;
|
|
|
|
void *dbg_kstack;
|
2013-10-12 07:22:38 +07:00
|
|
|
|
|
|
|
struct tlb_core_data tcd;
|
2009-07-24 06:15:42 +07:00
|
|
|
#endif /* CONFIG_PPC_BOOK3E */
|
|
|
|
|
2015-10-28 11:54:06 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S
|
2015-12-11 05:34:42 +07:00
|
|
|
mm_context_id_t mm_ctx_id;
|
|
|
|
#ifdef CONFIG_PPC_MM_SLICES
|
|
|
|
u64 mm_ctx_low_slices_psize;
|
|
|
|
unsigned char mm_ctx_high_slices_psize[SLICE_ARRAY_SIZE];
|
2017-03-22 10:36:59 +07:00
|
|
|
unsigned long addr_limit;
|
2015-12-11 05:34:42 +07:00
|
|
|
#else
|
2016-01-09 04:25:01 +07:00
|
|
|
u16 mm_ctx_user_psize;
|
2015-12-11 05:34:42 +07:00
|
|
|
u16 mm_ctx_sllp;
|
|
|
|
#endif
|
2015-10-28 11:54:06 +07:00
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* then miscellaneous read-write fields
|
|
|
|
*/
|
|
|
|
struct task_struct *__current; /* Pointer to current */
|
|
|
|
u64 kstack; /* Saved Kernel stack addr */
|
|
|
|
u64 stab_rr; /* stab/slb round-robin counter */
|
2011-01-24 14:42:41 +07:00
|
|
|
u64 saved_r1; /* r1 save for RTAS calls or PM */
|
2005-04-17 05:20:36 +07:00
|
|
|
u64 saved_msr; /* MSR saved here by enter_rtas */
|
2007-04-23 22:11:55 +07:00
|
|
|
u16 trap_save; /* Used when bad stack is encountered */
|
[POWERPC] Lazy interrupt disabling for 64-bit machines
This implements a lazy strategy for disabling interrupts. This means
that local_irq_disable() et al. just clear the 'interrupts are
enabled' flag in the paca. If an interrupt comes along, the interrupt
entry code notices that interrupts are supposed to be disabled, and
clears the EE bit in SRR1, clears the 'interrupts are hard-enabled'
flag in the paca, and returns. This means that interrupts only
actually get disabled in the processor when an interrupt comes along.
When interrupts are enabled by local_irq_enable() et al., the code
sets the interrupts-enabled flag in the paca, and then checks whether
interrupts got hard-disabled. If so, it also sets the EE bit in the
MSR to hard-enable the interrupts.
This has the potential to improve performance, and also makes it
easier to make a kernel that can boot on iSeries and on other 64-bit
machines, since this lazy-disable strategy is very similar to the
soft-disable strategy that iSeries already uses.
This version renames paca->proc_enabled to paca->soft_enabled, and
changes a couple of soft-disables in the kexec code to hard-disables,
which should fix the crash that Michael Ellerman saw. This doesn't
yet use a reserved CR field for the soft_enabled and hard_enabled
flags. This applies on top of Stephen Rothwell's patches to make it
possible to build a combined iSeries/other kernel.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-10-04 13:47:49 +07:00
|
|
|
u8 soft_enabled; /* irq soft-enable flag */
|
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 14:27:59 +07:00
|
|
|
u8 irq_happened; /* irq happened while soft-disabled */
|
2006-09-13 19:08:26 +07:00
|
|
|
u8 io_sync; /* writel() needs spin_unlock sync */
|
2010-10-14 13:01:34 +07:00
|
|
|
u8 irq_work_pending; /* IRQ_WORK interrupt while soft-disable */
|
2011-12-06 02:47:26 +07:00
|
|
|
u8 nap_state_lost; /* NV GPR values lost in power7_idle */
|
2014-03-11 05:29:38 +07:00
|
|
|
u64 sprg_vdso; /* Saved user-visible sprg */
|
2013-02-13 23:21:34 +07:00
|
|
|
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
|
|
|
|
u64 tm_scratch; /* TM scratch area for reclaim */
|
|
|
|
#endif
|
powerpc: Implement accurate task and CPU time accounting
This implements accurate task and cpu time accounting for 64-bit
powerpc kernels. Instead of accounting a whole jiffy of time to a
task on a timer interrupt because that task happened to be running at
the time, we now account time in units of timebase ticks according to
the actual time spent by the task in user mode and kernel mode. We
also count the time spent processing hardware and software interrupts
accurately. This is conditional on CONFIG_VIRT_CPU_ACCOUNTING. If
that is not set, we do tick-based approximate accounting as before.
To get this accurate information, we read either the PURR (processor
utilization of resources register) on POWER5 machines, or the timebase
on other machines on
* each entry to the kernel from usermode
* each exit to usermode
* transitions between process context, hard irq context and soft irq
context in kernel mode
* context switches.
On POWER5 systems with shared-processor logical partitioning we also
read both the PURR and the timebase at each timer interrupt and
context switch in order to determine how much time has been taken by
the hypervisor to run other partitions ("steal" time). Unfortunately,
since we need values of the PURR on both threads at the same time to
accurately calculate the steal time, and since we can only calculate
steal time on a per-core basis, the apportioning of the steal time
between idle time (time which we ceded to the hypervisor in the idle
loop) and actual stolen time is somewhat approximate at the moment.
This is all based quite heavily on what s390 does, and it uses the
generic interfaces that were added by the s390 developers,
i.e. account_system_time(), account_user_time(), etc.
This patch doesn't add any new interfaces between the kernel and
userspace, and doesn't change the units in which time is reported to
userspace by things such as /proc/stat, /proc/<pid>/stat, getrusage(),
times(), etc. Internally the various task and cpu times are stored in
timebase units, but they are converted to USER_HZ units (1/100th of a
second) when reported to userspace. Some precision is therefore lost
but there should not be any accumulating error, since the internal
accumulation is at full precision.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24 06:06:59 +07:00
|
|
|
|
2014-12-10 01:56:52 +07:00
|
|
|
#ifdef CONFIG_PPC_POWERNV
|
|
|
|
/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
|
|
|
|
u32 *core_idle_state_ptr;
|
|
|
|
u8 thread_idle_state; /* PNV_THREAD_RUNNING/NAP/SLEEP */
|
|
|
|
/* Mask to indicate thread id in core */
|
|
|
|
u8 thread_mask;
|
2014-12-10 01:56:53 +07:00
|
|
|
/* Mask to denote subcore sibling threads */
|
|
|
|
u8 subcore_sibling_mask;
|
2017-03-22 22:04:17 +07:00
|
|
|
/*
|
|
|
|
* Pointer to an array which contains pointer
|
|
|
|
* to the sibling threads' paca.
|
|
|
|
*/
|
|
|
|
struct paca_struct **thread_sibling_pacas;
|
2017-05-16 15:49:47 +07:00
|
|
|
/* The PSSCR value that the kernel requested before going to stop */
|
|
|
|
u64 requested_psscr;
|
powerpc/powernv: Save/Restore additional SPRs for stop4 cpuidle
The stop4 idle state on POWER9 is a deep idle state which loses
hypervisor resources, but whose latency is low enough that it can be
exposed via cpuidle.
Until now, the deep idle states which lose hypervisor resources (eg:
winkle) were only exposed via CPU-Hotplug. Hence currently on wakeup
from such states, barring a few SPRs which need to be restored to
their older value, rest of the SPRS are reinitialized to their values
corresponding to that at boot time.
When stop4 is used in the context of cpuidle, we want these additional
SPRs to be restored to their older value, to ensure that the context
on the CPU coming back from idle is same as it was before going idle.
In this patch, we define a SPR save area in PACA (since we have used
up the volatile register space in the stack) and on POWER9, we restore
SPRN_PID, SPRN_LDBAR, SPRN_FSCR, SPRN_HFSCR, SPRN_MMCRA, SPRN_MMCR1,
SPRN_MMCR2 to the values they had before entering stop.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-21 17:41:37 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Save area for additional SPRs that need to be
|
|
|
|
* saved/restored during cpuidle stop.
|
|
|
|
*/
|
|
|
|
struct stop_sprs stop_sprs;
|
2014-12-10 01:56:52 +07:00
|
|
|
#endif
|
|
|
|
|
2016-12-20 01:30:04 +07:00
|
|
|
#ifdef CONFIG_PPC_STD_MMU_64
|
|
|
|
/* Non-maskable exceptions that are not performance critical */
|
2017-05-21 20:15:46 +07:00
|
|
|
u64 exnmi[EX_SIZE]; /* used for system reset (nmi) */
|
|
|
|
u64 exmc[EX_SIZE]; /* used for machine checks */
|
2016-12-20 01:30:04 +07:00
|
|
|
#endif
|
2013-10-30 21:34:00 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S_64
|
2016-12-20 01:30:06 +07:00
|
|
|
/* Exclusive stacks for system reset and machine check exception. */
|
|
|
|
void *nmi_emergency_sp;
|
2013-10-30 21:34:00 +07:00
|
|
|
void *mc_emergency_sp;
|
2016-12-20 01:30:05 +07:00
|
|
|
|
|
|
|
u16 in_nmi; /* In nmi handler */
|
|
|
|
|
2013-10-30 21:34:00 +07:00
|
|
|
/*
|
|
|
|
* Flag to check whether we are in machine check early handler
|
|
|
|
* and already using emergency stack.
|
|
|
|
*/
|
|
|
|
u16 in_mce;
|
2016-12-20 01:30:05 +07:00
|
|
|
u8 hmi_event_available; /* HMI event is available */
|
2013-10-30 21:34:00 +07:00
|
|
|
#endif
|
2011-09-20 00:45:04 +07:00
|
|
|
|
powerpc: Implement accurate task and CPU time accounting
This implements accurate task and cpu time accounting for 64-bit
powerpc kernels. Instead of accounting a whole jiffy of time to a
task on a timer interrupt because that task happened to be running at
the time, we now account time in units of timebase ticks according to
the actual time spent by the task in user mode and kernel mode. We
also count the time spent processing hardware and software interrupts
accurately. This is conditional on CONFIG_VIRT_CPU_ACCOUNTING. If
that is not set, we do tick-based approximate accounting as before.
To get this accurate information, we read either the PURR (processor
utilization of resources register) on POWER5 machines, or the timebase
on other machines on
* each entry to the kernel from usermode
* each exit to usermode
* transitions between process context, hard irq context and soft irq
context in kernel mode
* context switches.
On POWER5 systems with shared-processor logical partitioning we also
read both the PURR and the timebase at each timer interrupt and
context switch in order to determine how much time has been taken by
the hypervisor to run other partitions ("steal" time). Unfortunately,
since we need values of the PURR on both threads at the same time to
accurately calculate the steal time, and since we can only calculate
steal time on a per-core basis, the apportioning of the steal time
between idle time (time which we ceded to the hypervisor in the idle
loop) and actual stolen time is somewhat approximate at the moment.
This is all based quite heavily on what s390 does, and it uses the
generic interfaces that were added by the s390 developers,
i.e. account_system_time(), account_user_time(), etc.
This patch doesn't add any new interfaces between the kernel and
userspace, and doesn't change the units in which time is reported to
userspace by things such as /proc/stat, /proc/<pid>/stat, getrusage(),
times(), etc. Internally the various task and cpu times are stored in
timebase units, but they are converted to USER_HZ units (1/100th of a
second) when reported to userspace. Some precision is therefore lost
but there should not be any accumulating error, since the internal
accumulation is at full precision.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24 06:06:59 +07:00
|
|
|
/* Stuff for accurate time accounting */
|
2016-05-17 13:33:46 +07:00
|
|
|
struct cpu_accounting_data accounting;
|
powerpc: Account time using timebase rather than PURR
Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the
PURR register for measuring the user and system time used by
processes, as well as other related times such as hardirq and
softirq times. This turns out to be quite confusing for users
because it means that a program will often be measured as taking
less time when run on a multi-threaded processor (SMT2 or SMT4 mode)
than it does when run on a single-threaded processor (ST mode), even
though the program takes longer to finish. The discrepancy is
accounted for as stolen time, which is also confusing, particularly
when there are no other partitions running.
This changes the accounting to use the timebase instead, meaning that
the reported user and system times are the actual number of real-time
seconds that the program was executing on the processor thread,
regardless of which SMT mode the processor is in. Thus a program will
generally show greater user and system times when run on a
multi-threaded processor than on a single-threaded processor.
On pSeries systems on POWER5 or later processors, we measure the
stolen time (time when this partition wasn't running) using the
hypervisor dispatch trace log. We check for new entries in the
log on every entry from user mode and on every transition from
kernel process context to soft or hard IRQ context (i.e. when
account_system_vtime() gets called). So that we can correctly
distinguish time stolen from user time and time stolen from system
time, without having to check the log on every exit to user mode,
we store separate timestamps for exit to user mode and entry from
user mode.
On systems that have a SPURR (POWER6 and POWER7), we read the SPURR
in account_system_vtime() (as before), and then apportion the SPURR
ticks since the last time we read it between scaled user time and
scaled system time according to the relative proportions of user
time and system time over the same interval. This avoids having to
read the SPURR on every kernel entry and exit. On systems that have
PURR but not SPURR (i.e., POWER5), we do the same using the PURR
rather than the SPURR.
This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl
for now since it conflicts with the use of the dispatch trace log
by the time accounting code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-08-27 02:56:43 +07:00
|
|
|
u64 dtl_ridx; /* read index in dispatch log */
|
|
|
|
struct dtl_entry *dtl_curr; /* pointer corresponding to dtl_ridx */
|
2009-10-30 12:47:22 +07:00
|
|
|
|
2010-04-16 05:11:41 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_HANDLER
|
2013-10-07 23:47:51 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
|
2010-01-08 08:58:03 +07:00
|
|
|
/* We use this to store guest state in */
|
|
|
|
struct kvmppc_book3s_shadow_vcpu shadow_vcpu;
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
#endif
|
2011-06-29 07:20:58 +07:00
|
|
|
struct kvmppc_host_state kvm_hstate;
|
2016-08-11 20:07:43 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
|
|
|
|
/*
|
|
|
|
* Bitmap for sibling subcore status. See kvm/book3s_hv_ras.c for
|
|
|
|
* more details
|
|
|
|
*/
|
|
|
|
struct sibling_subcore_state *sibling_subcore_state;
|
|
|
|
#endif
|
2009-10-30 12:47:22 +07:00
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2017-03-22 10:36:49 +07:00
|
|
|
extern void copy_mm_to_paca(struct mm_struct *mm);
|
2010-01-28 20:23:22 +07:00
|
|
|
extern struct paca_struct *paca;
|
|
|
|
extern void initialise_paca(struct paca_struct *new_paca, int cpu);
|
2010-07-08 04:55:37 +07:00
|
|
|
extern void setup_paca(struct paca_struct *new_paca);
|
2010-01-28 20:23:22 +07:00
|
|
|
extern void allocate_pacas(void);
|
|
|
|
extern void free_unused_pacas(void);
|
|
|
|
|
|
|
|
#else /* CONFIG_PPC64 */
|
|
|
|
|
|
|
|
static inline void allocate_pacas(void) { };
|
|
|
|
static inline void free_unused_pacas(void) { };
|
|
|
|
|
|
|
|
#endif /* CONFIG_PPC64 */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-12-17 04:43:46 +07:00
|
|
|
#endif /* __KERNEL__ */
|
2005-11-09 09:38:01 +07:00
|
|
|
#endif /* _ASM_POWERPC_PACA_H */
|