License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 21:07:57 +07:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2009-06-03 04:17:38 +07:00
|
|
|
/*
|
|
|
|
* This file contains the 64-bit "server" PowerPC variant
|
|
|
|
* of the low level exception handling including exception
|
|
|
|
* vectors, exception return, part of the slb and stab
|
|
|
|
* handling and other fixed offset specific things.
|
|
|
|
*
|
|
|
|
* This file is meant to be #included from head_64.S due to
|
2011-03-31 08:57:33 +07:00
|
|
|
* position dependent assembly.
|
2009-06-03 04:17:38 +07:00
|
|
|
*
|
|
|
|
* Most of this originates from head_64.S and thus has the same
|
|
|
|
* copyright history.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 14:27:59 +07:00
|
|
|
#include <asm/hw_irq.h>
|
2009-07-15 03:52:52 +07:00
|
|
|
#include <asm/exception-64s.h>
|
2010-11-18 22:06:17 +07:00
|
|
|
#include <asm/ptrace.h>
|
2014-12-10 01:56:52 +07:00
|
|
|
#include <asm/cpuidle.h>
|
2016-09-30 16:43:18 +07:00
|
|
|
#include <asm/head-64.h>
|
2018-07-05 23:25:01 +07:00
|
|
|
#include <asm/feature-fixups.h>
|
2019-04-18 13:51:24 +07:00
|
|
|
#include <asm/kup.h>
|
2009-07-15 03:52:52 +07:00
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
/*
|
2016-09-28 08:31:48 +07:00
|
|
|
* There are a few constraints to be concerned with.
|
|
|
|
* - Real mode exceptions code/data must be located at their physical location.
|
|
|
|
* - Virtual mode exceptions must be mapped at their 0xc000... location.
|
|
|
|
* - Fixed location code must not call directly beyond the __end_interrupts
|
|
|
|
* area when built with CONFIG_RELOCATABLE. LOAD_HANDLER / bctr sequence
|
|
|
|
* must be used.
|
|
|
|
* - LOAD_HANDLER targets must be within first 64K of physical 0 /
|
|
|
|
* virtual 0xc00...
|
|
|
|
* - Conditional branch targets must be within +/-32K of caller.
|
|
|
|
*
|
|
|
|
* "Virtual exceptions" run with relocation on (MSR_IR=1, MSR_DR=1), and
|
|
|
|
* therefore don't have to run in physically located code or rfid to
|
|
|
|
* virtual mode kernel code. However on relocatable kernels they do have
|
|
|
|
* to branch to KERNELBASE offset because the rest of the kernel (outside
|
|
|
|
* the exception vectors) may be located elsewhere.
|
|
|
|
*
|
|
|
|
* Virtual exceptions correspond with physical, except their entry points
|
|
|
|
* are offset by 0xc000000000000000 and also tend to get an added 0x4000
|
|
|
|
* offset applied. Virtual exceptions are enabled with the Alternate
|
|
|
|
* Interrupt Location (AIL) bit set in the LPCR. However this does not
|
|
|
|
* guarantee they will be delivered virtually. Some conditions (see the ISA)
|
|
|
|
* cause exceptions to be delivered in real mode.
|
|
|
|
*
|
|
|
|
* It's impossible to receive interrupts below 0x300 via AIL.
|
|
|
|
*
|
|
|
|
* KVM: None of the virtual exceptions are from the guest. Anything that
|
|
|
|
* escalated to HV=1 from HV=0 is delivered via real mode handlers.
|
|
|
|
*
|
|
|
|
*
|
2009-06-03 04:17:38 +07:00
|
|
|
* We layout physical memory as follows:
|
|
|
|
* 0x0000 - 0x00ff : Secondary processor spin code
|
2016-09-28 08:31:48 +07:00
|
|
|
* 0x0100 - 0x18ff : Real mode pSeries interrupt vectors
|
|
|
|
* 0x1900 - 0x3fff : Real mode trampolines
|
|
|
|
* 0x4000 - 0x58ff : Relon (IR=1,DR=1) mode pSeries interrupt vectors
|
|
|
|
* 0x5900 - 0x6fff : Relon mode trampolines
|
2009-06-03 04:17:38 +07:00
|
|
|
* 0x7000 - 0x7fff : FWNMI data area
|
2016-09-28 08:31:48 +07:00
|
|
|
* 0x8000 - .... : Common interrupt handlers, remaining early
|
|
|
|
* setup code, rest of kernel.
|
2016-09-21 14:44:07 +07:00
|
|
|
*
|
|
|
|
* We could reclaim 0x4000-0x42ff for real mode trampolines if the space
|
|
|
|
* is necessary. Until then it's more consistent to explicitly put VIRT_NONE
|
|
|
|
* vectors there.
|
2016-09-28 08:31:48 +07:00
|
|
|
*/
|
|
|
|
OPEN_FIXED_SECTION(real_vectors, 0x0100, 0x1900)
|
|
|
|
OPEN_FIXED_SECTION(real_trampolines, 0x1900, 0x4000)
|
|
|
|
OPEN_FIXED_SECTION(virt_vectors, 0x4000, 0x5900)
|
|
|
|
OPEN_FIXED_SECTION(virt_trampolines, 0x5900, 0x7000)
|
2019-02-26 15:51:07 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_PPC_POWERNV
|
2019-03-01 19:56:36 +07:00
|
|
|
.globl start_real_trampolines
|
|
|
|
.globl end_real_trampolines
|
|
|
|
.globl start_virt_trampolines
|
|
|
|
.globl end_virt_trampolines
|
2019-02-26 15:51:07 +07:00
|
|
|
#endif
|
|
|
|
|
2016-09-28 08:31:48 +07:00
|
|
|
#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
|
|
|
|
/*
|
|
|
|
* Data area reserved for FWNMI option.
|
|
|
|
* This address (0x7000) is fixed by the RPA.
|
|
|
|
* pseries and powernv need to keep the whole page from
|
|
|
|
* 0x7000 to 0x8000 free for use by the firmware
|
2009-06-03 04:17:38 +07:00
|
|
|
*/
|
2016-09-28 08:31:48 +07:00
|
|
|
ZERO_FIXED_SECTION(fwnmi_page, 0x7000, 0x8000)
|
|
|
|
OPEN_TEXT_SECTION(0x8000)
|
|
|
|
#else
|
|
|
|
OPEN_TEXT_SECTION(0x7000)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
USE_FIXED_SECTION(real_vectors)
|
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
/*
|
|
|
|
* This is the start of the interrupt handlers for pSeries
|
|
|
|
* This code runs with relocation off.
|
|
|
|
* Code from here to __end_interrupts gets copied down to real
|
|
|
|
* address 0x100 when we are running a relocatable kernel.
|
|
|
|
* Therefore any relative branches in this section must only
|
|
|
|
* branch to labels in this section.
|
|
|
|
*/
|
|
|
|
.globl __start_interrupts
|
|
|
|
__start_interrupts:
|
|
|
|
|
2016-09-21 14:44:07 +07:00
|
|
|
/* No virt vectors corresponding with 0x0..0x100 */
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_VIRT_NONE(0x4000, 0x100)
|
2016-09-21 14:44:07 +07:00
|
|
|
|
2016-10-13 09:17:14 +07:00
|
|
|
|
2019-06-22 20:15:15 +07:00
|
|
|
EXC_REAL_BEGIN(system_reset, 0x100, 0x100)
|
|
|
|
SET_SCRATCH0(r13)
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXNMI
|
2019-06-22 20:15:15 +07:00
|
|
|
|
|
|
|
/* This is EXCEPTION_PROLOG_1 with the idle feature section added */
|
|
|
|
OPT_SAVE_REG_TO_PACA(PACA_EXNMI+EX_PPR, r9, CPU_FTR_HAS_PPR)
|
|
|
|
OPT_SAVE_REG_TO_PACA(PACA_EXNMI+EX_CFAR, r10, CPU_FTR_CFAR)
|
|
|
|
INTERRUPT_TO_KERNEL
|
|
|
|
SAVE_CTR(r10, PACA_EXNMI)
|
|
|
|
mfcr r9
|
|
|
|
|
2011-01-24 14:42:41 +07:00
|
|
|
#ifdef CONFIG_PPC_P7_NAP
|
2016-10-13 09:17:14 +07:00
|
|
|
/*
|
|
|
|
* If running native on arch 2.06 or later, check if we are waking up
|
2017-06-25 00:29:01 +07:00
|
|
|
* from nap/sleep/winkle, and branch to idle handler. This tests SRR1
|
|
|
|
* bits 46:47. A non-0 value indicates that we are coming from a power
|
|
|
|
* saving state. The idle wakeup handler initially runs in real mode,
|
|
|
|
* but we branch to the 0xc000... address so we can turn on relocation
|
|
|
|
* with mtmsr.
|
2011-01-24 14:42:41 +07:00
|
|
|
*/
|
2019-06-22 20:15:15 +07:00
|
|
|
BEGIN_FTR_SECTION
|
|
|
|
mfspr r10,SPRN_SRR1
|
|
|
|
rlwinm. r10,r10,47-31,30,31
|
|
|
|
beq- 1f
|
|
|
|
cmpwi cr1,r10,2
|
|
|
|
mfspr r3,SPRN_SRR1
|
|
|
|
bltlr cr1 /* no state loss, return to idle caller */
|
|
|
|
BRANCH_TO_C000(r10, system_reset_idle_common)
|
|
|
|
1:
|
|
|
|
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
|
2016-10-13 09:17:14 +07:00
|
|
|
#endif
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
|
2019-06-22 20:15:15 +07:00
|
|
|
KVMTEST EXC_STD 0x100
|
|
|
|
std r11,PACA_EXNMI+EX_R11(r13)
|
|
|
|
std r12,PACA_EXNMI+EX_R12(r13)
|
|
|
|
GET_SCRATCH0(r10)
|
|
|
|
std r10,PACA_EXNMI+EX_R13(r13)
|
|
|
|
|
|
|
|
EXCEPTION_PROLOG_2_REAL system_reset_common, EXC_STD, 0
|
2016-12-20 01:30:05 +07:00
|
|
|
/*
|
|
|
|
* MSR_RI is not enabled, because PACA_EXNMI and nmi stack is
|
|
|
|
* being used, so a nested NMI exception would corrupt it.
|
|
|
|
*/
|
2016-10-13 09:17:14 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_END(system_reset, 0x100, 0x100)
|
|
|
|
EXC_VIRT_NONE(0x4100, 0x100)
|
2017-11-05 19:33:55 +07:00
|
|
|
TRAMP_KVM(PACA_EXNMI, 0x100)
|
2016-10-13 09:17:14 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_PPC_P7_NAP
|
|
|
|
EXC_COMMON_BEGIN(system_reset_idle_common)
|
powerpc/64s: Reimplement book3s idle code in C
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code.
Book3S assembly stubs are kept in common code and used only to save
the stack frame and non-volatile GPRs before executing architected
idle instructions, and restoring the stack and reloading GPRs then
returning to C after waking from idle.
The complex logic dealing with threads and subcores, locking, SPRs,
HMIs, timebase resync, etc., is all done in C which makes it more
maintainable.
This is not a strict translation to C code, there are some
significant differences:
- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
but saves and restores them itself.
- The optimisation where EC=ESL=0 idle modes did not have to save GPRs
or change MSR is restored, because it's now simple to do. ESL=1
sleeps that do not lose GPRs can use this optimization too.
- KVM secondary entry and cede is now more of a call/return style
rather than branchy. nap_state_lost is not required because KVM
always returns via NVGPR restoring path.
- KVM secondary wakeup from offline sequence is moved entirely into
the offline wakeup, which avoids a hwsync in the normal idle wakeup
path.
Performance measured with context switch ping-pong on different
threads or cores, is possibly improved a small amount, 1-3% depending
on stop state and core vs thread test for shallow states. Deep states
it's in the noise compared with other latencies.
KVM improvements:
- Idle sleepers now always return to caller rather than branch out
to KVM first.
- This allows optimisations like very fast return to caller when no
state has been lost.
- KVM no longer requires nap_state_lost because it controls NVGPR
save/restore itself on the way in and out.
- The heavy idle wakeup KVM request check can be moved out of the
normal host idle code and into the not-performance-critical offline
code.
- KVM nap code now returns from where it is called, which makes the
flow a bit easier to follow.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Squash the KVM changes in]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-12 21:30:52 +07:00
|
|
|
/*
|
|
|
|
* This must be a direct branch (without linker branch stub) because
|
|
|
|
* we can not use TOC at this point as r2 may not be restored yet.
|
|
|
|
*/
|
|
|
|
b idle_return_gpr_loss
|
KVM: PPC: Allow book3s_hv guests to use SMT processor modes
This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.
This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.
To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.
When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.
We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.
When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).
It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.
Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:23:08 +07:00
|
|
|
#endif
|
|
|
|
|
2016-12-20 01:30:04 +07:00
|
|
|
EXC_COMMON_BEGIN(system_reset_common)
|
2016-12-20 01:30:05 +07:00
|
|
|
/*
|
|
|
|
* Increment paca->in_nmi then enable MSR_RI. SLB or MCE will be able
|
|
|
|
* to recover, but nested NMI will notice in_nmi and not recover
|
|
|
|
* because of the use of the NMI stack. in_nmi reentrancy is tested in
|
|
|
|
* system_reset_exception.
|
|
|
|
*/
|
|
|
|
lhz r10,PACA_IN_NMI(r13)
|
|
|
|
addi r10,r10,1
|
|
|
|
sth r10,PACA_IN_NMI(r13)
|
|
|
|
li r10,MSR_RI
|
|
|
|
mtmsrd r10,1
|
2014-02-26 07:08:25 +07:00
|
|
|
|
2016-12-20 01:30:06 +07:00
|
|
|
mr r10,r1
|
|
|
|
ld r1,PACA_NMI_EMERG_SP(r13)
|
|
|
|
subi r1,r1,INT_FRAME_SIZE
|
2019-06-22 20:15:21 +07:00
|
|
|
EXCEPTION_COMMON_STACK(PACA_EXNMI, 0x100)
|
|
|
|
bl save_nvgprs
|
|
|
|
/*
|
|
|
|
* Set IRQS_ALL_DISABLED unconditionally so arch_irqs_disabled does
|
|
|
|
* the right thing. We do not want to reconcile because that goes
|
|
|
|
* through irq tracing which we don't want in NMI.
|
|
|
|
*
|
|
|
|
* Save PACAIRQHAPPENED because some code will do a hard disable
|
|
|
|
* (e.g., xmon). So we want to restore this back to where it was
|
|
|
|
* when we return. DAR is unused in the stack, so save it there.
|
|
|
|
*/
|
|
|
|
li r10,IRQS_ALL_DISABLED
|
|
|
|
stb r10,PACAIRQSOFTMASK(r13)
|
|
|
|
lbz r10,PACAIRQHAPPENED(r13)
|
|
|
|
std r10,_DAR(r1)
|
|
|
|
|
2019-06-22 20:15:20 +07:00
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl system_reset_exception
|
2018-03-26 22:01:03 +07:00
|
|
|
|
|
|
|
/* This (and MCE) can be simplified with mtmsrd L=1 */
|
|
|
|
/* Clear MSR_RI before setting SRR0 and SRR1. */
|
|
|
|
li r0,MSR_RI
|
|
|
|
mfmsr r9
|
|
|
|
andc r9,r9,r0
|
|
|
|
mtmsrd r9,1
|
2016-12-20 01:30:05 +07:00
|
|
|
|
|
|
|
/*
|
2018-03-26 22:01:03 +07:00
|
|
|
* MSR_RI is clear, now we can decrement paca->in_nmi.
|
2016-12-20 01:30:05 +07:00
|
|
|
*/
|
|
|
|
lhz r10,PACA_IN_NMI(r13)
|
|
|
|
subi r10,r10,1
|
|
|
|
sth r10,PACA_IN_NMI(r13)
|
|
|
|
|
2018-03-26 22:01:03 +07:00
|
|
|
/*
|
|
|
|
* Restore soft mask settings.
|
|
|
|
*/
|
|
|
|
ld r10,_DAR(r1)
|
|
|
|
stb r10,PACAIRQHAPPENED(r13)
|
|
|
|
ld r10,SOFTE(r1)
|
|
|
|
stb r10,PACAIRQSOFTMASK(r13)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Keep below code in synch with MACHINE_CHECK_HANDLER_WINDUP.
|
|
|
|
* Should share common bits...
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* Move original SRR0 and SRR1 into the respective regs */
|
|
|
|
ld r9,_MSR(r1)
|
|
|
|
mtspr SPRN_SRR1,r9
|
|
|
|
ld r3,_NIP(r1)
|
|
|
|
mtspr SPRN_SRR0,r3
|
|
|
|
ld r9,_CTR(r1)
|
|
|
|
mtctr r9
|
|
|
|
ld r9,_XER(r1)
|
|
|
|
mtxer r9
|
|
|
|
ld r9,_LINK(r1)
|
|
|
|
mtlr r9
|
|
|
|
REST_GPR(0, r1)
|
|
|
|
REST_8GPRS(2, r1)
|
|
|
|
REST_GPR(10, r1)
|
|
|
|
ld r11,_CCR(r1)
|
|
|
|
mtcr r11
|
|
|
|
REST_GPR(11, r1)
|
|
|
|
REST_2GPRS(12, r1)
|
|
|
|
/* restore original r1. */
|
|
|
|
ld r1,GPR1(r1)
|
|
|
|
RFI_TO_USER_OR_KERNEL
|
2016-09-21 14:43:30 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_PPC_PSERIES
|
|
|
|
/*
|
|
|
|
* Vectors for the FWNMI option. Share common code.
|
|
|
|
*/
|
|
|
|
TRAMP_REAL_BEGIN(system_reset_fwnmi)
|
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
2019-06-22 20:15:22 +07:00
|
|
|
/* See comment at system_reset exception, don't turn on RI */
|
|
|
|
EXCEPTION_PROLOG_0 PACA_EXNMI
|
|
|
|
EXCEPTION_PROLOG_1 EXC_STD, PACA_EXNMI, 0, 0x100, 0
|
|
|
|
EXCEPTION_PROLOG_2_REAL system_reset_common, EXC_STD, 0
|
|
|
|
|
2016-09-21 14:43:30 +07:00
|
|
|
#endif /* CONFIG_PPC_PSERIES */
|
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_BEGIN(machine_check, 0x200, 0x100)
|
2011-06-29 07:18:26 +07:00
|
|
|
/* This is moved out of line as it can be patched by FW, but
|
|
|
|
* some code path might still want to branch into the original
|
|
|
|
* vector
|
|
|
|
*/
|
powerpc: Save CFAR before branching in interrupt entry paths
Some of the interrupt vectors on 64-bit POWER server processors are
only 32 bytes long, which is not enough for the full first-level
interrupt handler. For these we currently just have a branch to an
out-of-line handler. However, this means that we corrupt the CFAR
(come-from address register) on POWER7 and later processors.
To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces:
EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR
is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We
then put EXCEPTION_PROLOG_0 in the short interrupt vectors before
we branch to the out-of-line handler, which contains the rest of the
first-level interrupt handler. To facilitate this, we define new
_OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc.
In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more
than 6 instructions, it was necessary to move the stores that move
the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and
to get rid of one of the two HMT_MEDIUM instructions. Previously
there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was
nop'd out on processors with the PPR (POWER7 and later), and then
another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside
__EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR.
Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally
and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although
this leaves it in for the interrupt vectors where there is room for
it.
Previously we had a handler for hypervisor maintenance interrupts at
0xe50, which doesn't leave enough room for the vector for hypervisor
emulation assist interrupts at 0xe40, since we need 8 instructions.
The 0xe50 vector was only used on POWER6, as the HMI vector was moved
to 0xe60 on POWER7. Since we don't support running in hypervisor mode
on POWER6, we just remove the handler at 0xe50.
This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0
instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD
from the relocation-on vectors (since any CPU that supports
relocation-on interrupts also has the PPR).
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-05 01:10:15 +07:00
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXMC
|
powerpc/book3s: handle machine check in Linux host.
Move machine check entry point into Linux. So far we were dependent on
firmware to decode MCE error details and handover the high level info to OS.
This patch introduces early machine check routine that saves the MCE
information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate
stack frame on emergency stack and set the r1 accordingly. This allows us to be
prepared to take another exception without loosing context. One thing to note
here that, if we get another machine check while ME bit is off then we risk a
checkstop. Hence we restrict ourselves to save only MCE information and
register saved on PACA_EXMC save are before we turn the ME bit on. We use
paca->in_mce flag to differentiate between first entry and nested machine check
entry which helps proper use of emergency stack. We increment paca->in_mce
every time we enter in early machine check handler and decrement it while
leaving. When we enter machine check early handler first time (paca->in_mce ==
0), we are sure nobody is using MC emergency stack and allocate a stack frame
at the start of the emergency stack. During subsequent entry (paca->in_mce >
0), we know that r1 points inside emergency stack and we allocate separate
stack frame accordingly. This prevents us from clobbering MCE information
during nested machine checks.
The early machine check handler changes are placed under CPU_FTR_HVMODE
section. This makes sure that the early machine check handler will get executed
only in hypervisor kernel.
This is the code flow:
Machine Check Interrupt
|
V
0x200 vector ME=0, IR=0, DR=0
|
V
+-----------------------------------------------+
|machine_check_pSeries_early: | ME=0, IR=0, DR=0
| Alloc frame on emergency stack |
| Save srr1, srr0, dar and dsisr on stack |
+-----------------------------------------------+
|
(ME=1, IR=0, DR=0, RFID)
|
V
machine_check_handle_early ME=1, IR=0, DR=0
|
V
+-----------------------------------------------+
| machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0
| Things to do: (in next patches) |
| Flush SLB for SLB errors |
| Flush TLB for TLB errors |
| Decode and save MCE info |
+-----------------------------------------------+
|
(Fall through existing exception handler routine.)
|
V
machine_check_pSerie ME=1, IR=0, DR=0
|
(ME=1, IR=1, DR=1, RFID)
|
V
machine_check_common ME=1, IR=1, DR=1
.
.
.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-30 21:34:08 +07:00
|
|
|
BEGIN_FTR_SECTION
|
2018-09-11 21:27:23 +07:00
|
|
|
b machine_check_common_early
|
powerpc/book3s: handle machine check in Linux host.
Move machine check entry point into Linux. So far we were dependent on
firmware to decode MCE error details and handover the high level info to OS.
This patch introduces early machine check routine that saves the MCE
information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate
stack frame on emergency stack and set the r1 accordingly. This allows us to be
prepared to take another exception without loosing context. One thing to note
here that, if we get another machine check while ME bit is off then we risk a
checkstop. Hence we restrict ourselves to save only MCE information and
register saved on PACA_EXMC save are before we turn the ME bit on. We use
paca->in_mce flag to differentiate between first entry and nested machine check
entry which helps proper use of emergency stack. We increment paca->in_mce
every time we enter in early machine check handler and decrement it while
leaving. When we enter machine check early handler first time (paca->in_mce ==
0), we are sure nobody is using MC emergency stack and allocate a stack frame
at the start of the emergency stack. During subsequent entry (paca->in_mce >
0), we know that r1 points inside emergency stack and we allocate separate
stack frame accordingly. This prevents us from clobbering MCE information
during nested machine checks.
The early machine check handler changes are placed under CPU_FTR_HVMODE
section. This makes sure that the early machine check handler will get executed
only in hypervisor kernel.
This is the code flow:
Machine Check Interrupt
|
V
0x200 vector ME=0, IR=0, DR=0
|
V
+-----------------------------------------------+
|machine_check_pSeries_early: | ME=0, IR=0, DR=0
| Alloc frame on emergency stack |
| Save srr1, srr0, dar and dsisr on stack |
+-----------------------------------------------+
|
(ME=1, IR=0, DR=0, RFID)
|
V
machine_check_handle_early ME=1, IR=0, DR=0
|
V
+-----------------------------------------------+
| machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0
| Things to do: (in next patches) |
| Flush SLB for SLB errors |
| Flush TLB for TLB errors |
| Decode and save MCE info |
+-----------------------------------------------+
|
(Fall through existing exception handler routine.)
|
V
machine_check_pSerie ME=1, IR=0, DR=0
|
(ME=1, IR=1, DR=1, RFID)
|
V
machine_check_common ME=1, IR=1, DR=1
.
.
.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-30 21:34:08 +07:00
|
|
|
FTR_SECTION_ELSE
|
powerpc: Save CFAR before branching in interrupt entry paths
Some of the interrupt vectors on 64-bit POWER server processors are
only 32 bytes long, which is not enough for the full first-level
interrupt handler. For these we currently just have a branch to an
out-of-line handler. However, this means that we corrupt the CFAR
(come-from address register) on POWER7 and later processors.
To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces:
EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR
is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We
then put EXCEPTION_PROLOG_0 in the short interrupt vectors before
we branch to the out-of-line handler, which contains the rest of the
first-level interrupt handler. To facilitate this, we define new
_OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc.
In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more
than 6 instructions, it was necessary to move the stores that move
the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and
to get rid of one of the two HMT_MEDIUM instructions. Previously
there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was
nop'd out on processors with the PPR (POWER7 and later), and then
another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside
__EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR.
Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally
and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although
this leaves it in for the interrupt vectors where there is room for
it.
Previously we had a handler for hypervisor maintenance interrupts at
0xe50, which doesn't leave enough room for the vector for hypervisor
emulation assist interrupts at 0xe40, since we need 8 instructions.
The 0xe50 vector was only used on POWER6, as the HMI vector was moved
to 0xe60 on POWER7. Since we don't support running in hypervisor mode
on POWER6, we just remove the handler at 0xe50.
This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0
instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD
from the relocation-on vectors (since any CPU that supports
relocation-on interrupts also has the PPR).
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-05 01:10:15 +07:00
|
|
|
b machine_check_pSeries_0
|
powerpc/book3s: handle machine check in Linux host.
Move machine check entry point into Linux. So far we were dependent on
firmware to decode MCE error details and handover the high level info to OS.
This patch introduces early machine check routine that saves the MCE
information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate
stack frame on emergency stack and set the r1 accordingly. This allows us to be
prepared to take another exception without loosing context. One thing to note
here that, if we get another machine check while ME bit is off then we risk a
checkstop. Hence we restrict ourselves to save only MCE information and
register saved on PACA_EXMC save are before we turn the ME bit on. We use
paca->in_mce flag to differentiate between first entry and nested machine check
entry which helps proper use of emergency stack. We increment paca->in_mce
every time we enter in early machine check handler and decrement it while
leaving. When we enter machine check early handler first time (paca->in_mce ==
0), we are sure nobody is using MC emergency stack and allocate a stack frame
at the start of the emergency stack. During subsequent entry (paca->in_mce >
0), we know that r1 points inside emergency stack and we allocate separate
stack frame accordingly. This prevents us from clobbering MCE information
during nested machine checks.
The early machine check handler changes are placed under CPU_FTR_HVMODE
section. This makes sure that the early machine check handler will get executed
only in hypervisor kernel.
This is the code flow:
Machine Check Interrupt
|
V
0x200 vector ME=0, IR=0, DR=0
|
V
+-----------------------------------------------+
|machine_check_pSeries_early: | ME=0, IR=0, DR=0
| Alloc frame on emergency stack |
| Save srr1, srr0, dar and dsisr on stack |
+-----------------------------------------------+
|
(ME=1, IR=0, DR=0, RFID)
|
V
machine_check_handle_early ME=1, IR=0, DR=0
|
V
+-----------------------------------------------+
| machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0
| Things to do: (in next patches) |
| Flush SLB for SLB errors |
| Flush TLB for TLB errors |
| Decode and save MCE info |
+-----------------------------------------------+
|
(Fall through existing exception handler routine.)
|
V
machine_check_pSerie ME=1, IR=0, DR=0
|
(ME=1, IR=1, DR=1, RFID)
|
V
machine_check_common ME=1, IR=1, DR=1
.
.
.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-30 21:34:08 +07:00
|
|
|
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_END(machine_check, 0x200, 0x100)
|
|
|
|
EXC_VIRT_NONE(0x4200, 0x100)
|
2018-09-11 21:27:23 +07:00
|
|
|
TRAMP_REAL_BEGIN(machine_check_common_early)
|
2019-06-22 20:15:16 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_STD, PACA_EXMC, 0, 0x200, 0
|
2016-09-21 14:43:31 +07:00
|
|
|
/*
|
|
|
|
* Register contents:
|
|
|
|
* R13 = PACA
|
|
|
|
* R9 = CR
|
|
|
|
* Original R9 to R13 is saved on PACA_EXMC
|
|
|
|
*
|
|
|
|
* Switch to mc_emergency stack and handle re-entrancy (we limit
|
|
|
|
* the nested MCE upto level 4 to avoid stack overflow).
|
|
|
|
* Save MCE registers srr1, srr0, dar and dsisr and then set ME=1
|
|
|
|
*
|
|
|
|
* We use paca->in_mce to check whether this is the first entry or
|
|
|
|
* nested machine check. We increment paca->in_mce to track nested
|
|
|
|
* machine checks.
|
|
|
|
*
|
|
|
|
* If this is the first entry then set stack pointer to
|
|
|
|
* paca->mc_emergency_sp, otherwise r1 is already pointing to
|
|
|
|
* stack frame on mc_emergency stack.
|
|
|
|
*
|
|
|
|
* NOTE: We are here with MSR_ME=0 (off), which means we risk a
|
|
|
|
* checkstop if we get another machine check exception before we do
|
|
|
|
* rfid with MSR_ME=1.
|
2017-04-19 20:05:47 +07:00
|
|
|
*
|
|
|
|
* This interrupt can wake directly from idle. If that is the case,
|
|
|
|
* the machine check is handled then the idle wakeup code is called
|
2018-07-05 15:47:00 +07:00
|
|
|
* to restore state.
|
2016-09-21 14:43:31 +07:00
|
|
|
*/
|
|
|
|
mr r11,r1 /* Save r1 */
|
|
|
|
lhz r10,PACA_IN_MCE(r13)
|
|
|
|
cmpwi r10,0 /* Are we in nested machine check */
|
|
|
|
bne 0f /* Yes, we are. */
|
|
|
|
/* First machine check entry */
|
|
|
|
ld r1,PACAMCEMERGSP(r13) /* Use MC emergency stack */
|
|
|
|
0: subi r1,r1,INT_FRAME_SIZE /* alloc stack frame */
|
|
|
|
addi r10,r10,1 /* increment paca->in_mce */
|
|
|
|
sth r10,PACA_IN_MCE(r13)
|
|
|
|
/* Limit nested MCE to level 4 to avoid stack overflow */
|
2017-09-29 11:26:53 +07:00
|
|
|
cmpwi r10,MAX_MCE_DEPTH
|
2016-09-21 14:43:31 +07:00
|
|
|
bgt 2f /* Check if we hit limit of 4 */
|
|
|
|
std r11,GPR1(r1) /* Save r1 on the stack. */
|
|
|
|
std r11,0(r1) /* make stack chain pointer */
|
|
|
|
mfspr r11,SPRN_SRR0 /* Save SRR0 */
|
|
|
|
std r11,_NIP(r1)
|
|
|
|
mfspr r11,SPRN_SRR1 /* Save SRR1 */
|
|
|
|
std r11,_MSR(r1)
|
|
|
|
mfspr r11,SPRN_DAR /* Save DAR */
|
|
|
|
std r11,_DAR(r1)
|
|
|
|
mfspr r11,SPRN_DSISR /* Save DSISR */
|
|
|
|
std r11,_DSISR(r1)
|
|
|
|
std r9,_CCR(r1) /* Save CR in stackframe */
|
2019-06-22 05:55:54 +07:00
|
|
|
/* We don't touch AMR here, we never go to virtual mode */
|
2016-09-21 14:43:31 +07:00
|
|
|
/* Save r9 through r13 from EXMC save area to stack frame. */
|
|
|
|
EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
|
|
|
|
mfmsr r11 /* get MSR value */
|
2018-09-11 21:27:23 +07:00
|
|
|
BEGIN_FTR_SECTION
|
2016-09-21 14:43:31 +07:00
|
|
|
ori r11,r11,MSR_ME /* turn on ME bit */
|
2018-09-11 21:27:23 +07:00
|
|
|
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
|
2016-09-21 14:43:31 +07:00
|
|
|
ori r11,r11,MSR_RI /* turn on RI bit */
|
|
|
|
LOAD_HANDLER(r12, machine_check_handle_early)
|
|
|
|
1: mtspr SPRN_SRR0,r12
|
|
|
|
mtspr SPRN_SRR1,r11
|
2018-01-09 23:07:15 +07:00
|
|
|
RFI_TO_KERNEL
|
2016-09-21 14:43:31 +07:00
|
|
|
b . /* prevent speculative execution */
|
|
|
|
2:
|
|
|
|
/* Stack overflow. Stay on emergency stack and panic.
|
|
|
|
* Keep the ME bit off while panic-ing, so that if we hit
|
|
|
|
* another machine check we checkstop.
|
|
|
|
*/
|
|
|
|
addi r1,r1,INT_FRAME_SIZE /* go back to previous stack frame */
|
|
|
|
ld r11,PACAKMSR(r13)
|
|
|
|
LOAD_HANDLER(r12, unrecover_mce)
|
|
|
|
li r10,MSR_ME
|
|
|
|
andc r11,r11,r10 /* Turn off MSR_ME */
|
|
|
|
b 1b
|
|
|
|
b . /* prevent speculative execution */
|
|
|
|
|
|
|
|
TRAMP_REAL_BEGIN(machine_check_pSeries)
|
|
|
|
.globl machine_check_fwnmi
|
|
|
|
machine_check_fwnmi:
|
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXMC
|
2018-09-11 21:27:00 +07:00
|
|
|
BEGIN_FTR_SECTION
|
2018-09-11 21:27:23 +07:00
|
|
|
b machine_check_common_early
|
2018-09-11 21:27:00 +07:00
|
|
|
END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
|
2016-09-21 14:43:31 +07:00
|
|
|
machine_check_pSeries_0:
|
2019-06-22 20:15:16 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_STD, PACA_EXMC, 1, 0x200, 0
|
2016-09-21 14:43:31 +07:00
|
|
|
/*
|
2016-12-20 01:30:02 +07:00
|
|
|
* MSR_RI is not enabled, because PACA_EXMC is being used, so a
|
|
|
|
* nested machine check corrupts it. machine_check_common enables
|
|
|
|
* MSR_RI.
|
2016-09-21 14:43:31 +07:00
|
|
|
*/
|
2019-06-22 20:15:13 +07:00
|
|
|
EXCEPTION_PROLOG_2_REAL machine_check_common, EXC_STD, 0
|
2016-09-21 14:43:31 +07:00
|
|
|
|
|
|
|
TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
|
|
|
|
|
|
|
|
EXC_COMMON_BEGIN(machine_check_common)
|
|
|
|
/*
|
|
|
|
* Machine check is different because we use a different
|
|
|
|
* save area: PACA_EXMC instead of PACA_EXGEN.
|
|
|
|
*/
|
|
|
|
mfspr r10,SPRN_DAR
|
|
|
|
std r10,PACA_EXMC+EX_DAR(r13)
|
|
|
|
mfspr r10,SPRN_DSISR
|
|
|
|
stw r10,PACA_EXMC+EX_DSISR(r13)
|
|
|
|
EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
|
|
|
|
FINISH_NAP
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
ld r3,PACA_EXMC+EX_DAR(r13)
|
|
|
|
lwz r4,PACA_EXMC+EX_DSISR(r13)
|
|
|
|
/* Enable MSR_RI when finished with PACA_EXMC */
|
|
|
|
li r10,MSR_RI
|
|
|
|
mtmsrd r10,1
|
|
|
|
std r3,_DAR(r1)
|
|
|
|
std r4,_DSISR(r1)
|
|
|
|
bl save_nvgprs
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl machine_check_exception
|
|
|
|
b ret_from_except
|
|
|
|
|
|
|
|
#define MACHINE_CHECK_HANDLER_WINDUP \
|
|
|
|
/* Clear MSR_RI before setting SRR0 and SRR1. */\
|
|
|
|
li r0,MSR_RI; \
|
|
|
|
mfmsr r9; /* get MSR value */ \
|
|
|
|
andc r9,r9,r0; \
|
|
|
|
mtmsrd r9,1; /* Clear MSR_RI */ \
|
|
|
|
/* Move original SRR0 and SRR1 into the respective regs */ \
|
|
|
|
ld r9,_MSR(r1); \
|
|
|
|
mtspr SPRN_SRR1,r9; \
|
|
|
|
ld r3,_NIP(r1); \
|
|
|
|
mtspr SPRN_SRR0,r3; \
|
|
|
|
ld r9,_CTR(r1); \
|
|
|
|
mtctr r9; \
|
|
|
|
ld r9,_XER(r1); \
|
|
|
|
mtxer r9; \
|
|
|
|
ld r9,_LINK(r1); \
|
|
|
|
mtlr r9; \
|
|
|
|
REST_GPR(0, r1); \
|
|
|
|
REST_8GPRS(2, r1); \
|
|
|
|
REST_GPR(10, r1); \
|
|
|
|
ld r11,_CCR(r1); \
|
|
|
|
mtcr r11; \
|
|
|
|
/* Decrement paca->in_mce. */ \
|
|
|
|
lhz r12,PACA_IN_MCE(r13); \
|
|
|
|
subi r12,r12,1; \
|
|
|
|
sth r12,PACA_IN_MCE(r13); \
|
|
|
|
REST_GPR(11, r1); \
|
|
|
|
REST_2GPRS(12, r1); \
|
|
|
|
/* restore original r1. */ \
|
|
|
|
ld r1,GPR1(r1)
|
|
|
|
|
2017-04-19 20:05:47 +07:00
|
|
|
#ifdef CONFIG_PPC_P7_NAP
|
|
|
|
/*
|
|
|
|
* This is an idle wakeup. Low level machine check has already been
|
|
|
|
* done. Queue the event then call the idle code to do the wake up.
|
|
|
|
*/
|
|
|
|
EXC_COMMON_BEGIN(machine_check_idle_common)
|
|
|
|
bl machine_check_queue_event
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have not used any non-volatile GPRs here, and as a rule
|
|
|
|
* most exception code including machine check does not.
|
|
|
|
* Therefore PACA_NAPSTATELOST does not need to be set. Idle
|
|
|
|
* wakeup will restore volatile registers.
|
|
|
|
*
|
|
|
|
* Load the original SRR1 into r3 for pnv_powersave_wakeup_mce.
|
|
|
|
*
|
|
|
|
* Then decrement MCE nesting after finishing with the stack.
|
|
|
|
*/
|
|
|
|
ld r3,_MSR(r1)
|
powerpc/64s: Reimplement book3s idle code in C
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code.
Book3S assembly stubs are kept in common code and used only to save
the stack frame and non-volatile GPRs before executing architected
idle instructions, and restoring the stack and reloading GPRs then
returning to C after waking from idle.
The complex logic dealing with threads and subcores, locking, SPRs,
HMIs, timebase resync, etc., is all done in C which makes it more
maintainable.
This is not a strict translation to C code, there are some
significant differences:
- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
but saves and restores them itself.
- The optimisation where EC=ESL=0 idle modes did not have to save GPRs
or change MSR is restored, because it's now simple to do. ESL=1
sleeps that do not lose GPRs can use this optimization too.
- KVM secondary entry and cede is now more of a call/return style
rather than branchy. nap_state_lost is not required because KVM
always returns via NVGPR restoring path.
- KVM secondary wakeup from offline sequence is moved entirely into
the offline wakeup, which avoids a hwsync in the normal idle wakeup
path.
Performance measured with context switch ping-pong on different
threads or cores, is possibly improved a small amount, 1-3% depending
on stop state and core vs thread test for shallow states. Deep states
it's in the noise compared with other latencies.
KVM improvements:
- Idle sleepers now always return to caller rather than branch out
to KVM first.
- This allows optimisations like very fast return to caller when no
state has been lost.
- KVM no longer requires nap_state_lost because it controls NVGPR
save/restore itself on the way in and out.
- The heavy idle wakeup KVM request check can be moved out of the
normal host idle code and into the not-performance-critical offline
code.
- KVM nap code now returns from where it is called, which makes the
flow a bit easier to follow.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Squash the KVM changes in]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-12 21:30:52 +07:00
|
|
|
ld r4,_LINK(r1)
|
2017-04-19 20:05:47 +07:00
|
|
|
|
|
|
|
lhz r11,PACA_IN_MCE(r13)
|
|
|
|
subi r11,r11,1
|
|
|
|
sth r11,PACA_IN_MCE(r13)
|
|
|
|
|
powerpc/64s: Reimplement book3s idle code in C
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code.
Book3S assembly stubs are kept in common code and used only to save
the stack frame and non-volatile GPRs before executing architected
idle instructions, and restoring the stack and reloading GPRs then
returning to C after waking from idle.
The complex logic dealing with threads and subcores, locking, SPRs,
HMIs, timebase resync, etc., is all done in C which makes it more
maintainable.
This is not a strict translation to C code, there are some
significant differences:
- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
but saves and restores them itself.
- The optimisation where EC=ESL=0 idle modes did not have to save GPRs
or change MSR is restored, because it's now simple to do. ESL=1
sleeps that do not lose GPRs can use this optimization too.
- KVM secondary entry and cede is now more of a call/return style
rather than branchy. nap_state_lost is not required because KVM
always returns via NVGPR restoring path.
- KVM secondary wakeup from offline sequence is moved entirely into
the offline wakeup, which avoids a hwsync in the normal idle wakeup
path.
Performance measured with context switch ping-pong on different
threads or cores, is possibly improved a small amount, 1-3% depending
on stop state and core vs thread test for shallow states. Deep states
it's in the noise compared with other latencies.
KVM improvements:
- Idle sleepers now always return to caller rather than branch out
to KVM first.
- This allows optimisations like very fast return to caller when no
state has been lost.
- KVM no longer requires nap_state_lost because it controls NVGPR
save/restore itself on the way in and out.
- The heavy idle wakeup KVM request check can be moved out of the
normal host idle code and into the not-performance-critical offline
code.
- KVM nap code now returns from where it is called, which makes the
flow a bit easier to follow.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Squash the KVM changes in]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-12 21:30:52 +07:00
|
|
|
mtlr r4
|
|
|
|
rlwinm r10,r3,47-31,30,31
|
|
|
|
cmpwi cr1,r10,2
|
|
|
|
bltlr cr1 /* no state loss, return to idle caller */
|
|
|
|
b idle_return_gpr_loss
|
2017-04-19 20:05:47 +07:00
|
|
|
#endif
|
2016-09-21 14:43:31 +07:00
|
|
|
/*
|
|
|
|
* Handle machine check early in real mode. We come here with
|
|
|
|
* ME=1, MMU (IR=0 and DR=0) off and using MC emergency stack.
|
|
|
|
*/
|
|
|
|
EXC_COMMON_BEGIN(machine_check_handle_early)
|
|
|
|
std r0,GPR0(r1) /* Save r0 */
|
|
|
|
EXCEPTION_PROLOG_COMMON_3(0x200)
|
|
|
|
bl save_nvgprs
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl machine_check_early
|
|
|
|
std r3,RESULT(r1) /* Save result */
|
|
|
|
ld r12,_MSR(r1)
|
2018-09-11 21:27:23 +07:00
|
|
|
BEGIN_FTR_SECTION
|
|
|
|
b 4f
|
|
|
|
END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
|
2017-04-19 20:05:47 +07:00
|
|
|
|
2016-09-21 14:43:31 +07:00
|
|
|
#ifdef CONFIG_PPC_P7_NAP
|
|
|
|
/*
|
|
|
|
* Check if thread was in power saving mode. We come here when any
|
|
|
|
* of the following is true:
|
|
|
|
* a. thread wasn't in power saving mode
|
|
|
|
* b. thread was in power saving mode with no state loss,
|
|
|
|
* supervisor state loss or hypervisor state loss.
|
|
|
|
*
|
|
|
|
* Go back to nap/sleep/winkle mode again if (b) is true.
|
|
|
|
*/
|
2017-04-19 20:05:47 +07:00
|
|
|
BEGIN_FTR_SECTION
|
|
|
|
rlwinm. r11,r12,47-31,30,31
|
2017-05-04 17:41:12 +07:00
|
|
|
bne machine_check_idle_common
|
2017-04-19 20:05:47 +07:00
|
|
|
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
|
2016-09-21 14:43:31 +07:00
|
|
|
#endif
|
2017-04-19 20:05:47 +07:00
|
|
|
|
2016-09-21 14:43:31 +07:00
|
|
|
/*
|
|
|
|
* Check if we are coming from hypervisor userspace. If yes then we
|
|
|
|
* continue in host kernel in V mode to deliver the MC event.
|
|
|
|
*/
|
|
|
|
rldicl. r11,r12,4,63 /* See if MC hit while in HV mode. */
|
|
|
|
beq 5f
|
2018-09-11 21:27:23 +07:00
|
|
|
4: andi. r11,r12,MSR_PR /* See if coming from user. */
|
2016-09-21 14:43:31 +07:00
|
|
|
bne 9f /* continue in V mode if we are. */
|
|
|
|
|
|
|
|
5:
|
|
|
|
#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
|
2018-09-11 21:27:23 +07:00
|
|
|
BEGIN_FTR_SECTION
|
2016-09-21 14:43:31 +07:00
|
|
|
/*
|
|
|
|
* We are coming from kernel context. Check if we are coming from
|
|
|
|
* guest. if yes, then we can continue. We will fall through
|
|
|
|
* do_kvm_200->kvmppc_interrupt to deliver the MC event to guest.
|
|
|
|
*/
|
|
|
|
lbz r11,HSTATE_IN_GUEST(r13)
|
|
|
|
cmpwi r11,0 /* Check if coming from guest */
|
|
|
|
bne 9f /* continue if we are. */
|
2018-09-11 21:27:23 +07:00
|
|
|
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
|
2016-09-21 14:43:31 +07:00
|
|
|
#endif
|
|
|
|
/*
|
|
|
|
* At this point we are not sure about what context we come from.
|
|
|
|
* Queue up the MCE event and return from the interrupt.
|
|
|
|
* But before that, check if this is an un-recoverable exception.
|
|
|
|
* If yes, then stay on emergency stack and panic.
|
|
|
|
*/
|
|
|
|
andi. r11,r12,MSR_RI
|
|
|
|
bne 2f
|
|
|
|
1: mfspr r11,SPRN_SRR0
|
|
|
|
LOAD_HANDLER(r10,unrecover_mce)
|
|
|
|
mtspr SPRN_SRR0,r10
|
|
|
|
ld r10,PACAKMSR(r13)
|
|
|
|
/*
|
|
|
|
* We are going down. But there are chances that we might get hit by
|
|
|
|
* another MCE during panic path and we may run into unstable state
|
|
|
|
* with no way out. Hence, turn ME bit off while going down, so that
|
|
|
|
* when another MCE is hit during panic path, system will checkstop
|
|
|
|
* and hypervisor will get restarted cleanly by SP.
|
|
|
|
*/
|
|
|
|
li r3,MSR_ME
|
|
|
|
andc r10,r10,r3 /* Turn off MSR_ME */
|
|
|
|
mtspr SPRN_SRR1,r10
|
2018-01-09 23:07:15 +07:00
|
|
|
RFI_TO_KERNEL
|
2016-09-21 14:43:31 +07:00
|
|
|
b .
|
|
|
|
2:
|
|
|
|
/*
|
|
|
|
* Check if we have successfully handled/recovered from error, if not
|
|
|
|
* then stay on emergency stack and panic.
|
|
|
|
*/
|
|
|
|
ld r3,RESULT(r1) /* Load result */
|
|
|
|
cmpdi r3,0 /* see if we handled MCE successfully */
|
|
|
|
|
|
|
|
beq 1b /* if !handled then panic */
|
2018-09-11 21:27:23 +07:00
|
|
|
BEGIN_FTR_SECTION
|
2016-09-21 14:43:31 +07:00
|
|
|
/*
|
|
|
|
* Return from MC interrupt.
|
|
|
|
* Queue up the MCE event so that we can log it later, while
|
|
|
|
* returning from kernel or opal call.
|
|
|
|
*/
|
|
|
|
bl machine_check_queue_event
|
|
|
|
MACHINE_CHECK_HANDLER_WINDUP
|
2018-01-09 23:07:15 +07:00
|
|
|
RFI_TO_USER_OR_KERNEL
|
2018-09-11 21:27:23 +07:00
|
|
|
FTR_SECTION_ELSE
|
|
|
|
/*
|
|
|
|
* pSeries: Return from MC interrupt. Before that stay on emergency
|
|
|
|
* stack and call machine_check_exception to log the MCE event.
|
|
|
|
*/
|
|
|
|
LOAD_HANDLER(r10,mce_return)
|
|
|
|
mtspr SPRN_SRR0,r10
|
|
|
|
ld r10,PACAKMSR(r13)
|
|
|
|
mtspr SPRN_SRR1,r10
|
|
|
|
RFI_TO_KERNEL
|
|
|
|
b .
|
|
|
|
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
|
2016-09-21 14:43:31 +07:00
|
|
|
9:
|
|
|
|
/* Deliver the machine check to host kernel in V mode. */
|
|
|
|
MACHINE_CHECK_HANDLER_WINDUP
|
2018-09-11 21:27:23 +07:00
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXMC
|
2018-09-11 21:27:23 +07:00
|
|
|
b machine_check_pSeries_0
|
2016-09-21 14:43:31 +07:00
|
|
|
|
|
|
|
EXC_COMMON_BEGIN(unrecover_mce)
|
|
|
|
/* Invoke machine_check_exception to print MCE event and panic. */
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl machine_check_exception
|
|
|
|
/*
|
|
|
|
* We will not reach here. Even if we did, there is no way out. Call
|
|
|
|
* unrecoverable_exception and die.
|
|
|
|
*/
|
|
|
|
1: addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl unrecoverable_exception
|
|
|
|
b 1b
|
|
|
|
|
2018-09-11 21:27:00 +07:00
|
|
|
EXC_COMMON_BEGIN(mce_return)
|
|
|
|
/* Invoke machine_check_exception to print MCE event and return. */
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl machine_check_exception
|
2018-09-11 21:27:23 +07:00
|
|
|
MACHINE_CHECK_HANDLER_WINDUP
|
2018-09-11 21:27:00 +07:00
|
|
|
RFI_TO_KERNEL
|
|
|
|
b .
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2019-02-26 15:51:09 +07:00
|
|
|
EXC_REAL_BEGIN(data_access, 0x300, 0x80)
|
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXGEN
|
2019-02-26 15:51:09 +07:00
|
|
|
b tramp_real_data_access
|
|
|
|
EXC_REAL_END(data_access, 0x300, 0x80)
|
|
|
|
|
|
|
|
TRAMP_REAL_BEGIN(tramp_real_data_access)
|
2019-06-22 20:15:16 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_STD, PACA_EXGEN, 1, 0x300, 0
|
2019-02-26 15:51:10 +07:00
|
|
|
/*
|
|
|
|
* DAR/DSISR must be read before setting MSR[RI], because
|
|
|
|
* a d-side MCE will clobber those registers so is not
|
|
|
|
* recoverable if they are live.
|
|
|
|
*/
|
|
|
|
mfspr r10,SPRN_DAR
|
|
|
|
mfspr r11,SPRN_DSISR
|
|
|
|
std r10,PACA_EXGEN+EX_DAR(r13)
|
|
|
|
stw r11,PACA_EXGEN+EX_DSISR(r13)
|
2019-06-22 20:15:13 +07:00
|
|
|
EXCEPTION_PROLOG_2_REAL data_access_common, EXC_STD, 1
|
2019-02-26 15:51:09 +07:00
|
|
|
|
|
|
|
EXC_VIRT_BEGIN(data_access, 0x4300, 0x80)
|
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXGEN
|
2019-06-22 20:15:16 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_STD, PACA_EXGEN, 0, 0x300, 0
|
2019-02-26 15:51:10 +07:00
|
|
|
mfspr r10,SPRN_DAR
|
|
|
|
mfspr r11,SPRN_DSISR
|
|
|
|
std r10,PACA_EXGEN+EX_DAR(r13)
|
|
|
|
stw r11,PACA_EXGEN+EX_DSISR(r13)
|
2019-06-22 20:15:13 +07:00
|
|
|
EXCEPTION_PROLOG_2_VIRT data_access_common, EXC_STD
|
2019-02-26 15:51:09 +07:00
|
|
|
EXC_VIRT_END(data_access, 0x4300, 0x80)
|
|
|
|
|
2016-09-21 14:43:32 +07:00
|
|
|
TRAMP_KVM_SKIP(PACA_EXGEN, 0x300)
|
|
|
|
|
|
|
|
EXC_COMMON_BEGIN(data_access_common)
|
|
|
|
/*
|
|
|
|
* Here r13 points to the paca, r9 contains the saved CR,
|
|
|
|
* SRR0 and SRR1 are saved in r11 and r12,
|
|
|
|
* r9 - r13 are saved in paca->exgen.
|
2019-02-26 15:51:10 +07:00
|
|
|
* EX_DAR and EX_DSISR have saved DAR/DSISR
|
2016-09-21 14:43:32 +07:00
|
|
|
*/
|
|
|
|
EXCEPTION_PROLOG_COMMON(0x300, PACA_EXGEN)
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
ld r12,_MSR(r1)
|
|
|
|
ld r3,PACA_EXGEN+EX_DAR(r13)
|
|
|
|
lwz r4,PACA_EXGEN+EX_DSISR(r13)
|
|
|
|
li r5,0x300
|
|
|
|
std r3,_DAR(r1)
|
|
|
|
std r4,_DSISR(r1)
|
|
|
|
BEGIN_MMU_FTR_SECTION
|
|
|
|
b do_hash_page /* Try to handle as hpte fault */
|
|
|
|
MMU_FTR_SECTION_ELSE
|
|
|
|
b handle_page_fault
|
|
|
|
ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
|
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_BEGIN(data_access_slb, 0x380, 0x80)
|
2019-02-26 15:51:09 +07:00
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXSLB
|
2019-02-26 15:51:09 +07:00
|
|
|
b tramp_real_data_access_slb
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_END(data_access_slb, 0x380, 0x80)
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2019-02-26 15:51:09 +07:00
|
|
|
TRAMP_REAL_BEGIN(tramp_real_data_access_slb)
|
2019-06-22 20:15:16 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_STD, PACA_EXSLB, 1, 0x380, 0
|
2019-02-26 15:51:10 +07:00
|
|
|
mfspr r10,SPRN_DAR
|
|
|
|
std r10,PACA_EXSLB+EX_DAR(r13)
|
2019-06-22 20:15:13 +07:00
|
|
|
EXCEPTION_PROLOG_2_REAL data_access_slb_common, EXC_STD, 1
|
2019-02-26 15:51:09 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_VIRT_BEGIN(data_access_slb, 0x4380, 0x80)
|
2019-02-26 15:51:09 +07:00
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXSLB
|
2019-06-22 20:15:16 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_STD, PACA_EXSLB, 0, 0x380, 0
|
2019-02-26 15:51:10 +07:00
|
|
|
mfspr r10,SPRN_DAR
|
|
|
|
std r10,PACA_EXSLB+EX_DAR(r13)
|
2019-06-22 20:15:13 +07:00
|
|
|
EXCEPTION_PROLOG_2_VIRT data_access_slb_common, EXC_STD
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_VIRT_END(data_access_slb, 0x4380, 0x80)
|
2018-09-14 22:30:51 +07:00
|
|
|
|
2016-09-21 14:43:33 +07:00
|
|
|
TRAMP_KVM_SKIP(PACA_EXSLB, 0x380)
|
|
|
|
|
2018-09-14 22:30:51 +07:00
|
|
|
EXC_COMMON_BEGIN(data_access_slb_common)
|
|
|
|
EXCEPTION_PROLOG_COMMON(0x380, PACA_EXSLB)
|
|
|
|
ld r4,PACA_EXSLB+EX_DAR(r13)
|
|
|
|
std r4,_DAR(r1)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
2019-03-29 14:42:57 +07:00
|
|
|
BEGIN_MMU_FTR_SECTION
|
|
|
|
/* HPT case, do SLB fault */
|
2018-09-14 22:30:51 +07:00
|
|
|
bl do_slb_fault
|
|
|
|
cmpdi r3,0
|
|
|
|
bne- 1f
|
|
|
|
b fast_exception_return
|
|
|
|
1: /* Error case */
|
2019-03-29 14:42:57 +07:00
|
|
|
MMU_FTR_SECTION_ELSE
|
|
|
|
/* Radix case, access is outside page table range */
|
|
|
|
li r3,-EFAULT
|
|
|
|
ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
|
2018-09-14 22:30:51 +07:00
|
|
|
std r3,RESULT(r1)
|
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
ld r4,_DAR(r1)
|
|
|
|
ld r5,RESULT(r1)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl do_bad_slb_fault
|
|
|
|
b ret_from_except
|
|
|
|
|
2016-09-21 14:43:33 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL(instruction_access, 0x400, 0x80)
|
|
|
|
EXC_VIRT(instruction_access, 0x4400, 0x80, 0x400)
|
2016-09-21 14:43:34 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0x400)
|
|
|
|
|
|
|
|
EXC_COMMON_BEGIN(instruction_access_common)
|
|
|
|
EXCEPTION_PROLOG_COMMON(0x400, PACA_EXGEN)
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
ld r12,_MSR(r1)
|
|
|
|
ld r3,_NIP(r1)
|
powerpc/64s: Fix masking of SRR1 bits on instruction fault
On 64-bit Book3s, when we take an instruction fault the reason for the
fault may be reported in SRR1. For data faults the reason is reported
in DSISR (Data Storage Instruction Status Register).
The reasons reported in each do not necessarily correspond, so we mask
the SRR1 bits before copying them to the DSISR, which is then used by
the page fault code.
Prior to commit b4c001dc44f0 ("powerpc/mm: Use symbolic constants for
filtering SRR1 bits on ISIs") we used a hard-coded mask of 0x58200000,
which corresponds to:
DSISR_NOHPTE 0x40000000 /* no translation found */
DSISR_NOEXEC_OR_G 0x10000000 /* exec of no-exec or guarded */
DSISR_PROTFAULT 0x08000000 /* protection fault */
DSISR_KEYFAULT 0x00200000 /* Storage Key fault */
That commit added a #define for the mask, DSISR_SRR1_MATCH_64S, but
incorrectly used a different similarly named DSISR_BAD_FAULT_64S.
This had the effect of changing the mask to 0xa43a0000, which omits
everything but DSISR_KEYFAULT.
Luckily this had no visible effect, because in practice we hardly use
the DSISR bits. The lack of DSISR_NOHPTE means a TLB flush
optimisation was missed in the native HPTE code, and DSISR_NOEXEC_OR_G
and DSISR_PROTFAULT are both only used to trigger rare warnings.
So we got lucky, but let's fix it. The new value only has bits between
17 and 30 set, so we can continue to use andis.
Fixes: b4c001dc44f0 ("powerpc/mm: Use symbolic constants for filtering SRR1 bits on ISIs")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-14 11:48:47 +07:00
|
|
|
andis. r4,r12,DSISR_SRR1_MATCH_64S@h
|
2016-09-21 14:43:34 +07:00
|
|
|
li r5,0x400
|
|
|
|
std r3,_DAR(r1)
|
|
|
|
std r4,_DSISR(r1)
|
|
|
|
BEGIN_MMU_FTR_SECTION
|
|
|
|
b do_hash_page /* Try to handle as hpte fault */
|
|
|
|
MMU_FTR_SECTION_ELSE
|
|
|
|
b handle_page_fault
|
|
|
|
ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
|
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2019-06-22 20:15:22 +07:00
|
|
|
__EXC_REAL(instruction_access_slb, 0x480, 0x80, PACA_EXSLB)
|
|
|
|
__EXC_VIRT(instruction_access_slb, 0x4480, 0x80, 0x480, PACA_EXSLB)
|
2018-09-14 22:30:51 +07:00
|
|
|
TRAMP_KVM(PACA_EXSLB, 0x480)
|
2018-10-02 20:56:39 +07:00
|
|
|
|
2018-09-14 22:30:51 +07:00
|
|
|
EXC_COMMON_BEGIN(instruction_access_slb_common)
|
|
|
|
EXCEPTION_PROLOG_COMMON(0x480, PACA_EXSLB)
|
|
|
|
ld r4,_NIP(r1)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
2019-03-29 14:42:57 +07:00
|
|
|
BEGIN_MMU_FTR_SECTION
|
|
|
|
/* HPT case, do SLB fault */
|
2018-09-14 22:30:51 +07:00
|
|
|
bl do_slb_fault
|
|
|
|
cmpdi r3,0
|
|
|
|
bne- 1f
|
|
|
|
b fast_exception_return
|
|
|
|
1: /* Error case */
|
2019-03-29 14:42:57 +07:00
|
|
|
MMU_FTR_SECTION_ELSE
|
|
|
|
/* Radix case, access is outside page table range */
|
|
|
|
li r3,-EFAULT
|
|
|
|
ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
|
2018-09-14 22:30:51 +07:00
|
|
|
std r3,RESULT(r1)
|
2016-09-21 14:43:35 +07:00
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
2018-09-14 22:30:51 +07:00
|
|
|
ld r4,_NIP(r1)
|
|
|
|
ld r5,RESULT(r1)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl do_bad_slb_fault
|
2016-09-21 14:43:35 +07:00
|
|
|
b ret_from_except
|
|
|
|
|
2018-09-14 22:30:51 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_BEGIN(hardware_interrupt, 0x500, 0x100)
|
2019-06-22 20:15:23 +07:00
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
|
|
|
EXCEPTION_PROLOG_0 PACA_EXGEN
|
2011-04-05 11:20:31 +07:00
|
|
|
BEGIN_FTR_SECTION
|
2019-06-22 20:15:22 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_HV, PACA_EXGEN, 1, 0x500, IRQS_DISABLED
|
|
|
|
EXCEPTION_PROLOG_2_REAL hardware_interrupt_common, EXC_HV, 1
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:21:34 +07:00
|
|
|
FTR_SECTION_ELSE
|
2019-06-22 20:15:22 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_STD, PACA_EXGEN, 1, 0x500, IRQS_DISABLED
|
|
|
|
EXCEPTION_PROLOG_2_REAL hardware_interrupt_common, EXC_STD, 1
|
powerpc, KVM: Split HVMODE_206 cpu feature bit into separate HV and architecture bits
This replaces the single CPU_FTR_HVMODE_206 bit with two bits, one to
indicate that we have a usable hypervisor mode, and another to indicate
that the processor conforms to PowerISA version 2.06. We also add
another bit to indicate that the processor conforms to ISA version 2.01
and set that for PPC970 and derivatives.
Some PPC970 chips (specifically those in Apple machines) have a
hypervisor mode in that MSR[HV] is always 1, but the hypervisor mode
is not useful in the sense that there is no way to run any code in
supervisor mode (HV=0 PR=0). On these processors, the LPES0 and LPES1
bits in HID4 are always 0, and we use that as a way of detecting that
hypervisor mode is not useful.
Where we have a feature section in assembly code around code that
only applies on POWER7 in hypervisor mode, we use a construct like
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
The definition of END_FTR_SECTION_IFSET is such that the code will
be enabled (not overwritten with nops) only if all bits in the
provided mask are set.
Note that the CPU feature check in __tlbie() only needs to check the
ARCH_206 bit, not the HVMODE bit, because __tlbie() can only get called
if we are running bare-metal, i.e. in hypervisor mode.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 07:26:11 +07:00
|
|
|
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_END(hardware_interrupt, 0x500, 0x100)
|
2016-09-30 16:43:18 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_VIRT_BEGIN(hardware_interrupt, 0x4500, 0x100)
|
2019-06-22 20:15:23 +07:00
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
|
|
|
EXCEPTION_PROLOG_0 PACA_EXGEN
|
2016-09-21 14:43:36 +07:00
|
|
|
BEGIN_FTR_SECTION
|
2019-06-22 20:15:22 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_HV, PACA_EXGEN, 1, 0x500, IRQS_DISABLED
|
|
|
|
EXCEPTION_PROLOG_2_VIRT hardware_interrupt_common, EXC_HV
|
2016-09-21 14:43:36 +07:00
|
|
|
FTR_SECTION_ELSE
|
2019-06-22 20:15:22 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_STD, PACA_EXGEN, 1, 0x500, IRQS_DISABLED
|
|
|
|
EXCEPTION_PROLOG_2_VIRT hardware_interrupt_common, EXC_STD
|
2016-09-21 14:43:36 +07:00
|
|
|
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_VIRT_END(hardware_interrupt, 0x4500, 0x100)
|
2016-09-21 14:43:36 +07:00
|
|
|
|
2016-12-22 01:29:26 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0x500)
|
|
|
|
TRAMP_KVM_HV(PACA_EXGEN, 0x500)
|
2016-09-21 14:43:36 +07:00
|
|
|
EXC_COMMON_ASYNC(hardware_interrupt_common, 0x500, do_IRQ)
|
|
|
|
|
|
|
|
|
2019-02-26 15:51:09 +07:00
|
|
|
EXC_REAL_BEGIN(alignment, 0x600, 0x100)
|
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXGEN
|
2019-06-22 20:15:16 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_STD, PACA_EXGEN, 1, 0x600, 0
|
2019-02-26 15:51:10 +07:00
|
|
|
mfspr r10,SPRN_DAR
|
|
|
|
mfspr r11,SPRN_DSISR
|
|
|
|
std r10,PACA_EXGEN+EX_DAR(r13)
|
|
|
|
stw r11,PACA_EXGEN+EX_DSISR(r13)
|
2019-06-22 20:15:13 +07:00
|
|
|
EXCEPTION_PROLOG_2_REAL alignment_common, EXC_STD, 1
|
2019-02-26 15:51:09 +07:00
|
|
|
EXC_REAL_END(alignment, 0x600, 0x100)
|
|
|
|
|
|
|
|
EXC_VIRT_BEGIN(alignment, 0x4600, 0x100)
|
|
|
|
SET_SCRATCH0(r13) /* save r13 */
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXGEN
|
2019-06-22 20:15:16 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_STD, PACA_EXGEN, 0, 0x600, 0
|
2019-02-26 15:51:10 +07:00
|
|
|
mfspr r10,SPRN_DAR
|
|
|
|
mfspr r11,SPRN_DSISR
|
|
|
|
std r10,PACA_EXGEN+EX_DAR(r13)
|
|
|
|
stw r11,PACA_EXGEN+EX_DSISR(r13)
|
2019-06-22 20:15:13 +07:00
|
|
|
EXCEPTION_PROLOG_2_VIRT alignment_common, EXC_STD
|
2019-02-26 15:51:09 +07:00
|
|
|
EXC_VIRT_END(alignment, 0x4600, 0x100)
|
|
|
|
|
2016-09-30 16:43:18 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0x600)
|
2016-09-21 14:43:37 +07:00
|
|
|
EXC_COMMON_BEGIN(alignment_common)
|
|
|
|
EXCEPTION_PROLOG_COMMON(0x600, PACA_EXGEN)
|
|
|
|
ld r3,PACA_EXGEN+EX_DAR(r13)
|
|
|
|
lwz r4,PACA_EXGEN+EX_DSISR(r13)
|
|
|
|
std r3,_DAR(r1)
|
|
|
|
std r4,_DSISR(r1)
|
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl alignment_exception
|
|
|
|
b ret_from_except
|
|
|
|
|
2016-09-30 16:43:18 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL(program_check, 0x700, 0x100)
|
|
|
|
EXC_VIRT(program_check, 0x4700, 0x100, 0x700)
|
2016-09-30 16:43:18 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0x700)
|
2016-09-21 14:43:38 +07:00
|
|
|
EXC_COMMON_BEGIN(program_check_common)
|
powerpc/64s: Use emergency stack for kernel TM Bad Thing program checks
When using transactional memory (TM), the CPU can be in one of six
states as far as TM is concerned, encoded in the Machine State
Register (MSR). Certain state transitions are illegal and if attempted
trigger a "TM Bad Thing" type program check exception.
If we ever hit one of these exceptions it's treated as a bug, ie. we
oops, and kill the process and/or panic, depending on configuration.
One case where we can trigger a TM Bad Thing, is when returning to
userspace after a system call or interrupt, using RFID. When this
happens the CPU first restores the user register state, in particular
r1 (the stack pointer) and then attempts to update the MSR. However
the MSR update is not allowed and so we take the program check with
the user register state, but the kernel MSR.
This tricks the exception entry code into thinking we have a bad
kernel stack pointer, because the MSR says we're coming from the
kernel, but r1 is pointing to userspace.
To avoid this we instead always switch to the emergency stack if we
take a TM Bad Thing from the kernel. That way none of the user
register values are used, other than for printing in the oops message.
This is the fix for CVE-2017-1000255.
Fixes: 5d176f751ee3 ("powerpc: tm: Enable transactional memory (TM) lazily for userspace")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
[mpe: Rewrite change log & comments, tweak asm slightly]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-17 17:42:26 +07:00
|
|
|
/*
|
|
|
|
* It's possible to receive a TM Bad Thing type program check with
|
|
|
|
* userspace register values (in particular r1), but with SRR1 reporting
|
|
|
|
* that we came from the kernel. Normally that would confuse the bad
|
|
|
|
* stack logic, and we would report a bad kernel stack pointer. Instead
|
|
|
|
* we switch to the emergency stack if we're taking a TM Bad Thing from
|
|
|
|
* the kernel.
|
|
|
|
*/
|
|
|
|
li r10,MSR_PR /* Build a mask of MSR_PR .. */
|
|
|
|
oris r10,r10,0x200000@h /* .. and SRR1_PROGTM */
|
|
|
|
and r10,r10,r12 /* Mask SRR1 with that. */
|
|
|
|
srdi r10,r10,8 /* Shift it so we can compare */
|
|
|
|
cmpldi r10,(0x200000 >> 8) /* .. with an immediate. */
|
|
|
|
bne 1f /* If != go to normal path. */
|
|
|
|
|
|
|
|
/* SRR1 had PR=0 and SRR1_PROGTM=1, so use the emergency stack */
|
|
|
|
andi. r10,r12,MSR_PR; /* Set CR0 correctly for label */
|
|
|
|
/* 3 in EXCEPTION_PROLOG_COMMON */
|
|
|
|
mr r10,r1 /* Save r1 */
|
|
|
|
ld r1,PACAEMERGSP(r13) /* Use emergency stack */
|
|
|
|
subi r1,r1,INT_FRAME_SIZE /* alloc stack frame */
|
|
|
|
b 3f /* Jump into the macro !! */
|
|
|
|
1: EXCEPTION_PROLOG_COMMON(0x700, PACA_EXGEN)
|
2016-09-21 14:43:38 +07:00
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl program_check_exception
|
|
|
|
b ret_from_except
|
|
|
|
|
2011-06-29 07:18:26 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL(fp_unavailable, 0x800, 0x100)
|
|
|
|
EXC_VIRT(fp_unavailable, 0x4800, 0x100, 0x800)
|
2016-09-30 16:43:18 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0x800)
|
2016-09-21 14:43:39 +07:00
|
|
|
EXC_COMMON_BEGIN(fp_unavailable_common)
|
|
|
|
EXCEPTION_PROLOG_COMMON(0x800, PACA_EXGEN)
|
|
|
|
bne 1f /* if from user, just load it up */
|
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl kernel_fp_unavailable_exception
|
|
|
|
BUG_OPCODE
|
|
|
|
1:
|
|
|
|
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
|
|
|
|
BEGIN_FTR_SECTION
|
|
|
|
/* Test if 2 TM state bits are zero. If non-zero (ie. userspace was in
|
|
|
|
* transaction), go do TM stuff
|
|
|
|
*/
|
|
|
|
rldicl. r0, r12, (64-MSR_TS_LG), (64-2)
|
|
|
|
bne- 2f
|
|
|
|
END_FTR_SECTION_IFSET(CPU_FTR_TM)
|
|
|
|
#endif
|
|
|
|
bl load_up_fpu
|
|
|
|
b fast_exception_return
|
|
|
|
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
|
|
|
|
2: /* User process was in a transaction */
|
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl fp_unavailable_tm
|
|
|
|
b ret_from_except
|
|
|
|
#endif
|
|
|
|
|
2011-04-05 11:20:31 +07:00
|
|
|
|
2018-05-22 06:00:00 +07:00
|
|
|
EXC_REAL_OOL_MASKABLE(decrementer, 0x900, 0x80, IRQS_DISABLED)
|
2017-12-20 10:55:52 +07:00
|
|
|
EXC_VIRT_MASKABLE(decrementer, 0x4900, 0x80, 0x900, IRQS_DISABLED)
|
2016-09-21 14:43:40 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0x900)
|
|
|
|
EXC_COMMON_ASYNC(decrementer_common, 0x900, timer_interrupt)
|
|
|
|
|
powerpc: Fix "attempt to move .org backwards" error
Building a 64-bit powerpc kernel with PR KVM enabled currently gives
this error:
AS arch/powerpc/kernel/head_64.o
arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
arch/powerpc/kernel/exceptions-64s.S:258: Error: attempt to move .org backwards
make[2]: *** [arch/powerpc/kernel/head_64.o] Error 1
This happens because the MASKABLE_EXCEPTION_PSERIES macro turns into
33 instructions, but we only have space for 32 at the decrementer
interrupt vector (from 0x900 to 0x980).
In the code generated by the MASKABLE_EXCEPTION_PSERIES macro, we
currently have two instances of the HMT_MEDIUM macro, which has the
effect of setting the SMT thread priority to medium. One is the
first instruction, and is overwritten by a no-op on processors where
we save the PPR (processor priority register), that is, POWER7 or
later. The other is after we have saved the PPR.
In order to reduce the code at 0x900 by one instruction, we omit the
first HMT_MEDIUM. On processors without SMT this will have no effect
since HMT_MEDIUM is a no-op there. On POWER5 and RS64 machines this
will mean that the first few instructions take a little longer in the
case where a decrementer interrupt occurs when the hardware thread is
running at low SMT priority. On POWER6 and later machines, the
hardware automatically boosts the thread priority when a decrementer
interrupt is taken if the thread priority was below medium, so this
change won't make any difference.
The alternative would be to branch out of line after saving the CFAR.
However, that would incur an extra overhead on all processors, whereas
the approach adopted here only adds overhead on older threaded processors.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 00:51:40 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_HV(hdecrementer, 0x980, 0x80)
|
|
|
|
EXC_VIRT_HV(hdecrementer, 0x4980, 0x80, 0x980)
|
2016-09-21 14:43:41 +07:00
|
|
|
TRAMP_KVM_HV(PACA_EXGEN, 0x980)
|
|
|
|
EXC_COMMON(hdecrementer_common, 0x980, hdec_interrupt)
|
|
|
|
|
2011-04-05 11:20:31 +07:00
|
|
|
|
2017-12-20 10:55:52 +07:00
|
|
|
EXC_REAL_MASKABLE(doorbell_super, 0xa00, 0x100, IRQS_DISABLED)
|
|
|
|
EXC_VIRT_MASKABLE(doorbell_super, 0x4a00, 0x100, 0xa00, IRQS_DISABLED)
|
2016-09-30 16:43:18 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0xa00)
|
2016-09-21 14:43:42 +07:00
|
|
|
#ifdef CONFIG_PPC_DOORBELL
|
|
|
|
EXC_COMMON_ASYNC(doorbell_super_common, 0xa00, doorbell_exception)
|
|
|
|
#else
|
|
|
|
EXC_COMMON_ASYNC(doorbell_super_common, 0xa00, unknown_exception)
|
|
|
|
#endif
|
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL(trap_0b, 0xb00, 0x100)
|
|
|
|
EXC_VIRT(trap_0b, 0x4b00, 0x100, 0xb00)
|
2016-09-30 16:43:18 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0xb00)
|
2016-09-21 14:43:43 +07:00
|
|
|
EXC_COMMON(trap_0b_common, 0xb00, unknown_exception)
|
|
|
|
|
2017-06-08 22:35:04 +07:00
|
|
|
/*
|
|
|
|
* system call / hypercall (0xc00, 0x4c00)
|
|
|
|
*
|
|
|
|
* The system call exception is invoked with "sc 0" and does not alter HV bit.
|
|
|
|
* There is support for kernel code to invoke system calls but there are no
|
|
|
|
* in-tree users.
|
|
|
|
*
|
|
|
|
* The hypercall is invoked with "sc 1" and sets HV=1.
|
|
|
|
*
|
|
|
|
* In HPT, sc 1 always goes to 0xc00 real mode. In RADIX, sc 1 can go to
|
|
|
|
* 0x4c00 virtual mode.
|
|
|
|
*
|
|
|
|
* Call convention:
|
|
|
|
*
|
|
|
|
* syscall register convention is in Documentation/powerpc/syscall64-abi.txt
|
|
|
|
*
|
|
|
|
* For hypercalls, the register convention is as follows:
|
|
|
|
* r0 volatile
|
|
|
|
* r1-2 nonvolatile
|
|
|
|
* r3 volatile parameter and return value for status
|
|
|
|
* r4-r10 volatile input and output value
|
|
|
|
* r11 volatile hypercall number and output value
|
2017-07-18 12:32:44 +07:00
|
|
|
* r12 volatile input and output value
|
2017-06-08 22:35:04 +07:00
|
|
|
* r13-r31 nonvolatile
|
|
|
|
* LR nonvolatile
|
|
|
|
* CTR volatile
|
|
|
|
* XER volatile
|
|
|
|
* CR0-1 CR5-7 volatile
|
|
|
|
* CR2-4 nonvolatile
|
|
|
|
* Other registers nonvolatile
|
|
|
|
*
|
|
|
|
* The intersection of volatile registers that don't contain possible
|
2017-07-18 12:32:44 +07:00
|
|
|
* inputs is: cr0, xer, ctr. We may use these as scratch regs upon entry
|
|
|
|
* without saving, though xer is not a good idea to use, as hardware may
|
|
|
|
* interpret some bits so it may be costly to change them.
|
2017-06-08 22:35:04 +07:00
|
|
|
*/
|
2017-01-30 17:21:40 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
|
2017-06-08 22:35:04 +07:00
|
|
|
/*
|
|
|
|
* There is a little bit of juggling to get syscall and hcall
|
2017-07-18 12:32:44 +07:00
|
|
|
* working well. Save r13 in ctr to avoid using SPRG scratch
|
|
|
|
* register.
|
2017-06-08 22:35:04 +07:00
|
|
|
*
|
|
|
|
* Userspace syscalls have already saved the PPR, hcalls must save
|
|
|
|
* it before setting HMT_MEDIUM.
|
|
|
|
*/
|
2017-01-30 17:21:40 +07:00
|
|
|
#define SYSCALL_KVMTEST \
|
2017-07-18 12:32:44 +07:00
|
|
|
mtctr r13; \
|
2017-01-30 17:21:40 +07:00
|
|
|
GET_PACA(r13); \
|
2017-07-18 12:32:44 +07:00
|
|
|
std r10,PACA_EXGEN+EX_R10(r13); \
|
2018-05-22 06:00:00 +07:00
|
|
|
INTERRUPT_TO_KERNEL; \
|
2019-06-22 20:15:15 +07:00
|
|
|
KVMTEST EXC_STD 0xc00 ; /* uses r10, branch to do_kvm_0xc00_system_call */ \
|
2017-01-30 17:21:40 +07:00
|
|
|
HMT_MEDIUM; \
|
2017-07-18 12:32:44 +07:00
|
|
|
mfctr r9;
|
2017-01-30 17:21:40 +07:00
|
|
|
|
|
|
|
#else
|
|
|
|
#define SYSCALL_KVMTEST \
|
2017-06-08 22:35:04 +07:00
|
|
|
HMT_MEDIUM; \
|
|
|
|
mr r9,r13; \
|
2018-05-22 06:00:00 +07:00
|
|
|
GET_PACA(r13); \
|
|
|
|
INTERRUPT_TO_KERNEL;
|
2017-01-30 17:21:40 +07:00
|
|
|
#endif
|
|
|
|
|
2016-10-13 09:17:14 +07:00
|
|
|
#define LOAD_SYSCALL_HANDLER(reg) \
|
|
|
|
__LOAD_HANDLER(reg, system_call_common)
|
2016-09-21 14:43:44 +07:00
|
|
|
|
2017-06-08 22:35:04 +07:00
|
|
|
/*
|
|
|
|
* After SYSCALL_KVMTEST, we reach here with PACA in r13, r13 in r9,
|
|
|
|
* and HMT_MEDIUM.
|
|
|
|
*/
|
|
|
|
#define SYSCALL_REAL \
|
|
|
|
mfspr r11,SPRN_SRR0 ; \
|
2016-09-21 14:43:44 +07:00
|
|
|
mfspr r12,SPRN_SRR1 ; \
|
|
|
|
LOAD_SYSCALL_HANDLER(r10) ; \
|
|
|
|
mtspr SPRN_SRR0,r10 ; \
|
|
|
|
ld r10,PACAKMSR(r13) ; \
|
|
|
|
mtspr SPRN_SRR1,r10 ; \
|
2018-01-09 23:07:15 +07:00
|
|
|
RFI_TO_KERNEL ; \
|
2016-09-21 14:43:44 +07:00
|
|
|
b . ; /* prevent speculative execution */
|
|
|
|
|
2017-10-09 17:54:05 +07:00
|
|
|
#ifdef CONFIG_PPC_FAST_ENDIAN_SWITCH
|
2017-10-09 17:54:04 +07:00
|
|
|
#define SYSCALL_FASTENDIAN_TEST \
|
|
|
|
BEGIN_FTR_SECTION \
|
|
|
|
cmpdi r0,0x1ebe ; \
|
|
|
|
beq- 1f ; \
|
|
|
|
END_FTR_SECTION_IFSET(CPU_FTR_REAL_LE) \
|
|
|
|
|
2017-06-08 22:35:04 +07:00
|
|
|
#define SYSCALL_FASTENDIAN \
|
2016-09-21 14:43:44 +07:00
|
|
|
/* Fast LE/BE switch system call */ \
|
|
|
|
1: mfspr r12,SPRN_SRR1 ; \
|
|
|
|
xori r12,r12,MSR_LE ; \
|
|
|
|
mtspr SPRN_SRR1,r12 ; \
|
2017-06-08 22:35:04 +07:00
|
|
|
mr r13,r9 ; \
|
2018-01-09 23:07:15 +07:00
|
|
|
RFI_TO_USER ; /* return to userspace */ \
|
2016-09-21 14:43:44 +07:00
|
|
|
b . ; /* prevent speculative execution */
|
2017-10-09 17:54:05 +07:00
|
|
|
#else
|
|
|
|
#define SYSCALL_FASTENDIAN_TEST
|
|
|
|
#define SYSCALL_FASTENDIAN
|
|
|
|
#endif /* CONFIG_PPC_FAST_ENDIAN_SWITCH */
|
2016-09-21 14:43:44 +07:00
|
|
|
|
|
|
|
#if defined(CONFIG_RELOCATABLE)
|
|
|
|
/*
|
|
|
|
* We can't branch directly so we do it via the CTR which
|
|
|
|
* is volatile across system calls.
|
|
|
|
*/
|
2017-06-08 22:35:04 +07:00
|
|
|
#define SYSCALL_VIRT \
|
|
|
|
LOAD_SYSCALL_HANDLER(r10) ; \
|
|
|
|
mtctr r10 ; \
|
|
|
|
mfspr r11,SPRN_SRR0 ; \
|
2016-09-21 14:43:44 +07:00
|
|
|
mfspr r12,SPRN_SRR1 ; \
|
|
|
|
li r10,MSR_RI ; \
|
|
|
|
mtmsrd r10,1 ; \
|
|
|
|
bctr ;
|
|
|
|
#else
|
|
|
|
/* We can branch directly */
|
2017-06-08 22:35:04 +07:00
|
|
|
#define SYSCALL_VIRT \
|
|
|
|
mfspr r11,SPRN_SRR0 ; \
|
2016-09-21 14:43:44 +07:00
|
|
|
mfspr r12,SPRN_SRR1 ; \
|
|
|
|
li r10,MSR_RI ; \
|
|
|
|
mtmsrd r10,1 ; /* Set RI (EE=0) */ \
|
|
|
|
b system_call_common ;
|
|
|
|
#endif
|
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_BEGIN(system_call, 0xc00, 0x100)
|
2017-06-08 22:35:04 +07:00
|
|
|
SYSCALL_KVMTEST /* loads PACA into r13, and saves r13 to r9 */
|
|
|
|
SYSCALL_FASTENDIAN_TEST
|
|
|
|
SYSCALL_REAL
|
|
|
|
SYSCALL_FASTENDIAN
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_END(system_call, 0xc00, 0x100)
|
2016-09-30 16:43:18 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_VIRT_BEGIN(system_call, 0x4c00, 0x100)
|
2017-06-08 22:35:04 +07:00
|
|
|
SYSCALL_KVMTEST /* loads PACA into r13, and saves r13 to r9 */
|
|
|
|
SYSCALL_FASTENDIAN_TEST
|
|
|
|
SYSCALL_VIRT
|
|
|
|
SYSCALL_FASTENDIAN
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_VIRT_END(system_call, 0x4c00, 0x100)
|
2016-09-21 14:43:44 +07:00
|
|
|
|
2017-06-08 22:35:04 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
|
|
|
|
/*
|
|
|
|
* This is a hcall, so register convention is as above, with these
|
|
|
|
* differences:
|
|
|
|
* r13 = PACA
|
2017-07-18 12:32:44 +07:00
|
|
|
* ctr = orig r13
|
|
|
|
* orig r10 saved in PACA
|
2017-06-08 22:35:04 +07:00
|
|
|
*/
|
|
|
|
TRAMP_KVM_BEGIN(do_kvm_0xc00)
|
|
|
|
/*
|
|
|
|
* Save the PPR (on systems that support it) before changing to
|
|
|
|
* HMT_MEDIUM. That allows the KVM code to save that value into the
|
|
|
|
* guest state (it is the guest's PPR value).
|
|
|
|
*/
|
2017-07-18 12:32:44 +07:00
|
|
|
OPT_GET_SPR(r10, SPRN_PPR, CPU_FTR_HAS_PPR)
|
2017-06-08 22:35:04 +07:00
|
|
|
HMT_MEDIUM
|
2017-07-18 12:32:44 +07:00
|
|
|
OPT_SAVE_REG_TO_PACA(PACA_EXGEN+EX_PPR, r10, CPU_FTR_HAS_PPR)
|
2017-06-08 22:35:04 +07:00
|
|
|
mfctr r10
|
2017-07-18 12:32:44 +07:00
|
|
|
SET_SCRATCH0(r10)
|
2017-06-08 22:35:04 +07:00
|
|
|
std r9,PACA_EXGEN+EX_R9(r13)
|
|
|
|
mfcr r9
|
2019-06-22 20:15:17 +07:00
|
|
|
KVM_HANDLER PACA_EXGEN, EXC_STD, 0xc00, 0
|
2017-06-08 22:35:04 +07:00
|
|
|
#endif
|
2016-09-30 16:43:18 +07:00
|
|
|
|
2016-09-21 14:43:44 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL(single_step, 0xd00, 0x100)
|
|
|
|
EXC_VIRT(single_step, 0x4d00, 0x100, 0xd00)
|
2016-09-30 16:43:18 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0xd00)
|
2016-09-21 14:43:45 +07:00
|
|
|
EXC_COMMON(single_step_common, 0xd00, single_step_exception)
|
2011-06-29 07:18:26 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_OOL_HV(h_data_storage, 0xe00, 0x20)
|
2017-02-14 13:18:29 +07:00
|
|
|
EXC_VIRT_OOL_HV(h_data_storage, 0x4e00, 0x20, 0xe00)
|
2016-09-21 14:43:46 +07:00
|
|
|
TRAMP_KVM_HV_SKIP(PACA_EXGEN, 0xe00)
|
|
|
|
EXC_COMMON_BEGIN(h_data_storage_common)
|
|
|
|
mfspr r10,SPRN_HDAR
|
|
|
|
std r10,PACA_EXGEN+EX_DAR(r13)
|
|
|
|
mfspr r10,SPRN_HDSISR
|
|
|
|
stw r10,PACA_EXGEN+EX_DSISR(r13)
|
|
|
|
EXCEPTION_PROLOG_COMMON(0xe00, PACA_EXGEN)
|
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
2018-12-14 12:29:05 +07:00
|
|
|
BEGIN_MMU_FTR_SECTION
|
|
|
|
ld r4,PACA_EXGEN+EX_DAR(r13)
|
|
|
|
lwz r5,PACA_EXGEN+EX_DSISR(r13)
|
|
|
|
std r4,_DAR(r1)
|
|
|
|
std r5,_DSISR(r1)
|
|
|
|
li r5,SIGSEGV
|
|
|
|
bl bad_page_fault
|
|
|
|
MMU_FTR_SECTION_ELSE
|
2016-09-21 14:43:46 +07:00
|
|
|
bl unknown_exception
|
2018-12-14 12:29:05 +07:00
|
|
|
ALT_MMU_FTR_SECTION_END_IFSET(MMU_FTR_TYPE_RADIX)
|
2016-09-21 14:43:46 +07:00
|
|
|
b ret_from_except
|
|
|
|
|
powerpc: Save CFAR before branching in interrupt entry paths
Some of the interrupt vectors on 64-bit POWER server processors are
only 32 bytes long, which is not enough for the full first-level
interrupt handler. For these we currently just have a branch to an
out-of-line handler. However, this means that we corrupt the CFAR
(come-from address register) on POWER7 and later processors.
To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces:
EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR
is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We
then put EXCEPTION_PROLOG_0 in the short interrupt vectors before
we branch to the out-of-line handler, which contains the rest of the
first-level interrupt handler. To facilitate this, we define new
_OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc.
In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more
than 6 instructions, it was necessary to move the stores that move
the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and
to get rid of one of the two HMT_MEDIUM instructions. Previously
there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was
nop'd out on processors with the PPR (POWER7 and later), and then
another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside
__EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR.
Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally
and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although
this leaves it in for the interrupt vectors where there is room for
it.
Previously we had a handler for hypervisor maintenance interrupts at
0xe50, which doesn't leave enough room for the vector for hypervisor
emulation assist interrupts at 0xe40, since we need 8 instructions.
The 0xe50 vector was only used on POWER6, as the HMI vector was moved
to 0xe60 on POWER7. Since we don't support running in hypervisor mode
on POWER6, we just remove the handler at 0xe50.
This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0
instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD
from the relocation-on vectors (since any CPU that supports
relocation-on interrupts also has the PPR).
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-05 01:10:15 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_OOL_HV(h_instr_storage, 0xe20, 0x20)
|
2017-02-14 13:18:29 +07:00
|
|
|
EXC_VIRT_OOL_HV(h_instr_storage, 0x4e20, 0x20, 0xe20)
|
2016-09-21 14:43:47 +07:00
|
|
|
TRAMP_KVM_HV(PACA_EXGEN, 0xe20)
|
|
|
|
EXC_COMMON(h_instr_storage_common, 0xe20, unknown_exception)
|
|
|
|
|
powerpc: Save CFAR before branching in interrupt entry paths
Some of the interrupt vectors on 64-bit POWER server processors are
only 32 bytes long, which is not enough for the full first-level
interrupt handler. For these we currently just have a branch to an
out-of-line handler. However, this means that we corrupt the CFAR
(come-from address register) on POWER7 and later processors.
To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces:
EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR
is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We
then put EXCEPTION_PROLOG_0 in the short interrupt vectors before
we branch to the out-of-line handler, which contains the rest of the
first-level interrupt handler. To facilitate this, we define new
_OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc.
In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more
than 6 instructions, it was necessary to move the stores that move
the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and
to get rid of one of the two HMT_MEDIUM instructions. Previously
there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was
nop'd out on processors with the PPR (POWER7 and later), and then
another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside
__EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR.
Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally
and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although
this leaves it in for the interrupt vectors where there is room for
it.
Previously we had a handler for hypervisor maintenance interrupts at
0xe50, which doesn't leave enough room for the vector for hypervisor
emulation assist interrupts at 0xe40, since we need 8 instructions.
The 0xe50 vector was only used on POWER6, as the HMI vector was moved
to 0xe60 on POWER7. Since we don't support running in hypervisor mode
on POWER6, we just remove the handler at 0xe50.
This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0
instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD
from the relocation-on vectors (since any CPU that supports
relocation-on interrupts also has the PPR).
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-05 01:10:15 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_OOL_HV(emulation_assist, 0xe40, 0x20)
|
|
|
|
EXC_VIRT_OOL_HV(emulation_assist, 0x4e40, 0x20, 0xe40)
|
2016-09-21 14:43:48 +07:00
|
|
|
TRAMP_KVM_HV(PACA_EXGEN, 0xe40)
|
|
|
|
EXC_COMMON(emulation_assist_common, 0xe40, emulation_assist_interrupt)
|
|
|
|
|
powerpc: Save CFAR before branching in interrupt entry paths
Some of the interrupt vectors on 64-bit POWER server processors are
only 32 bytes long, which is not enough for the full first-level
interrupt handler. For these we currently just have a branch to an
out-of-line handler. However, this means that we corrupt the CFAR
(come-from address register) on POWER7 and later processors.
To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces:
EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR
is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We
then put EXCEPTION_PROLOG_0 in the short interrupt vectors before
we branch to the out-of-line handler, which contains the rest of the
first-level interrupt handler. To facilitate this, we define new
_OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc.
In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more
than 6 instructions, it was necessary to move the stores that move
the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and
to get rid of one of the two HMT_MEDIUM instructions. Previously
there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was
nop'd out on processors with the PPR (POWER7 and later), and then
another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside
__EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR.
Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally
and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although
this leaves it in for the interrupt vectors where there is room for
it.
Previously we had a handler for hypervisor maintenance interrupts at
0xe50, which doesn't leave enough room for the vector for hypervisor
emulation assist interrupts at 0xe40, since we need 8 instructions.
The 0xe50 vector was only used on POWER6, as the HMI vector was moved
to 0xe60 on POWER7. Since we don't support running in hypervisor mode
on POWER6, we just remove the handler at 0xe50.
This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0
instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD
from the relocation-on vectors (since any CPU that supports
relocation-on interrupts also has the PPR).
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-05 01:10:15 +07:00
|
|
|
|
2016-09-21 14:44:07 +07:00
|
|
|
/*
|
|
|
|
* hmi_exception trampoline is a special case. It jumps to hmi_exception_early
|
|
|
|
* first, and then eventaully from there to the trampoline to get into virtual
|
|
|
|
* mode.
|
|
|
|
*/
|
2016-12-06 08:41:12 +07:00
|
|
|
__EXC_REAL_OOL_HV_DIRECT(hmi_exception, 0xe60, 0x20, hmi_exception_early)
|
2017-12-20 10:55:52 +07:00
|
|
|
__TRAMP_REAL_OOL_MASKABLE_HV(hmi_exception, 0xe60, IRQS_DISABLED)
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_VIRT_NONE(0x4e60, 0x20)
|
2016-09-21 14:43:49 +07:00
|
|
|
TRAMP_KVM_HV(PACA_EXGEN, 0xe60)
|
|
|
|
TRAMP_REAL_BEGIN(hmi_exception_early)
|
2019-06-22 20:15:16 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_HV, PACA_EXGEN, 1, 0xe60, 0
|
powerpc/64s: Exception macro for stack frame and initial register save
This code is common to a few exceptions, and another user will be added.
This causes a trivial change to generated code:
- 604: std r9,416(r1)
- 608: mfspr r11,314
- 60c: std r11,368(r1)
- 610: mfspr r12,315
+ 604: mfspr r11,314
+ 608: mfspr r12,315
+ 60c: std r9,416(r1)
+ 610: std r11,368(r1)
machine_check_powernv_early could also use this, but that requires non
trivial changes to generated code, so that's for another patch.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-20 01:30:03 +07:00
|
|
|
mr r10,r1 /* Save r1 */
|
|
|
|
ld r1,PACAEMERGSP(r13) /* Use emergency stack for realmode */
|
2016-09-21 14:43:49 +07:00
|
|
|
subi r1,r1,INT_FRAME_SIZE /* alloc stack frame */
|
|
|
|
mfspr r11,SPRN_HSRR0 /* Save HSRR0 */
|
powerpc/64s: Exception macro for stack frame and initial register save
This code is common to a few exceptions, and another user will be added.
This causes a trivial change to generated code:
- 604: std r9,416(r1)
- 608: mfspr r11,314
- 60c: std r11,368(r1)
- 610: mfspr r12,315
+ 604: mfspr r11,314
+ 608: mfspr r12,315
+ 60c: std r9,416(r1)
+ 610: std r11,368(r1)
machine_check_powernv_early could also use this, but that requires non
trivial changes to generated code, so that's for another patch.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-20 01:30:03 +07:00
|
|
|
mfspr r12,SPRN_HSRR1 /* Save HSRR1 */
|
|
|
|
EXCEPTION_PROLOG_COMMON_1()
|
2019-04-18 13:51:24 +07:00
|
|
|
/* We don't touch AMR here, we never go to virtual mode */
|
2016-09-21 14:43:49 +07:00
|
|
|
EXCEPTION_PROLOG_COMMON_2(PACA_EXGEN)
|
|
|
|
EXCEPTION_PROLOG_COMMON_3(0xe60)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
2018-10-08 11:08:31 +07:00
|
|
|
BRANCH_LINK_TO_FAR(DOTSYM(hmi_exception_realmode)) /* Function call ABI */
|
2017-09-15 12:25:48 +07:00
|
|
|
cmpdi cr0,r3,0
|
|
|
|
|
2016-09-21 14:43:49 +07:00
|
|
|
/* Windup the stack. */
|
|
|
|
/* Move original HSRR0 and HSRR1 into the respective regs */
|
|
|
|
ld r9,_MSR(r1)
|
|
|
|
mtspr SPRN_HSRR1,r9
|
|
|
|
ld r3,_NIP(r1)
|
|
|
|
mtspr SPRN_HSRR0,r3
|
|
|
|
ld r9,_CTR(r1)
|
|
|
|
mtctr r9
|
|
|
|
ld r9,_XER(r1)
|
|
|
|
mtxer r9
|
|
|
|
ld r9,_LINK(r1)
|
|
|
|
mtlr r9
|
|
|
|
REST_GPR(0, r1)
|
|
|
|
REST_8GPRS(2, r1)
|
|
|
|
REST_GPR(10, r1)
|
|
|
|
ld r11,_CCR(r1)
|
2017-09-15 12:25:48 +07:00
|
|
|
REST_2GPRS(12, r1)
|
|
|
|
bne 1f
|
2016-09-21 14:43:49 +07:00
|
|
|
mtcr r11
|
|
|
|
REST_GPR(11, r1)
|
2017-09-15 12:25:48 +07:00
|
|
|
ld r1,GPR1(r1)
|
2018-01-09 23:07:15 +07:00
|
|
|
HRFI_TO_USER_OR_KERNEL
|
2017-09-15 12:25:48 +07:00
|
|
|
|
|
|
|
1: mtcr r11
|
|
|
|
REST_GPR(11, r1)
|
2016-09-21 14:43:49 +07:00
|
|
|
ld r1,GPR1(r1)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Go to virtual mode and pull the HMI event information from
|
|
|
|
* firmware.
|
|
|
|
*/
|
|
|
|
.globl hmi_exception_after_realmode
|
|
|
|
hmi_exception_after_realmode:
|
|
|
|
SET_SCRATCH0(r13)
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXGEN
|
2016-09-21 14:43:49 +07:00
|
|
|
b tramp_real_hmi_exception
|
|
|
|
|
2017-09-15 12:25:48 +07:00
|
|
|
EXC_COMMON_BEGIN(hmi_exception_common)
|
2019-06-22 20:15:21 +07:00
|
|
|
EXCEPTION_COMMON(PACA_EXGEN, 0xe60)
|
|
|
|
FINISH_NAP
|
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
RUNLATCH_ON
|
2019-06-22 20:15:20 +07:00
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl handle_hmi_exception
|
|
|
|
b ret_from_except
|
powerpc: Save CFAR before branching in interrupt entry paths
Some of the interrupt vectors on 64-bit POWER server processors are
only 32 bytes long, which is not enough for the full first-level
interrupt handler. For these we currently just have a branch to an
out-of-line handler. However, this means that we corrupt the CFAR
(come-from address register) on POWER7 and later processors.
To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces:
EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR
is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We
then put EXCEPTION_PROLOG_0 in the short interrupt vectors before
we branch to the out-of-line handler, which contains the rest of the
first-level interrupt handler. To facilitate this, we define new
_OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc.
In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more
than 6 instructions, it was necessary to move the stores that move
the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and
to get rid of one of the two HMT_MEDIUM instructions. Previously
there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was
nop'd out on processors with the PPR (POWER7 and later), and then
another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside
__EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR.
Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally
and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although
this leaves it in for the interrupt vectors where there is room for
it.
Previously we had a handler for hypervisor maintenance interrupts at
0xe50, which doesn't leave enough room for the vector for hypervisor
emulation assist interrupts at 0xe40, since we need 8 instructions.
The 0xe50 vector was only used on POWER6, as the HMI vector was moved
to 0xe60 on POWER7. Since we don't support running in hypervisor mode
on POWER6, we just remove the handler at 0xe50.
This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0
instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD
from the relocation-on vectors (since any CPU that supports
relocation-on interrupts also has the PPR).
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-05 01:10:15 +07:00
|
|
|
|
2017-12-20 10:55:52 +07:00
|
|
|
EXC_REAL_OOL_MASKABLE_HV(h_doorbell, 0xe80, 0x20, IRQS_DISABLED)
|
|
|
|
EXC_VIRT_OOL_MASKABLE_HV(h_doorbell, 0x4e80, 0x20, 0xe80, IRQS_DISABLED)
|
2016-09-21 14:43:50 +07:00
|
|
|
TRAMP_KVM_HV(PACA_EXGEN, 0xe80)
|
|
|
|
#ifdef CONFIG_PPC_DOORBELL
|
|
|
|
EXC_COMMON_ASYNC(h_doorbell_common, 0xe80, doorbell_exception)
|
|
|
|
#else
|
|
|
|
EXC_COMMON_ASYNC(h_doorbell_common, 0xe80, unknown_exception)
|
|
|
|
#endif
|
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2017-12-20 10:55:52 +07:00
|
|
|
EXC_REAL_OOL_MASKABLE_HV(h_virt_irq, 0xea0, 0x20, IRQS_DISABLED)
|
|
|
|
EXC_VIRT_OOL_MASKABLE_HV(h_virt_irq, 0x4ea0, 0x20, 0xea0, IRQS_DISABLED)
|
2016-09-21 14:43:51 +07:00
|
|
|
TRAMP_KVM_HV(PACA_EXGEN, 0xea0)
|
|
|
|
EXC_COMMON_ASYNC(h_virt_irq_common, 0xea0, do_IRQ)
|
|
|
|
|
2016-07-08 13:37:06 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_NONE(0xec0, 0x20)
|
|
|
|
EXC_VIRT_NONE(0x4ec0, 0x20)
|
|
|
|
EXC_REAL_NONE(0xee0, 0x20)
|
|
|
|
EXC_VIRT_NONE(0x4ee0, 0x20)
|
2016-09-21 14:43:52 +07:00
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2017-12-20 10:55:53 +07:00
|
|
|
EXC_REAL_OOL_MASKABLE(performance_monitor, 0xf00, 0x20, IRQS_PMI_DISABLED)
|
|
|
|
EXC_VIRT_OOL_MASKABLE(performance_monitor, 0x4f00, 0x20, 0xf00, IRQS_PMI_DISABLED)
|
2016-09-21 14:43:53 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0xf00)
|
|
|
|
EXC_COMMON_ASYNC(performance_monitor_common, 0xf00, performance_monitor_exception)
|
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_OOL(altivec_unavailable, 0xf20, 0x20)
|
|
|
|
EXC_VIRT_OOL(altivec_unavailable, 0x4f20, 0x20, 0xf20)
|
2016-09-21 14:43:54 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0xf20)
|
|
|
|
EXC_COMMON_BEGIN(altivec_unavailable_common)
|
|
|
|
EXCEPTION_PROLOG_COMMON(0xf20, PACA_EXGEN)
|
|
|
|
#ifdef CONFIG_ALTIVEC
|
|
|
|
BEGIN_FTR_SECTION
|
|
|
|
beq 1f
|
|
|
|
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
|
|
|
|
BEGIN_FTR_SECTION_NESTED(69)
|
|
|
|
/* Test if 2 TM state bits are zero. If non-zero (ie. userspace was in
|
|
|
|
* transaction), go do TM stuff
|
|
|
|
*/
|
|
|
|
rldicl. r0, r12, (64-MSR_TS_LG), (64-2)
|
|
|
|
bne- 2f
|
|
|
|
END_FTR_SECTION_NESTED(CPU_FTR_TM, CPU_FTR_TM, 69)
|
|
|
|
#endif
|
|
|
|
bl load_up_altivec
|
|
|
|
b fast_exception_return
|
|
|
|
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
|
|
|
|
2: /* User process was in a transaction */
|
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl altivec_unavailable_tm
|
|
|
|
b ret_from_except
|
|
|
|
#endif
|
|
|
|
1:
|
|
|
|
END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
|
|
|
|
#endif
|
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl altivec_unavailable_exception
|
|
|
|
b ret_from_except
|
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_OOL(vsx_unavailable, 0xf40, 0x20)
|
|
|
|
EXC_VIRT_OOL(vsx_unavailable, 0x4f40, 0x20, 0xf40)
|
2016-09-21 14:43:55 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0xf40)
|
|
|
|
EXC_COMMON_BEGIN(vsx_unavailable_common)
|
|
|
|
EXCEPTION_PROLOG_COMMON(0xf40, PACA_EXGEN)
|
|
|
|
#ifdef CONFIG_VSX
|
|
|
|
BEGIN_FTR_SECTION
|
|
|
|
beq 1f
|
|
|
|
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
|
|
|
|
BEGIN_FTR_SECTION_NESTED(69)
|
|
|
|
/* Test if 2 TM state bits are zero. If non-zero (ie. userspace was in
|
|
|
|
* transaction), go do TM stuff
|
|
|
|
*/
|
|
|
|
rldicl. r0, r12, (64-MSR_TS_LG), (64-2)
|
|
|
|
bne- 2f
|
|
|
|
END_FTR_SECTION_NESTED(CPU_FTR_TM, CPU_FTR_TM, 69)
|
|
|
|
#endif
|
|
|
|
b load_up_vsx
|
|
|
|
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
|
|
|
|
2: /* User process was in a transaction */
|
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl vsx_unavailable_tm
|
|
|
|
b ret_from_except
|
|
|
|
#endif
|
|
|
|
1:
|
|
|
|
END_FTR_SECTION_IFSET(CPU_FTR_VSX)
|
|
|
|
#endif
|
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl vsx_unavailable_exception
|
|
|
|
b ret_from_except
|
|
|
|
|
2016-09-30 16:43:18 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_OOL(facility_unavailable, 0xf60, 0x20)
|
|
|
|
EXC_VIRT_OOL(facility_unavailable, 0x4f60, 0x20, 0xf60)
|
2016-09-21 14:43:56 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0xf60)
|
|
|
|
EXC_COMMON(facility_unavailable_common, 0xf60, facility_unavailable_exception)
|
|
|
|
|
2016-09-30 16:43:18 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_OOL_HV(h_facility_unavailable, 0xf80, 0x20)
|
|
|
|
EXC_VIRT_OOL_HV(h_facility_unavailable, 0x4f80, 0x20, 0xf80)
|
2016-09-21 14:43:57 +07:00
|
|
|
TRAMP_KVM_HV(PACA_EXGEN, 0xf80)
|
|
|
|
EXC_COMMON(h_facility_unavailable_common, 0xf80, facility_unavailable_exception)
|
|
|
|
|
2016-09-30 16:43:18 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_NONE(0xfa0, 0x20)
|
|
|
|
EXC_VIRT_NONE(0x4fa0, 0x20)
|
|
|
|
EXC_REAL_NONE(0xfc0, 0x20)
|
|
|
|
EXC_VIRT_NONE(0x4fc0, 0x20)
|
|
|
|
EXC_REAL_NONE(0xfe0, 0x20)
|
|
|
|
EXC_VIRT_NONE(0x4fe0, 0x20)
|
|
|
|
|
|
|
|
EXC_REAL_NONE(0x1000, 0x100)
|
|
|
|
EXC_VIRT_NONE(0x5000, 0x100)
|
|
|
|
EXC_REAL_NONE(0x1100, 0x100)
|
|
|
|
EXC_VIRT_NONE(0x5100, 0x100)
|
2013-02-13 23:21:38 +07:00
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
#ifdef CONFIG_CBE_RAS
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_HV(cbe_system_error, 0x1200, 0x100)
|
|
|
|
EXC_VIRT_NONE(0x5200, 0x100)
|
2016-09-30 16:43:18 +07:00
|
|
|
TRAMP_KVM_HV_SKIP(PACA_EXGEN, 0x1200)
|
2016-09-21 14:43:59 +07:00
|
|
|
EXC_COMMON(cbe_system_error_common, 0x1200, cbe_system_error_exception)
|
2016-09-30 16:43:18 +07:00
|
|
|
#else /* CONFIG_CBE_RAS */
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_NONE(0x1200, 0x100)
|
|
|
|
EXC_VIRT_NONE(0x5200, 0x100)
|
2016-09-30 16:43:18 +07:00
|
|
|
#endif
|
2011-06-29 07:18:26 +07:00
|
|
|
|
2016-09-21 14:43:59 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL(instruction_breakpoint, 0x1300, 0x100)
|
|
|
|
EXC_VIRT(instruction_breakpoint, 0x5300, 0x100, 0x1300)
|
2016-09-30 16:43:18 +07:00
|
|
|
TRAMP_KVM_SKIP(PACA_EXGEN, 0x1300)
|
2016-09-21 14:44:00 +07:00
|
|
|
EXC_COMMON(instruction_breakpoint_common, 0x1300, instruction_breakpoint_exception)
|
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_NONE(0x1400, 0x100)
|
|
|
|
EXC_VIRT_NONE(0x5400, 0x100)
|
2016-09-30 16:43:18 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_BEGIN(denorm_exception_hv, 0x1500, 0x100)
|
2012-09-10 07:35:26 +07:00
|
|
|
mtspr SPRN_SPRG_HSCRATCH0,r13
|
2019-06-22 20:15:19 +07:00
|
|
|
EXCEPTION_PROLOG_0 PACA_EXGEN
|
2019-06-22 20:15:16 +07:00
|
|
|
EXCEPTION_PROLOG_1 EXC_HV, PACA_EXGEN, 0, 0x1500, 0
|
2012-09-10 07:35:26 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_PPC_DENORMALISATION
|
|
|
|
mfspr r10,SPRN_HSRR1
|
2016-09-21 14:43:31 +07:00
|
|
|
andis. r10,r10,(HSRR1_DENORM)@h /* denorm? */
|
|
|
|
bne+ denorm_assist
|
|
|
|
#endif
|
powerpc/book3s: handle machine check in Linux host.
Move machine check entry point into Linux. So far we were dependent on
firmware to decode MCE error details and handover the high level info to OS.
This patch introduces early machine check routine that saves the MCE
information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate
stack frame on emergency stack and set the r1 accordingly. This allows us to be
prepared to take another exception without loosing context. One thing to note
here that, if we get another machine check while ME bit is off then we risk a
checkstop. Hence we restrict ourselves to save only MCE information and
register saved on PACA_EXMC save are before we turn the ME bit on. We use
paca->in_mce flag to differentiate between first entry and nested machine check
entry which helps proper use of emergency stack. We increment paca->in_mce
every time we enter in early machine check handler and decrement it while
leaving. When we enter machine check early handler first time (paca->in_mce ==
0), we are sure nobody is using MC emergency stack and allocate a stack frame
at the start of the emergency stack. During subsequent entry (paca->in_mce >
0), we know that r1 points inside emergency stack and we allocate separate
stack frame accordingly. This prevents us from clobbering MCE information
during nested machine checks.
The early machine check handler changes are placed under CPU_FTR_HVMODE
section. This makes sure that the early machine check handler will get executed
only in hypervisor kernel.
This is the code flow:
Machine Check Interrupt
|
V
0x200 vector ME=0, IR=0, DR=0
|
V
+-----------------------------------------------+
|machine_check_pSeries_early: | ME=0, IR=0, DR=0
| Alloc frame on emergency stack |
| Save srr1, srr0, dar and dsisr on stack |
+-----------------------------------------------+
|
(ME=1, IR=0, DR=0, RFID)
|
V
machine_check_handle_early ME=1, IR=0, DR=0
|
V
+-----------------------------------------------+
| machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0
| Things to do: (in next patches) |
| Flush SLB for SLB errors |
| Flush TLB for TLB errors |
| Decode and save MCE info |
+-----------------------------------------------+
|
(Fall through existing exception handler routine.)
|
V
machine_check_pSerie ME=1, IR=0, DR=0
|
(ME=1, IR=1, DR=1, RFID)
|
V
machine_check_common ME=1, IR=1, DR=1
.
.
.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-30 21:34:08 +07:00
|
|
|
|
2019-06-22 20:15:15 +07:00
|
|
|
KVMTEST EXC_HV 0x1500
|
2019-06-22 20:15:13 +07:00
|
|
|
EXCEPTION_PROLOG_2_REAL denorm_common, EXC_HV, 1
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_END(denorm_exception_hv, 0x1500, 0x100)
|
2016-08-10 17:48:43 +07:00
|
|
|
|
2016-09-21 14:44:01 +07:00
|
|
|
#ifdef CONFIG_PPC_DENORMALISATION
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_VIRT_BEGIN(denorm_exception, 0x5500, 0x100)
|
2016-09-21 14:44:01 +07:00
|
|
|
b exc_real_0x1500_denorm_exception_hv
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_VIRT_END(denorm_exception, 0x5500, 0x100)
|
2016-09-21 14:44:01 +07:00
|
|
|
#else
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_VIRT_NONE(0x5500, 0x100)
|
2016-09-21 14:43:31 +07:00
|
|
|
#endif
|
|
|
|
|
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 17:32:01 +07:00
|
|
|
TRAMP_KVM_HV(PACA_EXGEN, 0x1500)
|
2011-06-29 07:18:26 +07:00
|
|
|
|
2012-09-10 07:35:26 +07:00
|
|
|
#ifdef CONFIG_PPC_DENORMALISATION
|
2016-09-30 16:43:18 +07:00
|
|
|
TRAMP_REAL_BEGIN(denorm_assist)
|
2012-09-10 07:35:26 +07:00
|
|
|
BEGIN_FTR_SECTION
|
|
|
|
/*
|
|
|
|
* To denormalise we need to move a copy of the register to itself.
|
|
|
|
* For POWER6 do that here for all FP regs.
|
|
|
|
*/
|
|
|
|
mfmsr r10
|
|
|
|
ori r10,r10,(MSR_FP|MSR_FE0|MSR_FE1)
|
|
|
|
xori r10,r10,(MSR_FE0|MSR_FE1)
|
|
|
|
mtmsrd r10
|
|
|
|
sync
|
2013-05-30 04:33:18 +07:00
|
|
|
|
|
|
|
#define FMR2(n) fmr (n), (n) ; fmr n+1, n+1
|
|
|
|
#define FMR4(n) FMR2(n) ; FMR2(n+2)
|
|
|
|
#define FMR8(n) FMR4(n) ; FMR4(n+4)
|
|
|
|
#define FMR16(n) FMR8(n) ; FMR8(n+8)
|
|
|
|
#define FMR32(n) FMR16(n) ; FMR16(n+16)
|
|
|
|
FMR32(0)
|
|
|
|
|
2012-09-10 07:35:26 +07:00
|
|
|
FTR_SECTION_ELSE
|
|
|
|
/*
|
|
|
|
* To denormalise we need to move a copy of the register to itself.
|
|
|
|
* For POWER7 do that here for the first 32 VSX registers only.
|
|
|
|
*/
|
|
|
|
mfmsr r10
|
|
|
|
oris r10,r10,MSR_VSX@h
|
|
|
|
mtmsrd r10
|
|
|
|
sync
|
2013-05-30 04:33:18 +07:00
|
|
|
|
|
|
|
#define XVCPSGNDP2(n) XVCPSGNDP(n,n,n) ; XVCPSGNDP(n+1,n+1,n+1)
|
|
|
|
#define XVCPSGNDP4(n) XVCPSGNDP2(n) ; XVCPSGNDP2(n+2)
|
|
|
|
#define XVCPSGNDP8(n) XVCPSGNDP4(n) ; XVCPSGNDP4(n+4)
|
|
|
|
#define XVCPSGNDP16(n) XVCPSGNDP8(n) ; XVCPSGNDP8(n+8)
|
|
|
|
#define XVCPSGNDP32(n) XVCPSGNDP16(n) ; XVCPSGNDP16(n+16)
|
|
|
|
XVCPSGNDP32(0)
|
|
|
|
|
2012-09-10 07:35:26 +07:00
|
|
|
ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_206)
|
2013-05-30 04:33:19 +07:00
|
|
|
|
|
|
|
BEGIN_FTR_SECTION
|
|
|
|
b denorm_done
|
|
|
|
END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
|
|
|
|
/*
|
|
|
|
* To denormalise we need to move a copy of the register to itself.
|
|
|
|
* For POWER8 we need to do that for all 64 VSX registers
|
|
|
|
*/
|
|
|
|
XVCPSGNDP32(32)
|
|
|
|
denorm_done:
|
2018-09-13 12:33:47 +07:00
|
|
|
mfspr r11,SPRN_HSRR0
|
|
|
|
subi r11,r11,4
|
2012-09-10 07:35:26 +07:00
|
|
|
mtspr SPRN_HSRR0,r11
|
|
|
|
mtcrf 0x80,r9
|
|
|
|
ld r9,PACA_EXGEN+EX_R9(r13)
|
2012-12-07 04:51:04 +07:00
|
|
|
RESTORE_PPR_PACA(PACA_EXGEN, r10)
|
2013-08-12 13:12:06 +07:00
|
|
|
BEGIN_FTR_SECTION
|
|
|
|
ld r10,PACA_EXGEN+EX_CFAR(r13)
|
|
|
|
mtspr SPRN_CFAR,r10
|
|
|
|
END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
|
2012-09-10 07:35:26 +07:00
|
|
|
ld r10,PACA_EXGEN+EX_R10(r13)
|
|
|
|
ld r11,PACA_EXGEN+EX_R11(r13)
|
|
|
|
ld r12,PACA_EXGEN+EX_R12(r13)
|
|
|
|
ld r13,PACA_EXGEN+EX_R13(r13)
|
2018-01-09 23:07:15 +07:00
|
|
|
HRFI_TO_UNKNOWN
|
2012-09-10 07:35:26 +07:00
|
|
|
b .
|
|
|
|
#endif
|
|
|
|
|
2018-01-12 09:28:48 +07:00
|
|
|
EXC_COMMON(denorm_common, 0x1500, unknown_exception)
|
2016-09-21 14:44:01 +07:00
|
|
|
|
|
|
|
|
|
|
|
#ifdef CONFIG_CBE_RAS
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_HV(cbe_maintenance, 0x1600, 0x100)
|
|
|
|
EXC_VIRT_NONE(0x5600, 0x100)
|
2016-09-21 14:44:01 +07:00
|
|
|
TRAMP_KVM_HV_SKIP(PACA_EXGEN, 0x1600)
|
2016-09-21 14:44:02 +07:00
|
|
|
EXC_COMMON(cbe_maintenance_common, 0x1600, cbe_maintenance_exception)
|
2016-09-21 14:44:01 +07:00
|
|
|
#else /* CONFIG_CBE_RAS */
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_NONE(0x1600, 0x100)
|
|
|
|
EXC_VIRT_NONE(0x5600, 0x100)
|
2016-09-21 14:44:01 +07:00
|
|
|
#endif
|
|
|
|
|
2016-09-21 14:44:02 +07:00
|
|
|
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL(altivec_assist, 0x1700, 0x100)
|
|
|
|
EXC_VIRT(altivec_assist, 0x5700, 0x100, 0x1700)
|
2016-09-21 14:44:01 +07:00
|
|
|
TRAMP_KVM(PACA_EXGEN, 0x1700)
|
2016-09-21 14:44:03 +07:00
|
|
|
#ifdef CONFIG_ALTIVEC
|
|
|
|
EXC_COMMON(altivec_assist_common, 0x1700, altivec_assist_exception)
|
|
|
|
#else
|
|
|
|
EXC_COMMON(altivec_assist_common, 0x1700, unknown_exception)
|
|
|
|
#endif
|
|
|
|
|
2016-09-21 14:44:01 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_CBE_RAS
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_HV(cbe_thermal, 0x1800, 0x100)
|
|
|
|
EXC_VIRT_NONE(0x5800, 0x100)
|
2016-09-21 14:44:01 +07:00
|
|
|
TRAMP_KVM_HV_SKIP(PACA_EXGEN, 0x1800)
|
2016-09-21 14:44:04 +07:00
|
|
|
EXC_COMMON(cbe_thermal_common, 0x1800, cbe_thermal_exception)
|
2016-09-21 14:44:01 +07:00
|
|
|
#else /* CONFIG_CBE_RAS */
|
2016-12-06 08:41:12 +07:00
|
|
|
EXC_REAL_NONE(0x1800, 0x100)
|
|
|
|
EXC_VIRT_NONE(0x5800, 0x100)
|
2016-09-21 14:44:01 +07:00
|
|
|
#endif
|
|
|
|
|
2017-08-01 19:00:52 +07:00
|
|
|
#ifdef CONFIG_PPC_WATCHDOG
|
2017-07-13 04:35:52 +07:00
|
|
|
|
|
|
|
#define MASKED_DEC_HANDLER_LABEL 3f
|
|
|
|
|
|
|
|
#define MASKED_DEC_HANDLER(_H) \
|
|
|
|
3: /* soft-nmi */ \
|
|
|
|
std r12,PACA_EXGEN+EX_R12(r13); \
|
|
|
|
GET_SCRATCH0(r10); \
|
|
|
|
std r10,PACA_EXGEN+EX_R13(r13); \
|
2019-06-22 20:15:13 +07:00
|
|
|
EXCEPTION_PROLOG_2_REAL soft_nmi_common, _H, 1
|
2017-07-13 04:35:52 +07:00
|
|
|
|
2017-07-29 19:50:27 +07:00
|
|
|
/*
|
|
|
|
* Branch to soft_nmi_interrupt using the emergency stack. The emergency
|
|
|
|
* stack is one that is usable by maskable interrupts so long as MSR_EE
|
|
|
|
* remains off. It is used for recovery when something has corrupted the
|
|
|
|
* normal kernel stack, for example. The "soft NMI" must not use the process
|
|
|
|
* stack because we want irq disabled sections to avoid touching the stack
|
|
|
|
* at all (other than PMU interrupts), so use the emergency stack for this,
|
|
|
|
* and run it entirely with interrupts hard disabled.
|
|
|
|
*/
|
2017-07-13 04:35:52 +07:00
|
|
|
EXC_COMMON_BEGIN(soft_nmi_common)
|
|
|
|
mr r10,r1
|
|
|
|
ld r1,PACAEMERGSP(r13)
|
|
|
|
subi r1,r1,INT_FRAME_SIZE
|
2019-06-22 20:15:21 +07:00
|
|
|
EXCEPTION_COMMON_STACK(PACA_EXGEN, 0x900)
|
|
|
|
bl save_nvgprs
|
|
|
|
RECONCILE_IRQ_STATE(r10, r11)
|
2019-06-22 20:15:20 +07:00
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl soft_nmi_interrupt
|
2017-07-13 04:35:52 +07:00
|
|
|
b ret_from_except
|
|
|
|
|
2017-08-01 19:00:52 +07:00
|
|
|
#else /* CONFIG_PPC_WATCHDOG */
|
2017-07-13 04:35:52 +07:00
|
|
|
#define MASKED_DEC_HANDLER_LABEL 2f /* normal return */
|
|
|
|
#define MASKED_DEC_HANDLER(_H)
|
2017-08-01 19:00:52 +07:00
|
|
|
#endif /* CONFIG_PPC_WATCHDOG */
|
2016-09-21 14:44:01 +07:00
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
/*
|
2012-11-15 01:49:48 +07:00
|
|
|
* An interrupt came in while soft-disabled. We set paca->irq_happened, then:
|
|
|
|
* - If it was a decrementer interrupt, we bump the dec to max and and return.
|
|
|
|
* - If it was a doorbell we return immediately since doorbells are edge
|
|
|
|
* triggered and won't automatically refire.
|
2014-07-29 20:10:01 +07:00
|
|
|
* - If it was a HMI we return immediately since we handled it in realmode
|
|
|
|
* and it won't refire.
|
2018-02-03 14:17:50 +07:00
|
|
|
* - Else it is one of PACA_IRQ_MUST_HARD_MASK, so hard disable and return.
|
2012-11-15 01:49:48 +07:00
|
|
|
* This is called with r10 containing the value to OR to the paca field.
|
2009-06-03 04:17:38 +07:00
|
|
|
*/
|
2019-06-22 20:15:11 +07:00
|
|
|
.macro MASKED_INTERRUPT hsrr
|
|
|
|
.if \hsrr
|
|
|
|
masked_Hinterrupt:
|
|
|
|
.else
|
|
|
|
masked_interrupt:
|
|
|
|
.endif
|
|
|
|
std r11,PACA_EXGEN+EX_R11(r13)
|
|
|
|
lbz r11,PACAIRQHAPPENED(r13)
|
|
|
|
or r11,r11,r10
|
|
|
|
stb r11,PACAIRQHAPPENED(r13)
|
|
|
|
cmpwi r10,PACA_IRQ_DEC
|
|
|
|
bne 1f
|
|
|
|
lis r10,0x7fff
|
|
|
|
ori r10,r10,0xffff
|
|
|
|
mtspr SPRN_DEC,r10
|
|
|
|
b MASKED_DEC_HANDLER_LABEL
|
|
|
|
1: andi. r10,r10,PACA_IRQ_MUST_HARD_MASK
|
|
|
|
beq 2f
|
|
|
|
.if \hsrr
|
|
|
|
mfspr r10,SPRN_HSRR1
|
|
|
|
xori r10,r10,MSR_EE /* clear MSR_EE */
|
|
|
|
mtspr SPRN_HSRR1,r10
|
|
|
|
.else
|
|
|
|
mfspr r10,SPRN_SRR1
|
|
|
|
xori r10,r10,MSR_EE /* clear MSR_EE */
|
|
|
|
mtspr SPRN_SRR1,r10
|
|
|
|
.endif
|
|
|
|
ori r11,r11,PACA_IRQ_HARD_DIS
|
|
|
|
stb r11,PACAIRQHAPPENED(r13)
|
|
|
|
2: /* done */
|
|
|
|
mtcrf 0x80,r9
|
|
|
|
std r1,PACAR1(r13)
|
|
|
|
ld r9,PACA_EXGEN+EX_R9(r13)
|
|
|
|
ld r10,PACA_EXGEN+EX_R10(r13)
|
|
|
|
ld r11,PACA_EXGEN+EX_R11(r13)
|
|
|
|
/* returns to kernel where r13 must be set up, so don't restore it */
|
|
|
|
.if \hsrr
|
|
|
|
HRFI_TO_KERNEL
|
|
|
|
.else
|
|
|
|
RFI_TO_KERNEL
|
|
|
|
.endif
|
|
|
|
b .
|
|
|
|
MASKED_DEC_HANDLER(\hsrr\())
|
|
|
|
.endm
|
2016-09-28 08:31:48 +07:00
|
|
|
|
2018-05-22 06:00:00 +07:00
|
|
|
TRAMP_REAL_BEGIN(stf_barrier_fallback)
|
|
|
|
std r9,PACA_EXRFI+EX_R9(r13)
|
|
|
|
std r10,PACA_EXRFI+EX_R10(r13)
|
|
|
|
sync
|
|
|
|
ld r9,PACA_EXRFI+EX_R9(r13)
|
|
|
|
ld r10,PACA_EXRFI+EX_R10(r13)
|
|
|
|
ori 31,31,0
|
|
|
|
.rept 14
|
|
|
|
b 1f
|
|
|
|
1:
|
|
|
|
.endr
|
|
|
|
blr
|
|
|
|
|
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
|
|
|
TRAMP_REAL_BEGIN(rfi_flush_fallback)
|
|
|
|
SET_SCRATCH0(r13);
|
|
|
|
GET_PACA(r13);
|
powerpc/64s: Make rfi_flush_fallback a little more robust
Because rfi_flush_fallback runs immediately before the return to
userspace it currently runs with the user r1 (stack pointer). This
means if we oops in there we will report a bad kernel stack pointer in
the exception entry path, eg:
Bad kernel stack pointer 7ffff7150e40 at c0000000000023b4
Oops: Bad kernel stack pointer, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1246 Comm: klogd Not tainted 4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3 #7
NIP: c0000000000023b4 LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000fffe7d40 TRAP: 4100 Not tainted (4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: c0000000f1e66a80
GPR00: 0000000002000000 00007ffff7150e40 00007fff93a99900 0000000000000020
...
NIP [c0000000000023b4] rfi_flush_fallback+0x34/0x80
LR [0000000010053e00] 0x10053e00
Although the NIP tells us where we were, and the TRAP number tells us
what happened, it would still be nicer if we could report the actual
exception rather than barfing about the stack pointer.
We an do that fairly simply by loading the kernel stack pointer on
entry and restoring the user value before returning. That way we see a
regular oops such as:
Unrecoverable exception 4100 at c00000000000239c
Oops: Unrecoverable exception, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1251 Comm: klogd Not tainted 4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty #40
NIP: c00000000000239c LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000f1e17bb0 TRAP: 4100 Not tainted (4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: 0
...
NIP [c00000000000239c] rfi_flush_fallback+0x3c/0x80
LR [0000000010053e00] 0x10053e00
Call Trace:
[c0000000f1e17e30] [c00000000000b9e4] system_call+0x5c/0x70 (unreliable)
Note this shouldn't make the kernel stack pointer vulnerable to a
meltdown attack, because it should be flushed from the cache before we
return to userspace. The user r1 value will be in the cache, because
we load it in the return path, but that is harmless.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
2018-07-26 19:42:44 +07:00
|
|
|
std r1,PACA_EXRFI+EX_R12(r13)
|
|
|
|
ld r1,PACAKSAVE(r13)
|
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
|
|
|
std r9,PACA_EXRFI+EX_R9(r13)
|
|
|
|
std r10,PACA_EXRFI+EX_R10(r13)
|
|
|
|
std r11,PACA_EXRFI+EX_R11(r13)
|
|
|
|
mfctr r9
|
|
|
|
ld r10,PACA_RFI_FLUSH_FALLBACK_AREA(r13)
|
2018-01-17 20:58:18 +07:00
|
|
|
ld r11,PACA_L1D_FLUSH_SIZE(r13)
|
|
|
|
srdi r11,r11,(7 + 3) /* 128 byte lines, unrolled 8x */
|
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
|
|
|
mtctr r11
|
2018-02-21 02:08:26 +07:00
|
|
|
DCBT_BOOK3S_STOP_ALL_STREAM_IDS(r11) /* Stop prefetch streams */
|
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
|
|
|
|
|
|
|
/* order ld/st prior to dcbt stop all streams with flushing */
|
|
|
|
sync
|
2018-01-17 20:58:18 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The load adresses are at staggered offsets within cachelines,
|
|
|
|
* which suits some pipelines better (on others it should not
|
|
|
|
* hurt).
|
|
|
|
*/
|
|
|
|
1:
|
|
|
|
ld r11,(0x80 + 8)*0(r10)
|
|
|
|
ld r11,(0x80 + 8)*1(r10)
|
|
|
|
ld r11,(0x80 + 8)*2(r10)
|
|
|
|
ld r11,(0x80 + 8)*3(r10)
|
|
|
|
ld r11,(0x80 + 8)*4(r10)
|
|
|
|
ld r11,(0x80 + 8)*5(r10)
|
|
|
|
ld r11,(0x80 + 8)*6(r10)
|
|
|
|
ld r11,(0x80 + 8)*7(r10)
|
|
|
|
addi r10,r10,0x80*8
|
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
|
|
|
bdnz 1b
|
|
|
|
|
|
|
|
mtctr r9
|
|
|
|
ld r9,PACA_EXRFI+EX_R9(r13)
|
|
|
|
ld r10,PACA_EXRFI+EX_R10(r13)
|
|
|
|
ld r11,PACA_EXRFI+EX_R11(r13)
|
powerpc/64s: Make rfi_flush_fallback a little more robust
Because rfi_flush_fallback runs immediately before the return to
userspace it currently runs with the user r1 (stack pointer). This
means if we oops in there we will report a bad kernel stack pointer in
the exception entry path, eg:
Bad kernel stack pointer 7ffff7150e40 at c0000000000023b4
Oops: Bad kernel stack pointer, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1246 Comm: klogd Not tainted 4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3 #7
NIP: c0000000000023b4 LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000fffe7d40 TRAP: 4100 Not tainted (4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: c0000000f1e66a80
GPR00: 0000000002000000 00007ffff7150e40 00007fff93a99900 0000000000000020
...
NIP [c0000000000023b4] rfi_flush_fallback+0x34/0x80
LR [0000000010053e00] 0x10053e00
Although the NIP tells us where we were, and the TRAP number tells us
what happened, it would still be nicer if we could report the actual
exception rather than barfing about the stack pointer.
We an do that fairly simply by loading the kernel stack pointer on
entry and restoring the user value before returning. That way we see a
regular oops such as:
Unrecoverable exception 4100 at c00000000000239c
Oops: Unrecoverable exception, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1251 Comm: klogd Not tainted 4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty #40
NIP: c00000000000239c LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000f1e17bb0 TRAP: 4100 Not tainted (4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: 0
...
NIP [c00000000000239c] rfi_flush_fallback+0x3c/0x80
LR [0000000010053e00] 0x10053e00
Call Trace:
[c0000000f1e17e30] [c00000000000b9e4] system_call+0x5c/0x70 (unreliable)
Note this shouldn't make the kernel stack pointer vulnerable to a
meltdown attack, because it should be flushed from the cache before we
return to userspace. The user r1 value will be in the cache, because
we load it in the return path, but that is harmless.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
2018-07-26 19:42:44 +07:00
|
|
|
ld r1,PACA_EXRFI+EX_R12(r13)
|
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
|
|
|
GET_SCRATCH0(r13);
|
|
|
|
rfid
|
|
|
|
|
|
|
|
TRAMP_REAL_BEGIN(hrfi_flush_fallback)
|
|
|
|
SET_SCRATCH0(r13);
|
|
|
|
GET_PACA(r13);
|
powerpc/64s: Make rfi_flush_fallback a little more robust
Because rfi_flush_fallback runs immediately before the return to
userspace it currently runs with the user r1 (stack pointer). This
means if we oops in there we will report a bad kernel stack pointer in
the exception entry path, eg:
Bad kernel stack pointer 7ffff7150e40 at c0000000000023b4
Oops: Bad kernel stack pointer, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1246 Comm: klogd Not tainted 4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3 #7
NIP: c0000000000023b4 LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000fffe7d40 TRAP: 4100 Not tainted (4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: c0000000f1e66a80
GPR00: 0000000002000000 00007ffff7150e40 00007fff93a99900 0000000000000020
...
NIP [c0000000000023b4] rfi_flush_fallback+0x34/0x80
LR [0000000010053e00] 0x10053e00
Although the NIP tells us where we were, and the TRAP number tells us
what happened, it would still be nicer if we could report the actual
exception rather than barfing about the stack pointer.
We an do that fairly simply by loading the kernel stack pointer on
entry and restoring the user value before returning. That way we see a
regular oops such as:
Unrecoverable exception 4100 at c00000000000239c
Oops: Unrecoverable exception, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1251 Comm: klogd Not tainted 4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty #40
NIP: c00000000000239c LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000f1e17bb0 TRAP: 4100 Not tainted (4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: 0
...
NIP [c00000000000239c] rfi_flush_fallback+0x3c/0x80
LR [0000000010053e00] 0x10053e00
Call Trace:
[c0000000f1e17e30] [c00000000000b9e4] system_call+0x5c/0x70 (unreliable)
Note this shouldn't make the kernel stack pointer vulnerable to a
meltdown attack, because it should be flushed from the cache before we
return to userspace. The user r1 value will be in the cache, because
we load it in the return path, but that is harmless.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
2018-07-26 19:42:44 +07:00
|
|
|
std r1,PACA_EXRFI+EX_R12(r13)
|
|
|
|
ld r1,PACAKSAVE(r13)
|
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
|
|
|
std r9,PACA_EXRFI+EX_R9(r13)
|
|
|
|
std r10,PACA_EXRFI+EX_R10(r13)
|
|
|
|
std r11,PACA_EXRFI+EX_R11(r13)
|
|
|
|
mfctr r9
|
|
|
|
ld r10,PACA_RFI_FLUSH_FALLBACK_AREA(r13)
|
2018-01-17 20:58:18 +07:00
|
|
|
ld r11,PACA_L1D_FLUSH_SIZE(r13)
|
|
|
|
srdi r11,r11,(7 + 3) /* 128 byte lines, unrolled 8x */
|
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
|
|
|
mtctr r11
|
2018-02-21 02:08:26 +07:00
|
|
|
DCBT_BOOK3S_STOP_ALL_STREAM_IDS(r11) /* Stop prefetch streams */
|
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
|
|
|
|
|
|
|
/* order ld/st prior to dcbt stop all streams with flushing */
|
|
|
|
sync
|
2018-01-17 20:58:18 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The load adresses are at staggered offsets within cachelines,
|
|
|
|
* which suits some pipelines better (on others it should not
|
|
|
|
* hurt).
|
|
|
|
*/
|
|
|
|
1:
|
|
|
|
ld r11,(0x80 + 8)*0(r10)
|
|
|
|
ld r11,(0x80 + 8)*1(r10)
|
|
|
|
ld r11,(0x80 + 8)*2(r10)
|
|
|
|
ld r11,(0x80 + 8)*3(r10)
|
|
|
|
ld r11,(0x80 + 8)*4(r10)
|
|
|
|
ld r11,(0x80 + 8)*5(r10)
|
|
|
|
ld r11,(0x80 + 8)*6(r10)
|
|
|
|
ld r11,(0x80 + 8)*7(r10)
|
|
|
|
addi r10,r10,0x80*8
|
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
|
|
|
bdnz 1b
|
|
|
|
|
|
|
|
mtctr r9
|
|
|
|
ld r9,PACA_EXRFI+EX_R9(r13)
|
|
|
|
ld r10,PACA_EXRFI+EX_R10(r13)
|
|
|
|
ld r11,PACA_EXRFI+EX_R11(r13)
|
powerpc/64s: Make rfi_flush_fallback a little more robust
Because rfi_flush_fallback runs immediately before the return to
userspace it currently runs with the user r1 (stack pointer). This
means if we oops in there we will report a bad kernel stack pointer in
the exception entry path, eg:
Bad kernel stack pointer 7ffff7150e40 at c0000000000023b4
Oops: Bad kernel stack pointer, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1246 Comm: klogd Not tainted 4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3 #7
NIP: c0000000000023b4 LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000fffe7d40 TRAP: 4100 Not tainted (4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: c0000000f1e66a80
GPR00: 0000000002000000 00007ffff7150e40 00007fff93a99900 0000000000000020
...
NIP [c0000000000023b4] rfi_flush_fallback+0x34/0x80
LR [0000000010053e00] 0x10053e00
Although the NIP tells us where we were, and the TRAP number tells us
what happened, it would still be nicer if we could report the actual
exception rather than barfing about the stack pointer.
We an do that fairly simply by loading the kernel stack pointer on
entry and restoring the user value before returning. That way we see a
regular oops such as:
Unrecoverable exception 4100 at c00000000000239c
Oops: Unrecoverable exception, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1251 Comm: klogd Not tainted 4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty #40
NIP: c00000000000239c LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000f1e17bb0 TRAP: 4100 Not tainted (4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: 0
...
NIP [c00000000000239c] rfi_flush_fallback+0x3c/0x80
LR [0000000010053e00] 0x10053e00
Call Trace:
[c0000000f1e17e30] [c00000000000b9e4] system_call+0x5c/0x70 (unreliable)
Note this shouldn't make the kernel stack pointer vulnerable to a
meltdown attack, because it should be flushed from the cache before we
return to userspace. The user r1 value will be in the cache, because
we load it in the return path, but that is harmless.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
2018-07-26 19:42:44 +07:00
|
|
|
ld r1,PACA_EXRFI+EX_R12(r13)
|
powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.
This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.
The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.
In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.
In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.
For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.
In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.
Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-09 23:07:15 +07:00
|
|
|
GET_SCRATCH0(r13);
|
|
|
|
hrfid
|
|
|
|
|
2016-09-28 08:31:48 +07:00
|
|
|
/*
|
|
|
|
* Real mode exceptions actually use this too, but alternate
|
|
|
|
* instruction code patches (which end up in the common .text area)
|
|
|
|
* cannot reach these if they are put there.
|
|
|
|
*/
|
|
|
|
USE_FIXED_SECTION(virt_trampolines)
|
2019-06-22 20:15:11 +07:00
|
|
|
MASKED_INTERRUPT EXC_STD
|
|
|
|
MASKED_INTERRUPT EXC_HV
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2013-09-20 11:52:50 +07:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
|
2016-09-30 16:43:18 +07:00
|
|
|
TRAMP_REAL_BEGIN(kvmppc_skip_interrupt)
|
2013-09-20 11:52:50 +07:00
|
|
|
/*
|
|
|
|
* Here all GPRs are unchanged from when the interrupt happened
|
|
|
|
* except for r13, which is saved in SPRG_SCRATCH0.
|
|
|
|
*/
|
|
|
|
mfspr r13, SPRN_SRR0
|
|
|
|
addi r13, r13, 4
|
|
|
|
mtspr SPRN_SRR0, r13
|
|
|
|
GET_SCRATCH0(r13)
|
2018-01-09 23:07:15 +07:00
|
|
|
RFI_TO_KERNEL
|
2013-09-20 11:52:50 +07:00
|
|
|
b .
|
|
|
|
|
2016-09-30 16:43:18 +07:00
|
|
|
TRAMP_REAL_BEGIN(kvmppc_skip_Hinterrupt)
|
2013-09-20 11:52:50 +07:00
|
|
|
/*
|
|
|
|
* Here all GPRs are unchanged from when the interrupt happened
|
|
|
|
* except for r13, which is saved in SPRG_SCRATCH0.
|
|
|
|
*/
|
|
|
|
mfspr r13, SPRN_HSRR0
|
|
|
|
addi r13, r13, 4
|
|
|
|
mtspr SPRN_HSRR0, r13
|
|
|
|
GET_SCRATCH0(r13)
|
2018-01-09 23:07:15 +07:00
|
|
|
HRFI_TO_KERNEL
|
2013-09-20 11:52:50 +07:00
|
|
|
b .
|
|
|
|
#endif
|
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
/*
|
2016-04-08 05:00:34 +07:00
|
|
|
* Ensure that any handlers that get invoked from the exception prologs
|
|
|
|
* above are below the first 64KB (0x10000) of the kernel image because
|
|
|
|
* the prologs assemble the addresses of these handlers using the
|
|
|
|
* LOAD_HANDLER macro, which uses an ori instruction.
|
2009-06-03 04:17:38 +07:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*** Common interrupt handlers ***/
|
|
|
|
|
|
|
|
|
2012-11-02 13:21:43 +07:00
|
|
|
/*
|
|
|
|
* Relocation-on interrupts: A subset of the interrupts can be delivered
|
|
|
|
* with IR=1/DR=1, if AIL==2 and MSR.HV won't be changed by delivering
|
|
|
|
* it. Addresses are the same as the original interrupt addresses, but
|
|
|
|
* offset by 0xc000000000004000.
|
|
|
|
* It's impossible to receive interrupts below 0x300 via this mechanism.
|
|
|
|
* KVM: None of these traps are from the guest ; anything that escalated
|
|
|
|
* to HV=1 from HV=0 is delivered via real mode handlers.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This uses the standard macro, since the original 0x300 vector
|
|
|
|
* only has extra guff for STAB-based processors -- which never
|
|
|
|
* come here.
|
|
|
|
*/
|
2016-09-30 16:43:18 +07:00
|
|
|
|
2016-09-28 08:31:48 +07:00
|
|
|
EXC_COMMON_BEGIN(ppc64_runlatch_on_trampoline)
|
2014-02-04 12:04:35 +07:00
|
|
|
b __ppc64_runlatch_on
|
2012-03-01 08:45:27 +07:00
|
|
|
|
2016-09-28 08:31:48 +07:00
|
|
|
USE_FIXED_SECTION(virt_trampolines)
|
powerpc/book3s64: Fix branching to OOL handlers in relocatable kernel
Some of the interrupt vectors on 64-bit POWER server processors are only
32 bytes long (8 instructions), which is not enough for the full
first-level interrupt handler. For these we need to branch to an
out-of-line (OOL) handler. But when we are running a relocatable kernel,
interrupt vectors till __end_interrupts marker are copied down to real
address 0x100. So, branching to labels (ie. OOL handlers) outside this
section must be handled differently (see LOAD_HANDLER()), considering
relocatable kernel, which would need at least 4 instructions.
However, branching from interrupt vector means that we corrupt the
CFAR (come-from address register) on POWER7 and later processors as
mentioned in commit 1707dd16. So, EXCEPTION_PROLOG_0 (6 instructions)
that contains the part up to the point where the CFAR is saved in the
PACA should be part of the short interrupt vectors before we branch out
to OOL handlers.
But as mentioned already, there are interrupt vectors on 64-bit POWER
server processors that are only 32 bytes long (like vectors 0x4f00,
0x4f20, etc.), which cannot accomodate the above two cases at the same
time owing to space constraint. Currently, in these interrupt vectors,
we simply branch out to OOL handlers, without using LOAD_HANDLER(),
which leaves us vulnerable when running a relocatable kernel (eg. kdump
case). While this has been the case for sometime now and kdump is used
widely, we were fortunate not to see any problems so far, for three
reasons:
1. In almost all cases, production kernel (relocatable) is used for
kdump as well, which would mean that crashed kernel's OOL handler
would be at the same place where we end up branching to, from short
interrupt vector of kdump kernel.
2. Also, OOL handler was unlikely the reason for crash in almost all
the kdump scenarios, which meant we had a sane OOL handler from
crashed kernel that we branched to.
3. On most 64-bit POWER server processors, page size is large enough
that marking interrupt vector code as executable (see commit
429d2e83) leads to marking OOL handler code from crashed kernel,
that sits right below interrupt vector code from kdump kernel, as
executable as well.
Let us fix this by moving the __end_interrupts marker down past OOL
handlers to make sure that we also copy OOL handlers to real address
0x100 when running a relocatable kernel.
This fix has been tested successfully in kdump scenario, on an LPAR with
4K page size by using different default/production kernel and kdump
kernel.
Also tested by manually corrupting the OOL handlers in the first kernel
and then kdump'ing, and then causing the OOL handlers to fire - mpe.
Fixes: c1fb6816fb1b ("powerpc: Add relocation on exception vector handlers")
Cc: stable@vger.kernel.org
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-04-15 19:48:02 +07:00
|
|
|
/*
|
|
|
|
* The __end_interrupts marker must be past the out-of-line (OOL)
|
|
|
|
* handlers, so that they are copied to real address 0x100 when running
|
|
|
|
* a relocatable kernel. This ensures they can be reached from the short
|
|
|
|
* trampoline handlers (like 0x4f00, 0x4f20, etc.) which branch
|
|
|
|
* directly, without using LOAD_HANDLER().
|
|
|
|
*/
|
|
|
|
.align 7
|
|
|
|
.globl __end_interrupts
|
|
|
|
__end_interrupts:
|
2016-09-28 08:31:48 +07:00
|
|
|
DEFINE_FIXED_SYMBOL(__end_interrupts)
|
2013-01-10 13:44:19 +07:00
|
|
|
|
2013-03-25 08:31:31 +07:00
|
|
|
#ifdef CONFIG_PPC_970_NAP
|
2016-10-11 14:47:56 +07:00
|
|
|
EXC_COMMON_BEGIN(power4_fixup_nap)
|
2013-03-25 08:31:31 +07:00
|
|
|
andc r9,r9,r10
|
|
|
|
std r9,TI_LOCAL_FLAGS(r11)
|
|
|
|
ld r10,_LINK(r1) /* make idle task do the */
|
|
|
|
std r10,_NIP(r1) /* equivalent of a blr */
|
|
|
|
blr
|
|
|
|
#endif
|
|
|
|
|
2016-09-28 08:31:48 +07:00
|
|
|
CLOSE_FIXED_SECTION(real_vectors);
|
|
|
|
CLOSE_FIXED_SECTION(real_trampolines);
|
|
|
|
CLOSE_FIXED_SECTION(virt_vectors);
|
|
|
|
CLOSE_FIXED_SECTION(virt_trampolines);
|
|
|
|
|
|
|
|
USE_TEXT_SECTION()
|
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
/*
|
|
|
|
* Hash table stuff
|
|
|
|
*/
|
2016-10-13 10:43:52 +07:00
|
|
|
.balign IFETCH_ALIGN_BYTES
|
2014-02-04 12:06:11 +07:00
|
|
|
do_hash_page:
|
2017-10-19 11:08:43 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S_64
|
2018-01-19 08:50:40 +07:00
|
|
|
lis r0,(DSISR_BAD_FAULT_64S | DSISR_DABRMATCH | DSISR_KEYFAULT)@h
|
2017-07-19 11:49:27 +07:00
|
|
|
ori r0,r0,DSISR_BAD_FAULT_64S@l
|
|
|
|
and. r0,r4,r0 /* weird error? */
|
2009-06-03 04:17:38 +07:00
|
|
|
bne- handle_page_fault /* if not, try to insert a HPTE */
|
2019-01-12 16:55:50 +07:00
|
|
|
ld r11, PACA_THREAD_INFO(r13)
|
powerpc: Allow perf_counters to access user memory at interrupt time
This provides a mechanism to allow the perf_counters code to access
user memory in a PMU interrupt routine. Such an access can cause
various kinds of interrupt: SLB miss, MMU hash table miss, segment
table miss, or TLB miss, depending on the processor. This commit
only deals with 64-bit classic/server processors, which use an MMU
hash table. 32-bit processors are already able to access user memory
at interrupt time. Since we don't soft-disable on 32-bit, we avoid
the possibility of reentering hash_page or the TLB miss handlers,
since they run with interrupts disabled.
On 64-bit processors, an SLB miss interrupt on a user address will
update the slb_cache and slb_cache_ptr fields in the paca. This is
OK except in the case where a PMU interrupt occurs in switch_slb,
which also accesses those fields. To prevent this, we hard-disable
interrupts in switch_slb. Interrupts are already soft-disabled at
this point, and will get hard-enabled when they get soft-enabled
later.
This also reworks slb_flush_and_rebolt: to avoid hard-disabling twice,
and to make sure that it clears the slb_cache_ptr when called from
other callers than switch_slb, the existing routine is renamed to
__slb_flush_and_rebolt, which is called by switch_slb and the new
version of slb_flush_and_rebolt.
Similarly, switch_stab (used on POWER3 and RS64 processors) gets a
hard_irq_disable() to protect the per-cpu variables used there and
in ste_allocate.
If a MMU hashtable miss interrupt occurs, normally we would call
hash_page to look up the Linux PTE for the address and create a HPTE.
However, hash_page is fairly complex and takes some locks, so to
avoid the possibility of deadlock, we check the preemption count
to see if we are in a (pseudo-)NMI handler, and if so, we don't call
hash_page but instead treat it like a bad access that will get
reported up through the exception table mechanism. An interrupt
whose handler runs even though the interrupt occurred when
soft-disabled (such as the PMU interrupt) is considered a pseudo-NMI
handler, which should use nmi_enter()/nmi_exit() rather than
irq_enter()/irq_exit().
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-08-17 12:17:54 +07:00
|
|
|
lwz r0,TI_PREEMPT(r11) /* If we're in an "NMI" */
|
|
|
|
andis. r0,r0,NMI_MASK@h /* (i.e. an irq when soft-disabled) */
|
|
|
|
bne 77f /* then don't call hash_page now */
|
2009-06-03 04:17:38 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* r3 contains the faulting address
|
2015-12-01 10:36:44 +07:00
|
|
|
* r4 msr
|
2009-06-03 04:17:38 +07:00
|
|
|
* r5 contains the trap number
|
2014-12-04 12:30:14 +07:00
|
|
|
* r6 contains dsisr
|
2009-06-03 04:17:38 +07:00
|
|
|
*
|
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 14:27:59 +07:00
|
|
|
* at return r3 = 0 for success, 1 for page fault, negative for error
|
2009-06-03 04:17:38 +07:00
|
|
|
*/
|
2015-12-01 10:36:44 +07:00
|
|
|
mr r4,r12
|
2014-12-04 12:30:14 +07:00
|
|
|
ld r6,_DSISR(r1)
|
2015-12-01 10:36:44 +07:00
|
|
|
bl __hash_page /* build HPTE if possible */
|
|
|
|
cmpdi r3,0 /* see if __hash_page succeeded */
|
2009-06-03 04:17:38 +07:00
|
|
|
|
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 14:27:59 +07:00
|
|
|
/* Success */
|
2009-06-03 04:17:38 +07:00
|
|
|
beq fast_exc_return_irq /* Return from exception on success */
|
|
|
|
|
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 14:27:59 +07:00
|
|
|
/* Error */
|
|
|
|
blt- 13f
|
2017-06-14 01:42:00 +07:00
|
|
|
|
|
|
|
/* Reload DSISR into r4 for the DABR check below */
|
|
|
|
ld r4,_DSISR(r1)
|
2017-10-19 11:08:43 +07:00
|
|
|
#endif /* CONFIG_PPC_BOOK3S_64 */
|
2010-03-30 06:59:25 +07:00
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
/* Here we have a page fault that hash_page can't handle. */
|
|
|
|
handle_page_fault:
|
2017-06-14 01:42:00 +07:00
|
|
|
11: andis. r0,r4,DSISR_DABRMATCH@h
|
|
|
|
bne- handle_dabr_fault
|
|
|
|
ld r4,_DAR(r1)
|
2009-06-03 04:17:38 +07:00
|
|
|
ld r5,_DSISR(r1)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
2014-02-04 12:04:35 +07:00
|
|
|
bl do_page_fault
|
2009-06-03 04:17:38 +07:00
|
|
|
cmpdi r3,0
|
powerpc/watchpoint: Restore NV GPRs while returning from exception
powerpc hardware triggers watchpoint before executing the instruction.
To make trigger-after-execute behavior, kernel emulates the
instruction. If the instruction is 'load something into non-volatile
register', exception handler should restore emulated register state
while returning back, otherwise there will be register state
corruption. eg, adding a watchpoint on a list can corrput the list:
# cat /proc/kallsyms | grep kthread_create_list
c00000000121c8b8 d kthread_create_list
Add watchpoint on kthread_create_list->prev:
# perf record -e mem:0xc00000000121c8c0
Run some workload such that new kthread gets invoked. eg, I just
logged out from console:
list_add corruption. next->prev should be prev (c000000001214e00), \
but was c00000000121c8b8. (next=c00000000121c8b8).
WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0
CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69
...
NIP __list_add_valid+0xb4/0xc0
LR __list_add_valid+0xb0/0xc0
Call Trace:
__list_add_valid+0xb0/0xc0 (unreliable)
__kthread_create_on_node+0xe0/0x260
kthread_create_on_node+0x34/0x50
create_worker+0xe8/0x260
worker_thread+0x444/0x560
kthread+0x160/0x1a0
ret_from_kernel_thread+0x5c/0x70
List corruption happened because it uses 'load into non-volatile
register' instruction:
Snippet from __kthread_create_on_node:
c000000000136be8: addis r29,r2,-19
c000000000136bec: ld r29,31424(r29)
if (!__list_add_valid(new, prev, next))
c000000000136bf0: mr r3,r30
c000000000136bf4: mr r5,r28
c000000000136bf8: mr r4,r29
c000000000136bfc: bl c00000000059a2f8 <__list_add_valid+0x8>
Register state from WARN_ON():
GPR00: c00000000059a3a0 c000007ff23afb50 c000000001344e00 0000000000000075
GPR04: 0000000000000000 0000000000000000 0000001852af8bc1 0000000000000000
GPR08: 0000000000000001 0000000000000007 0000000000000006 00000000000004aa
GPR12: 0000000000000000 c000007ffffeb080 c000000000137038 c000005ff62aaa00
GPR16: 0000000000000000 0000000000000000 c000007fffbe7600 c000007fffbe7370
GPR20: c000007fffbe7320 c000007fffbe7300 c000000001373a00 0000000000000000
GPR24: fffffffffffffef7 c00000000012e320 c000007ff23afcb0 c000000000cb8628
GPR28: c00000000121c8b8 c000000001214e00 c000007fef5b17e8 c000007fef5b17c0
Watchpoint hit at 0xc000000000136bec.
addis r29,r2,-19
=> r29 = 0xc000000001344e00 + (-19 << 16)
=> r29 = 0xc000000001214e00
ld r29,31424(r29)
=> r29 = *(0xc000000001214e00 + 31424)
=> r29 = *(0xc00000000121c8c0)
0xc00000000121c8c0 is where we placed a watchpoint and thus this
instruction was emulated by emulate_step. But because handle_dabr_fault
did not restore emulated register state, r29 still contains stale
value in above register state.
Fixes: 5aae8a5370802 ("powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors")
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: stable@vger.kernel.org # 2.6.36+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-13 10:30:14 +07:00
|
|
|
beq+ ret_from_except_lite
|
2014-02-04 12:04:35 +07:00
|
|
|
bl save_nvgprs
|
2009-06-03 04:17:38 +07:00
|
|
|
mr r5,r3
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
lwz r4,_DAR(r1)
|
2014-02-04 12:04:35 +07:00
|
|
|
bl bad_page_fault
|
|
|
|
b ret_from_except
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2012-03-07 12:48:45 +07:00
|
|
|
/* We have a data breakpoint exception - handle it */
|
|
|
|
handle_dabr_fault:
|
2014-02-04 12:04:35 +07:00
|
|
|
bl save_nvgprs
|
2012-03-07 12:48:45 +07:00
|
|
|
ld r4,_DAR(r1)
|
|
|
|
ld r5,_DSISR(r1)
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
2014-02-04 12:04:35 +07:00
|
|
|
bl do_break
|
powerpc/watchpoint: Restore NV GPRs while returning from exception
powerpc hardware triggers watchpoint before executing the instruction.
To make trigger-after-execute behavior, kernel emulates the
instruction. If the instruction is 'load something into non-volatile
register', exception handler should restore emulated register state
while returning back, otherwise there will be register state
corruption. eg, adding a watchpoint on a list can corrput the list:
# cat /proc/kallsyms | grep kthread_create_list
c00000000121c8b8 d kthread_create_list
Add watchpoint on kthread_create_list->prev:
# perf record -e mem:0xc00000000121c8c0
Run some workload such that new kthread gets invoked. eg, I just
logged out from console:
list_add corruption. next->prev should be prev (c000000001214e00), \
but was c00000000121c8b8. (next=c00000000121c8b8).
WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0
CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69
...
NIP __list_add_valid+0xb4/0xc0
LR __list_add_valid+0xb0/0xc0
Call Trace:
__list_add_valid+0xb0/0xc0 (unreliable)
__kthread_create_on_node+0xe0/0x260
kthread_create_on_node+0x34/0x50
create_worker+0xe8/0x260
worker_thread+0x444/0x560
kthread+0x160/0x1a0
ret_from_kernel_thread+0x5c/0x70
List corruption happened because it uses 'load into non-volatile
register' instruction:
Snippet from __kthread_create_on_node:
c000000000136be8: addis r29,r2,-19
c000000000136bec: ld r29,31424(r29)
if (!__list_add_valid(new, prev, next))
c000000000136bf0: mr r3,r30
c000000000136bf4: mr r5,r28
c000000000136bf8: mr r4,r29
c000000000136bfc: bl c00000000059a2f8 <__list_add_valid+0x8>
Register state from WARN_ON():
GPR00: c00000000059a3a0 c000007ff23afb50 c000000001344e00 0000000000000075
GPR04: 0000000000000000 0000000000000000 0000001852af8bc1 0000000000000000
GPR08: 0000000000000001 0000000000000007 0000000000000006 00000000000004aa
GPR12: 0000000000000000 c000007ffffeb080 c000000000137038 c000005ff62aaa00
GPR16: 0000000000000000 0000000000000000 c000007fffbe7600 c000007fffbe7370
GPR20: c000007fffbe7320 c000007fffbe7300 c000000001373a00 0000000000000000
GPR24: fffffffffffffef7 c00000000012e320 c000007ff23afcb0 c000000000cb8628
GPR28: c00000000121c8b8 c000000001214e00 c000007fef5b17e8 c000007fef5b17c0
Watchpoint hit at 0xc000000000136bec.
addis r29,r2,-19
=> r29 = 0xc000000001344e00 + (-19 << 16)
=> r29 = 0xc000000001214e00
ld r29,31424(r29)
=> r29 = *(0xc000000001214e00 + 31424)
=> r29 = *(0xc00000000121c8c0)
0xc00000000121c8c0 is where we placed a watchpoint and thus this
instruction was emulated by emulate_step. But because handle_dabr_fault
did not restore emulated register state, r29 still contains stale
value in above register state.
Fixes: 5aae8a5370802 ("powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors")
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: stable@vger.kernel.org # 2.6.36+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-13 10:30:14 +07:00
|
|
|
/*
|
|
|
|
* do_break() may have changed the NV GPRS while handling a breakpoint.
|
|
|
|
* If so, we need to restore them with their updated values. Don't use
|
|
|
|
* ret_from_except_lite here.
|
|
|
|
*/
|
|
|
|
b ret_from_except
|
2012-03-07 12:48:45 +07:00
|
|
|
|
2009-06-03 04:17:38 +07:00
|
|
|
|
2017-10-19 11:08:43 +07:00
|
|
|
#ifdef CONFIG_PPC_BOOK3S_64
|
2009-06-03 04:17:38 +07:00
|
|
|
/* We have a page fault that hash_page could handle but HV refused
|
|
|
|
* the PTE insertion
|
|
|
|
*/
|
2014-02-04 12:04:35 +07:00
|
|
|
13: bl save_nvgprs
|
2009-06-03 04:17:38 +07:00
|
|
|
mr r5,r3
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
ld r4,_DAR(r1)
|
2014-02-04 12:04:35 +07:00
|
|
|
bl low_hash_fault
|
|
|
|
b ret_from_except
|
2016-04-29 20:26:07 +07:00
|
|
|
#endif
|
2009-06-03 04:17:38 +07:00
|
|
|
|
powerpc: Allow perf_counters to access user memory at interrupt time
This provides a mechanism to allow the perf_counters code to access
user memory in a PMU interrupt routine. Such an access can cause
various kinds of interrupt: SLB miss, MMU hash table miss, segment
table miss, or TLB miss, depending on the processor. This commit
only deals with 64-bit classic/server processors, which use an MMU
hash table. 32-bit processors are already able to access user memory
at interrupt time. Since we don't soft-disable on 32-bit, we avoid
the possibility of reentering hash_page or the TLB miss handlers,
since they run with interrupts disabled.
On 64-bit processors, an SLB miss interrupt on a user address will
update the slb_cache and slb_cache_ptr fields in the paca. This is
OK except in the case where a PMU interrupt occurs in switch_slb,
which also accesses those fields. To prevent this, we hard-disable
interrupts in switch_slb. Interrupts are already soft-disabled at
this point, and will get hard-enabled when they get soft-enabled
later.
This also reworks slb_flush_and_rebolt: to avoid hard-disabling twice,
and to make sure that it clears the slb_cache_ptr when called from
other callers than switch_slb, the existing routine is renamed to
__slb_flush_and_rebolt, which is called by switch_slb and the new
version of slb_flush_and_rebolt.
Similarly, switch_stab (used on POWER3 and RS64 processors) gets a
hard_irq_disable() to protect the per-cpu variables used there and
in ste_allocate.
If a MMU hashtable miss interrupt occurs, normally we would call
hash_page to look up the Linux PTE for the address and create a HPTE.
However, hash_page is fairly complex and takes some locks, so to
avoid the possibility of deadlock, we check the preemption count
to see if we are in a (pseudo-)NMI handler, and if so, we don't call
hash_page but instead treat it like a bad access that will get
reported up through the exception table mechanism. An interrupt
whose handler runs even though the interrupt occurred when
soft-disabled (such as the PMU interrupt) is considered a pseudo-NMI
handler, which should use nmi_enter()/nmi_exit() rather than
irq_enter()/irq_exit().
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-08-17 12:17:54 +07:00
|
|
|
/*
|
|
|
|
* We come here as a result of a DSI at a point where we don't want
|
|
|
|
* to call hash_page, such as when we are accessing memory (possibly
|
|
|
|
* user memory) inside a PMU interrupt that occurred while interrupts
|
|
|
|
* were soft-disabled. We want to invoke the exception handler for
|
|
|
|
* the access, or panic if there isn't a handler.
|
|
|
|
*/
|
2014-02-04 12:04:35 +07:00
|
|
|
77: bl save_nvgprs
|
powerpc: Allow perf_counters to access user memory at interrupt time
This provides a mechanism to allow the perf_counters code to access
user memory in a PMU interrupt routine. Such an access can cause
various kinds of interrupt: SLB miss, MMU hash table miss, segment
table miss, or TLB miss, depending on the processor. This commit
only deals with 64-bit classic/server processors, which use an MMU
hash table. 32-bit processors are already able to access user memory
at interrupt time. Since we don't soft-disable on 32-bit, we avoid
the possibility of reentering hash_page or the TLB miss handlers,
since they run with interrupts disabled.
On 64-bit processors, an SLB miss interrupt on a user address will
update the slb_cache and slb_cache_ptr fields in the paca. This is
OK except in the case where a PMU interrupt occurs in switch_slb,
which also accesses those fields. To prevent this, we hard-disable
interrupts in switch_slb. Interrupts are already soft-disabled at
this point, and will get hard-enabled when they get soft-enabled
later.
This also reworks slb_flush_and_rebolt: to avoid hard-disabling twice,
and to make sure that it clears the slb_cache_ptr when called from
other callers than switch_slb, the existing routine is renamed to
__slb_flush_and_rebolt, which is called by switch_slb and the new
version of slb_flush_and_rebolt.
Similarly, switch_stab (used on POWER3 and RS64 processors) gets a
hard_irq_disable() to protect the per-cpu variables used there and
in ste_allocate.
If a MMU hashtable miss interrupt occurs, normally we would call
hash_page to look up the Linux PTE for the address and create a HPTE.
However, hash_page is fairly complex and takes some locks, so to
avoid the possibility of deadlock, we check the preemption count
to see if we are in a (pseudo-)NMI handler, and if so, we don't call
hash_page but instead treat it like a bad access that will get
reported up through the exception table mechanism. An interrupt
whose handler runs even though the interrupt occurred when
soft-disabled (such as the PMU interrupt) is considered a pseudo-NMI
handler, which should use nmi_enter()/nmi_exit() rather than
irq_enter()/irq_exit().
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-08-17 12:17:54 +07:00
|
|
|
mr r4,r3
|
|
|
|
addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
li r5,SIGSEGV
|
2014-02-04 12:04:35 +07:00
|
|
|
bl bad_page_fault
|
|
|
|
b ret_from_except
|
2014-07-15 17:25:02 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Here we have detected that the kernel stack pointer is bad.
|
|
|
|
* R9 contains the saved CR, r13 points to the paca,
|
|
|
|
* r10 contains the (bad) kernel stack pointer,
|
|
|
|
* r11 and r12 contain the saved SRR0 and SRR1.
|
|
|
|
* We switch to using an emergency stack, save the registers there,
|
|
|
|
* and call kernel_bad_stack(), which panics.
|
|
|
|
*/
|
|
|
|
bad_stack:
|
|
|
|
ld r1,PACAEMERGSP(r13)
|
|
|
|
subi r1,r1,64+INT_FRAME_SIZE
|
|
|
|
std r9,_CCR(r1)
|
|
|
|
std r10,GPR1(r1)
|
|
|
|
std r11,_NIP(r1)
|
|
|
|
std r12,_MSR(r1)
|
|
|
|
mfspr r11,SPRN_DAR
|
|
|
|
mfspr r12,SPRN_DSISR
|
|
|
|
std r11,_DAR(r1)
|
|
|
|
std r12,_DSISR(r1)
|
|
|
|
mflr r10
|
|
|
|
mfctr r11
|
|
|
|
mfxer r12
|
|
|
|
std r10,_LINK(r1)
|
|
|
|
std r11,_CTR(r1)
|
|
|
|
std r12,_XER(r1)
|
|
|
|
SAVE_GPR(0,r1)
|
|
|
|
SAVE_GPR(2,r1)
|
|
|
|
ld r10,EX_R3(r3)
|
|
|
|
std r10,GPR3(r1)
|
|
|
|
SAVE_GPR(4,r1)
|
|
|
|
SAVE_4GPRS(5,r1)
|
|
|
|
ld r9,EX_R9(r3)
|
|
|
|
ld r10,EX_R10(r3)
|
|
|
|
SAVE_2GPRS(9,r1)
|
|
|
|
ld r9,EX_R11(r3)
|
|
|
|
ld r10,EX_R12(r3)
|
|
|
|
ld r11,EX_R13(r3)
|
|
|
|
std r9,GPR11(r1)
|
|
|
|
std r10,GPR12(r1)
|
|
|
|
std r11,GPR13(r1)
|
|
|
|
BEGIN_FTR_SECTION
|
|
|
|
ld r10,EX_CFAR(r3)
|
|
|
|
std r10,ORIG_GPR3(r1)
|
|
|
|
END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
|
|
|
|
SAVE_8GPRS(14,r1)
|
|
|
|
SAVE_10GPRS(22,r1)
|
|
|
|
lhz r12,PACA_TRAP_SAVE(r13)
|
|
|
|
std r12,_TRAP(r1)
|
|
|
|
addi r11,r1,INT_FRAME_SIZE
|
|
|
|
std r11,0(r1)
|
|
|
|
li r12,0
|
|
|
|
std r12,0(r11)
|
|
|
|
ld r2,PACATOC(r13)
|
|
|
|
ld r11,exception_marker@toc(r2)
|
|
|
|
std r12,RESULT(r1)
|
|
|
|
std r11,STACK_FRAME_OVERHEAD-16(r1)
|
|
|
|
1: addi r3,r1,STACK_FRAME_OVERHEAD
|
|
|
|
bl kernel_bad_stack
|
|
|
|
b 1b
|
2017-06-30 00:49:19 +07:00
|
|
|
_ASM_NOKPROBE_SYMBOL(bad_stack);
|
2016-09-21 14:44:05 +07:00
|
|
|
|
2017-06-13 20:05:48 +07:00
|
|
|
/*
|
|
|
|
* When doorbell is triggered from system reset wakeup, the message is
|
|
|
|
* not cleared, so it would fire again when EE is enabled.
|
|
|
|
*
|
|
|
|
* When coming from local_irq_enable, there may be the same problem if
|
|
|
|
* we were hard disabled.
|
|
|
|
*
|
|
|
|
* Execute msgclr to clear pending exceptions before handling it.
|
|
|
|
*/
|
|
|
|
h_doorbell_common_msgclr:
|
|
|
|
LOAD_REG_IMMEDIATE(r3, PPC_DBELL_MSGTYPE << (63-36))
|
|
|
|
PPC_MSGCLR(3)
|
|
|
|
b h_doorbell_common
|
|
|
|
|
|
|
|
doorbell_super_common_msgclr:
|
|
|
|
LOAD_REG_IMMEDIATE(r3, PPC_DBELL_MSGTYPE << (63-36))
|
|
|
|
PPC_MSGCLRP(3)
|
|
|
|
b doorbell_super_common
|
|
|
|
|
2016-09-21 14:44:05 +07:00
|
|
|
/*
|
|
|
|
* Called from arch_local_irq_enable when an interrupt needs
|
|
|
|
* to be resent. r3 contains 0x500, 0x900, 0xa00 or 0xe80 to indicate
|
|
|
|
* which kind of interrupt. MSR:EE is already off. We generate a
|
|
|
|
* stackframe like if a real interrupt had happened.
|
|
|
|
*
|
|
|
|
* Note: While MSR:EE is off, we need to make sure that _MSR
|
|
|
|
* in the generated frame has EE set to 1 or the exception
|
|
|
|
* handler will not properly re-enable them.
|
2017-06-13 20:05:49 +07:00
|
|
|
*
|
|
|
|
* Note that we don't specify LR as the NIP (return address) for
|
|
|
|
* the interrupt because that would unbalance the return branch
|
|
|
|
* predictor.
|
2016-09-21 14:44:05 +07:00
|
|
|
*/
|
|
|
|
_GLOBAL(__replay_interrupt)
|
|
|
|
/* We are going to jump to the exception common code which
|
|
|
|
* will retrieve various register values from the PACA which
|
|
|
|
* we don't give a damn about, so we don't bother storing them.
|
|
|
|
*/
|
|
|
|
mfmsr r12
|
2017-08-22 08:51:37 +07:00
|
|
|
LOAD_REG_ADDR(r11, replay_interrupt_return)
|
2016-09-21 14:44:05 +07:00
|
|
|
mfcr r9
|
|
|
|
ori r12,r12,MSR_EE
|
|
|
|
cmpwi r3,0x900
|
|
|
|
beq decrementer_common
|
|
|
|
cmpwi r3,0x500
|
2017-08-11 23:39:04 +07:00
|
|
|
BEGIN_FTR_SECTION
|
|
|
|
beq h_virt_irq_common
|
|
|
|
FTR_SECTION_ELSE
|
2016-09-21 14:44:05 +07:00
|
|
|
beq hardware_interrupt_common
|
2017-08-11 23:39:04 +07:00
|
|
|
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_300)
|
2017-12-20 10:55:53 +07:00
|
|
|
cmpwi r3,0xf00
|
|
|
|
beq performance_monitor_common
|
2016-09-21 14:44:05 +07:00
|
|
|
BEGIN_FTR_SECTION
|
2017-08-11 23:39:03 +07:00
|
|
|
cmpwi r3,0xa00
|
2017-06-13 20:05:48 +07:00
|
|
|
beq h_doorbell_common_msgclr
|
2016-09-21 14:44:05 +07:00
|
|
|
cmpwi r3,0xe60
|
|
|
|
beq hmi_exception_common
|
|
|
|
FTR_SECTION_ELSE
|
|
|
|
cmpwi r3,0xa00
|
2017-06-13 20:05:48 +07:00
|
|
|
beq doorbell_super_common_msgclr
|
2016-09-21 14:44:05 +07:00
|
|
|
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
|
2017-08-22 08:51:37 +07:00
|
|
|
replay_interrupt_return:
|
2016-09-21 14:44:05 +07:00
|
|
|
blr
|
2017-06-13 20:05:49 +07:00
|
|
|
|
2017-06-30 00:49:19 +07:00
|
|
|
_ASM_NOKPROBE_SYMBOL(__replay_interrupt)
|