The 'bus' member of struct eeh_dev is assigned to once but never used,
so remove it.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently a flag, EEH_POSTPONED_PROBE, is used to prevent an incorrect
message "EEH: No capable adapters found" from being displayed during
the boot of powernv systems.
It is necessary because, on powernv, the call to eeh_probe_devices()
made from eeh_init() is too early and EEH can't yet be enabled. A
second call is made later from eeh_pnv_post_init(), which succeeds.
(On pseries, the first call succeeds because PCI devices are set up
early enough and no second call is made.)
This can be simplified by moving the early call to eeh_probe_devices()
from eeh_init() (where it's seen by both platforms) to
pSeries_final_fixup(), so that each platform only calls
eeh_probe_devices() once, at a point where it can succeed.
This is slightly later in the boot sequence, but but still early
enough and it is now in the same place in the sequence for both
platforms (the pcibios_fixup hook).
The display of the message can be cleaned up as well, by moving it
into eeh_probe_devices().
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
eeh_add_to_parent_pe() sometimes removes the EEH_PE_KEEP flag, but it
incorrectly removes it from pe->type, instead of pe->state.
However, rather than clearing it from the correct field, remove it.
Inspection of the code shows that it can't ever have had any effect
(even if it had been cleared from the correct field), because the
field is never tested after it is cleared by the statement in
question.
The clear statement was added by commit 807a827d4e ("powerpc/eeh:
Keep PE during hotplug"), but it didn't explain why it was necessary.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If a device is removed during EEH processing (either by a driver's
handler or as part of recovery), it can lead to a null dereference
in eeh_pe_report_edev().
To handle this, skip devices that have been removed.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If an error occurs during an unplug operation, it's possible for
eeh_dump_dev_log() to be called when edev->pdn is null, which
currently leads to dereferencing a null pointer.
Handle this by skipping the error log for those devices.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently when we get an unknown RTAS event it prints the type as
"Unknown" and no other useful information. Add the raw type code to the
log message so that we have something to work off.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The powerpc kernel uses setjmp which causes a warning when building
with clang:
In file included from arch/powerpc/xmon/xmon.c:51:
./arch/powerpc/include/asm/setjmp.h:15:13: error: declaration of
built-in function 'setjmp' requires inclusion of the header <setjmp.h>
[-Werror,-Wbuiltin-requires-header]
extern long setjmp(long *);
^
./arch/powerpc/include/asm/setjmp.h:16:13: error: declaration of
built-in function 'longjmp' requires inclusion of the header <setjmp.h>
[-Werror,-Wbuiltin-requires-header]
extern void longjmp(long *, long);
^
This *is* the header and we're not using the built-in setjump but
rather the one in arch/powerpc/kernel/misc.S. As the compiler warning
does not make sense, it for the files where setjmp is used.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
[mpe: Move subdir-ccflags in xmon/Makefile to not clobber -Werror]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
instructions_to_print var is assigned value 16 and there is no
way to change it.
This patch replaces it by a constant.
Reviewed-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When two processes crash at the same time, we sometimes encounter
interleaving in the middle of a line:
init[1]: segfault (11) at 0 nip 0 lr 0 code 1
init[1]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
init[74]: segfault (11) at 10a74 nip 1000c198 lr 100078c8 code 1 in sh[10000000+14000]
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
init[1]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
init[74]: code: 90010024 bf61000c 91490a7c 3fa01002 3be00000 7d3e4b78 3bbd0c20 3b600000
init[74]: code: 3b9d0040 7c7fe02e 2f830000 419e0028 <89230000> 2f890000 41be001c 4b7f6e79
This patch fixes it by preparing complete lines in a buffer and
printing it at once.
Fixes: 88b0fe1757 ("powerpc: Add show_user_instructions()")
Reviewed-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Use seq_buf_printf() not seq_buf_puts() which doesn't NULL terminate]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As spotted by sparse:
arch/powerpc/kernel/process.c:1302:6: warning: symbol 'show_user_instructions' was not declared. Should it be static?
Fixes: 88b0fe1757 ("powerpc: Add show_user_instructions()")
Reviewed-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch fixes the following warnings, which are leftovers
from when __get_user() was replaced by probe_kernel_address().
arch/powerpc/kernel/process.c:1287:22: warning: incorrect type in argument 2 (different address spaces)
arch/powerpc/kernel/process.c:1287:22: expected void const *src
arch/powerpc/kernel/process.c:1287:22: got unsigned int [noderef] <asn:1>*<noident>
arch/powerpc/kernel/process.c:1319:21: warning: incorrect type in argument 2 (different address spaces)
arch/powerpc/kernel/process.c:1319:21: expected void const *src
arch/powerpc/kernel/process.c:1319:21: got unsigned int [noderef] <asn:1>*<noident>
Fixes: 7b051f665c ("powerpc: Use probe_kernel_address in show_instructions")
Reviewed-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Recently we implemented show_user_instructions() which dumps the code
around the NIP when a user space process dies with an unhandled
signal. This was modelled on the x86 code, and we even went so far as
to implement the exact same bug, namely that if the user process
crashed with its NIP pointing into the kernel we will dump kernel text
to dmesg. eg:
bad-bctr[2996]: segfault (11) at c000000000010000 nip c000000000010000 lr 12d0b0894 code 1
bad-bctr[2996]: code: fbe10068 7cbe2b78 7c7f1b78 fb610048 38a10028 38810020 fb810050 7f8802a6
bad-bctr[2996]: code: 3860001c f8010080 48242371 60000000 <7c7b1b79> 4082002c e8010080 eb610048
This was discovered on x86 by Jann Horn and fixed in commit
342db04ae7 ("x86/dumpstack: Don't dump kernel memory based on usermode RIP").
Fix it by checking the adjusted NIP value (pc) and number of
instructions against USER_DS, and bail if we fail the check, eg:
bad-bctr[2969]: segfault (11) at c000000000010000 nip c000000000010000 lr 107930894 code 1
bad-bctr[2969]: Bad NIP, not dumping instructions.
Fixes: 88b0fe1757 ("powerpc: Add show_user_instructions()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PPC_INVALIDATE_ERAT is slbia IH=7 which is a new variant introduced
with POWER9, and the result is undefined on earlier CPUs.
Commits 7b9f71f974 ("powerpc/64s: POWER9 machine check handler") and
d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on
POWER9") caused POWER7/8 code to use this instruction. Remove it. An
ERAT flush can be made by invalidatig the SLB, but before POWER9 that
requires a flush and rebolt.
Fixes: 7b9f71f974 ("powerpc/64s: POWER9 machine check handler")
Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If CONFIG_PPC_WATCHDOG is enabled we always cap the decrementer to
0x7fffffff:
if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
set_dec(0x7fffffff);
else
set_dec(decrementer_max);
If there are no future events, we don't reprogram the decrementer
after this and we end up with 0x7fffffff even on a large decrementer
capable system.
As suggested by Nick, add a set_state_oneshot_stopped callback
so we program the decrementer with decrementer_max if there are
no future events.
Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We currently cap the decrementer clockevent at 4 seconds, even on systems
with large decrementer support. Fix this by converting the code to use
clockevents_register_device() which calculates the upper bound based on
the max_delta passed in.
Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add call to early_memtest() so that kernel compiled with
CONFIG_MEMTEST really perform memtest at startup when requested
via 'memtest' boot parameter.
Tested-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The comments in this file don't conform to the coding style so take
them to "Comment Formatting Re-Education Camp".
Suggested-by: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
[mpe: Reflow some comments and add full stops, fix spelling of Sergeant.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The code in machine_check_exception excludes 64s hvmode when
incrementing the MCE counter only to call opal_machine_check to
increment it specifically for this case.
Remove the exclusion and special case.
Fixes: a43c159042 ("powerpc/pseries: Flush SLB contents on SLB MCE
errors.")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On a kernel TM Bad thing program exception, the Machine State Register
(MSR) is not being properly displayed. The exception code dumps a 32-bits
value but MSR is a 64 bits register for all platforms that have HTM
enabled.
This patch dumps the MSR value as a 64-bits value instead of 32 bits. In
order to do so, the 'reason' variable could not be used, since it trimmed
MSR to 32-bits (int).
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently msr_tm_active() is a wrapper around MSR_TM_ACTIVE() if
CONFIG_PPC_TRANSACTIONAL_MEM is set, or it is just a function that
returns false if CONFIG_PPC_TRANSACTIONAL_MEM is not set.
This function is not necessary, since MSR_TM_ACTIVE() just do the same and
could be used, removing the dualism and simplifying the code.
This patchset remove every instance of msr_tm_active() and replaced it
by MSR_TM_ACTIVE().
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is a patch that adds support for PTRACE_SYSEMU ptrace request in
PowerPC architecture.
When ptrace(PTRACE_SYSEMU, ...) request is called, it will be handled by
the arch independent function ptrace_resume(), which will tag the task with
the TIF_SYSCALL_EMU flag. This flag needs to be handled from a platform
dependent point of view, which is what this patch does.
This patch adds this task's flag as part of the _TIF_SYSCALL_DOTRACE, which
is the MACRO that is used to trace syscalls at entrance/exit.
Since TIF_SYSCALL_EMU is now part of _TIF_SYSCALL_DOTRACE, if the task has
_TIF_SYSCALL_DOTRACE set, it will hit do_syscall_trace_enter() at syscall
entrance and do_syscall_trace_leave() at syscall leave.
do_syscall_trace_enter() needs to handle the TIF_SYSCALL_EMU flag properly,
which will interrupt the syscall executing if TIF_SYSCALL_EMU is set. The
output values should not be changed, i.e. the return value (r3) should
contain the original syscall argument on exit.
With this flag set, the syscall is not executed fundamentally, because
do_syscall_trace_enter() is returning -1 which is bigger than NR_syscall,
thus, skipping the syscall execution and exiting userspace.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Moving TIF_32BIT to use bit 20 instead of 4 in the task flag field.
This change is making room for an upcoming new task macro
(_TIF_SYSCALL_EMU) which is preferred to set a bit in the lower 16-bits
part of the word.
This upcoming flag macro will take part in a composed macro
(_TIF_SYSCALL_DOTRACE) which will contain other flags as well, and it is
preferred that the whole _TIF_SYSCALL_DOTRACE macro only sets the lower 16
bits of a word, so, it could be handled using immediate operations (as load
immediate, add immediate, ...) where the immediate operand (SI) is limited
to 16-bits.
Another possible solution would be using the LOAD_REG_IMMEDIATE() macro
to load a full 64-bits word immediate, but it takes 5 operations instead of
one.
Having TIF_32BITS being redefined to use an upper bit is not a problem
since there is only one place in the assembly code where TIF_32BIT is being
used, and it could be replaced with an operation with right shift (addis),
since it is used alone, i.e. not being part of a composed macro, which has
different bits set, and would require LOAD_REG_IMMEDIATE().
Tested on a 64 bits Big Endian machine running a 32 bits task.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On PPC64, as register r13 points to the paca_struct at all time,
this patch adds a copy of the canary there, which is copied at
task_switch.
That new canary is then used by using the following GCC options:
-mstack-protector-guard=tls
-mstack-protector-guard-reg=r13
-mstack-protector-guard-offset=offsetof(struct paca_struct, canary))
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This functionality was tentatively added in the past
(commit 6533b7c16e ("powerpc: Initial stack protector
(-fstack-protector) support")) but had to be reverted
(commit f2574030b0 ("powerpc: Revert the initial stack
protector support") because of GCC implementing it differently
whether it had been built with libc support or not.
Now, GCC offers the possibility to manually set the
stack-protector mode (global or tls) regardless of libc support.
This time, the patch selects HAVE_STACKPROTECTOR only if
-mstack-protector-guard=tls is supported by GCC.
On PPC32, as register r2 points to current task_struct at
all time, the stack_canary located inside task_struct can be
used directly by using the following GCC options:
-mstack-protector-guard=tls
-mstack-protector-guard-reg=r2
-mstack-protector-guard-offset=offsetof(struct task_struct, stack_canary))
The protector is disabled for prom_init and bootx_init as
it is too early to handle it properly.
$ echo CORRUPT_STACK > /sys/kernel/debug/provoke-crash/DIRECT
[ 134.943666] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: lkdtm_CORRUPT_STACK+0x64/0x64
[ 134.943666]
[ 134.955414] CPU: 0 PID: 283 Comm: sh Not tainted 4.18.0-s3k-dev-12143-ga3272be41209 #835
[ 134.963380] Call Trace:
[ 134.965860] [c6615d60] [c001f76c] panic+0x118/0x260 (unreliable)
[ 134.971775] [c6615dc0] [c001f654] panic+0x0/0x260
[ 134.976435] [c6615dd0] [c032c368] lkdtm_CORRUPT_STACK_STRONG+0x0/0x64
[ 134.982769] [c6615e00] [ffffffff] 0xffffffff
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PPC32 uses nonrecoverable_exception() while PPC64 uses
unrecoverable_exception().
Both functions are doing almost the same thing.
This patch removes nonrecoverable_exception()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This reverts commits:
5e46e29e6a ("powerpc/64s/hash: convert SLB miss handlers to C")
8fed04d0f6 ("powerpc/64s/hash: remove user SLB data from the paca")
655deecf67 ("powerpc/64s/hash: SLB allocation status bitmaps")
2e1626744e ("powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setup")
89ca4e126a ("powerpc/64s/hash: Add a SLB preload cache")
This series had a few bugs, and the fixes are not all trivial. So
revert most of it for now.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Current we store the userspace r1 to PACATMSCRATCH before finally
saving it to the thread struct.
In theory an exception could be taken here (like a machine check or
SLB miss) that could write PACATMSCRATCH and hence corrupt the
userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but
others do.
We've never actually seen this happen but it's theoretically
possible. Either way, the code is fragile as it is.
This patch saves r1 to the kernel stack (which can't fault) before we
turn MSR[RI] back on. PACATMSCRATCH is still used but only with
MSR[RI] off. We then copy r1 from the kernel stack to the thread
struct once we have MSR[RI] back on.
Suggested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When we treclaim we store the userspace checkpointed r13 to a scratch
SPR and then later save the scratch SPR to the user thread struct.
Unfortunately, this doesn't work as accessing the user thread struct
can take an SLB fault and the SLB fault handler will write the same
scratch SPRG that now contains the userspace r13.
To fix this, we store r13 to the kernel stack (which can't fault)
before we access the user thread struct.
Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen
as a random userspace segfault with r13 looking like a kernel address.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Firmware-Assisted Dump (FADump) needs to be registered again after any
memory hot add/remove operation to update the crash memory ranges. But
currently, the kernel returns '-EEXIST' if we try to register without
uregistering it first. This could expose the system to racing issues
while unregistering and registering FADump from userspace during udev
events. Spare the userspace of this and let it be taken care of in the
kernel space for a simpler interface.
Since this change, running 'echo 1 > /sys/kernel/fadump_registered'
would result in re-regisering (unregistering and registering) FADump,
if it was already registered.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In prom_check_platform_support() we retrieve and parse the
"ibm,arch-vec-5-platform-support" property of the chosen node.
Currently we use a variable length array however to avoid this use an
array of constant length 8.
This property is used to indicate the supported options of vector 5
bytes 23-26 of the ibm,architecture.vec node. Each of these options
is a pair of bytes, thus for 4 options we have a max length of 8 bytes.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When performing partition migrations all present CPUs must be online
as all present CPUs must make the H_JOIN call as part of the migration
process. Once all present CPUs make the H_JOIN call, one CPU is returned
to make the rtas call to perform the migration to the destination system.
During testing of migration and changing the SMT state we have found
instances where CPUs are offlined, as part of the SMT state change,
before they make the H_JOIN call. This results in a hung system where
every CPU is either in H_JOIN or offline.
To prevent this this patch disables CPU hotplug during the migration
process.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When switching processes, currently all user SLBEs are cleared, and a
few (exec_base, pc, and stack) are preloaded. In trivial testing with
small apps, this tends to miss the heap and low 256MB segments, and it
will also miss commonly accessed segments on large memory workloads.
Add a simple round-robin preload cache that just inserts the last SLB
miss into the head of the cache and preloads those at context switch
time. Every 256 context switches, the oldest entry is removed from the
cache to shrink the cache and require fewer slbmte if they are unused.
Much more could go into this, including into the SLB entry reclaim
side to track some LRU information etc, which would require a study of
large memory workloads. But this is a simple thing we can do now that
is an obvious win for common workloads.
With the full series, process switching speed on the context_switch
benchmark on POWER9/hash (with kernel speculation security masures
disabled) increases from 140K/s to 178K/s (27%).
POWER8 does not change much (within 1%), it's unclear why it does not
see a big gain like POWER9.
Booting to busybox init with 256MB segments has SLB misses go down
from 945 to 69, and with 1T segments 900 to 21. These could almost all
be eliminated by preloading a bit more carefully with ELF binary
loading.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This will be used by the SLB code in the next patch, but for now this
sets the slb_addr_limit to the correct size for 32-bit tasks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add 32-entry bitmaps to track the allocation status of the first 32
SLB entries, and whether they are user or kernel entries. These are
used to allocate free SLB entries first, before resorting to the round
robin allocator.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
User SLB mappig data is copied into the PACA from the mm->context so
it can be accessed by the SLB miss handlers.
After the C conversion, SLB miss handlers now run with relocation on,
and user SLB misses are able to take recursive kernel SLB misses, so
the user SLB mapping data can be removed from the paca and accessed
directly.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch moves SLB miss handlers completely to C, using the standard
exception handler macros to set up the stack and branch to C.
This can be done because the segment containing the kernel stack is
always bolted, so accessing it with relocation on will not cause an
SLB exception.
Arbitrary kernel memory may not be accessed when handling kernel space
SLB misses, so care should be taken there. However user SLB misses can
access any kernel memory, which can be used to move some fields out of
the paca (in later patches).
User SLB misses could quite easily reconcile IRQs and set up a first
class kernel environment and exit via ret_from_except, however that
doesn't seem to be necessary at the moment, so we only do that if a
bad fault is encountered.
[ Credit to Aneesh for bug fixes, error checks, and improvements to bad
address handling, etc ]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Since RFC:
- Added MSR[RI] handling
- Fixed up a register loss bug exposed by irq tracing (Aneesh)
- Reject misses outside the defined kernel regions (Aneesh)
- Added several more sanity checks and error handling (Aneesh), we may
look at consolidating these tests and tightenig up the code but for
a first pass we decided it's better to check carefully.
Since v1:
- Fixed SLB cache corruption (Aneesh)
- Fixed untidy SLBE allocation "leak" in get_vsid error case
- Now survives some stress testing on real hardware
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
I only have POWER8/9 to test, so just remove it for those.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that other platforms also implements real mode mce handler,
lets consolidate the code by sharing existing powernv machine check
early code. Rename machine_check_powernv_early to
machine_check_common_early and reuse the code.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On pseries, as of today system crashes if we get a machine check
exceptions due to SLB errors. These are soft errors and can be fixed
by flushing the SLBs so the kernel can continue to function instead of
system crash. We do this in real mode before turning on MMU. Otherwise
we would run into nested machine checks. This patch now fetches the
rtas error log in real mode and flushes the SLBs on SLB/ERAT errors.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michal Suchanek <msuchanek@suse.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The tbl pointer is being derefenced by IOMMU_PAGE_SIZE prior the check
if it is not NULL.
Just moving the dereference code to after the check, where there will
be guarantee that 'tbl' will not be NULL.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch simply fix part of the documentation on the HTM code.
This fixes reference to old fields that were renamed in commit
000ec280e3 ("powerpc: tm: Rename transct_(*) to ck(\1)_state")
It also documents better the flow after commit eb5c3f1c86 ("powerpc:
Always save/restore checkpointed regs during treclaim/trecheckpoint"),
where tm_recheckpoint can recheckpoint what is in ck{fp,vr}_state
blindly.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When we come into the softpatch handler (0x1500), we use r11 to store
the HSRR0 for later use by the denorm handler.
We also use the softpatch handler for the TM workarounds for
POWER9. Unfortunately, in kvmppc_interrupt_hv we later store r11 out
to the vcpu assuming it's still what we got from userspace.
This causes r11 to be corrupted in the VCPU and hence when we restore
the guest, we get a corrupted r11. We've seen this when running TM
tests inside guests on P9.
This fixes the problem by only touching r11 in the denorm case.
Fixes: 4bb3c7a020 ("KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9")
Cc: <stable@vger.kernel.org> # 4.17+
Test-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Call Frame Information is used by gdb for back-traces and inserting
breakpoints on function return for the "finish" command. This failed
when inside __kernel_clock_gettime. More concerning than difficulty
debugging is that CFI is also used by stack frame unwinding code to
implement exceptions. If you have an app that needs to handle
asynchronous exceptions for some reason, and you are unlucky enough to
get one inside the VDSO time functions, your app will crash.
What's wrong: There is control flow in __kernel_clock_gettime that
reaches label 99 without saving lr in r12. CFI info however is
interpreted by the unwinder without reference to control flow: It's a
simple matter of "Execute all the CFI opcodes up to the current
address". That means the unwinder thinks r12 contains the return
address at label 99. Disabuse it of that notion by resetting CFI for
the return address at label 99.
Note that the ".cfi_restore lr" could have gone anywhere from the
"mtlr r12" a few instructions earlier to the instruction at label 99.
I put the CFI as late as possible, because in general that's best
practice (and if possible grouped with other CFI in order to reduce
the number of CFI opcodes executed when unwinding). Using r12 as the
return address is perfectly fine after the "mtlr r12" since r12 on
that code path still contains the return address.
__get_datapage also has a CFI error. That function temporarily saves
lr in r0, and reflects that fact with ".cfi_register lr,r0". A later
use of r0 means the CFI at that point isn't correct, as r0 no longer
contains the return address. Fix that too.
Signed-off-by: Alan Modra <amodra@gmail.com>
Tested-by: Reza Arbab <arbab@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently on P9N DD2.1 we end up taking infinite TM facility
unavailable exceptions on the first TM usage by userspace.
In the special case of TM no suspend (P9N DD2.1), Linux is told TM is
off via CPU dt-ftrs but told to (partially) use it via
OPAL_REINIT_CPUS_TM_SUSPEND_DISABLED. So HFSCR[TM] will be off from
dt-ftrs but we need to turn it on for the no suspend case.
This patch fixes this by enabling HFSCR TM in this case.
Cc: stable@vger.kernel.org # 4.15+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
- An implementation for the newly added hv_ops->flush() for the OPAL hvc
console driver backends, I forgot to apply this after merging the hvc driver
changes before the merge window.
- Enable all PCI bridges at boot on powernv, to avoid races when multiple
children of a bridge try to enable it simultaneously. This is a workaround
until the PCI core can be enhanced to fix the races.
- A fix to query PowerVM for the correct system topology at boot before
initialising sched domains, seen in some configurations to cause broken
scheduling etc.
- A fix for pte_access_permitted() on "nohash" platforms.
- Two commits to fix SIGBUS when using remap_pfn_range() seen on Power9 due to
a workaround when using the nest MMU (GPUs, accelerators).
- Another fix to the VFIO code used by KVM, the previous fix had some bugs
which caused guests to not start in some configurations.
- A handful of other minor fixes.
Thanks to:
Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy, Hari Bathini, Luke
Dashjr, Mahesh Salgaonkar, Nicholas Piggin, Paul Mackerras, Srikar Dronamraju.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAlt//10THG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgLUGD/9y/MPrs+V2lNFhxP+l1jO4Ro0v8DPI
vARjqq06WfppgQgSS1dvWfzLaMFbGe8wPRfL0T1xMOXCZg8Ts/HrgxVBVFYcv6/Q
xBaU5bKztg6HKQbwwO+8B/gdTA3hO7yFVux6JGwsGO5Ebl8Q3UDOdbvYX6XTj3H8
rdiicle9LaI7qodC8bxlBo1Be0YKEW0O/ag179sxXzozzvoPIyFpeX6FL1sAft++
XlQS1MHu4hErSM2rbmyoFCm+SmyRt3CD0NTVmNd2cgw5XexPOBFlnsdgpaK1jJFc
CYu1chP83E91ol1/8NAPkcmPWvP6MGyoqOl75RghooY2D1IJ2GtLKoz2Dvc53ay2
ZlIpMyc2CYIa55Mj18tOV/NbGbh0Lf0Ta++BxqxbcCDt5fq0VxHkoDPqBgWh/tdp
Po7oQc7U2VTKKC3wiLr//nSHpgtSTAWtucDt7oT7GdP8+EMxUZ8teBFIfTkwfuD2
plroEmcYuRD3beI4FAG/iCp5POOCsnHLkKVDl7tyQPl3Yvu8hvyLY9gBS9RN4Unt
/z94YFJtz7UD+VP7jDaPiQS26y0WEJfW8ml0tNxMZVBdksZLPIPN0/EU/04V0jWf
GxynzKMhElxC67tK/Eb43EGiLEZAlbEnJxoOtiCnL1MK+OIJPjg6e7o6qlsmaa+Y
zkXVpxLtkq0OoQ==
=1AUa
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- An implementation for the newly added hv_ops->flush() for the OPAL
hvc console driver backends, I forgot to apply this after merging the
hvc driver changes before the merge window.
- Enable all PCI bridges at boot on powernv, to avoid races when
multiple children of a bridge try to enable it simultaneously. This
is a workaround until the PCI core can be enhanced to fix the races.
- A fix to query PowerVM for the correct system topology at boot before
initialising sched domains, seen in some configurations to cause
broken scheduling etc.
- A fix for pte_access_permitted() on "nohash" platforms.
- Two commits to fix SIGBUS when using remap_pfn_range() seen on Power9
due to a workaround when using the nest MMU (GPUs, accelerators).
- Another fix to the VFIO code used by KVM, the previous fix had some
bugs which caused guests to not start in some configurations.
- A handful of other minor fixes.
Thanks to: Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy,
Hari Bathini, Luke Dashjr, Mahesh Salgaonkar, Nicholas Piggin, Paul
Mackerras, Srikar Dronamraju.
* tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mce: Fix SLB rebolting during MCE recovery path.
KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pages
powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transition
powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.
powerpc/nohash: fix pte_access_permitted()
powerpc/topology: Get topology for shared processors at boot
powerpc64/ftrace: Include ftrace.h needed for enable/disable calls
powerpc/powernv/pci: Work around races in PCI bridge enabling
powerpc/fadump: cleanup crash memory ranges support
powerpc/powernv: provide a console flush operation for opal hvc driver
powerpc/traps: Avoid rate limit messages from show unhandled signals
powerpc/64s: Fix PACA_IRQ_HARD_DIS accounting in idle_power4()