The hcall H_INT_RESET should be called to make sure XIVE is fully
reseted.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The hcall H_INT_RESET can take some time to complete and in such cases
it returns H_LONG_BUSY_* codes requiring the machine to sleep for a
while before retrying.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The kexec_state KEXEC_STATE_IRQS_OFF barrier is reached by all
secondary CPUs before the kexec_cpu_down() operation is called on
secondaries. This can raise conflicts and provoque errors in the XIVE
hcalls when XIVE is shutdown with H_INT_RESET on the primary CPU.
To synchronize the kexec_cpu_down() operations and make sure the
secondaries have completed their task before the primary starts doing
the same, let's move the primary kexec_cpu_down() after the
KEXEC_STATE_REAL_MODE barrier.
This change of the ending sequence of kexec is mostly useful on the
pseries platform but it impacts also the powernv, ps3 and 85xx
platforms. powernv can be easily tested and fixed but some caution is
required for the other two.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For consideration:
* Add NVDIMM support - Enables greater testing, mambo device.
* Add IPv6 support built in + additional modules - Because it's 2018 maan.
* Add DEFERRED_STRUCT_PAGE_INIT - Let's see what breaks.
* Add PPC_MEMTRACE - Small powernv debugfs driver for getting hardware traces.
* Add MEMORY_FAILURE - Machine check exceptions can now drive memory failure.
* Turn on FANOTIFY - This is the current filesystem notification feature.
* Turn on SCOM_DEBUGFS - Handy for hardware/firmware debugging, security risk?
* Turn on async SCSI scanning - Let's see what breaks.
* Add MLX5 driver as a module - Popular demand.
* Add CRYPTO_CRCT10DIF_VPMSUM - POWER8 T10DIF acceleration.
* Make a bunch of USB hid drivers modules.
* Make SCSI SG, SR, and FC modules - FC is huge.
* Make video drivers except AST GPU modules - Also huge.
* Make PCI serial driver a module - Uncommon.
* Make more things modules, NFS FS, RAM disk, netconsole, MS-DOS fs.
* Get rid of /dev/port - Not used.
* Remove PPS and PTP subsystms - Unusual.
* Remove legacy BSD ttys - Long dead.
* Remove IDE - Deprecated and replaced with ATA.
* Remove WIRELESS - Until we get POWER9 laptops.
* Remove RAW - Long deprecated in favour of direct IO.
* Remove floppy, parport, and PS2 input devices - not supported.
* Remove virtio drivers, ballooning - We're host only.
* Remove PPP - Sorry Paulus.
This results in a significantly smaller vmlinux:
text data bss dec filename
13143383 5277944 1317856 19739183 vanilla
12263281 4852074 1341720 18457075 patched
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The B43 driver only needs CONFIG_SSB to support the WLAN card found in
the Wii. Configure it accordingly, and disable BCMA bus support to save
a bit of space.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This allows access to the SD card and the BCM4318 Wifi module.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that there's a GPIO driver for the Wii, let's enable the following
drivers:
- the GPIO driver itself
- gpio-keys
- gpio-poweroff
- gpio-leds and a few LED triggers
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The Wii doesn't have built-in Ethernet and USB Ethernet adapters are in
a different menu. Disable CONFIG_ETHERNET to save some space in support
code for Ethernet drivers.
Note that this patch doesn't disable any Ethernet drivers, because they
are not enabled by default.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The hcall_exit() tracepoint has retval defined as unsigned long. That
leads to humours results like:
bash-3686 [009] d..2 854.134094: hcall_entry: opcode=24
bash-3686 [009] d..2 854.134095: hcall_exit: opcode=24 retval=18446744073709551609
It's normal for some hcalls to return negative values, displaying them
as unsigned isn't very helpful. So change it to signed.
bash-3711 [001] d..2 471.691008: hcall_entry: opcode=24
bash-3711 [001] d..2 471.691008: hcall_exit: opcode=24 retval=-7
Which can be more easily compared to H_NOT_FOUND in hvcall.h
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
This commit was a stop-gap to prevent crashes on hotunplug, caused by
the mismatch between the 1G mappings used for the linear mapping and the
memory block size. Those issues are now resolved because we split the
linear mapping at hotunplug time if necessary, as implemented in commit
4dd5f8a99e ("powerpc/mm/radix: Split linear mapping on hot-unplug").
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Tested-by: Rashmica Gupta <rashmica.g@gmail.com>
Tested-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
By using IS_ENABLED() we can simplify __set_pte_at() by removing
redundant *ptep = pte.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When nohash and book3s header were split, some hash related stuff
remained in the nohash header. This patch removes them.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Duplicate pte_young() to avoid circular header dependency]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Unregister fadump on kexec down path otherwise the fadump registration
in new kexec-ed kernel complains that fadump is already registered.
This makes new kernel to continue using fadump registered by previous
kernel which may lead to invalid vmcore generation. Hence this patch
fixes this issue by un-registering fadump in fadump_cleanup() which is
called during kexec path so that new kernel can register fadump with
new valid values.
Fixes: b500afff11 ("fadump: Invalidate registration and release reserved memory for general use.")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
FADump capture kernel boots in restricted memory environment preserving
the context of previous kernel to save vmcore. Supporting hugepages in
such environment makes things unnecessarily complicated, as hugepages
need memory set aside for them. This means most of the capture kernel's
memory is used in supporting hugepages. In most cases, this results in
out-of-memory issues while booting FADump capture kernel. But hugepages
are not of much use in capture kernel whose only job is to save vmcore.
So, disabling hugepages support, when fadump is active, is a reliable
solution for the out of memory issues. Introducing a flag variable to
disable HugeTLB support when fadump is active.
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The second kernel, during early boot after the crash, reserves rest of
the memory above boot memory size to make sure it does not touch any of the
dump memory area. It uses memblock_reserve() that reserves the specified
memory region irrespective of memory holes present within that region.
There are chances where previous kernel would have hot removed some of
its memory leaving memory holes behind. In such cases fadump kernel reports
incorrect number of reserved pages through arch_reserved_kernel_pages()
hook causing kernel to hang or panic.
Fix this by excluding memory holes while reserving rest of the memory
above boot memory size during second kernel boot after crash.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
I no longer have a functional version of this board for even the most
basic sanity boot testing, and they have not been available for purchase
for quite some years now.
There is no point in adding a burden to testing coverage that does
walk all the possible defconfigs, so with all the above in mind, it
makes sense to remove it. Of course it will remain in the git history
for anyone who happens to stumble on one and wants to tinker with it.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We've had dynamic ftrace support for over 9 years since Steve first
wrote it, all the distros use dynamic, and static is basically
untested these days, so drop support for static ftrace.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With -mprofile-kernel, we always save the full register state in
ftrace_caller(). While this works, this is inefficient if we're not
interested in the register state, such as when we're using the function
tracer.
Rename the existing ftrace_caller() as ftrace_regs_caller() and provide
a simpler implementation for ftrace_caller() that is used when registers
are not required to be saved.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Our implementation matches that of the generic version, which also
handles FTRACE_UPDATE_MODIFY_CALL. So, remove our implementation in
favor of the generic version.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For R_PPC64_REL24 relocations, we suppress emitting instructions for TOC
load/restore in the relocation stub if the relocation is for _mcount()
call when using -mprofile-kernel ABI.
To detect this, we check if the preceding instructions are per the
standard set of instructions emitted by gcc: either the two instruction
sequence of 'mflr r0; std r0,16(r1)', or the more optimized variant of a
single 'mflr r0'. This is not sufficient since nothing prevents users
from hand coding sequences involving a 'mflr r0' followed by a 'bl'.
For removing the toc save instruction from the stub, we additionally
check if the symbol is "_mcount". Add the same check here as well.
Also rename is_early_mcount_callsite() to is_mprofile_mcount_callsite()
since that is what is being checked. The use of "early" is misleading
since there is nothing involving this function that qualifies as early.
Fixes: 153086644f ("powerpc/ftrace: Add support for -mprofile-kernel ftrace ABI")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If function_graph tracer is enabled during kexec, we see the below
exception in the simulator:
root@(none):/# kexec -e
kvm: exiting hardware virtualization
kexec_core: Starting new kernel
[ 19.262020070,5] OPAL: Switch to big-endian OS
kexec: Starting switchover sequence.
Interrupt to 0xC000000000004380 from 0xC000000000004380
** Execution stopped: Continuous Interrupt, Instruction caused exception, **
Now that we have a more effective way to completely disable ftrace on
ppc64, let's also use that before switching to a new kernel during
kexec.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
During guest entry/exit, we switch over to/from the guest MMU context
and we cannot take exceptions in the hypervisor code.
Since ftrace may be enabled and since it can result in us taking a trap,
disable ftrace by setting paca->ftrace_enabled to zero. There are two
paths through which we enter/exit a guest:
1. If we are the vcore runner, then we enter the guest via
__kvmppc_vcore_entry() and we disable ftrace around this. This is always
the case for Power9, and for the primary thread on Power8.
2. If we are a secondary thread in Power8, then we would be in nap due
to SMT being disabled. We are woken up by an IPI to enter the guest. In
this scenario, we enter the guest through kvm_start_guest(). We disable
ftrace at this point. In this scenario, ftrace would only get re-enabled
on the secondary thread when SMT is re-enabled (via start_secondary()).
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Disable ftrace when a cpu is about to go offline. When the cpu is woken
up, ftrace will get enabled in start_secondary().
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On the boot cpu, though we enable paca->ftrace_enabled in early_setup()
(via cpu_ready_for_interrupts()), we don't start tracing until much
later since ftrace is not initialized yet and since we only support
DYNAMIC_FTRACE on powerpc. However, it is possible that ftrace has been
initialized by the time some of the secondary cpus start up. In this
case, we will try to trace some of the early boot code which can cause
problems.
To address this, move setting paca->ftrace_enabled from
cpu_ready_for_interrupts() to early_setup() for the boot cpu, and towards
the end of start_secondary() for secondary cpus.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add some helpers to enable/disable ftrace through paca->ftrace_enabled.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Re-arrange the last #ifdef section in preparation for a subsequent
change.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have some C code that we call into from real mode where we cannot
take any exceptions. Though the C functions themselves are mostly safe,
if these functions are traced, there is a possibility that we may take
an exception. For instance, in certain conditions, the ftrace code uses
WARN(), which uses a 'trap' to do its job.
For such scenarios, introduce a new field in paca 'ftrace_enabled',
which is checked on ftrace entry before continuing. This field can then
be set to zero to disable/pause ftrace, and set to a non-zero value to
resume ftrace.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
smp_send_stop can lock up the IPI path for any subsequent calls,
because the receiving CPUs spin in their handler function. This
started becoming a problem with the addition of an smp_send_stop
call in the reboot path, because panics can reboot after doing
their own smp_send_stop.
The NMI IPI variant was fixed with ac61c11566 ("powerpc: Fix
smp_send_stop NMI IPI handling"), which leaves the smp_call_function
variant.
This is fixed by having smp_send_stop only ever do the
smp_call_function once. This is a bit less robust than the NMI IPI
fix, because any other call to smp_call_function after smp_send_stop
could deadlock, but that has always been the case, and it was not
been a problem before.
Fixes: f2748bdfe1 ("powerpc/powernv: Always stop secondaries before reboot/shutdown")
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The NMI IPI handler for a receiving CPU increments nmi_ipi_busy_count
over the handler function call, which causes later smp_send_nmi_ipi()
callers to spin until the call is finished.
The stop_this_cpu() function never returns, so the busy count is never
decremeted, which can cause the system to hang in some cases. For
example panic() will call smp_send_stop() early on which calls
stop_this_cpu() on other CPUs, then later in the reboot path,
pnv_restart() will call smp_send_stop() again, which hangs.
Fix this by adding a special case to the stop_this_cpu() handler to
decrement the busy count, because it will never return.
Now that the NMI/non-NMI versions of stop_this_cpu() are different,
split them out into separate functions rather than doing #ifdef tricks
to share the body between the two functions.
Fixes: 6bed323762 ("powerpc: use NMI IPI for smp_send_stop")
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out the functions, tweak change log a bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or
OPAL_BUSY_EVENT from firmware, which causes large scheduling
latencies, up to 50 seconds have been observed here when RTC stops
responding (BMC reboot can do it).
Fix this by converting it to the standard form OPAL_BUSY loop that
sleeps.
Fixes: 628daa8d5a ("powerpc/powernv: Add RTC and NVRAM support plus RTAS fallbacks")
Cc: stable@vger.kernel.org # v3.2+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The current code extracts the physical address for UE errors and then
hooks it up into memory failure infrastructure. On successful
extraction of physical address it wrongly sets "handled = 1" which
means this UE error has been recovered. Since MCE handler gets return
value as handled = 1, it assumes that error has been recovered and
goes back to same NIP. This causes MCE interrupt again and again in a
loop leading to hard lockup.
Also, initialize phys_addr to ULONG_MAX so that we don't end up
queuing undesired page to hwpoison.
Without this patch we see:
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
...
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
Severe Machine check interrupt [Recovered]
NIP: [000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffd2755940
Physical address: 000020181a080000
Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
...
Watchdog CPU:38 Hard LOCKUP
After this patch we see:
Severe Machine check interrupt [Not recovered]
NIP: [00007fffaae585f4] PID: 7168 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address: 00007fffaafe28ac
Physical address: 00002017c0bd0000
find[7168]: unhandled signal 7 at 00007fffaae585f4 nip 00007fffaae585f4 lr 00007fffaae585e0 code 4
Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered
Fixes: 01eaac2b05 ("powerpc/mce: Hookup ierror (instruction) UE errors")
Fixes: ba41e1e1cc ("powerpc/mce: Hookup derror (load/store) UE errors")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The NPU has a limited number of address translation shootdown (ATSD)
registers and the GPU has limited bandwidth to process ATSDs. This can
result in contention of ATSD registers leading to soft lockups on some
threads, particularly when invalidating a large address range in
pnv_npu2_mn_invalidate_range().
At some threshold it becomes more efficient to flush the entire GPU
TLB for the given MM context (PID) than individually flushing each
address in the range. This patch will result in ranges greater than
2MB being converted from 32+ ATSDs into a single ATSD which will flush
the TLB for the given PID on each GPU.
Fixes: 1ab66d1fba ("powerpc/powernv: Introduce address translation services for Nvlink2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Tested-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is a single npu context per set of callback parameters. Callers
should be prevented from overwriting existing callback values so
instead return an error if different parameters are passed.
Fixes: 1ab66d1fba ("powerpc/powernv: Introduce address translation services for Nvlink2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Mark Hairgrove <mhairgrove@nvidia.com>
Tested-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The pnv_npu2_init_context() and pnv_npu2_destroy_context() functions
are used to allocate/free contexts to allow address translation and
shootdown by the NPU on a particular GPU. Context initialisation is
implicitly safe as it is protected by the requirement mmap_sem be held
in write mode, however pnv_npu2_destroy_context() does not require
mmap_sem to be held and it is not safe to call with a concurrent
initialisation for a different GPU.
It was assumed the driver would ensure destruction was not called
concurrently with initialisation. However the driver may be simplified
by allowing concurrent initialisation and destruction for different
GPUs. As npu context creation/destruction is not a performance
critical path and the critical section is not large a single spinlock
is used for simplicity.
Fixes: 1ab66d1fba ("powerpc/powernv: Introduce address translation services for Nvlink2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Mark Hairgrove <mhairgrove@nvidia.com>
Tested-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Don't do this via custom code, instead now that we have support in the
arch hotplug/hotunplug code, rely on those routines to do the right
thing.
The existing flush doesn't work because it uses ppc64_caches.l1d.size
instead of ppc64_caches.l1d.line_size.
Fixes: 9d5171a8f2 ("powerpc/powernv: Enable removal of memory for in memory tracing")
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds support for flushing potentially dirty cache lines
when memory is hot-plugged/hot-un-plugged. The support is currently
limited to 64 bit systems.
The bug was exposed when mappings for a device were actually
hot-unplugged and plugged in back later. A similar issue was observed
during the development of memtrace, but memtrace does it's own
flushing of region via a custom routine.
These patches do a flush both on hotplug/unplug to clear any stale
data in the cache w.r.t mappings, there is a small race window where a
clean cache line may be created again just prior to tearing down the
mapping.
The patches were tested by disabling the flush routines in memtrace
and doing I/O on the trace file. The system immediately
checkstops (quite reliablly if prior to the hot-unplug of the memtrace
region, we memset the regions we are about to hot unplug). After these
patches no custom flushing is needed in the memtrace code.
Fixes: 9d5171a8f2 ("powerpc/powernv: Enable removal of memory for in memory tracing")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Reza Arbab <arbab@linux.ibm.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 95846ecf9d ("pid: replace pid bitmap implementation with IDR
API") changed last field of /proc/loadavg (last pid allocated) to be off
by one:
# unshare -p -f --mount-proc cat /proc/loadavg
0.00 0.00 0.00 1/60 2 <===
It should be 1 after first fork into pid namespace.
This is formally a regression but given how useless this field is I
don't think anyone is affected.
Bug was found by /proc testsuite!
Link: http://lkml.kernel.org/r/20180413175408.GA27246@avx2
Fixes: 95846ecf9d ("pid: replace pid bitmap implementation with IDR API")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gargi Sharma <gs051095@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When running KVM guests on Power8 we can see a lockup where one CPU
stops responding. This often leads to a message such as:
watchdog: CPU 136 detected hard LOCKUP on other CPUs 72
Task dump for CPU 72:
qemu-system-ppc R running task 10560 20917 20908 0x00040004
And then backtraces on other CPUs, such as:
Task dump for CPU 48:
ksmd R running task 10032 1519 2 0x00000804
Call Trace:
...
--- interrupt: 901 at smp_call_function_many+0x3c8/0x460
LR = smp_call_function_many+0x37c/0x460
pmdp_invalidate+0x100/0x1b0
__split_huge_pmd+0x52c/0xdb0
try_to_unmap_one+0x764/0x8b0
rmap_walk_anon+0x15c/0x370
try_to_unmap+0xb4/0x170
split_huge_page_to_list+0x148/0xa30
try_to_merge_one_page+0xc8/0x990
try_to_merge_with_ksm_page+0x74/0xf0
ksm_scan_thread+0x10ec/0x1ac0
kthread+0x160/0x1a0
ret_from_kernel_thread+0x5c/0x78
This is caused by commit 8c1c7fb0b5 ("powerpc/64s/idle: avoid sync
for KVM state when waking from idle"), which added a check in
pnv_powersave_wakeup() to see if the kvm_hstate.hwthread_state is
already set to KVM_HWTHREAD_IN_KERNEL, and if so to skip the store and
test of kvm_hstate.hwthread_req.
The problem is that the primary does not set KVM_HWTHREAD_IN_KVM when
entering the guest, so it can then come out to cede with
KVM_HWTHREAD_IN_KERNEL set. It can then go idle in kvm_do_nap after
setting hwthread_req to 1, but because hwthread_state is still
KVM_HWTHREAD_IN_KERNEL we will skip the test of hwthread_req when we
wake up from idle and won't go to kvm_start_guest. From there the
thread will return somewhere garbage and crash.
Fix it by skipping the store of hwthread_state, but not the test of
hwthread_req, when coming out of idle. It's OK to skip the sync in
that case because hwthread_req will have been set on the same thread,
so there is no synchronisation required.
Fixes: 8c1c7fb0b5 ("powerpc/64s/idle: avoid sync for KVM state when waking from idle")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On boot we save the configuration space of PCIe bridges. We do this so
when we get an EEH event and everything gets reset that we can restore
them.
Unfortunately we save this state before we've enabled the MMIO space
on the bridges. Hence if we have to reset the bridge when we come back
MMIO is not enabled and we end up taking an PE freeze when the driver
starts accessing again.
This patch forces the memory/MMIO and bus mastering on when restoring
bridges on EEH. Ideally we'd do this correctly by saving the
configuration space writes later, but that will have to come later in
a larger EEH rewrite. For now we have this simple fix.
The original bug can be triggered on a boston machine by doing:
echo 0x8000000000000000 > /sys/kernel/debug/powerpc/PCI0001/err_injct_outbound
On boston, this PHB has a PCIe switch on it. Without this patch,
you'll see two EEH events, 1 expected and 1 the failure we are fixing
here. The second EEH event causes the anything under the PHB to
disappear (i.e. the i40e eth).
With this patch, only 1 EEH event occurs and devices properly recover.
Fixes: 652defed48 ("powerpc/eeh: Check PCIe link after reset")
Cc: stable@vger.kernel.org # v3.11+
Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When setting up a CPU, we "push" (activate) a pool VP for it.
However it's an error to do so if it already has an active
pool VP.
This happens when doing soft CPU hotplug on powernv since we
don't tear down the CPU on unplug. The HW flags the error which
gets captured by the diagnostics.
Fix this by making sure to "pull" out any already active pool
first.
Fixes: 243e25112d ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If there is no d-cache-size property in the device tree, l1d_size could
be zero. We don't actually expect that to happen, it's only been seen
on mambo (simulator) in some configurations.
A zero-size l1d_size leads to the loop in the asm wrapping around to
2^64-1, and then walking off the end of the fallback area and
eventually causing a page fault which is fatal.
Just default to 64K which is correct on some CPUs, and sane enough to
not cause a crash on others.
Fixes: aa8a5e0062 ('powerpc/64s: Add support for RFI flush of L1-D cache')
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Rewrite comment and change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When we patch an alternate feature section, we have to adjust any
relative branches that branch out of the alternate section.
But currently we have a bug if we have a branch that points to past
the last instruction of the alternate section, eg:
FTR_SECTION_ELSE
1: b 2f
or 6,6,6
2:
ALT_FTR_SECTION_END(...)
nop
This will result in a relative branch at 1 with a target that equals
the end of the alternate section.
That branch does not need adjusting when it's moved to the non-else
location. Currently we do adjust it, resulting in a branch that goes
off into the link-time location of the else section, which is junk.
The fix is to not patch branches that have a target == end of the
alternate section.
Fixes: d20fe50a7b ("KVM: PPC: Book3S HV: Branch inside feature section")
Fixes: 9b1a735de6 ("powerpc: Add logic to patch alternative feature sections")
Cc: stable@vger.kernel.org # v2.6.27+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Fix crashes when loading modules built with a different CONFIG_RELOCATABLE
value by adding CONFIG_RELOCATABLE to vermagic.
- Fix busy loops in the OPAL NVRAM driver if we get certain error conditions
from firmware.
- Remove tlbie trace points from KVM code that's called in real mode, because
it causes crashes.
- Fix checkstops caused by invalid tlbiel on Power9 Radix.
- Ensure the set of CPU features we "know" are always enabled is actually the
minimal set when we build with support for firmware supplied CPU features.
Thanks to:
Aneesh Kumar K.V, Anshuman Khandual, Nicholas Piggin.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJa0oWBExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAV
EA//UQB7n33HlroMBL3841VswUrIgY36gSi9+QZPjxHiTYGVStI+u0FHq5hm8OMm
1FkronD0sfbN7t8LJ9FS6vCoAlW15vMBj95pWJXAL7NuQneO7cM5JvRbowVKbNbq
GESn4EkUj9bAXl7aTzX2yA7jruzMJIwE/2bgl1J6YkpByHx5x/fgzKTuoCRggQqY
Xd656ZZyWR/uIuWhAjCPFdrPnekVvo7/Gy5Yo5kcR6LbX4MO0JFNOHpiT3wPGGv/
LJedJtdpH1biMk7SrqLfWD7mpMsVJeMelpkftEbe8zUXf17wBd1rl4JkybLTDwZD
HznZ0nbnh0j5Q/KRHwOh0sLDSisJF6pQHH/Bnc4Xqrsn4OERwEk40/Ym/O0kDsjH
n2F3X5a5QLWoBWRsdV3BMyEzc0bgnb++tXeBWOV4jJBEOhoqxwimCU+3fmLiy95A
urJYf8Sljwve9I5kJyXKpruopgeoJ1aumRwopE8YCG9lfBgzvU/90u6+uKqAdkOv
kpZxS5FDT2lNIwbCP9WjEelAOGrtA6JQNWU0iPGwsVAvAxX/p1RwwbQeWVlWs3Ek
UdVF8eGezClpLPa/FQFdDyXfGE6WIf1vK9YnG7k3rUwwkSqoYxKgVQ3o1Vetp0Lg
fzPsDdDi2V2I3KDYNav8s6kohAIIS/a9XYk4BGvT/htRnUg=
=0e7c
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix crashes when loading modules built with a different
CONFIG_RELOCATABLE value by adding CONFIG_RELOCATABLE to vermagic.
- Fix busy loops in the OPAL NVRAM driver if we get certain error
conditions from firmware.
- Remove tlbie trace points from KVM code that's called in real mode,
because it causes crashes.
- Fix checkstops caused by invalid tlbiel on Power9 Radix.
- Ensure the set of CPU features we "know" are always enabled is
actually the minimal set when we build with support for firmware
supplied CPU features.
Thanks to: Aneesh Kumar K.V, Anshuman Khandual, Nicholas Piggin.
* tag 'powerpc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Fix CPU_FTRS_ALWAYS vs DT CPU features
powerpc/mm/radix: Fix checkstops caused by invalid tlbiel
KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode
powerpc/8xx: Fix build with hugetlbfs enabled
powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops
powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops
powerpc/fscr: Enable interrupts earlier before calling get_user()
powerpc/64s: Fix section mismatch warnings from setup_rfi_flush()
powerpc/modules: Fix crashes by adding CONFIG_RELOCATABLE to vermagic
For s390 new kernels are loaded to fixed addresses in memory before they
are booted. With the current code this is a problem as it assumes the
kernel will be loaded to an 'arbitrary' address. In particular,
kexec_locate_mem_hole searches for a large enough memory region and sets
the load address (kexec_bufer->mem) to it.
Luckily there is a simple workaround for this problem. By returning 1
in arch_kexec_walk_mem, kexec_locate_mem_hole is turned off. This
allows the architecture to set kbuf->mem by hand. While the trick works
fine for the kernel it does not for the purgatory as here the
architectures don't have access to its kexec_buffer.
Give architectures access to the purgatories kexec_buffer by changing
kexec_load_purgatory to take a pointer to it. With this change
architectures have access to the buffer and can edit it as they need.
A nice side effect of this change is that we can get rid of the
purgatory_info->purgatory_load_address field. As now the information
stored there can directly be accessed from kbuf->mem.
Link: http://lkml.kernel.org/r/20180321112751.22196-11-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As arch_kexec_kernel_image_{probe,load}(),
arch_kimage_file_post_load_cleanup() and arch_kexec_kernel_verify_sig()
are almost duplicated among architectures, they can be commonalized with
an architecture-defined kexec_file_ops array. So let's factor them out.
Link: http://lkml.kernel.org/r/20180306102303.9063-3-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kexec_file, x86, powerpc: refactoring for other
architecutres", v2.
This is a preparatory patchset for adding kexec_file support on arm64.
It was originally included in a arm64 patch set[1], but Philipp is also
working on their kexec_file support on s390[2] and some changes are now
conflicting.
So these common parts were extracted and put into a separate patch set
for better integration. What's more, my original patch#4 was split into
a few small chunks for easier review after Dave's comment.
As such, the resulting code is basically identical with my original, and
the only *visible* differences are:
- renaming of _kexec_kernel_image_probe() and _kimage_file_post_load_cleanup()
- change one of types of arguments at prepare_elf64_headers()
Those, unfortunately, require a couple of trivial changes on the rest
(#1, #6 to #13) of my arm64 kexec_file patch set[1].
Patch #1 allows making a use of purgatory optional, particularly useful
for arm64.
Patch #2 commonalizes arch_kexec_kernel_{image_probe, image_load,
verify_sig}() and arch_kimage_file_post_load_cleanup() across
architectures.
Patches #3-#7 are also intended to generalize parse_elf64_headers(),
along with exclude_mem_range(), to be made best re-use of.
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-February/561182.html
[2] http://lkml.iu.edu//hypermail/linux/kernel/1802.1/02596.html
This patch (of 7):
On arm64, crash dump kernel's usable memory is protected by *unmapping*
it from kernel virtual space unlike other architectures where the region
is just made read-only. It is highly unlikely that the region is
accidentally corrupted and this observation rationalizes that digest
check code can also be dropped from purgatory. The resulting code is so
simple as it doesn't require a bit ugly re-linking/relocation stuff,
i.e. arch_kexec_apply_relocations_add().
Please see:
http://lists.infradead.org/pipermail/linux-arm-kernel/2017-December/545428.html
All that the purgatory does is to shuffle arguments and jump into a new
kernel, while we still need to have some space for a hash value
(purgatory_sha256_digest) which is never checked against.
As such, it doesn't make sense to have trampline code between old kernel
and new kernel on arm64.
This patch introduces a new configuration, ARCH_HAS_KEXEC_PURGATORY, and
allows related code to be compiled in only if necessary.
[takahiro.akashi@linaro.org: fix trivial screwup]
Link: http://lkml.kernel.org/r/20180309093346.GF25863@linaro.org
Link: http://lkml.kernel.org/r/20180306102303.9063-2-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cpu_has_feature() mechanism has an optimisation where at build
time we construct a mask of the CPU feature bits that will always be
true for the given .config, based on the platform/bitness/etc. that we
are building for.
That is incompatible with DT CPU features, where the set of CPU
features is dependent on feature flags that are given to us by
firmware.
The result is that some feature bits can not be *disabled* by DT CPU
features. Or more accurately, they can be disabled but they will still
appear in the ALWAYS mask, meaning cpu_has_feature() will always
return true for them.
In the past this hasn't really been a problem because on Book3S
64 (where we support DT CPU features), the set of ALWAYS bits has been
very small. That was because we always built for POWER4 and later,
meaning the set of common bits was small.
The only bit that could be cleared by DT CPU features that was also in
the ALWAYS mask was CPU_FTR_NODSISRALIGN, and that was only used in
the alignment handler to create a fake DSISR. That code was itself
deleted in 31bfdb036f ("powerpc: Use instruction emulation
infrastructure to handle alignment faults") (Sep 2017).
However the set of ALWAYS features changed with the recent commit
db5ae1c155 ("powerpc/64s: Refine feature sets for little endian
builds") which restricted the set of feature flags when building
little endian to Power7 or later. That caused the ALWAYS mask to
become much larger for little endian builds.
The result is that the following feature bits can currently not
be *disabled* by DT CPU features:
CPU_FTR_REAL_LE, CPU_FTR_MMCRA, CPU_FTR_CTRL, CPU_FTR_SMT,
CPU_FTR_PURR, CPU_FTR_SPURR, CPU_FTR_DSCR, CPU_FTR_PKEY,
CPU_FTR_VMX_COPY, CPU_FTR_CFAR, CPU_FTR_HAS_PPR.
To fix it we need to mask the set of ALWAYS features with the base set
of DT CPU features, ie. the features that are always enabled by DT CPU
features. That way there are no bits in the ALWAYS mask that are not
also always set by DT CPU features.
Fixes: db5ae1c155 ("powerpc/64s: Refine feature sets for little endian builds")
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>