The alignment of IOV BAR on PowerNV platform is the total size of the IOV
BAR. No matter whether the IOV BAR is extended with number of
roundup_pow_of_two(total_vfs) or number of max PE number (256), the total
size could be calculated by (vfs_expanded * VF_BAR_size).
This patch simplifies the pnv_pci_iov_resource_alignment() by removing the
first case.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On PHB3, we enable SRIOV devices by mapping IOV BAR with M64 BARs. If a
SRIOV device's IOV BAR is not 64bit-prefetchable, this is not assigned from
64bit prefetchable window, which means M64 BAR can't work on it.
The reason is PCI bridges support only 2 memory windows and the kernel code
programs bridges in the way that one window is 32bit-nonprefetchable and
the other one is 64bit-prefetchable. So if devices' IOV BAR is 64bit and
non-prefetchable, it will be mapped into 32bit space and therefore M64
cannot be used for it.
This patch makes this explicit and truncate IOV resource in this case to
save MMIO space.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The EEH debugfs handlers have same prototype. This introduces
a macro to define them, then to simplify the code. No logical
changes.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, the OPAL msglog/console buffer is exposed as a sysfs file, with
the sysfs read handler responsible for retrieving the log from the OPAL
buffer. We'd like to be able to use it in xmon as well.
Refactor the OPAL msglog code to create a new function, opal_msglog_copy(),
that copies to an arbitrary buffer. Separate the initialisation code into
generic memcons init and sysfs file creation.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
"p5ioc2 is used by approximately 2 machines in the world, and has never
ever been a supported configuration."
The code for p5ioc2 is essentially unused and complicates what is already
a very complicated codebase. Its removal is essentially a "free win" in
the effort to simplify the powernv PCI code.
In addition, support for p5ioc2 has been dropped from skiboot. There's no
reason to keep it around in the kernel.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Ground work for the new Power9 MMU from Aneesh Kumar K.V
- Optimise FP/VMX/VSX context switching from Anton Blanchard
- Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica Gupta,
Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling, Andrew Donnellan
- Allow wrapper to work on non-english system from Laurent Vivier
- Add rN aliases to the pt_regs_offset table from Rashmica Gupta
- Fix module autoload for rackmeter & axonram drivers from Luis de Bethencourt
- Include KVM guest test in all interrupt vectors from Paul Mackerras
- Fix DSCR inheritance over fork() from Anton Blanchard
- Make value-returning atomics & {cmp}xchg* & their atomic_ versions fully ordered from Boqun Feng
- Print MSR TM bits in oops messages from Michael Neuling
- Add TM signal return & invalid stack selftests from Michael Neuling
- Limit EPOW reset event warnings from Vipin K Parashar
- Remove the Cell QPACE code from Rashmica Gupta
- Append linux_banner to exception information in xmon from Rashmica Gupta
- Add selftest to check if VSRs are corrupted from Rashmica Gupta
- Remove broken GregorianDay() from Daniel Axtens
- Import Anton's context_switch2 benchmark into selftests from Michael Ellerman
- Add selftest script to test HMI functionality from Daniel Axtens
- Remove obsolete OPAL v2 support from Stewart Smith
- Make enter_rtas() private from Michael Ellerman
- PPR exception cleanups from Michael Ellerman
- Add page soft dirty tracking from Laurent Dufour
- Add support for Nvlink NPUs from Alistair Popple
- Add support for kexec on 476fpe from Alistair Popple
- Enable kernel CPU dlpar from sysfs from Nathan Fontenot
- Copy only required pieces of the mm_context_t to the paca from Michael Neuling
- Add a kmsg_dumper that flushes OPAL console output on panic from Russell Currey
- Implement save_stack_trace_regs() to enable kprobe stack tracing from Steven Rostedt
- Add HWCAP bits for Power9 from Michael Ellerman
- Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
- Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
- scripts/recordmcount.pl: support data in text section on powerpc from Ulrich Weigand
- Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
- cxl: Fix possible idr warning when contexts are released from Vaibhav Jain
- cxl: use correct operator when writing pcie config space values from Andrew Donnellan
- cxl: Fix DSI misses when the context owning task exits from Vaibhav Jain
- cxl: fix build for GCC 4.6.x from Brian Norris
- cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
- cxl: Enable PCI device ID for future IBM CXL adapter from Uma Krishnan
- Freescale updates from Scott: Highlights include moving QE code out of
arch/powerpc (to be shared with arm), device tree updates, and minor fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWmIxeAAoJEFHr6jzI4aWAA+cQAIXAw4WfVWJ2V4ZK+1eKfB57
fdXG71PuXG+WYIWy71ly8keLHdzzD1NQ2OUB64bUVRq202nRgVc15ZYKRJ/FE/sP
SkxaQ2AG/2kI2EflWshOi0Lu9qaZ+LMHJnszIqE/9lnGSB2kUI/cwsSXgziiMKXR
XNci9v14SdDd40YV/6BSZXoxApwyq9cUbZ7rnzFLmz4hrFuKmB/L3LABDF8QcpH7
sGt/YaHGOtqP0UX7h5KQTFLGe1OPvK6NWixSXeZKQ71ED6cho1iKUEOtBA9EZeIN
QM5JdHFWgX8MMRA0OHAgidkSiqO38BXjmjkVYWoIbYz7Zax3ThmrDHB4IpFwWnk3
l7WBykEXY7KEqpZzbh0GFGehZWzVZvLnNgDdvpmpk/GkPzeYKomBj7ZZfm3H1yGD
BTHPwuWCTX+/K75yEVNO8aJO12wBg7DRl4IEwBgqhwU8ga4FvUOCJkm+SCxA1Dnn
qlpS7qPwTXNIEfKMJcxp5X0KiwDY1EoOotd4glTN0jbeY5GEYcxe+7RQ302GrYxP
zcc8EGLn8h6BtQvV3ypNHF5l6QeTW/0ZlO9c236tIuUQ5gQU39SQci7jQKsYjSzv
BB1XdLHkbtIvYDkmbnr1elbeJCDbrWL9rAXRUTRyfuCzaFWTfZmfVNe8c8qwDMLk
TUxMR/38aI7bLcIQjwj9
=R5bX
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Core:
- Ground work for the new Power9 MMU from Aneesh Kumar K.V
- Optimise FP/VMX/VSX context switching from Anton Blanchard
Misc:
- Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica
Gupta, Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling,
Andrew Donnellan
- Allow wrapper to work on non-english system from Laurent Vivier
- Add rN aliases to the pt_regs_offset table from Rashmica Gupta
- Fix module autoload for rackmeter & axonram drivers from Luis de
Bethencourt
- Include KVM guest test in all interrupt vectors from Paul Mackerras
- Fix DSCR inheritance over fork() from Anton Blanchard
- Make value-returning atomics & {cmp}xchg* & their atomic_ versions
fully ordered from Boqun Feng
- Print MSR TM bits in oops messages from Michael Neuling
- Add TM signal return & invalid stack selftests from Michael Neuling
- Limit EPOW reset event warnings from Vipin K Parashar
- Remove the Cell QPACE code from Rashmica Gupta
- Append linux_banner to exception information in xmon from Rashmica
Gupta
- Add selftest to check if VSRs are corrupted from Rashmica Gupta
- Remove broken GregorianDay() from Daniel Axtens
- Import Anton's context_switch2 benchmark into selftests from
Michael Ellerman
- Add selftest script to test HMI functionality from Daniel Axtens
- Remove obsolete OPAL v2 support from Stewart Smith
- Make enter_rtas() private from Michael Ellerman
- PPR exception cleanups from Michael Ellerman
- Add page soft dirty tracking from Laurent Dufour
- Add support for Nvlink NPUs from Alistair Popple
- Add support for kexec on 476fpe from Alistair Popple
- Enable kernel CPU dlpar from sysfs from Nathan Fontenot
- Copy only required pieces of the mm_context_t to the paca from
Michael Neuling
- Add a kmsg_dumper that flushes OPAL console output on panic from
Russell Currey
- Implement save_stack_trace_regs() to enable kprobe stack tracing
from Steven Rostedt
- Add HWCAP bits for Power9 from Michael Ellerman
- Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
- Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
- scripts/recordmcount.pl: support data in text section on powerpc
from Ulrich Weigand
- Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
cxl:
- cxl: Fix possible idr warning when contexts are released from
Vaibhav Jain
- cxl: use correct operator when writing pcie config space values
from Andrew Donnellan
- cxl: Fix DSI misses when the context owning task exits from Vaibhav
Jain
- cxl: fix build for GCC 4.6.x from Brian Norris
- cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
- cxl: Enable PCI device ID for future IBM CXL adapter from Uma
Krishnan
Freescale:
- Freescale updates from Scott: Highlights include moving QE code out
of arch/powerpc (to be shared with arm), device tree updates, and
minor fixes"
* tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (149 commits)
powerpc/module: Handle R_PPC64_ENTRY relocations
scripts/recordmcount.pl: support data in text section on powerpc
powerpc/powernv: Fix OPAL_CONSOLE_FLUSH prototype and usages
powerpc/mm: fix _PAGE_SWP_SOFT_DIRTY breaking swapoff
powerpc/mm: Fix _PAGE_PTE breaking swapoff
cxl: Enable PCI device ID for future IBM CXL adapter
cxl: use -Werror only with CONFIG_PPC_WERROR
cxl: fix build for GCC 4.6.x
powerpc: Add HWCAP bits for Power9
powerpc/powernv: Reserve PE#0 on NPU
powerpc/powernv: Change NPU PE# assignment
powerpc/powernv: Fix update of NVLink DMA mask
powerpc/powernv: Remove misleading comment in pci.c
powerpc: Implement save_stack_trace_regs() to enable kprobe stack tracing
powerpc: Fix build break due to paca mm_context_t changes
cxl: Fix DSI misses when the context owning task exits
MAINTAINERS: Update Scott Wood's e-mail address
powerpc/powernv: Fix minor off-by-one error in opal_mce_check_early_recovery()
powerpc: Fix style of self-test config prompts
powerpc/powernv: Only delay opal_rtc_read() retry when necessary
...
The recently added OPAL API call, OPAL_CONSOLE_FLUSH, originally took no
parameters and returned nothing. The call was updated to accept the
terminal number to flush, and returned various values depending on the
state of the output buffer.
The prototype has been updated and its usage in the OPAL kmsg dumper has
been modified to support its new behaviour as an incremental flush.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
P8+ hardware reports all errors on PE#0. This patch ensures PE#0 is
not assigned to NPU devices so that it can be used for EEH.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The P8+ hardware supports four partitionable endpoints (PEs) however
the hardware reports all errors as occurring on PE#0. This means we
need to reserve this PE for error handling (EEH) and not assign it to
a NPU device, implying that some devices will need to share PEs.
This patch changes the PE assignment for NPU devices such that NPU
devices which connect to the same GPU are assigned to the same
PE#.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The emulated NVLink PCI devices share the same IODA2 TCE tables but only
support a single TVT (instead of the normal two for PCI devices). This
requires the kernel to manually replace windows with either the bypass
or non-bypass window depending on what the driver has requested.
Unfortunately an incorrect optimisation was made in
pnv_pci_ioda_dma_set_mask() which caused updating of some NPU device PEs
to be skipped in certain configurations due to an incorrect assumption
that a NULL peer PE in the array indicated there were no more peers
present. This patch fixes the problem by ensuring all peer PEs are
updated.
Fixes: 5d2aa710e6 ("powerpc/powernv: Add support for Nvlink NPUs")
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PCI in powernv now supports quite a bit more than p5ioc2, so remove the
outdated comment.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fix off-by-one error in opal_mce_check_early_recovery() when checking
whether the NIP falls within OPAL space.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Only delay opal_rtc_read() when busy and are going to retry.
This has the advantage of possibly saving a massive 10ms off booting!
Kudos to Stewart for noticing.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On BMC machines, console output is controlled by the OPAL firmware and is
only flushed when its pollers are called. When the kernel is in a panic
state, it no longer calls these pollers and thus console output does not
completely flush, causing some output from the panic to be lost.
Output is only actually lost when the kernel is configured to not power off
or reboot after panic (i.e. CONFIG_PANIC_TIMEOUT is set to 0) since OPAL
flushes the console buffer as part of its power down routines. Before this
patch, however, only partial output would be printed during the timeout wait.
This patch adds a new kmsg_dumper which gets called at panic time to ensure
panic output is not lost. It accomplishes this by calling OPAL_CONSOLE_FLUSH
in the OPAL API, and if that is not available, the pollers are called enough
times to (hopefully) completely flush the buffer.
The flushing mechanism will only affect output printed at and before the
kmsg_dump call in kernel/panic.c:panic(). As such, the "end Kernel panic"
message may still be truncated as follows:
>Call Trace:
>[c000000f1f603b00] [c0000000008e9458] dump_stack+0x90/0xbc (unreliable)
>[c000000f1f603b30] [c0000000008e7e78] panic+0xf8/0x2c4
>[c000000f1f603bc0] [c000000000be4860] mount_block_root+0x288/0x33c
>[c000000f1f603c80] [c000000000be4d14] prepare_namespace+0x1f4/0x254
>[c000000f1f603d00] [c000000000be43e8] kernel_init_freeable+0x318/0x350
>[c000000f1f603dc0] [c00000000000bd74] kernel_init+0x24/0x130
>[c000000f1f603e30] [c0000000000095b0] ret_from_kernel_thread+0x5c/0xac
>---[ end Kernel panic - not
This functionality is implemented as a kmsg_dumper as it seems to be the
most sensible way to introduce platform-specific functionality to the
panic function.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 25642e1459 ("powerpc/opal-irqchip: Fix double endian
conversion") fixed an endian bug by calling opal_handle_events() in
opal_event_unmask().
However this introduced a deadlock if we find an event is active
during unmasking and call opal_handle_events() again. The bad call
sequence is:
opal_interrupt()
-> opal_handle_events()
-> generic_handle_irq()
-> handle_level_irq()
-> raw_spin_lock(&desc->lock)
handle_irq_event(desc)
unmask_irq(desc)
-> opal_event_unmask()
-> opal_handle_events()
-> generic_handle_irq()
-> handle_level_irq()
-> raw_spin_lock(&desc->lock) (BOOM)
When generating multiple opal events in quick succession this would lead
to the following stall warnings:
EEH: Fenced PHB#0 detected, location: U78C9.001.WZS09XA-P1-C32
INFO: rcu_sched detected stalls on CPUs/tasks:
12-...: (1 GPs behind) idle=68f/140000000000001/0 softirq=860/861 fqs=2065
15-...: (1 GPs behind) idle=be5/140000000000001/0 softirq=1142/1143 fqs=2065
(detected by 13, t=2102 jiffies, g=1325, c=1324, q=602)
NMI watchdog: BUG: soft lockup - CPU#18 stuck for 22s! [irqbalance:2696]
INFO: rcu_sched detected stalls on CPUs/tasks:
12-...: (1 GPs behind) idle=68f/140000000000001/0 softirq=860/861 fqs=8371
15-...: (1 GPs behind) idle=be5/140000000000001/0 softirq=1142/1143 fqs=8371
(detected by 20, t=8407 jiffies, g=1325, c=1324, q=1290)
This patch corrects the problem by queuing the work if an event is
active during unmasking, which is similar to the pre-endian fix
behaviour.
Fixes: 25642e1459 ("powerpc/opal-irqchip: Fix double endian conversion")
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Reported-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
NVLink is a high speed interconnect that is used in conjunction with a
PCI-E connection to create an interface between CPU and GPU that
provides very high data bandwidth. A PCI-E connection to a GPU is used
as the control path to initiate and report status of large data
transfers sent via the NVLink.
On IBM Power systems the NVLink processing unit (NPU) is similar to
the existing PHB3. This patch adds support for a new NPU PHB type. DMA
operations on the NPU are not supported as this patch sets the TCE
translation tables to be the same as the related GPU PCIe device for
each NVLink. Therefore all DMA operations are setup and controlled via
the PCIe device.
EEH is not presently supported for the NPU devices, although it may be
added in future.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move __raw_rm_writeq() from platforms/powernv/pci-ioda.c to
include/asm/io.h so that it can be used by other code.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This commit removed the pcidev field from struct pci_dn as it was no
longer in use by the kernel. However to support finding the
association of Nvlink devices to GPU devices from the device-tree this
field is required.
This reverts commit 250c7b277c ("powerpc/pci: Remove unused struct
pci_dn.pcidev field").
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The name of PCI root bus's M64 resource isn't initialized properly.
When dumping "/proc/iomem", "<BAD>" is seen for those M64 resources
on PCI root buses.
~# cat /proc/iomem | grep -e "BAD"
3b0000000000-3b0fefffffff : <BAD>
3b1000000000-3b1fefffffff : <BAD>
3c0000000000-3c0fefffffff : <BAD>
3c1000000000-3c1fefffffff : <BAD>
3c2000000000-3c2fefffffff : <BAD>
This fixes the issue by setting the name of PCI root bus's M64
resource to that of PHB's device node full name. With the patch,
no "<BAD>" is seen from "/proc/iomem".
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Long ago, only in the lab, there was OPALv1 and OPALv2. Now there is
just OPALv3, with nobody ever expecting anything on pre-OPALv3 to
be cared about or supported by mainline kernels.
So, let's remove FW_FEATURE_OPALv3 and instead use FW_FEATURE_OPAL
exclusively.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
OPALv2 only ever existed in the lab and didn't escape to the world.
All OPAL systems in the wild are OPALv3.
The probability of there being an OPALv2 system still powered on
anywhere inside IBM is approximately zero, let alone anyone
expecting to run mainline kernels.
So, start to remove references to OPALv2.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The OpenPower Abstraction Layer firmware went through a couple
of iterations in the lab before being released. What we now know
as OPAL advertises itself as OPALv3.
OPALv2 and OPALv1 never made it outside the lab, and the possibility
of anyone at all ever building a mainline kernel today and expecting
it to boot on such hardware is zero.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When running on newer OPAL firmware that supports sending extra
OPAL_MSG types, we would print a warning on *every* message received.
This could be a problem for kernels that don't support OPAL_MSG_OCC
on machines that are running real close to thermal limits and the
OCC is throttling the chip. For a kernel that is paying attention to
the message queue, we could get these notifications quite often.
Conceivably, future message types could also come fairly often,
and printing that we didn't understand them 10,000 times provides
no further information than printing them once.
Cc: stable@vger.kernel.org
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
GregorianDay() is supposed to calculate the day of the week
(tm->tm_wday) for a given day/month/year. In that calcuation it
indexed into an array called MonthOffset using tm->tm_mon-1. However
tm_mon is zero-based, not one-based, so this is off-by-one. It also
means that every January, GregoiranDay() will access element -1 of
the MonthOffset array.
It also doesn't appear to be a correct algorithm either: see in
contrast kernel/time/timeconv.c's time_to_tm function.
It's been broken forever, which suggests no-one in userland uses
this. It looks like no-one in the kernel uses tm->tm_wday either
(see e.g. drivers/rtc/rtc-ds1305.c:319).
tm->tm_wday is conventionally set to -1 when not available in
hardware so we can simply set it to -1 and drop the function.
(There are over a dozen other drivers in drivers/rtc that do
this.)
Found using UBSAN.
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrew Morton <akpm@linux-foundation.org> # as an example of what UBSan finds.
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Cc: rtc-linux@googlegroups.com
Signed-off-by: Daniel Axtens <dja@axtens.net>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The OPAL event calls return a mask of events that are active in big
endian format. This is checked when unmasking the events in the
irqchip by comparison with a cached value. The cached value was stored
in big endian format but should've been converted to CPU endian
first.
This bug leads to OPAL event delivery being delayed or dropped on some
systems. Symptoms may include a non-functional console.
The bug is fixed by calling opal_handle_events(...) instead of
duplicating code in opal_event_unmask(...).
Fixes: 9f0fd0499d ("powerpc/powernv: Add a virtual irqchip for opal events")
Cc: stable@vger.kernel.org # v4.2+
Reported-by: Douglas L Lehr <dllehr@us.ibm.com>
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
platform_driver does not need to set an owner because
platform_driver_register() will set it.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Kconfig: remove BE-only platforms from LE kernel build from Boqun Feng
- Refresh ps3_defconfig from Geoff Levand
- Emit GNU & SysV hashes for the vdso from Michael Ellerman
- Define an enum for the bolted SLB indexes from Anshuman Khandual
- Use a local to avoid multiple calls to get_slb_shadow() from Michael Ellerman
- Add gettimeofday() benchmark from Michael Neuling
- Avoid link stack corruption in __get_datapage() from Michael Neuling
- Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar K.V
- Add ppc64le_defconfig from Michael Ellerman
- pseries: extract of_helpers module from Andy Shevchenko
- Correct string length in pseries_of_derive_parent() from Nathan Fontenot
- Free the MSI bitmap if it was slab allocated from Denis Kirjanov
- Shorten irq_chip name for the SIU from Christophe Leroy
- Wait 1s for secondaries to enter OPAL during kexec from Samuel Mendoza-Jonas
- Fix _ALIGN_* errors due to type difference. from Aneesh Kumar K.V
- powerpc/pseries/hvcserver: don't memset pi_buff if it is null from Colin Ian King
- Disable hugepd for 64K page size. from Aneesh Kumar K.V
- Differentiate between hugetlb and THP during page walk from Aneesh Kumar K.V
- Make PCI non-optional for pseries from Michael Ellerman
- Individual System V IPC system calls from Sam bobroff
- Add selftest of unmuxed IPC calls from Michael Ellerman
- discard .exit.data at runtime from Stephen Rothwell
- Delete old orphaned PrPMC 280/2800 DTS and boot file. from Paul Gortmaker
- Use of_get_next_parent to simplify code from Christophe Jaillet
- Paginate some xmon output from Sam bobroff
- Add some more elements to the xmon PACA dump from Michael Ellerman
- Allow the tm-syscall selftest to build with old headers from Michael Ellerman
- Run EBB selftests only on POWER8 from Denis Kirjanov
- Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael Ellerman
- Avoid reference to potentially freed memory in prom.c from Christophe Jaillet
- Quieten boot wrapper output with run_cmd from Geoff Levand
- EEH fixes and cleanups from Gavin Shan
- Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
- Use of_get_next_parent() in of_get_ibm_chip_id() from Michael Ellerman
- Fix section mismatch warning in msi_bitmap_alloc() from Denis Kirjanov
- Fix ps3-lpm white space from Rudhresh Kumar J
- Fix ps3-vuart null dereference from Colin King
- nvram: Add missing kfree in error path from Christophe Jaillet
- nvram: Fix function name in some errors messages. from Christophe Jaillet
- drivers/macintosh: adb: fix misleading Kconfig help text from Aaro Koskinen
- agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
- cxl: Free virtual PHB when removing from Andrew Donnellan
- scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from Michael Ellerman
- scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building with O= from Michael Ellerman
- Freescale updates from Scott: Highlights include 64-bit book3e kexec/kdump
support, a rework of the qoriq clock driver, device tree changes including
qoriq fman nodes, support for a new 85xx board, and some fixes.
- MPC5xxx updates from Anatolij: Highlights include a driver for MPC512x
LocalPlus Bus FIFO with its device tree binding documentation, mpc512x
device tree updates and some minor fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWPEZgAAoJEFHr6jzI4aWANjYQAKX2Q/95hqKfCuF5FBcUmtMC
Pu/Nff027MVzxZ2ApDcvvLGps5Nz2bn3nIhc9zjkXc5E8DuL6X3Yl8ce7qyNcc3g
cJJ8RvtUo6J1OMWetXFehtPYniAAwKMhZYKnj0+WnLr2SyH/Vhl3ehDkFbGyPtuH
r+2E7krFjfVgU+bzciIFnOaDekFuFN/pXWMb6e6zQyBJe9N8ZIp96uouGCebKVd0
VDLItzdaKErT8JFfbymMPvZm3V0rMVx4WWu3kAbQX8LrD5a18NF1zrjAOHRXc61n
kkk8/DPuNOon1PbXXyiS5BcFyZRe+KE3VBnoW5sOMqMIRg5WdO1oU3e2pEfXMO8+
leXYwFLXiKzUZuOgQG2QiUhrzD2yC1o6/TJWATv0dSl9AwrecgPX+Vj6X357slAf
A9E3eMy5tgnpndBWZmvZS3W7YDKH+NkeZ+Q40+NErAlqr++ErrTcKVndk5vWlYTT
7mMZeTXagX66al/k5ATKqwB7iUSpnYHSAa9fcUYPSM2FnXsDxPyeJGkBbcoOmkGj
QrpgNYOvJaUJd076goZCV39v0c1xpfV9/9kyVch8HUadf6JcjpVZwYnbGw2qlJjh
ZanuBG2VOeSwaKQqXiRBSBetnpAg8CVpFjDmX9wOBfSek2wxEJqDX/vQExdbIDQQ
pUs7vnUxLzhmW/x+ygOI
=YwcM
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Kconfig: remove BE-only platforms from LE kernel build from Boqun
Feng
- Refresh ps3_defconfig from Geoff Levand
- Emit GNU & SysV hashes for the vdso from Michael Ellerman
- Define an enum for the bolted SLB indexes from Anshuman Khandual
- Use a local to avoid multiple calls to get_slb_shadow() from Michael
Ellerman
- Add gettimeofday() benchmark from Michael Neuling
- Avoid link stack corruption in __get_datapage() from Michael Neuling
- Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar
K.V
- Add ppc64le_defconfig from Michael Ellerman
- pseries: extract of_helpers module from Andy Shevchenko
- Correct string length in pseries_of_derive_parent() from Nathan
Fontenot
- Free the MSI bitmap if it was slab allocated from Denis Kirjanov
- Shorten irq_chip name for the SIU from Christophe Leroy
- Wait 1s for secondaries to enter OPAL during kexec from Samuel
Mendoza-Jonas
- Fix _ALIGN_* errors due to type difference, from Aneesh Kumar K.V
- powerpc/pseries/hvcserver: don't memset pi_buff if it is null from
Colin Ian King
- Disable hugepd for 64K page size, from Aneesh Kumar K.V
- Differentiate between hugetlb and THP during page walk from Aneesh
Kumar K.V
- Make PCI non-optional for pseries from Michael Ellerman
- Individual System V IPC system calls from Sam bobroff
- Add selftest of unmuxed IPC calls from Michael Ellerman
- discard .exit.data at runtime from Stephen Rothwell
- Delete old orphaned PrPMC 280/2800 DTS and boot file, from Paul
Gortmaker
- Use of_get_next_parent to simplify code from Christophe Jaillet
- Paginate some xmon output from Sam bobroff
- Add some more elements to the xmon PACA dump from Michael Ellerman
- Allow the tm-syscall selftest to build with old headers from Michael
Ellerman
- Run EBB selftests only on POWER8 from Denis Kirjanov
- Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael
Ellerman
- Avoid reference to potentially freed memory in prom.c from Christophe
Jaillet
- Quieten boot wrapper output with run_cmd from Geoff Levand
- EEH fixes and cleanups from Gavin Shan
- Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
- Use of_get_next_parent() in of_get_ibm_chip_id() from Michael
Ellerman
- Fix section mismatch warning in msi_bitmap_alloc() from Denis
Kirjanov
- Fix ps3-lpm white space from Rudhresh Kumar J
- Fix ps3-vuart null dereference from Colin King
- nvram: Add missing kfree in error path from Christophe Jaillet
- nvram: Fix function name in some errors messages, from Christophe
Jaillet
- drivers/macintosh: adb: fix misleading Kconfig help text from Aaro
Koskinen
- agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
- cxl: Free virtual PHB when removing from Andrew Donnellan
- scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from
Michael Ellerman
- scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building
with O= from Michael Ellerman
- Freescale updates from Scott: Highlights include 64-bit book3e
kexec/kdump support, a rework of the qoriq clock driver, device tree
changes including qoriq fman nodes, support for a new 85xx board, and
some fixes.
- MPC5xxx updates from Anatolij: Highlights include a driver for
MPC512x LocalPlus Bus FIFO with its device tree binding
documentation, mpc512x device tree updates and some minor fixes.
* tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (106 commits)
powerpc/msi: Fix section mismatch warning in msi_bitmap_alloc()
powerpc/prom: Use of_get_next_parent() in of_get_ibm_chip_id()
powerpc/pseries: Correct string length in pseries_of_derive_parent()
powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
powerpc/mpc85xx: Add FSL QorIQ DPAA FMan support to the SoC device tree(s)
powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan
powerpc/fsl: Add #clock-cells and clockgen label to clockgen nodes
powerpc: handle error case in cpm_muram_alloc()
powerpc: mpic: use IRQCHIP_SKIP_SET_WAKE instead of redundant mpic_irq_set_wake
powerpc/book3e-64: Enable kexec
powerpc/book3e-64/kexec: Set "r4 = 0" when entering spinloop
powerpc/booke: Only use VIRT_PHYS_OFFSET on booke32
powerpc/book3e-64/kexec: Enable SMP release
powerpc/book3e-64/kexec: create an identity TLB mapping
powerpc/book3e-64: Don't limit paca to 256 MiB
powerpc/book3e/kdump: Enable crash_kexec_wait_realmode
powerpc/book3e: support CONFIG_RELOCATABLE
powerpc/booke64: Fix args to copy_and_flush
powerpc/book3e-64: rename interrupt_end_book3e with __end_interrupts
powerpc/e6500: kexec: Handle hardware threads
...
Pull irq updates from Thomas Gleixner:
"The irq departement delivers:
- Rework the irqdomain core infrastructure to accomodate ACPI based
systems. This is required to support ARM64 without creating
artificial device tree nodes.
- Sanitize the ACPI based ARM GIC initialization by making use of the
new firmware independent irqdomain core
- Further improvements to the generic MSI management
- Generalize the irq migration on CPU hotplug
- Improvements to the threaded interrupt infrastructure
- Allow the migration of "chained" low level interrupt handlers
- Allow optional force masking of interrupts in disable_irq[_nosysnc]
- Support for two new interrupt chips - Sigh!
- A larger set of errata fixes for ARM gicv3
- The usual pile of fixes, updates, improvements and cleanups all
over the place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
Document that IRQ_NONE should be returned when IRQ not actually handled
PCI/MSI: Allow the MSI domain to be device-specific
PCI: Add per-device MSI domain hook
of/irq: Use the msi-map property to provide device-specific MSI domain
of/irq: Split of_msi_map_rid to reuse msi-map lookup
irqchip/gic-v3-its: Parse new version of msi-parent property
PCI/MSI: Use of_msi_get_domain instead of open-coded "msi-parent" parsing
of/irq: Use of_msi_get_domain instead of open-coded "msi-parent" parsing
of/irq: Add support code for multi-parent version of "msi-parent"
irqchip/gic-v3-its: Add handling of PCI requester id.
PCI/MSI: Add helper function pci_msi_domain_get_msi_rid().
of/irq: Add new function of_msi_map_rid()
Docs: dt: Add PCI MSI map bindings
irqchip/gic-v2m: Add support for multiple MSI frames
irqchip/gic-v3: Fix translation of LPIs after conversion to irq_fwspec
irqchip/mxs: Add Alphascale ASM9260 support
irqchip/mxs: Prepare driver for hardware with different offsets
irqchip/mxs: Panic if ioremap or domain creation fails
irqdomain: Documentation updates
irqdomain/msi: Use fwnode instead of of_node
...
This fixes a bug where it is possible for an off-line CPU to fail to go
into a low-power state (nap/sleep/winkle), and to become unresponsive to
requests from the KVM subsystem to wake up and run a VCPU. What can
happen is that a maskable interrupt of some kind (external, decrementer,
hypervisor doorbell, or HMI) after we have called local_irq_disable() at
the beginning of pnv_smp_cpu_kill_self() and before interrupts are
hard-disabled inside power7_nap/sleep/winkle(). In this situation, the
pending event is marked in the irq_happened flag in the PACA. This
pending event prevents power7_nap/sleep/winkle from going to the
requested low-power state; instead they return immediately. We don't
deal with any of these pending event flags in the off-line loop in
pnv_smp_cpu_kill_self() because power7_nap et al. return 0 in this case,
so we will have srr1 == 0, and none of the processing to clear
interrupts or doorbells will be done.
Usually, the most obvious symptom of this is that a KVM guest will fail
with a console message saying "KVM: couldn't grab cpu N".
This fixes the problem by making sure we handle the irq_happened flags
properly. First, we hard-disable before the off-line loop. Once we have
hard-disabled, the irq_happened flags can't change underneath us. We
unconditionally clear the DEC and HMI flags: there is no processing of
timer interrupts while off-line, and the necessary HMI processing is all
done in lower-level code. We leave the EE and DBELL flags alone for the
first iteration of the loop, so that we won't fail to respond to a
split-core request that came in just before hard-disabling. Within the
loop, we handle external interrupts if the EE bit is set in irq_happened
as well as if the low-power state was interrupted by an external
interrupt. (We don't need to do the msgclr for a pending doorbell in
irq_happened, because doorbells are edge-triggered and don't remain
pending in hardware.) Then we clear both the EE and DBELL flags, and
once clear, they cannot be set again (until this CPU comes online again,
that is).
This also fixes the debug check to not be done when we just ran a KVM
guest or when the sleep didn't happen because of a pending event in
irq_happened.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Similar to commit b6541db ("powerpc/eeh: Block PCI config access
upon frozen PE"), this blocks the PCI config space of Broadcom
Shiner adapter until PE reset is completed, to avoid recursive
fenced PHB when dumping PCI config registers during the period
of error recovery.
~# lspci -ns 0003:03:00.0
0003:03:00.0 0200: 14e4:168a (rev 10)
~# lspci -s 0003:03:00.0
0003:03:00.0 Ethernet controller: Broadcom Corporation \
NetXtreme II BCM57800 1/10 Gigabit Ethernet (rev 10)
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This simplifies pnv_eeh_set_option() to avoid unnecessary nested
if statements, to improve readability. No functional changes.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves the logic of pnv_eeh_cap_start() to pnv_eeh_find_cap()
as the function is only called by pnv_eeh_find_cap(). The logic
of both functions are pretty simple. No need to have separate
functions.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The struct irq_domain contains a "struct device_node *" field
(of_node) that is almost the only link between the irqdomain
and the device tree infrastructure.
In order to prepare for the removal of that field, convert all
users to use irq_domain_get_of_node() instead.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-and-tested-by: Hanjun Guo <hanjun.guo@linaro.org>
Tested-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: <linux-arm-kernel@lists.infradead.org>
Cc: Tomasz Nowicki <tomasz.nowicki@linaro.org>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: Graeme Gregory <graeme@xora.org.uk>
Cc: Jake Oshins <jakeo@microsoft.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Link: http://lkml.kernel.org/r/1444737105-31573-2-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
All unrecovered machine check errors on PowerNV should cause an
immediate panic. There are 2 reasons that this is the right policy:
it's not safe to continue, and we're already trying to reboot.
Firstly, if we go through the recovery process and do not successfully
recover, we can't be sure about the state of the machine, and it is
not safe to recover and proceed.
Linux knows about the following sources of Machine Check Errors:
- Uncorrectable Errors (UE)
- Effective - Real Address Translation (ERAT)
- Segment Lookaside Buffer (SLB)
- Translation Lookaside Buffer (TLB)
- Unknown/Unrecognised
In the SLB, TLB and ERAT cases, we can further categorise these as
parity errors, multihit errors or unknown/unrecognised.
We can handle SLB errors by flushing and reloading the SLB. We can
handle TLB and ERAT multihit errors by flushing the TLB. (It appears
we may not handle TLB and ERAT parity errors: I will investigate
further and send a followup patch if appropriate.)
This leaves us with uncorrectable errors. Uncorrectable errors are
usually the result of ECC memory detecting an error that it cannot
correct, but they also crop up in the context of PCI cards failing
during DMA writes, and during CAPI error events.
There are several types of UE, and there are 3 places a UE can occur:
Skiboot, the kernel, and userspace. For Skiboot errors, we have the
facility to make some recoverable. For userspace, we can simply kill
(SIGBUS) the affected process. We have no meaningful way to deal with
UEs in kernel space or in unrecoverable sections of Skiboot.
Currently, these unrecovered UEs fall through to
machine_check_expection() in traps.c, which calls die(), which OOPSes
and sends SIGBUS to the process. This sometimes allows us to stumble
onwards. For example we've seen UEs kill the kernel eehd and
khugepaged. However, the process killed could have held a lock, or it
could have been a more important process, etc: we can no longer make
any assertions about the state of the machine. Similarly if we see a
UE in skiboot (and again we've seen this happen), we're not in a
position where we can make any assertions about the state of the
machine.
Likewise, for unknown or unrecognised errors, we're not able to say
anything about the state of the machine.
Therefore, if we have an unrecovered MCE, the most appropriate thing
to do is to panic.
The second reason is that since e784b6499d ("powerpc/powernv: Invoke
opal_cec_reboot2() on unrecoverable machine check errors."), we
attempt a special OPAL reboot on an unhandled MCE. This is so the
hardware can record error data for later debugging.
The comments in that commit assert that we are heading down the panic
path anyway. At the moment this is not always true. With UEs in kernel
space, for instance, they are marked as recoverable by the hardware,
so if the attempt to reboot failed (e.g. old Skiboot), we wouldn't
panic() but would simply die() and OOPS. It doesn't make sense to be
staggering on if we've just tried to reboot: we should panic().
Explicitly panic() on unrecovered MCEs on PowerNV.
Update the comments appropriately.
This fixes some hangs following EEH events on cxlflash setups.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Always include a timeout when waiting for secondary cpus to enter OPAL
in the kexec path, rather than only when crashing.
Signed-off-by: Samuel Mendoza-Jonas <sam.mj@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This fixes a race which can result in the same virtual IRQ number
being assigned to two different MSI interrupts. The most visible
consequence of that is usually a warning and stack trace from the
sysfs code about an attempt to create a duplicate entry in sysfs.
The race happens when one CPU (say CPU 0) is disposing of an MSI
while another CPU (say CPU 1) is setting up an MSI. CPU 0 calls
(for example) pnv_teardown_msi_irqs(), which calls
msi_bitmap_free_hwirqs() to indicate that the MSI (i.e. its
hardware IRQ number) is no longer in use. Then, before CPU 0 gets
to calling irq_dispose_mapping() to free up the virtal IRQ number,
CPU 1 comes in and calls msi_bitmap_alloc_hwirqs() to allocate an
MSI, and gets the same hardware IRQ number that CPU 0 just freed.
CPU 1 then calls irq_create_mapping() to get a virtual IRQ number,
which sees that there is currently a mapping for that hardware IRQ
number and returns the corresponding virtual IRQ number (which is
the same virtual IRQ number that CPU 0 was using). CPU 0 then
calls irq_dispose_mapping() and frees that virtual IRQ number.
Now, if another CPU comes along and calls irq_create_mapping(), it
is likely to get the virtual IRQ number that was just freed,
resulting in the same virtual IRQ number apparently being used for
two different hardware interrupts.
To fix this race, we just move the call to msi_bitmap_free_hwirqs()
to after the call to irq_dispose_mapping(). Since virq_to_hw()
doesn't work for the virtual IRQ number after irq_dispose_mapping()
has been called, we need to call it before irq_dispose_mapping() and
remember the result for the msi_bitmap_free_hwirqs() call.
The pattern of calling msi_bitmap_free_hwirqs() before
irq_dispose_mapping() appears in 5 places under arch/powerpc, and
appears to have originated in commit 05af7bd2d7 ("[POWERPC] MPIC
U3/U4 MSI backend") from 2007.
Fixes: 05af7bd2d7 ("[POWERPC] MPIC U3/U4 MSI backend")
Cc: stable@vger.kernel.org # v2.6.22+
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The 32-bit TCE table initialization relies on the DMA window having a
size equal to a power of 2 (and checks for it explicitly). But
crashkernel= has no constraint that requires a power-of-2 be specified.
This causes the kdump kernel to fail to boot as none of the PCI devices
(including the disk controller) are successfully initialized.
After this change, the PCI devices successfully set up the 32-bit TCE
table and kdump succeeds.
Fixes: aca6913f55 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages")
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org # 4.2
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When attempting to kdump with the 4.2 kernel, we see for each PCI
device:
pci 0003:01 : [PE# 000] Assign DMA32 space
pci 0003:01 : [PE# 000] Setting up 32-bit TCE table at 0..80000000
pci 0003:01 : [PE# 000] Failed to create 32-bit TCE table, err -22
PCI: Domain 0004 has 8 available 32-bit DMA segments
PCI: 4 PE# for a total weight of 70
pci 0004:01 : [PE# 002] Assign DMA32 space
pci 0004:01 : [PE# 002] Setting up 32-bit TCE table at 0..80000000
pci 0004:01 : [PE# 002] Failed to create 32-bit TCE table, err -22
pci 0004:0d : [PE# 005] Assign DMA32 space
pci 0004:0d : [PE# 005] Setting up 32-bit TCE table at 0..80000000
pci 0004:0d : [PE# 005] Failed to create 32-bit TCE table, err -22
pci 0004:0e : [PE# 006] Assign DMA32 space
pci 0004:0e : [PE# 006] Setting up 32-bit TCE table at 0..80000000
pci 0004:0e : [PE# 006] Failed to create 32-bit TCE table, err -22
pci 0004:10 : [PE# 008] Assign DMA32 space
pci 0004:10 : [PE# 008] Setting up 32-bit TCE table at 0..80000000
pci 0004:10 : [PE# 008] Failed to create 32-bit TCE table, err -22
and eventually the kdump kernel fails to boot as none of the PCI devices
(including the disk controller) are successfully initialized.
The EINVAL response is because the DMA window (the 2GB base window) is
larger than the kdump kernel's reserved memory (crashkernel=, in this
case specified to be 1024M). The check in question,
if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
is a valid sanity check for pnv_pci_ioda2_table_alloc_pages(), so adjust
the caller to pass in a smaller window size if our maximum memory value
is smaller than the DMA window.
After this change, the PCI devices successfully set up the 32-bit TCE
table and kdump succeeds.
The problem was seen on a Firestone machine originally.
Fixes: aca6913f55 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages")
Cc: stable@vger.kernel.org # 4.2
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[mpe: Coding style pedantry, use u64, change the indentation]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Support "hybrid" iommu/direct DMA ops for coherent_mask < dma_mask from Benjamin Herrenschmidt
- EEH fixes for SRIOV from Gavin
- Introduce rtas_get_sensor_fast() for IRQ handlers from Thomas Huth
- Use hardware RNG for arch_get_random_seed_* not arch_get_random_* from Paul Mackerras
- Seccomp filter support from Michael Ellerman
- opal_cec_reboot2() handling for HMIs & machine checks from Mahesh Salgaonkar
- Add powerpc timebase as a trace clock source from Naveen N. Rao
- Misc cleanups in the xmon, signal & SLB code from Anshuman Khandual
- Add an inline function to update POWER8 HID0 from Gautham R. Shenoy
- Fix pte_pagesize_index() crash on 4K w/64K hash from Michael Ellerman
- Drop support for 64K local store on 4K kernels from Michael Ellerman
- move dma_get_required_mask() from pnv_phb to pci_controller_ops from Andrew Donnellan
- Initialize distance lookup table from drconf path from Nikunj A Dadhania
- Enable RTC class support from Vaibhav Jain
- Disable automatically blocked PCI config from Gavin Shan
- Add LEDs driver for PowerNV platform from Vasant Hegde
- Fix endianness issues in the HVSI driver from Laurent Dufour
- Kexec endian fixes from Samuel Mendoza-Jonas
- Fix corrupted pdn list from Gavin Shan
- Fix fenced PHB caused by eeh_slot_error_detail() from Gavin Shan
- Freescale updates from Scott: Highlights include 32-bit memcpy/memset
optimizations, checksum optimizations, 85xx config fragments and updates,
device tree updates, e6500 fixes for non-SMP, and misc cleanup and minor
fixes.
- A ton of cxl updates & fixes:
- Add explicit precision specifiers from Rasmus Villemoes
- use more common format specifier from Rasmus Villemoes
- Destroy cxl_adapter_idr on module_exit from Johannes Thumshirn
- Destroy afu->contexts_idr on release of an afu from Johannes Thumshirn
- Compile with -Werror from Daniel Axtens
- EEH support from Daniel Axtens
- Plug irq_bitmap getting leaked in cxl_context from Vaibhav Jain
- Add alternate MMIO error handling from Ian Munsie
- Allow release of contexts which have been OPENED but not STARTED from Andrew Donnellan
- Remove use of macro DEFINE_PCI_DEVICE_TABLE from Vaishali Thakkar
- Release irqs if memory allocation fails from Vaibhav Jain
- Remove racy attempt to force EEH invocation in reset from Daniel Axtens
- Fix + cleanup error paths in cxl_dev_context_init from Ian Munsie
- Fix force unmapping mmaps of contexts allocated through the kernel api from Ian Munsie
- Set up and enable PSL Timebase from Philippe Bergheaud
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV5+GzAAoJEFHr6jzI4aWA0iAP/jcd0kNaNBzLgcDKKygKdgz4
xn4EWu81vfMfZYWesb0ATrjlH0hLsRxSXoFUqUMhtJTa5kNAoCIaz/M8WBALS50h
aT+i7br4WEU2j2FcaMyP3iAZx/2hl+2utODJSHPRWPkec1fUDBfEyBf++e520RWM
HUQGIGZXh8yq7KMA96Pwhsvls9vOB8hS2UdU/NS8ff3J5jFvXC1/WmF2qfzJBS1V
8iHyz26Jl8+dJ+et7iC2oD5XQAjIH1oJgOyPVPBzAQttfi8RjuVzRA30TfPBAUwI
lC9nlmPy6bCe4kiQYWVB1z7GegHyW/9vkeuMj/u8mZbqpaayMEMZmd2C3iNDXNHx
i2NSvdln539t4qWYsV2v6lVCfa/ayDHD73Wackj5Dk394tzXnpCPhxNzc2yKEd5v
h7vwYc9jBhsbfSCSogaM+gSHJ1APgCidggHJMYYNA2nN2u6V62RpsMB7zp/1+Q2v
yqYdD8oYF4Dm21x/ujaNFrlizROD46WS0UqdJ3yP6HAqRYIpRXtibmpECJgt1n5h
HjADEci4hQ2UQxdMdp/Q5KZnPTJebBtrZrmkW5r6cZBUaTB5TVkFaEWN44CT/Loh
tMNeA3qOBN06CaQS2WL3UUUWpbZq9fSbWuUZ5lWZDb5AOyRxe5eWVYNLkiyIXozY
L24l1bYdBhXahnjoS/kc
=n9+X
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- support "hybrid" iommu/direct DMA ops for coherent_mask < dma_mask
from Benjamin Herrenschmidt
- EEH fixes for SRIOV from Gavin
- introduce rtas_get_sensor_fast() for IRQ handlers from Thomas Huth
- use hardware RNG for arch_get_random_seed_* not arch_get_random_*
from Paul Mackerras
- seccomp filter support from Michael Ellerman
- opal_cec_reboot2() handling for HMIs & machine checks from Mahesh
Salgaonkar
- add powerpc timebase as a trace clock source from Naveen N. Rao
- misc cleanups in the xmon, signal & SLB code from Anshuman Khandual
- add an inline function to update POWER8 HID0 from Gautham R. Shenoy
- fix pte_pagesize_index() crash on 4K w/64K hash from Michael Ellerman
- drop support for 64K local store on 4K kernels from Michael Ellerman
- move dma_get_required_mask() from pnv_phb to pci_controller_ops from
Andrew Donnellan
- initialize distance lookup table from drconf path from Nikunj A
Dadhania
- enable RTC class support from Vaibhav Jain
- disable automatically blocked PCI config from Gavin Shan
- add LEDs driver for PowerNV platform from Vasant Hegde
- fix endianness issues in the HVSI driver from Laurent Dufour
- kexec endian fixes from Samuel Mendoza-Jonas
- fix corrupted pdn list from Gavin Shan
- fix fenced PHB caused by eeh_slot_error_detail() from Gavin Shan
- Freescale updates from Scott: Highlights include 32-bit memcpy/memset
optimizations, checksum optimizations, 85xx config fragments and
updates, device tree updates, e6500 fixes for non-SMP, and misc
cleanup and minor fixes.
- a ton of cxl updates & fixes:
- add explicit precision specifiers from Rasmus Villemoes
- use more common format specifier from Rasmus Villemoes
- destroy cxl_adapter_idr on module_exit from Johannes Thumshirn
- destroy afu->contexts_idr on release of an afu from Johannes
Thumshirn
- compile with -Werror from Daniel Axtens
- EEH support from Daniel Axtens
- plug irq_bitmap getting leaked in cxl_context from Vaibhav Jain
- add alternate MMIO error handling from Ian Munsie
- allow release of contexts which have been OPENED but not STARTED
from Andrew Donnellan
- remove use of macro DEFINE_PCI_DEVICE_TABLE from Vaishali Thakkar
- release irqs if memory allocation fails from Vaibhav Jain
- remove racy attempt to force EEH invocation in reset from Daniel
Axtens
- fix + cleanup error paths in cxl_dev_context_init from Ian Munsie
- fix force unmapping mmaps of contexts allocated through the kernel
api from Ian Munsie
- set up and enable PSL Timebase from Philippe Bergheaud
* tag 'powerpc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (140 commits)
cxl: Set up and enable PSL Timebase
cxl: Fix force unmapping mmaps of contexts allocated through the kernel api
cxl: Fix + cleanup error paths in cxl_dev_context_init
powerpc/eeh: Fix fenced PHB caused by eeh_slot_error_detail()
powerpc/pseries: Cleanup on pci_dn_reconfig_notifier()
powerpc/pseries: Fix corrupted pdn list
powerpc/powernv: Enable LEDS support
powerpc/iommu: Set default DMA offset in dma_dev_setup
cxl: Remove racy attempt to force EEH invocation in reset
cxl: Release irqs if memory allocation fails
cxl: Remove use of macro DEFINE_PCI_DEVICE_TABLE
powerpc/powernv: Fix mis-merge of OPAL support for LEDS driver
powerpc/powernv: Reset HILE before kexec_sequence()
powerpc/kexec: Reset secondary cpu endianness before kexec
powerpc/hvsi: Fix endianness issues in the HVSI driver
leds/powernv: Add driver for PowerNV platform
powerpc/powernv: Create LED platform device
powerpc/powernv: Add OPAL interfaces for accessing and modifying system LED states
powerpc/powernv: Fix the log message when disabling VF
cxl: Allow release of contexts which have been OPENED but not STARTED
...
Pull irq updates from Thomas Gleixner:
"This updated pull request does not contain the last few GIC related
patches which were reported to cause a regression. There is a fix
available, but I let it breed for a couple of days first.
The irq departement provides:
- new infrastructure to support non PCI based MSI interrupts
- a couple of new irq chip drivers
- the usual pile of fixlets and updates to irq chip drivers
- preparatory changes for removal of the irq argument from interrupt
flow handlers
- preparatory changes to remove IRQF_VALID"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
irqchip/imx-gpcv2: IMX GPCv2 driver for wakeup sources
irqchip: Add bcm2836 interrupt controller for Raspberry Pi 2
irqchip: Add documentation for the bcm2836 interrupt controller
irqchip/bcm2835: Add support for being used as a second level controller
irqchip/bcm2835: Refactor handle_IRQ() calls out of MAKE_HWIRQ
PCI: xilinx: Fix typo in function name
irqchip/gic: Ensure gic_cpu_if_up/down() programs correct GIC instance
irqchip/gic: Only allow the primary GIC to set the CPU map
PCI/MSI: pci-xgene-msi: Consolidate chained IRQ handler install/remove
unicore32/irq: Prepare puv3_gpio_handler for irq argument removal
tile/pci_gx: Prepare trio_handle_level_irq for irq argument removal
m68k/irq: Prepare irq handlers for irq argument removal
C6X/megamode-pic: Prepare megamod_irq_cascade for irq argument removal
blackfin: Prepare irq handlers for irq argument removal
arc/irq: Prepare idu_cascade_isr for irq argument removal
sparc/irq: Use access helper irq_data_get_affinity_mask()
sparc/irq: Use helper irq_data_get_irq_handler_data()
parisc/irq: Use access helper irq_data_get_affinity_mask()
mn10300/irq: Use access helper irq_data_get_affinity_mask()
irqchip/i8259: Prepare i8259_irq_dispatch for irq argument removal
...
Commit e91c25111a "powerpc/iommu: Cleanup setting of DMA base/offset"
expects that the default DMA offset is set from pnv_ioda_setup_bus_dma()
which is correct unless it is SRIOV where the code flow is different -
at the moment when pnv_ioda_setup_bus_dma() is called, PCI devices for
VFs are not created yet.
This adds missing set_dma_offset() to pnv_pci_ioda_dma_dev_setup() to
cover the case of SRIOV.
Note that we still need set_dma_offset() in pnv_ioda_setup_bus_dma() as
at the boot time pnv_pci_ioda_dma_dev_setup() is called when no PE was
created yet, this happens at the PHB fixup stage.
Fixes: e91c25111a ("powerpc/iommu: Cleanup setting of DMA base/offset")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On powernv secondary cpus are returned to OPAL, and will then enter
the target kernel in big-endian. However if it is set the HILE bit
will persist, causing the first exception in the target kernel to be
delivered in litte-endian regardless of the current endianness.
If running on top of OPAL make sure the HILE bit is reset once we've
finished waiting for all of the secondaries to be returned to OPAL.
Signed-off-by: Samuel Mendoza-Jonas <sam.mj@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds platform devices for leds. Also export LED related
OPAL API's so that led driver can use these APIs.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch registers the following two new OPAL interfaces calls
for the platform LED subsystem. With the help of these new OPAL calls,
the kernel will be able to get or set the state of various individual
LEDs on the system at any given location code which is passed through
the LED specific device tree nodes.
(1) OPAL_LEDS_GET_INDICATOR opal_leds_get_ind
(2) OPAL_LEDS_SET_INDICATOR opal_leds_set_ind
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Tested-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On powernv platform, IOV BAR would be shifted if necessary. While the log
message is not correct when disabling VFs.
This patch fixes this by print correct message based on the offset value.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Simplify the dma_get_required_mask call chain by moving it from pnv_phb to
pci_controller_ops, similar to commit 763d2d8df1 ("powerpc/powernv:
Move dma_set_mask from pnv_phb to pci_controller_ops").
Previous call chain:
0) call dma_get_required_mask() (kernel/dma.c)
1) call ppc_md.dma_get_required_mask, if it exists. On powernv, that
points to pnv_dma_get_required_mask() (platforms/powernv/setup.c)
2) device is PCI, therefore call pnv_pci_dma_get_required_mask()
(platforms/powernv/pci.c)
3) call phb->dma_get_required_mask if it exists
4) it only exists in the ioda case, where it points to
pnv_pci_ioda_dma_get_required_mask() (platforms/powernv/pci-ioda.c)
New call chain:
0) call dma_get_required_mask() (kernel/dma.c)
1) device is PCI, therefore call pci_controller_ops.dma_get_required_mask
if it exists
2) in the ioda case, that points to pnv_pci_ioda_dma_get_required_mask()
(platforms/powernv/pci-ioda.c)
In the p5ioc2 case, the call chain remains the same -
dma_get_required_mask() does not find either a ppc_md call or
pci_controller_ops call, so it calls __dma_get_required_mask().
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Section 3.7 of Version 1.2 of the Power8 Processor User's Manual
prescribes that updates to HID0 be preceded by a SYNC instruction and
followed by an ISYNC instruction (Page 91).
Create an inline function name update_power8_hid0() which follows this
recipe and invoke it from the static split core path.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Tested-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Invoke new opal_cec_reboot2() call with reboot type
OPAL_REBOOT_PLATFORM_ERROR (for unrecoverable HMI interrupts) to inform
BMC/OCC about this error, so that BMC can collect relevant data for error
analysis and decide what component to de-configure before rebooting.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On non-recoverable MCE errors in kernel space, Linux kernel panics
and system reboots. On BMC based system opal-prd runs as a daemon
in the host. Hence, kernel crash may prevent opal-prd to detect and
analyze this MCE error. This may land us in a situation where the faulty
memory never gets de-configured and Linux would keep hitting same MCE error
again and again. If this happens in early stage of kernel initialization,
then Linux will keep crashing and rebooting in a loop.
This patch fixes this issue by invoking new opal_cec_reboot2() call with
reboot type OPAL_REBOOT_PLATFORM_ERROR to inform BMC/OCC about this
error, so that BMC can collect relevant data for error analysis and
decide what component to de-configure before rebooting.
This patch is dependent on OPAL patchset posted on skiboot mailing list
at https://lists.ozlabs.org/pipermail/skiboot/2015-July/001771.html that
introduces opal_cec_reboot2() opal call.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In the event of unrecovered HMI the existing code panics as soon as
it receives the first unrecovered HMI event. This makes host to report
partial information about HMIs before panic. There may be more errors
which would have caused the HMI and hence more HMI event would have been
generated waiting to be pulled by host. This patch implements a logic to
pull and display all the HMI event before going down panic path.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The V2 version of HMI event now carries additional information for
Malfunction Alert. It now contains error information about CORE and NX
checkstop. This patch checks and displays the check stop reason before
panic.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pnv_eeh_next_error() re-enables the eeh opal event interrupt but it
gets called from a loop if there are more outstanding events to
process, resulting in a warning due to enabling an already enabled
interrupt. Instead the interrupt should only be re-enabled once the
last outstanding event has been processed.
Tested-by: Daniel Axtens <dja@axtens.net>
Reported-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It is not uncommon (at least with the ARM stuff) to have a piece
of hardware that implements different flavours of "interrupts".
A typical example of this is the GICv3 ITS, which implements
standard PCI/MSI support, but also some form of "generic MSI".
So far, the PCI/MSI domain is registered using the ITS device_node,
so that irq_find_host can return it. On the contrary, the raw MSI
domain is not registered with an device_node, making it impossible
to be looked up by another subsystem (obviously, using the same
device_node twice would only result in confusion, as it is not
defined which one irq_find_host would return).
A solution to this is to "type" domains that may be aliasing, and
to be able to lookup an device_node that matches a given type.
For this, we introduce irq_find_matching_host() as a superset
of irq_find_host:
struct irq_domain *irq_find_matching_host(struct device_node *node,
enum irq_domain_bus_token bus_token);
where bus_token is the "type" we want to match the domain against
(so far, only DOMAIN_BUS_ANY is defined). This result in some
moderately invasive changes on the PPC side (which is the only
user of the .match method).
This has otherwise no functionnal change.
Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: <linux-arm-kernel@lists.infradead.org>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Ma Jun <majun258@huawei.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Duc Dang <dhdang@apm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/1438091186-10244-2-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The existing code stores the amount of memory allocated for a TCE table.
At the moment it uses @offset which is a virtual offset in the TCE table
which is only correct for a one level tables and it does not include
memory allocated for intermediate levels. When multilevel TCE table is
requested, WARN_ON in tce_iommu_create_table() prints a warning.
This adds an additional counter to pnv_pci_ioda2_table_do_alloc_pages()
to count actually allocated memory.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The hardware RNG on POWER8 and POWER7+ can be relatively slow, since
it can only supply one 64-bit value per microsecond. Currently we
read it in arch_get_random_long(), but that slows down reading from
/dev/urandom since the code in random.c calls arch_get_random_long()
for every longword read from /dev/urandom.
Since the hardware RNG supplies high-quality entropy on every read, it
matches the semantics of arch_get_random_seed_long() better than those
of arch_get_random_long(). Therefore this commit makes the code use
the POWER8/7+ hardware RNG only for arch_get_random_seed_{long,int}
and not for arch_get_random_{long,int}.
This won't affect any other PowerPC-based platforms because none of
them currently support a hardware RNG. To make it clear that the
ppc_md function pointer is used for arch_get_random_seed_*, we rename
it from get_random_long to get_random_seed.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When detecting EEH error on non-existing PE, including the reserved
one, the PE is simply unfrozen without dumping the PHB diag-data,
which is useful for locating the root cause of the EEH error. The
patch dumps the PHB diag-data when non-existing PE reports error.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On LE kernel, the non-existing PE number in BE format derived from
skiboot firmware isn't converted to LE format properly as following
kernel log indicates:
EEH: Clear non-existing PHB#4-PE#200000000000000
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds support for OPAL EPOW (Environmental and Power Warnings)
and DPO (Delayed Power Off) events for the PowerNV platform. These events
are generated on FSP (Flexible Service Processor) based systems. EPOW
events are generated due to various critical system conditions that
require system shutdown. A few examples of these conditions are high
ambient temperature or system running on UPS power with low UPS battery.
DPO event is generated in response to admin initiated system shutdown
request. Upon receipt of EPOW and DPO events the host kernel invokes
orderly_poweroff() for performing graceful system shutdown.
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Acked-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When releasing PE for SRIOV VF, the PE is forced to be frozen
wrongly. When the same PE is picked for another VF, it won't
work anyhow. The patch fixes the issue by unfreezing, not
freezing the VF PE when releasing it.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The PELTV of PF PE should include VF PE, which is missed by current
code, so that the VF PE is frozen automatically when freezing PF PE.
The patch fixes the PELTV of PF PE to include VF PE.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On PHB3, PE might be reserved in advance to reflect the M64 segments
consumed by the PE according to M64 BARs (exclude VF BARs) of the PCI
devices included in the PE. The PE is picked based on M64 BARs instead
of the bridge's M64 windows, which might include VF BARs. Otherwise,
wrong PE could be picked.
The patch calculates the used M64 segments and PE numbers according to
the M64 BARs, excluding VF BARs, of PCI devices in one particular PE,
instead of the bridge's M64 windows. Then the right PE number is picked.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The patch changes the type of last argument of pnv_ioda_setup_bus_PE()
and phb::pick_m64_pe() to boolean. No functional change.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On PHB3, some PEs might be reserved in advance to reflect the M64
segments consumed by those PEs. We're reserving PEs based on the
M64 window of root port, which might contain VF BAR. The PEs for
VFs are allocated dynamically, not reserved based on the consumed
M64 segments. So the M64 window of root port isn't reliable for
the task. Instead, we go through M64 BARs (VF BARs excluded) of
PCI devices under the specified root bus and reserve PEs accordingly,
as the patch does.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The PE numbers are reserved according to root port's M64 window,
which is aligned to M64 segment finely. So one PE shouldn't be
reserved for multiple times. We will reserve PE numbers according
to the M64 BARs of PCI device in subsequent patches, which aren't
aligned to M64 segment size finely. It means one particular PE
could be reserved for multiple times.
The patch allows one PE to be reserved for multiple times and we
print the warning message at debugging level.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that the table and the offset can co-exist, we no longer need
to flip/flop, we can just establish both once at boot time.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The conversion of opal events to a proper irqchip means that handlers
are called until the relevant opal event has been cleared by
processing it. Events that queue work should therefore use a threaded
handler to mask the event until processing is complete.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
opal-prd driver will mmap() firmware code/data area as private
mapping to prd user space daemon. Write to this page will
trigger COW faults. The new COW pages are normal kernel RAM
pages accounted by the kernel and are not special.
vma->vm_page_prot value will be used at page fault time
for the new COW pages, while pgprot_t value passed in
remap_pfn_range() is used for the initial page table entry.
Hence:
* Do not add _PAGE_SPECIAL in vma, but only for remap_pfn_range()
* Also remap_pfn_range() will add the _PAGE_SPECIAL flag using
pte_mkspecial() call, hence no need to specify in the driver
This fix resolves the page accounting warning shown below:
BUG: Bad rss-counter state mm:c0000007d34ac600 idx:1 val:19
The above warning is triggered since _PAGE_SPECIAL was incorrectly
being set for the normal kernel COW pages.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When pnv_pci_ioda_fixup() is called during PHB fixup time, each PE in
the sorted list of PEs (phb::pe_dma_list) is iterated to setup the PE's
DMA32 space by pnv_ioda_setup_bus_dma() if the PE's DMA32 weight is bigger
than zero. The function also assigns all the subordinate PCI devices of
the PE's primary bus with the PE's DMA32 IOMMU table. It causes the PCI
devicess in the child PEs, which don't have DMA weight, receives wrong
IOMMU table and then IOMMU group.
The patch fixes above issue by more check on the PE's coverage and don't
assign IOMMU table to those PCI devices, which belong to the child PEs.
The problem was found on Firestone platform initially.
Suggested-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The pnv_pci_ioda2_unset_window() function is used to do the final
cleanup of a DMA window being released:
- via VFIO ioctl by the guest request;
- via unplugging a virtual PCI function.
However the function was under #ifdef CONFIG_IOMMU_API and was missing.
This moves the helper outside of IOMMU_API block and enables it
for either or both IOMMU_API and PCI_IOV.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We currently have a bug in the PRD code, where the contents of an
incoming message (beyond the header) will be overwritten by the list
item manipulations when adding to to the prd_msg_queue.
This change reorders struct opal_prd_msg_queue_item, so that the
message body doesn't overlap the list_head.
We also clarify the memcpy of the message, as we're copying unnecessary
bytes at the end of the message data.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The eeh subsystem for powernv requires the opal event irqchip to be
initialised prior to initialisation or the following errors are
produced (and eeh doesn't work as expected):
irq: XICS didn't like hwirq-0x9 to VIRQ17 mapping (rc=-22)
pnv_eeh_post_init: Can't request OPAL event interrupt (0)
On powernv eeh is initialised from a subsys_initcall due to a check
for machine_is(powernv) in eeh_init(). This patch increases the
initcall priority of opal_event_init() to an arch_initcall to ensure
the opal event interface is initialised prior to any users of it.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Reported-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Although this init call checks for device tree properties before doing
anything, it should still only run on powernv machines.
Reviewed-by: Shreyas B Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.
This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU implements take/release_ownership() callbacks, this lets
the user have full control over the IOMMU group. When the ownership
is taken, the platform code removes all the windows so the caller must
create them.
Before returning the ownership back to the platform code, VFIO
unprograms and removes all the tables it created.
This changes IODA2's onwership handler to remove the existing table
rather than manipulating with the existing one. From now on,
iommu_take_ownership() and iommu_release_ownership() are only called
from the vfio_iommu_spapr_tce driver.
Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.
No change in userspace-visible behaviour is expected. Since it recreates
TCE tables on each ownership change, related kernel traces will appear
more often.
This adds a pnv_pci_ioda2_setup_default_config() which is called
when PE is being configured at boot time and when the ownership is
passed from VFIO to the platform code.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.
This stores the allocated table size in pnv_pci_ioda2_get_table_size()
so the locked_vm counter can be updated correctly when a table is
being disposed.
This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The existing code programmed TVT#0 with some address and then
immediately released that memory.
This makes use of pnv_pci_ioda2_unset_window() and
pnv_pci_ioda2_set_bypass() which do correct resource release and
TVT update.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.
create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.
set_window() sets the window at specified TVT index + @num on PHB.
unset_window() unsets the window from specified TVT.
This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.
create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.
This adds IOMMU capabilities to iommu_table_group such as default
32bit window parameters and others. This makes use of new values in
vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
advertise pagemasks to the userspace.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
on huge guests (hundreds of GB of RAM) so the kernel might be unable to
allocate contiguous chunk of physical memory to store the TCE table.
To address this, POWER8 CPU (actually, IODA2) supports multi-level
TCE tables, up to 5 levels which splits the table into a tree of
smaller subtables.
This adds multi-level TCE tables support to
pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages()
helpers.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is a part of moving DMA window programming to an iommu_ops
callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
a first parameter (not pnv_ioda_pe) as it is going to be used as
a callback for VFIO DDW code.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is a part of moving TCE table allocation into an iommu_ops
callback to support multiple IOMMU groups per one VFIO container.
This moves the code which allocates the actual TCE tables to helpers:
pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages().
These do not allocate/free the iommu_table struct.
This enforces window size to be a power of two.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves iommu_table creation to the beginning to make following changes
easier to review. This starts using table parameters from the iommu_table
struct.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.
Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.
This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE and DMA direction so
the caller can release the pages afterwards. The exchange() receives
a physical address unlike set() which receives linear mapping address;
and returns a physical address as the clear() does.
This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
for a platform to have exchange() implemented in order to support VFIO.
This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().
This makes sure that TCE permission bits are not set in TCE passed to
IOMMU API as those are to be calculated by platform code from
DMA direction.
This moves SetPageDirty() to the IOMMU code to make it work for both
VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
available later).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This replaces direct accesses to TCE table with a helper which
returns an TCE entry address. This does not make difference now but will
when multi-level TCE tables get introduces.
No change in behavior is expected.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The iommu_table struct keeps a list of IOMMU groups it is used for.
At the moment there is just a single group attached but further
patches will add TCE table sharing. When sharing is enabled, TCE cache
in each PE needs to be invalidated so does the patch.
This does not change pnv_pci_ioda1_tce_invalidate() as there is no plan
to enable TCE table sharing on PHBs older than IODA2.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment the DMA setup code looks for the "ibm,opal-tce-kill"
property which contains the TCE kill register address. Writing to
this register invalidates TCE cache on IODA/IODA2 hub.
This moves the register address from iommu_table to pnv_pnb as this
register belongs to PHB and invalidates TCE cache for all tables of
all attached PEs.
This moves the property reading/remapping code to a helper which is
called when DMA is being configured for PE and which does DMA setup
for both IODA1 and IODA2.
This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which
invalidates cache for the entire table. It should be called after
every call to opal_pci_map_pe_dma_window(). It was not required before
because there was just a single TCE table and 64bit DMA was handled via
bypass window (which has no table so no cache was used) but this is going
to change with Dynamic DMA windows (DDW).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
which call in a loop iommu_take_ownership()/iommu_release_ownership()
for every table on the group. As there is just one now, no change in
behaviour is expected.
At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.
The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds take_ownership()/release_ownership() callbacks to it which are
called when an external user takes/releases control over the IOMMU.
This replaces set_bypass() with ownership callbacks as it is not
necessarily just bypass enabling, it can be something else/more
so let's give it more generic name.
The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
The following patches will replace iommu_take_ownership/
iommu_release_ownership calls in IODA2 with full IOMMU table release/
create.
As we here and touching bypass control, this removes
pnv_pci_ioda2_setup_bypass_pe() as it does not do much
more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
initialization to pnv_pci_ioda2_setup_dma_pe.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
So far one TCE table could only be used by one IOMMU group. However
IODA2 hardware allows programming the same TCE table address to
multiple PE allowing sharing tables.
This replaces a single pointer to a group in a iommu_table struct
with a linked list of groups which provides the way of invalidating
TCE cache for every PE when an actual TCE table is updated. This adds
pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
helpers to manage the list. However without VFIO, it is still going
to be a single IOMMU group per iommu_table.
This changes iommu_add_device() to add a device to a first group
from the group list of a table as it is only called from the platform
init code or PCI bus notifier and at these moments there is only
one group per table.
This does not change TCE invalidation code to loop through all
attached groups in order to simplify this patch and because
it is not really needed in most cases. IODA2 is fixed in a later
patch.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.
This defines iommu_table_group struct which stores pointers to
iommu_group and iommu_table(s). This replaces iommu_table with
iommu_table_group where iommu_table was used to identify a group:
- iommu_register_group();
- iommudata of generic iommu_group;
This removes @data from iommu_table as it_table_group provides
same access to pnv_ioda_pe.
For IODA, instead of embedding iommu_table, the new iommu_table_group
keeps pointers to those. The iommu_table structs are allocated
dynamically.
For P5IOC2, both iommu_table_group and iommu_table are embedded into
PE struct. As there is no EEH and SRIOV support for P5IOC2,
iommu_free_table() should not be called on iommu_table struct pointers
so we can keep it embedded in pnv_phb::p5ioc2.
For pSeries, this replaces multiple calls of kzalloc_node() with a new
iommu_pseries_alloc_group() helper and stores the table group struct
pointer into the pci_dn struct. For release, a iommu_table_free_group()
helper is added.
This moves iommu_table struct allocation from SR-IOV code to
the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
This change is here because those lines had to be changed anyway.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
supposed to be called on IODA1/2 and not called on p5ioc2. It receives
start and end host addresses of TCE table.
IODA2 actually needs PCI addresses to invalidate the cache. Those
can be calculated from host addresses but since we are going
to implement multi-level TCE tables, calculating PCI address from
a host address might get either tricky or ugly as TCE table remains flat
on PCI bus but not in RAM.
This moves pnv_pci_ioda_tce_invalidate() from generic pnv_tce_build/
pnt_tce_free and defines IODA1/2-specific callbacks which call generic
ones and do PHB-model-specific TCE cache invalidation. P5IOC2 keeps
using generic callbacks as before.
This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
number of pages which are PCI addresses shifted by IOMMU page shift.
No change in behaviour is expected.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.
This adds the requirement for @it_ops to be initialized before calling
iommu_init_table() to make sure that we do not leave any IOMMU table
with iommu_table_ops uninitialized. This is not a parameter of
iommu_init_table() though as there will be cases when iommu_init_table()
will not be called on TCE tables, for example - VFIO.
This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
redundant prefixes.
This removes tce_xxx_rm handlers from ppc_md but does not add
them to iommu_table_ops as this will be done later if we decide to
support TCE hypercalls in real mode. This removes _vm callbacks as
only virtual mode is supported by now so this also removes @rm parameter.
For pSeries, this always uses tce_buildmulti_pSeriesLP/
tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support "multitce=off"
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace "multi" callbacks with single
ones.
For powernv, this defines _ops per PHB type which are P5IOC2/IODA1/IODA2.
This makes the callbacks for them public. Later patches will extend
callbacks for IODA1/2.
No change in behaviour is expected.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag as the old TCE value (more precisely -
the permission bits) will be used to decide whether to put the page or not.
This adds iommu_direction_to_tce_perm() (its counterpart is there already)
and uses it for powernv's pnv_tce_build().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
So far an iommu_table lifetime was the same as PE. Dynamic DMA windows
will change this and iommu_free_table() will not always require
the group to be released.
This moves iommu_group_put() out of iommu_free_table().
This adds a iommu_pseries_free_table() helper which does
iommu_group_put() and iommu_free_table(). Later it will be
changed to receive a table_group and we will have to change less
lines then.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The existing code has 3 calls to iommu_register_group() and
all 3 branches actually cover all possible cases.
This replaces 3 calls with one and moves the registration earlier;
the latter will make more sense when we add TCE table sharing.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The set_iommu_table_base_and_group() name suggests that the function
sets table base and add a device to an IOMMU group.
The actual purpose for table base setting is to put some reference
into a device so later iommu_add_device() can get the IOMMU group
reference and the device to the group.
At the moment a group cannot be explicitly passed to iommu_add_device()
as we want it to work from the bus notifier, we can fix it later and
remove confusing calls of set_iommu_table_base().
This replaces set_iommu_table_base_and_group() with a couple of
set_iommu_table_base() + iommu_add_device() which makes reading the code
easier.
This adds few comments why set_iommu_table_base() and iommu_add_device()
are called where they are called.
For IODA1/2, this essentially removes iommu_add_device() call from
the pnv_pci_ioda_dma_dev_setup() as it will always fail at this particular
place:
- for physical PE, the device is already attached by iommu_add_device()
in pnv_pci_ioda_setup_dma_pe();
- for virtual PE, the sysfs entries are not ready to create all symlinks
so actual adding is happening in tce_iommu_bus_notifier.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This relies on the fact that a PCI device always has an IOMMU table
which may not be the case when we get dynamic DMA windows so
let's use more reliable check for IOMMU group here.
As we do not rely on the table presence here, remove the workaround
from pnv_pci_ioda2_set_bypass(); also remove the @add_to_iommu_group
parameter from pnv_ioda_setup_bus_dma().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This change adds a char device to access the "PRD" (processor runtime
diagnostics) channel to OPAL firmware.
Includes contributions from Vaidyanathan Srinivasan, Neelesh Gupta &
Vishal Kulkarni.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The (upcoming) opal-prd driver needs to access the message notifier and
xscom code, so add EXPORT_SYMBOL_GPL macros for these.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
opal_ipmi_init and opal_flash_init are equivalent, except for the
compatbile string. Merge these two into a common opal_pdev_init
function.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The opal_{get,set}_param calls return internal error codes which need
to be translated in errnos in Linux.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves the current include file from cxl.h -> cxl-base.h. This current
include file is used only to pass information between the base driver that
needs to be built into the kernel and the cxl module.
This is to make way for a new include/misc/cxl.h which will
contain just the kernel API for other driver to use
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently pnv_pci_shutdown() calls the PHB shutdown code for all PHBs in the
system. It dereferences the private_data assuming it's a powernv PHB, which
won't be the case when we have different PHB in the systems (like when we add
vPHBs for CXL).
This moves the shutdown hook to the pci_controller_ops and fixes the call site
to use that instead.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Previously, dma_set_mask() on powernv was convoluted:
0) Call dma_set_mask() (a/p/kernel/dma.c)
1) In dma_set_mask(), ppc_md.dma_set_mask() exists, so call it.
2) On powernv, that function pointer is pnv_dma_set_mask().
In pnv_dma_set_mask(), the device is pci, so call pnv_pci_dma_set_mask().
3) In pnv_pci_dma_set_mask(), call pnv_phb->set_dma_mask() if it exists.
4) It only exists in the ioda case, where it points to
pnv_pci_ioda_dma_set_mask(), which is the final function.
So the call chain is:
dma_set_mask() ->
pnv_dma_set_mask() ->
pnv_pci_dma_set_mask() ->
pnv_pci_ioda_dma_set_mask()
Both ppc_md and pnv_phb function pointers are used.
Rip out the ppc_md call, pnv_dma_set_mask() and pnv_pci_dma_set_mask().
Instead:
0) Call dma_set_mask() (a/p/kernel/dma.c)
1) In dma_set_mask(), the device is pci, and pci_controller_ops.dma_set_mask()
exists, so call pci_controller_ops.dma_set_mask()
2) In the ioda case, that points to pnv_pci_ioda_dma_set_mask().
The new call chain is
dma_set_mask() ->
pnv_pci_ioda_dma_set_mask()
Now only the pci_controller_ops function pointer is used.
The fallback paths for p5ioc2 are the same.
Previously, pnv_pci_dma_set_mask() would find no pnv_phb->set_dma_mask()
function, to it would call __set_dma_mask().
Now, dma_set_mask() finds no ppc_md call or pci_controller_ops call,
so it calls __set_dma_mask().
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Remove powernv generic PCI controller operations. Replace it with
controller ops for each of the two supported PHBs.
As an added bonus, make the two new structs const, which will help
guard against bugs such as the one introduced in 65ebf4b63
("powerpc/powernv: Move controller ops from ppc_md to controller_ops")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move the PowerNV/BML platform to use the pci_controller_ops structure
rather than ppc_md for MSI related PCI controller operations.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
All users of the old opal events notifier have been converted over to
the irq domain so remove the event notifier functions.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Convert the opal dump driver to the new opal irq domain.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch converts the elog code to use the opal irq domain instead
of notifier events.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch converts the opal message event to use the new opal irq
domain.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The eeh code currently uses the old notifier method to get eeh events
from OPAL. It also contains some logic to filter opal events which has
been moved into the virtual irqchip. This patch converts the eeh code
to the new event interface which simplifies event handling.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Whenever an interrupt is received for opal the linux kernel gets a
bitfield indicating certain events that have occurred and need handling
by the various device drivers. Currently this is handled using a
notifier interface where we call every device driver that has
registered to receive opal events.
This approach has several drawbacks. For example each driver has to do
its own checking to see if the event is relevant as well as event
masking. There is also no easy method of recording the number of times
we receive particular events.
This patch solves these issues by exposing opal events via the
standard interrupt APIs by adding a new interrupt chip and
domain. Drivers can then register for the appropriate events using
standard kernel calls such as irq_of_parse_and_map().
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Most of the OPAL subsystems are always compiled in for PowerNV and
many of them need to be initialised before or after other OPAL
subsystems. Rather than trying to control this ordering through
machine initcalls it is clearer and easier to control initialisation
order with explicit calls in opal_init.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Cc: Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fastsleep is one of the idle state which cpuidle subsystem currently
uses on power8 machines. In this state L2 cache is brought down to a
threshold voltage. Therefore when the core is in fastsleep, the
communication between L2 and L3 needs to be fenced. But there is a bug
in the current power8 chips surrounding this fencing.
OPAL provides a workaround which precludes the possibility of hitting
this bug. But running with this workaround applied causes checkstop
if any correctable error in L2 cache directory is detected. Hence OPAL
also provides a way to undo the workaround.
In the existing implementation, workaround is applied by the last thread
of the core entering fastsleep and undone by the first thread waking up.
But this has a performance cost. These OPAL calls account for roughly
4000 cycles everytime the core has to enter or wakeup from fastsleep.
This patch introduces a sysfs attribute (fastsleep_workaround_applyonce)
to choose the behavior of this workaround.
By default, fastsleep_workaround_applyonce = 0. In this case, workaround
is applied/undone everytime the core enters/exits fastsleep.
fastsleep_workaround_applyonce = 1. In this case the workaround is
applied once on all the cores and never undone. This can be triggered by
echo 1 > /sys/devices/system/cpu/fastsleep_workaround_applyonce
For simplicity this attribute can be modified only once. Implying, once
fastsleep_workaround_applyonce is changed to 1, it cannot be reverted
to the default state.
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is a cleanup patch; doesn't change any functionality. Moves
all cpuidle related code from setup.c to a new file.
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
[mpe: Fix the SMP=n build by including asm/smp.h in idle.c]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As the comment indicates, powernv_eeh_get_state() will inform EEH core to
delay 1 second. This means the delay doesn't happen when
powernv_eeh_get_state() returns.
This patch moves the delay subtraction just before msleep(), which is the
same logic in pseries_eeh_wait_state().
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To retrieve the PCI slot state, EEH driver would set a timeout for that.
While current comment is not aligned to what the code does.
This patch fixes those comments according to the code.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
OpenPower BMC machines do not place any sysparams in the device tree, so
at every boot we get a warning:
[ 0.437176] SYSPARAM: Opal sysparam node not found
Remove the warning, and reorder the init so we don't peform allocations
when there is no sysparam node in the device tree.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Acked-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Load the PowerNV platform pci controller ops into pci controllers
after all the operations are loaded into the platform ops struct, not
before.
Otherwise we aren't actually setting the ops properly which can break
IO for some devices.
Fixes: 65ebf4b63 ("powerpc/powernv: Move controller ops from ppc_md to controller_ops")
Reported-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Book3S HV only (debugging aids, minor performance improvements and some
cleanups). But there are also bug fixes and small cleanups for ARM,
x86 and s390.
The task_migration_notifier revert and real fix is still pending review,
but I'll send it as soon as possible after -rc1.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJVOONLAAoJEL/70l94x66DbsMIAIpZPsaqgXOC1sDEiZuYay+6
rD4n4id7j8hIAzcf3AlZdyf5XgLlr6I1Zyt62s1WcoRq/CCnL7k9EljzSmw31WFX
P2y7/J0iBdkn0et+PpoNThfL2GsgTqNRCLOOQlKgEQwMP9Dlw5fnUbtC1UchOzTg
eAMeBIpYwufkWkXhdMw4PAD4lJ9WxUZ1eXHEBRzJb0o0ZxAATJ1tPZGrFJzoUOSM
WsVNTuBsNd7upT02kQdvA1TUo/OPjseTOEoksHHwfcORt6bc5qvpctL3jYfcr7sk
/L6sIhYGVNkjkuredjlKGLfT2DDJjSEdJb1k2pWrDRsY76dmottQubAE9J9cDTk=
=OAi2
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second batch of KVM changes from Paolo Bonzini:
"This mostly includes the PPC changes for 4.1, which this time cover
Book3S HV only (debugging aids, minor performance improvements and
some cleanups). But there are also bug fixes and small cleanups for
ARM, x86 and s390.
The task_migration_notifier revert and real fix is still pending
review, but I'll send it as soon as possible after -rc1"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (29 commits)
KVM: arm/arm64: check IRQ number on userland injection
KVM: arm: irqfd: fix value returned by kvm_irq_map_gsi
KVM: VMX: Preserve host CR4.MCE value while in guest mode.
KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8
KVM: PPC: Book3S HV: Translate kvmhv_commence_exit to C
KVM: PPC: Book3S HV: Streamline guest entry and exit
KVM: PPC: Book3S HV: Use bitmap of active threads rather than count
KVM: PPC: Book3S HV: Use decrementer to wake napping threads
KVM: PPC: Book3S HV: Don't wake thread with no vcpu on guest IPI
KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken
KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu
KVM: PPC: Book3S HV: Minor cleanups
KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update
KVM: PPC: Book3S HV: Accumulate timing information for real-mode code
KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT
KVM: PPC: Book3S HV: Add ICP real mode counters
KVM: PPC: Book3S HV: Move virtual mode ICP functions to real-mode
KVM: PPC: Book3S HV: Convert ICS mutex lock to spin lock
KVM: PPC: Book3S HV: Add guest->host real mode completion counters
KVM: PPC: Book3S HV: Add helpers for lock/unlock hpte
...
Some PowerNV systems include a hardware random-number generator.
This HWRNG is present on POWER7+ and POWER8 chips and is capable of
generating one 64-bit random number every microsecond. The random
numbers are produced by sampling a set of 64 unstable high-frequency
oscillators and are almost completely entropic.
PAPR defines an H_RANDOM hypercall which guests can use to obtain one
64-bit random sample from the HWRNG. This adds a real-mode
implementation of the H_RANDOM hypercall. This hypercall was
implemented in real mode because the latency of reading the HWRNG is
generally small compared to the latency of a guest exit and entry for
all the threads in the same virtual core.
Userspace can detect the presence of the HWRNG and the H_RANDOM
implementation by querying the KVM_CAP_PPC_HWRNG capability. The
H_RANDOM hypercall implementation will only be invoked when the guest
does an H_RANDOM hypercall if userspace first enables the in-kernel
H_RANDOM implementation using the KVM_CAP_PPC_ENABLE_HCALL capability.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
- Numerous minor fixes, cleanups etc.
- More EEH work from Gavin to remove its dependency on device_nodes.
- Memory hotplug implemented entirely in the kernel from Nathan Fontenot.
- Removal of redundant CONFIG_PPC_OF by Kevin Hao.
- Rewrite of VPHN parsing logic & tests from Greg Kurz.
- A fix from Nish Aravamudan to reduce memory usage by clamping
nodes_possible_map.
- Support for pstore on powernv from Hari Bathini.
- Removal of old powerpc specific byte swap routines by David Gibson.
- Fix from Vasant Hegde to prevent the flash driver telling you it was flashing
your firmware when it wasn't.
- Patch from Ben Herrenschmidt to add an OPAL heartbeat driver.
- Fix for an oops causing get/put_cpu_var() imbalance in perf by Jan Stancek.
- Some fixes for migration from Tyrel Datwyler.
- A new syscall to switch the cpu endian by Michael Ellerman.
- Large series from Wei Yang to implement SRIOV, reviewed and acked by Bjorn.
- A fix for the OPAL sensor driver from Cédric Le Goater.
- Fixes to get STRICT_MM_TYPECHECKS building again by Michael Ellerman.
- Large series from Daniel Axtens to make our PCI hooks per PHB rather than per
machine.
- Small patch from Sam Bobroff to explicitly abort non-suspended transactions
on syscalls, plus a test to exercise it.
- Numerous reworks and fixes for the 24x7 PMU from Sukadev Bhattiprolu.
- Small patch to enable the hard lockup detector from Anton Blanchard.
- Fix from Dave Olson for missing L2 cache information on some CPUs.
- Some fixes from Michael Ellerman to get Cell machines booting again.
- Freescale updates from Scott: Highlights include BMan device tree nodes, an
MSI erratum workaround, a couple minor performance improvements, config
updates, and misc fixes/cleanup.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVL2cxAAoJEFHr6jzI4aWAR8cP/19VTo/CzCE4ffPSx7qR464n
F+WFZcbNjIMXu6+B0YLuJZEsuWtKKrCit/MCg3+mSgE4iqvxmtI+HDD0445Buszj
UD4E4HMdPrXQ+KUSUDORvRjv/FFUXIa94LSv/0g2UeMsPz/HeZlhMxEu7AkXw9Nf
rTxsmRTsOWME85Y/c9ss7XHuWKXT3DJV7fOoK9roSaN3dJAuWTtG3WaKS0nUu0ok
0M81D6ZczoD6ybwh2DUMPD9K6SGxLdQ4OzQwtW6vWzcQIBDfy5Pdeo0iAFhGPvXf
T4LLPkv4cF4AwHsAC4rKDPHQNa+oZBoLlScrHClaebAlDiv+XYKNdMogawUObvSh
h7avKmQr0Ygp1OvvZAaXLhuDJI9FJJ8lf6AOIeULgHsDR9SyKMjZWxRzPe11uarO
Fyi0qj3oJaQu6LjazZraApu8mo+JBtQuD3z3o5GhLxeFtBBF60JXj6zAXJikufnl
kk1/BUF10nKUhtKcDX767AMUCtMH3fp5hx8K/z9T5v+pobJB26Wup1bbdT68pNBT
NjdKUppV6QTjZvCsA6U2/ECu6E9KeIaFtFSL2IRRoiI0dWBN5/5eYn3RGkO2ZFoL
1NdwKA2XJcchwTPkpSRrUG70sYH0uM2AldNYyaLfjzrQqza7Y6lF699ilxWmCN/H
OplzJAE5cQ8Am078veTW
=03Yh
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc updates from Michael Ellerman:
- Numerous minor fixes, cleanups etc.
- More EEH work from Gavin to remove its dependency on device_nodes.
- Memory hotplug implemented entirely in the kernel from Nathan
Fontenot.
- Removal of redundant CONFIG_PPC_OF by Kevin Hao.
- Rewrite of VPHN parsing logic & tests from Greg Kurz.
- A fix from Nish Aravamudan to reduce memory usage by clamping
nodes_possible_map.
- Support for pstore on powernv from Hari Bathini.
- Removal of old powerpc specific byte swap routines by David Gibson.
- Fix from Vasant Hegde to prevent the flash driver telling you it was
flashing your firmware when it wasn't.
- Patch from Ben Herrenschmidt to add an OPAL heartbeat driver.
- Fix for an oops causing get/put_cpu_var() imbalance in perf by Jan
Stancek.
- Some fixes for migration from Tyrel Datwyler.
- A new syscall to switch the cpu endian by Michael Ellerman.
- Large series from Wei Yang to implement SRIOV, reviewed and acked by
Bjorn.
- A fix for the OPAL sensor driver from Cédric Le Goater.
- Fixes to get STRICT_MM_TYPECHECKS building again by Michael Ellerman.
- Large series from Daniel Axtens to make our PCI hooks per PHB rather
than per machine.
- Small patch from Sam Bobroff to explicitly abort non-suspended
transactions on syscalls, plus a test to exercise it.
- Numerous reworks and fixes for the 24x7 PMU from Sukadev Bhattiprolu.
- Small patch to enable the hard lockup detector from Anton Blanchard.
- Fix from Dave Olson for missing L2 cache information on some CPUs.
- Some fixes from Michael Ellerman to get Cell machines booting again.
- Freescale updates from Scott: Highlights include BMan device tree
nodes, an MSI erratum workaround, a couple minor performance
improvements, config updates, and misc fixes/cleanup.
* tag 'powerpc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (196 commits)
powerpc/powermac: Fix build error seen with powermac smp builds
powerpc/pseries: Fix compile of memory hotplug without CONFIG_MEMORY_HOTREMOVE
powerpc: Remove PPC32 code from pseries specific find_and_init_phbs()
powerpc/cell: Fix iommu breakage caused by controller_ops change
powerpc/eeh: Fix crash in eeh_add_device_early() on Cell
powerpc/perf: Cap 64bit userspace backtraces to PERF_MAX_STACK_DEPTH
powerpc/perf/hv-24x7: Fail 24x7 initcall if create_events_from_catalog() fails
powerpc/pseries: Correct memory hotplug locking
powerpc: Fix missing L2 cache size in /sys/devices/system/cpu
powerpc: Add ppc64 hard lockup detector support
oprofile: Disable oprofile NMI timer on ppc64
powerpc/perf/hv-24x7: Add missing put_cpu_var()
powerpc/perf/hv-24x7: Break up single_24x7_request
powerpc/perf/hv-24x7: Define update_event_count()
powerpc/perf/hv-24x7: Whitespace cleanup
powerpc/perf/hv-24x7: Define add_event_to_24x7_request()
powerpc/perf/hv-24x7: Rename hv_24x7_event_update
powerpc/perf/hv-24x7: Move debug prints to separate function
powerpc/perf/hv-24x7: Drop event_24x7_request()
powerpc/perf/hv-24x7: Use pr_devel() to log message
...
Conflicts:
tools/testing/selftests/powerpc/Makefile
tools/testing/selftests/powerpc/tm/Makefile
Use orderly_reboot so userspace will to shut itself down via the reboot
path. This is required for graceful reboot initiated by the BMC, such as
when a user uses ipmitool to issue a 'chassis power cycle' command.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge Richard's work to support SR-IOV on PowerNV. All generic PCI
patches acked by Bjorn.
Some minor conflicts with Daniel's pci_controller_ops work.
Conflicts:
arch/powerpc/include/asm/machdep.h
arch/powerpc/platforms/powernv/pci-ioda.c
Pull core locking changes from Ingo Molnar:
"Main changes:
- jump label asm preparatory work for PowerPC (Anton Blanchard)
- rwsem optimizations and cleanups (Davidlohr Bueso)
- mutex optimizations and cleanups (Jason Low)
- futex fix (Oleg Nesterov)
- remove broken atomicity checks from {READ,WRITE}_ONCE() (Peter
Zijlstra)"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
powerpc, jump_label: Include linux/jump_label.h to get HAVE_JUMP_LABEL define
jump_label: Allow jump labels to be used in assembly
jump_label: Allow asm/jump_label.h to be included in assembly
locking/mutex: Further simplify mutex_spin_on_owner()
locking: Remove atomicy checks from {READ,WRITE}_ONCE
locking/rtmutex: Rename argument in the rt_mutex_adjust_prio_chain() documentation as well
locking/rwsem: Fix lock optimistic spinning when owner is not running
locking: Remove ACCESS_ONCE() usage
locking/rwsem: Check for active lock before bailing on spinning
locking/rwsem: Avoid deceiving lock spinners
locking/rwsem: Set lock ownership ASAP
locking/rwsem: Document barrier need when waking tasks
locking/futex: Check PF_KTHREAD rather than !p->mm to filter out kthreads
locking/mutex: Refactor mutex_spin_on_owner()
locking/mutex: In mutex_spin_on_owner(), return true when owner changes
This change adds the OPAL interface definitions to allow Linux to read,
write and erase from system flash devices. We register platform devices
for the flash devices exported by firmware.
We clash with the existing opal_flash_init function, which is really for
the FSP flash update functionality, so we rename that initcall to
opal_flash_update_init().
A future change will add an mtd driver that uses this interface.
Changes from Joel Stanley and Jeremy Kerr.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves the PowerNV platform to use the pci_controller_ops
structure rather than ppc_md for PCI controller operations.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pcibios_enable_device_hook returned an int. Every implementation
returned either -EINVAL or 0. The return value wasn't propagated by
the caller: any non-zero return value caused pcibios_enable_device
to return -EINVAL itself. Therefore, make the hook return a bool.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The powernv code has some conditional support for running on bare metal
machines that have no OPAL firmware, but provide RTAS.
No released machines ever supported that, and even in the lab it was
just a transitional hack in the days when OPAL was still being
developed.
So remove the code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Currently, when a sensor value is read, the kernel calls OPAL, which in
turn builds a message for the FSP, and waits for a message back.
The new device tree for OPAL sensors [1] adds new sensors that can be
read synchronously (core temperatures for instance) and that don't need
to wait for a response.
This patch modifies the opal call to accept an OPAL_SUCCESS return value
and cover the case above.
[1] https://lists.ozlabs.org/pipermail/skiboot/2015-March/000639.html
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
OPAL has its own list of return codes. The patch provides a translation
of such codes in errnos for the opal_sensor_read call, and possibly
others if needed.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If M64 has been supported, the prefetchable 64-bits memory resources
shouldn't be mapped to the corresponding PE# via M32DT. Unfortunately,
we're doing that in pnv_ioda_setup_pe_seg() wrongly. The issue was
introduced by commit 262af55 ("powerpc/powernv: Enable M64 aperatus
for PHB3"). The patch fixes the issue by simply skipping M64 resources
when updating to M32DT.
Cc: <stable@vger.kernel.org> # v3.17+
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In struct pci_dn, the pcidev field is assigned but not used, so remove it.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When IOV BAR is big, each is covered by 4 M64 windows. This leads to
several VF PE sits in one PE in terms of M64.
Group VF PEs according to the M64 allocation.
[bhelgaas: use dev_printk() when possible]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
M64 aperture size is limited on PHB3. When the IOV BAR is too big, this
will exceed the limitation and failed to be assigned.
Introduce a different mechanism based on the IOV BAR size:
- if IOV BAR size is smaller than 64MB, expand to total_pe
- if IOV BAR size is bigger than 64MB, roundup power2
[bhelgaas: make dev_printk() output more consistent, use PCI_SRIOV_NUM_BARS]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On PowerNV platform, resource position in M64 BAR implies the PE# the
resource belongs to. In some cases, adjustment of a resource is necessary
to locate it to a correct position in M64 BAR .
This patch adds pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR
address according to an offset.
Note:
After doing so, there would be a "hole" in the /proc/iomem when offset
is a positive value. It looks like the device return some mmio back to
the system, which actually no one could use it.
[bhelgaas: rework loops, rework overlap check, index resource[]
conventionally, remove pci_regs.h include, squashed with next patch]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Implement pcibios_iov_resource_alignment() on powernv platform.
On PowerNV platform, there are 3 cases for the IOV BAR:
1. initial state, the IOV BAR size is multiple times of VF BAR size
2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
3. sizing stage, the IOV BAR is truncated to 0
pnv_pci_iov_resource_alignment() handle these three cases respectively.
[bhelgaas: adjust to drop "align" parameter, return pci_iov_resource_size()
if no ppc_md machdep_call version]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On PHB3, PF IOV BAR will be covered by M64 BAR to have better PE isolation.
M64 BAR is a type of hardware resource in PHB3, which could map a range of
MMIO to PE numbers on powernv platform. And this range is divided equally
by the number of total_pe with each divided range mapping to a PE number.
Also, the M64 BAR must map a MMIO range with power-of-two size.
The total_pe number is usually different from total_VFs, which can lead to
a conflict between MMIO space and the PE number.
For example, if total_VFs is 128 and total_pe is 256, the second half of
M64 BAR will be part of other PCI device, which may already belong to other
PEs.
This patch prevents the conflict by reserving additional space for the PF
IOV BAR, which is total_pe number of VF's BAR size.
[bhelgaas: make dev_printk() output more consistent, index resource[]
conventionally]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Previously the iommu_table had the same lifetime as a struct pnv_ioda_pe
and was embedded in it. The pnv_ioda_pe was assigned to a PE on the bootup
stage. Since PEs are based on the hardware layout which is static in the
system, they will never get released. This means the iommu_table in the
pnv_ioda_pe will never get released either.
This no longer works for VF PE. VF PEs are created and released dynamically
when VFs are created and released. So we need to assign pnv_ioda_pe to VF
PEs respectively when VFs are enabled and clean up those resources for VF
PE when VFs are disabled. And iommu_table is one of the resources we need
to handle dynamically.
Current iommu_table is a static field in pnv_ioda_pe, which will face a
problem when freeing it. During the disabling of a VF,
pnv_pci_ioda2_release_dma_pe will call iommu_free_table to release the
iommu_table for this PE. A static iommu_table will fail in
iommu_free_table.
According to these requirement, this patch allocates iommu_table
dynamically.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
pci_dn is the extension of PCI device node and is created from device node.
Unfortunately, VFs are enabled dynamically by PF's driver and they don't
have corresponding device nodes and pci_dn, which is required to access
VFs' config spaces.
The patch creates pci_dn for VFs in pcibios_sriov_enable() on their PF,
and removes pci_dn for VFs in pcibios_sriov_disable() on their PF. When
VF's pci_dn is created, it's put to the child list of the pci_dn of PF's
upstream bridge. The pci_dn is linked to pci_dev during early fixup time
to setup the fast path.
[bhelgaas: add ifdef around add_one_dev_pci_info(), use dev_printk()]
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We currently read the information about idle states from the device
tree, so as to find out the CPU idle states supported by the platform.
Use the of_property_read/count_xxx() APIs, which handle endian
conversions for us, and mean we don't need any endian annotations in the
code.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Provide an unregister interface for the opal message notifiers
to be called when not needed like during driver unload/remove.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fixes the condition check of incoming message type which can
otherwise shoot beyond the message notifiers head array.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If OPAL requests it, call it back via opal_poll_events() at a
regular interval. Some versions of OPAL on some machines require
this to operate some internal timeouts properly.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Present code checks for update_flash_data in opal_flash_term_callback().
update_flash_data has been statically initialized to zero, and that
is the value of FLASH_IMG_READY. Also code update initialization happens
during subsys init.
So if reboot is issued before the subsys init stage then we endup displaying
"Flashing new firmware" message.. which may confuse end user.
This patch fixes above described issue by initializes update_flash status
to invalid state.
Reported-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There are 3 EEH operations whose arguments contain device_node:
read_config(), write_config() and restore_config(). The patch
replaces device_node with pci_dn.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>