Commit Graph

5390 Commits

Author SHA1 Message Date
Arnd Bergmann
75d319c06e y2038: syscalls: change remaining timeval to __kernel_old_timeval
All of the remaining syscalls that pass a timeval (gettimeofday, utime,
futimesat) can trivially be changed to pass a __kernel_old_timeval
instead, which has a compatible layout, but avoids ambiguity with
the timeval type in user space.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Arnd Bergmann
1bf883c1a9 y2038: stat: avoid 'time_t' in 'struct stat'
The time_t definition may differ between user space and kernel space,
so replace time_t with an unambiguous 'long' for the mips and sparc.

The same structures also contain 'off_t', which has the same problem,
so replace that as well on those two architectures and powerpc.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:28 +01:00
Arnd Bergmann
caf5e32d4e y2038: ipc: remove __kernel_time_t reference from headers
There are two structures based on time_t that conflict between libc and
kernel: timeval and timespec. Both are now renamed to __kernel_old_timeval
and __kernel_old_timespec.

For time_t, the old typedef is still __kernel_time_t. There is nothing
wrong with that name, but it would be nice to not use that going forward
as this type is used almost only in deprecated interfaces because of
the y2038 overflow.

In the IPC headers (msgbuf.h, sembuf.h, shmbuf.h), __kernel_time_t is only
used for the 64-bit variants, which are not deprecated.

Change these to a plain 'long', which is the same type as __kernel_time_t
on all 64-bit architectures anyway, to reduce the number of users of the
old type.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:28 +01:00
Arnd Bergmann
176ed98c8a y2038: vdso: powerpc: avoid timespec references
As a preparation to stop using 'struct timespec' in the kernel,
change the powerpc vdso implementation:

- split up the vdso data definition to have equivalent members
   for seconds and nanoseconds instead of an xtime structure

- use timespec64 as an intermediate for the xtime update

- change the asm-offsets definition to be based the appropriate
  fixed-length types

This is only a temporary fix for changing the types, in order
to actually support a 64-bit safe vdso32 version of clock_gettime(),
the entire powerpc vdso should be replaced with the generic
lib/vdso/ implementation. If that happens first, this patch
becomes obsolete.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:28 +01:00
Michael Ellerman
3df191118b Merge branch 'topic/kaslr-book3e32' into next
This is a slight rebase of Scott's next branch, which contained the
KASLR support for book3e 32-bit, to squash in a couple of small fixes.

See the	original pull request:
  https://lore.kernel.org/r/20191022232155.GA26174@home.buserror.net
2019-11-14 19:23:33 +11:00
Michael Ellerman
af2e8c68b9 KVM: PPC: Book3S HV: Flush link stack on guest exit to host kernel
On some systems that are vulnerable to Spectre v2, it is up to
software to flush the link stack (return address stack), in order to
protect against Spectre-RSB.

When exiting from a guest we do some house keeping and then
potentially exit to C code which is several stack frames deep in the
host kernel. We will then execute a series of returns without
preceeding calls, opening up the possiblity that the guest could have
poisoned the link stack, and direct speculative execution of the host
to a gadget of some sort.

To prevent this we add a flush of the link stack on exit from a guest.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-14 15:37:59 +11:00
Michael Ellerman
39e72bf96f powerpc/book3s64: Fix link stack flush on context switch
In commit ee13cb249f ("powerpc/64s: Add support for software count
cache flush"), I added support for software to flush the count
cache (indirect branch cache) on context switch if firmware told us
that was the required mitigation for Spectre v2.

As part of that code we also added a software flush of the link
stack (return address stack), which protects against Spectre-RSB
between user processes.

That is all correct for CPUs that activate that mitigation, which is
currently Power9 Nimbus DD2.3.

What I got wrong is that on older CPUs, where firmware has disabled
the count cache, we also need to flush the link stack on context
switch.

To fix it we create a new feature bit which is not set by firmware,
which tells us we need to flush the link stack. We set that when
firmware tells us that either of the existing Spectre v2 mitigations
are enabled.

Then we adjust the patching code so that if we see that feature bit we
enable the link stack flush. If we're also told to flush the count
cache in software then we fall through and do that also.

On the older CPUs we don't need to do do the software count cache
flush, firmware has disabled it, so in that case we patch in an early
return after the link stack flush.

The naming of some of the functions is awkward after this patch,
because they're called "count cache" but they also do link stack. But
we'll fix that up in a later commit to ease backporting.

This is the fix for CVE-2019-18660.

Reported-by: Anthony Steinhauser <asteinhauser@google.com>
Fixes: ee13cb249f ("powerpc/64s: Add support for software count cache flush")
Cc: stable@vger.kernel.org # v4.4+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-14 15:37:52 +11:00
Jason Yan
921a79b780 powerpc/fsl_booke/kaslr: dump out kernel offset information on panic
When kaslr is enabled, the kernel offset is different for every boot.
This brings some difficult to debug the kernel. Dump out the kernel
offset when panic so that we can easily debug the kernel.

This code is derived from x86/arm64 which has similar functionality.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:51 +11:00
Jason Yan
2b0e86cc5d powerpc/fsl_booke/32: implement KASLR infrastructure
This patch add support to boot kernel from places other than KERNELBASE.
Since CONFIG_RELOCATABLE has already supported, what we need to do is
map or copy kernel to a proper place and relocate. Freescale Book-E
parts expect lowmem to be mapped by fixed TLB entries(TLB1). The TLB1
entries are not suitable to map the kernel directly in a randomized
region, so we chose to copy the kernel to a proper place and restart to
relocate.

The offset of the kernel was not randomized yet(a fixed 64M is set). We
will randomize it in the next patch.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
[mpe: Use PTRRELOC() in early_init()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:40 +11:00
Jason Yan
39f4b7bf75 powerpc: introduce kernstart_virt_addr to store the kernel base
Now the kernel base is a fixed value - KERNELBASE. To support KASLR, we
need a variable to store the kernel base.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:32 +11:00
Jason Yan
8054df0570 powerpc: unify definition of M_IF_NEEDED
M_IF_NEEDED is defined too many times. Move it to a common place and
rename it to MAS2_M_IF_NEEDED which is much readable.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:24 +11:00
Christoph Hellwig
f5817191b0 powerpc: use <asm-generic/dma-mapping.h>
The powerpc version of dma-mapping.h only contains a version of
get_arch_dma_ops that always return NULL.  Replace it with the
asm-generic version that does the same.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190807150752.17894-1-hch@lst.de
2019-11-13 16:58:10 +11:00
Thomas Huth
bbbd7f112c powerpc: Replace GPL boilerplate with SPDX identifiers
The FSF does not reside in "675 Mass Ave, Cambridge" anymore...
let's simply use proper SPDX identifiers instead.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190828060737.32531-1-thuth@redhat.com
2019-11-13 16:58:07 +11:00
Ravi Bangoria
c3f68b0478 powerpc/watchpoint: Fix ptrace code that muck around with address/len
ptrace_set_debugreg() does not consider new length while overwriting
the watchpoint. Fix that. ppc_set_hwdebug() aligns watchpoint address
to doubleword boundary but does not change the length. If address
range is crossing doubleword boundary and length is less then 8, we
will lose samples from second doubleword. So fix that as well.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191017093204.7511-4-ravi.bangoria@linux.ibm.com
2019-11-13 16:58:03 +11:00
Ravi Bangoria
b57aeab811 powerpc/watchpoint: Fix length calculation for unaligned target
Watchpoint match range is always doubleword(8 bytes) aligned on
powerpc. If the given range is crossing doubleword boundary, we need
to increase the length such that next doubleword also get
covered. Ex,

          address   len = 6 bytes
                |=========.
   |------------v--|------v--------|
   | | | | | | | | | | | | | | | | |
   |---------------|---------------|
    <---8 bytes--->

In such case, current code configures hw as:
  start_addr = address & ~HW_BREAKPOINT_ALIGN
  len = 8 bytes

And thus read/write in last 4 bytes of the given range is ignored.
Fix this by including next doubleword in the length.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191017093204.7511-3-ravi.bangoria@linux.ibm.com
2019-11-13 16:58:03 +11:00
Ravi Bangoria
b811be615c powerpc/watchpoint: Introduce macros for watchpoint length
We are hadrcoding length everywhere in the watchpoint code. Introduce
macros for the length and use them.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191017093204.7511-2-ravi.bangoria@linux.ibm.com
2019-11-13 16:58:02 +11:00
Michael Ellerman
d34a5709be Merge branch 'topic/secureboot' into next
Merge the secureboot support, as well as the IMA changes needed to
support it.

From Nayna's cover letter:
  In order to verify the OS kernel on PowerNV systems, secure boot
  requires X.509 certificates trusted by the platform. These are
  stored in secure variables controlled by OPAL, called OPAL secure
  variables. In order to enable users to manage the keys, the secure
  variables need to be exposed to userspace.

  OPAL provides the runtime services for the kernel to be able to
  access the secure variables. This patchset defines the kernel
  interface for the OPAL APIs. These APIs are used by the hooks, which
  load these variables to the keyring and expose them to the userspace
  for reading/writing.

  Overall, this patchset adds the following support:
    * expose secure variables to the kernel via OPAL Runtime API interface
    * expose secure variables to the userspace via kernel sysfs interface
    * load kernel verification and revocation keys to .platform and
      .blacklist keyring respectively.

  The secure variables can be read/written using simple linux
  utilities cat/hexdump.

  For example:
  Path to the secure variables is: /sys/firmware/secvar/vars

    Each secure variable is listed as directory.
    $ ls -l
    total 0
    drwxr-xr-x. 2 root root 0 Aug 20 21:20 db
    drwxr-xr-x. 2 root root 0 Aug 20 21:20 KEK
    drwxr-xr-x. 2 root root 0 Aug 20 21:20 PK

  The attributes of each of the secure variables are (for example: PK):
    $ ls -l
    total 0
    -r--r--r--. 1 root root  4096 Oct  1 15:10 data
    -r--r--r--. 1 root root 65536 Oct  1 15:10 size
    --w-------. 1 root root  4096 Oct  1 15:12 update

  The "data" is used to read the existing variable value using
  hexdump. The data is stored in ESL format. The "update" is used to
  write a new value using cat. The update is to be submitted as AUTH
  file.
2019-11-13 16:55:50 +11:00
Nayna Jain
9155e2341a powerpc/powernv: Add OPAL API interface to access secure variable
The X.509 certificates trusted by the platform and required to secure
boot the OS kernel are wrapped in secure variables, which are
controlled by OPAL.

This patch adds firmware/kernel interface to read and write OPAL
secure variables based on the unique key.

This support can be enabled using CONFIG_OPAL_SECVAR.

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Nayna Jain <nayna@linux.ibm.com>
Signed-off-by: Eric Richter <erichte@linux.ibm.com>
[mpe: Make secvar_ops __ro_after_init, only build opal-secvar.c if PPC_SECURE_BOOT=y]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1573441836-3632-2-git-send-email-nayna@linux.ibm.com
2019-11-13 00:33:22 +11:00
Nayna Jain
2702809a4a powerpc: Detect the trusted boot state of the system
While secure boot permits only properly verified signed kernels to be
booted, trusted boot calculates the file hash of the kernel image and
stores the measurement prior to boot, that can be subsequently
compared against good known values via attestation services.

This patch reads the trusted boot state of a PowerNV system. The state
is used to conditionally enable additional measurement rules in the
IMA arch-specific policies.

Signed-off-by: Nayna Jain <nayna@linux.ibm.com>
Signed-off-by: Eric Richter <erichte@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/e9eeee6b-b9bf-1e41-2954-61dbd6fbfbcf@linux.ibm.com
2019-11-12 12:25:49 +11:00
Nayna Jain
1a8916ee3a powerpc: Detect the secure boot mode of the system
This patch defines a function to detect the secure boot state of a
PowerNV system.

The PPC_SECURE_BOOT config represents the base enablement of secure
boot for powerpc.

Signed-off-by: Nayna Jain <nayna@linux.ibm.com>
Signed-off-by: Eric Richter <erichte@linux.ibm.com>
[mpe: Fold in change from Nayna to add "ibm,secureboot" to ids]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/46b003b9-3225-6bf7-9101-ed6580bb748c@linux.ibm.com
2019-11-12 12:25:02 +11:00
Alastair D'Silva
23eb7f560a powerpc: Convert flush_icache_range & friends to C
Similar to commit 22e9c88d48
("powerpc/64: reuse PPC32 static inline flush_dcache_range()")
this patch converts the following ASM symbols to C:
    flush_icache_range()
    __flush_dcache_icache()
    __flush_dcache_icache_phys()

This was done as we discovered a long-standing bug where the length of the
range was truncated due to using a 32 bit shift instead of a 64 bit one.

By converting these functions to C, it becomes easier to maintain.

flush_dcache_icache_phys() retains a critical assembler section as we must
ensure there are no memory accesses while the data MMU is disabled
(authored by Christophe Leroy). Since this has no external callers, it has
also been made static, allowing the compiler to inline it within
flush_dcache_icache_page().

Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
[mpe: Minor fixups, don't export __flush_dcache_icache()]
Link: https://lore.kernel.org/r/20191104023305.9581-5-alastair@au1.ibm.com
2019-11-07 23:35:37 +11:00
Alastair D'Silva
7a0745c5e0 powerpc: define helpers to get L1 icache sizes
This patch adds helpers to retrieve icache sizes, and renames the existing
helpers to make it clear that they are for dcache.

Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191104023305.9581-4-alastair@au1.ibm.com
2019-11-07 22:48:34 +11:00
Daniel Axtens
5bece3d661 powerpc: support KASAN instrumentation of bitops
The powerpc-specific bitops are not being picked up by the KASAN
test suite.

Instrumentation is done via the bitops/instrumented-{atomic,lock}.h
headers. They require that arch-specific versions of bitop functions
are renamed to arch_*. Do this renaming.

For clear_bit_unlock_is_negative_byte, the current implementation
uses the PG_waiters constant. This works because it's a preprocessor
macro - so it's only actually evaluated in contexts where PG_waiters
is defined. With instrumentation however, it becomes a static inline
function, and all of a sudden we need the actual value of PG_waiters.
Because of the order of header includes, it's not available and we
fail to compile. Instead, manually specify that we care about bit 7.
This is still correct: bit 7 is the bit that would mark a negative
byte.

While we're at it, replace __inline__ with inline across the file.

Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820024941.12640-2-dja@axtens.net
2019-11-07 13:15:40 +11:00
Geert Uytterhoeven
3b05a1e517 powerpc/security: Fix debugfs data leak on 32-bit
"powerpc_security_features" is "unsigned long", i.e. 32-bit or 64-bit,
depending on the platform (PPC_FSL_BOOK3E or PPC_BOOK3S_64).  Hence
casting its address to "u64 *", and calling debugfs_create_x64() is
wrong, and leaks 32-bit of nearby data to userspace on 32-bit platforms.

While all currently defined SEC_FTR_* security feature flags fit in
32-bit, they all have "ULL" suffixes to make them 64-bit constants.
Hence fix the leak by changing the type of "powerpc_security_features"
(and the parameter types of its accessors) to "u64".  This also allows
to drop the cast.

Fixes: 398af57112 ("powerpc/security: Show powerpc_security_features in debugfs")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191021142309.28105-1-geert+renesas@glider.be
2019-11-05 22:29:27 +11:00
Aneesh Kumar K.V
52162ec784 powerpc/mm/book3s64/radix: Use freed_tables instead of need_flush_all
With commit 22a61c3c4f ("asm-generic/tlb: Track freeing of
page-table directories in struct mmu_gather") we now track whether we
freed page table in mmu_gather. Use that to decide whether to flush
Page Walk Cache.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191024075801.22434-2-aneesh.kumar@linux.ibm.com
2019-11-05 22:23:55 +11:00
Nicolas Saenz Julienne
8b5369ea58 dma/direct: turn ARCH_ZONE_DMA_BITS into a variable
Some architectures, notably ARM, are interested in tweaking this
depending on their runtime DMA addressing limitations.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-11-01 09:41:18 +00:00
Thiago Jung Bauermann
05d9a95283 powerpc/prom_init: Undo relocation before entering secure mode
The ultravisor will do an integrity check of the kernel image but we
relocated it so the check will fail. Restore the original image by
relocating it back to the kernel virtual base address.

This works because during build vmlinux is linked with an expected
virtual runtime address of KERNELBASE.

Fixes: 6a9c930bd7 ("powerpc/prom_init: Add the ESM call to prom_init")
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Tested-by: Michael Anderson <andmike@linux.ibm.com>
[mpe: Add IS_ENABLED() to fix the CONFIG_RELOCATABLE=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911163433.12822-1-bauerman@linux.ibm.com
2019-10-29 15:12:17 +11:00
Nicholas Piggin
87a45e07a5 KVM: PPC: Book3S: Replace reset_msr mmu op with inject_interrupt arch op
reset_msr sets the MSR for interrupt injection, but it's cleaner and
more flexible to provide a single op to set both MSR and PC for the
interrupt.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-10-22 16:29:02 +11:00
Nicholas Piggin
9ee6471eb9 KVM: PPC: Book3S: Define and use SRR1_MSR_BITS
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-10-22 16:29:02 +11:00
Greg Kurz
efe5ddcae4 KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs
Add a new attribute to both XIVE and XICS-on-XIVE KVM devices so that
userspace can tell how many interrupt servers it needs. If a VM needs
less than the current default of KVM_MAX_VCPUS (2048), we can allocate
less VPs in OPAL. Combined with a core stride (VSMT) that matches the
number of guest threads per core, this may substantially increases the
number of VMs that can run concurrently with an in-kernel XIVE device.

Since the legacy XIVE KVM device is exposed to userspace through the
XICS KVM API, a new attribute group is added to it for this purpose.
While here, fix the syntax of the existing KVM_DEV_XICS_GRP_SOURCES
in the XICS documentation.

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-10-22 16:29:02 +11:00
Christophe Leroy
d10f60ae27 powerpc/32s: fix allow/prevent_user_access() when crossing segment boundaries.
Make sure starting addr is aligned to segment boundary so that when
incrementing the segment, the starting address of the new segment is
below the end address. Otherwise the last segment might get  missed.

Fixes: a68c31fc01 ("powerpc/32s: Implement Kernel Userspace Access Protection")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/067a1b09f15f421d40797c2d04c22d4049a1cee8.1571071875.git.christophe.leroy@c-s.fr
2019-10-17 08:57:43 +11:00
Stephen Rothwell
18217da361 powerpc/64s/radix: Fix build failure with RADIX_MMU=n
After merging the powerpc tree, today's linux-next build (powerpc64
allnoconfig) failed like this:

 arch/powerpc/mm/book3s64/pgtable.c:216:3:
 error: implicit declaration of function 'radix__flush_all_lpid_guest'

radix__flush_all_lpid_guest() is only declared for
CONFIG_PPC_RADIX_MMU which is not set for this build.

Fix it by adding an empty version for the RADIX_MMU=n case, which
should never be called.

Fixes: 99161de3a2 ("powerpc/64s/radix: tidy up TLB flushing code")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
[mpe: Munge change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190930101342.36c1afa0@canb.auug.org.au
2019-10-09 17:16:58 +11:00
Linus Torvalds
a3c0e7b1fe libnvdimm fixes v5.4-rc1
- Complete the reworks to interoperate with powerpc dynamic huge page sizes
 
 - Fix a crash due to missed accounting for the powerpc 'struct
   page'-memmap mapping granularity.
 
 - Fix badblock initialization for volatile (DRAM emulated) pmem ranges.
 
 - Stop triggering request_key() notifications to userspace when
   NVDIMM-security is disabled / not present.
 
 - Miscellaneous small fixups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdkAprAAoJEB7SkWpmfYgCjXoQAIwJE1VzNP1V+ARxfs1rTGVz
 pbNJiBnj4gxDaCkcKoatiadRkytUxeUNEcPslEKsfoNinXYqkpjMQoWm2VpILOMU
 nY+SvIudGRnuesq2/Y+CP8zrX6rV4eBDfHK05RN/Zp1IlW7pTDItUx8mJ7glmDwG
 PW0vkvK7yZ+dRFnpQ7QFjhA0Q3oudO5YcTVBDK5YYtDGlv69xfXqc9LW8SszJ1kU
 rhCIT1kdoL5of0TIgG5pTfmggPSQ9y1xPsKjllOHNa3m50eGOkkQLELOVzQb1frW
 cjAsPLjRDSzvdHHSLyu0Is04Q5JU2CucxHl2SXGHiOt5tigH8dk5XFxWt0Pc8EXx
 acYYiBqUXC3MomSYWeLK4BdO2cRTqcPPXgJYAqXblqr+/0ys+rFepjw+j8JkiLZa
 5UCC30l1GXEpw9u6gdCMqvvHN2gHvDB0BV82Sx8wTewJpeL18wCUJoKVuFmpsHko
 p1cCe7St1TzcK3eO+xfeW1rxNrcXUpKVYXVa/WOJW0vwErqAZ6YCdNuyJHocZzXn
 vNyIQmVDOlubsgBAI2ExxeZO6xc8UIwLhLg7XEJ0mg3k6UXA8HZxH2B2THJk1BSF
 RppodkYiMknh11sqgpGp+Hz5XSEg/jvmCdL/qRDGAwhsFhFaxDH37Kg4Qncj2/dg
 uDvDHXNCjbGpzCo3tyNx
 =Z6Fa
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

More libnvdimm updates from Dan Williams:

 - Complete the reworks to interoperate with powerpc dynamic huge page
   sizes

 - Fix a crash due to missed accounting for the powerpc 'struct
   page'-memmap mapping granularity

 - Fix badblock initialization for volatile (DRAM emulated) pmem ranges

 - Stop triggering request_key() notifications to userspace when
   NVDIMM-security is disabled / not present

 - Miscellaneous small fixups

* tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm/region: Enable MAP_SYNC for volatile regions
  libnvdimm: prevent nvdimm from requesting key when security is disabled
  libnvdimm/region: Initialize bad block for volatile namespaces
  libnvdimm/nfit_test: Fix acpi_handle redefinition
  libnvdimm/altmap: Track namespace boundaries in altmap
  libnvdimm: Fix endian conversion issues 
  libnvdimm/dax: Pick the right alignment default when creating dax devices
  powerpc/book3s64: Export has_transparent_hugepage() related functions.
2019-09-29 10:33:41 -07:00
Linus Torvalds
a2953204b5 powerpc fixes for 5.4 #2
An assortment of fixes that were either missed by me, or didn't arrive quite in
 time for the first v5.4 pull.
 
 Most notable is a fix for an issue with tlbie (broadcast TLB invalidation) on
 Power9, when using the Radix MMU. The tlbie can race with an mtpid (move to PID
 register, essentially MMU context switch) on another thread of the core, which
 can cause stores to continue to go to a page after it's unmapped.
 
 A fix in our KVM code to add a missing barrier, the lack of which has been
 observed to cause missed IPIs and subsequently stuck CPUs in the host.
 
 A change to the way we initialise PCR (Processor Compatibility Register) to make
 it forward compatible with future CPUs.
 
 On some older PowerVM systems our H_BLOCK_REMOVE support could oops, fix it to
 detect such systems and fallback to the old invalidation method.
 
 A fix for an oops seen on some machines when using KASAN on 32-bit.
 
 A handful of other minor fixes, and two new selftests.
 
 Thanks to:
   Alistair Popple, Aneesh Kumar K.V, Christophe Leroy, Gustavo Romero, Joel
   Stanley, Jordan Niethe, Laurent Dufour, Michael Roth, Oliver O'Halloran.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl2PRGITHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgClRD/9jKIT6GjVpRMc+Dg9zHB5/Pir7gePk
 ztXKI+u15GrrXgjtWEZ1PaaXvtNIfs/IZHDQm5gyJjiBAKcGl2v+9ETaMzO5sjZ7
 GSe1F8VX/MwzRnET8Jph8w/b0cy0Q8xndkEOjcJqJ7+TF+SSWqmJEdmBfkU23jWD
 B3kW4W1x2xt/XGsX25l1HpUgpcJqzukCeYUSCdqUu2j+sXAEZmfgTRG8uD4HffzZ
 3As76TrBiJsDnkyH0qi2G1BuLXrQbAMdjTeSGi+cb0gTIunCr190gI4+Tjdu2/z7
 ywWR2ZUkueCNDcLsqXaqpZx50utPJ44//uY750sk72vixJJOVuOWM6+5HKVW83se
 /v0zkOcI9+ywNHe0vLfP3Jm/OMMHYxkIwz6kVu2NSR6sE79B9AZpBFU+Nynq7kKl
 +Hc6md/HATvR+NK6LtQKtEGydRhvxU5n3KBmjq3SQj+B/ZlU6IdgerfhUWrNvg0B
 zzHeT35X6UBpswonhkQLgqJuaWpkClK9wsUy85MuA7aub1EP8S6/X7paKoiOtAHK
 NjlXM2JYV5OKwhjGgdCiI94Bdune7yudKPdsXV3Gr8Iw7wf2bQk1p7VH+LcruyE9
 YJdXwCgN0PaoFUQh3AR4CqzzFwqDya8FQqdkFN3kqhRLVGAMq/PsV8/Tn+myTgQP
 rZnWnbfZh9BMjw==
 =dF42
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "An assortment of fixes that were either missed by me, or didn't arrive
  quite in time for the first v5.4 pull.

   - Most notable is a fix for an issue with tlbie (broadcast TLB
     invalidation) on Power9, when using the Radix MMU. The tlbie can
     race with an mtpid (move to PID register, essentially MMU context
     switch) on another thread of the core, which can cause stores to
     continue to go to a page after it's unmapped.

   - A fix in our KVM code to add a missing barrier, the lack of which
     has been observed to cause missed IPIs and subsequently stuck CPUs
     in the host.

   - A change to the way we initialise PCR (Processor Compatibility
     Register) to make it forward compatible with future CPUs.

   - On some older PowerVM systems our H_BLOCK_REMOVE support could
     oops, fix it to detect such systems and fallback to the old
     invalidation method.

   - A fix for an oops seen on some machines when using KASAN on 32-bit.

   - A handful of other minor fixes, and two new selftests.

  Thanks to: Alistair Popple, Aneesh Kumar K.V, Christophe Leroy,
  Gustavo Romero, Joel Stanley, Jordan Niethe, Laurent Dufour, Michael
  Roth, Oliver O'Halloran"

* tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/eeh: Fix eeh eeh_debugfs_break_device() with SRIOV devices
  powerpc/nvdimm: use H_SCM_QUERY hcall on H_OVERLAP error
  powerpc/nvdimm: Use HCALL error as the return value
  selftests/powerpc: Add test case for tlbie vs mtpidr ordering issue
  powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9
  powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag
  powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisions
  powerpc/pseries: Call H_BLOCK_REMOVE when supported
  powerpc/pseries: Read TLB Block Invalidate Characteristics
  KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
  powerpc/mm: Fix an Oops in kasan_mmu_init()
  powerpc/mm: Add a helper to select PAGE_KERNEL_RO or PAGE_READONLY
  powerpc/64s: Set reserved PCR bits
  powerpc: Fix definition of PCR bits to work with old binutils
  powerpc/book3s64/radix: Remove WARN_ON in destroy_context()
  powerpc/tm: Add tm-poison test
2019-09-28 13:43:00 -07:00
Linus Torvalds
9c9fa97a8e Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few hot fixes

 - ocfs2 updates

 - almost all of -mm (slab-generic, slab, slub, kmemleak, kasan,
   cleanups, debug, pagecache, memcg, gup, pagemap, memory-hotplug,
   sparsemem, vmalloc, initialization, z3fold, compaction, mempolicy,
   oom-kill, hugetlb, migration, thp, mmap, madvise, shmem, zswap,
   zsmalloc)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
  mm/zsmalloc.c: fix a -Wunused-function warning
  zswap: do not map same object twice
  zswap: use movable memory if zpool support allocate movable memory
  zpool: add malloc_support_movable to zpool_driver
  shmem: fix obsolete comment in shmem_getpage_gfp()
  mm/madvise: reduce code duplication in error handling paths
  mm: mmap: increase sockets maximum memory size pgoff for 32bits
  mm/mmap.c: refine find_vma_prev() with rb_last()
  riscv: make mmap allocation top-down by default
  mips: use generic mmap top-down layout and brk randomization
  mips: replace arch specific way to determine 32bit task with generic version
  mips: adjust brk randomization offset to fit generic version
  mips: use STACK_TOP when computing mmap base address
  mips: properly account for stack randomization and stack guard gap
  arm: use generic mmap top-down layout and brk randomization
  arm: use STACK_TOP when computing mmap base address
  arm: properly account for stack randomization and stack guard gap
  arm64, mm: make randomization selected by generic topdown mmap layout
  arm64, mm: move generic mmap layout functions to mm
  arm64: consider stack randomization for mmap base only when necessary
  ...
2019-09-24 16:10:23 -07:00
Mike Rapoport
782de70c42 mm: consolidate pgtable_cache_init() and pgd_cache_init()
Both pgtable_cache_init() and pgd_cache_init() are used to initialize kmem
cache for page table allocations on several architectures that do not use
PAGE_SIZE tables for one or more levels of the page table hierarchy.

Most architectures do not implement these functions and use __weak default
NOP implementation of pgd_cache_init().  Since there is no such default
for pgtable_cache_init(), its empty stub is duplicated among most
architectures.

Rename the definitions of pgd_cache_init() to pgtable_cache_init() and
drop empty stubs of pgtable_cache_init().

Link: http://lkml.kernel.org/r/1566457046-22637-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Will Deacon <will@kernel.org>		[arm64]
Acked-by: Thomas Gleixner <tglx@linutronix.de>	[x86]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:09 -07:00
Nicholas Piggin
13224794cb mm: remove quicklist page table caches
Patch series "mm: remove quicklist page table caches".

A while ago Nicholas proposed to remove quicklist page table caches [1].

I've rebased his patch on the curren upstream and switched ia64 and sh to
use generic versions of PTE allocation.

[1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com

This patch (of 3):

Remove page table allocator "quicklists".  These have been around for a
long time, but have not got much traction in the last decade and are only
used on ia64 and sh architectures.

The numbers in the initial commit look interesting but probably don't
apply anymore.  If anybody wants to resurrect this it's in the git
history, but it's unhelpful to have this code and divergent allocator
behaviour for minor archs.

Also it might be better to instead make more general improvements to page
allocator if this is still so slow.

Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:09 -07:00
Aneesh Kumar K.V
a6f197f889 powerpc/book3s64: Export has_transparent_hugepage() related functions.
In later patch, we want to use hash_transparent_hugepage() in a kernel module.
Export two related functions.

Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/20190924042440.27946-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24 10:22:29 -07:00
Aneesh Kumar K.V
047e6575ae powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9
On POWER9, under some circumstances, a broadcast TLB invalidation will
fail to invalidate the ERAT cache on some threads when there are
parallel mtpidr/mtlpidr happening on other threads of the same core.
This can cause stores to continue to go to a page after it's unmapped.

The workaround is to force an ERAT flush using PID=0 or LPID=0 tlbie
flush. This additional TLB flush will cause the ERAT cache
invalidation. Since we are using PID=0 or LPID=0, we don't get
filtered out by the TLB snoop filtering logic.

We need to still follow this up with another tlbie to take care of
store vs tlbie ordering issue explained in commit:
a5d4b5891c ("powerpc/mm: Fixup tlbie vs store ordering issue on
POWER9"). The presence of ERAT cache implies we can still get new
stores and they may miss store queue marking flush.

Cc: stable@vger.kernel.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190924035254.24612-3-aneesh.kumar@linux.ibm.com
2019-09-24 20:58:55 +10:00
Aneesh Kumar K.V
09ce98cacd powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag
Rename the #define to indicate this is related to store vs tlbie
ordering issue. In the next patch, we will be adding another feature
flag that is used to handles ERAT flush vs tlbie ordering issue.

Fixes: a5d4b5891c ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.com
2019-09-24 20:58:47 +10:00
Michael Roth
3a83f677a6 KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
of memory running the following guest configs:

  guest A:
    - 224GB of memory
    - 56 VCPUs (sockets=1,cores=28,threads=2), where:
      VCPUs 0-1 are pinned to CPUs 0-3,
      VCPUs 2-3 are pinned to CPUs 4-7,
      ...
      VCPUs 54-55 are pinned to CPUs 108-111

  guest B:
    - 4GB of memory
    - 4 VCPUs (sockets=1,cores=4,threads=1)

with the following workloads (with KSM and THP enabled in all):

  guest A:
    stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M

  guest B:
    stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M

  host:
    stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M

the below soft-lockup traces were observed after an hour or so and
persisted until the host was reset (this was found to be reliably
reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
and 5.3-rc5):

  [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1253.183319] rcu:     124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941
  [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709]
  [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913]
  [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331]
  [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338]
  [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525]
  [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
  [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1316.198032] rcu:     124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243
  [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
  [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1379.212629] rcu:     124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714
  [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
  [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1442.227115] rcu:     124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403
  [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
  [ 1455.111822]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
  [ 1455.111905]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
  [ 1455.111986]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
  [ 1455.112068]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
  [ 1455.112159]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
  [ 1455.112231]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
  [ 1455.112303]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
  [ 1455.112392]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1

CPUs 45, 24, and 124 are stuck on spin locks, likely held by
CPUs 105 and 31.

CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
target CPU 42. For instance:

  # CPU 105 registers (via xmon)
  R00 = c00000000020b20c   R16 = 00007d1bcd800000
  R01 = c00000363eaa7970   R17 = 0000000000000001
  R02 = c0000000019b3a00   R18 = 000000000000006b
  R03 = 000000000000002a   R19 = 00007d537d7aecf0
  R04 = 000000000000002a   R20 = 60000000000000e0
  R05 = 000000000000002a   R21 = 0801000000000080
  R06 = c0002073fb0caa08   R22 = 0000000000000d60
  R07 = c0000000019ddd78   R23 = 0000000000000001
  R08 = 000000000000002a   R24 = c00000000147a700
  R09 = 0000000000000001   R25 = c0002073fb0ca908
  R10 = c000008ffeb4e660   R26 = 0000000000000000
  R11 = c0002073fb0ca900   R27 = c0000000019e2464
  R12 = c000000000050790   R28 = c0000000000812b0
  R13 = c000207fff623e00   R29 = c0002073fb0ca808
  R14 = 00007d1bbee00000   R30 = c0002073fb0ca800
  R15 = 00007d1bcd600000   R31 = 0000000000000800
  pc  = c00000000020b260 smp_call_function_many+0x3d0/0x460
  cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460
  lr  = c00000000020b20c smp_call_function_many+0x37c/0x460
  msr = 900000010288b033   cr  = 44024824
  ctr = c000000000050790   xer = 0000000000000000   trap =  100

CPU 42 is running normally, doing VCPU work:

  # CPU 42 stack trace (via xmon)
  [link register   ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv]
  [c000008ed3343820] c000008ed3343850 (unreliable)
  [c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv]
  [c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
  [c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
  [c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
  [c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
  [c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70
  [c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120
  [c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80
  [c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70
  --- Exception: c00 (System Call) at 00007d545cfd7694
  SP (7d53ff7edf50) is in userspace

It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
was set for CPU 42 by at least 1 of the CPUs waiting in
smp_call_function_many(), but somehow the corresponding
call_single_queue entries were never processed by CPU 42, causing the
callers to spin in csd_lock_wait() indefinitely.

Nick Piggin suggested something similar to the following sequence as
a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
IPI messages seems to be most common, but any mix of CALL_FUNCTION and
!CALL_FUNCTION messages could trigger it):

    CPU
      X: smp_muxed_ipi_set_message():
      X:   smp_mb()
      X:   message[RESCHEDULE] = 1
      X: doorbell_global_ipi(42):
      X:   kvmppc_set_host_ipi(42, 1)
      X:   ppc_msgsnd_sync()/smp_mb()
      X:   ppc_msgsnd() -> 42
     42: doorbell_exception(): // from CPU X
     42:   ppc_msgsync()
    105: smp_muxed_ipi_set_message():
    105:   smb_mb()
         // STORE DEFERRED DUE TO RE-ORDERING
  --105:   message[CALL_FUNCTION] = 1
  | 105: doorbell_global_ipi(42):
  | 105:   kvmppc_set_host_ipi(42, 1)
  |  42:   kvmppc_set_host_ipi(42, 0)
  |  42: smp_ipi_demux_relaxed()
  |  42: // returns to executing guest
  |      // RE-ORDERED STORE COMPLETES
  ->105:   message[CALL_FUNCTION] = 1
    105:   ppc_msgsnd_sync()/smp_mb()
    105:   ppc_msgsnd() -> 42
     42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
    105: // hangs waiting on 42 to process messages/call_single_queue

This can be prevented with an smp_mb() at the beginning of
kvmppc_set_host_ipi(), such that stores to message[<type>] (or other
state indicated by the host_ipi flag) are ordered vs. the store to
to host_ipi.

However, doing so might still allow for the following scenario (not
yet observed):

    CPU
      X: smp_muxed_ipi_set_message():
      X:   smp_mb()
      X:   message[RESCHEDULE] = 1
      X: doorbell_global_ipi(42):
      X:   kvmppc_set_host_ipi(42, 1)
      X:   ppc_msgsnd_sync()/smp_mb()
      X:   ppc_msgsnd() -> 42
     42: doorbell_exception(): // from CPU X
     42:   ppc_msgsync()
         // STORE DEFERRED DUE TO RE-ORDERING
  -- 42:   kvmppc_set_host_ipi(42, 0)
  |  42: smp_ipi_demux_relaxed()
  | 105: smp_muxed_ipi_set_message():
  | 105:   smb_mb()
  | 105:   message[CALL_FUNCTION] = 1
  | 105: doorbell_global_ipi(42):
  | 105:   kvmppc_set_host_ipi(42, 1)
  |      // RE-ORDERED STORE COMPLETES
  -> 42:   kvmppc_set_host_ipi(42, 0)
     42: // returns to executing guest
    105:   ppc_msgsnd_sync()/smp_mb()
    105:   ppc_msgsnd() -> 42
     42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
    105: // hangs waiting on 42 to process messages/call_single_queue

Fixing this scenario would require an smp_mb() *after* clearing
host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
subsequent processing of IPI messages.

To handle both cases, this patch splits kvmppc_set_host_ipi() into
separate set/clear functions, where we execute smp_mb() prior to
setting host_ipi flag, and after clearing host_ipi flag. These
functions pair with each other to synchronize the sender and receiver
sides.

With that change in place the above workload ran for 20 hours without
triggering any lock-ups.

Fixes: 755563bc79 ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com
2019-09-24 12:46:26 +10:00
Linus Torvalds
299d14d4c3 pci-v5.4-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAl2JNVAUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vyTOA/9EZeyS7J+ZcOwihWz5vNijf0kfpKp
 /jZ9VF9nHjsL9Pw3/Fzha605Ssrtwcqge8g/sze9f0g/pxZk99lLHokE6dEOurEA
 GyKpNNMdiBol4YZMCsSoYji0MpwW0uMCuASPMiEwv2LxZ72A2Tu1RbgYLU+n4m1T
 fQldDTxsUMXc/OH/8SL8QDEh6o8qyDRhmSXFAOv8RGqN8N3iUwVwhQobKpwpmEvx
 ddzqWMS8f91qkhIKO7fgc9P4NI/7yI7kkF+wcdwtfiMO8Qkr4IdcdF7qwNVAtpKA
 A+sMRi59i2XxDTqRFx+wXXMa+rt+Pf1pucv77SO74xXWwpuXSxLVDYjULP1YQugK
 FTBo4SNmico/ts+n5cgm+CGMq2P2E29VYeqkI1Un6eDDvQnQlBgQdpdcBoadJ0rW
 y31OInjhRJC1ZK5bATKfCMbmB+VQxFsbyeUA7PBlrALyAmXZfw30iNxX9iHBhWqc
 myPNVEJJGp0cWTxGxMAU9MhelzeQxDAd+Eb44J5gv51bx0w9yqmZHECSDrOVdtYi
 HpOyI7E3Cb8m23BOHvCdB/v8igaYMZl08LUUJqu1S9mFclYyYVuOOIB04Yc2Qrx1
 3PHtT8TC47FbWuzKwo12RflzoAiNShJGw+tNKo6T1jC+r5jdbKWWtTnsoRqbSfaG
 rG5RJpB7EuQSP1Y=
 =/xB3
 -----END PGP SIGNATURE-----

Merge tag 'pci-v5.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "Enumeration:

   - Consolidate _HPP/_HPX stuff in pci-acpi.c and simplify it
     (Krzysztof Wilczynski)

   - Fix incorrect PCIe device types and remove dev->has_secondary_link
     to simplify code that deals with upstream/downstream ports (Mika
     Westerberg)

   - After suspend, restore Resizable BAR size bits correctly for 1MB
     BARs (Sumit Saxena)

   - Enable PCI_MSI_IRQ_DOMAIN support for RISC-V (Wesley Terpstra)

  Virtualization:

   - Add ACS quirks for iProc PAXB (Abhinav Ratna), Amazon Annapurna
     Labs (Ali Saidi)

   - Move sysfs SR-IOV functions to iov.c (Kelsey Skunberg)

   - Remove group write permissions from sysfs sriov_numvfs,
     sriov_drivers_autoprobe (Kelsey Skunberg)

  Hotplug:

   - Simplify pciehp indicator control (Denis Efremov)

  Peer-to-peer DMA:

   - Allow P2P DMA between root ports for whitelisted bridges (Logan
     Gunthorpe)

   - Whitelist some Intel host bridges for P2P DMA (Logan Gunthorpe)

   - DMA map P2P DMA requests that traverse host bridge (Logan
     Gunthorpe)

  Amazon Annapurna Labs host bridge driver:

   - Add DT binding and controller driver (Jonathan Chocron)

  Hyper-V host bridge driver:

   - Fix hv_pci_dev->pci_slot use-after-free (Dexuan Cui)

   - Fix PCI domain number collisions (Haiyang Zhang)

   - Use instance ID bytes 4 & 5 as PCI domain numbers (Haiyang Zhang)

   - Fix build errors on non-SYSFS config (Randy Dunlap)

  i.MX6 host bridge driver:

   - Limit DBI register length (Stefan Agner)

  Intel VMD host bridge driver:

   - Fix config addressing issues (Jon Derrick)

  Layerscape host bridge driver:

   - Add bar_fixed_64bit property to endpoint driver (Xiaowei Bao)

   - Add CONFIG_PCI_LAYERSCAPE_EP to build EP/RC drivers separately
     (Xiaowei Bao)

  Mediatek host bridge driver:

   - Add MT7629 controller support (Jianjun Wang)

  Mobiveil host bridge driver:

   - Fix CPU base address setup (Hou Zhiqiang)

   - Make "num-lanes" property optional (Hou Zhiqiang)

  Tegra host bridge driver:

   - Fix OF node reference leak (Nishka Dasgupta)

   - Disable MSI for root ports to work around design problem (Vidya
     Sagar)

   - Add Tegra194 DT binding and controller support (Vidya Sagar)

   - Add support for sideband pins and slot regulators (Vidya Sagar)

   - Add PIPE2UPHY support (Vidya Sagar)

  Misc:

   - Remove unused pci_block_cfg_access() et al (Kelsey Skunberg)

   - Unexport pci_bus_get(), etc (Kelsey Skunberg)

   - Hide PM, VC, link speed, ATS, ECRC, PTM constants and interfaces in
     the PCI core (Kelsey Skunberg)

   - Clean up sysfs DEVICE_ATTR() usage (Kelsey Skunberg)

   - Mark expected switch fall-through (Gustavo A. R. Silva)

   - Propagate errors for optional regulators and PHYs (Thierry Reding)

   - Fix kernel command line resource_alignment parameter issues (Logan
     Gunthorpe)"

* tag 'pci-v5.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (112 commits)
  PCI: Add pci_irq_vector() and other stubs when !CONFIG_PCI
  arm64: tegra: Add PCIe slot supply information in p2972-0000 platform
  arm64: tegra: Add configuration for PCIe C5 sideband signals
  PCI: tegra: Add support to enable slot regulators
  PCI: tegra: Add support to configure sideband pins
  PCI: vmd: Fix shadow offsets to reflect spec changes
  PCI: vmd: Fix config addressing when using bus offsets
  PCI: dwc: Add validation that PCIe core is set to correct mode
  PCI: dwc: al: Add Amazon Annapurna Labs PCIe controller driver
  dt-bindings: PCI: Add Amazon's Annapurna Labs PCIe host bridge binding
  PCI: Add quirk to disable MSI-X support for Amazon's Annapurna Labs Root Port
  PCI/VPD: Prevent VPD access for Amazon's Annapurna Labs Root Port
  PCI: Add ACS quirk for Amazon Annapurna Labs root ports
  PCI: Add Amazon's Annapurna Labs vendor ID
  MAINTAINERS: Add PCI native host/endpoint controllers designated reviewer
  PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain numbers
  dt-bindings: PCI: tegra: Add PCIe slot supplies regulator entries
  dt-bindings: PCI: tegra: Add sideband pins configuration entries
  PCI: tegra: Add Tegra194 PCIe support
  PCI: Get rid of dev->has_secondary_link flag
  ...
2019-09-23 19:16:01 -07:00
Jordan Niethe
13c7bb3c57 powerpc/64s: Set reserved PCR bits
Currently the reserved bits of the Processor Compatibility
Register (PCR) are cleared as per the Programming Note in Section
1.3.3 of version 3.0B of the Power ISA. This causes all new
architecture features to be made available when running on newer
processors with new architecture features added to the PCR as bits
must be set to disable a given feature.

For example to disable new features added as part of Version 2.07 of
the ISA the corresponding bit in the PCR needs to be set.

As new processor features generally require explicit kernel support
they should be disabled until such support is implemented. Therefore
kernels should set all unknown/reserved bits in the PCR such that any
new architecture features which the kernel does not currently know
about get disabled.

An update is planned to the ISA to clarify that the PCR is an
exception to the Programming Note on reserved bits in Section 1.3.3.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190917004605.22471-2-alistair@popple.id.au
2019-09-21 08:36:53 +10:00
Alistair Popple
c6fadabb28 powerpc: Fix definition of PCR bits to work with old binutils
Commit 388cc6e133 ("KVM: PPC: Book3S HV: Support POWER6
compatibility mode on POWER7") introduced new macros defining the PCR
bits. When used from assembly files these definitions lead to build
errors using older versions of binutils that don't support the 'ul'
suffix. This fixes the build errors by updating the definitions to use
the __MASK() macro which selects the appropriate suffix.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190917004605.22471-1-alistair@popple.id.au
2019-09-21 08:36:53 +10:00
Linus Torvalds
45824fc0da powerpc updates for 5.4
- Initial support for running on a system with an Ultravisor, which is software
    that runs below the hypervisor and protects guests against some attacks by
    the hypervisor.
 
  - Support for building the kernel to run as a "Secure Virtual Machine", ie. as
    a guest capable of running on a system with an Ultravisor.
 
  - Some changes to our DMA code on bare metal, to allow devices with medium
    sized DMA masks (> 32 && < 59 bits) to use more than 2GB of DMA space.
 
  - Support for firmware assisted crash dumps on bare metal (powernv).
 
  - Two series fixing bugs in and refactoring our PCI EEH code.
 
  - A large series refactoring our exception entry code to use gas macros, both
    to make it more readable and also enable some future optimisations.
 
 As well as many cleanups and other minor features & fixups.
 
 Thanks to:
   Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh
   Kumar K.V, Anju T Sudhakar, Anshuman Khandual, Balbir Singh, Benjamin
   Herrenschmidt, Cédric Le Goater, Christophe JAILLET, Christophe Leroy,
   Christopher M. Riedl, Christoph Hellwig, Claudio Carvalho, Daniel Axtens,
   David Gibson, David Hildenbrand, Desnes A. Nunes do Rosario, Ganesh Goudar,
   Gautham R. Shenoy, Greg Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari
   Bathini, Joakim Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras,
   Lianbo Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
   Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan Chancellor,
   Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Qian Cai, Ram
   Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm, Sam Bobroff, Santosh Sivaraj,
   Segher Boessenkool, Sukadev Bhattiprolu, Thiago Bauermann, Thiago Jung
   Bauermann, Thomas Gleixner, Tom Lendacky, Vasant Hegde.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl2EtEcTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPfsD/9uXyBXn3anI/H08+mk74k5gCsmMQpn
 D442CD/ByogZcccp23yBTlhawtCE03hcHnCLygn0Xgd8a4YvHts/RGHUe3fPHqlG
 bEyZ7jsLVz5ebNZQP7r4eGs2pSzCajwJy2N9HJ/C1ojf15rrfRxoVJtnyhE2wXpm
 DL+6o2K+nUCB3gTQ1Inr3DnWzoGOOUfNTOea2u+J+yfHwGRqOBYpevwqiwy5eelK
 aRjUJCqMTvrzra49MeFwjo0Nt3/Y8UNcwA+JlGdeR8bRuWhFrYmyBRiZEKPaujNO
 5EAfghBBlB0KQCqvF/tRM/c0OftHqK59AMobP9T7u9oOaBXeF/FpZX/iXjzNDPsN
 j9Oo2tKLTu/YVEXqBFuREGP+znANr1Wo4CFyOG8SbvYz0HFjR6XbtRJsS+0e8GWl
 kqX5/ZhYz3lBnKSNe9jgWOrh/J0KCSFigBTEWJT3xsn4YE8x8kK2l9KPqAIldWEP
 sKb2UjGS7v0NKq+NvShH88Q9AeQUEIjTcg/9aDDQDe6FaRQ7KiF8bUxSdwSPi+Fn
 j0lnF6i+1ATWZKuCr85veVi7C5qoe/+MqalnmP7MxULyzgXLLxUgN0SzEYO6QofK
 LQK/VaH2XVr5+M5YAb7K4/NX5gbM3s1bKrCiUy4EyHNvgG7gricYdbz6HgAjKpR7
 oP0rHfgmVYvF1g==
 =WlW+
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "This is a bit late, partly due to me travelling, and partly due to a
  power outage knocking out some of my test systems *while* I was
  travelling.

   - Initial support for running on a system with an Ultravisor, which
     is software that runs below the hypervisor and protects guests
     against some attacks by the hypervisor.

   - Support for building the kernel to run as a "Secure Virtual
     Machine", ie. as a guest capable of running on a system with an
     Ultravisor.

   - Some changes to our DMA code on bare metal, to allow devices with
     medium sized DMA masks (> 32 && < 59 bits) to use more than 2GB of
     DMA space.

   - Support for firmware assisted crash dumps on bare metal (powernv).

   - Two series fixing bugs in and refactoring our PCI EEH code.

   - A large series refactoring our exception entry code to use gas
     macros, both to make it more readable and also enable some future
     optimisations.

  As well as many cleanups and other minor features & fixups.

  Thanks to: Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew
  Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual,
  Balbir Singh, Benjamin Herrenschmidt, Cédric Le Goater, Christophe
  JAILLET, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig,
  Claudio Carvalho, Daniel Axtens, David Gibson, David Hildenbrand,
  Desnes A. Nunes do Rosario, Ganesh Goudar, Gautham R. Shenoy, Greg
  Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari Bathini, Joakim
  Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras, Lianbo
  Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
  Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan
  Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver
  O'Halloran, Qian Cai, Ram Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm,
  Sam Bobroff, Santosh Sivaraj, Segher Boessenkool, Sukadev Bhattiprolu,
  Thiago Bauermann, Thiago Jung Bauermann, Thomas Gleixner, Tom
  Lendacky, Vasant Hegde"

* tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (264 commits)
  powerpc/mm/mce: Keep irqs disabled during lockless page table walk
  powerpc: Use ftrace_graph_ret_addr() when unwinding
  powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
  ftrace: Look up the address of return_to_handler() using helpers
  powerpc: dump kernel log before carrying out fadump or kdump
  docs: powerpc: Add missing documentation reference
  powerpc/xmon: Fix output of XIVE IPI
  powerpc/xmon: Improve output of XIVE interrupts
  powerpc/mm/radix: remove useless kernel messages
  powerpc/fadump: support holes in kernel boot memory area
  powerpc/fadump: remove RMA_START and RMA_END macros
  powerpc/fadump: update documentation about option to release opalcore
  powerpc/fadump: consider f/w load area
  powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file
  powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes
  powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
  powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel
  powerpc/fadump: improve how crashed kernel's memory is reserved
  powerpc/fadump: consider reserved ranges while releasing memory
  powerpc/fadump: make crash memory ranges array allocation generic
  ...
2019-09-20 11:48:06 -07:00
Linus Torvalds
8b53c76533 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Add the ability to abort a skcipher walk.

  Algorithms:
   - Fix XTS to actually do the stealing.
   - Add library helpers for AES and DES for single-block users.
   - Add library helpers for SHA256.
   - Add new DES key verification helper.
   - Add surrounding bits for ESSIV generator.
   - Add accelerations for aegis128.
   - Add test vectors for lzo-rle.

  Drivers:
   - Add i.MX8MQ support to caam.
   - Add gcm/ccm/cfb/ofb aes support in inside-secure.
   - Add ofb/cfb aes support in media-tek.
   - Add HiSilicon ZIP accelerator support.

  Others:
   - Fix potential race condition in padata.
   - Use unbound workqueues in padata"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (311 commits)
  crypto: caam - Cast to long first before pointer conversion
  crypto: ccree - enable CTS support in AES-XTS
  crypto: inside-secure - Probe transform record cache RAM sizes
  crypto: inside-secure - Base RD fetchcount on actual RD FIFO size
  crypto: inside-secure - Base CD fetchcount on actual CD FIFO size
  crypto: inside-secure - Enable extended algorithms on newer HW
  crypto: inside-secure: Corrected configuration of EIP96_TOKEN_CTRL
  crypto: inside-secure - Add EIP97/EIP197 and endianness detection
  padata: remove cpu_index from the parallel_queue
  padata: unbind parallel jobs from specific CPUs
  padata: use separate workqueues for parallel and serial work
  padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible
  crypto: pcrypt - remove padata cpumask notifier
  padata: make padata_do_parallel find alternate callback CPU
  workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs
  workqueue: unconfine alloc/apply/free_workqueue_attrs()
  padata: allocate workqueue internally
  arm64: dts: imx8mq: Add CAAM node
  random: Use wait_event_freezable() in add_hwgenerator_randomness()
  crypto: ux500 - Fix COMPILE_TEST warnings
  ...
2019-09-18 12:11:14 -07:00
Linus Torvalds
fe38bd6862 * s390: ioctl hardening, selftests
* ARM: ITS translation cache; support for 512 vCPUs, various cleanups
 and bugfixes
 
 * PPC: various minor fixes and preparation
 
 * x86: bugfixes all over the place (posted interrupts, SVM, emulation
 corner cases, blocked INIT), some IPI optimizations
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJdf7fdAAoJEL/70l94x66DJzkIAKDcuWXJB4Qtoto6yUvPiHZm
 LYkY/Dn1zulb/DhzrBoXFey/jZXwl9kxMYkVTefnrAl0fRwFGX+G1UYnQrtAL6Gr
 ifdTYdy3kZhXCnnp99QAantWDswJHo1THwbmHrlmkxS4MdisEaTHwgjaHrDRZ4/d
 FAEwW2isSonP3YJfTtsKFFjL9k2D4iMnwZ/R2B7UOaWvgnerZ1GLmOkilvnzGGEV
 IQ89IIkWlkKd4SKgq8RkDKlfW5JrLrSdTK2Uf0DvAxV+J0EFkEaR+WlLsqumra0z
 Eg3KwNScfQj0DyT0TzurcOxObcQPoMNSFYXLRbUu1+i0CGgm90XpF1IosiuihgU=
 =w6I3
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "s390:
   - ioctl hardening
   - selftests

  ARM:
   - ITS translation cache
   - support for 512 vCPUs
   - various cleanups and bugfixes

  PPC:
   - various minor fixes and preparation

  x86:
   - bugfixes all over the place (posted interrupts, SVM, emulation
     corner cases, blocked INIT)
   - some IPI optimizations"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (75 commits)
  KVM: X86: Use IPI shorthands in kvm guest when support
  KVM: x86: Fix INIT signal handling in various CPU states
  KVM: VMX: Introduce exit reason for receiving INIT signal on guest-mode
  KVM: VMX: Stop the preemption timer during vCPU reset
  KVM: LAPIC: Micro optimize IPI latency
  kvm: Nested KVM MMUs need PAE root too
  KVM: x86: set ctxt->have_exception in x86_decode_insn()
  KVM: x86: always stop emulation on page fault
  KVM: nVMX: trace nested VM-Enter failures detected by H/W
  KVM: nVMX: add tracepoint for failed nested VM-Enter
  x86: KVM: svm: Fix a check in nested_svm_vmrun()
  KVM: x86: Return to userspace with internal error on unexpected exit reason
  KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM code
  KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callers
  doc: kvm: Fix return description of KVM_SET_MSRS
  KVM: X86: Tune PLE Window tracepoint
  KVM: VMX: Change ple_window type to unsigned int
  KVM: X86: Remove tailing newline for tracepoints
  KVM: X86: Trace vcpu_id for vmexit
  KVM: x86: Manually calculate reserved bits when loading PDPTRS
  ...
2019-09-18 09:49:13 -07:00
Naveen N. Rao
370011a270 powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
This associates entries in the ftrace_ret_stack with corresponding stack
frames, enabling more robust stack unwinding. Also update the only user
of ftrace_graph_ret_addr() to pass the stack pointer.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0224f2d0971b069c678e2ff678cfc2cd1e114cfe.1567707399.git.naveen.n.rao@linux.vnet.ibm.com
2019-09-18 12:24:55 +10:00
Linus Torvalds
94d18ee934 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "This cycle's RCU changes were:

   - A few more RCU flavor consolidation cleanups.

   - Updates to RCU's list-traversal macros improving lockdep usability.

   - Forward-progress improvements for no-CBs CPUs: Avoid ignoring
     incoming callbacks during grace-period waits.

   - Forward-progress improvements for no-CBs CPUs: Use ->cblist
     structure to take advantage of others' grace periods.

   - Also added a small commit that avoids needlessly inflicting
     scheduler-clock ticks on callback-offloaded CPUs.

   - Forward-progress improvements for no-CBs CPUs: Reduce contention on
     ->nocb_lock guarding ->cblist.

   - Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
     list to further reduce contention on ->nocb_lock guarding ->cblist.

   - Miscellaneous fixes.

   - Torture-test updates.

   - minor LKMM updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (86 commits)
  MAINTAINERS: Update from paulmck@linux.ibm.com to paulmck@kernel.org
  rcu: Don't include <linux/ktime.h> in rcutiny.h
  rcu: Allow rcu_do_batch() to dynamically adjust batch sizes
  rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload
  rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention
  rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention
  rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()
  rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()
  rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed
  rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended
  rcu/nocb: Add bypass callback queueing
  rcu/nocb: Atomic ->len field in rcu_segcblist structure
  rcu/nocb: Unconditionally advance and wake for excessive CBs
  rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lock
  rcu/nocb: Reduce contention at no-CBs invocation-done time
  rcu/nocb: Reduce contention at no-CBs registry-time CB advancement
  rcu/nocb: Round down for number of no-CBs grace-period kthreads
  rcu/nocb: Avoid ->nocb_lock capture by corresponding CPU
  rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread
  rcu/nocb: Make __call_rcu_nocb_wake() safe for many callbacks
  ...
2019-09-16 16:28:19 -07:00
Linus Torvalds
e77fafe9af arm64 updates for 5.4:
- 52-bit virtual addressing in the kernel
 
 - New ABI to allow tagged user pointers to be dereferenced by syscalls
 
 - Early RNG seeding by the bootloader
 
 - Improve robustness of SMP boot
 
 - Fix TLB invalidation in light of recent architectural clarifications
 
 - Support for i.MX8 DDR PMU
 
 - Remove direct LSE instruction patching in favour of static keys
 
 - Function error injection using kprobes
 
 - Support for the PPTT "thread" flag introduced by ACPI 6.3
 
 - Move PSCI idle code into proper cpuidle driver
 
 - Relaxation of implicit I/O memory barriers
 
 - Build with RELR relocations when toolchain supports them
 
 - Numerous cleanups and non-critical fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl1yYREQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNAM3CAChqDFQkryXoHwdeEcaukMRVNxtxOi4pM4g
 5xqkb7PoqRJssIblsuhaXjrSD97yWCgaqCmFe6rKoes++lP4bFcTe22KXPPyPBED
 A+tK4nTuKKcZfVbEanUjI+ihXaHJmKZ/kwAxWsEBYZ4WCOe3voCiJVNO2fHxqg1M
 8TskZ2BoayTbWMXih0eJg2MCy/xApBq4b3nZG4bKI7Z9UpXiKN1NYtDh98ZEBK4V
 d/oNoHsJ2ZvIQsztoBJMsvr09DTCazCijWZiECadm6l41WEPFizngrACiSJLLtYo
 0qu4qxgg9zgFlvBCRQmIYSggTuv35RgXSfcOwChmW5DUjHG+f9GK
 =Ru4B
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "Although there isn't tonnes of code in terms of line count, there are
  a fair few headline features which I've noted both in the tag and also
  in the merge commits when I pulled everything together.

  The part I'm most pleased with is that we had 35 contributors this
  time around, which feels like a big jump from the usual small group of
  core arm64 arch developers. Hopefully they all enjoyed it so much that
  they'll continue to contribute, but we'll see.

  It's probably worth highlighting that we've pulled in a branch from
  the risc-v folks which moves our CPU topology code out to where it can
  be shared with others.

  Summary:

   - 52-bit virtual addressing in the kernel

   - New ABI to allow tagged user pointers to be dereferenced by
     syscalls

   - Early RNG seeding by the bootloader

   - Improve robustness of SMP boot

   - Fix TLB invalidation in light of recent architectural
     clarifications

   - Support for i.MX8 DDR PMU

   - Remove direct LSE instruction patching in favour of static keys

   - Function error injection using kprobes

   - Support for the PPTT "thread" flag introduced by ACPI 6.3

   - Move PSCI idle code into proper cpuidle driver

   - Relaxation of implicit I/O memory barriers

   - Build with RELR relocations when toolchain supports them

   - Numerous cleanups and non-critical fixes"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (114 commits)
  arm64: remove __iounmap
  arm64: atomics: Use K constraint when toolchain appears to support it
  arm64: atomics: Undefine internal macros after use
  arm64: lse: Make ARM64_LSE_ATOMICS depend on JUMP_LABEL
  arm64: asm: Kill 'asm/atomic_arch.h'
  arm64: lse: Remove unused 'alt_lse' assembly macro
  arm64: atomics: Remove atomic_ll_sc compilation unit
  arm64: avoid using hard-coded registers for LSE atomics
  arm64: atomics: avoid out-of-line ll/sc atomics
  arm64: Use correct ll/sc atomic constraints
  jump_label: Don't warn on __exit jump entries
  docs/perf: Add documentation for the i.MX8 DDR PMU
  perf/imx_ddr: Add support for AXI ID filtering
  arm64: kpti: ensure patched kernel text is fetched from PoU
  arm64: fix fixmap copy for 16K pages and 48-bit VA
  perf/smmuv3: Validate groups for global filtering
  perf/smmuv3: Validate group size
  arm64: Relax Documentation/arm64/tagged-pointers.rst
  arm64: kvm: Replace hardcoded '1' with SYS_PAR_EL1_F
  arm64: mm: Ignore spurious translation faults taken from the kernel
  ...
2019-09-16 14:31:40 -07:00
Cédric Le Goater
5896163f7f powerpc/xmon: Improve output of XIVE interrupts
When looping on the list of interrupts, add the current value of the
PQ bits with a load on the ESB page. This has the side effect of
faulting the ESB page of all interrupts.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190910081850.26038-2-clg@kaod.org
2019-09-14 00:58:47 +10:00
Hari Bathini
7dee93a9a8 powerpc/fadump: support holes in kernel boot memory area
With support to copy multiple kernel boot memory regions owing to copy
size limitation, also handle holes in the memory area to be preserved.
Support as many as 128 kernel boot memory regions. This allows having
an adequate FADump capture kernel size for different scenarios.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821385448.5656.6124791213910877759.stgit@hbathini.in.ibm.com
2019-09-14 00:04:46 +10:00
Hari Bathini
becd91d9c5 powerpc/fadump: remove RMA_START and RMA_END macros
RMA_START is defined as '0' and there is even a BUILD_BUG_ON() to
make sure it is never anything else. Remove this macro and use '0'
instead as code change is needed anyway when it has to be something
else. Also, remove unused RMA_END macro.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821384096.5656.15026984053970204652.stgit@hbathini.in.ibm.com
2019-09-14 00:04:46 +10:00
Hari Bathini
7b1b3b4825 powerpc/fadump: consider f/w load area
OPAL loads kernel & initrd at 512MB offset (256MB size), also exported
as ibm,opal/dump/fw-load-area. So, if boot memory size of FADump is
less than 768MB, kernel memory to be exported as '/proc/vmcore' would
be overwritten by f/w while loading kernel & initrd. To avoid such a
scenario, enforce a minimum boot memory size of 768MB on OPAL platform
and skip using FADump if a newer F/W version loads kernel & initrd
above 768MB.

Also, irrespective of RMA size, set the minimum boot memory size
expected on pseries platform at 320MB. This is to avoid inflating the
minimum memory requirements on systems with 512M/1024M RMA size.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821381414.5656.1592867278535469652.stgit@hbathini.in.ibm.com
2019-09-14 00:04:45 +10:00
Hari Bathini
bec53196ad powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel
Add a new kernel config option, CONFIG_PRESERVE_FA_DUMP that ensures
that crash data, from previously crash'ed kernel, is preserved. This
helps in cases where FADump is not enabled but the subsequent memory
preserving kernel boot is likely to process this crash data. One
typical usecase for this config option is petitboot kernel.

As OPAL allows registering address with it in the first kernel and
retrieving it after MPIPL, use it to store the top of boot memory.
A kernel that intends to preserve crash data retrieves it and avoids
using memory beyond this address.

Move arch_reserved_kernel_pages() function as it is needed for both
FA_DUMP and PRESERVE_FA_DUMP configurations.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821375751.5656.11459483669542541602.stgit@hbathini.in.ibm.com
2019-09-14 00:04:45 +10:00
Hari Bathini
e4fc48fb4d powerpc/fadump: make crash memory ranges array allocation generic
Make allocate_crash_memory_ranges() and free_crash_memory_ranges()
functions generic to reuse them for memory management of all types of
dynamic memory range arrays. This change helps in memory management
of reserved ranges array to be added later.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821369863.5656.4375667005352155892.stgit@hbathini.in.ibm.com
2019-09-14 00:04:44 +10:00
Hari Bathini
5000a17afb powerpc/fadump: process architected register state data provided by firmware
Firmware provides architected register state data at the time of crash.
Process this data and build CPU notes to append to ELF core. In case
this data is missing or in unsupported format, at least append crashing
CPU's register data, to have something to work with in the vmcore file.

Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821367702.5656.5546683836236508389.stgit@hbathini.in.ibm.com
2019-09-14 00:04:44 +10:00
Hari Bathini
2790d01d1e powerpc/fadump: reset metadata address during clean up
During kexec boot, metadata address needs to be reset to avoid running
into errors interpreting stale metadata address, in case the kexec'ed
kernel crashes before metadata address could be setup again.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821346629.5656.10783321582005237813.stgit@hbathini.in.ibm.com
2019-09-14 00:04:43 +10:00
Hari Bathini
742a265acc powerpc/fadump: register kernel metadata address with opal
OPAL allows registering address with it in the first kernel and
retrieving it after MPIPL. Setup kernel metadata and register its
address with OPAL to use it for processing the crash dump.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821345011.5656.13567765019032928471.stgit@hbathini.in.ibm.com
2019-09-14 00:04:43 +10:00
Hari Bathini
41df592872 powerpc/fadump: add fadump support on powernv
Add basic callback functions for FADump on PowerNV platform.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821342072.5656.4346362203141486452.stgit@hbathini.in.ibm.com
2019-09-14 00:04:43 +10:00
Hari Bathini
6f5f193e84 powerpc/opal: add MPIPL interface definitions
MPIPL is Memory Preserving IPL supported from POWER9. This enables the
kernel to reset the system with memory 'preserved'. Also, it supports
copying memory from a source address to some destination address during
MPIPL boot. Add MPIPL interface definitions here to leverage these f/w
features in adding FADump support for PowerNV platform.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821340710.5656.10071829040515662624.stgit@hbathini.in.ibm.com
2019-09-14 00:04:43 +10:00
Hari Bathini
41a65d1618 pseries/fadump: define RTAS register/un-register callback functions
Move platform specific register/un-register code, the RTAS calls, to
register/un-register callback functions. This would also mean moving
code that initializes and prints the platform specific FADump data.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821332856.5656.16380417702046411631.stgit@hbathini.in.ibm.com
2019-09-14 00:04:42 +10:00
Hari Bathini
d3833a7010 powerpc/fadump: introduce callbacks for platform specific operations
Introduce callback functions for platform specific operations like
register, unregister, invalidate & such. Also, define place-holders
for the same on pSeries platform.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821330286.5656.15538934400074110770.stgit@hbathini.in.ibm.com
2019-09-14 00:04:42 +10:00
Hari Bathini
0226e55275 powerpc/fadump: move rtas specific definitions to platform code
Currently, FADump is only supported on pSeries but that is going to
change soon with FADump support being added on PowerNV platform. So,
move rtas specific definitions to platform code to allow FADump
to have multiple platforms support.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821328494.5656.16219929140866195511.stgit@hbathini.in.ibm.com
2019-09-14 00:04:41 +10:00
Hari Bathini
7f0ad11d3f powerpc/fadump: declare helper functions in internal header file
Declare helper functions, that can be reused by multiple platforms,
in the internal header file.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821320487.5656.2660730464212209984.stgit@hbathini.in.ibm.com
2019-09-14 00:04:41 +10:00
Hari Bathini
961cf26a98 powerpc/fadump: add helper functions
Add helper functions to setup & free CPU notes buffer and to find if a
given memory area is contiguous. Also, use boolean as return type for
the function that finds if boot memory area is contiguous. While at
it, save the virtual address of CPU notes buffer instead of physical
address as virtual address is used often.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821318971.5656.9281936950510635858.stgit@hbathini.in.ibm.com
2019-09-14 00:04:41 +10:00
Hari Bathini
ca986d7fa7 powerpc/fadump: move internal macros/definitions to a new header
Though asm/fadump.h is meant to be used by other components dealing
with FADump, it also has macros/definitions internal to FADump code.
Move them to a new header file used within FADump code. This also
makes way for refactoring platform specific FADump code.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821313134.5656.6597770626574392140.stgit@hbathini.in.ibm.com
2019-09-14 00:04:41 +10:00
Michael Ellerman
dac39f7885 powerpc/64s: Remove overlaps_kvm_tmp()
kvm_tmp is now in .text and so doesn't need a special overlap check.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911115746.12433-2-mpe@ellerman.id.au
2019-09-14 00:04:40 +10:00
Michael Ellerman
1b7f3b6c43 powerpc/eeh: Fix build with STACKTRACE=n
The build breaks when STACKTRACE=n, eg. skiroot_defconfig:

  arch/powerpc/kernel/eeh_event.c:124:23: error: implicit declaration of function 'stack_trace_save'

Fix it with some ifdefs for now.

Fixes: 25baf3d816 ("powerpc/eeh: Defer printing stack trace")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-09-14 00:01:14 +10:00
Greg Kurz
6ccb4ac2bf powerpc/xive: Fix bogus error code returned by OPAL
There's a bug in skiboot that causes the OPAL_XIVE_ALLOCATE_IRQ call
to return the 32-bit value 0xffffffff when OPAL has run out of IRQs.
Unfortunatelty, OPAL return values are signed 64-bit entities and
errors are supposed to be negative. If that happens, the linux code
confusingly treats 0xffffffff as a valid IRQ number and panics at some
point.

A fix was recently merged in skiboot:

e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()")

but we need a workaround anyway to support older skiboots already
in the field.

Internally convert 0xffffffff to OPAL_RESOURCE which is the usual error
returned upon resource exhaustion.

Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821713818.1985334.14123187368108582810.stgit@bahia.lan
2019-09-12 09:27:05 +10:00
Vasant Hegde
587164cd59 powerpc/powernv: Add new opal message type
We have OPAL_MSG_PRD message type to pass prd related messages from
OPAL to `opal-prd`. It can handle messages upto 64 bytes. We have a
requirement to send bigger than 64 bytes of data from OPAL to
`opal-prd`. Lets add new message type (OPAL_MSG_PRD2) to pass bigger
data.

Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[mpe: Make the error string clear that it's the PRD2 event that failed]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190826065701.8853-2-hegdevasant@linux.vnet.ibm.com
2019-09-12 09:27:00 +10:00
Segher Boessenkool
aa497d4352 powerpc: Add attributes for setjmp/longjmp
The setjmp function should be declared as "returns_twice", or bad
things can happen[1]. This does not actually change generated code in
my testing.

The longjmp function should be declared as "noreturn", so that the
compiler can optimise calls to it better. This makes the generated
code a little shorter.

1: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-returns_005ftwice-function-attribute

Signed-off-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c02ce4a573f3bac907e2c70957a2d1275f910013.1567605586.git.segher@kernel.crashing.org
2019-09-12 09:26:59 +10:00
Paolo Bonzini
32d1d15c52 KVM/arm updates for 5.4
- New ITS translation cache
 - Allow up to 512 CPUs to be supported with GICv3 (for real this time)
 - Now call kvm_arch_vcpu_blocking early in the blocking sequence
 - Tidy-up device mappings in S2 when DIC is available
 - Clean icache invalidation on VMID rollover
 - General cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl12QlAPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDXxUQAMd+GlOlmTXqEiuKudVApTkl6WIebfh0vkn6
 /1j8yNgJqRtZEY/YqE/XhAaqz1tx88VtzqSrNG4Pmrl9rDHMD9mDuk+w5UvEN2vy
 D5/nEe/wnzyVpuROBlHhsRbCRkT/6dNpnDnydwxCUqQPhfsAHnTNx6IygVzH9BHS
 D/1+KLI1imW8YziSSf6SGlIKJtk0eo5qo/aT6/mhb+e18Dobax3miItZL4mAqFPd
 tCV8fvOLb/phdSmOZuD/3XF9JOodk2ycvF9MW9Rp/FxDx9HULCXPv/3KnoHg9ca5
 QSGz1Chj0C2avaQJ4GbHZnZZjdvL2TmVxMpixocc/VZCqlO3ifRKf91t/rq4cElG
 HxLE9AX6kqW6UK66RHUQiHxjqRG8ynz8xEmlhwd7YhCLmtmJSXLTrmc2ABf64+BT
 RaexRa3h6D19fLBcMN5gpP8I48XaRpfxg6E/jCw5ZEr/8zhzLajFnE89ftgRR04f
 bSXOnj0kAhrBZ6jRTEata1MrFAt58wiaulxTxgMlnj1hHpqA3b+x6woRECAEVOlc
 6JJuzReJSBuCJL/rVtXGF31mXNnqUo+oTcDpQSle/fDtQ/44+xlYj6V/ZeFIRHAz
 nwUw9DHyZ/JMSwPNsqtdzCnLths1rNw34A7VgdVWiqiPYEcGGUnMzkRrXKMYjjJn
 LD4+Rh/e
 =0dD/
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm updates for 5.4

- New ITS translation cache
- Allow up to 512 CPUs to be supported with GICv3 (for real this time)
- Now call kvm_arch_vcpu_blocking early in the blocking sequence
- Tidy-up device mappings in S2 when DIC is available
- Clean icache invalidation on VMID rollover
- General cleanup
2019-09-10 19:09:14 +02:00
Jordan Niethe
67c87892e2 powerpc: Remove empty comment
Commit 2874c5fd28 ("treewide: Replace GPLv2 boilerplate/reference with
SPDX - rule 152") left an empty comment in machdep.h, as the boilerplate
was the only text in the comment. Remove the empty comment.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813051212.6387-1-jniethe5@gmail.com
2019-09-05 14:22:41 +10:00
Nicholas Piggin
2275d7b575 powerpc/64s/radix: introduce options to disable use of the tlbie instruction
Introduce two options to control the use of the tlbie instruction. A
boot time option which completely disables the kernel using the
instruction, this is currently incompatible with HASH MMU, KVM, and
coherent accelerators.

And a debugfs option can be switched at runtime and avoids using tlbie
for invalidating CPU TLBs for normal process and kernel address
mappings. Coherent accelerators are still managed with tlbie, as will
KVM partition scope translations.

Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a
basic implementation which does not attempt to make any optimisation
beyond the tlbie implementation.

This is useful for performance testing among other things. For example
in certain situations on large systems, using IPIs may be faster than
tlbie as they can be directed rather than broadcast. Later we may also
take advantage of the IPIs to do more interesting things such as trim
the mm cpumask more aggressively.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-7-npiggin@gmail.com
2019-09-05 14:22:41 +10:00
Nicholas Piggin
fd13daea5f powerpc/64s: make mmu_partition_table_set_entry TLB flush optional
No functional change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-4-npiggin@gmail.com
2019-09-05 14:22:40 +10:00
Nicholas Piggin
99161de3a2 powerpc/64s/radix: tidy up TLB flushing code
There should be no functional changes.

- Use calls to existing radix_tlb.c functions in flush_partition.

- Rename radix__flush_tlb_lpid to radix__flush_all_lpid and similar,
  because they flush everything, matching flush_all_mm rather than
  flush_tlb_mm for the lpid.

- Remove some unused radix_tlb.c flush primitives.

Signed-off: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-3-npiggin@gmail.com
2019-09-05 14:22:40 +10:00
Nicholas Piggin
ed6546bdc6 powerpc/64s: remove register_process_table callback
This callback is only required because the partition table init comes
before process table allocation on powernv (aka bare metal aka native).

Change the order to allocate the process table first, and remove the
callback.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-2-npiggin@gmail.com
2019-09-05 14:22:40 +10:00
Oliver O'Halloran
25baf3d816 powerpc/eeh: Defer printing stack trace
Currently we print a stack trace in the event handler to help with
debugging EEH issues. In the case of suprise hot-unplug this is unneeded,
so we want to prevent printing the stack trace unless we know it's due to
an actual device error. To accomplish this, we can save a stack trace at
the point of detection and only print it once the EEH recovery handler has
determined the freeze was due to an actual error.

Since the whole point of this is to prevent spurious EEH output we also
move a few prints out of the detection thread, or mark them as pr_debug
so anyone interested can get output from the eeh_check_dev_failure()
if they want.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-6-oohall@gmail.com
2019-09-05 14:22:38 +10:00
Oliver O'Halloran
5ef753ae43 powerpc/eeh: Fix race when freeing PDNs
When hot-adding devices we rely on the hotplug driver to create pci_dn's
for the devices under the hotplug slot. Converse, when hot-removing the
driver will remove the pci_dn's that it created. This is a problem because
the pci_dev is still live until it's refcount drops to zero. This can
happen if the driver is slow to tear down it's internal state. Ideally, the
driver would not attempt to perform any config accesses to the device once
it's been marked as removed, but sometimes it happens. As a result, we
might attempt to access the pci_dn for a device that has been torn down and
the kernel may crash as a result.

To fix this, don't free the pci_dn unless the corresponding pci_dev has
been released.  If the pci_dev is still live, then we mark the pci_dn with
a flag that indicates the pci_dev's release function should free it.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-3-oohall@gmail.com
2019-09-05 14:22:37 +10:00
Nicholas Piggin
a243281195 powerpc/64s/exception: move head-64.h exception code to exception-64s.S
The head-64.h code should deal only with the head code sections
and offset calculations.

No generated code change except BUG line number constants.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-19-npiggin@gmail.com
2019-08-30 10:32:36 +10:00
Nicholas Piggin
9ca766f989 powerpc/64s/pseries: machine check convert to use common event code
The common machine_check_event data structures and queues are mostly
platform independent, with powernv decoding SRR1/DSISR/etc., into
machine_check_event objects.

This patch converts pseries to use this infrastructure by decoding
fwnmi/rtas data into machine_check_event objects.

This allows queueing to be used by a subsequent change to delay the
virtual mode handling of machine checks that occur in kernel space
where it is unsafe to switch immediately to virtual mode, similarly
to powernv.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix implicit fallthrough warnings in mce_handle_error()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-10-npiggin@gmail.com
2019-08-30 10:32:35 +10:00
Anshuman Khandual
2efbc58f15 powerpc/pseries/svm: Force SWIOTLB for secure guests
SWIOTLB checks range of incoming CPU addresses to be bounced and sees if
the device can access it through its DMA window without requiring bouncing.
In such cases it just chooses to skip bouncing. But for cases like secure
guests on powerpc platform all addresses need to be bounced into the shared
pool of memory because the host cannot access it otherwise. Hence the need
to do the bouncing is not related to device's DMA window and use of bounce
buffers is forced by setting swiotlb_force.

Also, connect the shared memory conversion functions into the
ARCH_HAS_MEM_ENCRYPT hooks and call swiotlb_update_mem_attributes() to
convert SWIOTLB's memory pool to shared memory.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[ bauerman: Use ARCH_HAS_MEM_ENCRYPT hooks to share swiotlb memory pool. ]
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-15-bauerman@linux.ibm.com
2019-08-30 09:55:41 +10:00
Ram Pai
256ba2c168 powerpc/pseries/svm: Unshare all pages before kexecing a new kernel
A new kernel deserves a clean slate. Any pages shared with the hypervisor
is unshared before invoking the new kernel. However there are exceptions.
If the new kernel is invoked to dump the current kernel, or if there is a
explicit request to preserve the state of the current kernel, unsharing
of pages is skipped.

NOTE: While testing crashkernel, make sure at least 256M is reserved for
crashkernel. Otherwise SWIOTLB allocation will fail and crash kernel will
fail to boot.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-11-bauerman@linux.ibm.com
2019-08-30 09:55:40 +10:00
Anshuman Khandual
d5394c059d powerpc/pseries/svm: Use shared memory for Debug Trace Log (DTL)
Secure guests need to share the DTL buffers with the hypervisor. To that
end, use a kmem_cache constructor which converts the underlying buddy
allocated SLUB cache pages into shared memory.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-10-bauerman@linux.ibm.com
2019-08-30 09:55:40 +10:00
Anshuman Khandual
bd104e6db6 powerpc/pseries/svm: Use shared memory for LPPACA structures
LPPACA structures need to be shared with the host. Hence they need to be in
shared memory. Instead of allocating individual chunks of memory for a
given structure from memblock, a contiguous chunk of memory is allocated
and then converted into shared memory. Subsequent allocation requests will
come from the contiguous chunk which will be always shared memory for all
structures.

While we are able to use a kmem_cache constructor for the Debug Trace Log,
LPPACAs are allocated very early in the boot process (before SLUB is
available) so we need to use a simpler scheme here.

Introduce helper is_svm_platform() which uses the S bit of the MSR to tell
whether we're running as a secure guest.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-9-bauerman@linux.ibm.com
2019-08-30 09:55:40 +10:00
Sukadev Bhattiprolu
7f70c3815a powerpc: Introduce the MSR_S bit
Protected Execution Facility (PEF) is an architectural change for
POWER 9 that enables Secure Virtual Machines (SVMs). When enabled,
PEF adds a new higher privileged mode, called Ultravisor mode, to
POWER architecture.

The hardware changes include the following:

  * There is a new bit in the MSR that determines whether the current
    process is running in secure mode, MSR(S) bit 41. MSR(S)=1, process
    is in secure mode, MSR(s)=0 process is in normal mode.

  * The MSR(S) bit can only be set by the Ultravisor.

  * HRFID cannot be used to set the MSR(S) bit. If the hypervisor needs
    to return to a SVM it must use an ultracall. It can determine if
    the VM it is returning to is secure.

  * The privilege of a process is now determined by three MSR bits,
    MSR(S, HV, PR). In each of the tables below the modes are listed
    from least privilege to highest privilege. The higher privilege
    modes can access all the resources of the lower privilege modes.

    **Secure Mode MSR Settings**

       +---+---+---+---------------+
       | S | HV| PR|Privilege      |
       +===+===+===+===============+
       | 1 | 0 | 1 | Problem       |
       +---+---+---+---------------+
       | 1 | 0 | 0 | Privileged(OS)|
       +---+---+---+---------------+
       | 1 | 1 | 0 | Ultravisor    |
       +---+---+---+---------------+
       | 1 | 1 | 1 | Reserved      |
       +---+---+---+---------------+

    **Normal Mode MSR Settings**

       +---+---+---+---------------+
       | S | HV| PR|Privilege      |
       +===+===+===+===============+
       | 0 | 0 | 1 | Problem       |
       +---+---+---+---------------+
       | 0 | 0 | 0 | Privileged(OS)|
       +---+---+---+---------------+
       | 0 | 1 | 0 | Hypervisor    |
       +---+---+---+---------------+
       | 0 | 1 | 1 | Problem (HV)  |
       +---+---+---+---------------+

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[ cclaudio: Update the commit message ]
Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-7-bauerman@linux.ibm.com
2019-08-30 09:55:40 +10:00
Ram Pai
f7777e008c powerpc/pseries/svm: Add helpers for UV_SHARE_PAGE and UV_UNSHARE_PAGE
These functions are used when the guest wants to grant the hypervisor
access to certain pages.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-6-bauerman@linux.ibm.com
2019-08-30 09:55:40 +10:00
Ram Pai
6a9c930bd7 powerpc/prom_init: Add the ESM call to prom_init
Make the Enter-Secure-Mode (ESM) ultravisor call to switch the VM to secure
mode. Pass kernel base address and FDT address so that the Ultravisor is
able to verify the integrity of the VM using information from the ESM blob.

Add "svm=" command line option to turn on switching to secure mode.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[ andmike: Generate an RTAS os-term hcall when the ESM ucall fails. ]
Signed-off-by: Michael Anderson <andmike@linux.ibm.com>
[ bauerman: Cleaned up the code a bit. ]
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-5-bauerman@linux.ibm.com
2019-08-30 09:54:35 +10:00
Thiago Jung Bauermann
136bc0397a powerpc/pseries: Introduce option to build secure virtual machines
Introduce CONFIG_PPC_SVM to control support for secure guests and include
Ultravisor-related helpers when it is selected

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-3-bauerman@linux.ibm.com
2019-08-30 09:53:28 +10:00
Michael Ellerman
9044adca78 Merge branch 'topic/ppc-kvm' into next
Merge our ppc-kvm topic branch to bring in the Ultravisor support
patches.
2019-08-30 09:52:57 +10:00
Sukadev Bhattiprolu
6c85b7bc63 powerpc/kvm: Use UV_RETURN ucall to return to ultravisor
When an SVM makes an hypercall or incurs some other exception, the
Ultravisor usually forwards (a.k.a. reflects) the exceptions to the
Hypervisor. After processing the exception, Hypervisor uses the
UV_RETURN ultracall to return control back to the SVM.

The expected register state on entry to this ultracall is:

* Non-volatile registers are restored to their original values.
* If returning from an hypercall, register R0 contains the return value
  (unlike other ultracalls) and, registers R4 through R12 contain any
  output values of the hypercall.
* R3 contains the ultracall number, i.e UV_RETURN.
* If returning with a synthesized interrupt, R2 contains the
  synthesized interrupt number.

Thanks to input from Paul Mackerras, Ram Pai and Mike Anderson.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-8-cclaudio@linux.ibm.com
2019-08-30 09:40:16 +10:00
Claudio Carvalho
5223134029 powerpc/mm: Write to PTCR only if ultravisor disabled
In ultravisor enabled systems, PTCR becomes ultravisor privileged only
for writing and an attempt to write to it will cause a Hypervisor
Emulation Assitance interrupt.

This patch uses the set_ptcr_when_no_uv() function to restrict PTCR
writing to only when ultravisor is disabled.

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-6-cclaudio@linux.ibm.com
2019-08-30 09:40:16 +10:00
Michael Anderson
139a1d2842 powerpc/mm: Use UV_WRITE_PATE ucall to register a PATE
When Ultravisor (UV) is enabled, the partition table is stored in secure
memory and can only be accessed via the UV. The Hypervisor (HV) however
maintains a copy of the partition table in normal memory to allow Nest MMU
translations to occur (for normal VMs). The HV copy includes partition
table entries (PATE)s for secure VMs which would currently be unused
(Nest MMU translations cannot access secure memory) but they would be
needed as we add functionality.

This patch adds the UV_WRITE_PATE ucall which is used to update the PATE
for a VM (both normal and secure) when Ultravisor is enabled.

Signed-off-by: Michael Anderson <andmike@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[ cclaudio: Write the PATE in HV's table before doing that in UV's ]
Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Reviewed-by: Ryan Grimm <grimm@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-5-cclaudio@linux.ibm.com
2019-08-30 09:40:15 +10:00
Claudio Carvalho
bb04ffe85e powerpc/powernv: Introduce FW_FEATURE_ULTRAVISOR
In PEF enabled systems, some of the resources which were previously
hypervisor privileged are now ultravisor privileged and controlled by
the ultravisor firmware.

This adds FW_FEATURE_ULTRAVISOR to indicate if PEF is enabled.

The host kernel can use FW_FEATURE_ULTRAVISOR, for instance, to skip
accessing resources (e.g. PTCR and LDBAR) in case PEF is enabled.

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
[ andmike: Device node name to "ibm,ultravisor" ]
Signed-off-by: Michael Anderson <andmike@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-4-cclaudio@linux.ibm.com
2019-08-30 09:40:15 +10:00
Claudio Carvalho
a49dddbdb0 powerpc/kernel: Add ucall_norets() ultravisor call handler
The ultracalls (ucalls for short) allow the Secure Virtual Machines
(SVM)s and hypervisor to request services from the ultravisor such as
accessing a register or memory region that can only be accessed when
running in ultravisor-privileged mode.

This patch adds the ucall_norets() ultravisor call handler.

The specific service needed from an ucall is specified in register
R3 (the first parameter to the ucall). Other parameters to the
ucall, if any, are specified in registers R4 through R12.

Return value of all ucalls is in register R3. Other output values
from the ucall, if any, are returned in registers R4 through R12.

Each ucall returns specific error codes, applicable in the context
of the ucall. However, like with the PowerPC Architecture Platform
Reference (PAPR), if no specific error code is defined for a particular
situation, then the ucall will fallback to an erroneous
parameter-position based code. i.e U_PARAMETER, U_P2, U_P3 etc depending
on the ucall parameter that may have caused the error.

Every host kernel (powernv) needs to be able to do ucalls in case it
ends up being run in a machine with ultravisor enabled. Otherwise, the
kernel may crash early in boot trying to access ultravisor resources,
for instance, trying to set the partition table entry 0. Secure guests
also need to be able to do ucalls and its kernel may not have
CONFIG_PPC_POWERNV=y. For that reason, the ucall.S file is placed under
arch/powerpc/kernel.

If ultravisor is not enabled, the ucalls will be redirected to the
hypervisor which must handle/fail the call.

Thanks to inputs from Ram Pai and Michael Anderson.

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-3-cclaudio@linux.ibm.com
2019-08-30 09:40:15 +10:00
Claudio Carvalho
70ed86f4de powerpc: Add PowerPC Capabilities ELF note
Add the PowerPC name and the PPC_ELFNOTE_CAPABILITIES type in the
kernel binary ELF note. This type is a bitmap that can be used to
advertise kernel capabilities to userland.

This patch also defines PPCCAP_ULTRAVISOR_BIT as being the bit zero.

Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
[ maxiwell: Define the 'PowerPC' type in the elfnote.h ]
Signed-off-by: Maxiwell S. Garcia <maxiwell@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190829155021.2915-2-maxiwell@linux.ibm.com
2019-08-30 09:40:15 +10:00
Alexey Kardashevskiy
a102f139aa powerpc/powernv/ioda: Remove obsolete iommu_table_ops::exchange callbacks
As now we have xchg_no_kill/tce_kill, these are not used anymore so
remove them.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190829085252.72370-6-aik@ozlabs.ru
2019-08-30 09:40:15 +10:00
Alexey Kardashevskiy
35872480da powerpc/powernv/ioda: Split out TCE invalidation from TCE updates
At the moment updates in a TCE table are made by iommu_table_ops::exchange
which update one TCE and invalidates an entry in the PHB/NPU TCE cache
via set of registers called "TCE Kill" (hence the naming).
Writing a TCE is a simple xchg() but invalidating the TCE cache is
a relatively expensive OPAL call. Mapping a 100GB guest with PCI+NPU
passed through devices takes about 20s.

Thankfully we can do better. Since such big mappings happen at the boot
time and when memory is plugged/onlined (i.e. not often), these requests
come in 512 pages so we call call OPAL 512 times less which brings 20s
from the above to less than 10s. Also, since TCE caches can be flushed
entirely, calling OPAL for 512 TCEs helps skiboot [1] to decide whether
to flush the entire cache or not.

This implements 2 new iommu_table_ops callbacks:
- xchg_no_kill() to update a single TCE with no TCE invalidation;
- tce_kill() to invalidate multiple TCEs.
This uses the same xchg_no_kill() callback for IODA1/2.

This implements 2 new wrappers on top of the new callbacks similar to
the existing iommu_tce_xchg().

This does not use the new callbacks yet, the next patches will;
so this should not cause any behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190829085252.72370-2-aik@ozlabs.ru
2019-08-30 09:40:14 +10:00
Christoph Hellwig
f2902a2fb4 powerpc: use the generic dma coherent remap allocator
This switches to using common code for the DMA allocations, including
potential use of the CMA allocator if configured.

Switching to the generic code enables DMA allocations from atomic
context, which is required by the DMA API documentation, and also
adds various other minor features drivers start relying upon.  It
also makes sure we have on tested code base for all architectures
that require uncached pte bits for coherent DMA allocations.

Another advantage is that consistent memory allocations now share
the general vmalloc pool instead of needing an explicit careout
from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> # tested on 8xx
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190814132230.31874-2-hch@lst.de
2019-08-28 23:19:34 +10:00