Commit Graph

2496 Commits

Author SHA1 Message Date
Christophe Leroy
0f9aee0cb9 powerpc/mm: Don't log user reads to 0xffffffff
Running vdsotest leaves many times the following log:

  [   79.629901] vdsotest[396]: User access of kernel address (ffffffff) - exploit attempt? (uid: 0)

A pointer set to (-1) is likely a programming error similar to
a NULL pointer and is not worth logging as an exploit attempt.

Don't log user accesses to 0xffffffff.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0728849e826ba16f1fbd6fa7f5c6cc87bd64e097.1577087627.git.christophe.leroy@c-s.fr
2020-01-27 22:37:24 +11:00
Christophe Leroy
cd08f109e2 powerpc/32s: Enable CONFIG_VMAP_STACK
A few changes to retrieve DAR and DSISR from struct regs
instead of retrieving them directly, as they may have
changed due to a TLB miss.

Also modifies hash_page() and friends to work with virtual
data addresses instead of physical ones. Same on load_up_fpu()
and load_up_altivec().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Fix tovirt_vmstack call in head_32.S to fix CHRP build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/2e2509a242fd5f3e23df4a06530c18060c4d321e.1576916812.git.christophe.leroy@c-s.fr
2020-01-27 22:37:24 +11:00
Jordan Niethe
736bcdd3a9 powerpc/mm: Remove kvm radix prefetch workaround for Power9 DD2.2
Commit a25bd72bad ("powerpc/mm/radix: Workaround prefetch issue with
KVM") introduced a number of workarounds as coming out of a guest with
the mmu enabled would make the cpu would start running in hypervisor
state with the PID value from the guest. The cpu will then start
prefetching for the hypervisor with that PID value.

In Power9 DD2.2 the cpu behaviour was modified to fix this. When
accessing Quadrant 0 in hypervisor mode with LPID != 0 prefetching will
not be performed. This means that we can get rid of the workarounds for
Power9 DD2.2 and later revisions. Add a new cpu feature
CPU_FTR_P9_RADIX_PREFETCH_BUG to indicate if the workarounds are needed.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191206031722.25781-1-jniethe5@gmail.com
2020-01-26 00:11:37 +11:00
Christophe Leroy
def0bfdbd6 powerpc: use probe_user_read() and probe_user_write()
Instead of opencoding, use probe_user_read() to failessly read
a user location and probe_user_write() for writing to user.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/e041f5eedb23f09ab553be8a91c3de2087147320.1579800517.git.christophe.leroy@c-s.fr
2020-01-26 00:11:35 +11:00
Christophe Leroy
991d656d72 powerpc/8xx: Fix permanently mapped IMMR region.
When not using large TLBs, the IMMR region is still
mapped as a whole block in the FIXMAP area.

Properly report that the IMMR region is block-mapped even
when not using large TLBs.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/45f4f414bcd7198b0755cf4287ff216fbfc24b9d.1574774187.git.christophe.leroy@c-s.fr
2020-01-23 21:31:14 +11:00
Christophe Leroy
d80ae83f1f powerpc/ptdump: Fix W+X verification
Verification cannot rely on simple bit checking because on some
platforms PAGE_RW is 0, checking that a page is not W means
checking that PAGE_RO is set instead of checking that PAGE_RW
is not set.

Use pte helpers instead of checking bits.

Fixes: 453d87f6a8 ("powerpc/mm: Warn if W+X pages found on boot")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0d894839fdbb19070f0e1e4140363be4f2bb62fc.1578989540.git.christophe.leroy@c-s.fr
2020-01-23 21:31:13 +11:00
Christophe Leroy
e26ad936dd powerpc/ptdump: Fix W+X verification call in mark_rodata_ro()
ptdump_check_wx() also have to be called when pages are mapped
by blocks.

Fixes: 453d87f6a8 ("powerpc/mm: Warn if W+X pages found on boot")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/37517da8310f4457f28921a4edb88fb21d27b62a.1578989531.git.christophe.leroy@c-s.fr
2020-01-23 21:31:12 +11:00
Christophe Leroy
1e1c8b2cc3 powerpc/ptdump: don't entirely rebuild kernel when selecting CONFIG_PPC_DEBUG_WX
Selecting CONFIG_PPC_DEBUG_WX only impacts ptdump and pgtable_32/64
init calls. Declaring related functions in asm/pgtable.h implies
rebuilding almost everything.

Move ptdump_check_wx() declaration in mm/mmu_decl.h

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bf34fd9dca61eadf9a134a9f89ebbc162cfd5f86.1578986011.git.christophe.leroy@c-s.fr
2020-01-23 21:31:11 +11:00
Russell Currey
970d54f99c powerpc/book3s64/hash: Disable 16M linear mapping size if not aligned
With STRICT_KERNEL_RWX on in a relocatable kernel under the hash MMU,
if the position the kernel is loaded at is not 16M aligned things go
horribly wrong. Specifically hash__mark_initmem_nx() will call
hash__change_memory_range() which then aligns down the start address,
and due to the text not being 16M aligned causes some of the kernel
text to be marked non-executable.

We can avoid this when selecting the linear mapping size, so do so and
print a warning. I tested this for various alignments and as long as
the position is 64K aligned it's fine (the base requirement for
powerpc).

Signed-off-by: Russell Currey <ruscur@russell.cc>
[mpe: Add details of the failure mode]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191224064126.183670-1-ruscur@russell.cc
2020-01-16 14:59:36 +10:00
David Hildenbrand
feee6b2989 mm/memory_hotplug: shrink zones when offlining memory
We currently try to shrink a single zone when removing memory.  We use
the zone of the first page of the memory we are removing.  If that
memmap was never initialized (e.g., memory was never onlined), we will
read garbage and can trigger kernel BUGs (due to a stale pointer):

    BUG: unable to handle page fault for address: 000000000000353d
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:clear_zone_contiguous+0x5/0x10
    Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
    RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
    RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
    RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
    R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
    FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     __remove_pages+0x4b/0x640
     arch_remove_memory+0x63/0x8d
     try_remove_memory+0xdb/0x130
     __remove_memory+0xa/0x11
     acpi_memory_device_remove+0x70/0x100
     acpi_bus_trim+0x55/0x90
     acpi_device_hotplug+0x227/0x3a0
     acpi_hotplug_work_fn+0x1a/0x30
     process_one_work+0x221/0x550
     worker_thread+0x50/0x3b0
     kthread+0x105/0x140
     ret_from_fork+0x3a/0x50
    Modules linked in:
    CR2: 000000000000353d

Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that.  We now
properly shrink the zones, even if we have DIMMs whereby

 - Some memory blocks fall into no zone (never onlined)

 - Some memory blocks fall into multiple zones (offlined+re-onlined)

 - Multiple memory blocks that fall into different zones

Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().

Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e86b]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: <stable@vger.kernel.org>	[5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-04 13:55:08 -08:00
Michael Ellerman
91a063c956 powerpc/mm: Mark get_slice_psize() & slice_addr_is_low() as notrace
These slice routines are called from the SLB miss handler, which can
lead to warnings from the IRQ code, because we have not reconciled the
IRQ state properly:

  WARNING: CPU: 72 PID: 30150 at arch/powerpc/kernel/irq.c:258 arch_local_irq_restore.part.0+0xcc/0x100
  Modules linked in:
  CPU: 72 PID: 30150 Comm: ftracetest Not tainted 5.5.0-rc2-gcc9x-g7e0165b2f1a9 #1
  NIP:  c00000000001d83c LR: c00000000029ab90 CTR: c00000000026cf90
  REGS: c0000007eee3b960 TRAP: 0700   Not tainted  (5.5.0-rc2-gcc9x-g7e0165b2f1a9)
  MSR:  8000000000021033 <SF,ME,IR,DR,RI,LE>  CR: 22242844  XER: 20000000
  CFAR: c00000000001d780 IRQMASK: 0
  ...
  NIP arch_local_irq_restore.part.0+0xcc/0x100
  LR  trace_graph_entry+0x270/0x340
  Call Trace:
    trace_graph_entry+0x254/0x340 (unreliable)
    function_graph_enter+0xe4/0x1a0
    prepare_ftrace_return+0xa0/0x130
    ftrace_graph_caller+0x44/0x94	# (get_slice_psize())
    slb_allocate_user+0x7c/0x100
    do_slb_fault+0xf8/0x300
    instruction_access_slb_common+0x140/0x180

Fixes: 48e7b76957 ("powerpc/64s/hash: Convert SLB miss handlers to C")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191221121337.4894-1-mpe@ellerman.id.au
2019-12-23 21:12:51 +11:00
Christophe Leroy
0601546f23 powerpc/8xx: fix bogus __init on mmu_mapin_ram_chunk()
Remove __init qualifier for mmu_mapin_ram_chunk() as it is called by
mmu_mark_initmem_nx() and mmu_mark_rodata_ro() which are not __init
functions.

At the same time, mark it static as it is only used in this file.

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: a2227a2777 ("powerpc/32: Don't populate page tables for block mapped pages except on the 8xx")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/56648921986a6b3e7315b1fbbf4684f21bd2dea8.1576310997.git.christophe.leroy@c-s.fr
2019-12-16 23:15:16 +11:00
Mike Rapoport
8fabc62323 powerpc: Ensure that swiotlb buffer is allocated from low memory
Some powerpc platforms (e.g. 85xx) limit DMA-able memory way below 4G.
If a system has more physical memory than this limit, the swiotlb
buffer is not addressable because it is allocated from memblock using
top-down mode.

Force memblock to bottom-up mode before calling swiotlb_init() to
ensure that the swiotlb buffer is DMA-able.

Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191204123524.22919-1-rppt@kernel.org
2019-12-13 21:27:55 +11:00
Aneesh Kumar K.V
6f4679b956 powerpc/pmem: Fix kernel crash due to wrong range value usage in flush_dcache_range
This patch fix the below kernel crash.

 BUG: Unable to handle kernel data access on read at 0xc000000380000000
 Faulting instruction address: 0xc00000000008b6f0
cpu 0x5: Vector: 300 (Data Access) at [c0000000d8587790]
    pc: c00000000008b6f0: arch_remove_memory+0x150/0x210
    lr: c00000000008b720: arch_remove_memory+0x180/0x210
    sp: c0000000d8587a20
   msr: 800000000280b033
   dar: c000000380000000
 dsisr: 40000000
  current = 0xc0000000d8558600
  paca    = 0xc00000000fff8f00   irqmask: 0x03   irq_happened: 0x01
    pid   = 1220, comm = ndctl
enter ? for help
 memunmap_pages+0x33c/0x410
 devm_action_release+0x30/0x50
 release_nodes+0x30c/0x3a0
 device_release_driver_internal+0x178/0x240
 unbind_store+0x74/0x190
 drv_attr_store+0x44/0x60
 sysfs_kf_write+0x74/0xa0
 kernfs_fop_write+0x1b0/0x260
 __vfs_write+0x3c/0x70
 vfs_write+0xe4/0x200
 ksys_write+0x7c/0x140
 system_call+0x5c/0x68

Fixes: 076265907c ("powerpc: Chunk calls to flush_dcache_range in arch_*_memory")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191204052909.59145-1-aneesh.kumar@linux.ibm.com
2019-12-05 00:11:45 +11:00
Linus Torvalds
596cf45cbf Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "Incoming:

   - a small number of updates to scripts/, ocfs2 and fs/buffer.c

   - most of MM

  I still have quite a lot of material (mostly not MM) staged after
  linux-next due to -next dependencies. I'll send those across next week
  as the preprequisites get merged up"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (135 commits)
  mm/page_io.c: annotate refault stalls from swap_readpage
  mm/Kconfig: fix trivial help text punctuation
  mm/Kconfig: fix indentation
  mm/memory_hotplug.c: remove __online_page_set_limits()
  mm: fix typos in comments when calling __SetPageUptodate()
  mm: fix struct member name in function comments
  mm/shmem.c: cast the type of unmap_start to u64
  mm: shmem: use proper gfp flags for shmem_writepage()
  mm/shmem.c: make array 'values' static const, makes object smaller
  userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK
  fs/userfaultfd.c: wp: clear VM_UFFD_MISSING or VM_UFFD_WP during userfaultfd_register()
  userfaultfd: wrap the common dst_vma check into an inlined function
  userfaultfd: remove unnecessary WARN_ON() in __mcopy_atomic_hugetlb()
  userfaultfd: use vma_pagesize for all huge page size calculation
  mm/madvise.c: use PAGE_ALIGN[ED] for range checking
  mm/madvise.c: replace with page_size() in madvise_inject_error()
  mm/mmap.c: make vma_merge() comment more easy to understand
  mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops
  autonuma: reduce cache footprint when scanning page tables
  autonuma: fix watermark checking in migrate_balanced_pgdat()
  ...
2019-12-01 20:36:41 -08:00
Mike Kravetz
997cdcb068 powerpc/mm: remove pmd_huge/pud_huge stubs and include hugetlb.h
Patch series "hugetlbfs: convert macros to static inline, fix sparse
warning".

The definition for huge_pte_offset() in <linux/hugetlb.h> causes a
sparse warning in the !CONFIG_HUGETLB_PAGE.  Fix this as well as
converting all macros in this block of definitions to static inlines for
better type checking.

When making the above changes, build errors were found in powerpc due to
duplicate definitions.  A separate powerpc specific patch is included as
a requisite to remove the definitions and get them from
<linux/hugetlb.h>.

This patch (of 2):

This removes the power specific stubs created by commit aad71e3928
("powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n") used when
!CONFIG_HUGETLB_PAGE.  Instead, it addresses the build break by getting
the definitions from <linux/hugetlb.h>.  This allows the macros in
<linux/hugetlb.h> to be replaced with static inlines.

Link: http://lkml.kernel.org/r/20191112194558.139389-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ben Dooks <ben.dooks@codethink.co.uk>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:08 -08:00
Linus Torvalds
7794b1d418 powerpc updates for 5.5
Highlights:
 
  - Infrastructure for secure boot on some bare metal Power9 machines. The
    firmware support is still in development, so the code here won't actually
    activate secure boot on any existing systems.
 
  - A change to xmon (our crash handler / pseudo-debugger) to restrict it to
    read-only mode when the kernel is lockdown'ed, otherwise it's trivial to drop
    into xmon and modify kernel data, such as the lockdown state.
 
  - Support for KASLR on 32-bit BookE machines (Freescale / NXP).
 
  - Fixes for our flush_icache_range() and __kernel_sync_dicache() (VDSO) to work
    with memory ranges >4GB.
 
  - Some reworks of the pseries CMM (Cooperative Memory Management) driver to
    make it behave more like other balloon drivers and enable some cleanups of
    generic mm code.
 
  - A series of fixes to our hardware breakpoint support to properly handle
    unaligned watchpoint addresses.
 
 Plus a bunch of other smaller improvements, fixes and cleanups.
 
 Thanks to:
   Alastair D'Silva, Andrew Donnellan, Aneesh Kumar K.V, Anthony Steinhauser,
   Cédric Le Goater, Chris Packham, Chris Smart, Christophe Leroy, Christopher M.
   Riedl, Christoph Hellwig, Claudio Carvalho, Daniel Axtens, David Hildenbrand,
   Deb McLemore, Diana Craciun, Eric Richter, Geert Uytterhoeven, Greg
   Kroah-Hartman, Greg Kurz, Gustavo L. F. Walbon, Hari Bathini, Harish, Jason
   Yan, Krzysztof Kozlowski, Leonardo Bras, Mathieu Malaterre, Mauro S. M.
   Rodrigues, Michal Suchanek, Mimi Zohar, Nathan Chancellor, Nathan Lynch, Nayna
   Jain, Nick Desaulniers, Oliver O'Halloran, Qian Cai, Rasmus Villemoes, Ravi
   Bangoria, Sam Bobroff, Santosh Sivaraj, Scott Wood, Thomas Huth, Tyrel
   Datwyler, Vaibhav Jain, Valentin Longchamp, YueHaibing.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl3hBycTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgApBEACk91MEQDYJ9MF9I6uN+85qb5p4pMsp
 rGzqnpt+XFidbDAc3eP63pYfIDSo3jtkQ2YL7shAnDOTvkO0md+Vqkl9Aq/G6FIf
 lDBlwbgkXMSxS/O2Lpvfn4NZAoK6dKmiV55LSgfliM62X3e2Saeg6TR55wBTgJ6/
 SlYPDwZfcVHOAiFS3UmfB+hkiIZk+AI5Zr5VAZvT2ZmeH36yAWkq4JgJI1uAk6m1
 /7iCnlfUjx/nl/BhnA3kjjmAgGCJ5s/WuVgwFMz47XpMBWGBhLWpMh/NqDTFb8ca
 kpiVQoVPQe2xyO3pL/kOwBy6sii26ftfHDhLKMy1hJdEhVQzS5LerPIMeh1qsU8Q
 hV/Cj+jfsrS/vBDOehj3jwx93t+861PmTOqgLnpYQ6Ltrt+2B/74+fufGMHE1kI3
 Ffo7xvNw4sw6bSziDxDFqUx2P1dFN5D5EJsJsYM98ekkVAAkzNqCDRvfD2QI8Pif
 VXWPYXqtNJTrVPJA0D7Yfo9FDNwhANd0f1zi7r/U5mVXBFUyKOlGqTQSkXgMrVeK
 3I7wHPOVGgdA5UUkfcd3pcuqsY081U9E//o5PUfj8ybO5JCwly8NoatbG+xHmKia
 a72uJT8MjCo9mGCHKDrwi9l/kqms6ZSv8RP+yMhGuB52YoiGc6PpVyab5jXIUd1N
 yTtBlC0YGW1JYw==
 =JHzg
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights:

   - Infrastructure for secure boot on some bare metal Power9 machines.
     The firmware support is still in development, so the code here
     won't actually activate secure boot on any existing systems.

   - A change to xmon (our crash handler / pseudo-debugger) to restrict
     it to read-only mode when the kernel is lockdown'ed, otherwise it's
     trivial to drop into xmon and modify kernel data, such as the
     lockdown state.

   - Support for KASLR on 32-bit BookE machines (Freescale / NXP).

   - Fixes for our flush_icache_range() and __kernel_sync_dicache()
     (VDSO) to work with memory ranges >4GB.

   - Some reworks of the pseries CMM (Cooperative Memory Management)
     driver to make it behave more like other balloon drivers and enable
     some cleanups of generic mm code.

   - A series of fixes to our hardware breakpoint support to properly
     handle unaligned watchpoint addresses.

  Plus a bunch of other smaller improvements, fixes and cleanups.

  Thanks to: Alastair D'Silva, Andrew Donnellan, Aneesh Kumar K.V,
  Anthony Steinhauser, Cédric Le Goater, Chris Packham, Chris Smart,
  Christophe Leroy, Christopher M. Riedl, Christoph Hellwig, Claudio
  Carvalho, Daniel Axtens, David Hildenbrand, Deb McLemore, Diana
  Craciun, Eric Richter, Geert Uytterhoeven, Greg Kroah-Hartman, Greg
  Kurz, Gustavo L. F. Walbon, Hari Bathini, Harish, Jason Yan, Krzysztof
  Kozlowski, Leonardo Bras, Mathieu Malaterre, Mauro S. M. Rodrigues,
  Michal Suchanek, Mimi Zohar, Nathan Chancellor, Nathan Lynch, Nayna
  Jain, Nick Desaulniers, Oliver O'Halloran, Qian Cai, Rasmus Villemoes,
  Ravi Bangoria, Sam Bobroff, Santosh Sivaraj, Scott Wood, Thomas Huth,
  Tyrel Datwyler, Vaibhav Jain, Valentin Longchamp, YueHaibing"

* tag 'powerpc-5.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (144 commits)
  powerpc/fixmap: fix crash with HIGHMEM
  x86/efi: remove unused variables
  powerpc: Define arch_is_kernel_initmem_freed() for lockdep
  powerpc/prom_init: Use -ffreestanding to avoid a reference to bcmp
  powerpc: Avoid clang warnings around setjmp and longjmp
  powerpc: Don't add -mabi= flags when building with Clang
  powerpc: Fix Kconfig indentation
  powerpc/fixmap: don't clear fixmap area in paging_init()
  selftests/powerpc: spectre_v2 test must be built 64-bit
  powerpc/powernv: Disable native PCIe port management
  powerpc/kexec: Move kexec files into a dedicated subdir.
  powerpc/32: Split kexec low level code out of misc_32.S
  powerpc/sysdev: drop simple gpio
  powerpc/83xx: map IMMR with a BAT.
  powerpc/32s: automatically allocate BAT in setbat()
  powerpc/ioremap: warn on early use of ioremap()
  powerpc: Add support for GENERIC_EARLY_IOREMAP
  powerpc/fixmap: Use __fix_to_virt() instead of fix_to_virt()
  powerpc/8xx: use the fixmapped IMMR in cpm_reset()
  powerpc/8xx: add __init to cpm1 init functions
  ...
2019-11-30 14:35:43 -08:00
Christophe Leroy
2807273f5e powerpc/fixmap: fix crash with HIGHMEM
Commit f2bb86937d ("powerpc/fixmap: don't clear fixmap area in
paging_init()") removed the clearing of fixmap area in order to
avoid clearing fixmapped areas set earlier.

However unlike all other users of fixmap which use __set_fixmap(),
HIGHMEM functions directly use __set_pte_at(). This means
the page table must pre-exist, otherwise the following crash
can be encoutered due to the lack of entry in the PGD.

Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=4K MMU=Hash PowerMac
Modules linked in:
CPU: 0 PID: 1 Comm: swapper Not tainted 5.4.0+ #2528
NIP:  c0144ce8 LR: c0144ccc CTR: 00000080
REGS: ef0b5aa0 TRAP: 0300   Not tainted  (5.4.0+)
MSR:  00009032 <EE,ME,IR,DR,RI>  CR: 44282842  XER: 00000000
DAR: fffdf000 DSISR: 42000000
GPR00: c0144ccc ef0b5b58 ef0b0000 fffdf000 fffdf000 00000000 c0000f7c 00000000
GPR08: c0833000 fffdf000 00000000 ef1c53c9 24042842 00000000 00000000 00000000
GPR16: 00000000 00000000 ef7e7358 effe8160 00000000 c08a9660 c0851644 00000004
GPR24: c08c70a8 00002dc2 00000000 00000001 00000201 effe8160 effe8160 00000000
NIP [c0144ce8] prep_new_page+0x138/0x178
LR [c0144ccc] prep_new_page+0x11c/0x178
Call Trace:
[ef0b5b58] [c0144ccc] prep_new_page+0x11c/0x178 (unreliable)
[ef0b5b88] [c0147218] get_page_from_freelist+0x1fc/0xd88
[ef0b5c38] [c0148328] __alloc_pages_nodemask+0xd4/0xbb4
[ef0b5cf8] [c0142ba8] __vmalloc_node_range+0x1b4/0x2e0
[ef0b5d38] [c0142dd0] vzalloc+0x48/0x58
[ef0b5d58] [c0301c8c] check_partition+0x58/0x244
[ef0b5d78] [c02ffe80] blk_add_partitions+0x44/0x2cc
[ef0b5db8] [c01a32d8] bdev_disk_changed+0x68/0xfc
[ef0b5de8] [c01a4494] __blkdev_get+0x290/0x460
[ef0b5e28] [c02fdd40] __device_add_disk+0x480/0x4d8
[ef0b5e68] [c0810688] brd_init+0xc0/0x188
[ef0b5e88] [c0005194] do_one_initcall+0x40/0x19c
[ef0b5ee8] [c07dd4dc] kernel_init_freeable+0x164/0x230
[ef0b5f28] [c0005408] kernel_init+0x18/0x10c
[ef0b5f38] [c0014274] ret_from_kernel_thread+0x14/0x1c

Partially revert that commit to still clear the fixmap area dedicated
to HIGHMEM.

Fixes: f2bb86937d ("powerpc/fixmap: don't clear fixmap area in paging_init()")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d42fa9747df5afa41e67b08e374c98d3b40529c9.1574927918.git.christophe.leroy@c-s.fr
2019-11-29 22:41:15 +11:00
Christophe Leroy
f2bb86937d powerpc/fixmap: don't clear fixmap area in paging_init()
fixmap is intended to map things permanently like the IMMR region on
FSL SOC (8xx, 83xx, ...), so don't clear it when initialising paging()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/41c99bc06394a6bc2888631cb98a3ed2ae281ddb.1568295907.git.christophe.leroy@c-s.fr
2019-11-25 21:45:43 +11:00
Christoph Hellwig
d7293f79ca Merge branch 'for-next/zone-dma' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux into dma-mapping-for-next
Pull in a stable branch from the arm64 tree that adds the zone_dma_bits
variable to avoid creating hard to resolve conflicts with that addition.
2019-11-21 18:13:03 +01:00
Christoph Hellwig
56e35f9c5b dma-mapping: drop the dev argument to arch_sync_dma_for_*
These are pure cache maintainance routines, so drop the unused
struct device argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2019-11-20 20:31:38 +01:00
Christophe Leroy
cbcaff7d27 powerpc/32s: automatically allocate BAT in setbat()
If no BAT is given to setbat(), select an available BAT.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a212bd36fbd6179e0929b6c727febc35132ac25c.1568665466.git.christophe.leroy@c-s.fr
2019-11-19 19:38:38 +11:00
Christophe Leroy
d538aadc27 powerpc/ioremap: warn on early use of ioremap()
Powerpc now has EARLY_IOREMAP.

Next step is to convert all early users of ioremap() to
early_ioremap().

Add a warning to help locate those users.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b4f03a68ee8e68773c8973d74ec35f9c82c72871.1568295907.git.christophe.leroy@c-s.fr
2019-11-19 19:38:38 +11:00
Christophe Leroy
a2227a2777 powerpc/32: Don't populate page tables for block mapped pages except on the 8xx.
Commit d2f15e0979 ("powerpc/32: always populate page tables for
Abatron BDI.") wrongly sets page tables for any PPC32 for using BDI,
and does't update them after init (remove RX on init section, set
text and rodata read-only)

Only the 8xx requires page tables to be populated for using the BDI.
They also need to be populated in order to see the mappings in
/sys/kernel/debug/kernel_page_tables

On BOOK3S_32, pages that are not mapped by page tables are mapped
by BATs. The BDI knows BATs and they can be viewed in
/sys/kernel/debug/powerpc/block_address_translation

Only set pagetables for RAM and IMMR on the 8xx and properly update
them at the end of init.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c8610942203e0d93fcb02ad20c57edd3adb4c9d3.1566554029.git.christophe.leroy@c-s.fr
2019-11-18 22:27:52 +11:00
Christophe Leroy
46ddcb3950 powerpc/mm: Show if a bad page fault on data is read or write.
DSISR (or ESR on some CPUs) has a bit to tell if the fault is due to a
read or a write.

Display it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4f88d7e6fda53b5f80a71040ab400242f6c8cb93.1566400889.git.christophe.leroy@c-s.fr
2019-11-18 22:27:51 +11:00
Michael Ellerman
3df191118b Merge branch 'topic/kaslr-book3e32' into next
This is a slight rebase of Scott's next branch, which contained the
KASLR support for book3e 32-bit, to squash in a couple of small fixes.

See the	original pull request:
  https://lore.kernel.org/r/20191022232155.GA26174@home.buserror.net
2019-11-14 19:23:33 +11:00
Jason Yan
8c2ae87be5 powerpc/fsl_booke/kaslr: support nokaslr cmdline parameter
One may want to disable kaslr when boot, so provide a cmdline parameter
'nokaslr' to support this.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:47 +11:00
Jason Yan
b396097200 powerpc/fsl_booke/kaslr: clear the original kernel if randomized
The original kernel still exists in the memory, clear it now.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:44 +11:00
Jason Yan
6a38ea1d7b powerpc/fsl_booke/32: randomize the kernel image offset
After we have the basic support of relocate the kernel in some
appropriate place, we can start to randomize the offset now.

Entropy is derived from the banner and timer, which will change every
build and boot. This not so much safe so additionally the bootloader may
pass entropy via the /chosen/kaslr-seed node in device tree.

We will use the first 512M of the low memory to randomize the kernel
image. The memory will be split in 64M zones. We will use the lower 8
bit of the entropy to decide the index of the 64M zone. Then we chose a
16K aligned offset inside the 64M zone to put the kernel in.

We also check if we will overlap with some areas like the dtb area, the
initrd area or the crashkernel area. If we cannot find a proper area,
kaslr will be disabled and boot from the original kernel.

Some pieces of code are derived from arch/x86/boot/compressed/kaslr.c or
arch/arm64/kernel/kaslr.c such as rotate_xor(). Credit goes to Kees and
Ard.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:41 +11:00
Jason Yan
2b0e86cc5d powerpc/fsl_booke/32: implement KASLR infrastructure
This patch add support to boot kernel from places other than KERNELBASE.
Since CONFIG_RELOCATABLE has already supported, what we need to do is
map or copy kernel to a proper place and relocate. Freescale Book-E
parts expect lowmem to be mapped by fixed TLB entries(TLB1). The TLB1
entries are not suitable to map the kernel directly in a randomized
region, so we chose to copy the kernel to a proper place and restart to
relocate.

The offset of the kernel was not randomized yet(a fixed 64M is set). We
will randomize it in the next patch.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
[mpe: Use PTRRELOC() in early_init()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:40 +11:00
Jason Yan
c061b38a3e powerpc/fsl_booke/32: introduce reloc_kernel_entry() helper
Add a new helper reloc_kernel_entry() to jump back to the start of the
new kernel. After we put the new kernel in a randomized place we can use
this new helper to enter the kernel and begin to relocate again.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:37 +11:00
Jason Yan
aa1d2090e6 powerpc/fsl_booke/32: introduce create_kaslr_tlb_entry() helper
Add a new helper create_kaslr_tlb_entry() to create a tlb entry by the
virtual and physical address. This is a preparation to support boot kernel
at a randomized address.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:34 +11:00
Jason Yan
39f4b7bf75 powerpc: introduce kernstart_virt_addr to store the kernel base
Now the kernel base is a fixed value - KERNELBASE. To support KASLR, we
need a variable to store the kernel base.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:32 +11:00
Jason Yan
4ed47dbefa powerpc: move memstart_addr and kernstart_addr to init-common.c
These two variables are both defined in init_32.c and init_64.c. Move
them to init-common.c and make them __ro_after_init.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-11-13 19:27:28 +11:00
Alastair D'Silva
ea458effa8 powerpc: Don't flush caches when adding memory
This operation takes a significant amount of time when hotplugging
large amounts of memory (~50 seconds with 890GB of persistent memory).

This was orignally in commit fb5924fddf
("powerpc/mm: Flush cache on memory hot(un)plug") to support memtrace,
but the flush on add is not needed as it is flushed on remove.

Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191104023305.9581-7-alastair@au1.ibm.com
2019-11-07 23:35:41 +11:00
Alastair D'Silva
076265907c powerpc: Chunk calls to flush_dcache_range in arch_*_memory
When presented with large amounts of memory being hotplugged
(in my test case, ~890GB), the call to flush_dcache_range takes
a while (~50 seconds), triggering RCU stalls.

This patch breaks up the call into 1GB chunks, calling
cond_resched() inbetween to allow the scheduler to run.

Fixes: fb5924fddf ("powerpc/mm: Flush cache on memory hot(un)plug")
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191104023305.9581-6-alastair@au1.ibm.com
2019-11-07 23:35:41 +11:00
Alastair D'Silva
23eb7f560a powerpc: Convert flush_icache_range & friends to C
Similar to commit 22e9c88d48
("powerpc/64: reuse PPC32 static inline flush_dcache_range()")
this patch converts the following ASM symbols to C:
    flush_icache_range()
    __flush_dcache_icache()
    __flush_dcache_icache_phys()

This was done as we discovered a long-standing bug where the length of the
range was truncated due to using a 32 bit shift instead of a 64 bit one.

By converting these functions to C, it becomes easier to maintain.

flush_dcache_icache_phys() retains a critical assembler section as we must
ensure there are no memory accesses while the data MMU is disabled
(authored by Christophe Leroy). Since this has no external callers, it has
also been made static, allowing the compiler to inline it within
flush_dcache_icache_page().

Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
[mpe: Minor fixups, don't export __flush_dcache_icache()]
Link: https://lore.kernel.org/r/20191104023305.9581-5-alastair@au1.ibm.com
2019-11-07 23:35:37 +11:00
Aneesh Kumar K.V
16f6b67cf0 powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning
With large memory (8TB and more) hotplug, we can get soft lockup
warnings as below. These were caused by a long loop without any
explicit cond_resched which is a problem for !PREEMPT kernels.

Avoid this using cond_resched() while inserting hash page table
entries. We already do similar cond_resched() in __add_pages(), see
commit f64ac5e6e3 ("mm, memory_hotplug: add scheduling point to
__add_pages").

  rcu:     3-....: (24002 ticks this GP) idle=13e/1/0x4000000000000002 softirq=722/722 fqs=12001
   (t=24003 jiffies g=4285 q=2002)
  NMI backtrace for cpu 3
  CPU: 3 PID: 3870 Comm: ndctl Not tainted 5.3.0-197.18-default+ #2
  Call Trace:
    dump_stack+0xb0/0xf4 (unreliable)
    nmi_cpu_backtrace+0x124/0x130
    nmi_trigger_cpumask_backtrace+0x1ac/0x1f0
    arch_trigger_cpumask_backtrace+0x28/0x3c
    rcu_dump_cpu_stacks+0xf8/0x154
    rcu_sched_clock_irq+0x878/0xb40
    update_process_times+0x48/0x90
    tick_sched_handle.isra.16+0x4c/0x80
    tick_sched_timer+0x68/0xe0
    __hrtimer_run_queues+0x180/0x430
    hrtimer_interrupt+0x110/0x300
    timer_interrupt+0x108/0x2f0
    decrementer_common+0x114/0x120
  --- interrupt: 901 at arch_add_memory+0xc0/0x130
      LR = arch_add_memory+0x74/0x130
    memremap_pages+0x494/0x650
    devm_memremap_pages+0x3c/0xa0
    pmem_attach_disk+0x188/0x750
    nvdimm_bus_probe+0xac/0x2c0
    really_probe+0x148/0x570
    driver_probe_device+0x19c/0x1d0
    device_driver_attach+0xcc/0x100
    bind_store+0x134/0x1c0
    drv_attr_store+0x44/0x60
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1a0/0x270
    __vfs_write+0x3c/0x70
    vfs_write+0xd0/0x260
    ksys_write+0xdc/0x130
    system_call+0x5c/0x68

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191001084656.31277-1-aneesh.kumar@linux.ibm.com
2019-11-05 22:24:57 +11:00
Aneesh Kumar K.V
864edb758c powerpc/mm/book3s64/radix: Flush the full mm even when need_flush_all is set
With the previous patch, we should now not be using need_flush_all for
powerpc. But then make sure we force a PID tlbie flush with RIC=2 if
we ever find need_flush_all set. Also don't reset it after a mmu
gather flush.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191024075801.22434-3-aneesh.kumar@linux.ibm.com
2019-11-05 22:24:18 +11:00
Aneesh Kumar K.V
52162ec784 powerpc/mm/book3s64/radix: Use freed_tables instead of need_flush_all
With commit 22a61c3c4f ("asm-generic/tlb: Track freeing of
page-table directories in struct mmu_gather") we now track whether we
freed page table in mmu_gather. Use that to decide whether to flush
Page Walk Cache.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191024075801.22434-2-aneesh.kumar@linux.ibm.com
2019-11-05 22:23:55 +11:00
Aneesh Kumar K.V
a42d6ba8c5 powerpc/mm/book3s64/radix: Remove unused code.
mm_tlb_flush_nested change was added in the mmu gather tlb flush to
handle the case of parallel pte invalidate happening with mmap_sem
held in read mode. This fix was done by commit
02390f66bd ("powerpc/64s/radix: Fix MADV_[FREE|DONTNEED] TLB flush
miss problem with THP") and the problem is explained in detail in
commit 99baac21e4 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss
problem")

This was later updated by commit 7a30df49f6 ("mm: mmu_gather: remove
__tlb_reset_range() for force flush") to do a full mm flush rather
than a range flush. By commit dd2283f260 ("mm: mmap: zap pages with
read mmap_sem in munmap") we are also now allowing a page table free
in mmap_sem read mode which means we should do a PWC flush too. Our
current full mm flush imply a PWC flush.

With all the above change the mm_tlb_flush_nested(mm) branch in
radix__tlb_flush will never be taken because for the nested case we
would have taken the if (tlb->fullmm) branch. This patch removes the
unused code. Also, remove the gflush change in
__radix__flush_tlb_range that was added to handle the range tlb flush
code. We only check for THP there because hugetlb is flushed via a
different code path where page size is explicitly specified.

This is a partial revert of commit 02390f66bd ("powerpc/64s/radix:
Fix MADV_[FREE|DONTNEED] TLB flush miss problem with THP")

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191024075801.22434-1-aneesh.kumar@linux.ibm.com
2019-11-05 22:23:17 +11:00
Nicolas Saenz Julienne
8b5369ea58 dma/direct: turn ARCH_ZONE_DMA_BITS into a variable
Some architectures, notably ARM, are interested in tweaking this
depending on their runtime DMA addressing limitations.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-11-01 09:41:18 +00:00
Aneesh Kumar K.V
d78d5dace5 powerpc/book3s64/hash: Use secondary hash for bolted mapping if the primary is full
With bolted hash page table entry, kernel currently only use primary hash group
when inserting the hash page table entry. In the rare case where kernel find all the
8 primary hash slot occupied by bolted entries, this can result in hash page
table insert failure for bolted entries. Avoid this by using the secondary hash
group.

This is different from what kernel does for the non-bolted mapping. With
non-bolted entries kernel will try secondary before removing an existing entry
from hash page table group. With bolted prefer primary hash group and hence
try to insert the page table entry by removing a slot from primary before trying
the secondary hash group.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191024093542.29777-3-aneesh.kumar@linux.ibm.com
2019-10-28 21:54:16 +11:00
Aneesh Kumar K.V
75838a3290 powerpc/pseries: Don't fail hash page table insert for bolted mapping
If the hypervisor returned H_PTEG_FULL for H_ENTER hcall, retry a hash page table
insert by removing a random entry from the group.

After some runtime, it is very well possible to find all the 8 hash page table
entry slot in the hpte group used for mapping. Don't fail a bolted entry insert
in that case. With Storage class memory a user can find this error easily since
a namespace enable/disable is equivalent to memory add/remove.

This results in failures as reported below:

$ ndctl create-namespace -r region1 -t pmem -m devdax -a 65536 -s 100M
libndctl: ndctl_dax_enable: dax1.3: failed to enable
  Error: namespace1.2: failed to enable

failed to create namespace: No such device or address

In kernel log we find the details as below:

Unable to create mapping for hot added memory 0xc000042006000000..0xc00004200d000000: -1
dax_pmem: probe of dax1.3 failed with error -14

This indicates that we failed to create a bolted hash table entry for direct-map
address backing the namespace.

We also observe failures such that not all namespaces will be enabled with
ndctl enable-namespace all command.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191024093542.29777-2-aneesh.kumar@linux.ibm.com
2019-10-28 21:54:16 +11:00
Aneesh Kumar K.V
5f5d6e40a0 powerpc/nvdimm: Update vmemmap_populated to check sub-section range
With commit: 7cc7867fb0 ("mm/devm_memremap_pages: enable sub-section remap")
pmem namespaces are remapped in 2M chunks. On architectures like ppc64 we
can map the memmap area using 16MB hugepage size and that can cover
a memory range of 16G.

While enabling new pmem namespaces, since memory is added in sub-section chunks,
before creating a new memmap mapping, kernel should check whether there is an
existing memmap mapping covering the new pmem namespace. Currently, this is
validated by checking whether the section covering the range is already
initialized or not. Considering there can be multiple namespaces in the same
section this can result in wrong validation. Update this to check for
sub-sections in the range. This is done by checking for all pfns in the range we
are mapping.

We could optimize this by checking only just one pfn in each sub-section. But
since this is not fast-path we keep this simple.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190917123851.22553-1-aneesh.kumar@linux.ibm.com
2019-10-28 21:54:15 +11:00
Qian Cai
29674a1c71 powerpc/pkeys: remove unused pkey_allows_readwrite
pkey_allows_readwrite() was first introduced in the commit 5586cf61e1
("powerpc: introduce execute-only pkey"), but the usage was removed
entirely in the commit a4fcc877d4 ("powerpc/pkeys: Preallocate
execute-only key").

Found by the "-Wunused-function" compiler warning flag.

Fixes: a4fcc877d4 ("powerpc/pkeys: Preallocate execute-only key")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1568733750-14580-1-git-send-email-cai@lca.pw
2019-10-11 19:34:10 +11:00
Linus Torvalds
a3c0e7b1fe libnvdimm fixes v5.4-rc1
- Complete the reworks to interoperate with powerpc dynamic huge page sizes
 
 - Fix a crash due to missed accounting for the powerpc 'struct
   page'-memmap mapping granularity.
 
 - Fix badblock initialization for volatile (DRAM emulated) pmem ranges.
 
 - Stop triggering request_key() notifications to userspace when
   NVDIMM-security is disabled / not present.
 
 - Miscellaneous small fixups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdkAprAAoJEB7SkWpmfYgCjXoQAIwJE1VzNP1V+ARxfs1rTGVz
 pbNJiBnj4gxDaCkcKoatiadRkytUxeUNEcPslEKsfoNinXYqkpjMQoWm2VpILOMU
 nY+SvIudGRnuesq2/Y+CP8zrX6rV4eBDfHK05RN/Zp1IlW7pTDItUx8mJ7glmDwG
 PW0vkvK7yZ+dRFnpQ7QFjhA0Q3oudO5YcTVBDK5YYtDGlv69xfXqc9LW8SszJ1kU
 rhCIT1kdoL5of0TIgG5pTfmggPSQ9y1xPsKjllOHNa3m50eGOkkQLELOVzQb1frW
 cjAsPLjRDSzvdHHSLyu0Is04Q5JU2CucxHl2SXGHiOt5tigH8dk5XFxWt0Pc8EXx
 acYYiBqUXC3MomSYWeLK4BdO2cRTqcPPXgJYAqXblqr+/0ys+rFepjw+j8JkiLZa
 5UCC30l1GXEpw9u6gdCMqvvHN2gHvDB0BV82Sx8wTewJpeL18wCUJoKVuFmpsHko
 p1cCe7St1TzcK3eO+xfeW1rxNrcXUpKVYXVa/WOJW0vwErqAZ6YCdNuyJHocZzXn
 vNyIQmVDOlubsgBAI2ExxeZO6xc8UIwLhLg7XEJ0mg3k6UXA8HZxH2B2THJk1BSF
 RppodkYiMknh11sqgpGp+Hz5XSEg/jvmCdL/qRDGAwhsFhFaxDH37Kg4Qncj2/dg
 uDvDHXNCjbGpzCo3tyNx
 =Z6Fa
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

More libnvdimm updates from Dan Williams:

 - Complete the reworks to interoperate with powerpc dynamic huge page
   sizes

 - Fix a crash due to missed accounting for the powerpc 'struct
   page'-memmap mapping granularity

 - Fix badblock initialization for volatile (DRAM emulated) pmem ranges

 - Stop triggering request_key() notifications to userspace when
   NVDIMM-security is disabled / not present

 - Miscellaneous small fixups

* tag 'libnvdimm-fixes-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm/region: Enable MAP_SYNC for volatile regions
  libnvdimm: prevent nvdimm from requesting key when security is disabled
  libnvdimm/region: Initialize bad block for volatile namespaces
  libnvdimm/nfit_test: Fix acpi_handle redefinition
  libnvdimm/altmap: Track namespace boundaries in altmap
  libnvdimm: Fix endian conversion issues 
  libnvdimm/dax: Pick the right alignment default when creating dax devices
  powerpc/book3s64: Export has_transparent_hugepage() related functions.
2019-09-29 10:33:41 -07:00
Linus Torvalds
a2953204b5 powerpc fixes for 5.4 #2
An assortment of fixes that were either missed by me, or didn't arrive quite in
 time for the first v5.4 pull.
 
 Most notable is a fix for an issue with tlbie (broadcast TLB invalidation) on
 Power9, when using the Radix MMU. The tlbie can race with an mtpid (move to PID
 register, essentially MMU context switch) on another thread of the core, which
 can cause stores to continue to go to a page after it's unmapped.
 
 A fix in our KVM code to add a missing barrier, the lack of which has been
 observed to cause missed IPIs and subsequently stuck CPUs in the host.
 
 A change to the way we initialise PCR (Processor Compatibility Register) to make
 it forward compatible with future CPUs.
 
 On some older PowerVM systems our H_BLOCK_REMOVE support could oops, fix it to
 detect such systems and fallback to the old invalidation method.
 
 A fix for an oops seen on some machines when using KASAN on 32-bit.
 
 A handful of other minor fixes, and two new selftests.
 
 Thanks to:
   Alistair Popple, Aneesh Kumar K.V, Christophe Leroy, Gustavo Romero, Joel
   Stanley, Jordan Niethe, Laurent Dufour, Michael Roth, Oliver O'Halloran.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl2PRGITHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgClRD/9jKIT6GjVpRMc+Dg9zHB5/Pir7gePk
 ztXKI+u15GrrXgjtWEZ1PaaXvtNIfs/IZHDQm5gyJjiBAKcGl2v+9ETaMzO5sjZ7
 GSe1F8VX/MwzRnET8Jph8w/b0cy0Q8xndkEOjcJqJ7+TF+SSWqmJEdmBfkU23jWD
 B3kW4W1x2xt/XGsX25l1HpUgpcJqzukCeYUSCdqUu2j+sXAEZmfgTRG8uD4HffzZ
 3As76TrBiJsDnkyH0qi2G1BuLXrQbAMdjTeSGi+cb0gTIunCr190gI4+Tjdu2/z7
 ywWR2ZUkueCNDcLsqXaqpZx50utPJ44//uY750sk72vixJJOVuOWM6+5HKVW83se
 /v0zkOcI9+ywNHe0vLfP3Jm/OMMHYxkIwz6kVu2NSR6sE79B9AZpBFU+Nynq7kKl
 +Hc6md/HATvR+NK6LtQKtEGydRhvxU5n3KBmjq3SQj+B/ZlU6IdgerfhUWrNvg0B
 zzHeT35X6UBpswonhkQLgqJuaWpkClK9wsUy85MuA7aub1EP8S6/X7paKoiOtAHK
 NjlXM2JYV5OKwhjGgdCiI94Bdune7yudKPdsXV3Gr8Iw7wf2bQk1p7VH+LcruyE9
 YJdXwCgN0PaoFUQh3AR4CqzzFwqDya8FQqdkFN3kqhRLVGAMq/PsV8/Tn+myTgQP
 rZnWnbfZh9BMjw==
 =dF42
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "An assortment of fixes that were either missed by me, or didn't arrive
  quite in time for the first v5.4 pull.

   - Most notable is a fix for an issue with tlbie (broadcast TLB
     invalidation) on Power9, when using the Radix MMU. The tlbie can
     race with an mtpid (move to PID register, essentially MMU context
     switch) on another thread of the core, which can cause stores to
     continue to go to a page after it's unmapped.

   - A fix in our KVM code to add a missing barrier, the lack of which
     has been observed to cause missed IPIs and subsequently stuck CPUs
     in the host.

   - A change to the way we initialise PCR (Processor Compatibility
     Register) to make it forward compatible with future CPUs.

   - On some older PowerVM systems our H_BLOCK_REMOVE support could
     oops, fix it to detect such systems and fallback to the old
     invalidation method.

   - A fix for an oops seen on some machines when using KASAN on 32-bit.

   - A handful of other minor fixes, and two new selftests.

  Thanks to: Alistair Popple, Aneesh Kumar K.V, Christophe Leroy,
  Gustavo Romero, Joel Stanley, Jordan Niethe, Laurent Dufour, Michael
  Roth, Oliver O'Halloran"

* tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/eeh: Fix eeh eeh_debugfs_break_device() with SRIOV devices
  powerpc/nvdimm: use H_SCM_QUERY hcall on H_OVERLAP error
  powerpc/nvdimm: Use HCALL error as the return value
  selftests/powerpc: Add test case for tlbie vs mtpidr ordering issue
  powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9
  powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag
  powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisions
  powerpc/pseries: Call H_BLOCK_REMOVE when supported
  powerpc/pseries: Read TLB Block Invalidate Characteristics
  KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
  powerpc/mm: Fix an Oops in kasan_mmu_init()
  powerpc/mm: Add a helper to select PAGE_KERNEL_RO or PAGE_READONLY
  powerpc/64s: Set reserved PCR bits
  powerpc: Fix definition of PCR bits to work with old binutils
  powerpc/book3s64/radix: Remove WARN_ON in destroy_context()
  powerpc/tm: Add tm-poison test
2019-09-28 13:43:00 -07:00
Mark Rutland
b4ed71f557 mm: treewide: clarify pgtable_page_{ctor,dtor}() naming
The naming of pgtable_page_{ctor,dtor}() seems to have confused a few
people, and until recently arm64 used these erroneously/pointlessly for
other levels of page table.

To make it incredibly clear that these only apply to the PTE level, and to
align with the naming of pgtable_pmd_page_{ctor,dtor}(), let's rename them
to pgtable_pte_page_{ctor,dtor}().

These changes were generated with the following shell script:

----
git grep -lw 'pgtable_page_.tor' | while read FILE; do
    sed -i '{s/pgtable_page_ctor/pgtable_pte_page_ctor/}' $FILE;
    sed -i '{s/pgtable_page_dtor/pgtable_pte_page_dtor/}' $FILE;
done
----

... with the documentation re-flowed to remain under 80 columns, and
whitespace fixed up in macros to keep backslashes aligned.

There should be no functional change as a result of this patch.

Link: http://lkml.kernel.org/r/20190722141133.3116-1-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 10:10:44 -07:00
Kefeng Wang
9ef258bad7 thp: update split_huge_page_pmd() comment
According to 78ddc53473 ("thp: rename split_huge_page_pmd() to
split_huge_pmd()"), update related comment.

Link: http://lkml.kernel.org/r/20190731033406.185285-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:10 -07:00
Matthew Wilcox (Oracle)
d8c6546b1a mm: introduce compound_nr()
Replace 1 << compound_order(page) with compound_nr(page).  Minor
improvements in readability.

Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:08 -07:00
Matthew Wilcox (Oracle)
94ad933810 mm: introduce page_shift()
Replace PAGE_SHIFT + compound_order(page) with the new page_shift()
function.  Minor improvements in readability.

[akpm@linux-foundation.org: fix build in tce_page_is_contained()]
  Link: http://lkml.kernel.org/r/201907241853.yNQTrJWd%25lkp@intel.com
Link: http://lkml.kernel.org/r/20190721104612.19120-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:08 -07:00
Aneesh Kumar K.V
cf387d9644 libnvdimm/altmap: Track namespace boundaries in altmap
With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
area. Some architectures map the memmap area with large page size. On
architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
This maps a namespace size of 16G.

When populating memmap region with 16MB page from the device area,
make sure the allocated space is not used to map resources outside this
namespace. Such usage of device area will prevent a namespace destroy.

Add resource end pnf in altmap and use that to check if the memmap area
allocation can map pfn outside the namespace. On ppc64 in such case we fallback
to allocation from memory.

This fix kernel crash reported below:

[  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
[  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
[  133.464760] Faulting instruction address: 0xc00000000007580c
[  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
[  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
.....
[  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
[  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
[  133.464910] Call Trace:
[  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
[  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
[  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
[  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
[  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
[  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
[  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
[  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
[  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
[  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
[  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
[  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
[  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
[  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250

djbw: Aneesh notes that this crash can likely be triggered in any kernel that
supports 'papr_scm', so flagging that commit for -stable consideration.

Fixes: b5beae5e22 ("powerpc/pseries: Add driver for PAPR SCM regions")
Cc: <stable@vger.kernel.org>
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Santosh Sivaraj <santosh@fossix.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Link: https://lore.kernel.org/r/20190910062826.10041-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24 10:24:12 -07:00
Aneesh Kumar K.V
a6f197f889 powerpc/book3s64: Export has_transparent_hugepage() related functions.
In later patch, we want to use hash_transparent_hugepage() in a kernel module.
Export two related functions.

Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/20190924042440.27946-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-24 10:22:29 -07:00
Aneesh Kumar K.V
047e6575ae powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9
On POWER9, under some circumstances, a broadcast TLB invalidation will
fail to invalidate the ERAT cache on some threads when there are
parallel mtpidr/mtlpidr happening on other threads of the same core.
This can cause stores to continue to go to a page after it's unmapped.

The workaround is to force an ERAT flush using PID=0 or LPID=0 tlbie
flush. This additional TLB flush will cause the ERAT cache
invalidation. Since we are using PID=0 or LPID=0, we don't get
filtered out by the TLB snoop filtering logic.

We need to still follow this up with another tlbie to take care of
store vs tlbie ordering issue explained in commit:
a5d4b5891c ("powerpc/mm: Fixup tlbie vs store ordering issue on
POWER9"). The presence of ERAT cache implies we can still get new
stores and they may miss store queue marking flush.

Cc: stable@vger.kernel.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190924035254.24612-3-aneesh.kumar@linux.ibm.com
2019-09-24 20:58:55 +10:00
Aneesh Kumar K.V
09ce98cacd powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag
Rename the #define to indicate this is related to store vs tlbie
ordering issue. In the next patch, we will be adding another feature
flag that is used to handles ERAT flush vs tlbie ordering issue.

Fixes: a5d4b5891c ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.com
2019-09-24 20:58:47 +10:00
Linus Torvalds
84da111de0 hmm related patches for 5.4
This is more cleanup and consolidation of the hmm APIs and the very
 strongly related mmu_notifier interfaces. Many places across the tree
 using these interfaces are touched in the process. Beyond that a cleanup
 to the page walker API and a few memremap related changes round out the
 series:
 
 - General improvement of hmm_range_fault() and related APIs, more
   documentation, bug fixes from testing, API simplification &
   consolidation, and unused API removal
 
 - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and
   make them internal kconfig selects
 
 - Hoist a lot of code related to mmu notifier attachment out of drivers by
   using a refcount get/put attachment idiom and remove the convoluted
   mmu_notifier_unregister_no_release() and related APIs.
 
 - General API improvement for the migrate_vma API and revision of its only
   user in nouveau
 
 - Annotate mmu_notifiers with lockdep and sleeping region debugging
 
 Two series unrelated to HMM or mmu_notifiers came along due to
 dependencies:
 
 - Allow pagemap's memremap_pages family of APIs to work without providing
   a struct device
 
 - Make walk_page_range() and related use a constant structure for function
   pointers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl1/nnkACgkQOG33FX4g
 mxqaRg//c6FqowV1pQlLutvAOAgMdpzfZ9eaaDKngy9RVQxz+k/MmJrdRH/p/mMA
 Pq93A1XfwtraGKErHegFXGEDk4XhOustVAVFwvjyXO41dTUdoFVUkti6ftbrl/rS
 6CT+X90jlvrwdRY7QBeuo7lxx7z8Qkqbk1O1kc1IOracjKfNJS+y6LTamy6weM3g
 tIMHI65PkxpRzN36DV9uCN5dMwFzJ73DWHp1b0acnDIigkl6u5zp6orAJVWRjyQX
 nmEd3/IOvdxaubAoAvboNS5CyVb4yS9xshWWMbH6AulKJv3Glca1Aa7QuSpBoN8v
 wy4c9+umzqRgzgUJUe1xwN9P49oBNhJpgBSu8MUlgBA4IOc3rDl/Tw0b5KCFVfkH
 yHkp8n6MP8VsRrzXTC6Kx0vdjIkAO8SUeylVJczAcVSyHIo6/JUJCVDeFLSTVymh
 EGWJ7zX2iRhUbssJ6/izQTTQyCH3YIyZ5QtqByWuX2U7ZrfkqS3/EnBW1Q+j+gPF
 Z2yW8iT6k0iENw6s8psE9czexuywa/Lttz94IyNlOQ8rJTiQqB9wLaAvg9hvUk7a
 kuspL+JGIZkrL3ouCeO/VA6xnaP+Q7nR8geWBRb8zKGHmtWrb5Gwmt6t+vTnCC2l
 olIDebrnnxwfBQhEJ5219W+M1pBpjiTpqK/UdBd92A4+sOOhOD0=
 =FRGg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull hmm updates from Jason Gunthorpe:
 "This is more cleanup and consolidation of the hmm APIs and the very
  strongly related mmu_notifier interfaces. Many places across the tree
  using these interfaces are touched in the process. Beyond that a
  cleanup to the page walker API and a few memremap related changes
  round out the series:

   - General improvement of hmm_range_fault() and related APIs, more
     documentation, bug fixes from testing, API simplification &
     consolidation, and unused API removal

   - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
     and make them internal kconfig selects

   - Hoist a lot of code related to mmu notifier attachment out of
     drivers by using a refcount get/put attachment idiom and remove the
     convoluted mmu_notifier_unregister_no_release() and related APIs.

   - General API improvement for the migrate_vma API and revision of its
     only user in nouveau

   - Annotate mmu_notifiers with lockdep and sleeping region debugging

  Two series unrelated to HMM or mmu_notifiers came along due to
  dependencies:

   - Allow pagemap's memremap_pages family of APIs to work without
     providing a struct device

   - Make walk_page_range() and related use a constant structure for
     function pointers"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
  libnvdimm: Enable unit test infrastructure compile checks
  mm, notifier: Catch sleeping/blocking for !blockable
  kernel.h: Add non_block_start/end()
  drm/radeon: guard against calling an unpaired radeon_mn_unregister()
  csky: add missing brackets in a macro for tlb.h
  pagewalk: use lockdep_assert_held for locking validation
  pagewalk: separate function pointers from iterator data
  mm: split out a new pagewalk.h header from mm.h
  mm/mmu_notifiers: annotate with might_sleep()
  mm/mmu_notifiers: prime lockdep
  mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
  mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
  mm/hmm: hmm_range_fault() infinite loop
  mm/hmm: hmm_range_fault() NULL pointer bug
  mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
  mm/mmu_notifiers: remove unregister_no_release
  RDMA/odp: remove ib_ucontext from ib_umem
  RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  ...
2019-09-21 10:07:42 -07:00
Christophe Leroy
cbd18991e2 powerpc/mm: Fix an Oops in kasan_mmu_init()
Uncompressing Kernel Image ... OK
   Loading Device Tree to 01ff7000, end 01fff74f ... OK
[    0.000000] printk: bootconsole [udbg0] enabled
[    0.000000] BUG: Unable to handle kernel data access at 0xf818c000
[    0.000000] Faulting instruction address: 0xc0013c7c
[    0.000000] Thread overran stack, or stack corrupted
[    0.000000] Oops: Kernel access of bad area, sig: 11 [#1]
[    0.000000] BE PAGE_SIZE=16K PREEMPT
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.3.0-rc4-s3k-dev-00743-g5abe4a3e8fd3-dirty #2080
[    0.000000] NIP:  c0013c7c LR: c0013310 CTR: 00000000
[    0.000000] REGS: c0c5ff38 TRAP: 0300   Not tainted  (5.3.0-rc4-s3k-dev-00743-g5abe4a3e8fd3-dirty)
[    0.000000] MSR:  00001032 <ME,IR,DR,RI>  CR: 99033955  XER: 80002100
[    0.000000] DAR: f818c000 DSISR: 82000000
[    0.000000] GPR00: c0013310 c0c5fff0 c0ad6ac0 c0c600c0 f818c031 82000000 00000000 ffffffff
[    0.000000] GPR08: 00000000 f1f1f1f1 c0013c2c c0013304 99033955 00400008 00000000 07ff9598
[    0.000000] GPR16: 00000000 07ffb94c 00000000 00000000 00000000 00000000 00000000 f818cfb2
[    0.000000] GPR24: 00000000 00000000 00001000 ffffffff 00000000 c07dbf80 00000000 f818c000
[    0.000000] NIP [c0013c7c] do_page_fault+0x50/0x904
[    0.000000] LR [c0013310] handle_page_fault+0xc/0x38
[    0.000000] Call Trace:
[    0.000000] Instruction dump:
[    0.000000] be010080 91410014 553fe8fe 3d40c001 3d20f1f1 7d800026 394a3c2c 3fffe000
[    0.000000] 6129f1f1 900100c4 9181007c 91410018 <913f0000> 3d2001f4 6129f4f4 913f0004

Don't map the early shadow page read-only yet when creating the new
page tables for the real shadow memory, otherwise the memblock
allocations that immediately follows to create the real shadow pages
that are about to replace the early shadow page trigger a page fault
if they fall into the region being worked on at the moment.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Fixes: 2edb16efc8 ("powerpc/32: Add KASAN support")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/fe86886fb8db44360417cee0dc515ad47ca6ef72.1566382750.git.christophe.leroy@c-s.fr
2019-09-21 08:36:53 +10:00
Christophe Leroy
4c0f5d1eb4 powerpc/mm: Add a helper to select PAGE_KERNEL_RO or PAGE_READONLY
In a couple of places there is a need to select whether read-only
protection of shadow pages is performed with PAGE_KERNEL_RO or with
PAGE_READONLY.

Add a helper to avoid duplicating the choice.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: stable@vger.kernel.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/9f33f44b9cd741c4a02b3dce7b8ef9438fe2cd2a.1566382750.git.christophe.leroy@c-s.fr
2019-09-21 08:36:53 +10:00
Aneesh Kumar K.V
7aec584eaf powerpc/book3s64/radix: Remove WARN_ON in destroy_context()
On failed task initialization due to memory allocation failures, we can
call into destroy_context() with process_tb entry already populated.
This patch forces the process_tb entry to zero in destroy_context().
With this patch, we lose the ability to track if we are destroying a
context without flushing the process table entry.

  WARNING: CPU: 4 PID: 6368 at arch/powerpc/mm/mmu_context_book3s64.c:246 destroy_context+0x58/0x340
  NIP [c0000000000875f8] destroy_context+0x58/0x340
  LR [c00000000013da18] __mmdrop+0x78/0x270
  Call Trace:
  [c000000f7db77c80] [c00000000013da18] __mmdrop+0x78/0x270
  [c000000f7db77cf0] [c0000000004d6a34] __do_execve_file.isra.13+0xbd4/0x1000
  [c000000f7db77e00] [c0000000004d7428] sys_execve+0x58/0x70
  [c000000f7db77e30] [c00000000000b388] system_call+0x5c/0x70

Reported-by: Priya M.A <priyama2@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Reformat/tweak comment wording]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190918140103.24395-1-aneesh.kumar@linux.ibm.com
2019-09-21 08:36:53 +10:00
Linus Torvalds
45824fc0da powerpc updates for 5.4
- Initial support for running on a system with an Ultravisor, which is software
    that runs below the hypervisor and protects guests against some attacks by
    the hypervisor.
 
  - Support for building the kernel to run as a "Secure Virtual Machine", ie. as
    a guest capable of running on a system with an Ultravisor.
 
  - Some changes to our DMA code on bare metal, to allow devices with medium
    sized DMA masks (> 32 && < 59 bits) to use more than 2GB of DMA space.
 
  - Support for firmware assisted crash dumps on bare metal (powernv).
 
  - Two series fixing bugs in and refactoring our PCI EEH code.
 
  - A large series refactoring our exception entry code to use gas macros, both
    to make it more readable and also enable some future optimisations.
 
 As well as many cleanups and other minor features & fixups.
 
 Thanks to:
   Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh
   Kumar K.V, Anju T Sudhakar, Anshuman Khandual, Balbir Singh, Benjamin
   Herrenschmidt, Cédric Le Goater, Christophe JAILLET, Christophe Leroy,
   Christopher M. Riedl, Christoph Hellwig, Claudio Carvalho, Daniel Axtens,
   David Gibson, David Hildenbrand, Desnes A. Nunes do Rosario, Ganesh Goudar,
   Gautham R. Shenoy, Greg Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari
   Bathini, Joakim Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras,
   Lianbo Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
   Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan Chancellor,
   Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Qian Cai, Ram
   Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm, Sam Bobroff, Santosh Sivaraj,
   Segher Boessenkool, Sukadev Bhattiprolu, Thiago Bauermann, Thiago Jung
   Bauermann, Thomas Gleixner, Tom Lendacky, Vasant Hegde.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl2EtEcTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPfsD/9uXyBXn3anI/H08+mk74k5gCsmMQpn
 D442CD/ByogZcccp23yBTlhawtCE03hcHnCLygn0Xgd8a4YvHts/RGHUe3fPHqlG
 bEyZ7jsLVz5ebNZQP7r4eGs2pSzCajwJy2N9HJ/C1ojf15rrfRxoVJtnyhE2wXpm
 DL+6o2K+nUCB3gTQ1Inr3DnWzoGOOUfNTOea2u+J+yfHwGRqOBYpevwqiwy5eelK
 aRjUJCqMTvrzra49MeFwjo0Nt3/Y8UNcwA+JlGdeR8bRuWhFrYmyBRiZEKPaujNO
 5EAfghBBlB0KQCqvF/tRM/c0OftHqK59AMobP9T7u9oOaBXeF/FpZX/iXjzNDPsN
 j9Oo2tKLTu/YVEXqBFuREGP+znANr1Wo4CFyOG8SbvYz0HFjR6XbtRJsS+0e8GWl
 kqX5/ZhYz3lBnKSNe9jgWOrh/J0KCSFigBTEWJT3xsn4YE8x8kK2l9KPqAIldWEP
 sKb2UjGS7v0NKq+NvShH88Q9AeQUEIjTcg/9aDDQDe6FaRQ7KiF8bUxSdwSPi+Fn
 j0lnF6i+1ATWZKuCr85veVi7C5qoe/+MqalnmP7MxULyzgXLLxUgN0SzEYO6QofK
 LQK/VaH2XVr5+M5YAb7K4/NX5gbM3s1bKrCiUy4EyHNvgG7gricYdbz6HgAjKpR7
 oP0rHfgmVYvF1g==
 =WlW+
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "This is a bit late, partly due to me travelling, and partly due to a
  power outage knocking out some of my test systems *while* I was
  travelling.

   - Initial support for running on a system with an Ultravisor, which
     is software that runs below the hypervisor and protects guests
     against some attacks by the hypervisor.

   - Support for building the kernel to run as a "Secure Virtual
     Machine", ie. as a guest capable of running on a system with an
     Ultravisor.

   - Some changes to our DMA code on bare metal, to allow devices with
     medium sized DMA masks (> 32 && < 59 bits) to use more than 2GB of
     DMA space.

   - Support for firmware assisted crash dumps on bare metal (powernv).

   - Two series fixing bugs in and refactoring our PCI EEH code.

   - A large series refactoring our exception entry code to use gas
     macros, both to make it more readable and also enable some future
     optimisations.

  As well as many cleanups and other minor features & fixups.

  Thanks to: Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew
  Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual,
  Balbir Singh, Benjamin Herrenschmidt, Cédric Le Goater, Christophe
  JAILLET, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig,
  Claudio Carvalho, Daniel Axtens, David Gibson, David Hildenbrand,
  Desnes A. Nunes do Rosario, Ganesh Goudar, Gautham R. Shenoy, Greg
  Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari Bathini, Joakim
  Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras, Lianbo
  Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
  Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan
  Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver
  O'Halloran, Qian Cai, Ram Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm,
  Sam Bobroff, Santosh Sivaraj, Segher Boessenkool, Sukadev Bhattiprolu,
  Thiago Bauermann, Thiago Jung Bauermann, Thomas Gleixner, Tom
  Lendacky, Vasant Hegde"

* tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (264 commits)
  powerpc/mm/mce: Keep irqs disabled during lockless page table walk
  powerpc: Use ftrace_graph_ret_addr() when unwinding
  powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
  ftrace: Look up the address of return_to_handler() using helpers
  powerpc: dump kernel log before carrying out fadump or kdump
  docs: powerpc: Add missing documentation reference
  powerpc/xmon: Fix output of XIVE IPI
  powerpc/xmon: Improve output of XIVE interrupts
  powerpc/mm/radix: remove useless kernel messages
  powerpc/fadump: support holes in kernel boot memory area
  powerpc/fadump: remove RMA_START and RMA_END macros
  powerpc/fadump: update documentation about option to release opalcore
  powerpc/fadump: consider f/w load area
  powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file
  powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes
  powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
  powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel
  powerpc/fadump: improve how crashed kernel's memory is reserved
  powerpc/fadump: consider reserved ranges while releasing memory
  powerpc/fadump: make crash memory ranges array allocation generic
  ...
2019-09-20 11:48:06 -07:00
Qian Cai
ec5b705c48 powerpc/mm/radix: remove useless kernel messages
Booting a POWER9 PowerNV system generates a few messages below with
"____ptrval____" due to the pointers printed without a specifier
extension (i.e unadorned %p) are hashed to prevent leaking information
about the kernel memory layout.

radix-mmu: Initializing Radix MMU
radix-mmu: Partition table (____ptrval____)
radix-mmu: Mapped 0x0000000000000000-0x0000000040000000 with 1.00 GiB
pages (exec)
radix-mmu: Mapped 0x0000000040000000-0x0000002000000000 with 1.00 GiB
pages
radix-mmu: Mapped 0x0000200000000000-0x0000202000000000 with 1.00 GiB
pages
radix-mmu: Process table (____ptrval____) and radix root for kernel:
(____ptrval____)

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1566570120-16529-1-git-send-email-cai@lca.pw
2019-09-14 00:04:46 +10:00
Michael Ellerman
dac39f7885 powerpc/64s: Remove overlaps_kvm_tmp()
kvm_tmp is now in .text and so doesn't need a special overlap check.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911115746.12433-2-mpe@ellerman.id.au
2019-09-14 00:04:40 +10:00
Christoph Hellwig
7b86ac3371 pagewalk: separate function pointers from iterator data
The mm_walk structure currently mixed data and code.  Split out the
operations vectors into a new mm_walk_ops structure, and while we are
changing the API also declare the mm_walk structure inside the
walk_page_range and walk_page_vma functions.

Based on patch from Linus Torvalds.

Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-07 04:28:04 -03:00
Christoph Hellwig
a520110e4a mm: split out a new pagewalk.h header from mm.h
Add a new header for the two handful of users of the walk_page_range /
walk_page_vma interface instead of polluting all users of mm.h with it.

Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-07 04:28:04 -03:00
Nicholas Piggin
2275d7b575 powerpc/64s/radix: introduce options to disable use of the tlbie instruction
Introduce two options to control the use of the tlbie instruction. A
boot time option which completely disables the kernel using the
instruction, this is currently incompatible with HASH MMU, KVM, and
coherent accelerators.

And a debugfs option can be switched at runtime and avoids using tlbie
for invalidating CPU TLBs for normal process and kernel address
mappings. Coherent accelerators are still managed with tlbie, as will
KVM partition scope translations.

Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a
basic implementation which does not attempt to make any optimisation
beyond the tlbie implementation.

This is useful for performance testing among other things. For example
in certain situations on large systems, using IPIs may be faster than
tlbie as they can be directed rather than broadcast. Later we may also
take advantage of the IPIs to do more interesting things such as trim
the mm cpumask more aggressively.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-7-npiggin@gmail.com
2019-09-05 14:22:41 +10:00
Nicholas Piggin
7d805accbe powerpc/64s: remove unnecessary translation cache flushes at boot
The various translation structure invalidations performed in early boot
when the MMU is off are not required, because everything is invalidated
immediately before a CPU first enables its MMU (see early_init_mmu
and early_init_mmu_secondary).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-6-npiggin@gmail.com
2019-09-05 14:22:41 +10:00
Nicholas Piggin
7e71c428a6 powerpc/64s/pseries: radix flush translations before MMU is enabled at boot
Radix guests are responsible for managing their own translation caches,
so make them match bare metal radix and hash, and make each CPU flush
all its translations right before enabling its MMU.

Radix guests may not flush partition scope translations, so in
tlbiel_all, make these flushes conditional on CPU_FTR_HVMODE. Process
scope translations are the only type visible to the guest.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-5-npiggin@gmail.com
2019-09-05 14:22:40 +10:00
Nicholas Piggin
fd13daea5f powerpc/64s: make mmu_partition_table_set_entry TLB flush optional
No functional change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-4-npiggin@gmail.com
2019-09-05 14:22:40 +10:00
Nicholas Piggin
99161de3a2 powerpc/64s/radix: tidy up TLB flushing code
There should be no functional changes.

- Use calls to existing radix_tlb.c functions in flush_partition.

- Rename radix__flush_tlb_lpid to radix__flush_all_lpid and similar,
  because they flush everything, matching flush_all_mm rather than
  flush_tlb_mm for the lpid.

- Remove some unused radix_tlb.c flush primitives.

Signed-off: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-3-npiggin@gmail.com
2019-09-05 14:22:40 +10:00
Nicholas Piggin
ed6546bdc6 powerpc/64s: remove register_process_table callback
This callback is only required because the partition table init comes
before process table allocation on powernv (aka bare metal aka native).

Change the order to allocate the process table first, and remove the
callback.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-2-npiggin@gmail.com
2019-09-05 14:22:40 +10:00
Nicholas Piggin
9b123d1ea2 powerpc/64s/exception: reduce page fault unnecessary loads
This avoids 3 loads in the radix page fault case, 1 load in the
hash fault case, and 2 loads in the hash miss page fault case.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-37-npiggin@gmail.com
2019-08-30 11:14:59 +10:00
Michael Ellerman
9044adca78 Merge branch 'topic/ppc-kvm' into next
Merge our ppc-kvm topic branch to bring in the Ultravisor support
patches.
2019-08-30 09:52:57 +10:00
Claudio Carvalho
5223134029 powerpc/mm: Write to PTCR only if ultravisor disabled
In ultravisor enabled systems, PTCR becomes ultravisor privileged only
for writing and an attempt to write to it will cause a Hypervisor
Emulation Assitance interrupt.

This patch uses the set_ptcr_when_no_uv() function to restrict PTCR
writing to only when ultravisor is disabled.

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-6-cclaudio@linux.ibm.com
2019-08-30 09:40:16 +10:00
Michael Anderson
139a1d2842 powerpc/mm: Use UV_WRITE_PATE ucall to register a PATE
When Ultravisor (UV) is enabled, the partition table is stored in secure
memory and can only be accessed via the UV. The Hypervisor (HV) however
maintains a copy of the partition table in normal memory to allow Nest MMU
translations to occur (for normal VMs). The HV copy includes partition
table entries (PATE)s for secure VMs which would currently be unused
(Nest MMU translations cannot access secure memory) but they would be
needed as we add functionality.

This patch adds the UV_WRITE_PATE ucall which is used to update the PATE
for a VM (both normal and secure) when Ultravisor is enabled.

Signed-off-by: Michael Anderson <andmike@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[ cclaudio: Write the PATE in HV's table before doing that in UV's ]
Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Reviewed-by: Ryan Grimm <grimm@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-5-cclaudio@linux.ibm.com
2019-08-30 09:40:15 +10:00
Christoph Hellwig
f2902a2fb4 powerpc: use the generic dma coherent remap allocator
This switches to using common code for the DMA allocations, including
potential use of the CMA allocator if configured.

Switching to the generic code enables DMA allocations from atomic
context, which is required by the DMA API documentation, and also
adds various other minor features drivers start relying upon.  It
also makes sure we have on tested code base for all architectures
that require uncached pte bits for coherent DMA allocations.

Another advantage is that consistent memory allocations now share
the general vmalloc pool instead of needing an explicit careout
from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> # tested on 8xx
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190814132230.31874-2-hch@lst.de
2019-08-28 23:19:34 +10:00
Christophe Leroy
12c3f1fd87 powerpc/32s: get rid of CPU_FTR_601 feature
Now that 601 is exclusive from other 6xx, CPU_FTR_601 and
associated fixups are useless.

Drop this feature and use #ifdefs instead.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ecdb7194a17dbfa01865df6a82979533adc2c70b.1566834712.git.christophe.leroy@c-s.fr
2019-08-28 23:19:33 +10:00
Christophe Leroy
163918fc57 powerpc/mm: split out early ioremap path.
ioremap does things differently depending on whether
SLAB is available or not at different levels.

Try to separate the early path from the beginning.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/3acd2dbe04b04f111475e7a59f2b6f2ab9b95ab6.1566309263.git.christophe.leroy@c-s.fr
2019-08-27 13:03:35 +10:00
Christophe Leroy
4a45b7460c powerpc/mm: refactor ioremap vm area setup.
PPC32 and PPC64 are doing the same once SLAB is available.
Create a do_ioremap() function that calls get_vm_area and
do the mapping.

For PPC64, we add the 4K PFN hack sanity check to __ioremap_caller()
in order to avoid using __ioremap_at(). Other checks in __ioremap_at()
are irrelevant for __ioremap_caller().

On PPC64, VM area is allocated in the range [ioremap_bot ; IOREMAP_END]
On PPC32, VM area is allocated in the range [VMALLOC_START ; VMALLOC_END]

Lets define IOREMAP_START is ioremap_bot for PPC64, and alias
IOREMAP_START/END to VMALLOC_START/END on PPC32

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/42e7e36ad32e0fdf76692426cc642799c9f689b8.1566309263.git.christophe.leroy@c-s.fr
2019-08-27 13:03:35 +10:00
Christophe Leroy
191e42063a powerpc/mm: refactor ioremap_range() and use ioremap_page_range()
book3s64's ioremap_range() is almost same as fallback ioremap_range(),
except that it calls radix__ioremap_range() when radix is enabled.

radix__ioremap_range() is also very similar to the other ones, expect
that it calls ioremap_page_range when slab is available.

PPC32 __ioremap_caller() have a loop doing the same thing as
ioremap_range() so use it on PPC32 as well.

Lets keep only one version of ioremap_range() which calls
ioremap_page_range() on all platforms when slab is available.

At the same time, drop the nid parameter which is not used.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4b1dca7096b01823b101be7338983578641547f1.1566309263.git.christophe.leroy@c-s.fr
2019-08-27 13:03:35 +10:00
Christophe Leroy
f381d5711f powerpc/mm: Move ioremap functions out of pgtable_32/64.c
Create ioremap_32.c and ioremap_64.c and move respective ioremap
functions out of pgtable_32.c and pgtable_64.c

In the meantime, fix a few comments and changes a printk() to
pr_warn(). Also fix a few oversplitted lines.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b5c8b02ccefd4ede64c61b53cf64fb5dacb35740.1566309263.git.christophe.leroy@c-s.fr
2019-08-27 13:03:35 +10:00
Christophe Leroy
7cd9b317b6 powerpc/mm: make ioremap_bot common to all
Drop multiple definitions of ioremap_bot and make one common to
all subarches.

Only CONFIG_PPC_BOOK3E_64 had a global static init value for
ioremap_bot. Now ioremap_bot is set in early_init_mmu_global().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/920eebfd9f36f14c79d1755847f5bf7c83703bdd.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:34 +10:00
Christophe Leroy
edfe1a5679 powerpc/mm: move ioremap_prot() into ioremap.c
Both ioremap_prot() are idenfical, move them into ioremap.c

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0b3eb0e0f1490a99fd6c983e166fb8946233f151.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:34 +10:00
Christophe Leroy
4634c375db powerpc/mm: move common 32/64 bits ioremap functions into ioremap.c
ioremap(), ioremap_wc() and ioremap_coherent() are now identical on
PPC32 and PPC64 as iowa_is_active() will always return false on
PPC32. Move them into a new common location called ioremap.c

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6223803ce024d6ab4dfaa919f44098aed5b4bc33.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:34 +10:00
Christophe Leroy
14b4d97669 powerpc/mm: rework io-workaround invocation.
ppc_md.ioremap() is only used for I/O workaround on CELL platform,
so indirect function call can be avoided.

This patch reworks the io-workaround and ioremap() functions to
use the global 'io_workaround_inited' flag for the activation
of io-workaround.

When CONFIG_PPC_IO_WORKAROUNDS or CONFIG_PPC_INDIRECT_MMIO are not
selected, the I/O workaround ioremap() voids and the global flag is
not used.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5fa3ef069fbd0f152512afaae19e7a60161454cf.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:34 +10:00
Christophe Leroy
492643e81e powerpc/mm: drop function __ioremap()
__ioremap() is not used anymore, drop it.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ccc439f481a0884e00a6be1bab44bab2a4477fea.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:33 +10:00
Christophe Leroy
8aee077292 powerpc/mm: drop ppc_md.iounmap() and __iounmap()
ppc_md.iounmap() is never set, drop it.

Once ppc_md.iounmap() is gone, iounmap() remains the only user of
__iounmap() and iounmap() does nothing else than calling __iounmap().
So drop iounmap() and make __iounmap() the new iounmap().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d73ba92bb7a387cc58cc34666d7f5158a45851b0.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:33 +10:00
Nicholas Piggin
31f210cf42 powerpc/64s/radix: Fix memory hot-unplug page table split
create_physical_mapping expects physical addresses, but splitting
these mapping on hot unplug is supplying virtual (effective)
addresses.

Fixes: 4dd5f8a99e ("powerpc/mm/radix: Split linear mapping on hot-unplug")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190724084638.24982-2-npiggin@gmail.com
2019-08-20 21:22:19 +10:00
Nicholas Piggin
8f51e39294 powerpc/64s/radix: Fix memory hotplug section page table creation
create_physical_mapping expects physical addresses, but creating and
splitting these mappings after boot is supplying virtual (effective)
addresses. This can be irritated by booting with mem= to limit memory
then probing an unused physical memory range:

  echo <addr> > /sys/devices/system/memory/probe

This mostly works by accident, firstly because __va(__va(x)) == __va(x)
so the virtual address does not get corrupted. Secondly because pfn_pte
masks out the upper bits of the pfn beyond the physical address limit,
so a pfn constructed with a 0xc000000000000000 virtual linear address
will be masked back to the correct physical address in the pte.

Fixes: 6cc27341b2 ("powerpc/mm: add radix__create_section_mapping()")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190724084638.24982-1-npiggin@gmail.com
2019-08-20 21:22:18 +10:00
Christophe Leroy
f204338f8e powerpc/mm: ppc 603 doesn't need update_mmu_cache()
On powerpc 603, there is no hash table so get out of
update_mmu_cache() early.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6133e0f115d955fac4061536dab0fa7480a1c433.1565933217.git.christophe.leroy@c-s.fr
2019-08-20 21:22:14 +10:00
Christophe Leroy
f49f4e2b68 powerpc/mm: Simplify update_mmu_cache() on BOOK3S32
On BOOK3S32, hash_preload() neither use is_exec nor trap,
so drop those parameters and simplify update_mmu_cached().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/35f143c6fe29f9fd25c7f3cd4448ae401029ce3c.1565933217.git.christophe.leroy@c-s.fr
2019-08-20 21:22:14 +10:00
Christophe Leroy
e5a1edb9fe powerpc/mm: move update_mmu_cache() into book3s hash utils.
update_mmu_cache() is only for BOOK3S, and can be simplified for
BOOK3S32.

Move it out of mem.c into respective BOOK3S32 and BOOK3S64 files
containing hash utils.

BOOK3S64 version of hash_preload() is only used locally, declare it
static.

Remove the radix_enabled() stuff in BOOK3S32 version.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/107aaf43583a5f5d09e0d4e84c4c4390ecfcd512.1565933217.git.christophe.leroy@c-s.fr
2019-08-20 21:22:14 +10:00
Christophe Leroy
4c1616ef03 powerpc/mm: move FSL_BOOK3 version of update_mmu_cache()
Move FSL_BOOK3E version of update_mmu_cache() at the same
place as book3e_hugetlb_preload() as update_mmu_cache() is
the only user of book3e_hugetlb_preload().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4d69fdc86df9c74adc71a60331a86f6afb8b5e9e.1565933217.git.christophe.leroy@c-s.fr
2019-08-20 21:22:14 +10:00
Christophe Leroy
d964211791 powerpc/mm: define empty update_mmu_cache() as static inline
Only BOOK3S and FSL_BOOK3E have a usefull update_mmu_cache().

For the others, just define it static inline.

In the meantime, simplify the FSL_BOOK3E related ifdef as
book3e_hugetlb_preload() only exists when CONFIG_PPC_FSL_BOOK3E
is selected.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/668aba4db6b9af6d8a151174e11a4289f1a6bbcd.1565933217.git.christophe.leroy@c-s.fr
2019-08-20 21:22:14 +10:00
Christophe Leroy
ad628a34ec powerpc/mm: don't display empty early ioremap area
On the 8xx, the layout displayed at boot is:

[    0.000000] Memory: 121856K/131072K available (5728K kernel code, 592K rwdata, 1248K rodata, 560K init, 448K bss, 9216K reserved, 0K cma-reserved)
[    0.000000] Kernel virtual memory layout:
[    0.000000]   * 0xffefc000..0xffffc000  : fixmap
[    0.000000]   * 0xffefc000..0xffefc000  : early ioremap
[    0.000000]   * 0xc9000000..0xffefc000  : vmalloc & ioremap
[    0.000000] SLUB: HWalign=16, Order=0-3, MinObjects=0, CPUs=1, Nodes=1

Remove display of an empty early ioremap.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f6267226038cb25a839b567319e240576e3f8565.1565793287.git.christophe.leroy@c-s.fr
2019-08-20 21:22:13 +10:00
Christophe Leroy
9d6d712fbf powerpc/32s: Fix boot failure with DEBUG_PAGEALLOC without KASAN.
When KASAN is selected, the definitive hash table has to be
set up later, but there is already an early temporary one.

When KASAN is not selected, there is no early hash table,
so the setup of the definitive hash table cannot be delayed.

Fixes: 72f208c6a8 ("powerpc/32s: move hash code patching out of MMU_init_hw()")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: Jonathan Neuschafer <j.neuschaefer@gmx.net>
Tested-by: Jonathan Neuschafer <j.neuschaefer@gmx.net>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b7860c5e1e784d6b96ba67edf47dd6cbc2e78ab6.1565776892.git.christophe.leroy@c-s.fr
2019-08-20 21:22:13 +10:00
Christophe Leroy
663c0c9496 powerpc/kasan: Fix shadow area set up for modules.
When loading modules, from time to time an Oops is encountered during
the init of shadow area for globals. This is due to the last page not
always being mapped depending on the exact distance between the start
and the end of the shadow area and the alignment with the page
addresses.

Fix this by aligning the starting address with the page address.

Fixes: 2edb16efc8 ("powerpc/32: Add KASAN support")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4f887e9b77d0d725cbb52035c7ece485c1c5fc14.1565361881.git.christophe.leroy@c-s.fr
2019-08-20 21:22:11 +10:00
Christophe Leroy
45ff3c5595 powerpc/kasan: Fix parallel loading of modules.
Parallel loading of modules may lead to bad setup of shadow page table
entries.

First, lets align modules so that two modules never share the same
shadow page.

Second, ensure that two modules cannot allocate two page tables for
the same PMD entry at the same time. This is done by using
init_mm.page_table_lock in the same way as __pte_alloc_kernel()

Fixes: 2edb16efc8 ("powerpc/32: Add KASAN support")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c97284f912128cbc3f2fe09d68e90e65fb3e6026.1565361876.git.christophe.leroy@c-s.fr
2019-08-20 21:22:11 +10:00
Christophe Leroy
65e701b2d2 powerpc/ptdump: drop non vital #ifdefs
hashpagetable.c is only compiled when CONFIG_PPC_BOOK3S_64 is
defined, so drop the test and its 'else' branch.

Use IS_ENABLED(CONFIG_PPC_PSERIES) instead of #ifdef, this allows the
code to be checked at any build. It is still optimised out by GCC.

Use IS_ENABLED(CONFIG_PPC_64K_PAGES) instead of #ifdef.

Use IS_ENABLED(CONFIG_SPARSEMEN_VMEMMAP) instead of #ifdef.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c8998ed32e4e3954b56a8dacecfe43319a2a0483.1565786091.git.christophe.leroy@c-s.fr
2019-08-20 21:22:09 +10:00
Christophe Leroy
f3a2ac0589 powerpc/ptdump: get out of note_prot_wx() when CONFIG_PPC_DEBUG_WX is not selected.
When CONFIG_PPC_DEBUG_WX, note_prot_wx() is useless.

Get out of it early and inconditionnally in that case,
so that GCC can kick all the code out.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ff6c8f631bd4ce3a10e0cc241eb569816187bc20.1565786091.git.christophe.leroy@c-s.fr
2019-08-20 21:22:09 +10:00
Christophe Leroy
8224235272 powerpc/ptdump: drop dummy KERN_VIRT_START on PPC32
PPC32 doesn't have KERN_VIRT_START. Make PAGE_OFFSET the
default starting address for the dump, and drop the dummy
definition of KERN_VIRT_START. Only use KERN_VIRT_START for
non radix PPC64.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/334632b1df4775b0ccf3bdc8d6b201d14e3daedd.1565786091.git.christophe.leroy@c-s.fr
2019-08-20 21:22:09 +10:00
Christophe Leroy
e033829d2a powerpc/ptdump: fix walk_pagetables() address mismatch
walk_pagetables() always walk the entire pgdir from address 0
but considers PAGE_OFFSET or KERN_VIRT_START as the starting
address of the walk, resulting in a possible mismatch in the
displayed addresses.

Ex: on PPC32, when KERN_VIRT_START was locally defined as
PAGE_OFFSET, ptdump displayed 0x80000000
instead of 0xc0000000 for the first kernel page,
because 0xc0000000 + 0xc0000000 = 0x80000000

Start the walk at st->start_address instead of starting at 0.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5aa2ac513295f594cce8ddb1c649f61947bd063d.1565786091.git.christophe.leroy@c-s.fr
2019-08-20 21:22:08 +10:00
Christophe Leroy
7c7a532ba3 powerpc/ptdump: Fix addresses display on PPC32
Commit 453d87f6a8 ("powerpc/mm: Warn if W+X pages found on boot")
wrongly changed KERN_VIRT_START from 0 to PAGE_OFFSET, leading to a
shift in the displayed addresses.

Lets revert that change to resync walk_pagetables()'s addr val and
pgd_t pointer for PPC32.

Fixes: 453d87f6a8 ("powerpc/mm: Warn if W+X pages found on boot")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/eb4d626514e22f85814830012642329018ef6af9.1565786091.git.christophe.leroy@c-s.fr
2019-08-20 21:22:08 +10:00
Gautham R. Shenoy
c784be435d powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt()
The calls to arch_add_memory()/arch_remove_memory() are always made
with the read-side cpu_hotplug_lock acquired via memory_hotplug_begin().
On pSeries, arch_add_memory()/arch_remove_memory() eventually call
resize_hpt() which in turn calls stop_machine() which acquires the
read-side cpu_hotplug_lock again, thereby resulting in the recursive
acquisition of this lock.

In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system
lockup during a memory hotplug operation because cpus_read_lock() is a
per-cpu rwsem read, which, in the fast-path (in the absence of the
writer, which in our case is a CPU-hotplug operation) simply
increments the read_count on the semaphore. Thus a recursive read in
the fast-path doesn't cause any problems.

However, we can hit this problem in practice if there is a concurrent
CPU-Hotplug operation in progress which is waiting to acquire the
write-side of the lock. This will cause the second recursive read to
block until the writer finishes. While the writer is blocked since the
first read holds the lock. Thus both the reader as well as the writers
fail to make any progress thereby blocking both CPU-Hotplug as well as
Memory Hotplug operations.

Memory-Hotplug				CPU-Hotplug
CPU 0					CPU 1
------                                  ------

1. down_read(cpu_hotplug_lock.rw_sem)
   [memory_hotplug_begin]
					2. down_write(cpu_hotplug_lock.rw_sem)
					[cpu_up/cpu_down]
3. down_read(cpu_hotplug_lock.rw_sem)
   [stop_machine()]

Lockdep complains as follows in these code-paths.

 swapper/0/1 is trying to acquire lock:
 (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x2c/0x60

but task is already holding lock:
(____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(cpu_hotplug_lock.rw_sem);
   lock(cpu_hotplug_lock.rw_sem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 3 locks held by swapper/0/1:
  #0: (____ptrval____) (&dev->mutex){....}, at: __driver_attach+0x12c/0x1b0
  #1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
  #2: (____ptrval____) (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x54/0x1a0

stack backtrace:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166
 Call Trace:
   dump_stack+0xe8/0x164 (unreliable)
   __lock_acquire+0x1110/0x1c70
   lock_acquire+0x240/0x290
   cpus_read_lock+0x64/0xf0
   stop_machine+0x2c/0x60
   pseries_lpar_resize_hpt+0x19c/0x2c0
   resize_hpt_for_hotplug+0x70/0xd0
   arch_add_memory+0x58/0xfc
   devm_memremap_pages+0x5e8/0x8f0
   pmem_attach_disk+0x764/0x830
   nvdimm_bus_probe+0x118/0x240
   really_probe+0x230/0x4b0
   driver_probe_device+0x16c/0x1e0
   __driver_attach+0x148/0x1b0
   bus_for_each_dev+0x90/0x130
   driver_attach+0x34/0x50
   bus_add_driver+0x1a8/0x360
   driver_register+0x108/0x170
   __nd_driver_register+0xd0/0xf0
   nd_pmem_driver_init+0x34/0x48
   do_one_initcall+0x1e0/0x45c
   kernel_init_freeable+0x540/0x64c
   kernel_init+0x2c/0x160
   ret_from_kernel_thread+0x5c/0x68

Fix this issue by
  1) Requiring all the calls to pseries_lpar_resize_hpt() be made
     with cpu_hotplug_lock held.

  2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked()
     as a consequence of 1)

  3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt()
     with cpu_hotplug_lock held.

Fixes: dbcf929c00 ("powerpc/pseries: Add support for hash table resizing")
Cc: stable@vger.kernel.org # v4.11+
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1557906352-29048-1-git-send-email-ego@linux.vnet.ibm.com
2019-08-19 13:20:23 +10:00
Michael Ellerman
4215fa2d7d Merge branch 'fixes' into next
Merge in our fixes branch, which brings in clone3() as well as some
implicit fallthrough fixes we want in next.
2019-08-19 12:43:13 +10:00
Nicholas Piggin
2b87a2553a powerpc/64s: Make boot look nice(r)
Radix boot looks like this:

 -----------------------------------------------------
 phys_mem_size     = 0x200000000
 dcache_bsize      = 0x80
 icache_bsize      = 0x80
 cpu_features      = 0x0000c06f8f5fb1a7
   possible        = 0x0000fbffcf5fb1a7
   always          = 0x00000003800081a1
 cpu_user_features = 0xdc0065c2 0xaee00000
 mmu_features      = 0xbc006041
 firmware_features = 0x0000000010000000
 hash-mmu: ppc64_pft_size    = 0x0
 hash-mmu: kernel vmalloc start   = 0xc008000000000000
 hash-mmu: kernel IO start        = 0xc00a000000000000
 hash-mmu: kernel vmemmap start   = 0xc00c000000000000
 -----------------------------------------------------

Fix:

 -----------------------------------------------------
 phys_mem_size     = 0x200000000
 dcache_bsize      = 0x80
 icache_bsize      = 0x80
 cpu_features      = 0x0000c06f8f5fb1a7
   possible        = 0x0000fbffcf5fb1a7
   always          = 0x00000003800081a1
 cpu_user_features = 0xdc0065c2 0xaee00000
 mmu_features      = 0xbc006041
 firmware_features = 0x0000000010000000
 vmalloc start     = 0xc008000000000000
 IO start          = 0xc00a000000000000
 vmemmap start     = 0xc00c000000000000
 -----------------------------------------------------

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190516020437.11783-1-npiggin@gmail.com
2019-08-15 22:52:55 +10:00
Christophe Leroy
b9ee5e04fd powerpc/64e: Drop stale call to smp_processor_id() which hangs SMP startup
Commit ebb9d30a6a ("powerpc/mm: any thread in one core can be the
first to setup TLB1") removed the need to know the cpu_id in
early_init_this_mmu(), but the call to smp_processor_id() which was
marked __maybe_used remained.

Since commit ed1cd6deb0 ("powerpc: Activate CONFIG_THREAD_INFO_IN_TASK")
thread_info cannot be reached before MMU is properly set up.

Drop this stale call to smp_processor_id() which makes SMP hang when
CONFIG_PREEMPT is set.

Fixes: ebb9d30a6a ("powerpc/mm: any thread in one core can be the first to setup TLB1")
Fixes: ed1cd6deb0 ("powerpc: Activate CONFIG_THREAD_INFO_IN_TASK")
Cc: stable@vger.kernel.org # v5.1+
Reported-by: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bef479514f4c08329fa649f67735df8918bc0976.1565268248.git.christophe.leroy@c-s.fr
2019-08-12 22:56:57 +10:00
Christophe Leroy
d7e23b887f powerpc/kasan: fix early boot failure on PPC32
Due to commit 4a6d8cf900 ("powerpc/mm: don't use pte_alloc_kernel()
until slab is available on PPC32"), pte_alloc_kernel() cannot be used
during early KASAN init.

Fix it by using memblock_alloc() instead.

Fixes: 2edb16efc8 ("powerpc/32: Add KASAN support")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/da89670093651437f27d2975224712e0a130b055.1564552796.git.christophe.leroy@c-s.fr
2019-07-31 22:02:52 +10:00
Linus Torvalds
bed38c3e2d powerpc fixes for 5.3 #2
An assortment of non-regression fixes that have accumulated since the start of
 the merge window.
 
 A fix for a user triggerable oops on machines where transactional memory is
 disabled, eg. Power9 bare metal, Power8 with TM disabled on the command line, or
 all Power7 or earlier machines.
 
 Three fixes for handling of PMU and power saving registers when running nested
 KVM on Power9.
 
 Two fixes for bugs found while stress testing the XIVE interrupt controller
 code, also on Power9.
 
 A fix to allow guests to boot under Qemu/KVM on Power9 using the the Hash MMU
 with >= 1TB of memory.
 
 Two fixes for bugs in the recent DMA cleanup, one of which could lead to
 checkstops.
 
 And finally three fixes for the PAPR SCM nvdimm driver.
 
 Thanks to:
   Alexey Kardashevskiy, Andrea Arcangeli, Cédric Le Goater, Christoph Hellwig,
   David Gibson, Gautham R. Shenoy, Michael Neuling, Oliver O'Halloran,, Satheesh
   Rajendran, Shawn Anastasio, Suraj Jitindar Singh, Vaibhav Jain.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdOF2iAAoJEFHr6jzI4aWAVmsQAJ//UY1a+lz39y/5jmkybJbH
 HVnja6ZhsKd3+ZAnljGmqr1zuwDmy8+X3pT+1832zBBm4Z1cNKj1c0wuK5fuhAfq
 o0XkO0N9GFcQu8HUPb5wSBOoyXwK0qUhExfCVobl7YsDAyAI2//nSQTwxNX3W4Hv
 P7hz48pBbiqRzQAOSHV8ZlcOBETbSVAXeNalSXrXqSJmXQbVWCQcd6vucMSwZ7S5
 ZiiL/gCBoO0kd0ZQRsGXCbwcjcR4NlTDN0M40og8Y9KTDkId8HdmJyXW3tMcZo/g
 W3LeMR94bUh/KrK88lMBrRXKUlxL+loZKWZaeNlA5+ShCYk/ZafkKri/QUX/glOq
 ahm8uqokdZ5VS1tgSYoJIKdA5qMGvv8V+CpHRJnZqaEhUCduQa5XmWPnDnEKkDt0
 94VBsk0D2vHYKyygv5JMgYHQVlU7XrQF8fw2pKShpqLMY7ZMpeDDmKN9AuzxhawF
 9b7HigbwNt5LvNJ0xn097KW+svCK7i3ZgiQe83W36wjSl2ystgjJ3T7yrH6Q1rKH
 o4loEGA4gASTDjTmWQM20lHT1xQHY4fQBC/wi/67as3m0TDeGXYI0fOZC5qtEBFr
 Ln/0e78VMhut/RWicDlRveszef1MCi1warR9R4I/bQ8M6O1BzHYsQ9zr6H111uxL
 vQ92Yp8G2PoqN7wlFlSG
 =Yc9T
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "An assortment of non-regression fixes that have accumulated since the
  start of the merge window.

   - A fix for a user triggerable oops on machines where transactional
     memory is disabled, eg. Power9 bare metal, Power8 with TM disabled
     on the command line, or all Power7 or earlier machines.

   - Three fixes for handling of PMU and power saving registers when
     running nested KVM on Power9.

   - Two fixes for bugs found while stress testing the XIVE interrupt
     controller code, also on Power9.

   - A fix to allow guests to boot under Qemu/KVM on Power9 using the
     the Hash MMU with >= 1TB of memory.

   - Two fixes for bugs in the recent DMA cleanup, one of which could
     lead to checkstops.

   - And finally three fixes for the PAPR SCM nvdimm driver.

  Thanks to: Alexey Kardashevskiy, Andrea Arcangeli, Cédric Le Goater,
  Christoph Hellwig, David Gibson, Gautham R. Shenoy, Michael Neuling,
  Oliver O'Halloran, Satheesh Rajendran, Shawn Anastasio, Suraj Jitindar
  Singh, Vaibhav Jain"

* tag 'powerpc-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/papr_scm: Force a scm-unbind if initial scm-bind fails
  powerpc/papr_scm: Update drc_pmem_unbind() to use H_SCM_UNBIND_ALL
  powerpc/pseries: Update SCM hcall op-codes in hvcall.h
  powerpc/tm: Fix oops on sigreturn on systems without TM
  powerpc/dma: Fix invalid DMA mmap behavior
  KVM: PPC: Book3S HV: XIVE: fix rollback when kvmppc_xive_create fails
  powerpc/xive: Fix loop exit-condition in xive_find_target_in_mask()
  powerpc: fix off by one in max_zone_pfn initialization for ZONE_DMA
  KVM: PPC: Book3S HV: Save and restore guest visible PSSCR bits on pseries
  powerpc/pmu: Set pmcregs_in_use in paca when running as LPAR
  KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting
  powerpc/mm: Limit rma_size to 1TB when running without HV mode
2019-07-24 09:58:39 -07:00
David Hildenbrand
80ec922dbd mm/memory_hotplug: allow arch_remove_memory() without CONFIG_MEMORY_HOTREMOVE
We want to improve error handling while adding memory by allowing to use
arch_remove_memory() and __remove_pages() even if
CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:

	arch_add_memory()
	rc = do_something();
	if (rc) {
		arch_remove_memory();
	}

We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
quite some dependencies for memory offlining.

Link: http://lkml.kernel.org/r/20190527111152.16324-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:06 -07:00
Daniel Jordan
79eb597cba mm: add account_locked_vm utility function
locked_vm accounting is done roughly the same way in five places, so
unify them in a helper.

Include the helper's caller in the debug print to distinguish between
callsites.

Error codes stay the same, so user-visible behavior does too.  The one
exception is that the -EPERM case in tce_account_locked_vm is removed
because Alexey has never seen it triggered.

[daniel.m.jordan@oracle.com: v3]
  Link: http://lkml.kernel.org/r/20190529205019.20927-1-daniel.m.jordan@oracle.com
[sfr@canb.auug.org.au: fix mm/util.c]
Link: http://lkml.kernel.org/r/20190524175045.26897-1-daniel.m.jordan@oracle.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Alan Tull <atull@kernel.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Moritz Fischer <mdf@kernel.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Wu Hao <hao.wu@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:25 -07:00
Anshuman Khandual
b98cca444d mm, kprobes: generalize and rename notify_page_fault() as kprobe_page_fault()
Architectures which support kprobes have very similar boilerplate around
calling kprobe_fault_handler().  Use a helper function in kprobes.h to
unify them, based on the x86 code.

This changes the behaviour for other architectures when preemption is
enabled.  Previously, they would have disabled preemption while calling
the kprobe handler.  However, preemption would be disabled if this fault
was due to a kprobe, so we know the fault was not due to a kprobe
handler and can simply return failure.

This behaviour was introduced in commit a980c0ef9f ("x86/kprobes:
Refactor kprobes_fault() like kprobe_exceptions_notify()")

[anshuman.khandual@arm.com: export kprobe_fault_handler()]
  Link: http://lkml.kernel.org/r/1561133358-8876-1-git-send-email-anshuman.khandual@arm.com
Link: http://lkml.kernel.org/r/1560420444-25737-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:22 -07:00
Anshuman Khandual
0f472d04f5 mm/ioremap: probe platform for p4d huge map support
Finish up what commit c2febafc67 ("mm: convert generic code to 5-level
paging") started while levelling up P4D huge mapping support at par with
PUD and PMD.  A new arch call back arch_ioremap_p4d_supported() is added
which just maintains status quo (P4D huge map not supported) on x86,
arm64 and powerpc.

When HAVE_ARCH_HUGE_VMAP is enabled its just a simple check from the
arch about the support, hence runtime effects are minimal.

Link: http://lkml.kernel.org/r/1561699231-20991-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:22 -07:00
Andrea Arcangeli
03800e0526 powerpc: fix off by one in max_zone_pfn initialization for ZONE_DMA
25078dc1f7 first introduced an off by
one error in the ZONE_DMA initialization of PPC_BOOK3E_64=y and since
9739ab7eda the off by one applies to
PPC32=y too. This simply corrects the off by one and should resolve
crashes like below:

[   65.179101] page 0x7fff outside node 0 zone DMA [ 0x0 - 0x7fff ]

Unfortunately in various MM places "max" means a non inclusive end of
range. free_area_init_nodes max_zone_pfn parameter is one case and
MAX_ORDER is another one (unrelated) that comes by memory.

Reported-by: Zorro Lang <zlang@redhat.com>
Fixes: 25078dc1f7 ("powerpc: use mm zones more sensibly")
Fixes: 9739ab7eda ("powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190625141727.2883-1-aarcange@redhat.com
2019-07-15 14:55:58 +10:00
Linus Torvalds
fec88ab0af HMM patches for 5.3
Improvements and bug fixes for the hmm interface in the kernel:
 
 - Improve clarity, locking and APIs related to the 'hmm mirror' feature
   merged last cycle. In linux-next we now see AMDGPU and nouveau to be
   using this API.
 
 - Remove old or transitional hmm APIs. These are hold overs from the past
   with no users, or APIs that existed only to manage cross tree conflicts.
   There are still a few more of these cleanups that didn't make the merge
   window cut off.
 
 - Improve some core mm APIs:
   * export alloc_pages_vma() for driver use
   * refactor into devm_request_free_mem_region() to manage
     DEVICE_PRIVATE resource reservations
   * refactor duplicative driver code into the core dev_pagemap
     struct
 
 - Remove hmm wrappers of improved core mm APIs, instead have drivers use
   the simplified API directly
 
 - Remove DEVICE_PUBLIC
 
 - Simplify the kconfig flow for the hmm users and core code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g
 mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ
 xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1
 CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ
 Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ
 4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL
 YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/
 VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES
 bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU
 hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi
 FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d
 lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8=
 =wKvp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull HMM updates from Jason Gunthorpe:
 "Improvements and bug fixes for the hmm interface in the kernel:

   - Improve clarity, locking and APIs related to the 'hmm mirror'
     feature merged last cycle. In linux-next we now see AMDGPU and
     nouveau to be using this API.

   - Remove old or transitional hmm APIs. These are hold overs from the
     past with no users, or APIs that existed only to manage cross tree
     conflicts. There are still a few more of these cleanups that didn't
     make the merge window cut off.

   - Improve some core mm APIs:
       - export alloc_pages_vma() for driver use
       - refactor into devm_request_free_mem_region() to manage
         DEVICE_PRIVATE resource reservations
       - refactor duplicative driver code into the core dev_pagemap
         struct

   - Remove hmm wrappers of improved core mm APIs, instead have drivers
     use the simplified API directly

   - Remove DEVICE_PUBLIC

   - Simplify the kconfig flow for the hmm users and core code"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
  mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
  mm: remove the HMM config option
  mm: sort out the DEVICE_PRIVATE Kconfig mess
  mm: simplify ZONE_DEVICE page private data
  mm: remove hmm_devmem_add
  mm: remove hmm_vma_alloc_locked_page
  nouveau: use devm_memremap_pages directly
  nouveau: use alloc_page_vma directly
  PCI/P2PDMA: use the dev_pagemap internal refcount
  device-dax: use the dev_pagemap internal refcount
  memremap: provide an optional internal refcount in struct dev_pagemap
  memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
  memremap: remove the data field in struct dev_pagemap
  memremap: add a migrate_to_ram method to struct dev_pagemap_ops
  memremap: lift the devmap_enable manipulation into devm_memremap_pages
  memremap: pass a struct dev_pagemap to ->kill and ->cleanup
  memremap: move dev_pagemap callbacks into a separate structure
  memremap: validate the pagemap type passed to devm_memremap_pages
  mm: factor out a devm_request_free_mem_region helper
  mm: export alloc_pages_vma
  ...
2019-07-14 19:42:11 -07:00
Suraj Jitindar Singh
da0ef93310 powerpc/mm: Limit rma_size to 1TB when running without HV mode
The virtual real mode addressing (VRMA) mechanism is used when a
partition is using HPT (Hash Page Table) translation and performs real
mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this mode
effective address bits 0:23 are treated as zero (i.e. the access is
aliased to 0) and the access is performed using an implicit 1TB SLB
entry.

The size of the RMA (Real Memory Area) is communicated to the guest as
the size of the first memory region in the device tree. And because of
the mechanism described above can be expected to not exceed 1TB. In
the event that the host erroneously represents the RMA as being larger
than 1TB, guest accesses in real mode to memory addresses above 1TB
will be aliased down to below 1TB. This means that a memory access
performed in real mode may differ to one performed in virtual mode for
the same memory address, which would likely have unintended
consequences.

To avoid this outcome have the guest explicitly limit the size of the
RMA to the current maximum, which is 1TB. This means that even if the
first memory block is larger than 1TB, only the first 1TB should be
accessed in real mode.

Fixes: c610d65c0a ("powerpc/pseries: lift RTAS limit for hash")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190710052018.14628-1-sjitindarsingh@gmail.com
2019-07-15 12:41:17 +10:00
Linus Torvalds
192f0f8e9d powerpc updates for 5.3
Notable changes:
 
  - Removal of the NPU DMA code, used by the out-of-tree Nvidia driver, as well
    as some other functions only used by drivers that haven't (yet?) made it
    upstream.
 
  - A fix for a bug in our handling of hardware watchpoints (eg. perf record -e
    mem: ...) which could lead to register corruption and kernel crashes.
 
  - Enable HAVE_ARCH_HUGE_VMAP, which allows us to use large pages for vmalloc
    when using the Radix MMU.
 
  - A large but incremental rewrite of our exception handling code to use gas
    macros rather than multiple levels of nested CPP macros.
 
 And the usual small fixes, cleanups and improvements.
 
 Thanks to:
   Alastair D'Silva, Alexey Kardashevskiy, Andreas Schwab, Aneesh Kumar K.V, Anju
   T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Cédric Le Goater,
   Christian Lamparter, Christophe Leroy, Christophe Lombard, Christoph Hellwig,
   Daniel Axtens, Denis Efremov, Enrico Weigelt, Frederic Barrat, Gautham R.
   Shenoy, Geert Uytterhoeven, Geliang Tang, Gen Zhang, Greg Kroah-Hartman, Greg
   Kurz, Gustavo Romero, Krzysztof Kozlowski, Madhavan Srinivasan, Masahiro
   Yamada, Mathieu Malaterre, Michael Neuling, Nathan Lynch, Naveen N. Rao,
   Nicholas Piggin, Nishad Kamdar, Oliver O'Halloran, Qian Cai, Ravi Bangoria,
   Sachin Sant, Sam Bobroff, Satheesh Rajendran, Segher Boessenkool, Shaokun
   Zhang, Shawn Anastasio, Stewart Smith, Suraj Jitindar Singh, Thiago Jung
   Bauermann, YueHaibing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdKVoLAAoJEFHr6jzI4aWA0kIP/A6shIbbE7H5W2hFrqt/PPPK
 3+VrvPKbOFF+W6hcE/RgSZmEnUo0svdNjHUd/eMfFS1vb/uRt2QDdrsHUNNwURQL
 M2mcLXFwYpnjSjb/XMgDbHpAQxjeGfTdYLonUIejN7Rk8KQUeLyKQ3SBn6kfMc46
 DnUUcPcjuRGaETUmVuZZ4e40ZWbJp8PKDrSJOuUrTPXMaK5ciNbZk5mCWXGbYl6G
 BMQAyv4ld/417rNTjBEP/T2foMJtioAt4W6mtlgdkOTdIEZnFU67nNxDBthNSu2c
 95+I+/sML4KOp1R4yhqLSLIDDbc3bg3c99hLGij0d948z3bkSZ8bwnPaUuy70C4v
 U8rvl/+N6C6H3DgSsPE/Gnkd8DnudqWY8nULc+8p3fXljGwww6/Qgt+6yCUn8BdW
 WgixkSjKgjDmzTw8trIUNEqORrTVle7cM2hIyIK2Q5T4kWzNQxrLZ/x/3wgoYjUa
 1KwIzaRo5JKZ9D3pJnJ5U+knE2/90rJIyfcp0W6ygyJsWKi2GNmq1eN3sKOw0IxH
 Tg86RENIA/rEMErNOfP45sLteMuTR7of7peCG3yumIOZqsDVYAzerpvtSgip2cvK
 aG+9HcYlBFOOOF9Dabi8GXsTBLXLfwiyjjLSpA9eXPwW8KObgiNfTZa7ujjTPvis
 4mk9oukFTFUpfhsMmI3T
 =3dBZ
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Notable changes:

   - Removal of the NPU DMA code, used by the out-of-tree Nvidia driver,
     as well as some other functions only used by drivers that haven't
     (yet?) made it upstream.

   - A fix for a bug in our handling of hardware watchpoints (eg. perf
     record -e mem: ...) which could lead to register corruption and
     kernel crashes.

   - Enable HAVE_ARCH_HUGE_VMAP, which allows us to use large pages for
     vmalloc when using the Radix MMU.

   - A large but incremental rewrite of our exception handling code to
     use gas macros rather than multiple levels of nested CPP macros.

  And the usual small fixes, cleanups and improvements.

  Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Andreas Schwab,
  Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Arnd Bergmann,
  Athira Rajeev, Cédric Le Goater, Christian Lamparter, Christophe
  Leroy, Christophe Lombard, Christoph Hellwig, Daniel Axtens, Denis
  Efremov, Enrico Weigelt, Frederic Barrat, Gautham R. Shenoy, Geert
  Uytterhoeven, Geliang Tang, Gen Zhang, Greg Kroah-Hartman, Greg Kurz,
  Gustavo Romero, Krzysztof Kozlowski, Madhavan Srinivasan, Masahiro
  Yamada, Mathieu Malaterre, Michael Neuling, Nathan Lynch, Naveen N.
  Rao, Nicholas Piggin, Nishad Kamdar, Oliver O'Halloran, Qian Cai, Ravi
  Bangoria, Sachin Sant, Sam Bobroff, Satheesh Rajendran, Segher
  Boessenkool, Shaokun Zhang, Shawn Anastasio, Stewart Smith, Suraj
  Jitindar Singh, Thiago Jung Bauermann, YueHaibing"

* tag 'powerpc-5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (163 commits)
  powerpc/powernv/idle: Fix restore of SPRN_LDBAR for POWER9 stop state.
  powerpc/eeh: Handle hugepages in ioremap space
  ocxl: Update for AFU descriptor template version 1.1
  powerpc/boot: pass CONFIG options in a simpler and more robust way
  powerpc/boot: add {get, put}_unaligned_be32 to xz_config.h
  powerpc/irq: Don't WARN continuously in arch_local_irq_restore()
  powerpc/module64: Use symbolic instructions names.
  powerpc/module32: Use symbolic instructions names.
  powerpc: Move PPC_HA() PPC_HI() and PPC_LO() to ppc-opcode.h
  powerpc/module64: Fix comment in R_PPC64_ENTRY handling
  powerpc/boot: Add lzo support for uImage
  powerpc/boot: Add lzma support for uImage
  powerpc/boot: don't force gzipped uImage
  powerpc/8xx: Add microcode patch to move SMC parameter RAM.
  powerpc/8xx: Use IO accessors in microcode programming.
  powerpc/8xx: replace #ifdefs by IS_ENABLED() in microcode.c
  powerpc/8xx: refactor programming of microcode CPM params.
  powerpc/8xx: refactor printing of microcode patch name.
  powerpc/8xx: Refactor microcode write
  powerpc/8xx: refactor writing of CPM microcode arrays
  ...
2019-07-13 16:08:36 -07:00
Linus Torvalds
ef8f3d48af Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "Am experimenting with splitting MM up into identifiable subsystems
  perhaps with a view to gitifying it in complex ways. Also with more
  verbose "incoming" emails.

  Most of MM is here and a few other trees.

  Subsystems affected by this patch series:
   - hotfixes
   - iommu
   - scripts
   - arch/sh
   - ocfs2
   - mm:slab-generic
   - mm:slub
   - mm:kmemleak
   - mm:kasan
   - mm:cleanups
   - mm:debug
   - mm:pagecache
   - mm:swap
   - mm:memcg
   - mm:gup
   - mm:pagemap
   - mm:infrastructure
   - mm:vmalloc
   - mm:initialization
   - mm:pagealloc
   - mm:vmscan
   - mm:tools
   - mm:proc
   - mm:ras
   - mm:oom-kill

  hotfixes:
      mm: vmscan: scan anonymous pages on file refaults
      mm/nvdimm: add is_ioremap_addr and use that to check ioremap address
      mm/memcontrol: fix wrong statistics in memory.stat
      mm/z3fold.c: lock z3fold page before  __SetPageMovable()
      nilfs2: do not use unexported cpu_to_le32()/le32_to_cpu() in uapi header
      MAINTAINERS: nilfs2: update email address

  iommu:
      include/linux/dmar.h: replace single-char identifiers in macros

  scripts:
      scripts/decode_stacktrace: match basepath using shell prefix operator, not regex
      scripts/decode_stacktrace: look for modules with .ko.debug extension
      scripts/spelling.txt: drop "sepc" from the misspelling list
      scripts/spelling.txt: add spelling fix for prohibited
      scripts/decode_stacktrace: Accept dash/underscore in modules
      scripts/spelling.txt: add more spellings to spelling.txt

  arch/sh:
      arch/sh/configs/sdk7786_defconfig: remove CONFIG_LOGFS
      sh: config: remove left-over BACKLIGHT_LCD_SUPPORT
      sh: prevent warnings when using iounmap

  ocfs2:
      fs: ocfs: fix spelling mistake "hearbeating" -> "heartbeat"
      ocfs2/dlm: use struct_size() helper
      ocfs2: add last unlock times in locking_state
      ocfs2: add locking filter debugfs file
      ocfs2: add first lock wait time in locking_state
      ocfs: no need to check return value of debugfs_create functions
      fs/ocfs2/dlmglue.c: unneeded variable: "status"
      ocfs2: use kmemdup rather than duplicating its implementation

  mm:slab-generic:
    Patch series "mm/slab: Improved sanity checking":
      mm/slab: validate cache membership under freelist hardening
      mm/slab: sanity-check page type when looking up cache
      lkdtm/heap: add tests for freelist hardening

  mm:slub:
      mm/slub.c: avoid double string traverse in kmem_cache_flags()
      slub: don't panic for memcg kmem cache creation failure

  mm:kmemleak:
      mm/kmemleak.c: fix check for softirq context
      mm/kmemleak.c: change error at _write when kmemleak is disabled
      docs: kmemleak: add more documentation details

  mm:kasan:
      mm/kasan: print frame description for stack bugs
      Patch series "Bitops instrumentation for KASAN", v5:
        lib/test_kasan: add bitops tests
        x86: use static_cpu_has in uaccess region to avoid instrumentation
        asm-generic, x86: add bitops instrumentation for KASAN
      Patch series "mm/kasan: Add object validation in ksize()", v3:
        mm/kasan: introduce __kasan_check_{read,write}
        mm/kasan: change kasan_check_{read,write} to return boolean
        lib/test_kasan: Add test for double-kzfree detection
        mm/slab: refactor common ksize KASAN logic into slab_common.c
        mm/kasan: add object validation in ksize()

  mm:cleanups:
      include/linux/pfn_t.h: remove pfn_t_to_virt()
      Patch series "remove ARCH_SELECT_MEMORY_MODEL where it has no effect":
        arm: remove ARCH_SELECT_MEMORY_MODEL
        s390: remove ARCH_SELECT_MEMORY_MODEL
        sparc: remove ARCH_SELECT_MEMORY_MODEL
      mm/gup.c: make follow_page_mask() static
      mm/memory.c: trivial clean up in insert_page()
      mm: make !CONFIG_HUGE_PAGE wrappers into static inlines
      include/linux/mm_types.h: ifdef struct vm_area_struct::swap_readahead_info
      mm: remove the account_page_dirtied export
      mm/page_isolation.c: change the prototype of undo_isolate_page_range()
      include/linux/vmpressure.h: use spinlock_t instead of struct spinlock
      mm: remove the exporting of totalram_pages
      include/linux/pagemap.h: document trylock_page() return value

  mm:debug:
      mm/failslab.c: by default, do not fail allocations with direct reclaim only
      Patch series "debug_pagealloc improvements":
        mm, debug_pagelloc: use static keys to enable debugging
        mm, page_alloc: more extensive free page checking with debug_pagealloc
        mm, debug_pagealloc: use a page type instead of page_ext flag

  mm:pagecache:
      Patch series "fix filler_t callback type mismatches", v2:
        mm/filemap.c: fix an overly long line in read_cache_page
        mm/filemap: don't cast ->readpage to filler_t for do_read_cache_page
        jffs2: pass the correct prototype to read_cache_page
        9p: pass the correct prototype to read_cache_page
      mm/filemap.c: correct the comment about VM_FAULT_RETRY

  mm:swap:
      mm, swap: fix race between swapoff and some swap operations
      mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device()
      mm, swap: use rbtree for swap_extent
      mm/mincore.c: fix race between swapoff and mincore

  mm:memcg:
      memcg, oom: no oom-kill for __GFP_RETRY_MAYFAIL
      memcg, fsnotify: no oom-kill for remote memcg charging
      mm, memcg: introduce memory.events.local
      mm: memcontrol: dump memory.stat during cgroup OOM
      Patch series "mm: reparent slab memory on cgroup removal", v7:
        mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache()
        mm: memcg/slab: rename slab delayed deactivation functions and fields
        mm: memcg/slab: generalize postponed non-root kmem_cache deactivation
        mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg()
        mm: memcg/slab: unify SLAB and SLUB page accounting
        mm: memcg/slab: don't check the dying flag on kmem_cache creation
        mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlock
        mm: memcg/slab: rework non-root kmem_cache lifecycle management
        mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages
        mm: memcg/slab: reparent memcg kmem_caches on cgroup removal
      mm, memcg: add a memcg_slabinfo debugfs file

  mm:gup:
      Patch series "switch the remaining architectures to use generic GUP", v4:
        mm: use untagged_addr() for get_user_pages_fast addresses
        mm: simplify gup_fast_permitted
        mm: lift the x86_32 PAE version of gup_get_pte to common code
        MIPS: use the generic get_user_pages_fast code
        sh: add the missing pud_page definition
        sh: use the generic get_user_pages_fast code
        sparc64: add the missing pgd_page definition
        sparc64: define untagged_addr()
        sparc64: use the generic get_user_pages_fast code
        mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP
        mm: reorder code blocks in gup.c
        mm: consolidate the get_user_pages* implementations
        mm: validate get_user_pages_fast flags
        mm: move the powerpc hugepd code to mm/gup.c
        mm: switch gup_hugepte to use try_get_compound_head
        mm: mark the page referenced in gup_hugepte
      mm/gup: speed up check_and_migrate_cma_pages() on huge page
      mm/gup.c: remove some BUG_ONs from get_gate_page()
      mm/gup.c: mark undo_dev_pagemap as __maybe_unused

  mm:pagemap:
      asm-generic, x86: introduce generic pte_{alloc,free}_one[_kernel]
      alpha: switch to generic version of pte allocation
      arm: switch to generic version of pte allocation
      arm64: switch to generic version of pte allocation
      csky: switch to generic version of pte allocation
      m68k: sun3: switch to generic version of pte allocation
      mips: switch to generic version of pte allocation
      nds32: switch to generic version of pte allocation
      nios2: switch to generic version of pte allocation
      parisc: switch to generic version of pte allocation
      riscv: switch to generic version of pte allocation
      um: switch to generic version of pte allocation
      unicore32: switch to generic version of pte allocation
      mm/pgtable: drop pgtable_t variable from pte_fn_t functions
      mm/memory.c: fail when offset == num in first check of __vm_map_pages()

  mm:infrastructure:
      mm/mmu_notifier: use hlist_add_head_rcu()

  mm:vmalloc:
      Patch series "Some cleanups for the KVA/vmalloc", v5:
        mm/vmalloc.c: remove "node" argument
        mm/vmalloc.c: preload a CPU with one object for split purpose
        mm/vmalloc.c: get rid of one single unlink_va() when merge
        mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va()
      mm/vmalloc.c: spelling> s/informaion/information/

  mm:initialization:
      mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist
      mm/large system hash: clear hashdist when only one node with memory is booted

  mm:pagealloc:
      arm64: move jump_label_init() before parse_early_param()
      Patch series "add init_on_alloc/init_on_free boot options", v10:
        mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
        mm: init: report memory auto-initialization features at boot time

  mm:vmscan:
      mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
      mm: vmscan: correct some vmscan counters for THP swapout

  mm:tools:
      tools/vm/slabinfo: order command line options
      tools/vm/slabinfo: add partial slab listing to -X
      tools/vm/slabinfo: add option to sort by partial slabs
      tools/vm/slabinfo: add sorting info to help menu

  mm:proc:
      proc: use down_read_killable mmap_sem for /proc/pid/maps
      proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
      proc: use down_read_killable mmap_sem for /proc/pid/pagemap
      proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
      proc: use down_read_killable mmap_sem for /proc/pid/map_files
      mm: use down_read_killable for locking mmap_sem in access_remote_vm
      mm: smaps: split PSS into components
      mm: vmalloc: show number of vmalloc pages in /proc/meminfo

  mm:ras:
      mm/memory-failure.c: clarify error message

  mm:oom-kill:
      mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
      mm, oom: refactor dump_tasks for memcg OOMs
      mm, oom: remove redundant task_in_mem_cgroup() check
      oom: decouple mems_allowed from oom_unkillable_task
      mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()"

* akpm: (147 commits)
  mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()
  oom: decouple mems_allowed from oom_unkillable_task
  mm, oom: remove redundant task_in_mem_cgroup() check
  mm, oom: refactor dump_tasks for memcg OOMs
  mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
  mm/memory-failure.c: clarify error message
  mm: vmalloc: show number of vmalloc pages in /proc/meminfo
  mm: smaps: split PSS into components
  mm: use down_read_killable for locking mmap_sem in access_remote_vm
  proc: use down_read_killable mmap_sem for /proc/pid/map_files
  proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
  proc: use down_read_killable mmap_sem for /proc/pid/pagemap
  proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
  proc: use down_read_killable mmap_sem for /proc/pid/maps
  tools/vm/slabinfo: add sorting info to help menu
  tools/vm/slabinfo: add option to sort by partial slabs
  tools/vm/slabinfo: add partial slab listing to -X
  tools/vm/slabinfo: order command line options
  mm: vmscan: correct some vmscan counters for THP swapout
  mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
  ...
2019-07-12 11:40:28 -07:00
Christoph Hellwig
cbd34da7dc mm: move the powerpc hugepd code to mm/gup.c
While only powerpc supports the hugepd case, the code is pretty generic
and I'd like to keep all GUP internals in one place.

Link: http://lkml.kernel.org/r/20190625143715.1689-15-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:45 -07:00
Linus Torvalds
97ff4ca46d Char / Misc driver patches for 5.3-rc1
Here is the "large" pull request for char and misc and other assorted
 smaller driver subsystems for 5.3-rc1.
 
 It seems that this tree is becoming the funnel point of lots of smaller
 driver subsystems, which is fine for me, but that's why it is getting
 larger over time and does not just contain stuff under drivers/char/ and
 drivers/misc.
 
 Lots of small updates all over the place here from different driver
 subsystems:
   - habana driver updates
   - coresight driver updates
   - documentation file movements and updates
   - Android binder fixes and updates
   - extcon driver updates
   - google firmware driver updates
   - fsi driver updates
   - smaller misc and char driver updates
   - soundwire driver updates
   - nvmem driver updates
   - w1 driver fixes
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXSXmoQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylV9wCgyJGbpPch8v/ecrZGFHYS4sIMexIAoMco3zf6
 wnqFmXiz1O0tyo1sgV9R
 =7sqO
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char / misc driver updates from Greg KH:
 "Here is the "large" pull request for char and misc and other assorted
  smaller driver subsystems for 5.3-rc1.

  It seems that this tree is becoming the funnel point of lots of
  smaller driver subsystems, which is fine for me, but that's why it is
  getting larger over time and does not just contain stuff under
  drivers/char/ and drivers/misc.

  Lots of small updates all over the place here from different driver
  subsystems:
   - habana driver updates
   - coresight driver updates
   - documentation file movements and updates
   - Android binder fixes and updates
   - extcon driver updates
   - google firmware driver updates
   - fsi driver updates
   - smaller misc and char driver updates
   - soundwire driver updates
   - nvmem driver updates
   - w1 driver fixes

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (188 commits)
  coresight: Do not default to CPU0 for missing CPU phandle
  dt-bindings: coresight: Change CPU phandle to required property
  ocxl: Allow contexts to be attached with a NULL mm
  fsi: sbefifo: Don't fail operations when in SBE IPL state
  coresight: tmc: Smatch: Fix potential NULL pointer dereference
  coresight: etm3x: Smatch: Fix potential NULL pointer dereference
  coresight: Potential uninitialized variable in probe()
  coresight: etb10: Do not call smp_processor_id from preemptible
  coresight: tmc-etf: Do not call smp_processor_id from preemptible
  coresight: tmc-etr: alloc_perf_buf: Do not call smp_processor_id from preemptible
  coresight: tmc-etr: Do not call smp_processor_id() from preemptible
  docs: misc-devices: convert files without extension to ReST
  fpga: dfl: fme: align PR buffer size per PR datawidth
  fpga: dfl: fme: remove copy_to_user() in ioctl for PR
  fpga: dfl-fme-mgr: fix FME_PR_INTFC_ID register address.
  intel_th: msu: Start read iterator from a non-empty window
  intel_th: msu: Split sgt array and pointer in multiwindow mode
  intel_th: msu: Support multipage blocks
  intel_th: pci: Add Ice Lake NNPI support
  intel_th: msu: Fix single mode with disabled IOMMU
  ...
2019-07-11 15:34:05 -07:00
Linus Torvalds
5ad18b2e60 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman:
 "A source of error over the years has been that force_sig has taken a
  task parameter when it is only safe to use force_sig with the current
  task.

  The force_sig function is built for delivering synchronous signals
  such as SIGSEGV where the userspace application caused a synchronous
  fault (such as a page fault) and the kernel responded with a signal.

  Because the name force_sig does not make this clear, and because the
  force_sig takes a task parameter the function force_sig has been
  abused for sending other kinds of signals over the years. Slowly those
  have been fixed when the oopses have been tracked down.

  This set of changes fixes the remaining abusers of force_sig and
  carefully rips out the task parameter from force_sig and friends
  making this kind of error almost impossible in the future"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
  signal: Remove the signal number and task parameters from force_sig_info
  signal: Factor force_sig_info_to_task out of force_sig_info
  signal: Generate the siginfo in force_sig
  signal: Move the computation of force into send_signal and correct it.
  signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
  signal: Remove the task parameter from force_sig_fault
  signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
  signal: Explicitly call force_sig_fault on current
  signal/unicore32: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from ptrace_break
  signal/nds32: Remove tsk parameter from send_sigtrap
  signal/riscv: Remove tsk parameter from do_trap
  signal/sh: Remove tsk parameter from force_sig_info_fault
  signal/um: Remove task parameter from send_sigtrap
  signal/x86: Remove task parameter from send_sigtrap
  signal: Remove task parameter from force_sig_mceerr
  signal: Remove task parameter from force_sig
  signal: Remove task parameter from force_sigsegv
  ...
2019-07-08 21:48:15 -07:00
Christophe Leroy
1cfb725fb1 powerpc/64: flush_inval_dcache_range() becomes flush_dcache_range()
On most arches having function flush_dcache_range(), including PPC32,
this function does a writeback and invalidation of the cache bloc.

On PPC64, flush_dcache_range() only does a writeback while
flush_inval_dcache_range() does the invalidation in addition.

In addition it looks like within arch/powerpc/, there are no PPC64
platforms using flush_dcache_range()

This patch drops the existing 64 bits version of flush_dcache_range()
and renames flush_inval_dcache_range() into flush_dcache_range().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 01:35:10 +10:00
Aneesh Kumar K.V
ac25ba68fa powerpc/mm/hugetlb: Don't enable HugeTLB if we don't have a page table cache
This makes sure we don't enable HugeTLB if the cache is not configured.
I am still not sure about this. IMHO hugetlb support should be a hardware
support derivative and any cache allocation failure should be handled as I did
in the earlier patch. But then if we were not able to create hugetlb page table
cache, we can as well declare hugetlb support disabled thereby avoiding calling
into allocation routines.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 00:48:01 +10:00
Aneesh Kumar K.V
5d49275a27 powerpc/mm/hugetlb: Fix kernel crash if we fail to allocate page table caches
We only check for hugetlb allocations, because with hugetlb we do conditional
registration. For PGD/PUD/PMD levels we register them always in
pgtable_cache_init.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 00:48:00 +10:00
Aneesh Kumar K.V
2230ebf6e6 powerpc/mm: Handle page table allocation failures
This fixes kernel crash that arises due to not handling page table allocation
failures while allocating hugetlb page table.

Fixes: e2b3d202d1 ("powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 00:47:59 +10:00
Aneesh Kumar K.V
1ecf2cdc74 powerpc/mm: pmd_devmap implies pmd_large().
large devmap usage is dependent on THP. Hence once check is sufficient.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 00:43:59 +10:00
Aneesh Kumar K.V
d6eacedd1f powerpc/book3s: Use config independent helpers for page table walk
Even when we have HugeTLB and THP disabled, kernel linear map can still be
mapped with hugepages. This is only an issue with radix translation because hash
MMU doesn't map kernel linear range in linux page table and other kernel
map areas are not mapped using hugepage.

Add config independent helpers and put WARN_ON() when we don't expect things
to be mapped via hugepages.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 00:43:50 +10:00
Aneesh Kumar K.V
c0b1b23b9c powerpc/mm/nvdimm: Add an informative message if we fail to allocate altmap block
Allocation from altmap area can fail based on vmemmap page size used.
Add kernel info message to indicate the failure. That allows the user
to identify whether they are really using persistent memory reserved
space for per-page metadata.

The message looks like:
  [  136.587212] altmap block allocation failed, falling back to system memory

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 00:32:57 +10:00
Aneesh Kumar K.V
495c2ff4c8 powerpc/mm: Consolidate numa_enable check and min_common_depth check
If we fail to parse min_common_depth from device tree we boot with
numa disabled. Reflect the same by updating numa_enabled variable
to false. Also, switch all min_common_depth failure check to
if (!numa_enabled) check.

This helps us to avoid checking for both in different code paths.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 00:27:33 +10:00
Aneesh Kumar K.V
f52741c410 powerpc/mm: Fix node look up with numa=off boot
If we boot with numa=off, we need to make sure we return NUMA_NO_NODE when
looking up associativity details of resources. Without this, we hit crash
like below

BUG: Unable to handle kernel data access at 0x40000000008
Faulting instruction address: 0xc000000008f31704
cpu 0x1b: Vector: 380 (Data SLB Access) at [c00000000b9bb320]
    pc: c000000008f31704: _raw_spin_lock+0x14/0x100
    lr: c0000000083f41fc: ____cache_alloc_node+0x5c/0x290
    sp: c00000000b9bb5b0
   msr: 800000010280b033
   dar: 40000000008
  current = 0xc00000000b9a2700
  paca    = 0xc00000000a740c00   irqmask: 0x03   irq_happened: 0x01
    pid   = 1, comm = swapper/27
Linux version 5.2.0-rc4-00925-g74e188c620b1 (root@linux-d8ip) (gcc version 7.4.1 20190424 [gcc-7-branch revision 270538] (SUSE Linux)) #34 SMP Sat Jun 29 00:41:02 EDT 2019
enter ? for help
[link register   ] c0000000083f41fc ____cache_alloc_node+0x5c/0x290
[c00000000b9bb5b0] 0000000000000dc0 (unreliable)
[c00000000b9bb5f0] c0000000083f48c8 kmem_cache_alloc_node_trace+0x138/0x360
[c00000000b9bb670] c000000008aa789c devres_alloc_node+0x4c/0xa0
[c00000000b9bb6a0] c000000008337218 devm_memremap+0x58/0x130
[c00000000b9bb6f0] c000000008aed00c devm_nsio_enable+0xdc/0x170
[c00000000b9bb780] c000000008af3b6c nd_pmem_probe+0x4c/0x180
[c00000000b9bb7b0] c000000008ad84cc nvdimm_bus_probe+0xac/0x260
[c00000000b9bb840] c000000008aa0628 really_probe+0x148/0x500
[c00000000b9bb8d0] c000000008aa0d7c driver_probe_device+0x19c/0x1d0
[c00000000b9bb950] c000000008aa11bc device_driver_attach+0xcc/0x100
[c00000000b9bb990] c000000008aa12ec __driver_attach+0xfc/0x1e0
[c00000000b9bba10] c000000008a9d0a4 bus_for_each_dev+0xb4/0x130
[c00000000b9bba70] c000000008a9fc04 driver_attach+0x34/0x50
[c00000000b9bba90] c000000008a9f118 bus_add_driver+0x1d8/0x300
[c00000000b9bbb20] c000000008aa2358 driver_register+0x98/0x1a0
[c00000000b9bbb90] c000000008ad7e6c __nd_driver_register+0x5c/0x100
[c00000000b9bbbf0] c0000000093efbac nd_pmem_driver_init+0x34/0x48
[c00000000b9bbc10] c0000000080106c0 do_one_initcall+0x60/0x2d0
[c00000000b9bbce0] c00000000938463c kernel_init_freeable+0x384/0x48c
[c00000000b9bbdb0] c000000008010a5c kernel_init+0x2c/0x160
[c00000000b9bbe20] c00000000800ba54 ret_from_kernel_thread+0x5c/0x68

Reported-and-debugged-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 00:27:32 +10:00
Aneesh Kumar K.V
ea9f5b702f powerpc/mm/drconf: Use NUMA_NO_NODE on failures instead of node 0
If we fail to parse the associativity array we should default to
NUMA_NO_NODE instead of NODE 0. Rest of the code fallback to the
right default if we find the numa node value NUMA_NO_NODE.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 00:27:31 +10:00
Aneesh Kumar K.V
89a3496e06 powerpc/mm/radix: Use the right page size for vmemmap mapping
We use mmu_vmemmap_psize to find the page size for mapping the vmmemap area.
With radix translation, we are suboptimally setting this value to PAGE_SIZE.

We do check for 2M page size support and update mmu_vmemap_psize to use
hugepage size but we suboptimally reset the value to PAGE_SIZE in
radix__early_init_mmu(). This resulted in always mapping vmemmap area with
64K page size.

Fixes: 2bfd65e45e ("powerpc/mm/radix: Add radix callbacks for early init routines")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 00:27:21 +10:00
Aneesh Kumar K.V
78c9498885 powerpc/mm/hash/4k: Don't use 64K page size for vmemmap with 4K pagesize
With hash translation and 4K PAGE_SIZE config, we need to make sure we don't
use 64K page size for vmemmap.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05 00:27:20 +10:00
Naveen N. Rao
d62c8deeb6 powerpc/pseries: Provide vcpu dispatch statistics
For Shared Processor LPARs, the POWER Hypervisor maintains a
relatively static mapping of the LPAR processors (vcpus) to physical
processor chips (representing the "home" node) and tries to always
dispatch vcpus on their associated physical processor chip. However,
under certain scenarios, vcpus may be dispatched on a different
processor chip (away from its home node). The actual physical
processor number on which a certain vcpu is dispatched is available to
the guest in the 'processor_id' field of each DTL entry.

The guest can discover the home node of each vcpu through the
H_HOME_NODE_ASSOCIATIVITY(flags=1) hcall. The guest can also discover
the associativity of physical processors, as represented in the DTL
entry, through the H_HOME_NODE_ASSOCIATIVITY(flags=2) hcall.

These can then be compared to determine if the vcpu was dispatched on
its home node or not. If the vcpu was not dispatched on the home node,
it is possible to determine if the vcpu was dispatched in a different
chip, socket or drawer.

Introduce a procfs file /proc/powerpc/vcpudispatch_stats that can be
used to obtain these statistics. Writing '1' to this file enables
collecting the statistics, while writing '0' disables the statistics.
The statistics themselves are available by reading the procfs file. By
default, the DTLB log for each vcpu is processed 50 times a second so
as not to miss any entries. This processing frequency can be changed
through /proc/powerpc/vcpudispatch_stats_freq.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-04 22:27:09 +10:00
Naveen N. Rao
5a1ea4774d powerpc/pseries: Move mm/book3s64/vphn.c under platforms/pseries/
hcall_vphn() is specific to pseries and will be used in a subsequent
patch. So, move it to a more appropriate place under
arch/powerpc/platforms/pseries. Also merge vphn.h into lppaca.h
and update vphn selftest to use the new files.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-04 22:23:38 +10:00
Naveen N. Rao
ef34e0efa2 powerpc/pseries: Generalize hcall_vphn()
H_HOME_NODE_ASSOCIATIVITY hcall can take two different flags and return
different associativity information in each case. Generalize the
existing hcall_vphn() function to take flags as an argument and to
return the result. Update the only existing user to pass the proper
arguments.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-04 22:23:38 +10:00
Alastair D'Silva
60e8523e2e ocxl: Allow contexts to be attached with a NULL mm
If an OpenCAPI context is to be used directly by a kernel driver, there
may not be a suitable mm to use.

The patch makes the mm parameter to ocxl_context_attach optional.

Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com>
Acked-by: Frederic Barrat <fbarrat@linux.ibm.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/20190620041203.12274-1-alastair@au1.ibm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 21:29:47 +02:00
Masahiro Yamada
6d3ca7e736 powerpc/mm: mark more tlb functions as __always_inline
With CONFIG_OPTIMIZE_INLINING enabled, Laura Abbott reported error
with gcc 9.1.1:

  arch/powerpc/mm/book3s64/radix_tlb.c: In function '_tlbiel_pid':
  arch/powerpc/mm/book3s64/radix_tlb.c:104:2: warning: asm operand 3 probably doesn't match constraints
    104 |  asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
        |  ^~~
  arch/powerpc/mm/book3s64/radix_tlb.c:104:2: error: impossible constraint in 'asm'

Fixing _tlbiel_pid() is enough to address the warning above, but I
inlined more functions to fix all potential issues.

To meet the "i" (immediate) constraint for the asm operands, functions
propagating "ric" must be always inlined.

Fixes: 9012d01166 ("compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING")
Reported-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-03 20:49:50 +10:00
Nicholas Piggin
6c46fcce39 powerpc/64s/radix: keep kernel ERAT over local process/guest invalidates
ISA v3.0 radix modes provide SLBIA variants which can invalidate ERAT
for effPID!=0 or for effLPID!=0, which allows user and guest
invalidations to retain kernel/host ERAT entries.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-03 15:19:35 +10:00
Nicholas Piggin
fe7946ce08 powerpc/64s: Rename PPC_INVALIDATE_ERAT to PPC_ISA_3_0_INVALIDATE_ERAT
This makes it clear to the caller that it can only be used on POWER9
and later CPUs.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Use "ISA_3_0" rather than "ARCH_300"]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-03 15:19:35 +10:00
Christoph Hellwig
514caf23a7 memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
Add a flags field to struct dev_pagemap to replace the altmap_valid
boolean to be a little more extensible.  Also add a pgmap_altmap() helper
to find the optional altmap and clean up the code using the altmap using
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
7eb3cf7619 powerpc/powernv: remove unused NPU DMA code
None of these routines were ever used anywhere in the kernel tree
since they were added to the kernel.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-01 16:26:55 +10:00
Michael Ellerman
8b8dc69514 Merge branch 'fixes' into next
Merge our fixes branch into next, this brings in a number of commits
that fix bugs we don't want to hit in next, in particular the fix for
CVE-2019-12817.
2019-07-01 14:04:39 +10:00
Michael Ellerman
b7cbb52401 powerpc fixes for 5.2 #6
One fix for a bug in our context id handling on 64-bit hash CPUs, which can lead
 to unrelated processes being able to read/write to each other's virtual memory.
 See the commit for full details.
 
 That is the fix for CVE-2019-12817.
 
 This also adds a kernel selftest for the bug.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdCheiAAoJEFHr6jzI4aWADMcP/3gC9mVintc5iFU+bi7O73d6
 ClHLkL7fqRsAiRthUVpRo6M8kdmKXnOy+Tqoy5dnJPmCTfjVIQzhEBwuHToaj9qs
 IaJKXrJFAg6ou2xcMjnyBk8CfPAKVPDDYKU2YcM8ODsFbketeKykRfNliw/91Z4t
 /cViOHGBY/oxlq4/MqG6n+OvYBf1c2/gqW25uG+gJzVEM/reCViHLj6Veqa6Cu0i
 9H4cNi4yE4aUsApqmNlJi4zJ0SMkwTOU1cRObQyUaK1njDUuIBp5IgGw2TxkThAq
 RXcsv14VwV+AGxkAkHEmc3rLvcL0P1E04J9HINBcVpShfGR5y3oUaxGsKhNgStLl
 Rex77/LBkVaV86pWvJTWVOcGz61EYu8/3Yh02zkzOlfMuVd6QjJhRGmnW55/Ntsz
 EOp93yXjRZycm6EZQvcITlFSUZ44htj9awK2xUvDHEPUIi+wkehjyq/F4ORCnxxH
 8kV6ZSNXsTZFYgHv8DOTortn9bGV9lEnFYn0wWCoej38gXQNb5ryYpSRuoOw5n5O
 cU+4z/Y9pHfrOzQpJxHLXQdhSGfoqNIxTHwDigxoBgGXRx/hdZWAsXP7AssFrTlJ
 V6p1VtKIdAhwmrSnTqTD0zFx0A3dunuhtNRgfzppvKVrcL4fJQyi3V0juUCigYJu
 Kv9LG+KrWZCfeQVp8kAf
 =y5oH
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.2-6' into fixes

This merges the commits that were the fix for CVE-2019-12817, which was
developed under embargo. They have already been merged by Linus

Merge them into fixes now so that this branch contains all the fixes for
this release.
2019-07-01 13:59:40 +10:00
Linus Torvalds
26df62aaae powerpc fixes for 5.2 #6
One fix for a bug in our context id handling on 64-bit hash CPUs, which can lead
 to unrelated processes being able to read/write to each other's virtual memory.
 See the commit for full details.
 
 That is the fix for CVE-2019-12817.
 
 This also adds a kernel selftest for the bug.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdCheiAAoJEFHr6jzI4aWADMcP/3gC9mVintc5iFU+bi7O73d6
 ClHLkL7fqRsAiRthUVpRo6M8kdmKXnOy+Tqoy5dnJPmCTfjVIQzhEBwuHToaj9qs
 IaJKXrJFAg6ou2xcMjnyBk8CfPAKVPDDYKU2YcM8ODsFbketeKykRfNliw/91Z4t
 /cViOHGBY/oxlq4/MqG6n+OvYBf1c2/gqW25uG+gJzVEM/reCViHLj6Veqa6Cu0i
 9H4cNi4yE4aUsApqmNlJi4zJ0SMkwTOU1cRObQyUaK1njDUuIBp5IgGw2TxkThAq
 RXcsv14VwV+AGxkAkHEmc3rLvcL0P1E04J9HINBcVpShfGR5y3oUaxGsKhNgStLl
 Rex77/LBkVaV86pWvJTWVOcGz61EYu8/3Yh02zkzOlfMuVd6QjJhRGmnW55/Ntsz
 EOp93yXjRZycm6EZQvcITlFSUZ44htj9awK2xUvDHEPUIi+wkehjyq/F4ORCnxxH
 8kV6ZSNXsTZFYgHv8DOTortn9bGV9lEnFYn0wWCoej38gXQNb5ryYpSRuoOw5n5O
 cU+4z/Y9pHfrOzQpJxHLXQdhSGfoqNIxTHwDigxoBgGXRx/hdZWAsXP7AssFrTlJ
 V6p1VtKIdAhwmrSnTqTD0zFx0A3dunuhtNRgfzppvKVrcL4fJQyi3V0juUCigYJu
 Kv9LG+KrWZCfeQVp8kAf
 =y5oH
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.2-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "One fix for a bug in our context id handling on 64-bit hash CPUs,
  which can lead to unrelated processes being able to read/write to each
  other's virtual memory. See the commit for full details.

  That is the fix for CVE-2019-12817.

  This also adds a kernel selftest for the bug"

* tag 'powerpc-5.2-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  selftests/powerpc: Add test of fork with mapping above 512TB
  powerpc/mm/64s/hash: Reallocate context ids on fork
2019-06-24 21:20:39 +08:00
Linus Torvalds
a8282bf087 powerpc fixes for 5.2 #5
Seven fixes, all for bugs introduced this cycle.
 
 The commit to add KASAN support broke booting on 32-bit SMP machines, due to a
 refactoring that moved some setup out of the secondary CPU path.
 
 A fix for another 32-bit SMP bug introduced by the fast syscall entry
 implementation for 32-bit BOOKE. And a build fix for the same commit.
 
 Our change to allow the DAWR to be force enabled on Power9 introduced a bug in
 KVM, where we clobber r3 leading to a host crash.
 
 The same commit also exposed a previously unreachable bug in the nested KVM
 handling of DAWR, which could lead to an oops in a nested host.
 
 One of the DMA reworks broke the b43legacy WiFi driver on some people's
 powermacs, fix it by enabling a 30-bit ZONE_DMA on 32-bit.
 
 A fix for TLB flushing in KVM introduced a new bug, as it neglected to also
 flush the ERAT, this could lead to memory corruption in the guest.
 
 Thanks to:
   Aaro Koskinen, Christoph Hellwig, Christophe Leroy, Larry Finger, Michael
   Neuling, Suraj Jitindar Singh.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdDg4YAAoJEFHr6jzI4aWACKYP/RG1cqYDjEWz0N9bxjAOanx6
 z//hPZqZrObEORx0mek07LNj6JDy4eL7CB9WaEudJjHt7mYugLYq0g7hUMVvBWnB
 irFEuzGJ8EgWl1aMbmz+fgf49PBIuroy2o/4pyzzQXoDaw44QyUaCke2VEBskQNG
 RW64C2rDVrPgpRHzBB9EZVNv7svmo6ERJsEpRvqP3PZG1ZxgXW+DXbEdSmJCcgAt
 8oI+z6frRv+0ez+nge7TULo8DuheShfxc7l0jFrd48i35v2qB/IowPr8cof9fRwM
 TqnB+3dZXHPKPz6J9mz80p9ZDe1omLzg6i9EiR2/7a3XGpRBo7kCg3Iri7N5pu0j
 LotK9l1+mXWLy5P6lOHH5/tEHv52Wqsvh5IetpNJ2tgXp3MzbOc1/Ut9h7Ag7cRw
 WRa7tNXQ5Ud8uPM1Pds8Ymhd+/nZ9RItjGcu6S095/OGpM1FJR9a0QnfUHMyfyuX
 kAGrJDWcAkCd/Q9tKHsQotuZAFmRCQe4JFkzTiGzwdjYWYgtTA1c/eIv3+SG7eLV
 1dsaIYzIS56b+Qz2Qc/pKHwho+I9o505Y7LFXxlCGXDDjyI72ioTQDwiSBzaZdc9
 ORwNchLfpXNpiNXRoRqAnqmhWxavYmA6oJ13RDBiMBxIUWHynVbEzLlX9fPNdBFj
 Kw3Zd15znokXBzU+1mDE
 =Ju1y
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.2-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "This is a frustratingly large batch at rc5. Some of these were sent
  earlier but were missed by me due to being distracted by other things,
  and some took a while to track down due to needing manual bisection on
  old hardware. But still we clearly need to improve our testing of KVM,
  and of 32-bit, so that we catch these earlier.

  Summary: seven fixes, all for bugs introduced this cycle.

   - The commit to add KASAN support broke booting on 32-bit SMP
     machines, due to a refactoring that moved some setup out of the
     secondary CPU path.

   - A fix for another 32-bit SMP bug introduced by the fast syscall
     entry implementation for 32-bit BOOKE. And a build fix for the same
     commit.

   - Our change to allow the DAWR to be force enabled on Power9
     introduced a bug in KVM, where we clobber r3 leading to a host
     crash.

   - The same commit also exposed a previously unreachable bug in the
     nested KVM handling of DAWR, which could lead to an oops in a
     nested host.

   - One of the DMA reworks broke the b43legacy WiFi driver on some
     people's powermacs, fix it by enabling a 30-bit ZONE_DMA on 32-bit.

   - A fix for TLB flushing in KVM introduced a new bug, as it neglected
     to also flush the ERAT, this could lead to memory corruption in the
     guest.

  Thanks to: Aaro Koskinen, Christoph Hellwig, Christophe Leroy, Larry
  Finger, Michael Neuling, Suraj Jitindar Singh"

* tag 'powerpc-5.2-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
  powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac
  KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr in real mode
  KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()
  powerpc/32: fix build failure on book3e with KVM
  powerpc/booke: fix fast syscall entry on SMP
  powerpc/32s: fix initial setup of segment registers on secondary CPU
2019-06-22 09:09:42 -07:00
Thomas Gleixner
d2912cb15b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:55 +02:00
Christoph Hellwig
9739ab7eda powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac
With the strict dma mask checking introduced with the switch to
the generic DMA direct code common wifi chips on 32-bit powerbooks
stopped working.  Add a 30-bit ZONE_DMA to the 32-bit pmac builds
to allow them to reliably allocate dma coherent memory.

Fixes: 65a21b71f9 ("powerpc/dma: remove dma_nommu_dma_supported")
Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Larry Finger <Larry.Finger@lwfinger.net>
Acked-by: Larry Finger <Larry.Finger@lwfinger.net>
Tested-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-19 22:31:20 +10:00
Nicholas Piggin
d909f9109c powerpc/64s/radix: Enable HAVE_ARCH_HUGE_VMAP
This sets the HAVE_ARCH_HUGE_VMAP option, and defines the required
page table functions.

This enables huge (2MB and 1GB) ioremap mappings. I don't have a
benchmark for this change, but huge vmap will be used by a later core
kernel change to enable huge vmalloc memory mappings. This improves
cached `git diff` performance by about 5% on a 2-node POWER9 with 32MB
size dentry cache hash.

  Profiling git diff dTLB misses with a vanilla kernel:

  81.75%  git      [kernel.vmlinux]    [k] __d_lookup_rcu
   7.21%  git      [kernel.vmlinux]    [k] strncpy_from_user
   1.77%  git      [kernel.vmlinux]    [k] find_get_entry
   1.59%  git      [kernel.vmlinux]    [k] kmem_cache_free

            40,168      dTLB-miss
       0.100342754 seconds time elapsed

  With powerpc huge vmalloc:

             2,987      dTLB-miss
       0.095933138 seconds time elapsed

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-19 20:05:09 +10:00
Nicholas Piggin
d38153f9cc powerpc/64s/radix: ioremap use ioremap_page_range
Radix can use ioremap_page_range for ioremap, after slab is available.
This makes it possible to enable huge ioremap mapping support.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-19 20:05:09 +10:00
Nicholas Piggin
a72808a7ec powerpc/64: __ioremap_at clean up in the error case
__ioremap_at error handling is wonky, it requires caller to clean up
after it. Implement a helper that does the map and error cleanup and
remove the requirement from the caller.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-19 20:05:09 +10:00
Andreas Schwab
46c2478af6 powerpc/mm/32s: fix condition that is always true
Move a misplaced paren that makes the condition always true.

Fixes: 63b2bc6195 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX")
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-19 20:05:08 +10:00
Linus Torvalds
fa1827d773 powerpc fixes for 5.2 #4
One fix for a regression introduced by our 32-bit KASAN support, which broke
 booting on machines with "bootx" early debugging enabled.
 
 A fix for a bug which broke kexec on 32-bit, introduced by changes to the 32-bit
 STRICT_KERNEL_RWX support in v5.1.
 
 Finally two fixes going to stable for our THP split/collapse handling,
 discovered by Nick. The first fixes random crashes and/or corruption in guests
 under sufficient load.
 
 Thanks to:
   Nicholas Piggin, Christophe Leroy, Aaro Koskinen, Mathieu Malaterre.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdBOivAAoJEFHr6jzI4aWA/T0P/1pmr4JLWzXOPQOGOwS2mmu5
 HhXBFuDflpGiWS31syOhfKhiE2qlwdIcGaclSo1wAgUnMKp+sxVEagF6DEt484r3
 DXs3eRyrGu5vQT7Q6yReuT3Kw2ZR474a5ob00WGAQosBKyJF4ZHWz16ETVWMdAMQ
 TknEEU3hOUnMWWIEvLnZOKT7eJcmzj5IYy1OtLHBjWiHHizGC8PSxdVhiRcD/O6R
 F6C7XrFb7RRj5ran6gxwMbcTvjgu922TSQPOCw93qnXYWLfvWDUXC4yCqY21oHnr
 b3zgJNgIdSoMYxE8pfOH7Y+eaJrbzgnhlS5OJNEz/4NOfGQnJXYcSF8QO6eeVoKM
 L2SkT1Ov+QMmZQjMC5e9OAe7DFHfM59RYFg11eaUqfiaObRsmwu8rqjngITV5Ede
 Ydq3W39XQkjB3aQ4qb0MnEBbVgyQ/y6/T5hoHRlvnb5byFk5Pd3jpub56sq87UGM
 M4GD3YdmT5eBqzGxApddyIiS839PZpdw7g/Ivtp2GYCtiNNZcJqaXvN7IwfcVF3c
 YJrCfhNTUJPjICIL9k9v2RQovGuGZqtM3BHc8PmyVvDxTqtEbh8B9GEg+QC58xp/
 s+UI2sEd/Vg6NVKluIJHau7aEmJlaxYIshqsIhYvPgJRN6KrFMhdfPNx7D+zMlu8
 nbM6Z1n2VFk9Cxb0vA9K
 =MXJI
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "One fix for a regression introduced by our 32-bit KASAN support, which
  broke booting on machines with "bootx" early debugging enabled.

  A fix for a bug which broke kexec on 32-bit, introduced by changes to
  the 32-bit STRICT_KERNEL_RWX support in v5.1.

  Finally two fixes going to stable for our THP split/collapse handling,
  discovered by Nick. The first fixes random crashes and/or corruption
  in guests under sufficient load.

  Thanks to: Nicholas Piggin, Christophe Leroy, Aaro Koskinen, Mathieu
  Malaterre"

* tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/32s: fix booting with CONFIG_PPC_EARLY_DEBUG_BOOTX
  powerpc/64s: __find_linux_pte() synchronization vs pmdp_invalidate()
  powerpc/64s: Fix THP PMD collapse serialisation
  powerpc: Fix kexec failure on book3s/32
2019-06-15 07:29:32 -10:00
Michael Ellerman
65565a68c5 Merge branch 'context-id-fix' into fixes
This merges a fix for a bug in our context id handling on 64-bit hash
CPUs.

The fix was written against v5.1 to ease backporting to stable
releases. Here we are merging it up to a v5.2-rc2 base, which involves
a bit of manual resolution.

It also adds a test case for the bug.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-13 15:00:34 +10:00
Michael Ellerman
ca72d88378 powerpc/mm/64s/hash: Reallocate context ids on fork
When using the Hash Page Table (HPT) MMU, userspace memory mappings
are managed at two levels. Firstly in the Linux page tables, much like
other architectures, and secondly in the SLB (Segment Lookaside
Buffer) and HPT. It's the SLB and HPT that are actually used by the
hardware to do translations.

As part of the series adding support for 4PB user virtual address
space using the hash MMU, we added support for allocating multiple
"context ids" per process, one for each 512TB chunk of address space.
These are tracked in an array called extended_id in the mm_context_t
of a process that has done a mapping above 512TB.

If such a process forks (ie. clone(2) without CLONE_VM set) it's mm is
copied, including the mm_context_t, and then init_new_context() is
called to reinitialise parts of the mm_context_t as appropriate to
separate the address spaces of the two processes.

The key step in ensuring the two processes have separate address
spaces is to allocate a new context id for the process, this is done
at the beginning of hash__init_new_context(). If we didn't allocate a
new context id then the two processes would share mappings as far as
the SLB and HPT are concerned, even though their Linux page tables
would be separate.

For mappings above 512TB, which use the extended_id array, we
neglected to allocate new context ids on fork, meaning the parent and
child use the same ids and therefore share those mappings even though
they're supposed to be separate. This can lead to the parent seeing
writes done by the child, which is essentially memory corruption.

There is an additional exposure which is that if the child process
exits, all its context ids are freed, including the context ids that
are still in use by the parent for mappings above 512TB. One or more
of those ids can then be reallocated to a third process, that process
can then read/write to the parent's mappings above 512TB. Additionally
if the freed id is used for the third process's primary context id,
then the parent is able to read/write to the third process's mappings
*below* 512TB.

All of these are fundamental failures to enforce separation between
processes. The only mitigating factor is that the bug only occurs if a
process creates mappings above 512TB, and most applications still do
not create such mappings.

Only machines using the hash page table MMU are affected, eg. PowerPC
970 (G5), PA6T, Power5/6/7/8/9. By default Power9 bare metal machines
(powernv) use the Radix MMU and are not affected, unless the machine
has been explicitly booted in HPT mode (using disable_radix on the
kernel command line). KVM guests on Power9 may be affected if the host
or guest is configured to use the HPT MMU. LPARs under PowerVM on
Power9 are affected as they always use the HPT MMU. Kernels built with
PAGE_SIZE=4K are not affected.

The fix is relatively simple, we need to reallocate context ids for
all extended mappings on fork.

Fixes: f384796c40 ("powerpc/mm: Add support for handling > 512TB address in SLB miss")
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-12 23:35:07 +10:00
Nicholas Piggin
a00196a272 powerpc/64s: __find_linux_pte() synchronization vs pmdp_invalidate()
The change to pmdp_invalidate() to mark the pmd with _PAGE_INVALID
broke the synchronisation against lock free lookups,
__find_linux_pte()'s pmd_none() check no longer returns true for such
cases.

Fix this by adding a check for this condition as well.

Fixes: da7ad366b4 ("powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit")
Cc: stable@vger.kernel.org # v4.20+
Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-07 16:28:28 +10:00
Nicholas Piggin
33258a1db1 powerpc/64s: Fix THP PMD collapse serialisation
Commit 1b2443a547 ("powerpc/book3s64: Avoid multiple endian
conversion in pte helpers") changed the actual bitwise tests in
pte_access_permitted by using pte_write() and pte_present() helpers
rather than raw bitwise testing _PAGE_WRITE and _PAGE_PRESENT bits.

The pte_present() change now returns true for PTEs which are
!_PAGE_PRESENT and _PAGE_INVALID, which is the combination used by
pmdp_invalidate() to synchronize access from lock-free lookups.
pte_access_permitted() is used by pmd_access_permitted(), so allowing
GUP lock free access to proceed with such PTEs breaks this
synchronisation.

This bug has been observed on a host using the hash page table MMU,
with random crashes and corruption in guests, usually together with
bad PMD messages in the host.

Fix this by adding an explicit check in pmd_access_permitted(), and
documenting the condition explicitly.

The pte_write() change should be okay, and would prevent GUP from
falling back to the slow path when encountering savedwrite PTEs, which
matches what x86 (that does not implement savedwrite) does.

Fixes: 1b2443a547 ("powerpc/book3s64: Avoid multiple endian conversion in pte helpers")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-07 16:26:44 +10:00
Thomas Gleixner
b886d83c5b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation version 2 of the license

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 315 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:17 +02:00
Thomas Gleixner
1a59d1b8e0 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not write to the free software foundation inc
  59 temple place suite 330 boston ma 02111 1307 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1334 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:35 -07:00
Thomas Gleixner
de6cc6515a treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 153
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version this program is distributed in the hope that it will
  be useful but without any warranty without even the implied warranty
  of merchantability or fitness for a particular purpose see the gnu
  general public license for more details you should have received a
  copy of the gnu general public license along with this program if
  not write to the free software foundation inc 675 mass ave cambridge
  ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 77 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.837555891@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Thomas Gleixner
2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Eric W. Biederman
2e1661d267 signal: Remove the task parameter from force_sig_fault
As synchronous exceptions really only make sense against the current
task (otherwise how are you synchronous) remove the task parameter
from from force_sig_fault to make it explicit that is what is going
on.

The two known exceptions that deliver a synchronous exception to a
stopped ptraced task have already been changed to
force_sig_fault_to_task.

The callers have been changed with the following emacs regular expression
(with obvious variations on the architectures that take more arguments)
to avoid typos:

force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
->
force_sig_fault(\1,\2,\3)

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-29 09:31:43 -05:00
YueHaibing
d667edc01b powerpc/mm: Make some symbols static that can be
Noticed by sparse.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-28 12:08:10 +10:00
Eric W. Biederman
f8eac9011b signal: Remove task parameter from force_sig_mceerr
All of the callers pass current into force_sig_mceer so remove the
task parameter to make this obvious.

This also makes it clear that force_sig_mceerr passes current
into force_sig_info.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
Linus Torvalds
86a78a8b8d powerpc fixes for 5.2 #2
One fix going back to stable, for a bug on 32-bit introduced when we added
 support for THREAD_INFO_IN_TASK.
 
 A fix for a typo in a recent rework of our hugetlb code that leads to crashes on
 64-bit when using hugetlbfs with a 4K PAGE_SIZE.
 
 Two fixes for our recent rework of the address layout on 64-bit hash CPUs, both
 only triggered when userspace tries to access addresses outside the user or
 kernel address ranges.
 
 Finally a fix for a recently introduced double free in an error path in our
 cacheinfo code.
 
 Thanks to:
   Aneesh Kumar K.V, Christophe Leroy, Sachin Sant, Tobin C. Harding.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJc3+WRAAoJEFHr6jzI4aWAlN4QAIP7UnQUApYzQX8YrMJEiip7
 1Jtez373pgDEX21McmaznyYRy04gtLWVRV0D5vRsTCG5RHBOVt2KPXe0PrzyjJws
 /AxS9aRuMgM+VkLkD9c6DzWsuBLP8kJMfuTbmY7+C0tByQQT9Xfp+K7/gC5kGlK4
 igyTGZvALlEVilsjTQO/FhADWisYIHM+y2/e0ocxLlK5F+TxdJKpESyUzjMOT78i
 SmAn1qufZedV7ZH91VmvLm8f8MSqg+t1NTwpAnXuS58z2eSNisRhUcP43K4XdAez
 QTGODTgUGGzLsbnIq8KJB/X4YVLwtTR2UM3p6ubTwlDuVkBfrFPUHowywDmxoPHZ
 Kh6KWlRFAL49sGmYMJpfkaEiyN+J+VLN2HMdNLW2HgnROzDSOXCXcjuAuBE9nAKe
 kvMV7Zwb4mQ0j0QUxHuugbBRzUpAcDg+PNZKKb9P/jIMWlMhGyb9fjGxC1yJfDtB
 Kq03zJ+0lzvDCr1XbcDhSv4Fu6Dxxfi7GX9BXlzHTzsGFwZKICj9sLXDXXjFh4h8
 CDNAeWtol1WE2xj315LL6WTlONUMQJ3wDAzd5VJw7SewadDzpK57vuwba88IN9hd
 2PmH7FF8VyrksKmqt19ZsF0X7n7JL3+xbOcV9OT0r0DjkkM1fC6zo+JT6YYIER6A
 FE+pFQRQJndKRzHqvyBJ
 =2265
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "One fix going back to stable, for a bug on 32-bit introduced when we
  added support for THREAD_INFO_IN_TASK.

  A fix for a typo in a recent rework of our hugetlb code that leads to
  crashes on 64-bit when using hugetlbfs with a 4K PAGE_SIZE.

  Two fixes for our recent rework of the address layout on 64-bit hash
  CPUs, both only triggered when userspace tries to access addresses
  outside the user or kernel address ranges.

  Finally a fix for a recently introduced double free in an error path
  in our cacheinfo code.

  Thanks to: Aneesh Kumar K.V, Christophe Leroy, Sachin Sant, Tobin C.
  Harding"

* tag 'powerpc-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/cacheinfo: Remove double free
  powerpc/mm/hash: Fix get_region_id() for invalid addresses
  powerpc/mm: Drop VM_BUG_ON in get_region_id()
  powerpc/mm: Fix crashes with hugepages & 4K pages
  powerpc/32s: fix flush_hash_pages() on SMP
2019-05-19 10:10:15 -07:00
Masahiro Yamada
efc344c57e powerpc/mm/radix: mark as __tlbie_pid() and friends as__always_inline
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
place.  We need to eliminate potential issues beforehand.

If it is enabled for powerpc, the following errors are reported:

  arch/powerpc/mm/tlb-radix.c: In function '__tlbie_lpid':
  arch/powerpc/mm/tlb-radix.c:148:2: warning: asm operand 3 probably doesn't match constraints
    asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
    ^~~
  arch/powerpc/mm/tlb-radix.c:148:2: error: impossible constraint in 'asm'
  arch/powerpc/mm/tlb-radix.c: In function '__tlbie_pid':
  arch/powerpc/mm/tlb-radix.c:118:2: warning: asm operand 3 probably doesn't match constraints
    asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
    ^~~
  arch/powerpc/mm/tlb-radix.c: In function '__tlbiel_pid':
  arch/powerpc/mm/tlb-radix.c:104:2: warning: asm operand 3 probably doesn't match constraints
    asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
    ^~~

Link: http://lkml.kernel.org/r/20190423034959.13525-11-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marek Vasut <marek.vasut@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:48 -07:00
Masahiro Yamada
e12d6d7d46 powerpc/mm/radix: mark __radix__flush_tlb_range_psize() as __always_inline
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
place.  We need to eliminate potential issues beforehand.

If it is enabled for powerpc, the following error is reported:

  arch/powerpc/mm/tlb-radix.c: In function '__radix__flush_tlb_range_psize':
  arch/powerpc/mm/tlb-radix.c:104:2: error: asm operand 3 probably doesn't match constraints [-Werror]
    asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
    ^~~
  arch/powerpc/mm/tlb-radix.c:104:2: error: impossible constraint in 'asm'

Link: http://lkml.kernel.org/r/20190423034959.13525-10-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marek Vasut <marek.vasut@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:48 -07:00
Michael Ellerman
7338874c33 powerpc/mm: Fix crashes with hugepages & 4K pages
The recent commit to cleanup ifdefs in the hugepage initialisation led
to crashes when using 4K pages as reported by Sachin:

  BUG: Kernel NULL pointer dereference at 0x0000001c
  Faulting instruction address: 0xc000000001d1e58c
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  ...
  CPU: 3 PID: 4635 Comm: futex_wake04 Tainted: G        W  O      5.1.0-next-20190507-autotest #1
  NIP:  c000000001d1e58c LR: c000000001d1e54c CTR: 0000000000000000
  REGS: c000000004937890 TRAP: 0300
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22424822  XER: 00000000
  CFAR: c00000000183e9e0 DAR: 000000000000001c DSISR: 40000000 IRQMASK: 0
  ...
  NIP kmem_cache_alloc+0xbc/0x5a0
  LR  kmem_cache_alloc+0x7c/0x5a0
  Call Trace:
    huge_pte_alloc+0x580/0x950
    hugetlb_fault+0x9a0/0x1250
    handle_mm_fault+0x490/0x4a0
    __do_page_fault+0x77c/0x1f00
    do_page_fault+0x28/0x50
    handle_page_fault+0x18/0x38

This is caused by us trying to allocate from a NULL kmem cache in
__hugepte_alloc(). The kmem cache is NULL because it was never
allocated in hugetlbpage_init(), because add_huge_page_size() returned
an error.

The reason add_huge_page_size() returned an error is a simple typo, we
are calling check_and_get_huge_psize(size) when we should be passing
shift instead.

The fact that we're able to trigger this path when the kmem caches are
NULL is a separate bug, ie. we should not advertise any hugepage sizes
if we haven't setup the required caches for them.

This was only seen with 4K pages, with 64K pages we don't need to
allocate any extra kmem caches because the 16M hugepage just occupies
a single entry at the PMD level.

Fixes: 723f268f19 ("powerpc/mm: cleanup ifdef mess in add_huge_page_size()")
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
2019-05-15 11:13:35 +10:00
David Hildenbrand
ac5c942645 mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail
All callers of arch_remove_memory() ignore errors.  And we should really
try to remove any errors from the memory removal path.  No more errors are
reported from __remove_pages().  BUG() in s390x code in case
arch_remove_memory() is triggered.  We may implement that properly later.
WARN in case powerpc code failed to remove the section mapping, which is
better than ignoring the error completely right now.

Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:50 -07:00
Michal Hocko
940519f0c8 mm, memory_hotplug: provide a more generic restrictions for memory hotplug
arch_add_memory, __add_pages take a want_memblock which controls whether
the newly added memory should get the sysfs memblock user API (e.g.
ZONE_DEVICE users do not want/need this interface).  Some callers even
want to control where do we allocate the memmap from by configuring
altmap.

Add a more generic hotplug context for arch_add_memory and __add_pages.
struct mhp_restrictions contains flags which contains additional features
to be enabled by the memory hotplug (MHP_MEMBLOCK_API currently) and
altmap for alternative memmap allocator.

This patch shouldn't introduce any functional change.

[akpm@linux-foundation.org: build fix]
Link: http://lkml.kernel.org/r/20190408082633.2864-3-osalvador@suse.de
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Christoph Hellwig
4afd58e14d initramfs: provide a generic free_initrd_mem implementation
For most architectures free_initrd_mem just expands to the same
free_reserved_area call.  Provide that as a generic implementation marked
__weak.

Link: http://lkml.kernel.org/r/20190213174621.29297-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Cc: Steven Price <steven.price@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:47 -07:00
Ira Weiny
932f4a630a mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM
Pach series "Add FOLL_LONGTERM to GUP fast and use it".

HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages.  These pages can be held for a significant time.  But
get_user_pages_fast() does not protect against mapping FS DAX pages.

Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks.  XDP has also
shown interest in using this functionality.[1]

In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.

[1] https://lkml.org/lkml/2019/3/19/939

"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move.  I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem.  Then I think we can change the flag to a
better name.

Secondly, it depends on how often you are registering memory.  I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance.  I don't have the numbers as the
tests for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

This patch (of 7):

This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast().  Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.

Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.

This patch does not change any functionality.  In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked.  However, callers of get_user_pages_fast()
were not "protected".

FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.

NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages.  This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.

As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.

In review[1] it was asked:

<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?

A couple of points.

First "longterm" is a relative thing and at this point is probably a
misnomer.  This is really flagging a pin which is going to be given to
hardware and can't move.  I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem.  Then I think we can
change the flag to a better name.

Second, It depends on how often you are registering memory.  I have spoken
with some RDMA users who consider MR in the performance path...  For the
overall application performance.  I don't have the numbers as the tests
for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

</quote>

[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

[ira.weiny@intel.com: v3]
  Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:45 -07:00
Christophe Leroy
397d2300b0 powerpc/32s: fix flush_hash_pages() on SMP
flush_hash_pages() runs with data translation off, so current
task_struct has to be accesssed using physical address.

Fixes: f7354ccac8 ("powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU")
Cc: stable@vger.kernel.org # v5.1+
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-14 22:58:52 +10:00
Linus Torvalds
b970afcfca powerpc updates for 5.2
Highlights:
 
  - Support for Kernel Userspace Access/Execution Prevention (like
    SMAP/SMEP/PAN/PXN) on some 64-bit and 32-bit CPUs. This prevents the kernel
    from accidentally accessing userspace outside copy_to/from_user(), or
    ever executing userspace.
 
  - KASAN support on 32-bit.
 
  - Rework of where we map the kernel, vmalloc, etc. on 64-bit hash to use the
    same address ranges we use with the Radix MMU.
 
  - A rewrite into C of large parts of our idle handling code for 64-bit Book3S
    (ie. power8 & power9).
 
  - A fast path entry for syscalls on 32-bit CPUs, for a 12-17% speedup in the
    null_syscall benchmark.
 
  - On 64-bit bare metal we have support for recovering from errors with the time
    base (our clocksource), however if that fails currently we hang in __delay()
    and never crash. We now have support for detecting that case and short
    circuiting __delay() so we at least panic() and reboot.
 
  - Add support for optionally enabling the DAWR on Power9, which had to be
    disabled by default due to a hardware erratum. This has the effect of
    enabling hardware breakpoints for GDB, the downside is a badly behaved
    program could crash the machine by pointing the DAWR at cache inhibited
    memory. This is opt-in obviously.
 
  - xmon, our crash handler, gets support for a read only mode where operations
    that could change memory or otherwise disturb the system are disabled.
 
 Plus many clean-ups, reworks and minor fixes etc.
 
 Thanks to:
   Christophe Leroy, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Andrew
   Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Ben Hutchings,
   Bo YU, Breno Leitao, Cédric Le Goater, Christopher M. Riedl, Christoph
   Hellwig, Colin Ian King, David Gibson, Ganesh Goudar, Gautham R. Shenoy,
   George Spelvin, Greg Kroah-Hartman, Greg Kurz, Horia Geantă, Jagadeesh
   Pagadala, Joel Stanley, Joe Perches, Julia Lawall, Laurentiu Tudor, Laurent
   Vivier, Lukas Bulwahn, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu
   Malaterre, Michael Neuling, Mukesh Ojha, Nathan Fontenot, Nathan Lynch,
   Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Peng Hao, Qian Cai, Ravi
   Bangoria, Rick Lindsley, Russell Currey, Sachin Sant, Stewart Smith, Sukadev
   Bhattiprolu, Thomas Huth, Tobin C. Harding, Tyrel Datwyler, Valentin
   Schneider, Wei Yongjun, Wen Yang, YueHaibing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJc1WbwAAoJEFHr6jzI4aWAv5cP/iDskai4Az/GCa6yLj4b+det
 7mc7tTOaEzhUtvfrYYfHgvvdNNzo1ETv7rqTdZqtWJ3xfwdeowLFXXZwSywZKUDB
 bi4pcl2v55Qlf9kxgx9RDr6+4fTwGG4nhO2qPDJDR1umEih9mG/2HJ7d+Wnq6Va2
 E9srd+R6Fa0ty88+9vzBtdyllnDK1XHu3ahsxCH62aRm79ucuVrxyydWmbbs5lJe
 a7g/OQIPgZmObHhfXvw9DFkOvkp5Pm6hfHOeyQH2nTB5X6k0judWv00uoHTJgOuP
 DKxZtDhaGnajUfuhQYboDPOuFjY7lkfgEXaagyZsjdudqridTMmv1iU1o7iy8BT4
 AId4DyJbvFFgqRJkCwKzhKRRHPfFMfM7KTJ38GPZuPmniuULk9uiIy6JyY0tXO+l
 UQEclPzOTPkAE12FBaOBuqZqTRuBQuokWQF8ZDPOxbNAixHgFoRd4Z9diNwCPpLu
 +KoyCwd2Gm5DyX+mC85sWG28IPKi9Hhhw2XBOA5F4A2kH6uFa1BnERSRGYomx+pc
 BvEXHglf/vgV0XUQZfDCsiOecIKYuWxgre0/liLhhU5qMss2pxHczzffH4KtdykS
 9y7o3mVRcS7Moitbmb6SAJoQxbR5QhzfN832DbSd6jEfKdg1ytZlfHTG0WZYHKDs
 PHs6V1N+cQANdukutrJz
 =cUkd
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.2-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Slightly delayed due to the issue with printk() calling
  probe_kernel_read() interacting with our new user access prevention
  stuff, but all fixed now.

  The only out-of-area changes are the addition of a cpuhp_state, small
  additions to Documentation and MAINTAINERS updates.

  Highlights:

   - Support for Kernel Userspace Access/Execution Prevention (like
     SMAP/SMEP/PAN/PXN) on some 64-bit and 32-bit CPUs. This prevents
     the kernel from accidentally accessing userspace outside
     copy_to/from_user(), or ever executing userspace.

   - KASAN support on 32-bit.

   - Rework of where we map the kernel, vmalloc, etc. on 64-bit hash to
     use the same address ranges we use with the Radix MMU.

   - A rewrite into C of large parts of our idle handling code for
     64-bit Book3S (ie. power8 & power9).

   - A fast path entry for syscalls on 32-bit CPUs, for a 12-17% speedup
     in the null_syscall benchmark.

   - On 64-bit bare metal we have support for recovering from errors
     with the time base (our clocksource), however if that fails
     currently we hang in __delay() and never crash. We now have support
     for detecting that case and short circuiting __delay() so we at
     least panic() and reboot.

   - Add support for optionally enabling the DAWR on Power9, which had
     to be disabled by default due to a hardware erratum. This has the
     effect of enabling hardware breakpoints for GDB, the downside is a
     badly behaved program could crash the machine by pointing the DAWR
     at cache inhibited memory. This is opt-in obviously.

   - xmon, our crash handler, gets support for a read only mode where
     operations that could change memory or otherwise disturb the system
     are disabled.

  Plus many clean-ups, reworks and minor fixes etc.

  Thanks to: Christophe Leroy, Akshay Adiga, Alastair D'Silva, Alexey
  Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar,
  Anton Blanchard, Ben Hutchings, Bo YU, Breno Leitao, Cédric Le Goater,
  Christopher M. Riedl, Christoph Hellwig, Colin Ian King, David Gibson,
  Ganesh Goudar, Gautham R. Shenoy, George Spelvin, Greg Kroah-Hartman,
  Greg Kurz, Horia Geantă, Jagadeesh Pagadala, Joel Stanley, Joe
  Perches, Julia Lawall, Laurentiu Tudor, Laurent Vivier, Lukas Bulwahn,
  Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre, Michael
  Neuling, Mukesh Ojha, Nathan Fontenot, Nathan Lynch, Nicholas Piggin,
  Nick Desaulniers, Oliver O'Halloran, Peng Hao, Qian Cai, Ravi
  Bangoria, Rick Lindsley, Russell Currey, Sachin Sant, Stewart Smith,
  Sukadev Bhattiprolu, Thomas Huth, Tobin C. Harding, Tyrel Datwyler,
  Valentin Schneider, Wei Yongjun, Wen Yang, YueHaibing"

* tag 'powerpc-5.2-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (205 commits)
  powerpc/64s: Use early_mmu_has_feature() in set_kuap()
  powerpc/book3s/64: check for NULL pointer in pgd_alloc()
  powerpc/mm: Fix hugetlb page initialization
  ocxl: Fix return value check in afu_ioctl()
  powerpc/mm: fix section mismatch for setup_kup()
  powerpc/mm: fix redundant inclusion of pgtable-frag.o in Makefile
  powerpc/mm: Fix makefile for KASAN
  powerpc/kasan: add missing/lost Makefile
  selftests/powerpc: Add a signal fuzzer selftest
  powerpc/booke64: set RI in default MSR
  ocxl: Provide global MMIO accessors for external drivers
  ocxl: move event_fd handling to frontend
  ocxl: afu_irq only deals with IRQ IDs, not offsets
  ocxl: Allow external drivers to use OpenCAPI contexts
  ocxl: Create a clear delineation between ocxl backend & frontend
  ocxl: Don't pass pci_dev around
  ocxl: Split pci.c
  ocxl: Remove some unused exported symbols
  ocxl: Remove superfluous 'extern' from headers
  ocxl: read_pasid never returns an error, so make it void
  ...
2019-05-10 05:29:27 -07:00
Sachin Sant
04a1942933 powerpc/mm: Fix hugetlb page initialization
This patch fixes a regression by using correct kernel config variable
for HUGETLB_PAGE_SIZE_VARIABLE.

Without this huge pages are disabled during kernel boot.
[0.309496] hugetlbfs: disabling because there are no supported hugepage sizes

Fixes: c5710cd207 ("powerpc/mm: cleanup HPAGE_SHIFT setup")
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-06 22:37:39 +10:00
Christophe Leroy
67d53f30e2 powerpc/mm: fix section mismatch for setup_kup()
commit b28c97505e ("powerpc/64: Setup KUP on secondary CPUs")
moved setup_kup() out of the __init section. As stated in that commit,
"this is only for 64-bit". But this function is also used on PPC32,
where the two functions called by setup_kup() are in the __init
section, so setup_kup() has to either be kept in the __init
section on PPC32 or marked __ref.

This patch marks it __ref, it fixes the below build warnings.

  MODPOST vmlinux.o
WARNING: vmlinux.o(.text+0x169ec): Section mismatch in reference from the function setup_kup() to the function .init.text:setup_kuep()
The function setup_kup() references
the function __init setup_kuep().
This is often because setup_kup lacks a __init
annotation or the annotation of setup_kuep is wrong.

WARNING: vmlinux.o(.text+0x16a04): Section mismatch in reference from the function setup_kup() to the function .init.text:setup_kuap()
The function setup_kup() references
the function __init setup_kuap().
This is often because setup_kup lacks a __init
annotation or the annotation of setup_kuap is wrong.

Fixes: b28c97505e ("powerpc/64: Setup KUP on secondary CPUs")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-06 20:21:56 +10:00
Christophe Leroy
c4e31847a5 powerpc/mm: fix redundant inclusion of pgtable-frag.o in Makefile
The patch identified below added pgtable-frag.o to obj-y
but some merge witchery kept it also for obj-CONFIG_PPC_BOOK3S_64

This patch clears the duplication.

Fixes: 737b434d3d ("powerpc/mm: convert Book3E 64 to pte_fragment")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-06 20:21:56 +10:00
Christophe Leroy
471e475c69 powerpc/mm: Fix makefile for KASAN
In commit 17312f258c ("powerpc/mm: Move book3s32 specifics in
subdirectory mm/book3s64"), ppc_mmu_32.c was moved and renamed.

This patch fixes Makefiles to disable KASAN instrumentation on
the new name and location.

Fixes: f072015c7b ("powerpc: disable KASAN instrumentation on early/critical files.")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-06 20:21:56 +10:00
Christophe Leroy
305d600123 powerpc/kasan: add missing/lost Makefile
For unknown reason (aka. mpe is a doofus), the new Makefile added via
the KASAN support patch didn't land into arch/powerpc/mm/kasan/

This patch restores it.

Fixes: 2edb16efc8 ("powerpc/32: Add KASAN support")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-06 20:21:34 +10:00
Russell Currey
453d87f6a8 powerpc/mm: Warn if W+X pages found on boot
Implement code to walk all pages and warn if any are found to be both
writable and executable.  Depends on STRICT_KERNEL_RWX enabled, and is
behind the DEBUG_WX config option.

This only runs on boot and has no runtime performance implications.

Very heavily influenced (and in some cases copied verbatim) from the
ARM64 code written by Laura Abbott (thanks!), since our ptdump
infrastructure is similar.

Signed-off-by: Russell Currey <ruscur@russell.cc>
[mpe: Fixup build error when disabled]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 02:54:45 +10:00
Russell Currey
5f18cbdbdd powerpc/mm/ptdump: Wrap seq_printf() to handle NULL pointers
Lovingly borrowed from the arch/arm64 ptdump code.

This doesn't seem to be an issue in practice, but is necessary for my
upcoming commit.

Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:58:11 +10:00
Christoph Hellwig
c9e0fc33b8 powerpc: remove the __kernel_io_end export
This export was added in this merge window, but without any actual
user, or justification for a modular user.

Fixes: a35a3c6f60 ("powerpc/mm/hash64: Add a variable to track the end of IO mapping")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:58:11 +10:00
Christophe Leroy
e4dccf9092 powerpc/mm: print hash info in a helper
Reduce #ifdef mess by defining a helper to print
hash info at startup.

In the meantime, remove the display of hash table address
to reduce leak of non necessary information.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:26 +10:00
Christophe Leroy
8f156c23f4 powerpc/32s: don't try to print hash table address.
Due to %p, (ptrval) is printed in lieu of the hash table address.

showing the hash table address isn't an operationnal need so just
don't print it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:26 +10:00
Christophe Leroy
57e0491b58 powerpc/32s: drop Hash_end
Hash_end has never been used, drop it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:26 +10:00
Christophe Leroy
da3a3b0a0e powerpc/32s: map kasan zero shadow with PAGE_READONLY instead of PAGE_KERNEL_RO
For hash32, the zero shadow page gets mapped with PAGE_READONLY instead
of PAGE_KERNEL_RO, because the PP bits don't provide a RO kernel, so
PAGE_KERNEL_RO is equivalent to PAGE_KERNEL. By using PAGE_READONLY,
the page is RO for both kernel and user, but this is not a security issue
as it contains only zeroes.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:26 +10:00
Christophe Leroy
215b823707 powerpc/32s: set up an early static hash table for KASAN.
KASAN requires early activation of hash table, before memblock()
functions are available.

This patch implements an early hash_table statically defined in
__initdata.

During early boot, a single page table is used.

For hash32, when doing the final init, one page table is allocated
for each PGD entry because of the _PAGE_HASHPTE flag which can't be
common to several virt pages. This is done after memblock get
available but before switching to the final hash table, otherwise
there are issues with TLB flushing due to the shared entries.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:26 +10:00
Christophe Leroy
72f208c6a8 powerpc/32s: move hash code patching out of MMU_init_hw()
For KASAN, hash table handling will be activated early for
accessing to KASAN shadow areas.

In order to avoid any modification of the hash functions while
they are still used with the early hash table, the code patching
is moved out of MMU_init_hw() and put close to the big-bang switch
to the final hash table.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:26 +10:00
Christophe Leroy
2edb16efc8 powerpc/32: Add KASAN support
This patch adds KASAN support for PPC32. The following patch
will add an early activation of hash table for book3s. Until
then, a warning will be raised if trying to use KASAN on an
hash 6xx.

To support KASAN, this patch initialises that MMU mapings for
accessing to the KASAN shadow area defined in a previous patch.

An early mapping is set as soon as the kernel code has been
relocated at its definitive place.

Then the definitive mapping is set once paging is initialised.

For modules, the shadow area is allocated at module_alloc().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:26 +10:00
Christophe Leroy
f072015c7b powerpc: disable KASAN instrumentation on early/critical files.
All files containing functions run before kasan_early_init() is called
must have KASAN instrumentation disabled.

For those file, branch profiling also have to be disabled otherwise
each if () generates a call to ftrace_likely_update().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:26 +10:00
Christophe Leroy
b4abe38fd6 powerpc/32: prepare shadow area for KASAN
This patch prepares a shadow area for KASAN.

The shadow area will be at the top of the kernel virtual
memory space above the fixmap area and will occupy one
eighth of the total kernel virtual memory space.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:26 +10:00
Christophe Leroy
b0124ff57e powerpc/mm: inline pte_alloc_one_kernel() and pte_alloc_one() on PPC32
pte_alloc_one_kernel() and pte_alloc_one() are simple calls to
pte_fragment_alloc(), so they are good candidates for inlining as
already done on PPC64.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:25 +10:00
Christophe Leroy
4a6d8cf900 powerpc/mm: don't use pte_alloc_kernel() until slab is available on PPC32
In the same way as PPC64, implement early allocation functions and
avoid calling pte_alloc_kernel() before slab is available.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:25 +10:00
Christophe Leroy
627f06c6f5 powerpc/book3e: move early_alloc_pgtable() to init section
early_alloc_pgtable() is only used during init.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
737b434d3d powerpc/mm: convert Book3E 64 to pte_fragment
Book3E 64 is the only subarch not using pte_fragment. In order
to allow refactorisation, this patch converts it to pte_fragment.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
26e66b08c3 powerpc/mm: flatten function __find_linux_pte() step 3
__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.

Previous patches left a { } block. This patch removes it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
e2fb251188 powerpc/mm: flatten function __find_linux_pte() step 2
__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.

Previous patch left { } blocks. This patch removes the first one
by shifting its content to the left.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
fab9a1165b powerpc/mm: flatten function __find_linux_pte() step 1
__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.

This patch flattens the function by getting rid of as much if/else
as possible. In order to ease the review, this is done in three steps.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
4df4b27585 powerpc/mm: cleanup remaining ifdef mess in hugetlbpage.c
Only 3 subarches support huge pages. So when it is either 2 of them,
it is not the third one.

And mmu_has_feature() is known by all subarches so IS_ENABLED() can
be used instead of #ifdef

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
c5710cd207 powerpc/mm: cleanup HPAGE_SHIFT setup
Only book3s/64 may select default among several HPAGE_SHIFT at runtime.
8xx always defines 512K pages as default
FSL_BOOK3E always defines 4M pages as default

This patch limits HUGETLB_PAGE_SIZE_VARIABLE to book3s/64
moves the definitions in subarches files.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
45d0ba527b powerpc/mm: move hugetlb_disabled into asm/hugetlb.h
No need to have this in asm/page.h, move it into asm/hugetlb.h

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
723f268f19 powerpc/mm: cleanup ifdef mess in add_huge_page_size()
Introduce a subarch specific helper check_and_get_huge_psize()
to check the huge page sizes and cleanup the ifdef mess in
add_huge_page_size()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
5fb84fec46 powerpc/mm: add a helper to populate hugepd
This patchs adds a subarch helper to populate hugepd.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
0001e5aa5c powerpc/mm: make gup_hugepte() static
gup_huge_pd() is the only user of gup_hugepte() and it is
located in the same file. This patch moves gup_huge_pd()
after gup_hugepte() and makes gup_hugepte() static.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
b7dcf96ce0 powerpc/mm: make hugetlbpage.c depend on CONFIG_HUGETLB_PAGE
The only function in hugetlbpage.c which doesn't depend on
CONFIG_HUGETLB_PAGE is gup_hugepte(), and this function is
only called from gup_huge_pd() which depends on
CONFIG_HUGETLB_PAGE so all the content of hugetlbpage.c
depends on CONFIG_HUGETLB_PAGE.

This patch modifies Makefile to only compile hugetlbpage.c
when CONFIG_HUGETLB_PAGE is set.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
0caed4de50 powerpc/mm: move __find_linux_pte() out of hugetlbpage.c
__find_linux_pte() is the only function in hugetlbpage.c
which is compiled in regardless on CONFIG_HUGETLBPAGE

This patch moves it in pgtable.c.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
3dea7332cc powerpc/book3e: hugetlbpage is only for CONFIG_PPC_FSL_BOOK3E
As per Kconfig.cputype, only CONFIG_PPC_FSL_BOOK3E gets to
select SYS_SUPPORTS_HUGETLBFS so simplify accordingly.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
5874cabe29 powerpc/64: only book3s/64 supports CONFIG_PPC_64K_PAGES
CONFIG_PPC_64K_PAGES cannot be selected by nohash/64.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
a521c44c3d powerpc/book3e: drop mmu_get_tsize()
This function is not used anymore, drop it.

Fixes: b42279f016 ("powerpc/mm/nohash: MM_SLICE is only used by book3s 64")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
5953fb4f46 powerpc/mm: define subarch SLB_ADDR_LIMIT_DEFAULT
This patch defines a subarch specific SLB_ADDR_LIMIT_DEFAULT
to remove the #ifdefs around the setup of mm->context.slb_addr_limit

It also generalises the use of mm_ctx_set_slb_addr_limit() helper.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
43ed7909d7 powerpc/mm: define get_slice_psize() all the time
get_slice_psize() can be defined regardless of CONFIG_PPC_MM_SLICES
to avoid ifdefs

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
203a1fa628 powerpc/mm: remove a couple of #ifdef CONFIG_PPC_64K_PAGES in mm/slice.c
This patch replaces a couple of #ifdef CONFIG_PPC_64K_PAGES
by IS_ENABLED(CONFIG_PPC_64K_PAGES) to improve code maintainability.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
b4baad0b27 powerpc/mm: remove unnecessary #ifdef CONFIG_PPC64
For PPC32 that's a noop, gcc should be smart enough to ignore it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
fca5c1e9eb powerpc/mm: move slice_mask_for_size() into mmu.h
Move slice_mask_for_size() into subarch mmu.h

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Retain the BUG_ON()s, rather than converting to VM_BUG_ON()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
6f60cc98df powerpc/mm: hand a context_t over to slice_mask_for_size() instead of mm_struct
slice_mask_for_size() only uses mm->context, so hand directly a
pointer to the context. This will help moving the function in
subarch mmu.h in the next patch by avoiding having to include
the definition of struct mm_struct

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
27e23b5f5f powerpc/mm: Move nohash specifics in subdirectory mm/nohash
Many files in arch/powerpc/mm are only for nohash. This patch
creates a subdirectory for them.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Shorten new filenames]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
17312f258c powerpc/mm: Move book3s32 specifics in subdirectory mm/book3s64
Several files in arch/powerpc/mm are only for book3S32. This patch
creates a subdirectory for them.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Shorten new filenames]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
47d99948ee powerpc/mm: Move book3s64 specifics in subdirectory mm/book3s64
Many files in arch/powerpc/mm are only for book3S64. This patch
creates a subdirectory for them.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Update the selftest sym links, shorten new filenames, cleanup some
      whitespace and formatting in the new files.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:18:38 +10:00
Christophe Leroy
9d9f2cccde powerpc/mm: change #include "mmu_decl.h" to <mm/mmu_decl.h>
This patch make inclusion of mmu_decl.h independant of the location
of the file including it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-02 21:18:58 +10:00
Christophe Leroy
a1ac2a9c4f powerpc/book3e: drop BUG_ON() in map_kernel_page()
early_alloc_pgtable() never returns NULL as it panics on failure.

This patch drops the three BUG_ON() which check the non nullity
of early_alloc_pgtable() returned value.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-02 21:18:57 +10:00
Christophe Leroy
12f363511d powerpc/32s: Fix BATs setting with CONFIG_STRICT_KERNEL_RWX
Serge reported some crashes with CONFIG_STRICT_KERNEL_RWX enabled
on a book3s32 machine.

Analysis shows two issues:
  - BATs addresses and sizes are not properly aligned.
  - There is a gap between the last address covered by BATs and the
    first address covered by pages.

Memory mapped with DBATs:
0: 0xc0000000-0xc07fffff 0x00000000 Kernel RO coherent
1: 0xc0800000-0xc0bfffff 0x00800000 Kernel RO coherent
2: 0xc0c00000-0xc13fffff 0x00c00000 Kernel RW coherent
3: 0xc1400000-0xc23fffff 0x01400000 Kernel RW coherent
4: 0xc2400000-0xc43fffff 0x02400000 Kernel RW coherent
5: 0xc4400000-0xc83fffff 0x04400000 Kernel RW coherent
6: 0xc8400000-0xd03fffff 0x08400000 Kernel RW coherent
7: 0xd0400000-0xe03fffff 0x10400000 Kernel RW coherent

Memory mapped with pages:
0xe1000000-0xefffffff  0x21000000       240M        rw       present           dirty  accessed

This patch fixes both issues. With the patch, we get the following
which is as expected:

Memory mapped with DBATs:
0: 0xc0000000-0xc07fffff 0x00000000 Kernel RO coherent
1: 0xc0800000-0xc0bfffff 0x00800000 Kernel RO coherent
2: 0xc0c00000-0xc0ffffff 0x00c00000 Kernel RW coherent
3: 0xc1000000-0xc1ffffff 0x01000000 Kernel RW coherent
4: 0xc2000000-0xc3ffffff 0x02000000 Kernel RW coherent
5: 0xc4000000-0xc7ffffff 0x04000000 Kernel RW coherent
6: 0xc8000000-0xcfffffff 0x08000000 Kernel RW coherent
7: 0xd0000000-0xdfffffff 0x10000000 Kernel RW coherent

Memory mapped with pages:
0xe0000000-0xefffffff  0x20000000       256M        rw       present           dirty  accessed

Fixes: 63b2bc6195 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX")
Reported-by: Serge Belyshev <belyshev@depni.sinp.msu.ru>
Acked-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-02 15:33:46 +10:00
Aneesh Kumar K.V
2c474c0350 powerpc/mm/radix: Fix kernel crash when running subpage protect test
This patch fixes the below crash by making sure we touch the subpage
protection related structures only if we know they are allocated on
the platform. With radix translation we don't allocate hash context at
all and trying to access subpage_prot_table results in:

  Faulting instruction address: 0xc00000000008bdb4
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
  ....
  NIP [c00000000008bdb4] sys_subpage_prot+0x74/0x590
  LR [c00000000000b688] system_call+0x5c/0x70
  Call Trace:
  [c00020002c6b7d30] [c00020002c6b7d90] 0xc00020002c6b7d90 (unreliable)
  [c00020002c6b7e20] [c00000000000b688] system_call+0x5c/0x70
  Instruction dump:
  fb61ffd8 fb81ffe0 fba1ffe8 fbc1fff0 fbe1fff8 f821ff11 e92d1178 f9210068
  39200000 e92d0968 ebe90630 e93f03e8 <eb891038> 60000000 3860fffe e9410068

We also move the subpage_prot_table with mmp_sem held to avoid race
between two parallel subpage_prot syscall.

Fixes: 701101865f ("powerpc/mm: Reduce memory usage for mm_context_t for radix")
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-01 23:01:33 +10:00
Nathan Fontenot
b2d3b5ee66 powerpc/pseries: Track LMB nid instead of using device tree
When removing memory we need to remove the memory from the node
it was added to instead of looking up the node it should be in
in the device tree.

During testing we have seen scenarios where the affinity for a
LMB changes due to a partition migration or PRRN event. In these
cases the node the LMB exists in may not match the node the device
tree indicates it belongs in. This can lead to a system crash
when trying to DLPAR remove the LMB after a migration or PRRN
event. The current code looks up the node in the device tree to
remove the LMB from, the crash occurs when we try to offline this
node and it does not have any data, i.e. node_data[nid] == NULL.

36:mon> e
cpu 0x36: Vector: 300 (Data Access) at [c0000001828b7810]
    pc: c00000000036d08c: try_offline_node+0x2c/0x1b0
    lr: c0000000003a14ec: remove_memory+0xbc/0x110
    sp: c0000001828b7a90
   msr: 800000000280b033
   dar: 9a28
 dsisr: 40000000
  current = 0xc0000006329c4c80
  paca    = 0xc000000007a55200   softe: 0        irq_happened: 0x01
    pid   = 76926, comm = kworker/u320:3

36:mon> t
[link register   ] c0000000003a14ec remove_memory+0xbc/0x110
[c0000001828b7a90] c00000000006a1cc arch_remove_memory+0x9c/0xd0 (unreliable)
[c0000001828b7ad0] c0000000003a14e0 remove_memory+0xb0/0x110
[c0000001828b7b20] c0000000000c7db4 dlpar_remove_lmb+0x94/0x160
[c0000001828b7b60] c0000000000c8ef8 dlpar_memory+0x7e8/0xd10
[c0000001828b7bf0] c0000000000bf828 handle_dlpar_errorlog+0xf8/0x160
[c0000001828b7c60] c0000000000bf8cc pseries_hp_work_fn+0x3c/0xa0
[c0000001828b7c90] c000000000128cd8 process_one_work+0x298/0x5a0
[c0000001828b7d20] c000000000129068 worker_thread+0x88/0x620
[c0000001828b7dc0] c00000000013223c kthread+0x1ac/0x1c0
[c0000001828b7e30] c00000000000b45c ret_from_kernel_thread+0x5c/0x80

To resolve this we need to track the node a LMB belongs to when
it is added to the system so we can remove it from that node instead
of the node that the device tree indicates it should belong to.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-29 22:27:16 +10:00
Colin Ian King
f341d89790 powerpc/mm: fix spelling mistake "Outisde" -> "Outside"
There are several identical spelling mistakes in warning messages,
fix these.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-28 16:28:23 +10:00
Aneesh Kumar K.V
26ad26718d powerpc/mm: Fix section mismatch warning
This patch fix the below section mismatch warnings.

WARNING: vmlinux.o(.text+0x2d1f44): Section mismatch in reference from the function devm_memremap_pages_release() to the function .meminit.text:arch_remove_memory()
WARNING: vmlinux.o(.text+0x2d265c): Section mismatch in reference from the function devm_memremap_pages() to the function .meminit.text:arch_add_memory()

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:40 +10:00
Aneesh Kumar K.V
5f53d28608 powerpc/mm/hash: Rename KERNEL_REGION_ID to LINEAR_MAP_REGION_ID
The region actually point to linear map. Rename the #define to
clarify thati.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:40 +10:00
Aneesh Kumar K.V
e09093927e powerpc/mm: Validate address values against different region limits
This adds an explicit check in various functions.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
0034d395f8 powerpc/mm/hash64: Map all the kernel regions in the same 0xc range
This patch maps vmalloc, IO and vmemap regions in the 0xc address range
instead of the current 0xd and 0xf range. This brings the mapping closer
to radix translation mode.

With hash 64K page size each of this region is 512TB whereas with 4K config
we are limited by the max page table range of 64TB and hence there regions
are of 16TB size.

The kernel mapping is now:

 On 4K hash

     kernel_region_map_size = 16TB
     kernel vmalloc start   = 0xc000100000000000
     kernel IO start        = 0xc000200000000000
     kernel vmemmap start   = 0xc000300000000000

64K hash, 64K radix and 4k radix:

     kernel_region_map_size = 512TB
     kernel vmalloc start   = 0xc008000000000000
     kernel IO start        = 0xc00a000000000000
     kernel vmemmap start   = 0xc00c000000000000

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
a35a3c6f60 powerpc/mm/hash64: Add a variable to track the end of IO mapping
This makes it easy to update the region mapping in the later patch

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
ef629cc5bf powerc/mm/hash: Reduce hash_mm_context size
Allocate subpage protect related variables only if we use the feature.
This helps in reducing the hash related mm context struct by around 4K

Before the patch
sizeof(struct hash_mm_context)  = 8288

After the patch
sizeof(struct hash_mm_context) = 4160

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
701101865f powerpc/mm: Reduce memory usage for mm_context_t for radix
Currently, our mm_context_t on book3s64 include all hash specific
context details like slice mask and subpage protection details. We
can skip allocating these with radix translation. This will help us to save
8K per mm_context with radix translation.

With the patch applied we have

sizeof(mm_context_t)  = 136
sizeof(struct hash_mm_context)  = 8288

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
67fda38f0d powerpc/mm: Move slb_addr_linit to early_init_mmu
Avoid #ifdef in generic code. Also enables us to do this specific to
MMU translation mode on book3s64

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
60458fba46 powerpc/mm: Add helpers for accessing hash translation related variables
We want to switch to allocating them runtime only when hash translation is
enabled. Add helpers so that both book3s and nohash can be adapted to
upcoming change easily.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:38 +10:00
Christophe Leroy
a68c31fc01 powerpc/32s: Implement Kernel Userspace Access Protection
This patch implements Kernel Userspace Access Protection for
book3s/32.

Due to limitations of the processor page protection capabilities,
the protection is only against writing. read protection cannot be
achieved using page protection.

The previous patch modifies the page protection so that RW user
pages are RW for Key 0 and RO for Key 1, and it sets Key 0 for
both user and kernel.

This patch changes userspace segment registers are set to Ku 0
and Ks 1. When kernel needs to write to RW pages, the associated
segment register is then changed to Ks 0 in order to allow write
access to the kernel.

In order to avoid having the read all segment registers when
locking/unlocking the access, some data is kept in the thread_struct
and saved on stack on exceptions. The field identifies both the
first unlocked segment and the first segment following the last
unlocked one. When no segment is unlocked, it contains value 0.

As the hash_page() function is not able to easily determine if a
protfault is due to a bad kernel access to userspace, protfaults
need to be handled by handle_page_fault when KUAP is set.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Drop allow_read/write_to/from_user() as they're now in kup.h,
      and adapt allow_user_access() to do nothing when to == NULL]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:47 +10:00
Christophe Leroy
f342adca3a powerpc/32s: Prepare Kernel Userspace Access Protection
This patch prepares Kernel Userspace Access Protection for
book3s/32.

Due to limitations of the processor page protection capabilities,
the protection is only against writing. read protection cannot be
achieved using page protection.

book3s/32 provides the following values for PP bits:

PP00 provides RW for Key 0 and NA for Key 1
PP01 provides RW for Key 0 and RO for Key 1
PP10 provides RW for all
PP11 provides RO for all

Today PP10 is used for RW pages and PP11 for RO pages, and user
segment register's Kp and Ks are set to 1. This patch modifies
page protection to use PP01 for RW pages and sets user segment
registers to Kp 0 and Ks 0.

This will allow to setup Userspace write access protection by
settng Ks to 1 in the following patch.

Kernel space segment registers remain unchanged.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:46 +10:00
Christophe Leroy
31ed2b13c4 powerpc/32s: Implement Kernel Userspace Execution Prevention.
To implement Kernel Userspace Execution Prevention, this patch
sets NX bit on all user segments on kernel entry and clears NX bit
on all user segments on kernel exit.

Note that powerpc 601 doesn't have the NX bit, so KUEP will not
work on it. A warning is displayed at startup.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:46 +10:00
Christophe Leroy
2679f9bd0a powerpc/8xx: Add Kernel Userspace Access Protection
This patch adds Kernel Userspace Access Protection on the 8xx.

When a page is RO or RW, it is set RO or RW for Key 0 and NA
for Key 1.

Up to now, the User group is defined with Key 0 for both User and
Supervisor.

By changing the group to Key 0 for User and Key 1 for Supervisor,
this patch prevents the Kernel from being able to access user data.

At exception entry, the kernel saves SPRN_MD_AP in the regs struct,
and reapply the protection. At exception exit it restores SPRN_MD_AP
with the value saved on exception entry.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Drop allow_read/write_to/from_user() as they're now in kup.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:46 +10:00
Christophe Leroy
06fbe81b59 powerpc/8xx: Add Kernel Userspace Execution Prevention
This patch adds Kernel Userspace Execution Prevention on the 8xx.

When a page is Executable, it is set Executable for Key 0 and NX
for Key 1.

Up to now, the User group is defined with Key 0 for both User and
Supervisor.

By changing the group to Key 0 for User and Key 1 for Supervisor,
this patch prevents the Kernel from being able to execute user code.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:46 +10:00
Michael Ellerman
5e5be3aed2 powerpc/mm: Detect bad KUAP faults
When KUAP is enabled we have logic to detect page faults that occur
outside of a valid user access region and are blocked by the AMR.

What we don't have at the moment is logic to detect a fault *within* a
valid user access region, that has been incorrectly blocked by AMR.
This is not meant to ever happen, but it can if we incorrectly
save/restore the AMR, or if the AMR was overwritten for some other
reason.

Currently if that happens we assume it's just a regular fault that
will be corrected by handling the fault normally, so we just return.
But there is nothing the fault handling code can do to fix it, so the
fault just happens again and we spin forever, leading to soft lockups.

So add some logic to detect that case and WARN() if we ever see it.
Arguably it should be a BUG(), but it's more polite to fail the access
and let the kernel continue, rather than taking down the box. There
should be no data integrity issue with failing the fault rather than
BUG'ing, as we're just going to disallow an access that should have
been allowed.

To make the code a little easier to follow, unroll the condition at
the end of bad_kernel_fault() and comment each case, before adding the
call to bad_kuap_fault().

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:06:04 +10:00
Michael Ellerman
890274c2dc powerpc/64s: Implement KUAP for Radix MMU
Kernel Userspace Access Prevention utilises a feature of the Radix MMU
which disallows read and write access to userspace addresses. By
utilising this, the kernel is prevented from accessing user data from
outside of trusted paths that perform proper safety checks, such as
copy_{to/from}_user() and friends.

Userspace access is disabled from early boot and is only enabled when
performing an operation like copy_{to/from}_user(). The register that
controls this (AMR) does not prevent userspace from accessing itself,
so there is no need to save and restore when entering and exiting
userspace.

When entering the kernel from the kernel we save AMR and if it is not
blocking user access (because eg. we faulted doing a user access) we
reblock user access for the duration of the exception (ie. the page
fault) and then restore the AMR when returning back to the kernel.

This feature can be tested by using the lkdtm driver (CONFIG_LKDTM=y)
and performing the following:

  # (echo ACCESS_USERSPACE) > [debugfs]/provoke-crash/DIRECT

If enabled, this should send SIGSEGV to the thread.

We also add paranoid checking of AMR in switch and syscall return
under CONFIG_PPC_KUAP_DEBUG.

Co-authored-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:06:02 +10:00
Russell Currey
1bb2bae2e6 powerpc/mm/radix: Use KUEP API for Radix MMU
Execution protection already exists on radix, this just refactors
the radix init to provide the KUEP setup function instead.

Thus, the only functional change is that it can now be disabled.

Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:06:01 +10:00
Russell Currey
b28c97505e powerpc/64: Setup KUP on secondary CPUs
Some platforms (i.e. Radix MMU) need per-CPU initialisation for KUP.

Any platforms that only want to do KUP initialisation once
globally can just check to see if they're running on the boot CPU, or
check if whatever setup they need has already been performed.

Note that this is only for 64-bit.

Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:05:59 +10:00
Christophe Leroy
de78a9c42a powerpc: Add a framework for Kernel Userspace Access Protection
This patch implements a framework for Kernel Userspace Access
Protection.

Then subarches will have the possibility to provide their own
implementation by providing setup_kuap() and
allow/prevent_user_access().

Some platforms will need to know the area accessed and whether it is
accessed from read, write or both. Therefore source, destination and
size and handed over to the two functions.

mpe: Rename to allow/prevent rather than unlock/lock, and add
read/write wrappers. Drop the 32-bit code for now until we have an
implementation for it. Add kuap to pt_regs for 64-bit as well as
32-bit. Don't split strings, use pr_crit_ratelimited().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:05:57 +10:00
Christophe Leroy
0fb1c25ab5 powerpc: Add skeleton for Kernel Userspace Execution Prevention
This patch adds a skeleton for Kernel Userspace Execution Prevention.

Then subarches implementing it have to define CONFIG_PPC_HAVE_KUEP
and provide setup_kuep() function.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Don't split strings, use pr_crit_ratelimited()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:05:56 +10:00
Christophe Leroy
69795cabe4 powerpc: Add framework for Kernel Userspace Protection
This patch adds a skeleton for Kernel Userspace Protection
functionnalities like Kernel Userspace Access Protection and Kernel
Userspace Execution Prevention

The subsequent implementation of KUAP for radix makes use of a MMU
feature in order to patch out assembly when KUAP is disabled or
unsupported. This won't work unless there's an entry point for KUP
support before the feature magic happens, so for PPC64 setup_kup() is
called early in setup.

On PPC32, feature_fixup() is done too early to allow the same.

Suggested-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:05:54 +10:00
Nathan Lynch
558f86493d powerpc/numa: document topology_updates_enabled, disable by default
Changing the NUMA associations for CPUs and memory at runtime is
basically unsupported by the core mm, scheduler etc. We see all manner
of crashes, warnings and instability when the pseries code tries to do
this. Disable this behavior by default, and document the switch a bit.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:03:59 +10:00
Nathan Lynch
2d4d9b308f powerpc/numa: improve control of topology updates
When booted with "topology_updates=no", or when "off" is written to
/proc/powerpc/topology_updates, NUMA reassignments are inhibited for
PRRN and VPHN events. However, migration and suspend unconditionally
re-enable reassignments via start_topology_update(). This is
incoherent.

Check the topology_updates_enabled flag in
start/stop_topology_update() so that callers of those APIs need not be
aware of whether reassignments are enabled. This allows the
administrative decision on reassignments to remain in force across
migrations and suspensions.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:03:57 +10:00
Jagadeesh Pagadala
6917735e8f powerpc: Remove duplicate headers
Remove duplicate headers inclusions.

Signed-off-by: Jagadeesh Pagadala <jagdsh.linux@gmail.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:44 +10:00
Laurent Vivier
f172acbfae powerpc/mm: move warning from resize_hpt_for_hotplug()
resize_hpt_for_hotplug() reports a warning when it cannot
resize the hash page table ("Unable to resize hash page
table to target order") but in some cases it's not a problem
and can make user thinks something has not worked properly.

This patch moves the warning to arch_remove_memory() to
only report the problem when it is needed.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:26 +10:00
Christophe Leroy
6c84f8c5cb powerpc/highmem: Change BUG_ON() to WARN_ON()
In arch/powerpc/mm/highmem.c, BUG_ON() is called only when
CONFIG_DEBUG_HIGHMEM is selected, this means the BUG_ON() is not vital
and can be replaced by a a WARN_ON().

At the same time, use IS_ENABLED() instead of #ifdef to clean a bit.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:11 +10:00