Commit Graph

20068 Commits

Author SHA1 Message Date
Christophe Leroy
7a792d5da2 powerpc/mm: get rid of nohash/32/mmu.h and nohash/64/mmu.h
Those files have no real added values, especially the 64 bit
which only includes the common book3e mmu.h which is also
included from 32 bits side.

So lets do the final inclusion directly from nohash/mmu.h

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
696dffa24b powerpc/mm: move pgtable_t in asm/mmu.h
pgtable_t is now identical for all subarches, move it to the
top level asm/mmu.h

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
737b434d3d powerpc/mm: convert Book3E 64 to pte_fragment
Book3E 64 is the only subarch not using pte_fragment. In order
to allow refactorisation, this patch converts it to pte_fragment.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
447def3b06 powerpc/mm: drop __bad_pte()
This has never been called (since Kernel has been in git at least),
drop it.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
26e66b08c3 powerpc/mm: flatten function __find_linux_pte() step 3
__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.

Previous patches left a { } block. This patch removes it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
e2fb251188 powerpc/mm: flatten function __find_linux_pte() step 2
__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.

Previous patch left { } blocks. This patch removes the first one
by shifting its content to the left.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
fab9a1165b powerpc/mm: flatten function __find_linux_pte() step 1
__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.

This patch flattens the function by getting rid of as much if/else
as possible. In order to ease the review, this is done in three steps.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
4df4b27585 powerpc/mm: cleanup remaining ifdef mess in hugetlbpage.c
Only 3 subarches support huge pages. So when it is either 2 of them,
it is not the third one.

And mmu_has_feature() is known by all subarches so IS_ENABLED() can
be used instead of #ifdef

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
c5710cd207 powerpc/mm: cleanup HPAGE_SHIFT setup
Only book3s/64 may select default among several HPAGE_SHIFT at runtime.
8xx always defines 512K pages as default
FSL_BOOK3E always defines 4M pages as default

This patch limits HUGETLB_PAGE_SIZE_VARIABLE to book3s/64
moves the definitions in subarches files.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
45d0ba527b powerpc/mm: move hugetlb_disabled into asm/hugetlb.h
No need to have this in asm/page.h, move it into asm/hugetlb.h

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:24 +10:00
Christophe Leroy
723f268f19 powerpc/mm: cleanup ifdef mess in add_huge_page_size()
Introduce a subarch specific helper check_and_get_huge_psize()
to check the huge page sizes and cleanup the ifdef mess in
add_huge_page_size()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
5fb84fec46 powerpc/mm: add a helper to populate hugepd
This patchs adds a subarch helper to populate hugepd.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
8197af22be powerpc/mm: split asm/hugetlb.h into dedicated subarch files
Three subarches support hugepages:
  - fsl book3e
  - book3s/64
  - 8xx

This patch splits asm/hugetlb.h to reduce the #ifdef mess.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
0001e5aa5c powerpc/mm: make gup_hugepte() static
gup_huge_pd() is the only user of gup_hugepte() and it is
located in the same file. This patch moves gup_huge_pd()
after gup_hugepte() and makes gup_hugepte() static.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
b7dcf96ce0 powerpc/mm: make hugetlbpage.c depend on CONFIG_HUGETLB_PAGE
The only function in hugetlbpage.c which doesn't depend on
CONFIG_HUGETLB_PAGE is gup_hugepte(), and this function is
only called from gup_huge_pd() which depends on
CONFIG_HUGETLB_PAGE so all the content of hugetlbpage.c
depends on CONFIG_HUGETLB_PAGE.

This patch modifies Makefile to only compile hugetlbpage.c
when CONFIG_HUGETLB_PAGE is set.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
0caed4de50 powerpc/mm: move __find_linux_pte() out of hugetlbpage.c
__find_linux_pte() is the only function in hugetlbpage.c
which is compiled in regardless on CONFIG_HUGETLBPAGE

This patch moves it in pgtable.c.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
3dea7332cc powerpc/book3e: hugetlbpage is only for CONFIG_PPC_FSL_BOOK3E
As per Kconfig.cputype, only CONFIG_PPC_FSL_BOOK3E gets to
select SYS_SUPPORTS_HUGETLBFS so simplify accordingly.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
5874cabe29 powerpc/64: only book3s/64 supports CONFIG_PPC_64K_PAGES
CONFIG_PPC_64K_PAGES cannot be selected by nohash/64.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
a521c44c3d powerpc/book3e: drop mmu_get_tsize()
This function is not used anymore, drop it.

Fixes: b42279f016 ("powerpc/mm/nohash: MM_SLICE is only used by book3s 64")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
5953fb4f46 powerpc/mm: define subarch SLB_ADDR_LIMIT_DEFAULT
This patch defines a subarch specific SLB_ADDR_LIMIT_DEFAULT
to remove the #ifdefs around the setup of mm->context.slb_addr_limit

It also generalises the use of mm_ctx_set_slb_addr_limit() helper.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
43ed7909d7 powerpc/mm: define get_slice_psize() all the time
get_slice_psize() can be defined regardless of CONFIG_PPC_MM_SLICES
to avoid ifdefs

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
33f128c649 powerpc/8xx: get rid of #ifdef CONFIG_HUGETLB_PAGE for slices
The 8xx only selects CONFIG_PPC_MM_SLICES when CONFIG_HUGETLB_PAGE
is set.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
203a1fa628 powerpc/mm: remove a couple of #ifdef CONFIG_PPC_64K_PAGES in mm/slice.c
This patch replaces a couple of #ifdef CONFIG_PPC_64K_PAGES
by IS_ENABLED(CONFIG_PPC_64K_PAGES) to improve code maintainability.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:23 +10:00
Christophe Leroy
b4baad0b27 powerpc/mm: remove unnecessary #ifdef CONFIG_PPC64
For PPC32 that's a noop, gcc should be smart enough to ignore it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
877461210e powerpc/mm: get rid of mm_ctx_slice_mask_xxx()
Now that slice_mask_for_size() is in mmu.h, the mm_ctx_slice_mask_xxx()
are not needed anymore, so drop them. Note that the 8xx ones where
not used anyway.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
fca5c1e9eb powerpc/mm: move slice_mask_for_size() into mmu.h
Move slice_mask_for_size() into subarch mmu.h

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Retain the BUG_ON()s, rather than converting to VM_BUG_ON()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
6f60cc98df powerpc/mm: hand a context_t over to slice_mask_for_size() instead of mm_struct
slice_mask_for_size() only uses mm->context, so hand directly a
pointer to the context. This will help moving the function in
subarch mmu.h in the next patch by avoiding having to include
the definition of struct mm_struct

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
02f89aed6b powerpc/mm: no slice for nohash/64
Only nohash/32 and book3s/64 support mm slices.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
5ba666d56c powerpc/mm: fix erroneous duplicate slb_addr_limit init
Commit 67fda38f0d ("powerpc/mm: Move slb_addr_linit to
early_init_mmu") moved slb_addr_limit init out of setup_arch().

Commit 701101865f ("powerpc/mm: Reduce memory usage for mm_context_t
for radix") brought it back into setup_arch() by error.

This patch reverts that erroneous regress.

Fixes: 701101865f ("powerpc/mm: Reduce memory usage for mm_context_t for radix")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
27e23b5f5f powerpc/mm: Move nohash specifics in subdirectory mm/nohash
Many files in arch/powerpc/mm are only for nohash. This patch
creates a subdirectory for them.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Shorten new filenames]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
17312f258c powerpc/mm: Move book3s32 specifics in subdirectory mm/book3s64
Several files in arch/powerpc/mm are only for book3S32. This patch
creates a subdirectory for them.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Shorten new filenames]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:20:22 +10:00
Christophe Leroy
47d99948ee powerpc/mm: Move book3s64 specifics in subdirectory mm/book3s64
Many files in arch/powerpc/mm are only for book3S64. This patch
creates a subdirectory for them.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Update the selftest sym links, shorten new filenames, cleanup some
      whitespace and formatting in the new files.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-03 01:18:38 +10:00
Christophe Leroy
9d9f2cccde powerpc/mm: change #include "mmu_decl.h" to <mm/mmu_decl.h>
This patch make inclusion of mmu_decl.h independant of the location
of the file including it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-02 21:18:58 +10:00
Christophe Leroy
71faf8145c powerpc/nohash64: clean pgtable.h
TRANSPARENT_HUGEPAGE is only supported by book3s

VMEMMAP_REGION_ID is never used

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-02 21:18:57 +10:00
Christophe Leroy
a1ac2a9c4f powerpc/book3e: drop BUG_ON() in map_kernel_page()
early_alloc_pgtable() never returns NULL as it panics on failure.

This patch drops the three BUG_ON() which check the non nullity
of early_alloc_pgtable() returned value.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-02 21:18:57 +10:00
Christophe Leroy
12f363511d powerpc/32s: Fix BATs setting with CONFIG_STRICT_KERNEL_RWX
Serge reported some crashes with CONFIG_STRICT_KERNEL_RWX enabled
on a book3s32 machine.

Analysis shows two issues:
  - BATs addresses and sizes are not properly aligned.
  - There is a gap between the last address covered by BATs and the
    first address covered by pages.

Memory mapped with DBATs:
0: 0xc0000000-0xc07fffff 0x00000000 Kernel RO coherent
1: 0xc0800000-0xc0bfffff 0x00800000 Kernel RO coherent
2: 0xc0c00000-0xc13fffff 0x00c00000 Kernel RW coherent
3: 0xc1400000-0xc23fffff 0x01400000 Kernel RW coherent
4: 0xc2400000-0xc43fffff 0x02400000 Kernel RW coherent
5: 0xc4400000-0xc83fffff 0x04400000 Kernel RW coherent
6: 0xc8400000-0xd03fffff 0x08400000 Kernel RW coherent
7: 0xd0400000-0xe03fffff 0x10400000 Kernel RW coherent

Memory mapped with pages:
0xe1000000-0xefffffff  0x21000000       240M        rw       present           dirty  accessed

This patch fixes both issues. With the patch, we get the following
which is as expected:

Memory mapped with DBATs:
0: 0xc0000000-0xc07fffff 0x00000000 Kernel RO coherent
1: 0xc0800000-0xc0bfffff 0x00800000 Kernel RO coherent
2: 0xc0c00000-0xc0ffffff 0x00c00000 Kernel RW coherent
3: 0xc1000000-0xc1ffffff 0x01000000 Kernel RW coherent
4: 0xc2000000-0xc3ffffff 0x02000000 Kernel RW coherent
5: 0xc4000000-0xc7ffffff 0x04000000 Kernel RW coherent
6: 0xc8000000-0xcfffffff 0x08000000 Kernel RW coherent
7: 0xd0000000-0xdfffffff 0x10000000 Kernel RW coherent

Memory mapped with pages:
0xe0000000-0xefffffff  0x20000000       256M        rw       present           dirty  accessed

Fixes: 63b2bc6195 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX")
Reported-by: Serge Belyshev <belyshev@depni.sinp.msu.ru>
Acked-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-02 15:33:46 +10:00
Al Viro
6d0e0d0bb8 spufs: switch to ->free_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-01 22:43:23 -04:00
Breno Leitao
e620d45065 powerpc/tm: Avoid machine crash on rt_sigreturn()
There is a kernel crash that happens if rt_sigreturn() is called inside
a transactional block.

This crash happens if the kernel hits an in-kernel page fault when
accessing userspace memory, usually through copy_ckvsx_to_user(). A
major page fault calls might_sleep() function, which can cause a task
reschedule. A task reschedule (switch_to()) reclaim and recheckpoint
the TM states, but, in the signal return path, the checkpointed memory
was already reclaimed, thus the exception stack has MSR that points to
MSR[TS]=0.

When the code returns from might_sleep() and a task reschedule
happened, then this task is returned with the memory recheckpointed,
and CPU MSR[TS] = suspended.

This means that there is a side effect at might_sleep() if it is
called with CPU MSR[TS] = 0 and the task has regs->msr[TS] != 0.

This side effect can cause a TM bad thing, since at the exception
entrance, the stack saves MSR[TS]=0, and this is what will be used at
RFID, but, the processor has MSR[TS] = Suspended, and this transition
will be invalid and a TM Bad thing will be raised, causing the
following crash:

  Unexpected TM Bad Thing exception at c00000000000e9ec (msr 0x8000000302a03031) tm_scratch=800000010280b033
  cpu 0xc: Vector: 700 (Program Check) at [c00000003ff1fd70]
      pc: c00000000000e9ec: fast_exception_return+0x100/0x1bc
      lr: c000000000032948: handle_rt_signal64+0xb8/0xaf0
      sp: c0000004263ebc40
     msr: 8000000302a03031
    current = 0xc000000415050300
    paca    = 0xc00000003ffc4080	 irqmask: 0x03	 irq_happened: 0x01
      pid   = 25006, comm = sigfuz
  Linux version 5.0.0-rc1-00001-g3bd6e94bec12 (breno@debian) (gcc version 8.2.0 (Debian 8.2.0-3)) #899 SMP Mon Jan 7 11:30:07 EST 2019
  WARNING: exception is not recoverable, can't continue
  enter ? for help
  [c0000004263ebc40] c000000000032948 handle_rt_signal64+0xb8/0xaf0 (unreliable)
  [c0000004263ebd30] c000000000022780 do_notify_resume+0x2f0/0x430
  [c0000004263ebe20] c00000000000e844 ret_from_except_lite+0x70/0x74
  --- Exception: c00 (System Call) at 00007fffbaac400c
  SP (7fffeca90f40) is in userspace

The solution for this problem is running the sigreturn code with
regs->msr[TS] disabled, thus, avoiding hitting the side effect above.
This does not seem to be a problem since regs->msr will be replaced by
the ucontext value, so, it is being flushed already. In this case, it
is flushed earlier.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-01 23:03:29 +10:00
Aneesh Kumar K.V
2c474c0350 powerpc/mm/radix: Fix kernel crash when running subpage protect test
This patch fixes the below crash by making sure we touch the subpage
protection related structures only if we know they are allocated on
the platform. With radix translation we don't allocate hash context at
all and trying to access subpage_prot_table results in:

  Faulting instruction address: 0xc00000000008bdb4
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
  ....
  NIP [c00000000008bdb4] sys_subpage_prot+0x74/0x590
  LR [c00000000000b688] system_call+0x5c/0x70
  Call Trace:
  [c00020002c6b7d30] [c00020002c6b7d90] 0xc00020002c6b7d90 (unreliable)
  [c00020002c6b7e20] [c00000000000b688] system_call+0x5c/0x70
  Instruction dump:
  fb61ffd8 fb81ffe0 fba1ffe8 fbc1fff0 fbe1fff8 f821ff11 e92d1178 f9210068
  39200000 e92d0968 ebe90630 e93f03e8 <eb891038> 60000000 3860fffe e9410068

We also move the subpage_prot_table with mmp_sem held to avoid race
between two parallel subpage_prot syscall.

Fixes: 701101865f ("powerpc/mm: Reduce memory usage for mm_context_t for radix")
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-01 23:01:33 +10:00
Mahesh Salgaonkar
50dbabe06a powerpc/powernv/mce: Print additional information about MCE error.
Print more information about MCE error whether it is an hardware or
software error.

Some of the MCE errors can be easily categorized as hardware or
software errors e.g. UEs are due to hardware error, where as error
triggered due to invalid usage of tlbie is a pure software bug. But
not all the MCE errors can be easily categorize into either software
or hardware. There are errors like multihit errors which are usually
result of a software bug, but in some rare cases a hardware failure
can cause a multihit error. In past, we have seen case where after
replacing faulty chip, multihit errors stopped occurring. Same with
parity errors, which are usually due to faulty hardware but there are
chances where multihit can also cause an parity error. Such errors are
difficult to determine what really caused it. Hence this patch
classifies MCE errors into following four categorize:

  1. Hardware error:
  	UE and Link timeout failure errors.
  2. Probable hardware error (some chance of software cause)
  	SLB/ERAT/TLB Parity errors.
  3. Software error
  	Invalid tlbie form.
  4. Probable software error (some chance of hardware cause)
  	SLB/ERAT/TLB Multihit errors.

Sample output:

  MCE: CPU80: machine check (Warning) Guest SLB Multihit DAR: 000001001b6e0320 [Recovered]
  MCE: CPU80: PID: 24765 Comm: qemu-system-ppc Guest NIP: [00007fffa309dc60]
  MCE: CPU80: Probable Software error (some chance of hardware cause)

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-01 22:23:20 +10:00
Mahesh Salgaonkar
cda6618d06 powerpc/powernv/mce: Print correct severity for MCE error.
Currently all machine check errors are printed as severe errors which
isn't correct. Print soft errors as warning instead of severe errors.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-01 22:22:51 +10:00
Mahesh Salgaonkar
d6e8a15085 powerpc/powernv/mce: Reduce MCE console logs to lesser lines.
Also add cpu number while displaying MCE log. This will help cleaner
logs when MCE hits on multiple cpus simultaneously.

Before the changes the MCE output was:

  Severe Machine check interrupt [Recovered]
    NIP [d00000000ba80280]: insert_slb_entry.constprop.0+0x278/0x2c0 [mcetest_slb]
    Initiator: CPU
    Error type: SLB [Multihit]
      Effective address: d00000000ba80280

After this patch series changes the MCE output will be:

  MCE: CPU80: machine check (Warning) Host SLB Multihit [Recovered]
  MCE: CPU80: NIP: [d00000000b550280] insert_slb_entry.constprop.0+0x278/0x2c0 [mcetest_slb]
  MCE: CPU80: Probable software error (some chance of hardware cause)

UE in host application:

  MCE: CPU48: machine check (Severe) Host UE Load/Store DAR: 00007fffc6079a80 paddr: 0000000f8e260000 [Not recovered]
  MCE: CPU48: PID: 4584 Comm: find NIP: [0000000010023368]
  MCE: CPU48: Hardware error

and for MCE in Guest:

  MCE: CPU80: machine check (Warning) Guest SLB Multihit DAR: 000001001b6e0320 [Recovered]
  MCE: CPU80: PID: 24765 Comm: qemu-system-ppc Guest NIP: [00007fffa309dc60]
  MCE: CPU80: Probable software error (some chance of hardware cause)

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-01 22:22:24 +10:00
Anton Blanchard
5b2a152962 powerpc: Add doorbell tracepoints
When analysing sources of OS jitter, I noticed that doorbells cannot be
traced.

Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-01 16:45:05 +10:00
Mathieu Malaterre
a5ae043de7 powerpc/64s: Remove 'dummy_copy_buffer'
In commit 2bf1071a8d ("powerpc/64s: Remove POWER9 DD1 support") the
function __switch_to remove usage for 'dummy_copy_buffer'. Since it is
not used anywhere else, remove it completely.

This remove the following warning:
  arch/powerpc/kernel/process.c:1156:17: error: 'dummy_copy_buffer' defined but not used

Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-01 12:08:44 +10:00
Tobin C. Harding
7e8039795a powerpc/cacheinfo: Fix kobject memleak
Currently error return from kobject_init_and_add() is not followed by
a call to kobject_put(). This means there is a memory leak.

Add call to kobject_put() in error path of kobject_init_and_add().

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-01 10:52:29 +10:00
Nick Desaulniers
33dda8c327 powerpc/vdso: Drop unnecessary cc-ldoption
Towards the goal of removing cc-ldoption, it seems that --hash-style=
was added to binutils 2.17.50.0.2 in 2006. The minimal required
version of binutils for the kernel according to
Documentation/process/changes.rst is 2.20.

Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-01 10:49:58 +10:00
Alexey Kardashevskiy
b511cdd1c1 powerpc/powernv/ioda: Handle failures correctly in pnv_pci_ioda_iommu_bypass_supported()
When the return value type was changed from int to bool, few places
were left unchanged, this fixes them. We did not hit these failures as
the first one is not happening at all and the second one is little
more likely to happen if the user switches a 33..58bit DMA capable
device between the VFIO and vendor drivers and there are not so many
of these.

Fixes: 2d6ad41b2c ("powerpc/powernv: use the generic iommu bypass code")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-30 23:23:24 +10:00
Michael Ellerman
bdc7c970bc Merge branch 'topic/ppc-kvm' into next
Merge our topic branch shared with KVM. In particular this includes the
rewrite of the idle code into C.
2019-04-30 22:52:03 +10:00
Michael Ellerman
e9cef0189c powerpc/powernv/idle: Restore AMR/UAMOR/AMOR/IAMR after idle
This is an implementation of commits 53a712bae5
("powerpc/powernv/idle: Restore AMR/UAMOR/AMOR after idle") and
a3f3072db6 ("powerpc/powernv/idle: Restore IAMR after idle") using
the new C-based idle code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Extract from Nick's patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-30 22:38:56 +10:00
Nicholas Piggin
10d91611f4 powerpc/64s: Reimplement book3s idle code in C
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code.

Book3S assembly stubs are kept in common code and used only to save
the stack frame and non-volatile GPRs before executing architected
idle instructions, and restoring the stack and reloading GPRs then
returning to C after waking from idle.

The complex logic dealing with threads and subcores, locking, SPRs,
HMIs, timebase resync, etc., is all done in C which makes it more
maintainable.

This is not a strict translation to C code, there are some
significant differences:

- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
  but saves and restores them itself.

- The optimisation where EC=ESL=0 idle modes did not have to save GPRs
  or change MSR is restored, because it's now simple to do. ESL=1
  sleeps that do not lose GPRs can use this optimization too.

- KVM secondary entry and cede is now more of a call/return style
  rather than branchy. nap_state_lost is not required because KVM
  always returns via NVGPR restoring path.

- KVM secondary wakeup from offline sequence is moved entirely into
  the offline wakeup, which avoids a hwsync in the normal idle wakeup
  path.

Performance measured with context switch ping-pong on different
threads or cores, is possibly improved a small amount, 1-3% depending
on stop state and core vs thread test for shallow states. Deep states
it's in the noise compared with other latencies.

KVM improvements:

- Idle sleepers now always return to caller rather than branch out
  to KVM first.

- This allows optimisations like very fast return to caller when no
  state has been lost.

- KVM no longer requires nap_state_lost because it controls NVGPR
  save/restore itself on the way in and out.

- The heavy idle wakeup KVM request check can be moved out of the
  normal host idle code and into the not-performance-critical offline
  code.

- KVM nap code now returns from where it is called, which makes the
  flow a bit easier to follow.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Squash the KVM changes in]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-30 22:37:48 +10:00
Paul Mackerras
0caecf5b00 KVM: PPC: Book3S HV: XIVE: Clear escalation interrupt pointers on device close
This adds code to ensure that after a XIVE or XICS-on-XIVE KVM device
is closed, KVM will not try to enable or disable any of the escalation
interrupts for the VCPUs.  We don't have to worry about races between
clearing the pointers and use of the pointers by the XIVE context
push/pull code, because the callers hold the vcpu->mutex, which is
also taken by the KVM_RUN code.  Therefore the vcpu cannot be entering
or exiting the guest concurrently.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:41:01 +10:00
Paul Mackerras
6f868405fa KVM: PPC: Book3S HV: XIVE: Prevent races when releasing device
Now that we have the possibility of a XIVE or XICS-on-XIVE device being
released while the VM is still running, we need to be careful about
races and potential use-after-free bugs.  Although the kvmppc_xive
struct is not freed, but kept around for re-use, the kvmppc_xive_vcpu
structs are freed, and they are used extensively in both the XIVE native
and XICS-on-XIVE code.

There are various ways in which XIVE code gets invoked:

- VCPU entry and exit, which do push and pull operations on the XIVE hardware
- one_reg get and set functions (vcpu->mutex is held)
- XICS hypercalls (but only inside guest execution, not from
  kvmppc_pseries_do_hcall)
- device creation calls (kvm->lock is held)
- device callbacks - get/set attribute, mmap, pagefault, release/destroy
- set_mapped/clr_mapped calls (kvm->lock is held)
- connect_vcpu calls
- debugfs file read callbacks

Inside a device release function, we know that userspace cannot have an
open file descriptor referring to the device, nor can it have any mmapped
regions from the device.  Therefore the device callbacks are excluded,
as are the connect_vcpu calls (since they need a fd for the device).
Further, since the caller holds the kvm->lock mutex, no other device
creation calls or set/clr_mapped calls can be executing concurrently.

To exclude VCPU execution and XICS hypercalls, we temporarily set
kvm->arch.mmu_ready to 0.  This forces any VCPU task that is trying to
enter the guest to take the kvm->lock mutex, which is held by the caller
of the release function.  Then, sending an IPI to all other CPUs forces
any VCPU currently executing in the guest to exit.

Finally, we take the vcpu->mutex for each VCPU around the process of
cleaning up and freeing its XIVE data structures, in order to exclude
any one_reg get/set calls.

To exclude the debugfs read callbacks, we just need to ensure that
debugfs_remove is called before freeing any data structures.  Once it
returns we know that no CPU can be executing the callbacks (for our
kvmppc_xive instance).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:41:01 +10:00
Cédric Le Goater
5422e95103 KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method
When a P9 sPAPR VM boots, the CAS negotiation process determines which
interrupt mode to use (XICS legacy or XIVE native) and invokes a
machine reset to activate the chosen mode.

We introduce 'release' methods for the XICS-on-XIVE and the XIVE
native KVM devices which are called when the file descriptor of the
device is closed after the TIMA and ESB pages have been unmapped.
They perform the necessary cleanups : clear the vCPU interrupt
presenters that could be attached and then destroy the device. The
'release' methods replace the 'destroy' methods as 'destroy' is not
called anymore once 'release' is. Compatibility with older QEMU is
nevertheless maintained.

This is not considered as a safe operation as the vCPUs are still
running and could be referencing the KVM device through their
presenters. To protect the system from any breakage, the kvmppc_xive
objects representing both KVM devices are now stored in an array under
the VM. Allocation is performed on first usage and memory is freed
only when the VM exits.

[paulus@ozlabs.org - Moved freeing of xive structures to book3s.c,
 put it under #ifdef CONFIG_KVM_XICS.]

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:40:39 +10:00
Cédric Le Goater
3fab2d1058 KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode
Full support for the XIVE native exploitation mode is now available,
advertise the capability KVM_CAP_PPC_IRQ_XIVE for guests running on
PowerNV KVM Hypervisors only. Support for nested guests (pseries KVM
Hypervisor) is not yet available. XIVE should also have been activated
which is default setting on POWER9 systems running a recent Linux
kernel.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
232b984b7d KVM: PPC: Book3S HV: XIVE: Add passthrough support
The KVM XICS-over-XIVE device and the proposed KVM XIVE native device
implement an IRQ space for the guest using the generic IPI interrupts
of the XIVE IC controller. These interrupts are allocated at the OPAL
level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
Interrupt management is performed in the XIVE way: using loads and
stores on the addresses of the XIVE IPI interrupt ESB pages.

Both KVM devices share the same internal structure caching information
on the interrupts, among which the xive_irq_data struct containing the
addresses of the IPI ESB pages and an extra one in case of pass-through.
The later contains the addresses of the ESB pages of the underlying HW
controller interrupts, PHB4 in all cases for now.

A guest, when running in the XICS legacy interrupt mode, lets the KVM
XICS-over-XIVE device "handle" interrupt management, that is to
perform the loads and stores on the addresses of the ESB pages of the
guest interrupts. However, when running in XIVE native exploitation
mode, the KVM XIVE native device exposes the interrupt ESB pages to
the guest and lets the guest perform directly the loads and stores.

The VMA exposing the ESB pages make use of a custom VM fault handler
which role is to populate the VMA with appropriate pages. When a fault
occurs, the guest IRQ number is deduced from the offset, and the ESB
pages of associated XIVE IPI interrupt are inserted in the VMA (using
the internal structure caching information on the interrupts).

Supporting device passthrough in the guest running in XIVE native
exploitation mode adds some extra refinements because the ESB pages
of a different HW controller (PHB4) need to be exposed to the guest
along with the initial IPI ESB pages of the XIVE IC controller. But
the overall mechanic is the same.

When the device HW irqs are mapped into or unmapped from the guest
IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
and kvmppc_xive_clr_mapped(), are called to record or clear the
passthrough interrupt information and to perform the switch.

The approach taken by this patch is to clear the ESB pages of the
guest IRQ number being mapped and let the VM fault handler repopulate.
The handler will insert the ESB page corresponding to the HW interrupt
of the device being passed-through or the initial IPI ESB page if the
device is being removed.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
6520ca64cd KVM: PPC: Book3S HV: XIVE: Add a mapping for the source ESB pages
Each source is associated with an Event State Buffer (ESB) with a
even/odd pair of pages which provides commands to manage the source:
to trigger, to EOI, to turn off the source for instance.

The custom VM fault handler will deduce the guest IRQ number from the
offset of the fault, and the ESB page of the associated XIVE interrupt
will be inserted into the VMA using the internal structure caching
information on the interrupts.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
39e9af3de5 KVM: PPC: Book3S HV: XIVE: Add a TIMA mapping
Each thread has an associated Thread Interrupt Management context
composed of a set of registers. These registers let the thread handle
priority management and interrupt acknowledgment. The most important
are :

    - Interrupt Pending Buffer     (IPB)
    - Current Processor Priority   (CPPR)
    - Notification Source Register (NSR)

They are exposed to software in four different pages each proposing a
view with a different privilege. The first page is for the physical
thread context and the second for the hypervisor. Only the third
(operating system) and the fourth (user level) are exposed the guest.

A custom VM fault handler will populate the VMA with the appropriate
pages, which should only be the OS page for now.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
e4945b9da5 KVM: PPC: Book3S HV: XIVE: Add get/set accessors for the VP XIVE state
The state of the thread interrupt management registers needs to be
collected for migration. These registers are cached under the
'xive_saved_state.w01' field of the VCPU when the VPCU context is
pulled from the HW thread. An OPAL call retrieves the backup of the
IPB register in the underlying XIVE NVT structure and merges it in the
KVM state.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
e6714bd167 KVM: PPC: Book3S HV: XIVE: Add a control to dirty the XIVE EQ pages
When migration of a VM is initiated, a first copy of the RAM is
transferred to the destination before the VM is stopped, but there is
no guarantee that the EQ pages in which the event notifications are
queued have not been modified.

To make sure migration will capture a consistent memory state, the
XIVE device should perform a XIVE quiesce sequence to stop the flow of
event notifications and stabilize the EQs. This is the purpose of the
KVM_DEV_XIVE_EQ_SYNC control which will also marks the EQ pages dirty
to force their transfer.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
7b46b6169a KVM: PPC: Book3S HV: XIVE: Add a control to sync the sources
This control will be used by the H_INT_SYNC hcall from QEMU to flush
event notifications on the XIVE IC owning the source.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
5ca8064748 KVM: PPC: Book3S HV: XIVE: Add a global reset control
This control is to be used by the H_INT_RESET hcall from QEMU. Its
purpose is to clear all configuration of the sources and EQs. This is
necessary in case of a kexec (for a kdump kernel for instance) to make
sure that no remaining configuration is left from the previous boot
setup so that the new kernel can start safely from a clean state.

The queue 7 is ignored when the XIVE device is configured to run in
single escalation mode. Prio 7 is used by escalations.

The XIVE VP is kept enabled as the vCPU is still active and connected
to the XIVE device.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
13ce3297c5 KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration
These controls will be used by the H_INT_SET_QUEUE_CONFIG and
H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying
Event Queue in the XIVE IC. They will also be used to restore the
configuration of the XIVE EQs and to capture the internal run-time
state of the EQs. Both 'get' and 'set' rely on an OPAL call to access
the EQ toggle bit and EQ index which are updated by the XIVE IC when
event notifications are enqueued in the EQ.

The value of the guest physical address of the event queue is saved in
the XIVE internal xive_q structure for later use. That is when
migration needs to mark the EQ pages dirty to capture a consistent
memory state of the VM.

To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra
OPAL call setting the EQ toggle bit and EQ index to configure the EQ,
but restoring the EQ state will.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
e8676ce50e KVM: PPC: Book3S HV: XIVE: Add a control to configure a source
This control will be used by the H_INT_SET_SOURCE_CONFIG hcall from
QEMU to configure the target of a source and also to restore the
configuration of a source when migrating the VM.

The XIVE source interrupt structure is extended with the value of the
Effective Interrupt Source Number. The EISN is the interrupt number
pushed in the event queue that the guest OS will use to dispatch
events internally. Caching the EISN value in KVM eases the test when
checking if a reconfiguration is indeed needed.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
4131f83c3d KVM: PPC: Book3S HV: XIVE: add a control to initialize a source
The XIVE KVM device maintains a list of interrupt sources for the VM
which are allocated in the pool of generic interrupts (IPIs) of the
main XIVE IC controller. These are used for the CPU IPIs as well as
for virtual device interrupts. The IRQ number space is defined by
QEMU.

The XIVE device reuses the source structures of the XICS-on-XIVE
device for the source blocks (2-level tree) and for the source
interrupts. Under XIVE native, the source interrupt caches mostly
configuration information and is less used than under the XICS-on-XIVE
device in which hcalls are still necessary at run-time.

When a source is initialized in KVM, an IPI interrupt source is simply
allocated at the OPAL level and then MASKED. KVM only needs to know
about its type: LSI or MSI.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
eacc56bb9d KVM: PPC: Book3S HV: XIVE: Introduce a new capability KVM_CAP_PPC_IRQ_XIVE
The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to
let QEMU connect the vCPU presenters to the XIVE KVM device if
required. The capability is not advertised for now as the full support
for the XIVE native exploitation mode is not yet available. When this
is case, the capability will be advertised on PowerNV Hypervisors
only. Nested guests (pseries KVM Hypervisor) are not supported.

Internally, the interface to the new KVM device is protected with a
new interrupt mode: KVMPPC_IRQ_XIVE.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater
90c73795af KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode
This is the basic framework for the new KVM device supporting the XIVE
native exploitation mode. The user interface exposes a new KVM device
to be created by QEMU, only available when running on a L0 hypervisor.
Support for nested guests is not available yet.

The XIVE device reuses the device structure of the XICS-on-XIVE device
as they have a lot in common. That could possibly change in the future
if the need arise.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Paul Mackerras
a878957a81 Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the ppc-kvm topic branch from the powerpc tree to get
patches which touch both general powerpc code and KVM code, one of
which is a prerequisite for following patches.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:32:47 +10:00
Suraj Jitindar Singh
44b198aee1 KVM: PPC: Book3S HV: Save/restore vrsave register in kvmhv_p9_guest_entry()
On POWER9 and later processors where the host can schedule vcpus on a
per thread basis, there is a streamlined entry path used when the guest
is radix. This entry path saves/restores the fp and vr state in
kvmhv_p9_guest_entry() by calling store_[fp/vr]_state() and
load_[fp/vr]_state(). This is the same as the old entry path however the
old entry path also saved/restored the VRSAVE register, which isn't done
in the new entry path.

This means that the vrsave register is now volatile across guest exit,
which is an incorrect change in behaviour.

Fix this by saving/restoring the vrsave register in kvmhv_p9_guest_entry().
This restores the old, correct, behaviour.

Fixes: 95a6432ce9 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:32:19 +10:00
Paul Mackerras
70ea13f6e6 KVM: PPC: Book3S HV: Flush TLB on secondary radix threads
When running on POWER9 with kvm_hv.indep_threads_mode = N and the host
in SMT1 mode, KVM will run guest VCPUs on offline secondary threads.
If those guests are in radix mode, we fail to load the LPID and flush
the TLB if necessary, leading to the guest crashing with an
unsupported MMU fault.  This arises from commit 9a4506e11b ("KVM:
PPC: Book3S HV: Make radix handle process scoped LPID flush in C,
with relocation on", 2018-05-17), which didn't consider the case
where indep_threads_mode = N.

For simplicity, this makes the real-mode guest entry path flush the
TLB in the same place for both radix and hash guests, as we did before
9a4506e11b, though the code is now C code rather than assembly code.
We also have the radix TLB flush open-coded rather than calling
radix__local_flush_tlb_lpid_guest(), because the TLB flush can be
called in real mode, and in real mode we don't want to invoke the
tracepoint code.

Fixes: 9a4506e11b ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:32:12 +10:00
Paul Mackerras
2940ba0c48 KVM: PPC: Book3S HV: Move HPT guest TLB flushing to C code
This replaces assembler code in book3s_hv_rmhandlers.S that checks
the kvm->arch.need_tlb_flush cpumask and optionally does a TLB flush
with C code in book3s_hv_builtin.c.  Note that unlike the radix
version, the hash version doesn't do an explicit ERAT invalidation
because we will invalidate and load up the SLB before entering the
guest, and that will invalidate the ERAT.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:32:01 +10:00
Suraj Jitindar Singh
7ae9bda7ed KVM: PPC: Book3S HV: Handle virtual mode in XIVE VCPU push code
The code in book3s_hv_rmhandlers.S that pushes the XIVE virtual CPU
context to the hardware currently assumes it is being called in real
mode, which is usually true.  There is however a path by which it can
be executed in virtual mode, in the case where indep_threads_mode = N.
A virtual CPU executing on an offline secondary thread can take a
hypervisor interrupt in virtual mode and return from the
kvmppc_hv_entry() call after the kvm_secondary_got_guest label.
It is possible for it to be given another vcpu to execute before it
gets to execute the stop instruction.  In that case it will call
kvmppc_hv_entry() for the second VCPU in virtual mode, and the XIVE
vCPU push code will be executed in virtual mode.  The result in that
case will be a host crash due to an unexpected data storage interrupt
caused by executing the stdcix instruction in virtual mode.

This fixes it by adding a code path for virtual mode, which uses the
virtual TIMA pointer and normal load/store instructions.

[paulus@ozlabs.org - wrote patch description]

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:31:52 +10:00
Paul Mackerras
1f80ba3d20 KVM: PPC: Book3S HV: Fix XICS-on-XIVE H_IPI when priority = 0
This fixes a bug in the XICS emulation on POWER9 machines which is
triggered by the guest doing a H_IPI with priority = 0 (the highest
priority).  What happens is that the notification interrupt arrives
at the destination at priority zero.  The loop in scan_interrupts()
sees that a priority 0 interrupt is pending, but because xc->mfrr is
zero, we break out of the loop before taking the notification
interrupt out of the queue and EOI-ing it.  (This doesn't happen
when xc->mfrr != 0; in that case we process the priority-0 notification
interrupt on the first iteration of the loop, and then break out of
a subsequent iteration of the loop with hirq == XICS_IPI.)

To fix this, we move the prio >= xc->mfrr check down to near the end
of the loop.  However, there are then some other things that need to
be adjusted.  Since we are potentially handling the notification
interrupt and also delivering an IPI to the guest in the same loop
iteration, we need to update pending and handle any q->pending_count
value before the xc->mfrr check, rather than at the end of the loop.
Also, we need to update the queue pointers when we have processed and
EOI-ed the notification interrupt, since we may not do it later.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:29:23 +10:00
Palmer Dabbelt
6fabc9f20c KVM: PPC: Book3S HV: smb->smp comment fixup
I made the same typo when trying to grep for uses of smp_wmb and figured
I might as well fix it.

Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:43:13 +10:00
Alexey Kardashevskiy
e1a1ef84cd KVM: PPC: Book3S: Allocate guest TCEs on demand too
We already allocate hardware TCE tables in multiple levels and skip
intermediate levels when we can, now it is a turn of the KVM TCE tables.
Thankfully these are allocated already in 2 levels.

This moves the table's last level allocation from the creating helper to
kvmppc_tce_put() and kvm_spapr_tce_fault(). Since such allocation cannot
be done in real mode, this creates a virtual mode version of
kvmppc_tce_put() which handles allocations.

This adds kvmppc_rm_ioba_validate() to do an additional test if
the consequent kvmppc_tce_put() needs a page which has not been allocated;
if this is the case, we bail out to virtual mode handlers.

The allocations are protected by a new mutex as kvm->lock is not suitable
for the task because the fault handler is called with the mmap_sem held
but kvmhv_setup_mmu() locks kvm->lock and mmap_sem in the reverse order.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:43:13 +10:00
Alexey Kardashevskiy
2001825efc KVM: PPC: Book3S HV: Avoid lockdep debugging in TCE realmode handlers
The kvmppc_tce_to_ua() helper is called from real and virtual modes
and it works fine as long as CONFIG_DEBUG_LOCKDEP is not enabled.
However if the lockdep debugging is on, the lockdep will most likely break
in kvm_memslots() because of srcu_dereference_check() so we need to use
PPC-own kvm_memslots_raw() which uses realmode safe
rcu_dereference_raw_notrace().

This creates a realmode copy of kvmppc_tce_to_ua() which replaces
kvm_memslots() with kvm_memslots_raw().

Since kvmppc_rm_tce_to_ua() becomes static and can only be used inside
HV KVM, this moves it earlier under CONFIG_KVM_BOOK3S_HV_POSSIBLE.

This moves truly virtual-mode kvmppc_tce_to_ua() to where it belongs and
drops the prmap parameter which was never used in the virtual mode.

Fixes: d3695aa4f4 ("KVM: PPC: Add support for multiple-TCE hcalls", 2016-02-15)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:43:13 +10:00
Alexey Kardashevskiy
3309bec85e KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest
The trace_hardirqs_on() sets current->hardirqs_enabled and from here
the lockdep assumes interrupts are enabled although they are remain
disabled until the context switches to the guest. Consequent
srcu_read_lock() checks the flags in rcu_lock_acquire(), observes
disabled interrupts and prints a warning (see below).

This moves trace_hardirqs_on/off closer to __kvmppc_vcore_entry to
prevent lockdep from being confused.

DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
WARNING: CPU: 16 PID: 8038 at kernel/locking/lockdep.c:4128 check_flags.part.25+0x224/0x280
[...]
NIP [c000000000185b84] check_flags.part.25+0x224/0x280
LR [c000000000185b80] check_flags.part.25+0x220/0x280
Call Trace:
[c000003fec253710] [c000000000185b80] check_flags.part.25+0x220/0x280 (unreliable)
[c000003fec253780] [c000000000187ea4] lock_acquire+0x94/0x260
[c000003fec253840] [c00800001a1e9768] kvmppc_run_core+0xa60/0x1ab0 [kvm_hv]
[c000003fec253a10] [c00800001a1ed944] kvmppc_vcpu_run_hv+0x73c/0xec0 [kvm_hv]
[c000003fec253ae0] [c00800001a1095dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000003fec253b00] [c00800001a1056bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
[c000003fec253b90] [c00800001a0f3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
[c000003fec253d00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
[c000003fec253db0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
[c000003fec253e00] [c00000000041ce78] sys_ioctl+0x28/0x80
[c000003fec253e20] [c00000000000b5a4] system_call+0x5c/0x70
Instruction dump:
419e0034 3d220004 39291730 81290000 2f890000 409e0020 3c82ffc6 3c62ffc5
3884be70 386329c0 4bf6ea71 60000000 <0fe00000> 3c62ffc6 3863be90 4801273d
irq event stamp: 1025
hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
softirqs last disabled at (0): [<0000000000000000>]           (null)
---[ end trace 31180adcc848993e ]---
possible reason: unannotated irqs-off.
irq event stamp: 1025
hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
softirqs last disabled at (0): [<0000000000000000>]           (null)

Fixes: 8b24e69fc4 ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry", 2017-06-26)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:43:12 +10:00
Suraj Jitindar Singh
eadfb1c5f8 KVM: PPC: Book3S HV: Implement real mode H_PAGE_INIT handler
Implement a real mode handler for the H_CALL H_PAGE_INIT which can be
used to zero or copy a guest page. The page is defined to be 4k and must
be 4k aligned.

The in-kernel real mode handler halves the time to handle this H_CALL
compared to handling it in userspace for a hash guest.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:43:12 +10:00
Suraj Jitindar Singh
2d34d1c3bb KVM: PPC: Book3S HV: Implement virtual mode H_PAGE_INIT handler
Implement a virtual mode handler for the H_CALL H_PAGE_INIT which can be
used to zero or copy a guest page. The page is defined to be 4k and must
be 4k aligned.

The in-kernel handler halves the time to handle this H_CALL compared to
handling it in userspace for a radix guest.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:41:20 +10:00
Nicholas Piggin
7ae3f6e130 powerpc/watchdog: Use hrtimers for per-CPU heartbeat
Using a jiffies timer creates a dependency on the tick_do_timer_cpu
incrementing jiffies. If that CPU has locked up and jiffies is not
incrementing, the watchdog heartbeat timer for all CPUs stops and
creates false positives and confusing warnings on local CPUs, and
also causes the SMP detector to stop, so the root cause is never
detected.

Fix this by using hrtimer based timers for the watchdog heartbeat,
like the generic kernel hardlockup detector.

Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reported-by: Ravikumar Bangoria <ravi.bangoria@in.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reported-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-30 11:31:02 +10:00
Heiner Kallweit
51c51a48de powerpc/powernv/npu: Use pci_dev_id() helper
Use new helper pci_dev_id() to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2019-04-29 16:12:28 -05:00
Nathan Fontenot
b2d3b5ee66 powerpc/pseries: Track LMB nid instead of using device tree
When removing memory we need to remove the memory from the node
it was added to instead of looking up the node it should be in
in the device tree.

During testing we have seen scenarios where the affinity for a
LMB changes due to a partition migration or PRRN event. In these
cases the node the LMB exists in may not match the node the device
tree indicates it belongs in. This can lead to a system crash
when trying to DLPAR remove the LMB after a migration or PRRN
event. The current code looks up the node in the device tree to
remove the LMB from, the crash occurs when we try to offline this
node and it does not have any data, i.e. node_data[nid] == NULL.

36:mon> e
cpu 0x36: Vector: 300 (Data Access) at [c0000001828b7810]
    pc: c00000000036d08c: try_offline_node+0x2c/0x1b0
    lr: c0000000003a14ec: remove_memory+0xbc/0x110
    sp: c0000001828b7a90
   msr: 800000000280b033
   dar: 9a28
 dsisr: 40000000
  current = 0xc0000006329c4c80
  paca    = 0xc000000007a55200   softe: 0        irq_happened: 0x01
    pid   = 76926, comm = kworker/u320:3

36:mon> t
[link register   ] c0000000003a14ec remove_memory+0xbc/0x110
[c0000001828b7a90] c00000000006a1cc arch_remove_memory+0x9c/0xd0 (unreliable)
[c0000001828b7ad0] c0000000003a14e0 remove_memory+0xb0/0x110
[c0000001828b7b20] c0000000000c7db4 dlpar_remove_lmb+0x94/0x160
[c0000001828b7b60] c0000000000c8ef8 dlpar_memory+0x7e8/0xd10
[c0000001828b7bf0] c0000000000bf828 handle_dlpar_errorlog+0xf8/0x160
[c0000001828b7c60] c0000000000bf8cc pseries_hp_work_fn+0x3c/0xa0
[c0000001828b7c90] c000000000128cd8 process_one_work+0x298/0x5a0
[c0000001828b7d20] c000000000129068 worker_thread+0x88/0x620
[c0000001828b7dc0] c00000000013223c kthread+0x1ac/0x1c0
[c0000001828b7e30] c00000000000b45c ret_from_kernel_thread+0x5c/0x80

To resolve this we need to track the node a LMB belongs to when
it is added to the system so we can remove it from that node instead
of the node that the device tree indicates it should belong to.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-29 22:27:16 +10:00
Linus Torvalds
0d82044e1b powerpc fixes for 5.1 #6
A one-liner to make our Radix MMU support depend on HUGETLB_PAGE. We use some of
 the hugetlb inlines (eg. pud_huge()) when operating on the linear mapping and if
 they're compiled into empty wrappers we can corrupt memory.
 
 Then two fixes to our VFIO IOMMU code. The first is not a regression but fixes
 the locking to avoid a user-triggerable deadlock.
 
 The second does fix a regression since rc1, and depends on the first fix. It
 makes it possible to run guests with large amounts of memory again (~256GB).
 
 Thanks to:
   Alexey Kardashevskiy.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcxUumAAoJEFHr6jzI4aWAEXwP/0x1hscWlJdpblLKc0Mh7kud
 MYUiEjxTsw2r8QeEJ5e3AtDh44DyuhrnL09w01vpsKMzVVC5vZDOQVXuAexeI8vO
 iv3pfpUhNyvNjpVDH1rHGJuQ4hJHIKr34GoNAhAfgW7yiZEvFOir2qd+bQm7KKgQ
 H2xvRiEXUJ3p1nVzOt2XYXEInTB2VWkwEfeNXLmh0AIySlAK1OLlrvZaZ4LzI7Bn
 Binbn+mIeftFWbXMOG8jhbiaTf6v9KQeWj5fTmaGEfBP3FBHjMQNf7krgCkq5Juh
 Ijxc20ardxBXeOq8E23q8+oAuGfpZjsKO3PPtl0r/Eyub65+FMP98//7iXvmwChL
 bkRwC+05LLMZPHYJ9UcWBKreEy1BDzb7nPWzvGkSyXnOVylavw+xZk22imAj/5pk
 auuuizTpTrW8c9WDYoWDkBVfbOOo44o4Eor51pYbC8Tq6cOkIYA6uO80JAD9yz6L
 FuKwrapMYKo779LAiX6u46iB6AVAaq/2TktwwnyZSqmj/oIiwNMKhrlENiO+CwgH
 PvDkrs0HA02F9nyHUreIQr5DQDgrgw4ZnMsatxvzBu/zBoe2RITfRms4c8WmKb+w
 y63ezFx4+FiFQGkniEY5/+o7ewpUFw6JyjKI+Q02tFKUKnq+iZQHtMxgJgxswxN/
 k8EUzlyS2ZY8d7vHHQ4u
 =oJ9o
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.1-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "A one-liner to make our Radix MMU support depend on HUGETLB_PAGE. We
  use some of the hugetlb inlines (eg. pud_huge()) when operating on the
  linear mapping and if they're compiled into empty wrappers we can
  corrupt memory.

  Then two fixes to our VFIO IOMMU code. The first is not a regression
  but fixes the locking to avoid a user-triggerable deadlock.

  The second does fix a regression since rc1, and depends on the first
  fix. It makes it possible to run guests with large amounts of memory
  again (~256GB).

  Thanks to Alexey Kardashevskiy"

* tag 'powerpc-5.1-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm_iommu: Allow pinning large regions
  powerpc/mm_iommu: Fix potential deadlock
  powerpc/mm/radix: Make Radix require HUGETLB_PAGE
2019-04-28 10:43:15 -07:00
Colin Ian King
f341d89790 powerpc/mm: fix spelling mistake "Outisde" -> "Outside"
There are several identical spelling mistakes in warning messages,
fix these.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-28 16:28:23 +10:00
Linus Torvalds
d286e13d53 arch: add pidfd and io_uring syscalls everywhere
This comes a bit late, but should be in 5.1 anyway: we want the newly
 added system calls to be synchronized across all architectures in
 the release.
 
 I hope that in the future, any newly added system calls can be added
 to all architectures at the same time, and tested there while they
 are in linux-next, avoiding dependencies between the architecture
 maintainer trees and the tree that contains the new system call.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJcv2aZAAoJEGCrR//JCVIncu4QALpTBqbjSu9u1/nXRGMLWo9J
 uToSBohDvsKW7wMkHcr1dU75ERIX9gqIY5pJWDrwzBdGDt02/oiy6WofXZDv4WkR
 Sp4YncdTeZENi0nNN+mrGDzNrcvBJd0FRc1MSLgPzfKXgf8P1oRzEsOaJVlGY5hS
 A8rNNUYE37m6rhTS59tNxzGvQcq3J7Q9ZRc0xjbSqIFngYVfQQiVbQCqd8RI6s9W
 +Hek+e5VF5HQnzhmTT9MQM4TsxMRMNfzrYpjhhayuLJ3CHROJPX7x9pZEGdyusQS
 5rDZxKes9SKTFS9QqycSyJkoP0awxrVrjqD1zFkWOJht0c3UCQAmw6GD7rlJkGPB
 vofuzmPzMq5XaZ8vpTucWNL+0ymzRXhhQ6esV39vRwxztRc4/DCy5MHDnrPK5yXb
 olPbltMAlHMaY5KePI/3jwpkcmzZjz9SNOKQ9/9tFlB5+RVF2qQdUgRMPE+XYa4H
 pRrZChrEAf6ZjINGeLlIVtpTlBFPl1LRF7UkOy7TYBvtRqukduXYpOFPb1XspQUl
 flIdBLOY3iF33o0eQnz10BEMxlblFhTj0SQrt0684kili7TjsWDaT+hPZSd72hhi
 Wey9l39kaexV2Sh7XZ6oUe205ay3R8sTn0Ic2+CnZaboeOuYlLYc8/w2HkTeTYmu
 9f3HAlX4Qu6RuX8bxLO0
 =Y7Kd
 -----END PGP SIGNATURE-----

Merge tag 'syscalls-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull syscall numbering updates from Arnd Bergmann:
 "arch: add pidfd and io_uring syscalls everywhere

  This comes a bit late, but should be in 5.1 anyway: we want the newly
  added system calls to be synchronized across all architectures in the
  release.

  I hope that in the future, any newly added system calls can be added
  to all architectures at the same time, and tested there while they are
  in linux-next, avoiding dependencies between the architecture
  maintainer trees and the tree that contains the new system call"

* tag 'syscalls-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  arch: add pidfd and io_uring syscalls everywhere
2019-04-23 13:34:17 -07:00
Jonas Gorski
ce0c890e2a powerpc/512x: mark clocks as big endian
These clocks' registers are accessed as big endian, so mark them as
such.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2019-04-23 10:57:49 -07:00
Aneesh Kumar K.V
26ad26718d powerpc/mm: Fix section mismatch warning
This patch fix the below section mismatch warnings.

WARNING: vmlinux.o(.text+0x2d1f44): Section mismatch in reference from the function devm_memremap_pages_release() to the function .meminit.text:arch_remove_memory()
WARNING: vmlinux.o(.text+0x2d265c): Section mismatch in reference from the function devm_memremap_pages() to the function .meminit.text:arch_add_memory()

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:40 +10:00
Aneesh Kumar K.V
5f53d28608 powerpc/mm/hash: Rename KERNEL_REGION_ID to LINEAR_MAP_REGION_ID
The region actually point to linear map. Rename the #define to
clarify thati.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:40 +10:00
Aneesh Kumar K.V
a092a03fa9 powerpc/mm: Print kernel map details to dmesg
This helps in debugging. We can look at the dmesg to find out
different kernel mapping details.

On 4K config this shows

 kernel vmalloc start   = 0xc000100000000000
 kernel IO start        = 0xc000200000000000
 kernel vmemmap start   = 0xc000300000000000

On 64K config:

 kernel vmalloc start   = 0xc008000000000000
 kernel IO start        = 0xc00a000000000000
 kernel vmemmap start   = 0xc00c000000000000

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:40 +10:00
Aneesh Kumar K.V
1c946c1b7f powerpc/mm/hash: Simplify the region id calculation.
This reduces multiple comparisons in get_region_id to a bit shift operation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:40 +10:00
Aneesh Kumar K.V
53ed7a5947 powerpc/mm: Drop the unnecessary region check
All the regions are now mapped with top nibble 0xc. Hence the region id
check is not needed for virt_addr_valid()

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:40 +10:00
Aneesh Kumar K.V
e09093927e powerpc/mm: Validate address values against different region limits
This adds an explicit check in various functions.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
0034d395f8 powerpc/mm/hash64: Map all the kernel regions in the same 0xc range
This patch maps vmalloc, IO and vmemap regions in the 0xc address range
instead of the current 0xd and 0xf range. This brings the mapping closer
to radix translation mode.

With hash 64K page size each of this region is 512TB whereas with 4K config
we are limited by the max page table range of 64TB and hence there regions
are of 16TB size.

The kernel mapping is now:

 On 4K hash

     kernel_region_map_size = 16TB
     kernel vmalloc start   = 0xc000100000000000
     kernel IO start        = 0xc000200000000000
     kernel vmemmap start   = 0xc000300000000000

64K hash, 64K radix and 4k radix:

     kernel_region_map_size = 512TB
     kernel vmalloc start   = 0xc008000000000000
     kernel IO start        = 0xc00a000000000000
     kernel vmemmap start   = 0xc00c000000000000

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
a35a3c6f60 powerpc/mm/hash64: Add a variable to track the end of IO mapping
This makes it easy to update the region mapping in the later patch

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
ef629cc5bf powerc/mm/hash: Reduce hash_mm_context size
Allocate subpage protect related variables only if we use the feature.
This helps in reducing the hash related mm context struct by around 4K

Before the patch
sizeof(struct hash_mm_context)  = 8288

After the patch
sizeof(struct hash_mm_context) = 4160

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
701101865f powerpc/mm: Reduce memory usage for mm_context_t for radix
Currently, our mm_context_t on book3s64 include all hash specific
context details like slice mask and subpage protection details. We
can skip allocating these with radix translation. This will help us to save
8K per mm_context with radix translation.

With the patch applied we have

sizeof(mm_context_t)  = 136
sizeof(struct hash_mm_context)  = 8288

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
67fda38f0d powerpc/mm: Move slb_addr_linit to early_init_mmu
Avoid #ifdef in generic code. Also enables us to do this specific to
MMU translation mode on book3s64

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Aneesh Kumar K.V
60458fba46 powerpc/mm: Add helpers for accessing hash translation related variables
We want to switch to allocating them runtime only when hash translation is
enabled. Add helpers so that both book3s and nohash can be adapted to
upcoming change easily.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:38 +10:00
Aneesh Kumar K.V
4f40b15f33 powerpc/mm: Remove PPC_MM_SLICES #ifdef for book3s64
Book3s64 always have PPC_MM_SLICES enabled. So remove the unncessary #ifdef

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:38 +10:00
Aneesh Kumar K.V
6161a37307 powerpc/mm: Fix build error with FLATMEM book3s64 config
The current value of MAX_PHYSMEM_BITS cannot work with 32 bit configs.
We used to have MAX_PHYSMEM_BITS not defined without SPARSEMEM and 32
bit configs never expected a value to be set for MAX_PHYSMEM_BITS.

Dependent code such as zsmalloc derived the right values based on other
fields. Instead of finding a value that works with different configs,
use new values only for book3s_64. For 64 bit booke, use the definition
of MAX_PHYSMEM_BITS as per commit a7df61a0e2 ("[PATCH] ppc64: Increase sparsemem defaults")
That change was done in 2005 and hopefully will work with book3e 64.

Fixes: 8bc0868998 ("powerpc/mm: Only define MAX_PHYSMEM_BITS in SPARSEMEM configurations")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:38 +10:00
Christophe Leroy
a68c31fc01 powerpc/32s: Implement Kernel Userspace Access Protection
This patch implements Kernel Userspace Access Protection for
book3s/32.

Due to limitations of the processor page protection capabilities,
the protection is only against writing. read protection cannot be
achieved using page protection.

The previous patch modifies the page protection so that RW user
pages are RW for Key 0 and RO for Key 1, and it sets Key 0 for
both user and kernel.

This patch changes userspace segment registers are set to Ku 0
and Ks 1. When kernel needs to write to RW pages, the associated
segment register is then changed to Ks 0 in order to allow write
access to the kernel.

In order to avoid having the read all segment registers when
locking/unlocking the access, some data is kept in the thread_struct
and saved on stack on exceptions. The field identifies both the
first unlocked segment and the first segment following the last
unlocked one. When no segment is unlocked, it contains value 0.

As the hash_page() function is not able to easily determine if a
protfault is due to a bad kernel access to userspace, protfaults
need to be handled by handle_page_fault when KUAP is set.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Drop allow_read/write_to/from_user() as they're now in kup.h,
      and adapt allow_user_access() to do nothing when to == NULL]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:47 +10:00
Christophe Leroy
f342adca3a powerpc/32s: Prepare Kernel Userspace Access Protection
This patch prepares Kernel Userspace Access Protection for
book3s/32.

Due to limitations of the processor page protection capabilities,
the protection is only against writing. read protection cannot be
achieved using page protection.

book3s/32 provides the following values for PP bits:

PP00 provides RW for Key 0 and NA for Key 1
PP01 provides RW for Key 0 and RO for Key 1
PP10 provides RW for all
PP11 provides RO for all

Today PP10 is used for RW pages and PP11 for RO pages, and user
segment register's Kp and Ks are set to 1. This patch modifies
page protection to use PP01 for RW pages and sets user segment
registers to Kp 0 and Ks 0.

This will allow to setup Userspace write access protection by
settng Ks to 1 in the following patch.

Kernel space segment registers remain unchanged.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:46 +10:00
Christophe Leroy
31ed2b13c4 powerpc/32s: Implement Kernel Userspace Execution Prevention.
To implement Kernel Userspace Execution Prevention, this patch
sets NX bit on all user segments on kernel entry and clears NX bit
on all user segments on kernel exit.

Note that powerpc 601 doesn't have the NX bit, so KUEP will not
work on it. A warning is displayed at startup.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:46 +10:00
Christophe Leroy
2679f9bd0a powerpc/8xx: Add Kernel Userspace Access Protection
This patch adds Kernel Userspace Access Protection on the 8xx.

When a page is RO or RW, it is set RO or RW for Key 0 and NA
for Key 1.

Up to now, the User group is defined with Key 0 for both User and
Supervisor.

By changing the group to Key 0 for User and Key 1 for Supervisor,
this patch prevents the Kernel from being able to access user data.

At exception entry, the kernel saves SPRN_MD_AP in the regs struct,
and reapply the protection. At exception exit it restores SPRN_MD_AP
with the value saved on exception entry.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Drop allow_read/write_to/from_user() as they're now in kup.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:46 +10:00
Christophe Leroy
06fbe81b59 powerpc/8xx: Add Kernel Userspace Execution Prevention
This patch adds Kernel Userspace Execution Prevention on the 8xx.

When a page is Executable, it is set Executable for Key 0 and NX
for Key 1.

Up to now, the User group is defined with Key 0 for both User and
Supervisor.

By changing the group to Key 0 for User and Key 1 for Supervisor,
this patch prevents the Kernel from being able to execute user code.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:46 +10:00
Christophe Leroy
c341a108a5 powerpc/8xx: Only define APG0 and APG1
Since the 8xx implements hardware page table walk assistance,
the PGD entries always point to a 4k aligned page, so the 2 upper
bits of the APG are not clobbered anymore and remain 0. Therefore
only APG0 and APG1 are used and need a definition. We set the
other APG to the lowest permission level.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:46 +10:00
Christophe Leroy
e2fb9f5444 powerpc/32: Prepare for Kernel Userspace Access Protection
This patch adds ASM macros for saving, restoring and checking
the KUAP state, and modifies setup_32 to call them on exceptions
from kernel.

The macros are defined as empty by default for when CONFIG_PPC_KUAP
is not selected and/or for platforms which don't handle (yet) KUAP.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:46 +10:00
Christophe Leroy
e291b6d575 powerpc/32: Remove MSR_PR test when returning from syscall
syscalls are from user only, so we can account time without checking
whether returning to kernel or user as it will only be user.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:11:46 +10:00
Michael Ellerman
5e5be3aed2 powerpc/mm: Detect bad KUAP faults
When KUAP is enabled we have logic to detect page faults that occur
outside of a valid user access region and are blocked by the AMR.

What we don't have at the moment is logic to detect a fault *within* a
valid user access region, that has been incorrectly blocked by AMR.
This is not meant to ever happen, but it can if we incorrectly
save/restore the AMR, or if the AMR was overwritten for some other
reason.

Currently if that happens we assume it's just a regular fault that
will be corrected by handling the fault normally, so we just return.
But there is nothing the fault handling code can do to fix it, so the
fault just happens again and we spin forever, leading to soft lockups.

So add some logic to detect that case and WARN() if we ever see it.
Arguably it should be a BUG(), but it's more polite to fail the access
and let the kernel continue, rather than taking down the box. There
should be no data integrity issue with failing the fault rather than
BUG'ing, as we're just going to disallow an access that should have
been allowed.

To make the code a little easier to follow, unroll the condition at
the end of bad_kernel_fault() and comment each case, before adding the
call to bad_kuap_fault().

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:06:04 +10:00
Michael Ellerman
890274c2dc powerpc/64s: Implement KUAP for Radix MMU
Kernel Userspace Access Prevention utilises a feature of the Radix MMU
which disallows read and write access to userspace addresses. By
utilising this, the kernel is prevented from accessing user data from
outside of trusted paths that perform proper safety checks, such as
copy_{to/from}_user() and friends.

Userspace access is disabled from early boot and is only enabled when
performing an operation like copy_{to/from}_user(). The register that
controls this (AMR) does not prevent userspace from accessing itself,
so there is no need to save and restore when entering and exiting
userspace.

When entering the kernel from the kernel we save AMR and if it is not
blocking user access (because eg. we faulted doing a user access) we
reblock user access for the duration of the exception (ie. the page
fault) and then restore the AMR when returning back to the kernel.

This feature can be tested by using the lkdtm driver (CONFIG_LKDTM=y)
and performing the following:

  # (echo ACCESS_USERSPACE) > [debugfs]/provoke-crash/DIRECT

If enabled, this should send SIGSEGV to the thread.

We also add paranoid checking of AMR in switch and syscall return
under CONFIG_PPC_KUAP_DEBUG.

Co-authored-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:06:02 +10:00
Russell Currey
ef296729b7 powerpc/lib: Refactor __patch_instruction() to use __put_user_asm()
__patch_instruction() is called in early boot, and uses
__put_user_size(), which includes the allow/prevent calls to enforce
KUAP, which could either be called too early, or in the Radix case,
forced to use "early_" versions of functions just to safely handle
this one case.

__put_user_asm() does not do this, and thus is safe to use both in
early boot, and later on since in this case it should only ever be
touching kernel memory.

__patch_instruction() was previously refactored to use
__put_user_size() in order to be able to return -EFAULT, which would
allow the kernel to patch instructions in userspace, which should
never happen. This has the functional change of causing faults on
userspace addresses if KUAP is turned on, which should never happen in
practice.

A future enhancement could be to double check the patch address is
definitely allowed to be tampered with by the kernel.

Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:06:02 +10:00
Russell Currey
1bb2bae2e6 powerpc/mm/radix: Use KUEP API for Radix MMU
Execution protection already exists on radix, this just refactors
the radix init to provide the KUEP setup function instead.

Thus, the only functional change is that it can now be disabled.

Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:06:01 +10:00
Russell Currey
b28c97505e powerpc/64: Setup KUP on secondary CPUs
Some platforms (i.e. Radix MMU) need per-CPU initialisation for KUP.

Any platforms that only want to do KUP initialisation once
globally can just check to see if they're running on the boot CPU, or
check if whatever setup they need has already been performed.

Note that this is only for 64-bit.

Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:05:59 +10:00
Christophe Leroy
de78a9c42a powerpc: Add a framework for Kernel Userspace Access Protection
This patch implements a framework for Kernel Userspace Access
Protection.

Then subarches will have the possibility to provide their own
implementation by providing setup_kuap() and
allow/prevent_user_access().

Some platforms will need to know the area accessed and whether it is
accessed from read, write or both. Therefore source, destination and
size and handed over to the two functions.

mpe: Rename to allow/prevent rather than unlock/lock, and add
read/write wrappers. Drop the 32-bit code for now until we have an
implementation for it. Add kuap to pt_regs for 64-bit as well as
32-bit. Don't split strings, use pr_crit_ratelimited().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:05:57 +10:00
Christophe Leroy
0fb1c25ab5 powerpc: Add skeleton for Kernel Userspace Execution Prevention
This patch adds a skeleton for Kernel Userspace Execution Prevention.

Then subarches implementing it have to define CONFIG_PPC_HAVE_KUEP
and provide setup_kuep() function.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Don't split strings, use pr_crit_ratelimited()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:05:56 +10:00
Christophe Leroy
69795cabe4 powerpc: Add framework for Kernel Userspace Protection
This patch adds a skeleton for Kernel Userspace Protection
functionnalities like Kernel Userspace Access Protection and Kernel
Userspace Execution Prevention

The subsequent implementation of KUAP for radix makes use of a MMU
feature in order to patch out assembly when KUAP is disabled or
unsupported. This won't work unless there's an entry point for KUP
support before the feature magic happens, so for PPC64 setup_kup() is
called early in setup.

On PPC32, feature_fixup() is done too early to allow the same.

Suggested-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:05:54 +10:00
Michael Ellerman
53a712bae5 powerpc/powernv/idle: Restore AMR/UAMOR/AMOR after idle
In order to implement KUAP (Kernel Userspace Access Protection) on
Power9 we will be using the AMR, and therefore indirectly the
UAMOR/AMOR.

So save/restore these regs in the idle code.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:05:54 +10:00
Russell Currey
a3f3072db6 powerpc/powernv/idle: Restore IAMR after idle
Without restoring the IAMR after idle, execution prevention on POWER9
with Radix MMU is overwritten and the kernel can freely execute
userspace without faulting.

This is necessary when returning from any stop state that modifies
user state, as well as hypervisor state.

To test how this fails without this patch, load the lkdtm driver and
do the following:

  $ echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT

which won't fault, then boot the kernel with powersave=off, where it
will fault. Applying this patch will fix this.

Fixes: 3b10d0095a ("powerpc/mm/radix: Prevent kernel execution of user space")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:05:52 +10:00
Michael Neuling
c1fe190c06 powerpc: Add force enable of DAWR on P9 option
This adds a flag so that the DAWR can be enabled on P9 via:
  echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous

The DAWR was previously force disabled on POWER9 in:
  9654153158 powerpc: Disable DAWR in the base POWER9 CPU features
Also see Documentation/powerpc/DAWR-POWER9.txt

This is a dangerous setting, USE AT YOUR OWN RISK.

Some users may not care about a bad user crashing their box
(ie. single user/desktop systems) and really want the DAWR.  This
allows them to force enable DAWR.

This flag can also be used to disable DAWR access. Once this is
cleared, all DAWR access should be cleared immediately and your
machine once again safe from crashing.

Userspace may get confused by toggling this. If DAWR is force
enabled/disabled between getting the number of breakpoints (via
PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
inconsistent view of what's available. Similarly for guests.

For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
enabled in the host AND the guest. For this reason, this won't work on
POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
dawr_enable_dangerous file will fail if the hypervisor doesn't support
writing the DAWR.

To double check the DAWR is working, run this kernel selftest:
  tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
Any errors/failures/skips mean something is wrong.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:20:45 +10:00
Nathan Lynch
558f86493d powerpc/numa: document topology_updates_enabled, disable by default
Changing the NUMA associations for CPUs and memory at runtime is
basically unsupported by the core mm, scheduler etc. We see all manner
of crashes, warnings and instability when the pseries code tries to do
this. Disable this behavior by default, and document the switch a bit.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:03:59 +10:00
Nathan Lynch
2d4d9b308f powerpc/numa: improve control of topology updates
When booted with "topology_updates=no", or when "off" is written to
/proc/powerpc/topology_updates, NUMA reassignments are inhibited for
PRRN and VPHN events. However, migration and suspend unconditionally
re-enable reassignments via start_topology_update(). This is
incoherent.

Check the topology_updates_enabled flag in
start/stop_topology_update() so that callers of those APIs need not be
aware of whether reassignments are enabled. This allows the
administrative decision on reassignments to remain in force across
migrations and suspensions.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:03:57 +10:00
Andrew Donnellan
2f9196b672 powerpc/powernv: Squash sparse warnings in opal-call.c
sparse complains a lot about opal-call.c:

  arch/powerpc/platforms/powernv/opal-call.c:128:1: warning: symbol 'opal_invalid_call' was not declared. Should it be static?
  arch/powerpc/platforms/powernv/opal-call.c:129:1: warning: symbol 'opal_console_write' was not declared. Should it be static?
  arch/powerpc/platforms/powernv/opal-call.c:130:1: warning: symbol 'opal_console_read' was not declared. Should it be static?

Those symbols are forward declared in opal.h, but we can't include that
because the function signatures in opal.h are different. So instead, just
add an extra forward declaration to the OPAL_CALL macro to shut sparse up.

Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:03:19 +10:00
George Spelvin
80d04b7fab powerpc/crypto: Use cheaper random numbers for crc-vpmsum self-test
This code was filling a 64K buffer from /dev/urandom in order to
compute a CRC over (on average half of) it by two different methods,
comparing the CRCs, and repeating.

This is not a remotely security-critical application, so use the far
faster and cheaper prandom_u32() generator.

And, while we're at it, only fill as much of the buffer as we plan to use.

Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:03:11 +10:00
Jagadeesh Pagadala
6917735e8f powerpc: Remove duplicate headers
Remove duplicate headers inclusions.

Signed-off-by: Jagadeesh Pagadala <jagdsh.linux@gmail.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:44 +10:00
Wen Yang
cc76404fea powerpc/8xx: Fix possible device node reference leak
The call to of_find_compatible_node() returns a node pointer with
refcount incremented thus it must be explicitly decremented after the
last usage.

irq_domain_add_linear() also calls of_node_get() to increase refcount,
so irq_domain() will not be affected when it is released.

Detected by coccinelle.

Fixes: a8db8cf0d8 ("irq_domain: Replace irq_alloc_host() with revmap-specific initializers")
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Peng Hao <peng.hao2@zte.com.cn>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:35 +10:00
Ganesh Goudar
7f177f9810 powerpc/pseries: hwpoison the pages upon hitting UE
Add support to hwpoison the pages upon hitting machine check
exception.

This patch queues the address where UE is hit to percpu array
and schedules work to plumb it into memory poison infrastructure.

Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
[mpe: Combine #ifdefs, drop PPC_BIT8(), and empty inline stub]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:35 +10:00
Julia Lawall
4df2cb633b powerpc/83xx: Add missing of_node_put() after of_device_is_available()
Add an of_node_put() when a tested device node is not available.

Fixes: c026c98739 ("powerpc/83xx: Do not configure or probe disabled FSL DR USB controllers")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:27 +10:00
Qian Cai
e663e1e060 powerpc/pseries/pmem: Fix a set but not used value
The commit 4c5d87db49 ("powerpc/pseries: PAPR persistent memory
support") set a local variable "count" in dlpar_hp_pmem() but never
use it.

  arch/powerpc/platforms/pseries/pmem.c: In function 'dlpar_hp_pmem':
  arch/powerpc/platforms/pseries/pmem.c:109:6: warning: variable 'count' set but not used

Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:26 +10:00
Qian Cai
c05f57fdc3 powerpc/pseries/iommu: Fix set but not used values
The commit b7d6bf4fdd ("powerpc/pseries/pci: Remove obsolete SW
invalidate") left 2 variables unused.

  arch/powerpc/platforms/pseries/iommu.c:108:17: warning: variable 'tces' set but not used
    __be64 *tcep, *tces;
                   ^~~~

  arch/powerpc/platforms/pseries/iommu.c:132:17: warning: variable 'tces' set but not used
    __be64 *tcep, *tces;
                   ^~~~

Also, the commit 68c0449ea1 ("powerpc/pseries/iommu: Use memory@
nodes in max RAM address calculation") set "ranges" in
ddw_memory_hotplug_max() but never use it.

  arch/powerpc/platforms/pseries/iommu.c: In function 'ddw_memory_hotplug_max':
  arch/powerpc/platforms/pseries/iommu.c:948:7: warning: variable 'ranges' set but not used
     int ranges, n_mem_addr_cells, n_mem_size_cells, len;
         ^~~~~~

Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:26 +10:00
Qian Cai
bff25143da powerpc/mm: Silence unused-but-set-variable warnings
pte_unmap() compiles away on some powerpc platforms, so silence the
warnings below by making it a static inline function.

  mm/memory.c: In function 'copy_pte_range':
  mm/memory.c:820:24: warning: variable 'orig_dst_pte' set but not used
  mm/memory.c:820:9: warning: variable 'orig_src_pte' set but not used
  mm/madvise.c: In function 'madvise_free_pte_range':
  mm/madvise.c:318:9: warning: variable 'orig_pte' set but not used
  mm/swap_state.c: In function 'swap_ra_info':
  mm/swap_state.c:634:15: warning: variable 'orig_pte' set but not used

Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:26 +10:00
Laurent Vivier
f172acbfae powerpc/mm: move warning from resize_hpt_for_hotplug()
resize_hpt_for_hotplug() reports a warning when it cannot
resize the hash page table ("Unable to resize hash page
table to target order") but in some cases it's not a problem
and can make user thinks something has not worked properly.

This patch moves the warning to arch_remove_memory() to
only report the problem when it is needed.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:26 +10:00
Aneesh Kumar K.V
f89bd8ba83 powerpc/mm/radix: Don't do SLB preload when using the radix MMU
Add radix_enabled() check to avoid SLB preload with radix translation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:26 +10:00
Thomas Huth
24c174bb23 powerpc/configs: Enable CONFIG_USB_XHCI_HCD by default
Recent versions of QEMU provide a XHCI device by default these
days instead of an old-fashioned OHCI device:

 https://git.qemu.org/?p=qemu.git;a=commitdiff;h=57040d451315320b7d27

So to get the keyboard working in the graphical console there again,
we should now include XHCI support in the kernel by default, too.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:26 +10:00
Mahesh Salgaonkar
c9d8dda423 powerpc/pseries/mce: Improve array initialization.
This is a follow up to the patch that fixed misleading print for TLB
mutlihit due to wrongly populated mc_err_types[] array. Convert all the
static array initialization to '[x] = val' style for better
readability of array indexing and avoid any further confusion.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:26 +10:00
Russell Currey
56c46bba9b powerpc/64: Fix booting large kernels with STRICT_KERNEL_RWX
With STRICT_KERNEL_RWX enabled anything marked __init is placed at a 16M
boundary.  This is necessary so that it can be repurposed later with
different permissions.  However, in kernels with text larger than 16M,
this pushes early_setup past 32M, incapable of being reached by the
branch instruction.

Fix this by setting the CTR and branching there instead.

Fixes: 1e0fc9d1eb ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs")
Signed-off-by: Russell Currey <ruscur@russell.cc>
[mpe: Fix it to work on BE by using DOTSYM()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:12 +10:00
Mathieu Malaterre
62611c1e24 powerpc/embedded6xx: Remove unused functions holly_power_off and holly_halt
Silence the following warnings triggered using W=1:

  arch/powerpc/platforms/embedded6xx/holly.c:236:6: error: no previous prototype for 'holly_power_off'
  arch/powerpc/platforms/embedded6xx/holly.c:243:6: error: no previous prototype for 'holly_halt'

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:12 +10:00
Mathieu Malaterre
308be6c781 powerpc/embedded6xx: Make some functions static
In commit cb9e4d10c4 ("[POWERPC] Add support for 750CL Holly board")
new functions were added. Since most of these functions can be made
static, make it so.

Both holly_power_off and holly_halt functions were not changed since
they are unused, making them static would have triggered the following
warning (treated as error):

  arch/powerpc/platforms/embedded6xx/holly.c:244:13: error: 'holly_halt' defined but not used

Silence the following warnings triggered using W=1:

  arch/powerpc/platforms/embedded6xx/holly.c:47:5: error: no previous prototype for 'holly_exclude_device'
  arch/powerpc/platforms/embedded6xx/holly.c:190:6: error: no previous prototype for 'holly_show_cpuinfo'
  arch/powerpc/platforms/embedded6xx/holly.c:196:17: error: no previous prototype for 'holly_restart'

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:12 +10:00
Ben Hutchings
ff6d27823f powerpc: vdso: Make vdso32 installation conditional in vdso_install
The 32-bit vDSO is not needed and not normally built for 64-bit
little-endian configurations.  However, the vdso_install target still
builds and installs it.  Add the same config condition as is normally
used for the build.

Fixes: e0d0059169 ("powerpc/vdso: Disable building the 32-bit VDSO ...")
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:12 +10:00
Michael Ellerman
eea86aa417 powerpc/mm/64: Document the sizes of/sizes mapped by Pxx_INDEX_SIZE
Add comments describing the size in bytes of the various levels of the
page table tree, and the size of the virtual address space mapped by
each level, to make it clear what the sizes are without having to also
look up other definitions.

The code that calculates the sizes actually uses sizeof(pgd_t) etc.,
so in theory these comments could skew vs the code, but the size of
pgd_t etc. is unlikely to change very often.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:11 +10:00
Christophe Leroy
6c84f8c5cb powerpc/highmem: Change BUG_ON() to WARN_ON()
In arch/powerpc/mm/highmem.c, BUG_ON() is called only when
CONFIG_DEBUG_HIGHMEM is selected, this means the BUG_ON() is not vital
and can be replaced by a a WARN_ON().

At the same time, use IS_ENABLED() instead of #ifdef to clean a bit.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:11 +10:00
Michael Ellerman
af5cd05de5 powerpc: Fix defconfig choice logic when cross compiling
Our logic for choosing defconfig doesn't work well in some situations.

For example if you're on a ppc64le machine but you specify a non-empty
CROSS_COMPILE, in order to use a non-default toolchain, then defconfig
will give you ppc64_defconfig (big endian):

  $ make CROSS_COMPILE=~/toolchains/gcc-8/bin/powerpc-linux- defconfig
  *** Default configuration is based on 'ppc64_defconfig'

This is because we assume that CROSS_COMPILE being set means we
can't be on a ppc machine and rather than checking we just default to
ppc64_defconfig.

We should just ignore CROSS_COMPILE, instead check the machine with
uname and if it's one of ppc, ppc64 or ppc64le then use that
defconfig. If it's none of those then we fall back to ppc64_defconfig.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:11 +10:00
Michael Ellerman
a273fa386a powerpc/32: Add ppc_defconfig
Add a generic 32-bit defconfig called ppc_defconfig. This means we'll
have a defconfig matching "uname -m" for all cases.

This config is mostly intended for build testing but if someone wants
to tweak it to get it booting on something that would be fine too.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:02:11 +10:00
Michael Ellerman
d3e76a1acd Merge branch 'fixes' into next
Merge our fixes branch. In particular the radix segment exception
handling fix is necessary to avoid odd crashes.
2019-04-20 21:59:57 +10:00
David Howells
5dd50aaeb1
Make anon_inodes unconditional
Make the anon_inodes facility unconditional so that it can be used by core
VFS code and pidfd code.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[christian@brauner.io: adapt commit message to mention pidfds]
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-04-19 14:03:11 +02:00
Paolo Bonzini
78671ab4c9 KVM/PPC fixes for 5.1
- Fix host hang in the HTM assist code for POWER9
 - Take srcu read lock around memslot lookup
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJcpup1AAoJEJ2a6ncsY3Gf9cUIAOdI0vtysd+jq2De+cxUe/8i
 qUm7ybYQQQQ+aOwapnw+BIvy3JbE2KFZ2AZQGldCBHYK/m1KrX7Gdt21kAcSouHL
 74xTCqOooG1XCJlek2wj2JCKkWbrm9K1i6WfYKNN4lnzgc4qaJxPQL0ikiZPfCkR
 7R+ulWmFDZ6ea5gjzGP93zzbq8RKaUnpwGcWVMvIOX2xwonAmZG1fxAu3d5NwUoy
 kXOXGdvmExsUsGWnwxJemqCdVj48pX34d+/263Y561/Bj7LotErdZQC+cTn/R8lL
 TSfmKfulMb1zAJCqcuQy4x6rY0q579yPbTF/PKSl+hOLG1t9y86fERsAVxAPiTo=
 =NzJU
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-fixes-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD

KVM/PPC fixes for 5.1

- Fix host hang in the HTM assist code for POWER9
- Take srcu read lock around memslot lookup
2019-04-18 18:53:12 +02:00
Eric Biggers
626ddb2fbe crypto: powerpc - convert to use crypto_simd_usable()
Replace all calls to in_interrupt() in the PowerPC crypto code with
!crypto_simd_usable().  This causes the crypto self-tests to test the
no-SIMD code paths when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y.

The p8_ghash algorithm is currently failing and needs to be fixed, as it
produces the wrong digest when no-SIMD updates are mixed with SIMD ones.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-04-18 22:15:04 +08:00
Miquel Raynal
72c5af0027 mtd: rawnand: Clarify Kconfig entry MTD_NAND
MTD_NAND is large and encloses much more than what the symbol is
actually used for: raw NAND. Clarify the symbol by naming it
MTD_RAW_NAND instead.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
2019-04-18 08:54:00 +02:00
Miquel Raynal
9bb94643b9 mtd: nand: Clarify Kconfig entry for software Hamming ECC entries
The software Hamming ECC correction implementation is referred as
MTD_NAND_ECC which is too generic. Rename it
MTD_NAND_ECC_SW_HAMMING. Also rename MTD_NAND_ECC_SMC which is an
SMC quirk in the Hamming implementation as
MTD_NAND_ECC_SW_HAMMING_SMC.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
2019-04-18 08:54:00 +02:00
Josh Poimboeuf
782e69efb3 powerpc/speculation: Support 'mitigations=' cmdline option
Configure powerpc CPU runtime speculation bug mitigations in accordance
with the 'mitigations=' cmdline option.  This affects Meltdown, Spectre
v1, Spectre v2, and Speculative Store Bypass.

The default behavior is unchanged.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86)
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-arch@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/245a606e1a42a558a310220312d9b6adb9159df6.1555085500.git.jpoimboe@redhat.com
2019-04-17 21:37:29 +02:00
Alexey Kardashevskiy
7a3a4d7638 powerpc/mm_iommu: Allow pinning large regions
When called with vmas_arg==NULL, get_user_pages_longterm() allocates
an array of nr_pages*8 which can easily get greater that the max order,
for example, registering memory for a 256GB guest does this and fails
in __alloc_pages_nodemask().

This adds a loop over chunks of entries to fit the max order limit.

Fixes: 678e174c4c ("powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc", 2019-03-05)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-17 21:36:51 +10:00
Alexey Kardashevskiy
eb9d7a62c3 powerpc/mm_iommu: Fix potential deadlock
Currently mm_iommu_do_alloc() is called in 2 cases:
- VFIO_IOMMU_SPAPR_REGISTER_MEMORY ioctl() for normal memory:
	this locks &mem_list_mutex and then locks mm::mmap_sem
	several times when adjusting locked_vm or pinning pages;
- vfio_pci_nvgpu_regops::mmap() for GPU memory:
	this is called with mm::mmap_sem held already and it locks
	&mem_list_mutex.

So one can craft a userspace program to do special ioctl and mmap in
2 threads concurrently and cause a deadlock which lockdep warns about
(below).

We did not hit this yet because QEMU constructs the machine in a single
thread.

This moves the overlap check next to where the new entry is added and
reduces the amount of time spent with &mem_list_mutex held.

This moves locked_vm adjustment from under &mem_list_mutex.

This relies on mm_iommu_adjust_locked_vm() doing nothing when entries==0.

This is one of the lockdep warnings:

======================================================
WARNING: possible circular locking dependency detected
5.1.0-rc2-le_nv2_aikATfstn1-p1 #363 Not tainted
------------------------------------------------------
qemu-system-ppc/8038 is trying to acquire lock:
000000002ec6c453 (mem_list_mutex){+.+.}, at: mm_iommu_do_alloc+0x70/0x490

but task is already holding lock:
00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_sem){++++}:
       lock_acquire+0xf8/0x260
       down_write+0x44/0xa0
       mm_iommu_adjust_locked_vm.part.1+0x4c/0x190
       mm_iommu_do_alloc+0x310/0x490
       tce_iommu_ioctl.part.9+0xb84/0x1150 [vfio_iommu_spapr_tce]
       vfio_fops_unl_ioctl+0x94/0x430 [vfio]
       do_vfs_ioctl+0xe4/0x930
       ksys_ioctl+0xc4/0x110
       sys_ioctl+0x28/0x80
       system_call+0x5c/0x70

-> #0 (mem_list_mutex){+.+.}:
       __lock_acquire+0x1484/0x1900
       lock_acquire+0xf8/0x260
       __mutex_lock+0x88/0xa70
       mm_iommu_do_alloc+0x70/0x490
       vfio_pci_nvgpu_mmap+0xc0/0x130 [vfio_pci]
       vfio_pci_mmap+0x198/0x2a0 [vfio_pci]
       vfio_device_fops_mmap+0x44/0x70 [vfio]
       mmap_region+0x5d4/0x770
       do_mmap+0x42c/0x650
       vm_mmap_pgoff+0x124/0x160
       ksys_mmap_pgoff+0xdc/0x2f0
       sys_mmap+0x40/0x80
       system_call+0x5c/0x70

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mm->mmap_sem);
                               lock(mem_list_mutex);
                               lock(&mm->mmap_sem);
  lock(mem_list_mutex);

 *** DEADLOCK ***

1 lock held by qemu-system-ppc/8038:
 #0: 00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160

Fixes: c10c21efa4 ("powerpc/vfio/iommu/kvm: Do not pin device memory", 2018-12-19)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-17 21:36:50 +10:00
Michael Ellerman
8adddf349f powerpc/mm/radix: Make Radix require HUGETLB_PAGE
Joel reported weird crashes using skiroot_defconfig, in his case we
jumped into an NX page:

  kernel tried to execute exec-protected page (c000000002bff4f0) - exploit attempt? (uid: 0)
  BUG: Unable to handle kernel instruction fetch
  Faulting instruction address: 0xc000000002bff4f0

Looking at the disassembly, we had simply branched to that address:

  c000000000c001bc  49fff335    bl     c000000002bff4f0

But that didn't match the original kernel image:

  c000000000c001bc  4bfff335    bl     c000000000bff4f0 <kobject_get+0x8>

When STRICT_KERNEL_RWX is enabled, and we're using the radix MMU, we
call radix__change_memory_range() late in boot to change page
protections. We do that both to mark rodata read only and also to mark
init text no-execute. That involves walking the kernel page tables,
and clearing _PAGE_WRITE or _PAGE_EXEC respectively.

With radix we may use hugepages for the linear mapping, so the code in
radix__change_memory_range() uses eg. pmd_huge() to test if it has
found a huge mapping, and if so it stops the page table walk and
changes the PMD permissions.

However if the kernel is built without HUGETLBFS support, pmd_huge()
is just a #define that always returns 0. That causes the code in
radix__change_memory_range() to incorrectly interpret the PMD value as
a pointer to a PTE page rather than as a PTE at the PMD level.

We can see this using `dv` in xmon which also uses pmd_huge():

  0:mon> dv c000000000000000
  pgd  @ 0xc000000001740000
  pgdp @ 0xc000000001740000 = 0x80000000ffffb009
  pudp @ 0xc0000000ffffb000 = 0x80000000ffffa009
  pmdp @ 0xc0000000ffffa000 = 0xc00000000000018f   <- this is a PTE
  ptep @ 0xc000000000000100 = 0xa64bb17da64ab07d   <- kernel text

The end result is we treat the value at 0xc000000000000100 as a PTE
and clear _PAGE_WRITE or _PAGE_EXEC, potentially corrupting the code
at that address.

In Joel's specific case we cleared the sign bit in the offset of the
branch, causing a backward branch to turn into a forward branch which
caused us to branch into a non-executable page. However the exact
nature of the crash depends on kernel version, compiler version, and
other factors.

We need to fix radix__change_memory_range() to not use accessors that
depend on HUGETLBFS, but we also have radix memory hotplug code that
uses pmd_huge() etc that will also need fixing. So for now just
disallow the broken combination of Radix with HUGETLBFS disabled.

The only defconfig we have that is affected is skiroot_defconfig, so
turn on HUGETLBFS there so that it still gets Radix.

Fixes: 566ca99af0 ("powerpc/mm/radix: Add dummy radix_enabled()")
Cc: stable@vger.kernel.org # v4.7+
Reported-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-17 18:50:26 +10:00
Paolo Bonzini
c110ae578c kvm: move KVM_CAP_NR_MEMSLOTS to common code
All architectures except MIPS were defining it in the same way,
and memory slots are handled entirely by common code so there
is no point in keeping the definition per-architecture.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:39:08 +02:00
Arnd Bergmann
39036cd272 arch: add pidfd and io_uring syscalls everywhere
Add the io_uring and pidfd_send_signal system calls to all architectures.

These system calls are designed to handle both native and compat tasks,
so all entries are the same across architectures, only arm-compat and
the generic tale still use an old format.

Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> (s390)
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-04-15 16:31:17 +02:00
Linus Torvalds
cf60528f8a powerpc fixes for 5.1 #5
A minor build fix for 64-bit FLATMEM configs.
 
 A fix for a boot failure on 32-bit powermacs.
 
 My commit to fix CLOCK_MONOTONIC across Y2038 broke the 32-bit VDSO on 64-bit
 kernels, ie. compat mode, which is only used on big endian.
 
 The rewrite of the SLB code we merged in 4.20 missed the fact that the 0x380
 exception is also used with the Radix MMU to report out of range accesses. This
 could lead to an oops if userspace tried to read from addresses outside the user
 or kernel range.
 
 Thanks to:
   Aneesh Kumar K.V, Christophe Leroy, Larry Finger, Nicholas Piggin.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcsVzhAAoJEFHr6jzI4aWAJuAP/2oLukNIIiF2UW/18xIXfvxR
 ZA9JljVqcKUHEUR4W+Y673xL4ZKtGGF79P+bzSvh8fUTMJ9cIN9mLO7eGGoDNqTn
 XhZX/jxJOh34tbHPYYbi9kYqWpZQKN4WuCjMQSPBCHOHMdx/0yn0wKgriOW1cuzG
 AQqDRHcRX4h1QT9o/hnsCAsdcnLEntdBBCTTHL1dZ8BucuUopjL+7cV0wf4qFIui
 e9SXOEl7yV03JGurmWcipE4mj9SrUioZJyHg6rJs70tlCUHFM24LQEFNIM4WczuF
 GoPfzXi5nNPrOzC3aF/v77hT5t4zD2sPRV2DuKABGsS+gfPoK8sIZC3mo7Vk5y+j
 gsbmkQSZt8/wVhRuAA0m0N6Aqg1J8NjhxoDfyM8kj0FzPe75D662VIgGSx15oMkl
 3olt/9uDyPetxuZ7tmmnFC8wkcmyaGpVurVz9xnqpt6c2r0KI+16R6Mk4OiT/e2p
 KNVBFkqRTp23ETpI8J9HUk9OtFIHqE9Zwzk2YOrX5yuLHByEwMq1T4qn2RuQsJqx
 RWPJagSalGLmM6dqDGe08gQl9rovkYKleGxNIAJuJB9rIxZQke86d2+S0eSUQbAW
 WWhP8SU0LJ5gmhEeZi5MntcuG+gcENwkz2UBK5nVDBVLFxGuBTPQATavW+w1bSi+
 SSEMXx8dNAOvsrqrZ97I
 =pfZc
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "A minor build fix for 64-bit FLATMEM configs.

  A fix for a boot failure on 32-bit powermacs.

  My commit to fix CLOCK_MONOTONIC across Y2038 broke the 32-bit VDSO on
  64-bit kernels, ie. compat mode, which is only used on big endian.

  The rewrite of the SLB code we merged in 4.20 missed the fact that the
  0x380 exception is also used with the Radix MMU to report out of range
  accesses. This could lead to an oops if userspace tried to read from
  addresses outside the user or kernel range.

  Thanks to: Aneesh Kumar K.V, Christophe Leroy, Larry Finger, Nicholas
  Piggin"

* tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm: Define MAX_PHYSMEM_BITS for all 64-bit configs
  powerpc/64s/radix: Fix radix segment exception handling
  powerpc/vdso32: fix CLOCK_MONOTONIC on PPC64
  powerpc/32: Fix early boot failure with RTAS built-in
2019-04-13 09:03:09 -07:00
Cédric Le Goater
88ec6b93c8 powerpc/xive: add OPAL extensions for the XIVE native exploitation support
The support for XIVE native exploitation mode in Linux/KVM needs a
couple more OPAL calls to get and set the state of the XIVE internal
structures being used by a sPAPR guest.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-11 15:31:41 +10:00
Ingo Molnar
54bbfe75cb Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-10 09:14:42 +02:00
Michael Ellerman
cf7cf6977f powerpc/mm: Define MAX_PHYSMEM_BITS for all 64-bit configs
The recent commit 8bc0868998 ("powerpc/mm: Only define
MAX_PHYSMEM_BITS in SPARSEMEM configurations") removed our definition
of MAX_PHYSMEM_BITS when SPARSEMEM is disabled.

This inadvertently broke some 64-bit FLATMEM using configs with eg:

  arch/powerpc/include/asm/book3s/64/mmu-hash.h:584:6: error: "MAX_PHYSMEM_BITS" is not defined, evaluates to 0
   #if (MAX_PHYSMEM_BITS > MAX_EA_BITS_PER_CONTEXT)
        ^~~~~~~~~~~~~~~~

Fix it by making sure we define MAX_PHYSMEM_BITS for all 64-bit
configs regardless of SPARSEMEM.

Fixes: 8bc0868998 ("powerpc/mm: Only define MAX_PHYSMEM_BITS in SPARSEMEM configurations")
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-10 14:45:57 +10:00
Nicholas Piggin
7100e8704b powerpc/64s/radix: Fix radix segment exception handling
Commit 48e7b76957 ("powerpc/64s/hash: Convert SLB miss handlers to C")
broke the radix-mode segment exception handler. In radix mode, this is
exception is not an SLB miss, rather it signals that the EA is outside
the range translated by any page table.

The commit lost the radix feature alternate code patch, which can
cause faults to some EAs to kernel BUG at arch/powerpc/mm/slb.c:639!

The original radix code would send faults to slb_miss_large_addr,
which would end up faulting due to slb_addr_limit being 0. This patch
sends radix directly to do_bad_slb_fault, which is a bit clearer.

Fixes: 48e7b76957 ("powerpc/64s/hash: Convert SLB miss handlers to C")
Cc: stable@vger.kernel.org # v4.20+
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-08 21:46:11 +10:00
Will Deacon
01e3b958ef arch: Remove dummy mmiowb() definitions from arch code
Now that no driver code is using mmiowb() directly, remove the dummy
definitions remaining in architectures that don't make use of
asm-generic/io.h, as well as the definition in asm-generic/io.h itself.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-04-08 12:09:27 +01:00
Will Deacon
420af15547 powerpc/mmiowb: Hook up mmwiob() implementation to asm-generic code
In a bid to kill off explicit mmiowb() usage in driver code, hook up
the asm-generic mmiowb() tracking code but provide a definition of
arch_mmiowb_state() so that the tracking data can remain in the paca
as it does at present

This replaces the existing (flawed) implementation.

Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-04-08 12:00:35 +01:00
Will Deacon
fdcd06a8ab arch: Use asm-generic header for asm/mmiowb.h
Hook up asm-generic/mmiowb.h to Kbuild for all architectures so that we
can subsequently include asm/mmiowb.h from core code.

Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-04-08 11:59:43 +01:00
Christophe Leroy
dd9a994fc6 powerpc/vdso32: fix CLOCK_MONOTONIC on PPC64
Commit b5b4453e79 ("powerpc/vdso64: Fix CLOCK_MONOTONIC
inconsistencies across Y2038") changed the type of wtom_clock_sec
to s64 on PPC64. Therefore, VDSO32 needs to read it with a 4 bytes
shift in order to retrieve the lower part of it.

Fixes: b5b4453e79 ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038")
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-08 06:57:19 +10:00
Linus Torvalds
f654f0fc0b Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "14 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  kernel/sysctl.c: fix out-of-bounds access when setting file-max
  mm/util.c: fix strndup_user() comment
  sh: fix multiple function definition build errors
  MAINTAINERS: add maintainer and replacing reviewer ARM/NUVOTON NPCM
  MAINTAINERS: fix bad pattern in ARM/NUVOTON NPCM
  mm: writeback: use exact memcg dirty counts
  psi: clarify the units used in pressure files
  mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()
  hugetlbfs: fix memory leak for resv_map
  mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX()
  lib/lzo: fix bugs for very short or empty input
  include/linux/bitrev.h: fix constant bitrev
  kmemleak: powerpc: skip scanning holes in the .bss section
  lib/string.c: implement a basic bcmp
2019-04-05 17:08:55 -10:00
Catalin Marinas
298a32b132 kmemleak: powerpc: skip scanning holes in the .bss section
Commit 2d4f567103 ("KVM: PPC: Introduce kvm_tmp framework") adds
kvm_tmp[] into the .bss section and then free the rest of unused spaces
back to the page allocator.

kernel_init
  kvm_guest_init
    kvm_free_tmp
      free_reserved_area
        free_unref_page
          free_unref_page_prepare

With DEBUG_PAGEALLOC=y, it will unmap those pages from kernel.  As the
result, kmemleak scan will trigger a panic when it scans the .bss
section with unmapped pages.

This patch creates dedicated kmemleak objects for the .data, .bss and
potentially .data..ro_after_init sections to allow partial freeing via
the kmemleak_free_part() in the powerpc kvm_free_tmp() function.

Link: http://lkml.kernel.org/r/20190321171917.62049-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Qian Cai <cai@lca.pw>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-05 16:02:30 -10:00
Steven Rostedt (VMware)
32d9258662 syscalls: Remove start and number from syscall_set_arguments() args
After removing the start and count arguments of syscall_get_arguments() it
seems reasonable to remove them from syscall_set_arguments(). Note, as of
today, there are no users of syscall_set_arguments(). But we are told that
there will be soon. But for now, at least make it consistent with
syscall_get_arguments().

Link: http://lkml.kernel.org/r/20190327222014.GA32540@altlinux.org

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Dave Martin <dave.martin@arm.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: x86@kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Cc: linux-hexagon@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: nios2-dev@lists.rocketboards.org
Cc: openrisc@lists.librecores.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
Reviewed-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-05 09:27:23 -04:00
Steven Rostedt (Red Hat)
b35f549df1 syscalls: Remove start and number from syscall_get_arguments() args
At Linux Plumbers, Andy Lutomirski approached me and pointed out that the
function call syscall_get_arguments() implemented in x86 was horribly
written and not optimized for the standard case of passing in 0 and 6 for
the starting index and the number of system calls to get. When looking at
all the users of this function, I discovered that all instances pass in only
0 and 6 for these arguments. Instead of having this function handle
different cases that are never used, simply rewrite it to return the first 6
arguments of a system call.

This should help out the performance of tracing system calls by ptrace,
ftrace and perf.

Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Dave Martin <dave.martin@arm.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: x86@kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Cc: linux-hexagon@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: nios2-dev@lists.rocketboards.org
Cc: openrisc@lists.librecores.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
Reviewed-by: Dmitry V. Levin <ldv@altlinux.org>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-05 09:26:43 -04:00
Alexey Kardashevskiy
345077c8e1 KVM: PPC: Book3S: Protect memslots while validating user address
Guest physical to user address translation uses KVM memslots and reading
these requires holding the kvm->srcu lock. However recently introduced
kvmppc_tce_validate() broke the rule (see the lockdep warning below).

This moves srcu_read_lock(&vcpu->kvm->srcu) earlier to protect
kvmppc_tce_validate() as well.

=============================
WARNING: suspicious RCU usage
5.1.0-rc2-le_nv2_aikATfstn1-p1 #380 Not tainted
-----------------------------
include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by qemu-system-ppc/8020:
 #0: 0000000094972fe9 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0xdc/0x850 [kvm]

stack backtrace:
CPU: 44 PID: 8020 Comm: qemu-system-ppc Not tainted 5.1.0-rc2-le_nv2_aikATfstn1-p1 #380
Call Trace:
[c000003fece8f740] [c000000000bcc134] dump_stack+0xe8/0x164 (unreliable)
[c000003fece8f790] [c000000000181be0] lockdep_rcu_suspicious+0x130/0x170
[c000003fece8f810] [c0000000000d5f50] kvmppc_tce_to_ua+0x280/0x290
[c000003fece8f870] [c00800001a7e2c78] kvmppc_tce_validate+0x80/0x1b0 [kvm]
[c000003fece8f8e0] [c00800001a7e3fac] kvmppc_h_put_tce+0x94/0x3e4 [kvm]
[c000003fece8f9a0] [c00800001a8baac4] kvmppc_pseries_do_hcall+0x30c/0xce0 [kvm_hv]
[c000003fece8fa10] [c00800001a8bd89c] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
[c000003fece8fae0] [c00800001a7d95dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000003fece8fb00] [c00800001a7d56bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
[c000003fece8fb90] [c00800001a7c3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
[c000003fece8fd00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
[c000003fece8fdb0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
[c000003fece8fe00] [c00000000041ce78] sys_ioctl+0x28/0x80
[c000003fece8fe20] [c00000000000b5a4] system_call+0x5c/0x70

Fixes: 42de7b9e21 ("KVM: PPC: Validate TCEs against preregistered memory page sizes", 2018-09-10)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-05 14:37:24 +11:00
Suraj Jitindar Singh
7cb9eb106d KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit
There is a hardware bug in some POWER9 processors where a treclaim in
fake suspend mode can cause an inconsistency in the XER[SO] bit across
the threads of a core, the workaround being to force the core into SMT4
when doing the treclaim.

The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a
thread is in fake suspend or real suspend. The important difference here
being that thread reconfiguration is blocked in real suspend but not
fake suspend mode.

When we exit a guest which was in fake suspend mode, we force the core
into SMT4 while we do the treclaim in kvmppc_save_tm_hv().
However on the new exit path introduced with the function
kvmhv_run_single_vcpu() we restore the host PSSCR before calling
kvmppc_save_tm_hv() which means that if we were in fake suspend mode we
put the thread into real suspend mode when we clear the
PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration
and the thread which is trying to get the core into SMT4 before it can
do the treclaim spins forever since it itself is blocking thread
reconfiguration. The result is that that core is essentially lost.

This results in a trace such as:
[   93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4
[   93.512905] NIP:  c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000
[   93.512908] REGS: c000003fffd2bd70 TRAP: 0100   Not tainted  (5.0.0)
[   93.512908] MSR:  9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]>  CR: 22222444  XER: 00000000
[   93.512914] CFAR: c000000000098a5c IRQMASK: 3
[   93.512915] PACATMSCRATCH: 0000000000000001
[   93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004
[   93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007
[   93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004
[   93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000
[   93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000
[   93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0
[   93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000
[   93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000
[   93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0
[   93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88
[   93.512960] Call Trace:
[   93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable)
[   93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv]
[   93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv]
[   93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv]
[   93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm]
[   93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
[   93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
[   93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0
[   93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0
[   93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80
[   93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70
[   93.512983] Instruction dump:
[   93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068
[   93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214

To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call
kvmppc_save_tm_hv() which will mean the core can get into SMT4 and
perform the treclaim. Note kvmppc_save_tm_hv() clears the
PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that.

Fixes: 95a6432ce9 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-05 14:37:24 +11:00
Waiman Long
390a0c62c2 locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs
Currently, we have two different implementation of rwsem:

 1) CONFIG_RWSEM_GENERIC_SPINLOCK (rwsem-spinlock.c)
 2) CONFIG_RWSEM_XCHGADD_ALGORITHM (rwsem-xadd.c)

As we are going to use a single generic implementation for rwsem-xadd.c
and no architecture-specific code will be needed, there is no point
in keeping two different implementations of rwsem. In most cases, the
performance of rwsem-spinlock.c will be worse. It also doesn't get all
the performance tuning and optimizations that had been implemented in
rwsem-xadd.c over the years.

For simplication, we are going to remove rwsem-spinlock.c and make all
architectures use a single implementation of rwsem - rwsem-xadd.c.

All references to RWSEM_GENERIC_SPINLOCK and RWSEM_XCHGADD_ALGORITHM
in the code are removed.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-um@lists.infradead.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: nios2-dev@lists.rocketboards.org
Cc: openrisc@lists.librecores.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Link: https://lkml.kernel.org/r/20190322143008.21313-3-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 14:50:52 +02:00
Waiman Long
46ad0840b1 locking/rwsem: Remove arch specific rwsem files
As the generic rwsem-xadd code is using the appropriate acquire and
release versions of the atomic operations, the arch specific rwsem.h
files will not be that much faster than the generic code as long as the
atomic functions are properly implemented. So we can remove those arch
specific rwsem.h and stop building asm/rwsem.h to reduce maintenance
effort.

Currently, only x86, alpha and ia64 have implemented architecture
specific fast paths. I don't have access to alpha and ia64 systems for
testing, but they are legacy systems that are not likely to be updated
to the latest kernel anyway.

By using a rwsem microbenchmark, the total locking rates on a 4-socket
56-core 112-thread x86-64 system before and after the patch were as
follows (mixed means equal # of read and write locks):

                      Before Patch              After Patch
   # of Threads  wlock   rlock   mixed     wlock   rlock   mixed
   ------------  -----   -----   -----     -----   -----   -----
        1        29,201  30,143  29,458    28,615  30,172  29,201
        2         6,807  13,299   1,171     7,725  15,025   1,804
        4         6,504  12,755   1,520     7,127  14,286   1,345
        8         6,762  13,412     764     6,826  13,652     726
       16         6,693  15,408     662     6,599  15,938     626
       32         6,145  15,286     496     5,549  15,487     511
       64         5,812  15,495      60     5,858  15,572      60

There were some run-to-run variations for the multi-thread tests. For
x86-64, using the generic C code fast path seems to be a little bit
faster than the assembly version with low lock contention.  Looking at
the assembly version of the fast paths, there are assembly to/from C
code wrappers that save and restore all the callee-clobbered registers
(7 registers on x86-64). The assembly generated from the generic C
code doesn't need to do that. That may explain the slight performance
gain here.

The generic asm rwsem.h can also be merged into kernel/locking/rwsem.h
with no code change as no other code other than those under
kernel/locking needs to access the internal rwsem macros and functions.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-um@lists.infradead.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: nios2-dev@lists.rocketboards.org
Cc: openrisc@lists.librecores.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Link: https://lkml.kernel.org/r/20190322143008.21313-2-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 14:50:50 +02:00
Peter Zijlstra
96bc9567cb asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
Make issuing a TLB invalidate for page-table pages the normal case.

The reason is twofold:

 - too many invalidates is safer than too few,
 - most architectures use the linux page-tables natively
   and would thus require this.

Make it an opt-out, instead of an opt-in.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:32:47 +02:00
Peter Zijlstra
5f307be18b asm-generic/tlb, arch: Provide generic tlb_flush() based on flush_tlb_range()
Provide a generic tlb_flush() implementation that relies on
flush_tlb_range(). This is a little awkward because flush_tlb_range()
assumes a VMA for range invalidation, but we no longer have one.

Audit of all flush_tlb_range() implementations shows only vma->vm_mm
and vma->vm_flags are used, and of the latter only VM_EXEC (I-TLB
invalidates) and VM_HUGETLB (large TLB invalidate) are used.

Therefore, track VM_EXEC and VM_HUGETLB in two more bits, and create a
'fake' VMA.

This allows architectures that have a reasonably efficient
flush_tlb_range() to not require any additional effort.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:32:42 +02:00
Peter Zijlstra
ed6a79352c asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
Move the mmu_gather::page_size things into the generic code instead of
PowerPC specific bits.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:32:40 +02:00
Christophe Leroy
fd427103e8 powerpc/32: Fix early boot failure with RTAS built-in
Commit 0df977eafc ("powerpc/6xx: Don't use SPRN_SPRG2 for storing
stack pointer while in RTAS") changes the code to use a field in
thread struct to store the stack pointer while in RTAS instead of
using SPRN_SPRG2. It therefore converts all places which were
manipulating SPRN_SPRG2 to use that field. During early startup, the
zeroing of SPRN_SPRG2 has been replaced by a zeroing of that field in
thread struct. But at least in start_here, that's done wrongly because
it used the physical address of the fields while MMU is on at that
time.

So the virtual address of the field should be used instead, but in
the meantime, thread struct has already been zeroed and initialised
so we can just drop this initialisation.

Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Fixes: 0df977eafc ("powerpc/6xx: Don't use SPRN_SPRG2 for storing stack pointer while in RTAS")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-01 22:32:52 +11:00
Linus Torvalds
6536c5f2c8 powerpc fixes for 5.1 #4
Three non-regression fixes.
 
 Our optimised memcmp could read past the end of one of the buffers and
 potentially trigger a page fault leading to an oops.
 
 Some of our code to read energy management data on PowerVM had an endian bug
 leading to bogus results.
 
 When reporting a machine check exception we incorrectly reported TLB multihits
 as D-Cache multhits due to a missing entry in the array of causes.
 
 Thanks to:
   Chandan Rajendra, Gautham R. Shenoy, Mahesh Salgaonkar, Segher Boessenkool,
   Vaidyanathan Srinivasan.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcoJG4AAoJEFHr6jzI4aWAwTkP/02lEd3G9MTaLLJUsvPTBG1G
 lUKPzTNqoWLvcqdwDqsr4Cfftn/DQvgQRTDXzFZCDPdIhUizDSDKAw0vf49Aue4l
 T8rxOiD7O7eFezsbZ86XIKqsRerWmb44NzrE28zkgcW6LEIjJTO6xz7ne6Cd+Xfc
 SCji4PBHKSHsL5L3mOU769nm5YDjQDszePN8M6WuYAhW/l7xKbQqWUw6m1zNQf/2
 pyy+KOpy1dSANCYgORltSyL3k280G3q75RZFEpqZkI8Yz9vuPImZh41L3CeVo7PU
 ktg2t+vy36r1/BXisENPF9NUBqhxUROU3ji56N1hKOhiocm6BBETRx+e/N2cXakB
 erKljjF0PMGqjfHgS0L05ZIwqjzme+amMvFDIPmGTW98UVW4+YLViAGMPBtB/NPm
 k2uap4VLAiBOsaj4XFPsR7y9WPtUyt56JBkB06e3aftUa9D8rwBP9oxBCR9M+MJ0
 V4qGaRUF1TIeAUlngbqJ/MBUqwWw6kcoApq+JX0/kf2Wc/lNjXK1+VCXDHSL3qkh
 4+WhEWRCf8XC/uTBM+/2a1ULn6kd8hh7LLZpCTt5X3vI0wXf2wGTbejC01jfTcX3
 I+PR/w9bSlxv2FfsiQWnn49l0dV4ZrCgQzTZ4wfiaRFWxnwn3z6CemyOiXn1umu7
 NK2/Q/nnNIwqquh7nJo+
 =Ugv6
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Three non-regression fixes.

   - Our optimised memcmp could read past the end of one of the buffers
     and potentially trigger a page fault leading to an oops.

   - Some of our code to read energy management data on PowerVM had an
     endian bug leading to bogus results.

   - When reporting a machine check exception we incorrectly reported
     TLB multihits as D-Cache multhits due to a missing entry in the
     array of causes.

  Thanks to: Chandan Rajendra, Gautham R. Shenoy, Mahesh Salgaonkar,
  Segher Boessenkool, Vaidyanathan Srinivasan"

* tag 'powerpc-5.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/pseries/mce: Fix misleading print for TLB mutlihit
  powerpc/pseries/energy: Use OF accessor functions to read ibm,drc-indexes
  powerpc/64: Fix memcmp reading past the end of src/dest
2019-03-31 07:44:13 -07:00
Mahesh Salgaonkar
6f845ebec2 powerpc/pseries/mce: Fix misleading print for TLB mutlihit
On pseries, TLB multihit are reported as D-Cache Multihit. This is because
the wrongly populated mc_err_types[] array. Per PAPR, TLB error type is 0x04
and mc_err_types[4] points to "D-Cache" instead of "TLB" string. Fixup the
mc_err_types[] array.

Machine check error type per PAPR:
  0x00 = Uncorrectable Memory Error (UE)
  0x01 = SLB error
  0x02 = ERAT Error
  0x04 = TLB error
  0x05 = D-Cache error
  0x07 = I-Cache error

Fixes: 8f0b80561f ("powerpc/pseries: Display machine check error details.")
Cc: stable@vger.kernel.org # v4.20+
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-29 16:59:19 +11:00
Linus Torvalds
1a9df9e29c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Fixes here and there, a couple new device IDs, as usual:

   1) Fix BQL race in dpaa2-eth driver, from Ioana Ciornei.

   2) Fix 64-bit division in iwlwifi, from Arnd Bergmann.

   3) Fix documentation for some eBPF helpers, from Quentin Monnet.

   4) Some UAPI bpf header sync with tools, also from Quentin Monnet.

   5) Set descriptor ownership bit at the right time for jumbo frames in
      stmmac driver, from Aaro Koskinen.

   6) Set IFF_UP properly in tun driver, from Eric Dumazet.

   7) Fix load/store doubleword instruction generation in powerpc eBPF
      JIT, from Naveen N. Rao.

   8) nla_nest_start() return value checks all over, from Kangjie Lu.

   9) Fix asoc_id handling in SCTP after the SCTP_*_ASSOC changes this
      merge window. From Marcelo Ricardo Leitner and Xin Long.

  10) Fix memory corruption with large MTUs in stmmac, from Aaro
      Koskinen.

  11) Do not use ipv4 header for ipv6 flows in TCP and DCCP, from Eric
      Dumazet.

  12) Fix topology subscription cancellation in tipc, from Erik Hugne.

  13) Memory leak in genetlink error path, from Yue Haibing.

  14) Valid control actions properly in packet scheduler, from Davide
      Caratti.

  15) Even if we get EEXIST, we still need to rehash if a shrink was
      delayed. From Herbert Xu.

  16) Fix interrupt mask handling in interrupt handler of r8169, from
      Heiner Kallweit.

  17) Fix leak in ehea driver, from Wen Yang"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (168 commits)
  dpaa2-eth: fix race condition with bql frame accounting
  chelsio: use BUG() instead of BUG_ON(1)
  net: devlink: skip info_get op call if it is not defined in dumpit
  net: phy: bcm54xx: Encode link speed and activity into LEDs
  tipc: change to check tipc_own_id to return in tipc_net_stop
  net: usb: aqc111: Extend HWID table by QNAP device
  net: sched: Kconfig: update reference link for PIE
  net: dsa: qca8k: extend slave-bus implementations
  net: dsa: qca8k: remove leftover phy accessors
  dt-bindings: net: dsa: qca8k: support internal mdio-bus
  dt-bindings: net: dsa: qca8k: fix example
  net: phy: don't clear BMCR in genphy_soft_reset
  bpf, libbpf: clarify bump in libbpf version info
  bpf, libbpf: fix version info and add it to shared object
  rxrpc: avoid clang -Wuninitialized warning
  tipc: tipc clang warning
  net: sched: fix cleanup NULL pointer exception in act_mirr
  r8169: fix cable re-plugging issue
  net: ethernet: ti: fix possible object reference leak
  net: ibm: fix possible object reference leak
  ...
2019-03-27 12:22:57 -07:00
Gautham R. Shenoy
ce9afe08e7 powerpc/pseries/energy: Use OF accessor functions to read ibm,drc-indexes
In cpu_to_drc_index() in the case when FW_FEATURE_DRC_INFO is absent,
we currently use of_read_property() to obtain the pointer to the array
corresponding to the property "ibm,drc-indexes". The elements of this
array are of type __be32, but are accessed without any conversion to
the OS-endianness, which is buggy on a Little Endian OS.

Fix this by using of_property_read_u32_index() accessor function to
safely read the elements of the array.

Fixes: e83636ac33 ("pseries/drc-info: Search DRC properties for CPU indexes")
Cc: stable@vger.kernel.org # v4.16+
Reported-by: Pavithra R. Prakash <pavrampu@in.ibm.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
[mpe: Make the WARN_ON a WARN_ON_ONCE so it's not retriggerable]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-27 10:40:09 +11:00
Michael Ellerman
d947075739 powerpc/64: Fix memcmp reading past the end of src/dest
Chandan reported that fstests' generic/026 test hit a crash:

  BUG: Unable to handle kernel data access at 0xc00000062ac40000
  Faulting instruction address: 0xc000000000092240
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries
  CPU: 0 PID: 27828 Comm: chacl Not tainted 5.0.0-rc2-next-20190115-00001-g6de6dba64dda #1
  NIP:  c000000000092240 LR: c00000000066a55c CTR: 0000000000000000
  REGS: c00000062c0c3430 TRAP: 0300   Not tainted  (5.0.0-rc2-next-20190115-00001-g6de6dba64dda)
  MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000842  XER: 20000000
  CFAR: 00007fff7f3108ac DAR: c00000062ac40000 DSISR: 40000000 IRQMASK: 0
  GPR00: 0000000000000000 c00000062c0c36c0 c0000000017f4c00 c00000000121a660
  GPR04: c00000062ac3fff9 0000000000000004 0000000000000020 00000000275b19c4
  GPR08: 000000000000000c 46494c4500000000 5347495f41434c5f c0000000026073a0
  GPR12: 0000000000000000 c0000000027a0000 0000000000000000 0000000000000000
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR20: c00000062ea70020 c00000062c0c38d0 0000000000000002 0000000000000002
  GPR24: c00000062ac3ffe8 00000000275b19c4 0000000000000001 c00000062ac30000
  GPR28: c00000062c0c38d0 c00000062ac30050 c00000062ac30058 0000000000000000
  NIP memcmp+0x120/0x690
  LR  xfs_attr3_leaf_lookup_int+0x53c/0x5b0
  Call Trace:
    xfs_attr3_leaf_lookup_int+0x78/0x5b0 (unreliable)
    xfs_da3_node_lookup_int+0x32c/0x5a0
    xfs_attr_node_addname+0x170/0x6b0
    xfs_attr_set+0x2ac/0x340
    __xfs_set_acl+0xf0/0x230
    xfs_set_acl+0xd0/0x160
    set_posix_acl+0xc0/0x130
    posix_acl_xattr_set+0x68/0x110
    __vfs_setxattr+0xa4/0x110
    __vfs_setxattr_noperm+0xac/0x240
    vfs_setxattr+0x128/0x130
    setxattr+0x248/0x600
    path_setxattr+0x108/0x120
    sys_setxattr+0x28/0x40
    system_call+0x5c/0x70
  Instruction dump:
  7d201c28 7d402428 7c295040 38630008 38840008 408201f0 4200ffe8 2c050000
  4182ff6c 20c50008 54c61838 7d201c28 <7d402428> 7d293436 7d4a3436 7c295040

The instruction dump decodes as:
  subfic  r6,r5,8
  rlwinm  r6,r6,3,0,28
  ldbrx   r9,0,r3
  ldbrx   r10,0,r4      <-

Which shows us doing an 8 byte load from c00000062ac3fff9, which
crosses the page boundary at c00000062ac40000 and faults.

It's not OK for memcmp to read past the end of the source or
destination buffers if that would cross a page boundary, because we
don't know that the next page is mapped.

As pointed out by Segher, we can read past the end of the source or
destination as long as we don't cross a 4K boundary, because that's
our minimum page size on all platforms.

The bug is in the code at the .Lcmp_rest_lt8bytes label. When we get
there we know that s1 is 8-byte aligned and we have at least 1 byte to
read, so a single 8-byte load won't read past the end of s1 and cross
a page boundary.

But we have to be more careful with s2. So check if it's within 8
bytes of a 4K boundary and if so go to the byte-by-byte loop.

Fixes: 2d9ee327ad ("powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()")
Cc: stable@vger.kernel.org # v4.19+
Reported-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Tested-by: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-25 23:33:26 +11:00
Michael Ellerman
92edf8df0f powerpc/security: Fix spectre_v2 reporting
When I updated the spectre_v2 reporting to handle software count cache
flush I got the logic wrong when there's no software count cache
enabled at all.

The result is that on systems with the software count cache flush
disabled we print:

  Mitigation: Indirect branch cache disabled, Software count cache flush

Which correctly indicates that the count cache is disabled, but
incorrectly says the software count cache flush is enabled.

The root of the problem is that we are trying to handle all
combinations of options. But we know now that we only expect to see
the software count cache flush enabled if the other options are false.

So split the two cases, which simplifies the logic and fixes the bug.
We were also missing a space before "(hardware accelerated)".

The result is we see one of:

  Mitigation: Indirect branch serialisation (kernel only)
  Mitigation: Indirect branch cache disabled
  Mitigation: Software count cache flush
  Mitigation: Software count cache flush (hardware accelerated)

Fixes: ee13cb249f ("powerpc/64s: Add support for software count cache flush")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-21 21:09:03 +11:00
Dmitry V. Levin
16add41164 syscall_get_arch: add "struct task_struct *" argument
This argument is required to extend the generic ptrace API with
PTRACE_GET_SYSCALL_INFO request: syscall_get_arch() is going
to be called from ptrace_request() along with syscall_get_nr(),
syscall_get_arguments(), syscall_get_error(), and
syscall_get_return_value() functions with a tracee as their argument.

The primary intent is that the triple (audit_arch, syscall_nr, arg1..arg6)
should describe what system call is being called and what its arguments
are.

Reverts: 5e937a9ae9 ("syscall_get_arch: remove useless function arguments")
Reverts: 1002d94d30 ("syscall.h: fix doc text for syscall_get_arch()")
Reviewed-by: Andy Lutomirski <luto@kernel.org> # for x86
Reviewed-by: Palmer Dabbelt <palmer@sifive.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Kees Cook <keescook@chromium.org> # seccomp parts
Acked-by: Mark Salter <msalter@redhat.com> # for the c6x bit
Cc: Elvira Khabirova <lineprinter@altlinux.org>
Cc: Eugene Syromyatnikov <esyr@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: x86@kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Cc: linux-hexagon@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-mips@vger.kernel.org
Cc: nios2-dev@lists.rocketboards.org
Cc: openrisc@lists.librecores.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Cc: linux-audit@redhat.com
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-03-20 21:12:36 -04:00
Ben Hutchings
8bc0868998 powerpc/mm: Only define MAX_PHYSMEM_BITS in SPARSEMEM configurations
MAX_PHYSMEM_BITS only needs to be defined if CONFIG_SPARSEMEM is
enabled, and that was the case before commit 4ffe713b75
("powerpc/mm: Increase the max addressable memory to 2PB").

On 32-bit systems, where CONFIG_SPARSEMEM is not enabled, we now
define it as 46.  That is larger than the real number of physical
address bits, and breaks calculations in zsmalloc:

  mm/zsmalloc.c:130:49: warning: right shift count is negative
    MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
                                                   ^~
  ...
  mm/zsmalloc.c:253:21: error: variably modified 'size_class' at file scope
    struct size_class *size_class[ZS_SIZE_CLASSES];
                       ^~~~~~~~~~

Fixes: 4ffe713b75 ("powerpc/mm: Increase the max addressable memory to 2PB")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-21 00:16:45 +11:00
Christophe Leroy
4622a2d431 powerpc/6xx: fix setup and use of SPRN_SPRG_PGDIR for hash32
Not only the 603 but all 6xx need SPRN_SPRG_PGDIR to be initialised at
startup. This patch move it from __setup_cpu_603() to start_here()
and __secondary_start(), close to the initialisation of SPRN_THREAD.

Previously, virt addr of PGDIR was retrieved from thread struct.
Now that it is the phys addr which is stored in SPRN_SPRG_PGDIR,
hash_page() shall not convert it to phys anymore.
This patch removes the conversion.

Fixes: 93c4a162b0 ("powerpc/6xx: Store PGDIR physical address in a SPRG")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-19 00:30:19 +11:00
Michael Ellerman
b5b4453e79 powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038
Jakub Drnec reported:
  Setting the realtime clock can sometimes make the monotonic clock go
  back by over a hundred years. Decreasing the realtime clock across
  the y2k38 threshold is one reliable way to reproduce. Allegedly this
  can also happen just by running ntpd, I have not managed to
  reproduce that other than booting with rtc at >2038 and then running
  ntp. When this happens, anything with timers (e.g. openjdk) breaks
  rather badly.

And included a test case (slightly edited for brevity):
  #define _POSIX_C_SOURCE 199309L
  #include <stdio.h>
  #include <time.h>
  #include <stdlib.h>
  #include <unistd.h>

  long get_time(void) {
    struct timespec tp;
    clock_gettime(CLOCK_MONOTONIC, &tp);
    return tp.tv_sec + tp.tv_nsec / 1000000000;
  }

  int main(void) {
    long last = get_time();
    while(1) {
      long now = get_time();
      if (now < last) {
        printf("clock went backwards by %ld seconds!\n", last - now);
      }
      last = now;
      sleep(1);
    }
    return 0;
  }

Which when run concurrently with:
 # date -s 2040-1-1
 # date -s 2037-1-1

Will detect the clock going backward.

The root cause is that wtom_clock_sec in struct vdso_data is only a
32-bit signed value, even though we set its value to be equal to
tk->wall_to_monotonic.tv_sec which is 64-bits.

Because the monotonic clock starts at zero when the system boots the
wall_to_montonic.tv_sec offset is negative for current and future
dates. Currently on a freshly booted system the offset will be in the
vicinity of negative 1.5 billion seconds.

However if the wall clock is set past the Y2038 boundary, the offset
from wall to monotonic becomes less than negative 2^31, and no longer
fits in 32-bits. When that value is assigned to wtom_clock_sec it is
truncated and becomes positive, causing the VDSO assembly code to
calculate CLOCK_MONOTONIC incorrectly.

That causes CLOCK_MONOTONIC to jump ahead by ~4 billion seconds which
it is not meant to do. Worse, if the time is then set back before the
Y2038 boundary CLOCK_MONOTONIC will jump backward.

We can fix it simply by storing the full 64-bit offset in the
vdso_data, and using that in the VDSO assembly code. We also shuffle
some of the fields in vdso_data to avoid creating a hole.

The original commit that added the CLOCK_MONOTONIC support to the VDSO
did actually use a 64-bit value for wtom_clock_sec, see commit
a7f290dad3 ("[PATCH] powerpc: Merge vdso's and add vdso support to
32 bits kernel") (Nov 2005). However just 3 days later it was
converted to 32-bits in commit 0c37ec2aa8 ("[PATCH] powerpc: vdso
fixes (take #2)"), and the bug has existed since then AFAICS.

Fixes: 0c37ec2aa8 ("[PATCH] powerpc: vdso fixes (take #2)")
Cc: stable@vger.kernel.org # v2.6.15+
Link: http://lkml.kernel.org/r/HaC.ZfES.62bwlnvAvMP.1STMMj@seznam.cz
Reported-by: Jakub Drnec <jaydee@email.cz>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-18 19:26:38 +11:00
Linus Torvalds
28d747f266 Kbuild updates for v5.1 (2nd)
- add more Build-Depends to Debian source package
 
  - prefix header search paths with $(srctree)/
 
  - make modpost show verbose section mismatch warnings
 
  - avoid hard-coded CROSS_COMPILE for h8300
 
  - fix regression for Debian make-kpkg command
 
  - add semantic patch to detect missing put_device()
 
  - fix some warnings of 'make deb-pkg'
 
  - optimize NOSTDINC_FLAGS evaluation
 
  - add warnings about redundant generic-y
 
  - clean up Makefiles and scripts
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJcjm13AAoJED2LAQed4NsG9FoQALFscagW8R5LIDmzzRPmslhF
 W1qm9rEmtdnOHGg20QbYUnJwtGZjVN4lIZp6eQ3v6mhvm6IY2VhInGJpcLnwbojb
 o7y4wKcP9/ucIpfV/z32DrUfEM+qnQwztn56u7lJBxf4cTFEOIwIIS8v1KEnsNXX
 Zzvu1kSKsc4ZHHdE7h3dmr3iC5GOz/6EAJ9U33WcLy24tRTevIxcZsYvb/SOvDAT
 NYdPK8yptuVVO+odHObNwMVBidRcXRb49gWQGWLuAvfbklh33pomYarWkNe/Syif
 UeCHDNwvqzEmjSks73EomdCjME0roWhgKbm/dXJKXhe2hBzP1psMWNzRPSRa4yIj
 SHE7UfFPXCa+tNveJo2qzTOhpMw1DRiNgZD3EM2cRvwZ1ip8emJr70qFfL+RGpqq
 4ZlLb9Tibb51ApLcn+r0AnOMrC8MkK1zC8dKNxgUwdJ7D4UqZ70348c2GXE54yfv
 kxst/gtLb9r6YEtaCsKbCk1XgR2y2QGtyYrVLKsI/v6fhPVBKxnDXIpsn0Q6NYFi
 UiYKojTpFKvEMl0tc1EaYrIGoq9ZH4wDna3q4lOSRiyrypUl8NfflWwDSIuYVP5Z
 Y2tIPYTcGeCxt3gyXu0riL6tvpy1KGVlByNB9V297rSrVenH4VcfYPLJhYAtqpRo
 gO2eyp64i9LduVZOrEEP
 =6GIM
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild updates from Masahiro Yamada:

 - add more Build-Depends to Debian source package

 - prefix header search paths with $(srctree)/

 - make modpost show verbose section mismatch warnings

 - avoid hard-coded CROSS_COMPILE for h8300

 - fix regression for Debian make-kpkg command

 - add semantic patch to detect missing put_device()

 - fix some warnings of 'make deb-pkg'

 - optimize NOSTDINC_FLAGS evaluation

 - add warnings about redundant generic-y

 - clean up Makefiles and scripts

* tag 'kbuild-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: remove stale lxdialog/.gitignore
  kbuild: force all architectures except um to include mandatory-y
  kbuild: warn redundant generic-y
  Revert "modsign: Abort modules_install when signing fails"
  kbuild: Make NOSTDINC_FLAGS a simply expanded variable
  kbuild: deb-pkg: avoid implicit effects
  coccinelle: semantic code search for missing put_device()
  kbuild: pkg: grep include/config/auto.conf instead of $KCONFIG_CONFIG
  kbuild: deb-pkg: introduce is_enabled and if_enabled_echo to builddeb
  kbuild: deb-pkg: add CONFIG_ prefix to kernel config options
  kbuild: add workaround for Debian make-kpkg
  kbuild: source include/config/auto.conf instead of ${KCONFIG_CONFIG}
  unicore32: simplify linker script generation for decompressor
  h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux-
  kbuild: move archive command to scripts/Makefile.lib
  modpost: always show verbose warning for section mismatch
  ia64: prefix header search path with $(srctree)/
  libfdt: prefix header search paths with $(srctree)/
  deb-pkg: generate correct build dependencies
2019-03-17 13:25:26 -07:00
Masahiro Yamada
037fc3368b kbuild: force all architectures except um to include mandatory-y
Currently, every arch/*/include/uapi/asm/Kbuild explicitly includes
the common Kbuild.asm file. Factor out the duplicated include directives
to scripts/Makefile.asm-generic so that no architecture would opt out
of the mandatory-y mechanism.

um is not forced to include mandatory-y since it is a very exceptional
case which does not support UAPI.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-03-17 12:56:32 +09:00
Masahiro Yamada
7cbbbb8bc2 kbuild: warn redundant generic-y
The generic-y is redundant under the following condition:

 - arch has its own implementation

 - the same header is added to generated-y

 - the same header is added to mandatory-y

If a redundant generic-y is found, the warning like follows is displayed:

  scripts/Makefile.asm-generic:20: redundant generic-y found in arch/arm/include/asm/Kbuild: timex.h

I fixed up arch Kbuild files found by this.

Suggested-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-03-17 12:56:31 +09:00
Linus Torvalds
f67e3fb489 device-dax for 5.1
* Replace the /sys/class/dax device model with /sys/bus/dax, and include
   a compat driver so distributions can opt-in to the new ABI.
 
 * Allow for an alternative driver for the device-dax address-range
 
 * Introduce the 'kmem' driver to hotplug / assign a device-dax
   address-range to the core-mm.
 
 * Arrange for the device-dax target-node to be onlined so that the newly
   added memory range can be uniquely referenced by numa apis.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJchWpGAAoJEB7SkWpmfYgCJk8P/0Q1DINszUDO/vKjJ09cDs9P
 Jw3it6GBIL50rDOu9QdcprSpwYDD0h1mLAV/m6oa3bVO+p4uWGvnxaxRx2HN2c/v
 vhZFtUDpHlqR63vzWMNVKRprYixCRJDUr6xQhhCcE3ak/ELN6w7LWfikKVWv15UL
 MfR96IQU38f+xRda/zSXnL9606Dvkvu/inEHj84lRcHIwj3sQAUalrE8bR3O32gZ
 bDg/l5kzT49o8ZXUo/TegvRSSSZpJmOl2DD0RW+ax5q3NI2bOXFrVDUKBKxf/hcQ
 E/V9i57TrqQx0GqRhnU7rN/v53cFZGGs31TEEIB/xs3bzCnADxwXcjL5b5K005J6
 vJjBA2ODBewHFK3uVx46Hy1iV4eCtZWj4QrMnrjdSrjXOfbF5GTbWOhPFgoq7TWf
 S7VqFEf3I2gDPaMq4o8Ej1kLH4HMYeor2NSOZjyvGn87rSZ3ZIQguwbaNIVl+itz
 gdDt0ZOU0BgOBkV+rZIeZDaGdloWCHcDPL15CkZaOZyzdWhfEZ7dod6ad+9udilU
 EUPH62RgzXZtfm5zpebYyjNVLbb9pLZ0nT+UypyGR6zqWx1SqU3mXi63NFXPco+x
 XA9j//edPeI6NHg2CXLEh8DLuCg3dG1zWRJANkiF+niBwyCR8CHtGWAoY6soXbKe
 2UrXGcIfXxyJ8V9v8v4q
 =hfa3
 -----END PGP SIGNATURE-----

Merge tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull device-dax updates from Dan Williams:
 "New device-dax infrastructure to allow persistent memory and other
  "reserved" / performance differentiated memories, to be assigned to
  the core-mm as "System RAM".

  Some users want to use persistent memory as additional volatile
  memory. They are willing to cope with potential performance
  differences, for example between DRAM and 3D Xpoint, and want to use
  typical Linux memory management apis rather than a userspace memory
  allocator layered over an mmap() of a dax file. The administration
  model is to decide how much Persistent Memory (pmem) to use as System
  RAM, create a device-dax-mode namespace of that size, and then assign
  it to the core-mm. The rationale for device-dax is that it is a
  generic memory-mapping driver that can be layered over any "special
  purpose" memory, not just pmem. On subsequent boots udev rules can be
  used to restore the memory assignment.

  One implication of using pmem as RAM is that mlock() no longer keeps
  data off persistent media. For this reason it is recommended to enable
  NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
  at rest. We considered making this recommendation an actively enforced
  requirement, but in the end decided to leave it as a distribution /
  administrator policy to allow for emulation and test environments that
  lack security capable NVDIMMs.

  Summary:

   - Replace the /sys/class/dax device model with /sys/bus/dax, and
     include a compat driver so distributions can opt-in to the new ABI.

   - Allow for an alternative driver for the device-dax address-range

   - Introduce the 'kmem' driver to hotplug / assign a device-dax
     address-range to the core-mm.

   - Arrange for the device-dax target-node to be onlined so that the
     newly added memory range can be uniquely referenced by numa apis"

NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
we currently have special - and very annoying rules in the kernel about
accessing PMEM only with the "MC safe" accessors, because machine checks
inside the regular repeat string copy functions can be fatal in some
(not described) circumstances.

And apparently the PMEM modules can cause that a lot more than regular
RAM.  The argument is that this happens because PMEM doesn't necessarily
get scrubbed at boot like RAM does, but that is planned to be added for
the user space tooling.

Quoting Dan from another email:
 "The exposure can be reduced in the volatile-RAM case by scanning for
  and clearing errors before it is onlined as RAM. The userspace tooling
  for that can be in place before v5.1-final. There's also runtime
  notifications of errors via acpi_nfit_uc_error_notify() from
  background scrubbers on the DIMM devices. With that mechanism the
  kernel could proactively clear newly discovered poison in the volatile
  case, but that would be additional development more suitable for v5.2.

  I understand the concern, and the need to highlight this issue by
  tapping the brakes on feature development, but I don't see PMEM as RAM
  making the situation worse when the exposure is also there via DAX in
  the PMEM case. Volatile-RAM is arguably a safer use case since it's
  possible to repair pages where the persistent case needs active
  application coordination"

* tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  device-dax: "Hotplug" persistent memory for use like normal RAM
  mm/resource: Let walk_system_ram_range() search child resources
  mm/memory-hotplug: Allow memory resources to be children
  mm/resource: Move HMM pr_debug() deeper into resource code
  mm/resource: Return real error codes from walk failures
  device-dax: Add a 'modalias' attribute to DAX 'bus' devices
  device-dax: Add a 'target_node' attribute
  device-dax: Auto-bind device after successful new_id
  acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
  device-dax: Add /sys/class/dax backwards compatibility
  device-dax: Add support for a dax override driver
  device-dax: Move resource pinning+mapping into the common driver
  device-dax: Introduce bus + driver model
  device-dax: Start defining a dax bus model
  device-dax: Remove multi-resource infrastructure
  device-dax: Kill dax_region base
  device-dax: Kill dax_region ida
2019-03-16 13:05:32 -07:00
David S. Miller
0aedadcf6b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-03-16

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a umem memory leak on cleanup in AF_XDP, from Björn.

2) Fix BTF to properly resolve forward-declared enums into their corresponding
   full enum definition types during deduplication, from Andrii.

3) Fix libbpf to reject invalid flags in xsk_socket__create(), from Magnus.

4) Fix accessing invalid pointer returned from bpf_tcp_sock() and
   bpf_sk_fullsock() after bpf_sk_release() was called, from Martin.

5) Fix generation of load/store DW instructions in PPC JIT, from Naveen.

6) Various fixes in BPF helper function documentation in bpf.h UAPI header
   used to bpf-helpers(7) man page, from Quentin.

7) Fix segfault in BPF test_progs when prog loading failed, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-16 12:20:08 -07:00
Linus Torvalds
a9c55d58bc powerpc fixes for 5.1 #2
One fix to prevent runtime allocation of 16GB pages when running in a VM (as
 opposed to bare metal), because it doesn't work.
 
 A small fix to our recently added KCOV support to exempt some more code from
 being instrumented.
 
 Plus a few minor build fixes, a small dead code removal and a defconfig update.
 
 Thanks to:
   Alexey Kardashevskiy, Aneesh Kumar K.V, Christophe Leroy, Jason Yan, Joel
   Stanley, Mahesh Salgaonkar, Mathieu Malaterre.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcjNHCAAoJEFHr6jzI4aWAJVAP/21RUgDvqAAW55jTwihH6Eit
 q6l1mJ30zwARz+UYWssqMe7qIYmnjWDeapgpZncZE3P6f3VMmepJrr75zca0LJhC
 ixWqNJOcQgUu9civDwwpaqKQvyY0CYCdF5mu1rA1RNZ2kTeuCMw7zYPPpM84UGkq
 IPFe3EgWAOURFeaQUGpH16klJVbPISq/1RCtsAkR4QifD4auM+EDYq+ML69LInc4
 m7mi2CpPQDGZyCepFL0zdfOI43zrtWerG0UwCxPbGPYzvT+T3mvxU2unV1NcYn6/
 obNYB5V0OCz4gUiu7aLoHnYZx2zK8fi1lTjSrB7XhWdi4ftEfRP3TrUntHWo420n
 FC3+ibbjS3Cr8y7eubXgEAAKh74M1xzBF2bdAEHQ/QmqHZLcG+mnUihOq/g8mCp1
 LsTKvkzXilov752wKSwdjvSNbU29a2KRaXSXAEgWJvsAQbZAidGRzX7CA9XeHQPp
 kRCWHTwzXM0E31oi5rGAk2F1l4EK12QLdk1m0DF96ZanX7xG/UK6MpDNut2y51Wr
 KsWPYhUhI6pc9xt+Fts0zehDWAtfttn7RTvE+34dkaZURGl3rQkjsKt1lQ+scRYX
 fuSAnpTinE46e6APezwjCELtHDAzOCZvOnh9RVPe+F//KEF8LcNQv6TLhQoukRAe
 ldJEhSReJfo3/agqGJ6v
 =6cp4
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "One fix to prevent runtime allocation of 16GB pages when running in a
  VM (as opposed to bare metal), because it doesn't work.

  A small fix to our recently added KCOV support to exempt some more
  code from being instrumented.

  Plus a few minor build fixes, a small dead code removal and a
  defconfig update.

  Thanks to: Alexey Kardashevskiy, Aneesh Kumar K.V, Christophe Leroy,
  Jason Yan, Joel Stanley, Mahesh Salgaonkar, Mathieu Malaterre"

* tag 'powerpc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: Include <asm/nmi.h> header file to fix a warning
  powerpc/powernv: Fix compile without CONFIG_TRACEPOINTS
  powerpc/mm: Disable kcov for SLB routines
  powerpc: remove dead code in head_fsl_booke.S
  powerpc/configs: Sync skiroot defconfig
  powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration
2019-03-16 10:45:17 -07:00
Naveen N. Rao
86be36f650 powerpc: bpf: Fix generation of load/store DW instructions
Yauheni Kaliuta pointed out that PTR_TO_STACK store/load verifier test
was failing on powerpc64 BE, and rightfully indicated that the PPC_LD()
macro is not masking away the last two bits of the offset per the ISA,
resulting in the generation of 'lwa' instruction instead of the intended
'ld' instruction.

Segher also pointed out that we can't simply mask away the last two bits
as that will result in loading/storing from/to a memory location that
was not intended.

This patch addresses this by using ldx/stdx if the offset is not
word-aligned. We load the offset into a temporary register (TMP_REG_2)
and use that as the index register in a subsequent ldx/stdx. We fix
PPC_LD() macro to mask off the last two bits, but enhance PPC_BPF_LL()
and PPC_BPF_STL() to factor in the offset value and generate the proper
instruction sequence. We also convert all existing users of PPC_LD() and
PPC_STD() to use these macros. All existing uses of these macros have
been audited to ensure that TMP_REG_2 can be clobbered.

Fixes: 156d0e290e ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF")
Cc: stable@vger.kernel.org # v4.9+

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-03-16 01:28:22 +01:00
Linus Torvalds
636deed6c0 ARM: some cleanups, direct physical timer assignment, cache sanitization
for 32-bit guests
 
 s390: interrupt cleanup, introduction of the Guest Information Block,
 preparation for processor subfunctions in cpu models
 
 PPC: bug fixes and improvements, especially related to machine checks
 and protection keys
 
 x86: many, many cleanups, including removing a bunch of MMU code for
 unnecessary optimizations; plus AVIC fixes.
 
 Generic: memcg accounting
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJci+7XAAoJEL/70l94x66DUMkIAKvEefhceySHYiTpfefjLjIC
 16RewgHa+9CO4Oo5iXiWd90fKxtXLXmxDQOS4VGzN0rxvLGRw/fyXIxL1MDOkaAO
 l8SLSNuewY4XBUgISL3PMz123r18DAGOuy9mEcYU/IMesYD2F+wy5lJ17HIGq6X2
 RpoF1p3qO1jfkPTKOob6Ixd4H5beJNPKpdth7LY3PJaVhDxgouj32fxnLnATVSnN
 gENQ10fnt8BCjshRYW6Z2/9bF15JCkUFR1xdBW2/xh1oj+kvPqqqk2bEN1eVQzUy
 2hT/XkwtpthqjSbX8NNavWRSFnOnbMLTRKQyIXmFVsM5VoSrwtiGsCFzBgcT++I=
 =XIzU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - some cleanups
   - direct physical timer assignment
   - cache sanitization for 32-bit guests

  s390:
   - interrupt cleanup
   - introduction of the Guest Information Block
   - preparation for processor subfunctions in cpu models

  PPC:
   - bug fixes and improvements, especially related to machine checks
     and protection keys

  x86:
   - many, many cleanups, including removing a bunch of MMU code for
     unnecessary optimizations
   - AVIC fixes

  Generic:
   - memcg accounting"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (147 commits)
  kvm: vmx: fix formatting of a comment
  KVM: doc: Document the life cycle of a VM and its resources
  MAINTAINERS: Add KVM selftests to existing KVM entry
  Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()"
  KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char()
  KVM: PPC: Fix compilation when KVM is not enabled
  KVM: Minor cleanups for kvm_main.c
  KVM: s390: add debug logging for cpu model subfunctions
  KVM: s390: implement subfunction processor calls
  arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
  KVM: arm/arm64: Remove unused timer variable
  KVM: PPC: Book3S: Improve KVM reference counting
  KVM: PPC: Book3S HV: Fix build failure without IOMMU support
  Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()"
  x86: kvmguest: use TSC clocksource if invariant TSC is exposed
  KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start
  KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter
  KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns
  KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()
  KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
  ...
2019-03-15 15:00:28 -07:00
Mathieu Malaterre
de3c83c2fd powerpc/64s: Include <asm/nmi.h> header file to fix a warning
Make sure to include <asm/nmi.h> to provide the following prototype:
hv_nmi_check_nonrecoverable.

Remove the following warning treated as error (W=1):

  arch/powerpc/kernel/traps.c:393:6: error: no previous prototype for 'hv_nmi_check_nonrecoverable'

Fixes: ccd477028a ("powerpc/64s: Fix HV NMI vs HV interrupt recoverability test")
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-13 15:03:13 +11:00
Alexey Kardashevskiy
1702877621 powerpc/powernv: Fix compile without CONFIG_TRACEPOINTS
The functions returns s64 but the return statement is missing.
This adds the missing return statement.

Fixes: 75d9fc7fd9 ("powerpc/powernv: move OPAL call wrapper tracing and interrupt handling to C")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-13 14:59:39 +11:00
Mike Rapoport
8a7f97b902 treewide: add checks for the return value of memblock_alloc*()
Add check for the return value of memblock_alloc*() functions and call
panic() in case of error.  The panic message repeats the one used by
panicing memblock allocators with adjustment of parameters to include
only relevant ones.

The replacement was mostly automated with semantic patches like the one
below with manual massaging of format strings.

  @@
  expression ptr, size, align;
  @@
  ptr = memblock_alloc(size, align);
  + if (!ptr)
  + 	panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align);

[anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type]
  Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org
[rppt@linux.ibm.com: fix format strings for panics after memblock_alloc]
  Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com
[rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails]
  Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx
[akpm@linux-foundation.org: fix xtensa printk warning]
Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Reviewed-by: Guo Ren <ren_guo@c-sky.com>		[c-sky]
Acked-by: Paul Burton <paul.burton@mips.com>		[MIPS]
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>	[s390]
Reviewed-by: Juergen Gross <jgross@suse.com>		[Xen]
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Acked-by: Max Filippov <jcmvbkbc@gmail.com>		[xtensa]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
Mike Rapoport
0ba9e6edd4 memblock: drop memblock_alloc_base()
The memblock_alloc_base() function tries to allocate a memory up to the
limit specified by its max_addr parameter and panics if the allocation
fails.  Replace its usage with memblock_phys_alloc_range() and make the
callers check the return value and panic in case of error.

Link: http://lkml.kernel.org/r/1548057848-15136-10-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Mike Rapoport
ecc3e771f4 memblock: memblock_phys_alloc(): don't panic
Make the memblock_phys_alloc() function an inline wrapper for
memblock_phys_alloc_range() and update the memblock_phys_alloc() callers
to check the returned value and panic in case of error.

Link: http://lkml.kernel.org/r/1548057848-15136-8-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Mike Rapoport
337555744e memblock: memblock_phys_alloc_try_nid(): don't panic
The memblock_phys_alloc_try_nid() function tries to allocate memory from
the requested node and then falls back to allocation from any node in
the system.  The memblock_alloc_base() fallback used by this function
panics if the allocation fails.

Replace the memblock_alloc_base() fallback with the direct call to
memblock_alloc_range_nid() and update the memblock_phys_alloc_try_nid()
callers to check the returned value and panic in case of error.

Link: http://lkml.kernel.org/r/1548057848-15136-7-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Christophe Leroy
1269f7b83f powerpc: use memblock functions returning virtual address
Since only the virtual address of allocated blocks is used, lets use
functions returning directly virtual address.

Those functions have the advantage of also zeroing the block.

[rppt@linux.ibm.com: powerpc: remove duplicated alloc_stack() function]
  Link: http://lkml.kernel.org/r/20190226064032.GA5873@rapoport-lnx
[rppt@linux.ibm.com: updated error message in alloc_stack() to be more verbose]
[rppt@linux.ibm.com: convereted several additional call sites ]
Link: http://lkml.kernel.org/r/1548057848-15136-3-git-send-email-rppt@linux.ibm.com
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Mahesh Salgaonkar
19d6907521 powerpc/mm: Disable kcov for SLB routines
The kcov instrumentation inside SLB routines causes duplicate SLB
entries to be added resulting into SLB multihit machine checks.
Disable kcov instrumentation on slb.o

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-12 14:06:12 +11:00