Radix keeps no meaningful state in addr_limit, so remove it from radix
code and rename to slb_addr_limit to make it clear it applies to hash
only.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Radix VA space allocations test addresses against mm->task_size which
is 512TB, even in cases where the intention is to limit allocation to
below 128TB.
This results in mmap with a hint address below 128TB but address +
length above 128TB succeeding when it should fail (as hash does after
the previous patch).
Set the high address limit to be considered up front, and base
subsequent allocation checks on that consistently.
Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
While mapping hints with a length that cross 128TB are disallowed,
MAP_FIXED allocations that cross 128TB are allowed. These are failing
on hash (on radix they succeed). Add an additional case for fixed
mappings to expand the addr_limit when crossing 128TB.
Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Hash unconditionally resets the addr_limit to default (128TB) when the
mm context is initialised. If a process has > 128TB mappings when it
forks, the child will not get the 512TB addr_limit, so accesses to
valid > 128TB mappings will fail in the child.
Fix this by only resetting the addr_limit to default if it was 0. Non
zero indicates it was duplicated from the parent (0 means exec()).
Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When allocating VA space with a hint that crosses 128TB, the SLB
addr_limit variable is not expanded if addr is not > 128TB, but the
slice allocation looks at task_size, which is 512TB. This results in
slice_check_fit() incorrectly succeeding because the slice_count
truncates off bit 128 of the requested mask, so the comparison to the
available mask succeeds.
Fix this by using mm->context.addr_limit instead of mm->task_size for
testing allocation limits. This causes such allocations to fail.
Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently userspace is able to request mmap() search between 128T-512T
by specifying a hint address that is greater than 128T. But that means
a hint of 128T exactly will return an address below 128T, which is
confusing and wrong.
So fix the logic to check the hint is greater than *or equal* to 128T.
Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Cc: stable@vger.kernel.org # v4.12+
Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out of Nick's bigger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 398a719d34 ("powerpc/mm: Update bits used to skip hash_page")
mistakenly dropped the DSISR_DABRMATCH bit from the mask of bit tested
to skip trying to hash a page.
As a result, the DABR matches would no longer be detected.
This adds it back. We open code it in the 2 places where it matters
rather than fold it into DSISR_BAD_FAULT_32S/64S because this isn't
technically a bad fault and while we would never hit it with the
current code, I prefer if page_fault_is_bad() didn't trigger on these.
Fixes: 398a719d34 ("powerpc/mm: Update bits used to skip hash_page")
Cc: stable@vger.kernel.org # v4.14
Tested-by: Pedro Miraglia Franco de Carvalho <pedromfc@br.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When a uprobe is installed on an instruction that we currently do not
emulate, we copy the instruction into a xol buffer and single step
that instruction. If that instruction generates a fault, we abort the
single stepping before invoking the signal handler. Once the signal
handler is done, the uprobe trap is hit again since the instruction is
retried and the process repeats.
We use uprobe_deny_signal() to detect if the xol instruction triggered
a signal. If so, we clear TIF_SIGPENDING and set TIF_UPROBE so that the
signal is not handled until after the single stepping is aborted. In
this case, uprobe_deny_signal() returns true and get_signal() ends up
returning 0. However, in do_signal(), we are not looking at the return
value, but depending on ksig.sig for further action, all with an
uninitialized ksig that is not touched in this scenario. Fix the same
by initializing ksig.sig to 0.
Fixes: 129b69df9c ("powerpc: Use get_signal() signal_setup_done()")
Cc: stable@vger.kernel.org # v3.17+
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently sysfs store handlers in fadump use if buf[0] == 'char'.
This means input "100foo" is interpreted as '1' and "01" as '0'.
Change to kstrtoint so leading zeroes and the like is handled in
expected way.
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Acked-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Signed-off-by: Michal Suchanek <a class="moz-txt-link-rfc2396E" href="mailto:msuchanek@suse.de"><msuchanek@suse.de></a></pre>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Implement the architecture specific portitions of the UACCESS_FLUSHCACHE
API. This provides functions for the copy_user_flushcache iterator that
ensure that when the copy is finished the destination buffer contains
a copy of the original and that the destination buffer is clean in the
processor caches.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Implement the architecture specific cache maintence functions that make
up the "PMEM API". Currently the writeback and invalidate functions
are the same since the function of the DCBST (data cache block store)
instruction is typically interpreted as "writeback to the point of
coherency" rather than to memory. As a result implementing the API
requires a full cache flush rather than just a cache write back. This
will probably change in the not-too-distant future.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The nest mmu required an explicit flush as a tlbi would not flush it in the
same way as the core. However an alternate firmware fix exists which should
eliminate the need for this flush, so instead add a device-tree property
(ibm,nmmu-flush) on the NVLink2 PHB to enable it only if required.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With the optimisations introduced by commit a46cc7a908 ("powerpc/mm/radix:
Improve TLB/PWC flushes"), flush_tlb_mm() no longer flushes the page walk
cache with radix. Switch to using flush_all_mm() to ensure the pwc and tlb
are properly flushed on the nmmu.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use safer string manipulation functions when dealing with a
user-provided string in kprobe_lookup_name().
Reported-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 3cdfcbfd32 ("powerpc: Change analyse_instr so it doesn't
modify *regs") introduced emulate_update_regs() to perform part of what
emulate_step() was doing earlier. However, this function was not added
to the kprobes blacklist. Add it so as to prevent it from being probed.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Per Documentation/kprobes.txt, we don't necessarily need to disable
interrupts before invoking the kprobe handlers. Masami submitted
similar changes for x86 via commit a19b2e3d78 ("kprobes/x86: Remove
IRQ disabling from ftrace-based/optimized kprobes"). Do the same for
powerpc.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Per Documentation/kprobes.txt, probe handlers need to be invoked with
preemption disabled. Update optimized_callback() to do so. Also move
get_kprobe_ctlblk() invocation post preemption disable, since it
accesses pre-cpu data.
This was not an issue so far since optprobes wasn't selected if
CONFIG_PREEMPT was enabled. Commit a30b85df7d ("kprobes: Use
synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y") changes
this.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 78adf6c214 ("powerpc/64s: Implement system reset idle wakeup
reason"), added a call to ppc_save_regs() in the book3s code.
ppc_save_regs() is only built if XMON and/or KEXEC_CORE are enabled,
which is usually the case, however if they're not enabled then the
build breaks.
Fix it by making the Makefile check also build ppc_save_regs.o if
CONFIG_PPC_BOOK3S is enabled.
Fixes: 78adf6c214 ("powerpc/64s: Implement system reset idle wakeup reason")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
[mpe: Write change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When using the radix MMU on Power9 DD1, to work around a hardware
problem, radix__pte_update() is required to do a two stage update of
the PTE. First we write a zero value into the PTE, then we flush the
TLB, and then we write the new PTE value.
In the normal case that works OK, but it does not work if we're
updating the PTE that maps the code we're executing, because the
mapping is removed by the TLB flush and we can no longer execute from
it. Unfortunately the STRICT_RWX code needs to do exactly that.
The exact symptoms when we hit this case vary, sometimes we print an
oops and then get stuck after that, but I've also seen a machine just
get stuck continually page faulting with no oops printed. The variance
is presumably due to the exact layout of the text and the page size
used for the mappings. In all cases we are unable to boot to a shell.
There are possible solutions such as creating a second mapping of the
TLB flush code, executing from that, and then jumping back to the
original. However we don't want to add that level of complexity for a
DD1 work around.
So just detect that we're running on Power9 DD1 and refrain from
changing the permissions, effectively disabling STRICT_RWX on Power9
DD1.
Fixes: 7614ff3272 ("powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix")
Cc: stable@vger.kernel.org # v4.13+
Reported-by: Andrew Jeffery <andrew@aj.id.au>
[Changelog as suggested by Michael Ellerman <mpe@ellerman.id.au>]
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add support for user space receive window (for the Fast thread-wakeup
coprocessor type)
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Define an interface to return a system-wide unique id for a given VAS
window.
The vas_win_id() will be used in a follow-on patch to generate an unique
handle for a user space receive window. Applications can use this handle
to pair send and receive windows for fast thread-wakeup.
The hardware refers to this system-wide unique id as a Partition Send
Window ID which is expected to be used during fault handling. Hence the
"pswid" in the function names.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Define an interface that the NX drivers can use to find the physical
paste address of a send window. This interface is expected to be used
with the mmap() operation of the NX driver's device. i.e the user space
process can use driver's mmap() operation to map the send window's paste
address into their address space and then use copy and paste instructions
to submit the CRBs to the NX engine.
Note that kernel drivers will use vas_paste_crb() directly and don't need
this interface.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A CP_ABORT instruction is required in processes that have mapped a VAS
"paste address" with the intention of using COPY/PASTE instructions.
But since CP_ABORT is expensive, we want to restrict it to only
processes that use/intend to use COPY/PASTE.
Define an interface, set_thread_uses_vas(), that VAS can use to
indicate that the current process opened a send window. During context
switch, issue CP_ABORT only for processes that have the flag set.
Thanks for input from Nick Piggin, Michael Ellerman.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
[mpe: Fix to not use new_thread after _switch() returns]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We need the SPRN_TIDR to be set for use with fast thread-wakeup (core-
to-core wakeup) and also with CAPI.
Each thread in a process needs to have a unique id within the process.
But for now, we assign globally unique thread ids to all threads in
the system.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Philippe Bergheaud <felix@linux.vnet.ibm.com>
Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
[mpe: Simplify tidr clearing on fork() and ctx switch code]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Export the VAS Window context information to debugfs.
We need to hold a mutex when closing the window to prevent a race
with the debugfs read(). Rather than introduce a per-instance mutex,
we use the global vas_mutex for now, since it is not heavily contended.
The window->cop field is only relevant to a receive window so we were
not setting it for a send window (which is is paired to a receive window
anyway). But to simplify reporting in debugfs, set the 'cop' field for the
send window also.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Define a helper, chip_to_vas_id() to map a given chip id to corresponding
vas id.
Normally, callers of vas_rx_win_open() and vas_tx_win_open() want the VAS
window to be on the same chip where the calling thread is executing. These
callers can pass in -1 for the VAS id.
This interface will be useful if a thread running on one chip wants to open
a window on another chip (like the NX-842 driver does during start up).
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Create a cpu to vasid mapping so callers can specify -1 instead of
trying to find a VAS id.
Changelog[v2]
[Michael Ellerman] Use per-cpu variables to simplify code.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Normally, the NX driver waits for the CRBs to be processed before closing
the window. But it is better to ensure that the credits are returned before
the window gets reassigned later.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Save the configured max window credits for a window in the vas_window
structure. We will need this when polling for return of window credits.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A VAS window is normally in "busy" state for only a short duration.
Reduce the time we wait for the window to go to "not-busy" state to
speed-up vas_win_close() a bit.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use a helper to have the hardware unpin and mark a window closed.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Polling for window cast out is listed in the spec, but turns out that
it is not strictly necessary and slows down window close. Making it a
stub for now.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Clean up vas.h and the debug code around ifdef vas_debug.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
NX-842, the only user of VAS, sets the window credits to default values
but VAS should check the credits against the possible max values.
The VAS_WCREDS_MIN is not needed and can be dropped.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Initialize a few missing window context fields from the window attributes
specified by the caller. These fields are currently set to their default
values by the caller (NX-842), but would be good to apply them anyway.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Take the DSCR value set by firmware as the dscr_default value,
rather than zero.
POWER9 recommends DSCR default to a non-zero value.
Signed-off-by: From: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make record_spr_defaults() __init]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
OPAL boot does not insert secondaries at 0x60 to wait at the secondary
hold spinloop. Instead they are started later, and inserted at
generic_secondary_smp_init(), which is after the secondary hold
spinloop.
Avoid waiting on this spinloop when booting with OPAL firmware. This
wait always times out that case.
This saves 100ms boot time on powernv, and 10s of seconds of real time
when booting on the simulator in SMP.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Unmaps that free page tables always flush the entire PID, which is
sub-optimal. Provide TLB range flushing with an additional PWC flush
that can be use for va range invalidations with PWC flush.
Time to munmap N pages of memory including last level page table
teardown (after mmap, touch), local invalidate:
N 1 2 4 8 16 32 64
vanilla 3.2us 3.3us 3.4us 3.6us 4.1us 5.2us 7.2us
patched 1.4us 1.5us 1.7us 1.9us 2.6us 3.7us 6.2us
Global invalidate:
N 1 2 4 8 16 32 64
vanilla 2.2us 2.3us 2.4us 2.6us 3.2us 4.1us 6.2us
patched 2.1us 2.5us 3.4us 5.2us 8.7us 15.7us 6.2us
Local invalidates get much better across the board. Global ones have
the same issue where multiple tlbies for va flush do get slower than
the single tlbie to invalidate the PID. None of this test captures
the TLB benefits of avoiding killing everything.
Global gets worse, but it is brought in to line with global invalidate
for munmap()s that do not free page tables.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The single page flush ceiling is the cut-off point at which we switch
from invalidating individual pages, to invalidating the entire process
address space in response to a range flush.
Introduce a local variant of this heuristic because local and global
tlbie have significantly different properties:
- Local tlbiel requires 128 instructions to invalidate a PID, global
tlbie only 1 instruction.
- Global tlbie instructions are expensive broadcast operations.
The local ceiling has been made much higher, 2x the number of
instructions required to invalidate the entire PID (i.e., 256 pages).
Time to mprotect N pages of memory (after mmap, touch), local invalidate:
N 32 34 64 128 256 512
vanilla 7.4us 9.0us 14.6us 26.4us 50.2us 98.3us
patched 7.4us 7.8us 13.8us 26.4us 51.9us 98.3us
The behaviour of both is identical at N=32 and N=512. Between there,
the vanilla kernel does a PID invalidate and the patched kernel does
a va range invalidate.
At N=128, these require the same number of tlbiel instructions, so
the patched version can be sen to be cheaper when < 128, and more
expensive when > 128. However this does not well capture the cost
of invalidated TLB.
The additional cost at 256 pages does not seem prohibitive. It may
be the case that increasing the limit further would continue to be
beneficial to avoid invalidating all of the process's TLB entries.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move the barriers and range iteration down into the _tlbie* level,
which improves readability.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Short range flushes issue a sequences of tlbie(l) instructions for
individual effective addresses. These do not all require individual
barrier sequences, only one covering all tlbie(l) instructions.
Commit f7327e0ba3 ("powerpc/mm/radix: Remove unnecessary ptesync")
made a similar optimization for tlbiel for PID flushing.
For tlbie, the ISA says:
The tlbsync instruction provides an ordering function for the
effects of all tlbie instructions executed by the thread executing
the tlbsync instruction, with respect to the memory barrier
created by a subsequent ptesync instruction executed by the same
thread.
Time to munmap 30 pages of memory (after mmap, touch):
local global
vanilla 10.9us 22.3us
patched 3.4us 14.4us
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have some dependencies & conflicts between patches in fixes and
things to go in next, both in the radix TLB flush code and the IMC PMU
driver. So merge fixes into next.
This rearranges the code in kvmppc_run_vcpu() and kvmppc_run_vcpu_hv()
to be neater and clearer. Deeply indented code in kvmppc_run_vcpu()
is moved out to a helper function, kvmhv_setup_mmu(). In
kvmppc_vcpu_run_hv(), make use of the existing variable 'kvm' in
place of 'vcpu->kvm'.
No functional change.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This merges in a couple of fixes from the kvm-ppc-fixes branch that
modify the same areas of code as some commits from the kvm-ppc-next
branch, in order to resolve the conflicts.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We need to add "clean-files" in Makfiles to clean up DT blobs, but we
often miss to do so.
Since there are no source files that end with .dtb or .dtb.S, so we
can clean-up those files from the top-level Makefile.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rob Herring <robh@kernel.org>
Most of DT files are compiled under arch/*/boot/dts/, but we have some
other directories, like drivers/of/unittest-data/. We often miss to
add gitignore patterns per directory. Since there are no source files
that end with .dtb or .dtb.S, we can ignore the patterns globally.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Just one fix here for a host crash that can occur with HV KVM
as a result of resizing the guest hashed page table (HPT).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJaApLVAAoJEJ2a6ncsY3GfNfcIAJk93C9FK6k2urAORP3lDmKy
P6a4LnkMrQTuUCBGrkP4F1hGq2vpH6o/KeoEdhAgLMHHsarzMyBc5N7rHMHgZUzI
bUna0LaXtjdb5IP0kcDb8HmulmBaFiMf+sa2i3dIW3sCxtvqzzmxOluR0C29fG1I
gTdJV0XDzhQHJLixcQ3i4pi/K6b+wzXrY7fFPMpI2Wji6cKYr0ZL0fG8bQ0pV4OZ
0YgV9sR8mVN17JKU9R4GYz9fkp3+cXDG4xBVtczDlK6TJzF2XVUGgY/iJLMAyDRw
9gcEiIc+khkqyfuQt8iYBiHqRJ7HiT4yX1LMI9dM2vTZi23zsG3yTmsIc16QZLg=
=MzO/
-----END PGP SIGNATURE-----
Merge tag 'kvm-ppc-fixes-4.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
PPC KVM fixes for 4.14
Just one fix here for a host crash that can occur with HV KVM
as a result of resizing the guest hashed page table (HPT).
It would be nice to be able to dump page tables in a particular
context.
eg: dumping vmalloc space:
0:mon> dv 0xd00037fffff00000
pgd @ 0xc0000000017c0000
pgdp @ 0xc0000000017c00d8 = 0x00000000f10b1000
pudp @ 0xc0000000f10b13f8 = 0x00000000f10d0000
pmdp @ 0xc0000000f10d1ff8 = 0x00000000f1102000
ptep @ 0xc0000000f1102780 = 0xc0000000f1ba018e
Maps physical address = 0x00000000f1ba0000
Flags = Accessed Dirty Read Write
This patch does not replicate the complex code of dump_pagetable and
has no support for bolted linear mapping, thats why I've it's called
dump virtual page table support. The format of the PTE can be expanded
even further to add more useful information about the flags in the PTE
if required.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Bike shed the output format, show the pgdir, fix build failures]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 5e9859699a ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing
implementation", 2016-12-20) added code that tries to exclude any use
or update of the hashed page table (HPT) while the HPT resizing code
is iterating through all the entries in the HPT. It does this by
taking the kvm->lock mutex, clearing the kvm->arch.hpte_setup_done
flag and then sending an IPI to all CPUs in the host. The idea is
that any VCPU task that tries to enter the guest will see that the
hpte_setup_done flag is clear and therefore call kvmppc_hv_setup_htab_rma,
which also takes the kvm->lock mutex and will therefore block until
we release kvm->lock.
However, any VCPU that is already in the guest, or is handling a
hypervisor page fault or hypercall, can re-enter the guest without
rechecking the hpte_setup_done flag. The IPI will cause a guest exit
of any VCPUs that are currently in the guest, but does not prevent
those VCPU tasks from immediately re-entering the guest.
The result is that after resize_hpt_rehash_hpte() has made a HPTE
absent, a hypervisor page fault can occur and make that HPTE present
again. This includes updating the rmap array for the guest real page,
meaning that we now have a pointer in the rmap array which connects
with pointers in the old rev array but not the new rev array. In
fact, if the HPT is being reduced in size, the pointer in the rmap
array could point outside the bounds of the new rev array. If that
happens, we can get a host crash later on such as this one:
[91652.628516] Unable to handle kernel paging request for data at address 0xd0000000157fb10c
[91652.628668] Faulting instruction address: 0xc0000000000e2640
[91652.628736] Oops: Kernel access of bad area, sig: 11 [#1]
[91652.628789] LE SMP NR_CPUS=1024 NUMA PowerNV
[91652.628847] Modules linked in: binfmt_misc vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas i2c_opal ipmi_powernv ipmi_devintf i2c_core ipmi_msghandler powernv_op_panel nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm_hv kvm_pr kvm scsi_dh_alua dm_service_time dm_multipath tg3 ptp pps_core [last unloaded: stap_552b612747aec2da355051e464fa72a1_14259]
[91652.629566] CPU: 136 PID: 41315 Comm: CPU 21/KVM Tainted: G O 4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le #1
[91652.629684] task: c0000007a419e400 task.stack: c0000000028d8000
[91652.629750] NIP: c0000000000e2640 LR: d00000000c36e498 CTR: c0000000000e25f0
[91652.629829] REGS: c0000000028db5d0 TRAP: 0300 Tainted: G O (4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le)
[91652.629932] MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44022422 XER: 00000000
[91652.630034] CFAR: d00000000c373f84 DAR: d0000000157fb10c DSISR: 40000000 SOFTE: 1
[91652.630034] GPR00: d00000000c36e498 c0000000028db850 c000000001403900 c0000007b7960000
[91652.630034] GPR04: d0000000117fb100 d000000007ab00d8 000000000033bb10 0000000000000000
[91652.630034] GPR08: fffffffffffffe7f 801001810073bb10 d00000000e440000 d00000000c373f70
[91652.630034] GPR12: c0000000000e25f0 c00000000fdb9400 f000000003b24680 0000000000000000
[91652.630034] GPR16: 00000000000004fb 00007ff7081a0000 00000000000ec91a 000000000033bb10
[91652.630034] GPR20: 0000000000010000 00000000001b1190 0000000000000001 0000000000010000
[91652.630034] GPR24: c0000007b7ab8038 d0000000117fb100 0000000ec91a1190 c000001e6a000000
[91652.630034] GPR28: 00000000033bb100 000000000073bb10 c0000007b7960000 d0000000157fb100
[91652.630735] NIP [c0000000000e2640] kvmppc_add_revmap_chain+0x50/0x120
[91652.630806] LR [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv]
[91652.630884] Call Trace:
[91652.630913] [c0000000028db850] [c0000000028db8b0] 0xc0000000028db8b0 (unreliable)
[91652.630996] [c0000000028db8b0] [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv]
[91652.631091] [c0000000028db9e0] [d00000000c36a078] kvmppc_vcpu_run_hv+0xdf8/0x1300 [kvm_hv]
[91652.631179] [c0000000028dbb30] [d00000000c2248c4] kvmppc_vcpu_run+0x34/0x50 [kvm]
[91652.631266] [c0000000028dbb50] [d00000000c220d54] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm]
[91652.631351] [c0000000028dbbd0] [d00000000c2139d8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm]
[91652.631433] [c0000000028dbd40] [c0000000003832e0] do_vfs_ioctl+0xd0/0x8c0
[91652.631501] [c0000000028dbde0] [c000000000383ba4] SyS_ioctl+0xd4/0x130
[91652.631569] [c0000000028dbe30] [c00000000000b8e0] system_call+0x58/0x6c
[91652.631635] Instruction dump:
[91652.631676] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ffa1 2fa70000 793d0020 e9432110
[91652.631814] 7bbf26e4 7c7e1b78 7feafa14 409e0094 <807f000c> 786326e4 7c6a1a14 93a40008
[91652.631959] ---[ end trace ac85ba6db72e5b2e ]---
To fix this, we tighten up the way that the hpte_setup_done flag is
checked to ensure that it does provide the guarantee that the resizing
code needs. In kvmppc_run_core(), we check the hpte_setup_done flag
after disabling interrupts and refuse to enter the guest if it is
clear (for a HPT guest). The code that checks hpte_setup_done and
calls kvmppc_hv_setup_htab_rma() is moved from kvmppc_vcpu_run_hv()
to a point inside the main loop in kvmppc_run_vcpu(), ensuring that
we don't just spin endlessly calling kvmppc_run_core() while
hpte_setup_done is clear, but instead have a chance to block on the
kvm->lock mutex.
Finally we also check hpte_setup_done inside the region in
kvmppc_book3s_hv_page_fault() where the HPTE is locked and we are about
to update the HPTE, and bail out if it is clear. If another CPU is
inside kvm_vm_ioctl_resize_hpt_commit) and has cleared hpte_setup_done,
then we know that either we are looking at a HPTE
that resize_hpt_rehash_hpte() has not yet processed, which is OK,
or else we will see hpte_setup_done clear and refuse to update it,
because of the full barrier formed by the unlock of the HPTE in
resize_hpt_rehash_hpte() combined with the locking of the HPTE
in kvmppc_book3s_hv_page_fault().
Fixes: 5e9859699a ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation")
Cc: stable@vger.kernel.org # v4.10+
Reported-by: Satheesh Rajendran <satheera@in.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
<linux/pci.h> defines struct pci_bus and struct pci_dev and includes the
struct resource definition before including <asm/pci.h>. Nobody includes
<asm/pci.h> directly, so they don't need their own declarations.
Remove the redundant struct pci_dev, pci_bus, resource declarations.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> # CRIS
Acked-by: Ralf Baechle <ralf@linux-mips.org> # MIPS
In preperation for a new function that will need additional resource
information during the resource walk, update the resource walk callback to
pass the resource structure. Since the current callback start and end
arguments are pulled from the resource structure, the callback functions
can obtain them from the resource structure directly.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lkml.kernel.org/r/20171020143059.3291-10-brijesh.singh@amd.com
In commit e6f81a9201 ("powerpc/mm/hash: Support 68 bit VA") the
masking is folded into ASM_VSID_SCRAMBLE but the comment about masking
is removed only from the firt use of ASM_VSID_SCRAMBLE.
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
DMA windows can only have a size of power of two on IODA2 hardware and
using memory_hotplug_max() to determine the upper limit won't work
correcly if it returns not power of two value.
This removes the check as the platform code does this check in
pnv_pci_ioda2_setup_default_config() anyway; the other client is VFIO
and that thing checks against locked_vm limit which prevents the userspace
from locking too much memory.
It is expected to impact DPDK on machines with non-power-of-two RAM size,
mostly. KVM guests are less likely to be affected as usually guests get
less than half of hosts RAM.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The call to /proc/cpuinfo in turn calls cpufreq_quick_get() which
returns the last frequency requested by the kernel, but may not
reflect the actual frequency the processor is running at. This patch
makes a call to cpufreq_get() instead which returns the current
frequency reported by the hardware.
Fixes: fb5153d05a ("powerpc: powernv: Implement ppc_md.get_proc_freq()")
Signed-off-by: Shriya <shriyak@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
DD2.1 does not have to save MMCR0 for all state-loss idle states,
only after deep idle states (like other PMU registers).
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
DD2.1 does not have to flush the ERAT after a state-loss idle.
Performance testing was done on a DD2.1 using only the stop0 idle state
(the shallowest state which supports state loss), using context_switch
selftest configured to ping-poing between two threads on the same core
and two different cores.
Performance improvement for same core is 7.0%, different cores is 14.8%.
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
After handling a transactional FP, Altivec or VSX unavailable exception.
The return to userspace code will detect that the TIF_RESTORE_TM bit is
set and call restore_tm_state(). restore_tm_state() will call
restore_math() to ensure that the correct facilities are loaded.
This means that all the loadup code in {fp,altivec,vsx}_unavailable_tm()
is doing pointless work and can simply be removed.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Lazy save and restore of FP/Altivec means that a userspace process can
be sent to userspace with FP or Altivec disabled and loaded only as
required (by way of an FP/Altivec unavailable exception). Transactional
Memory complicates this situation as a transaction could be started
without FP/Altivec being loaded up. This causes the hardware to
checkpoint incorrect registers. Handling FP/Altivec unavailable
exceptions while a thread is transactional requires a reclaim and
recheckpoint to ensure the CPU has correct state for both sets of
registers.
tm_reclaim() has optimisations to not always save the FP/Altivec
registers to the checkpointed save area. This was originally done
because the caller might have information that the checkpointed
registers aren't valid due to lazy save and restore. We've also been a
little vague as to how tm_reclaim() leaves the FP/Altivec state since it
doesn't necessarily always save it to the thread struct. This has lead
to an (incorrect) assumption that it leaves the checkpointed state on
the CPU.
tm_recheckpoint() has similar optimisations in reverse. It may not
always reload the checkpointed FP/Altivec registers from the thread
struct before the trecheckpoint. It is therefore quite unclear where it
expects to get the state from. This didn't help with the assumption
made about tm_reclaim().
These optimisations sit in what is by definition a slow path. If a
process has to go through a reclaim/recheckpoint then its transaction
will be doomed on returning to userspace. This mean that the process
will be unable to complete its transaction and be forced to its failure
handler. This is already an out if line case for userspace. Furthermore,
the cost of copying 64 times 128 bits from registers isn't very long[0]
(at all) on modern processors. As such it appears these optimisations
have only served to increase code complexity and are unlikely to have
had a measurable performance impact.
Our transactional memory handling has been riddled with bugs. A cause
of this has been difficulty in following the code flow, code complexity
has not been our friend here. It makes sense to remove these
optimisations in favour of a (hopefully) more stable implementation.
This patch does mean that some times the assembly will needlessly save
'junk' registers which will subsequently get overwritten with the
correct value by the C code which calls the assembly function. This
small inefficiency is far outweighed by the reduction in complexity for
general TM code, context switching paths, and transactional facility
unavailable exception handler.
0: I tried to measure it once for other work and found that it was
hiding in the noise of everything else I was working with. I find it
exceedingly likely this will be the case here.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Lazy save and restore of FP/Altivec means that a userspace process can
be sent to userspace with FP or Altivec disabled and loaded only as
required (by way of an FP/Altivec unavailable exception). Transactional
Memory complicates this situation as a transaction could be started
without FP/Altivec being loaded up. This causes the hardware to
checkpoint incorrect registers. Handling FP/Altivec unavailable
exceptions while a thread is transactional requires a reclaim and
recheckpoint to ensure the CPU has correct state for both sets of
registers.
tm_reclaim() has optimisations to not always save the FP/Altivec
registers to the checkpointed save area. This was originally done
because the caller might have information that the checkpointed
registers aren't valid due to lazy save and restore. We've also been a
little vague as to how tm_reclaim() leaves the FP/Altivec state since it
doesn't necessarily always save it to the thread struct. This has lead
to an (incorrect) assumption that it leaves the checkpointed state on
the CPU.
tm_recheckpoint() has similar optimisations in reverse. It may not
always reload the checkpointed FP/Altivec registers from the thread
struct before the trecheckpoint. It is therefore quite unclear where it
expects to get the state from. This didn't help with the assumption
made about tm_reclaim().
This patch is a minimal fix for ease of backporting. A more correct fix
which removes the msr parameter to tm_reclaim() and tm_recheckpoint()
altogether has been upstreamed to apply on top of this patch.
Fixes: dc3106690b ("powerpc: tm: Always use fp_state and vr_state to
store live registers")
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Lazy save and restore of FP/Altivec means that a userspace process can
be sent to userspace with FP or Altivec disabled and loaded only as
required (by way of an FP/Altivec unavailable exception). Transactional
Memory complicates this situation as a transaction could be started
without FP/Altivec being loaded up. This causes the hardware to
checkpoint incorrect registers. Handling FP/Altivec unavailable
exceptions while a thread is transactional requires a reclaim and
recheckpoint to ensure the CPU has correct state for both sets of
registers.
Lazy save and restore of FP/Altivec cannot be done if a process is
transactional. If a facility was enabled it must remain enabled whenever
a thread is transactional.
Commit dc16b553c9 ("powerpc: Always restore FPU/VEC/VSX if hardware
transactional memory in use") ensures that the facilities are always
enabled if a thread is transactional. A bug in the introduced code may
cause it to inadvertently enable a facility that was (and should remain)
disabled. The problem with this extraneous enablement is that the
registers for the erroneously enabled facility have not been correctly
recheckpointed - the recheckpointing code assumed the facility would
remain disabled.
Further compounding the issue, the transactional {fp,altivec,vsx}
unavailable code has been incorrectly using the MSR to enable
facilities. The presence of the {FP,VEC,VSX} bit in the regs->msr simply
means if the registers are live on the CPU, not if the kernel should
load them before returning to userspace. This has worked due to the bug
mentioned above.
This causes transactional threads which return to their failure handler
to observe incorrect checkpointed registers. Perhaps an example will
help illustrate the problem:
A userspace process is running and uses both FP and Altivec registers.
This process then continues to run for some time without touching
either sets of registers. The kernel subsequently disables the
facilities as part of lazy save and restore. The userspace process then
performs a tbegin and the CPU checkpoints 'junk' FP and Altivec
registers. The process then performs a floating point instruction
triggering a fp unavailable exception in the kernel.
The kernel then loads the FP registers - and only the FP registers.
Since the thread is transactional it must perform a reclaim and
recheckpoint to ensure both the checkpointed registers and the
transactional registers are correct. It then (correctly) enables
MSR[FP] for the process. Later (on exception exist) the kernel also
(inadvertently) enables MSR[VEC]. The process is then returned to
userspace.
Since the act of loading the FP registers doomed the transaction we know
CPU will fail the transaction, restore its checkpointed registers, and
return the process to its failure handler. The problem is that we're
now running with Altivec enabled and the 'junk' checkpointed registers
are restored. The kernel had only recheckpointed FP.
This patch solves this by only activating FP/Altivec if userspace was
using them when it entered the kernel and not simply if the process is
transactional.
Fixes: dc16b553c9 ("powerpc: Always restore FPU/VEC/VSX if hardware
transactional memory in use")
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Also export opal_error_code() so that it can be used in modules
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds an _interruptible version of opal_async_wait_response().
This is useful when a long running OPAL call is performed on behalf of
a userspace thread, for example, the opal_flash_{read,write,erase}
functions performed by the powernv-flash MTD driver.
It is foreseeable that these functions would take upwards of two
minutes causing the wait_event() to block long enough to cause hung
task warnings. Furthermore, wait_event_interruptible() is preferable
as otherwise there is no way for signals to stop the process which is
going to be confusing in userspace.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Parallel sensor reads could run out of async tokens due to
opal_get_sensor_data grabbing tokens but then doing the sensor
read behind a mutex, essentially serializing the (possibly
asynchronous and relatively slow) sensor read.
It turns out that the mutex isn't needed at all, not only
should the OPAL interface allow concurrent reads, the implementation
is certainly safe for that, and if any sensor we were reading
from somewhere isn't, doing the mutual exclusion in the kernel
is the wrong place to do it, OPAL should be doing it for the kernel.
So, remove the mutex.
Additionally, we shouldn't be printing out an error when we don't
get a token as the only way this should happen is if we've been
interrupted in down_interruptible() on the semaphore.
Reported-by: Robert Lippert <rlippert@google.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Future work will add an opal_async_wait_response_interruptible()
which will call wait_event_interruptible(). This work requires extra
token state to be tracked as wait_event_interruptible() can return and
the caller could release the token before OPAL responds.
Currently token state is tracked with two bitfields which are 64 bits
big but may not need to be as OPAL informs Linux how many async tokens
there are. It also uses an array indexed by token to store response
messages for each token.
The bitfields make it difficult to add more state and also provide a
hard maximum as to how many tokens there can be - it is possible that
OPAL will inform Linux that there are more than 64 tokens.
Rather than add a bitfield to track the extra state, rework the
internals slightly.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
[mpe: Fix __opal_async_get_token() when no tokens are free]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There are no callers of both __opal_async_get_token() and
__opal_async_release_token().
This patch also removes the possibility of "emergency through
synchronous call to __opal_async_get_token()" as such it makes more
sense to initialise opal_sync_sem for the maximum number of async
tokens.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The current code checks the completion map to look for the first token
that is complete. In some cases, a completion can come in but the
token can still be on lease to the caller processing the completion.
If this completed but unreleased token is the first token found in the
bitmap by another tasks trying to acquire a token, then the
__test_and_set_bit call will fail since the token will still be on
lease. The acquisition will then fail with an EBUSY.
This patch reorganizes the acquisition code to look at the
opal_async_token_map for an unleased token. If the token has no lease
it must have no outstanding completions so we should never see an
EBUSY, unless we have leased out too many tokens. Since
opal_async_get_token_inrerruptible is protected by a semaphore, we
will practically never see EBUSY anymore.
Fixes: 8d72482322 ("powerpc/powernv: Infrastructure to support OPAL async completion")
Signed-off-by: William A. Kennington III <wak@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This interface is inefficient and deprecated because of the y2038
overflow.
ktime_get_seconds() is an appropriate replacement here, since it
has sufficient granularity but is more efficient and uses monotonic
time.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Take advantage of stack_depth tracking, originally introduced for
x64, in powerpc JIT as well. Round up allocated stack by 16 bytes
to make sure it stays aligned for functions called from JITed bpf
program.
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently when we take a TM Bad Thing program check exception, we
search the bug table to see if the program check was generated by a
WARN/WARN_ON etc.
That makes no sense, the WARN macros use trap instructions, which
should never generate a TM Bad Thing exception. If they ever did that
would be a bug and we should oops.
We do have some hand-coded bugs in tm.S, using EMIT_BUG_ENTRY, but
those are all BUGs not WARNs, and they all use trap instructions
anyway. Almost certainly this check was incorrectly copied from the
REASON_TRAP handling in the same function.
Remove it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-By: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently if the hardware supports the radix MMU we will use
it, *unless* "disable_radix" is passed on the kernel command line.
However some users would like the reverse semantics. ie. The kernel
uses the hash MMU by default, unless radix is explicitly requested on
the command line.
So add a CONFIG option to choose whether we use radix by default or
not, and expand the disable_radix command line option to allow
"disable_radix=no" which *enables* radix.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
CONFIG_PPC_STD_MMU_64 indicates support for the "standard" powerpc MMU
on 64-bit CPUs. The "standard" MMU refers to the hash page table MMU
found in "server" processors, from IBM mainly.
Currently CONFIG_PPC_STD_MMU_64 is == CONFIG_PPC_BOOK3S_64. While it's
annoying to have two symbols that always have the same value, it's not
quite annoying enough to bother removing one.
However with the arrival of Power9, we now have the situation where
CONFIG_PPC_STD_MMU_64 is enabled, but the kernel is running using the
Radix MMU - *not* the "standard" MMU. So it is now actively confusing
to use it, because it implies that code is disabled or inactive when
the Radix MMU is in use, however that is not necessarily true.
So s/CONFIG_PPC_STD_MMU_64/CONFIG_PPC_BOOK3S_64/, and do some minor
formatting updates of some of the affected lines.
This will be a pain for backports, but c'est la vie.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The last user of CPU_FTR_ICSWX was removed in commit
6ff4d3e966 ("powerpc: Remove old unused icswx based coprocessor
support"), so free the bit up for future use.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
IPIC Status is provided by register IPIC_SERSR and not by IPIC_SERMR
which is the mask register.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In order to make generic IOV code work, the physical function IOV BAR
should start from offset of the first VF. Since M64 segments share
PE number space across PHB, and some PEs may be in use at the time
when IOV is enabled, the existing code shifts the IOV BAR to the index
of the first PE/VF. This creates a hole in IOMEM space which can be
potentially taken by some other device.
This reserves a temporary hole on a parent and releases it when IOV is
disabled; the temporary resources are stored in pci_dn to avoid
kmalloc/free.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When a vdevice is DLPAR removed from the system the vio subsystem
doesn't bother unmapping the virq from the irq_domain. As a result we
have a virq mapped to a hardware irq that is no longer valid for the
irq_domain. A side effect is that we are left with /proc/irq/<irq#>
affinity entries, and attempts to modify the smp_affinity of the irq
will fail.
In the following observed example the kernel log is spammed by
ics_rtas_set_affinity errors after the removal of a VSCSI adapter.
This is a result of irqbalance trying to adjust the affinity every 10
seconds.
rpadlpar_io: slot U8408.E8E.10A7ACV-V5-C25 removed
ics_rtas_set_affinity: ibm,set-xive irq=655385 returns -3
ics_rtas_set_affinity: ibm,set-xive irq=655385 returns -3
This patch fixes the issue by calling irq_dispose_mapping() on the
virq of the viodev on unregister.
Fixes: f2ab621996 ("powerpc/pseries: Add PFO support to the VIO bus")
Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
According to the architecture, the process table entry cache must be
flushed with tlbie RIC=2.
Currently the process table entry is set to invalid right before the
PID is returned to the allocator, with no invalidation. This works on
existing implementations that are known to not cache the process table
entry for any except the current PIDR.
It is architecturally correct and cleaner to invalidate with RIC=2
after clearing the process table entry and before the PID is returned
to the allocator. This can be done in arch_exit_mmap that runs before
the final flush, and to ensure the final flush (fullmm) is always a
RIC=2 variant.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Preempt should be consistently disabled for mm_is_thread_local tests,
so bring the rest of these under preempt_disable().
Preempt does not need to be disabled for the mm->context.id tests,
which allows simplification and removal of gotos.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Close the recoverability gap for OPAL calls by using FIXUP_ENDIAN_HV
in the return path.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add an HV variant of FIXUP_ENDIAN which uses HSRR[01] and does not
clear MSR[RI], which improves recoverability.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When returning from an exception to a soft-enabled context, pending
IRQs are replayed but IRQ tracing is not reset, so a number of them
can get chained together into the same IRQ-disabled trace.
Fix this by having __check_irq_replay re-set IRQ trace. This is
conceptually where we respond to the next interrupt, so it fits the
semantics of the IRQ tracer.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If the host takes a system reset interrupt while a guest is running,
the CPU must exit the guest before processing the host exception
handler.
After this patch, taking a sysrq+x with a CPU running in a guest
gives a trace like this:
cpu 0x27: Vector: 100 (System Reset) at [c000000fdf5776f0]
pc: c008000010158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv]
lr: c008000010158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv]
sp: c000000fdf577850
msr: 9000000002803033
current = 0xc000000fdf4b1e00
paca = 0xc00000000fd4d680 softe: 3 irq_happened: 0x01
pid = 6608, comm = qemu-system-ppc
Linux version 4.14.0-rc7-01489-g47e1893a404a-dirty #26 SMP
[c000000fdf577a00] c008000010159dd4 kvmppc_vcpu_run_hv+0x3dc/0x12d0 [kvm_hv]
[c000000fdf577b30] c0080000100a537c kvmppc_vcpu_run+0x44/0x60 [kvm]
[c000000fdf577b60] c0080000100a1ae0 kvm_arch_vcpu_ioctl_run+0x118/0x310 [kvm]
[c000000fdf577c00] c008000010093e98 kvm_vcpu_ioctl+0x530/0x7c0 [kvm]
[c000000fdf577d50] c000000000357bf8 do_vfs_ioctl+0xd8/0x8c0
[c000000fdf577df0] c000000000358448 SyS_ioctl+0x68/0x100
[c000000fdf577e30] c00000000000b220 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 00007fff76868df0
SP (7fff7069baf0) is in userspace
Fixes: e36d0a2ed5 ("powerpc/powernv: Implement NMI IPI with OPAL_SIGNAL_SYSTEM_RESET")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A fix to the handling of misaligned paste instructions (P9 only), where a change
to a #define has caused the check for the instruction to always fail.
The preempt handling was unbalanced in the radix THP flush (P9 only). Though we
don't generally use preempt we want to keep it working as much as possible.
Two fixes for IMC (P9 only), one when booting with restricted number of CPUs and
one in the error handling when initialisation fails due to firmware etc.
A revert to fix function_graph on big endian machines, and then a rework of the
reverted patch to fix kprobes blacklist handling on big endian machines.
Thanks to:
Anju T Sudhakar, Guilherme G. Piccoli, Madhavan Srinivasan, Naveen N. Rao,
Nicholas Piggin, Paul Mackerras.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZ/ASDAAoJEFHr6jzI4aWAtWYP/34uCIJ83roZHfBdASGcczeO
AEOnxZdVdAB6ynU0eM2JSlJMhx39mqT5qJQP4ucUlqNa+SHhFxTW4EmoQm/qAr9N
4xhwix6OAsMleiPJx+RGGAnKB8rFhUEYH/+cql67FAq3u8zRbAdurR/A1lnIAY9x
2bQ85GsjFlRSFdmSohXegBJ7Xey3QE3jpj+co3dAEm3JQVRHNpGqgvRy8yeowvod
c/hHH7vmdKyDCZDtS5uF/egZOJLK1mueVuV69O0KKxT7xtEKk1cjFB08aHhcuJDm
QIk5Efr8iCbn4EyRBzthij7YJ0VPBTNCmsGcf6RorrYv3hT0uN72qQPjg13WH7HY
wkdMtkQvaFLbnZDqOxm9saczE27Ce7QNe1PNh6eapNlunz2nsb8Z/SvSyqskK3Yh
5cBxHrfi+NcJHvdDAJNobNoJ3vlTJGI0qQ/dSexJsXxZw2zR3tPojRKCtK0pZGql
xKDSqU7AbuUhLMxVBDDSQ+KONwyaBS84baLc98dapflw7htGWU339c5f9JTmnGvZ
7fWT3hNihAmZNZZ957BC5J0h5kgkKgZ7y67SA/+RI0o2ZoJlxoeDYAyqc7dwfvs+
ncxG4S6okSQP/EHf9bRSdYeJsZup/zP7+wD+02cbJd5zFJ+6cfu0AhPDRcBNMjHg
EPkZvfZzRxTlGvmHCoWv
=vTi1
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 4.14.
This is bigger than I like to send at rc7, but that's at least partly
because I didn't send any fixes last week. If it wasn't for the IMC
driver, which is new and getting heavy testing, the diffstat would
look a bit better. I've also added ftrace on big endian to my test
suite, so we shouldn't break that again in future.
- A fix to the handling of misaligned paste instructions (P9 only),
where a change to a #define has caused the check for the
instruction to always fail.
- The preempt handling was unbalanced in the radix THP flush (P9
only). Though we don't generally use preempt we want to keep it
working as much as possible.
- Two fixes for IMC (P9 only), one when booting with restricted
number of CPUs and one in the error handling when initialisation
fails due to firmware etc.
- A revert to fix function_graph on big endian machines, and then a
rework of the reverted patch to fix kprobes blacklist handling on
big endian machines.
Thanks to: Anju T Sudhakar, Guilherme G. Piccoli, Madhavan Srinivasan,
Naveen N. Rao, Nicholas Piggin, Paul Mackerras"
* tag 'powerpc-4.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/perf: Fix core-imc hotplug callback failure during imc initialization
powerpc/kprobes: Dereference function pointers only if the address does not belong to kernel text
Revert "powerpc64/elfv1: Only dereference function descriptor for non-text symbols"
powerpc/64s/radix: Fix preempt imbalance in TLB flush
powerpc: Fix check for copy/paste instructions in alignment handler
powerpc/perf: Fix IMC allocation routine
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Call trace observed during boot:
nest_capp0_imc performance monitor hardware support registered
nest_capp1_imc performance monitor hardware support registered
core_imc memory allocation for cpu 56 failed
Unable to handle kernel paging request for data at address 0xffa400010
Faulting instruction address: 0xc000000000bf3294
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c000000ff38ff8d0]
pc: c000000000bf3294: mutex_lock+0x34/0x90
lr: c000000000bf3288: mutex_lock+0x28/0x90
sp: c000000ff38ffb50
msr: 9000000002009033
dar: ffa400010
dsisr: 80000
current = 0xc000000ff383de00
paca = 0xc000000007ae0000 softe: 0 irq_happened: 0x01
pid = 13, comm = cpuhp/0
Linux version 4.11.0-39.el7a.ppc64le (mockbuild@ppc-058.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Oct 3 07:42:44 EDT 2017
0:mon> t
[c000000ff38ffb80] c0000000002ddfac perf_pmu_migrate_context+0xac/0x470
[c000000ff38ffc40] c00000000011385c ppc_core_imc_cpu_offline+0x1ac/0x1e0
[c000000ff38ffc90] c000000000125758 cpuhp_invoke_callback+0x198/0x5d0
[c000000ff38ffd00] c00000000012782c cpuhp_thread_fun+0x8c/0x3d0
[c000000ff38ffd60] c0000000001678d0 smpboot_thread_fn+0x290/0x2a0
[c000000ff38ffdc0] c00000000015ee78 kthread+0x168/0x1b0
[c000000ff38ffe30] c00000000000b368 ret_from_kernel_thread+0x5c/0x74
While registering the cpuhoplug callbacks for core-imc, if we fails
in the cpuhotplug online path for any random core (either because opal call to
initialize the core-imc counters fails or because memory allocation fails for
that core), ppc_core_imc_cpu_offline() will get invoked for other cpus who
successfully returned from cpuhotplug online path.
But in the ppc_core_imc_cpu_offline() path we are trying to migrate the event
context, when core-imc counters are not even initialized. Thus creating the
above stack dump.
Add a check to see if core-imc counters are enabled or not in the cpuhotplug
offline path before migrating the context to handle this failing scenario.
Fixes: 885dcd709b ("powerpc/perf: Add nest IMC PMU support")
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWfswbQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykvEwCfXU1MuYFQGgMdDmAZXEc+xFXZvqgAoKEcHDNA
6dVh26uchcEQLN/XqUDt
=x306
-----END PGP SIGNATURE-----
Merge tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull initial SPDX identifiers from Greg KH:
"License cleanup: add SPDX license identifiers to some files
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the
'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
binding shorthand, which can be used instead of the full boiler plate
text.
This patch is based on work done by Thomas Gleixner and Kate Stewart
and Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset
of the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to
license had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied
to a file was done in a spreadsheet of side by side results from of
the output of two independent scanners (ScanCode & Windriver)
producing SPDX tag:value files created by Philippe Ombredanne.
Philippe prepared the base worksheet, and did an initial spot review
of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537
files assessed. Kate Stewart did a file by file comparison of the
scanner results in the spreadsheet to determine which SPDX license
identifier(s) to be applied to the file. She confirmed any
determination that was not immediately clear with lawyers working with
the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained
>5 lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that
became the concluded license(s).
- when there was disagreement between the two scanners (one detected
a license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply
(and which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases,
confirmation by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.
The Windriver scanner is based on an older version of FOSSology in
part, so they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot
checks in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect
the correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial
patch version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch
license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
applied SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
License cleanup: add SPDX license identifier to uapi header files with a license
License cleanup: add SPDX license identifier to uapi header files with no license
License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many user space API headers have licensing information, which is either
incomplete, badly formatted or just a shorthand for referring to the
license under which the file is supposed to be. This makes it hard for
compliance tools to determine the correct license.
Update these files with an SPDX license identifier. The identifier was
chosen based on the license information in the file.
GPL/LGPL licensed headers get the matching GPL/LGPL SPDX license
identifier with the added 'WITH Linux-syscall-note' exception, which is
the officially assigned exception identifier for the kernel syscall
exception:
NOTE! This copyright does *not* cover user programs that use kernel
services by normal system calls - this is merely considered normal use
of the kernel, and does *not* fall under the heading of "derived work".
This exception makes it possible to include GPL headers into non GPL
code, without confusing license compliance tools.
Headers which have either explicit dual licensing or are just licensed
under a non GPL license are updated with the corresponding SPDX
identifier and the GPLv2 with syscall exception identifier. The format
is:
((GPL-2.0 WITH Linux-syscall-note) OR SPDX-ID-OF-OTHER-LICENSE)
SPDX license identifiers are a legally binding shorthand, which can be
used instead of the full boiler plate text. The update does not remove
existing license information as this has to be done on a case by case
basis and the copyright holders might have to be consulted. This will
happen in a separate step.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne. See the previous patch in this series for the
methodology of how this patch was researched.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Many user space API headers are missing licensing information, which
makes it hard for compliance tools to determine the correct license.
By default are files without license information under the default
license of the kernel, which is GPLV2. Marking them GPLV2 would exclude
them from being included in non GPLV2 code, which is obviously not
intended. The user space API headers fall under the syscall exception
which is in the kernels COPYING file:
NOTE! This copyright does *not* cover user programs that use kernel
services by normal system calls - this is merely considered normal use
of the kernel, and does *not* fall under the heading of "derived work".
otherwise syscall usage would not be possible.
Update the files which contain no license information with an SPDX
license identifier. The chosen identifier is 'GPL-2.0 WITH
Linux-syscall-note' which is the officially assigned identifier for the
Linux syscall exception. SPDX license identifiers are a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne. See the previous patch in this series for the
methodology of how this patch was researched.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZ9kEFAAoJEHm+PkMAQRiGw6wH/0j197qyGd0hkVFMJO6LAgN3
KQWS4nZ5BkVDocwv0RVnUJTtXqU1eozFgdVEtSoaFXpzlHGuptR2Tau9efDCJ7w3
/utZxqvhGebZd2T+j+/o/LE8BRQxhADBNJq2D/o0WNt8ecxuG0GIkhkEYt/o3z1v
/sxlwVwzXB7Dc/h1WcgGJG7cS6L9KzzAzGAS/iNvdFrPOygHBv8c0MxVZIiBIeeK
1nZdyvbyM8uenSyG+prGt9ENrqXZxxfwUxIchi2V7A9m1WmD5zijNkf1JCWji/O+
UsA1auxna7MwoxjxqZuGm4MlKOwZ+8xutk4JGgc+aP/ulndJbJYu+4op/3vaFBM=
=Mhx+
-----END PGP SIGNATURE-----
Backmerge tag 'v4.14-rc7' into drm-next
Linux 4.14-rc7
Requested by Ben Skeggs for nouveau to avoid major conflicts,
and things were getting a bit conflicty already, esp around amdgpu
reverts.
This makes the changes introduced in commit 83e840c770
("powerpc64/elfv1: Only dereference function descriptor for non-text
symbols") to be specific to the kprobe subsystem.
We previously changed ppc_function_entry() to always check the provided
address to confirm if it needed to be dereferenced. This is actually
only an issue for kprobe blacklisted asm labels (through use of
_ASM_NOKPROBE_SYMBOL) and can cause other issues with ftrace. Also, the
additional checks are not really necessary for our other uses.
As such, move this check to the kprobes subsystem.
Fixes: 83e840c770 ("powerpc64/elfv1: Only dereference function descriptor for non-text symbols")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This reverts commit 83e840c770 ("powerpc64/elfv1: Only dereference
function descriptor for non-text symbols").
Chandan reported that on newer kernels, trying to enable function_graph
tracer on ppc64 (BE) locks up the system with the following trace:
Unable to handle kernel paging request for data at address 0x600000002fa30010
Faulting instruction address: 0xc0000000001f1300
Thread overran stack, or stack corrupted
Oops: Kernel access of bad area, sig: 11 [#1]
BE SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries
Modules linked in:
CPU: 1 PID: 6586 Comm: bash Not tainted 4.14.0-rc3-00162-g6e51f1f-dirty #20
task: c000000625c07200 task.stack: c000000625c07310
NIP: c0000000001f1300 LR: c000000000121cac CTR: c000000000061af8
REGS: c000000625c088c0 TRAP: 0380 Not tainted (4.14.0-rc3-00162-g6e51f1f-dirty)
MSR: 8000000000001032 <SF,ME,IR,DR,RI> CR: 28002848 XER: 00000000
CFAR: c0000000001f1320 SOFTE: 0
...
NIP [c0000000001f1300] .__is_insn_slot_addr+0x30/0x90
LR [c000000000121cac] .kernel_text_address+0x18c/0x1c0
Call Trace:
[c000000625c08b40] [c0000000001bd040] .is_module_text_address+0x20/0x40 (unreliable)
[c000000625c08bc0] [c000000000121cac] .kernel_text_address+0x18c/0x1c0
[c000000625c08c50] [c000000000061960] .prepare_ftrace_return+0x50/0x130
[c000000625c08cf0] [c000000000061b10] .ftrace_graph_caller+0x14/0x34
[c000000625c08d60] [c000000000121b40] .kernel_text_address+0x20/0x1c0
[c000000625c08df0] [c000000000061960] .prepare_ftrace_return+0x50/0x130
...
[c000000625c0ab30] [c000000000061960] .prepare_ftrace_return+0x50/0x130
[c000000625c0abd0] [c000000000061b10] .ftrace_graph_caller+0x14/0x34
[c000000625c0ac40] [c000000000121b40] .kernel_text_address+0x20/0x1c0
[c000000625c0acd0] [c000000000061960] .prepare_ftrace_return+0x50/0x130
[c000000625c0ad70] [c000000000061b10] .ftrace_graph_caller+0x14/0x34
[c000000625c0ade0] [c000000000121b40] .kernel_text_address+0x20/0x1c0
This is because ftrace is using ppc_function_entry() for obtaining the
address of return_to_handler() in prepare_ftrace_return(). The call to
kernel_text_address() itself gets traced and we end up in a recursive
loop.
Fixes: 83e840c770 ("powerpc64/elfv1: Only dereference function descriptor for non-text symbols")
Cc: stable@vger.kernel.org # v4.13+
Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch removes the restriction that a radix host can only run
radix guests, allowing us to run HPT (hashed page table) guests as
well. This is useful because it provides a way to run old guest
kernels that know about POWER8 but not POWER9.
Unfortunately, POWER9 currently has a restriction that all threads
in a given code must either all be in HPT mode, or all in radix mode.
This means that when entering a HPT guest, we have to obtain control
of all 4 threads in the core and get them to switch their LPIDR and
LPCR registers, even if they are not going to run a guest. On guest
exit we also have to get all threads to switch LPIDR and LPCR back
to host values.
To make this feasible, we require that KVM not be in the "independent
threads" mode, and that the CPU cores be in single-threaded mode from
the host kernel's perspective (only thread 0 online; threads 1, 2 and
3 offline). That allows us to use the same code as on POWER8 for
obtaining control of the secondary threads.
To manage the LPCR/LPIDR changes required, we extend the kvm_split_info
struct to contain the information needed by the secondary threads.
All threads perform a barrier synchronization (where all threads wait
for every other thread to reach the synchronization point) on guest
entry, both before and after loading LPCR and LPIDR. On guest exit,
they all once again perform a barrier synchronization both before
and after loading host values into LPCR and LPIDR.
Finally, it is also currently necessary to flush the entire TLB every
time we enter a HPT guest on a radix host. We do this on thread 0
with a loop of tlbiel instructions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch allows for a mode on POWER9 hosts where we control all the
threads of a core, much as we do on POWER8. The mode is controlled by
a module parameter on the kvm_hv module, called "indep_threads_mode".
The normal mode on POWER9 is the "independent threads" mode, with
indep_threads_mode=Y, where the host is in SMT4 mode (or in fact any
desired SMT mode) and each thread independently enters and exits from
KVM guests without reference to what other threads in the core are
doing.
If indep_threads_mode is set to N at the point when a VM is started,
KVM will expect every core that the guest runs on to be in single
threaded mode (that is, threads 1, 2 and 3 offline), and will set the
flag that prevents secondary threads from coming online. We can still
use all four threads; the code that implements dynamic micro-threading
on POWER8 will become active in over-commit situations and will allow
up to three other VCPUs to be run on the secondary threads of the core
whenever a VCPU is run.
The reason for wanting this mode is that this will allow us to run HPT
guests on a radix host on a POWER9 machine that does not support
"mixed mode", that is, having some threads in a core be in HPT mode
while other threads are in radix mode. It will also make it possible
to implement a "strict threads" mode in future, if desired.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This sets up the machinery for switching a guest between HPT (hashed
page table) and radix MMU modes, so that in future we can run a HPT
guest on a radix host on POWER9 machines.
* The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or
radix mode, on a radix host.
* The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9
with HV KVM on a radix host.
* The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a
radix host.
* The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the
guest to HPT mode and allocate a HPT.
* For simplicity, we now allocate the rmap array for each memslot,
even on a radix host, since it will be needed if the guest switches
to HPT mode.
* Since we cannot yet run a HPT guest on a radix host, the KVM_RUN
ioctl will return an EINVAL error in that case.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>