Use a function for byte swapping instructions in preparation of a more
complicated instruction type.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Balamuruhan S <bala24@linux.ibm.com>
Link: https://lore.kernel.org/r/20200506034050.24806-10-jniethe5@gmail.com
In preparation for using a data type for instructions that can not be
directly used with the '>>' operator use a function for getting the op
code of an instruction.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-9-jniethe5@gmail.com
In preparation for introducing a more complicated instruction type to
accommodate prefixed instructions use an accessor for getting an
instruction as a u32.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-8-jniethe5@gmail.com
In preparation for instructions having a more complex data type start
using a macro, ppc_inst(), for making an instruction out of a u32. A
macro is used so that instructions can be used as initializer elements.
Currently this does nothing, but it will allow for creating a data type
that can represent prefixed instructions.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
[mpe: Change include guard to _ASM_POWERPC_INST_H]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-7-jniethe5@gmail.com
create_branch(), create_cond_branch() and translate_branch() return the
instruction that they create, or return 0 to signal an error. Separate
these concerns in preparation for an instruction type that is not just
an unsigned int. Fill the created instruction to a pointer passed as
the first parameter to the function and use a non-zero return value to
signify an error.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-6-jniethe5@gmail.com
A modulo operation is used for calculating the current offset from a
breakpoint within the breakpoint table. As instruction lengths are
always a power of 2, this can be replaced with a bitwise 'and'. The
current check for word alignment can be replaced with checking that the
lower 2 bits are not set.
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-5-jniethe5@gmail.com
The instructions for xmon's breakpoint are stored bpt_table[] which is in
the data section. This is problematic as the data section may be marked
as no execute. Move bpt_table[] to the text section.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-4-jniethe5@gmail.com
To execute an instruction out of line after a breakpoint, the NIP is set
to the address of struct bpt::instr. Here a copy of the instruction that
was replaced with a breakpoint is kept, along with a trap so normal flow
can be resumed after XOLing. The struct bpt's are located within the
data section. This is problematic as the data section may be marked as
no execute.
Instead of each struct bpt holding the instructions to be XOL'd, make a
new array, bpt_table[], with enough space to hold instructions for the
number of supported breakpoints. A later patch will move this to the
text section.
Make struct bpt::instr a pointer to the instructions in bpt_table[]
associated with that breakpoint. This association is a simple mapping:
bpts[n] -> bpt_table[n * words per breakpoint]. Currently we only need
the copied instruction followed by a trap, so 2 words per breakpoint.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-3-jniethe5@gmail.com
For modifying instructions in xmon, patch_instruction() can serve the
same role that store_inst() is performing with the advantage of not
being specific to xmon. In some places patch_instruction() is already
being using followed by store_inst(). In these cases just remove the
store_inst(). Otherwise replace store_inst() with patch_instruction().
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/20200506034050.24806-2-jniethe5@gmail.com
The ps3_mm_region_destroy() and ps3_mm_vas_destroy() routines
are called very late in the shutdown via kexec's mmu_cleanup_all
routine. By the time mmu_cleanup_all runs it is too late to use
udbg_printf, and calling it will cause PS3 systems to hang.
Remove all debugging statements from ps3_mm_region_destroy() and
ps3_mm_vas_destroy() and replace any error reporting with calls
to lv1_panic.
With this change builds with 'DEBUG' defined will not cause kexec
reboots to hang, and builds with 'DEBUG' defined or not will end
in lv1_panic if an error is encountered.
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7325c4af2b4c989c19d6a26b90b1fec9c0615ddf.1589049250.git.geoff@infradead.org
System Reset and Machine Check interrupts that are not recoverable due
to being nested or interrupting when RI=0 currently panic. This is not
necessary, and can often just kill the current context and recover.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Link: https://lore.kernel.org/r/20200508043408.886394-16-npiggin@gmail.com
Similarly to the previous patch, do not trace system reset. This code
is used when there is a crash or hang, and tracing disturbs the system
more and has been known to crash in the crash handling path.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/20200508043408.886394-15-npiggin@gmail.com
Rather than notrace annotations throughout a significant part of the
machine check code across kernel/ pseries/ and powernv/ which can
easily be broken and is infrequently tested, use paca->ftrace_enabled
to blanket-disable tracing of the real-mode non-maskable handler.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/20200508043408.886394-14-npiggin@gmail.com
machine_check_early() is taken as an NMI, so nmi_enter() is used
there. machine_check_exception() is no longer taken as an NMI (it's
invoked via irq_work in the case a machine check hits in kernel mode),
so remove the nmi_enter() from that case.
In NMI context, hash faults don't try to refill the hash table, which
can lead to crashes accessing non-pinned kernel pages. System reset
still has this potential problem.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Drop change in show_regs() which breaks Book3E]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200508043408.886394-12-npiggin@gmail.com
With the previous patch, machine checks can use rtas_call_unlocked()
which avoids the RTAS spinlock which would deadlock if a machine
check hits while making an RTAS call.
This also avoids the complex RTAS error logging which has more RTAS
calls and includes kmalloc (which can return memory beyond RMA, which
would also crash).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200508043408.886394-11-npiggin@gmail.com
This allows rtas_args to be put on the machine check stack, which
avoids a lot of complications with re-entrancy deadlocks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200508043408.886394-10-npiggin@gmail.com
PAPR does not specify that fwnmi sreset should be interlocked, and
PowerVM (and therefore now QEMU) do not require it.
These "ibm,nmi-interlock" calls are ignored by firmware, but there
is a possibility that the sreset could have interrupted a machine
check and release the machine check's interlock too early, corrupting
it if another machine check came in.
This is an extremely rare case, but it should be fixed for clarity
and reducing the code executed in the sreset path. Firmware also
does not provide error information for the sreset case to look at, so
remove that comment.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Use __be64 to silence some sparse warnings]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200508043408.886394-9-npiggin@gmail.com
If there is some error with the fwnmi save area, r3 has already been
modified which doesn't help with debugging.
Only update r3 when to restore the saved value.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200508043408.886394-8-npiggin@gmail.com
This was discovered developing qemu fwnmi sreset support. This
off-by-one bug means the last 16 bytes of the rtas area can not
be used for a 16 byte save area.
It's not a serious bug, and QEMU implementation has to retain a
workaround for old kernels, but it's good to tighten it.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200508043408.886394-7-npiggin@gmail.com
In the interest of reducing code and possible failures in the
machine check and system reset paths, grab the "ibm,nmi-interlock"
token at init time.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200508043408.886394-6-npiggin@gmail.com
A spare interrupt stack slot is needed to save irq state when
reconciling NMIs (sreset and decrementer soft-nmi). _DAR is used
for this, but we want to reconcile machine checks as well, which
do use _DAR. Switch to using RESULT instead, as it's used by
system calls.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200508043408.886394-4-npiggin@gmail.com
The architecture allows for machine check exceptions to cause idle
wakeups which resume at the 0x200 address which has to return via
the idle wakeup code, but the early machine check handler is run
first.
The case of a no state-loss sleep is broken because the early
handler uses non-volatile register r1 , which is needed for the wakeup
protocol, but it is not restored.
Fix this by loading r1 from the MCE exception frame before returning
to the idle wakeup code. Also update the comment which has become
stale since the idle rewrite in C.
This crash was found and fix confirmed with a machine check injection
test in qemu powernv model (which is not upstream in qemu yet).
Fixes: 10d91611f4 ("powerpc/64s: Reimplement book3s idle code in C")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200508043408.886394-2-npiggin@gmail.com
EEH device state is currently removed (by eeh_remove_device()) during
the device release handler, which is invoked as the device's reference
count drops to zero. This may take some time, or forever, as other
threads may hold references.
However, the PCI device state is released synchronously by
pci_stop_and_remove_bus_device(). This mismatch causes problems, for
example the device may be re-discovered as a new device before the
release handler has been called, leaving the PCI and EEH state
mismatched.
So instead, call eeh_remove_device() from the bus device removal
handlers, which are called synchronously in the removal path.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0a1f5105d3a33b1c090bba31de63eb0cdd25de7b.1588045502.git.sbobroff@linux.ibm.com
If a device is hot unplgged during EEH recovery, it's possible for the
RTAS call to ibm,configure-pe in pseries_eeh_configure() to return
parameter error (-3), however negative return values are not checked
for and this leads to an infinite loop.
Fix this by correctly bailing out on negative values.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Link: https://lore.kernel.org/r/1b0a6010a647dc915816e44845b64d72066676a7.1588045502.git.sbobroff@linux.ibm.com
Currently we don't report anything useful in /proc/<pid>/status:
$ grep Speculation_Store_Bypass /proc/self/status
Speculation_Store_Bypass: unknown
Our mitigation is currently always a barrier instruction, which
doesn't map that well onto the existing possibilities for the PR_SPEC
values.
However even if we added a "barrier" type PR_SPEC value, userspace
would still need to consult some other source to work out which type
of barrier to use. So reporting "vulnerable" seems sufficient, as
userspace can see that and then consult its source to determine what
barrier to use.
Signed-off-by: Gustavo Walbon <gwalbon@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200402124929.3574166-1-mpe@ellerman.id.au
There's no need to cast in task_pt_regs() as tsk->thread.regs should
already be a struct pt_regs. If someone's using task_pt_regs() on
something that's not a task but happens to have a thread.regs then
we'll deal with them later.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200428123152.73566-1-mpe@ellerman.id.au
Aneesh increased the size of struct pt_regs by 16 bytes and started
seeing this WARN_ON:
smp: Bringing up secondary CPUs ...
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at arch/powerpc/kernel/process.c:455 giveup_all+0xb4/0x110
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-rc2-gcc-8.2.0-1.g8f6a41f-default+ #318
NIP: c00000000001a2b4 LR: c00000000001a29c CTR: c0000000031d0000
REGS: c0000000026d3980 TRAP: 0700 Not tainted (5.7.0-rc2-gcc-8.2.0-1.g8f6a41f-default+)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48048224 XER: 00000000
CFAR: c000000000019cc8 IRQMASK: 1
GPR00: c00000000001a264 c0000000026d3c20 c0000000026d7200 800000000280b033
GPR04: 0000000000000001 0000000000000000 0000000000000077 30206d7372203164
GPR08: 0000000000002000 0000000002002000 800000000280b033 3230303030303030
GPR12: 0000000000008800 c0000000031d0000 0000000000800050 0000000002000066
GPR16: 000000000309a1a0 000000000309a4b0 000000000309a2d8 000000000309a890
GPR20: 00000000030d0098 c00000000264da40 00000000fd620000 c0000000ff798080
GPR24: c00000000264edf0 c0000001007469f0 00000000fd620000 c0000000020e5e90
GPR28: c00000000264edf0 c00000000264d200 000000001db60000 c00000000264d200
NIP [c00000000001a2b4] giveup_all+0xb4/0x110
LR [c00000000001a29c] giveup_all+0x9c/0x110
Call Trace:
[c0000000026d3c20] [c00000000001a264] giveup_all+0x64/0x110 (unreliable)
[c0000000026d3c90] [c00000000001ae34] __switch_to+0x104/0x480
[c0000000026d3cf0] [c000000000e0b8a0] __schedule+0x320/0x970
[c0000000026d3dd0] [c000000000e0c518] schedule_idle+0x38/0x70
[c0000000026d3df0] [c00000000019c7c8] do_idle+0x248/0x3f0
[c0000000026d3e70] [c00000000019cbb8] cpu_startup_entry+0x38/0x40
[c0000000026d3ea0] [c000000000011bb0] rest_init+0xe0/0xf8
[c0000000026d3ed0] [c000000002004820] start_kernel+0x990/0x9e0
[c0000000026d3f90] [c00000000000c49c] start_here_common+0x1c/0x400
Which was unexpected. The warning is checking the thread.regs->msr
value of the task we are switching from:
usermsr = tsk->thread.regs->msr;
...
WARN_ON((usermsr & MSR_VSX) && !((usermsr & MSR_FP) && (usermsr & MSR_VEC)));
ie. if MSR_VSX is set then both of MSR_FP and MSR_VEC are also set.
Dumping tsk->thread.regs->msr we see that it's: 0x1db60000
Which is not a normal looking MSR, in fact the only valid bit is
MSR_VSX, all the other bits are reserved in the current definition of
the MSR.
We can see from the oops that it was swapper/0 that we were switching
from when we hit the warning, ie. init_task. So its thread.regs points
to the base (high addresses) in init_stack.
Dumping the content of init_task->thread.regs, with the members of
pt_regs annotated (the 16 bytes larger version), we see:
0000000000000000 c000000002780080 gpr[0] gpr[1]
0000000000000000 c000000002666008 gpr[2] gpr[3]
c0000000026d3ed0 0000000000000078 gpr[4] gpr[5]
c000000000011b68 c000000002780080 gpr[6] gpr[7]
0000000000000000 0000000000000000 gpr[8] gpr[9]
c0000000026d3f90 0000800000002200 gpr[10] gpr[11]
c000000002004820 c0000000026d7200 gpr[12] gpr[13]
000000001db60000 c0000000010aabe8 gpr[14] gpr[15]
c0000000010aabe8 c0000000010aabe8 gpr[16] gpr[17]
c00000000294d598 0000000000000000 gpr[18] gpr[19]
0000000000000000 0000000000001ff8 gpr[20] gpr[21]
0000000000000000 c00000000206d608 gpr[22] gpr[23]
c00000000278e0cc 0000000000000000 gpr[24] gpr[25]
000000002fff0000 c000000000000000 gpr[26] gpr[27]
0000000002000000 0000000000000028 gpr[28] gpr[29]
000000001db60000 0000000004750000 gpr[30] gpr[31]
0000000002000000 000000001db60000 nip msr
0000000000000000 0000000000000000 orig_r3 ctr
c00000000000c49c 0000000000000000 link xer
0000000000000000 0000000000000000 ccr softe
0000000000000000 0000000000000000 trap dar
0000000000000000 0000000000000000 dsisr result
0000000000000000 0000000000000000 ppr kuap
0000000000000000 0000000000000000 pad[2] pad[3]
This looks suspiciously like stack frames, not a pt_regs. If we look
closely we can see return addresses from the stack trace above,
c000000002004820 (start_kernel) and c00000000000c49c (start_here_common).
init_task->thread.regs is setup at build time in processor.h:
#define INIT_THREAD { \
.ksp = INIT_SP, \
.regs = (struct pt_regs *)INIT_SP - 1, /* XXX bogus, I think */ \
The early boot code where we setup the initial stack is:
LOAD_REG_ADDR(r3,init_thread_union)
/* set up a stack pointer */
LOAD_REG_IMMEDIATE(r1,THREAD_SIZE)
add r1,r3,r1
li r0,0
stdu r0,-STACK_FRAME_OVERHEAD(r1)
Which creates a stack frame of size 112 bytes (STACK_FRAME_OVERHEAD).
Which is far too small to contain a pt_regs.
So the result is init_task->thread.regs is pointing at some stack
frames on the init stack, not at a pt_regs.
We have gotten away with this for so long because with pt_regs at its
current size the MSR happens to point into the first frame, at a
location that is not written to by the early asm. With the 16 byte
expansion the MSR falls into the second frame, which is used by the
compiler, and collides with a saved register that tends to be
non-zero.
As far as I can see this has been wrong since the original merge of
64-bit ppc support, back in 2002.
Conceptually swapper should have no regs, it never entered from
userspace, and in fact that's what we do on 32-bit. It's also
presumably what the "bogus" comment is referring to.
So I think the right fix is to just not-initialise regs at all. I'm
slightly worried this will break some code that isn't prepared for a
NULL regs, but we'll have to see.
Remove the comment in head_64.S which refers to us setting up the
regs (even though we never did), and is otherwise not really accurate
any more.
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200428123130.73078-1-mpe@ellerman.id.au
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200507185755.GA15014@embeddedor
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200507185749.GA14994@embeddedor
It's not very nice to zero trap for this, because then system calls no
longer have trap_is_syscall(regs) invariant, and we can't distinguish
between sc and scv system calls (in a later patch).
Take one last unused bit from the low bits of the pt_regs.trap word
for this instead. There is not a really good reason why it should be
in trap as opposed to another field, but trap has some concept of
flags and it exists. Ideally I think we would move trap to 2-byte
field and have 2 more bytes available independently.
Add a selftests case for this, which can be seen to fail if
trap_norestart() is changed to return false.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make them static inlines]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200507121332.2233629-4-mpe@ellerman.id.au
A new system call interrupt will be added with a new trap number.
Hide the explicit 0xc00 test behind an accessor to reduce churn
in callers.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make it a static inline]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200507121332.2233629-3-mpe@ellerman.id.au
The pt_regs.trap field keeps 4 low bits for some metadata about the
trap or how it was handled, which is masked off in order to test the
architectural trap number.
Add a set_trap() accessor to set this, equivalent to TRAP() for
returning it. This is actually not quite the equivalent of TRAP()
because it always clears the low bits, which may be harmless if
it can only be updated via ptrace syscall, but it seems dangerous.
In fact settting TRAP from ptrace doesn't seem like a great idea
so maybe it's better deleted.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make it a static inline rather than a shouty macro]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200507121332.2233629-2-mpe@ellerman.id.au
When CONFIG_VIRT_CPU_ACCOUNTING is selected, system call exception
handler doesn't fit below 0xd00 and build fails.
As exception 0xd00 doesn't exist and is never generated by 40x,
comment it out in order to get more space for system call exception.
Fixes: 9e27086292 ("powerpc/32: Warn and return ENOSYS on syscalls from kernel")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/633165d72f75b4ef4c0901aebe99d3915c93e9a2.1589043863.git.christophe.leroy@csgroup.eu
My 'pengutronix' address is defunct for years. Merge the entries and use
the proper contact address.
Signed-off-by: Wolfram Sang <wsa@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200502142642.18979-1-wsa@kernel.org
Replace relaswap with built-in one, because relaswap
does a simple byte to byte swap.
Since Spectre mitigations have made indirect function calls more
expensive, and the default simple byte copies swap is implemented
without them, an "optimized" custom swap function is now
a waste of time as well as code.
Signed-off-by: Andrey Abramov <st5pub@yandex.ru>
Reviewed-by: George Spelvin <lkml@sdf.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/994931554238042@iva8-b333b7f98ab0.qloud-c.yandex.net
Fix a cut'n'paste error in a warning message. This should be
'cpu-idle-state-residency-ns' to match the property searched in the
previous 'of_property_read_u32_array()'
Fixes: 9c7b185ab2 ("powernv/cpuidle: Parse dt idle properties into global structure")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200502115949.139000-1-christophe.jaillet@wanadoo.fr
There are other clocks than the standard ones, for instance
per process clocks. Therefore, being above the last standard clock
doesn't mean it is a bad clock. So, fallback to syscall instead
of returning -EINVAL inconditionaly.
Fixes: e33ffc956b ("powerpc/vdso32: implement clock_getres entirely")
Cc: stable@vger.kernel.org # v5.6+
Reported-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
Link: https://lore.kernel.org/r/7316a9e2c0c2517923eb4b0411c4a08d15e675a4.1589017281.git.christophe.leroy@csgroup.eu
The "m<>" constraint breaks compilation with GCC 4.6.x era compilers.
The use of the constraint allows the compiler to use update-form
instructions, however in practice current compilers never generate
those forms for any of the current uses of __put_user_asm_goto().
We anticipate that GCC 4.6 will be declared unsupported for building
the kernel in the not too distant future. So for now just switch to
the "m" constraint.
Fixes: 334710b149 ("powerpc/uaccess: Implement unsafe_put_user() using 'asm goto'")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Segher Boessenkool <segher@kernel.crashing.org>
Link: https://lore.kernel.org/r/20200507123324.2250024-1-mpe@ellerman.id.au
When an interrupt has been handled, the OS notifies the interrupt
controller with a EOI sequence. On a POWER9 system using the XIVE
interrupt controller, this can be done with a load or a store
operation on the ESB interrupt management page of the interrupt. The
StoreEOI operation has less latency and improves interrupt handling
performance but it was deactivated during the POWER9 DD2.0 timeframe
because of ordering issues. We use the LoadEOI today but we plan to
reactivate StoreEOI in future architectures.
There is usually no need to enforce ordering between ESB load and
store operations as they should lead to the same result. E.g. a store
trigger and a load EOI can be executed in any order. Assuming the
interrupt state is PQ=10, a store trigger followed by a load EOI will
return a Q bit. In the reverse order, it will create a new interrupt
trigger from HW. In both cases, the handler processing interrupts is
notified.
In some cases, the XIVE_ESB_SET_PQ_10 load operation is used to
disable temporarily the interrupt source (mask/unmask). When the
source is reenabled, the OS can detect if interrupts were received
while the source was disabled and reinject them. This process needs
special care when StoreEOI is activated. The ESB load and store
operations should be correctly ordered because a XIVE_ESB_STORE_EOI
operation could leave the source enabled if it has not completed
before the loads.
For those cases, we enforce Load-after-Store ordering with a special
load operation offset. To avoid performance impact, this ordering is
only enforced when really needed, that is when interrupt sources are
temporarily disabled with the XIVE_ESB_SET_PQ_10 load. It should not
be needed for other loads.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200220081506.31209-1-clg@kaod.org
To prevent verifying the kernel module appended signature
twice (finit_module), once by the module_sig_check() and again by IMA,
powerpc secure boot rules define an IMA architecture specific policy
rule only if CONFIG_MODULE_SIG_FORCE is not enabled. This,
unfortunately, does not take into account the ability of enabling
"sig_enforce" on the boot command line (module.sig_enforce=1).
Including the IMA module appraise rule results in failing the
finit_module syscall, unless the module signing public key is loaded
onto the IMA keyring.
This patch fixes secure boot policy rules to be based on
CONFIG_MODULE_SIG instead.
Fixes: 4238fad366 ("powerpc/ima: Add support to initialize ima policy rules")
Signed-off-by: Nayna Jain <nayna@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Link: https://lore.kernel.org/r/1588342612-14532-1-git-send-email-nayna@linux.ibm.com
Interrupts that use fast_interrupt_return actually do lock AMR, but
they have been ones which tend to come from userspace (or kernel bugs)
in radix mode. With kuap on hash, segment interrupts are taken in
kernel often, which quickly breaks due to the missing restore.
Fixes: 890274c2dc ("powerpc/64s: Implement KUAP for Radix MMU")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200429065654.1677541-6-npiggin@gmail.com
This merges the lockless page table walk rework series from Aneesh.
Because it touches powerpc KVM code we are sharing it with the kvm-ppc
tree in our topic/ppc-kvm branch.
This is the cover letter from Aneesh:
Avoid IPI while updating page table entries.
Problem Summary:
Slow termination of KVM guest with large guest RAM config due to a
large number of IPIs that were caused by clearing level 1 PTE
entries (THP) entries. This is shown in the stack trace below.
- qemu-system-ppc [kernel.vmlinux] [k] smp_call_function_many
- smp_call_function_many
- 36.09% smp_call_function_many
serialize_against_pte_lookup
radix__pmdp_huge_get_and_clear
zap_huge_pmd
unmap_page_range
unmap_vmas
unmap_region
__do_munmap
__vm_munmap
sys_munmap
system_call
__munmap
qemu_ram_munmap
qemu_anon_ram_free
reclaim_ramblock
call_rcu_thread
qemu_thread_start
start_thread
__clone
Why we need to do IPI when clearing PMD entries:
This was added as part of commit: 13bd817bb8 ("powerpc/thp: Serialize pmd clear against a linux page table walk")
serialize_against_pte_lookup makes sure that all parallel lockless
page table walk completes before we convert a PMD pte entry to regular
pmd entry. We end up doing that conversion in the below scenarios
1) __split_huge_zero_page_pmd
2) do_huge_pmd_wp_page_fallback
3) MADV_DONTNEED running parallel to page faults.
local_irq_disable and lockless page table walk:
The lockless page table walk work with the assumption that we can
dereference the page table contents without holding a lock. For this
to work, we need to make sure we read the page table contents
atomically and page table pages are not going to be freed/released
while we are walking the table pages. We can achieve by using a rcu
based freeing for page table pages or if the architecture implements
broadcast tlbie, we can block the IPI as we walk the page table pages.
To support both the above framework, lockless page table walk is done
with irq disabled instead of rcu_read_lock()
We do have two interface for lockless page table walk, gup fast and
__find_linux_pte. This patch series makes __find_linux_pte table walk
safe against the conversion of PMD PTE to regular PMD.
gup fast:
gup fast is already safe against THP split because kernel now
differentiate between a pmd split and a compound page split. gup fast
can run parallel to a pmd split and we prevent a parallel gup fast to
a hugepage split, by freezing the page refcount and failing the
speculative page ref increment.
Similar to how gup is safe against parallel pmd split, this patch
series updates the __find_linux_pte callers to be safe against a
parallel pmd split. We do that by enforcing the following rules.
1) Don't reload the pte value, because that can be updated in
parallel.
2) Code should be able to work with a stale PTE value and not the
recent one. ie, the pte value that we are looking at may not be the
latest value in the page table.
3) Before looking at pte value check for _PAGE_PTE bit. We now do this
as part of pte_present() check.
Performance:
This speeds up Qemu guest RAM del/unplug time as below
128 core, 496GB guest:
Without patch:
munmap start: timer = 13162 ms, PID=7684
munmap finish: timer = 95312 ms, PID=7684 - delta = 82150 ms
With patch (upto removing IPI)
munmap start: timer = 196449 ms, PID=6681
munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms
With patch (with adding the tlb invalidate in pmdp_huge_get_and_clear_full)
munmap start: timer = 196345 ms, PID=6879
munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms
Link: https://lore.kernel.org/r/20200505071729.54912-1-aneesh.kumar@linux.ibm.com
MADV_DONTNEED holds mmap_sem in read mode and that implies a
parallel page fault is possible and the kernel can end up with a level 1 PTE
entry (THP entry) converted to a level 0 PTE entry without flushing
the THP TLB entry.
Most architectures including POWER have issues with kernel instantiating a level
0 PTE entry while holding level 1 TLB entries.
The code sequence I am looking at is
down_read(mmap_sem) down_read(mmap_sem)
zap_pmd_range()
zap_huge_pmd()
pmd lock held
pmd_cleared
table details added to mmu_gather
pmd_unlock()
insert a level 0 PTE entry()
tlb_finish_mmu().
Fix this by forcing a tlb flush before releasing pmd lock if this is
not a fullmm invalidate. We can safely skip this invalidate for
task exit case (fullmm invalidate) because in that case we are sure
there can be no parallel fault handlers.
This do change the Qemu guest RAM del/unplug time as below
128 core, 496GB guest:
Without patch:
munmap start: timer = 196449 ms, PID=6681
munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms
With patch:
munmap start: timer = 196345 ms, PID=6879
munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-23-aneesh.kumar@linux.ibm.com
Now that all the lockless page table walk is careful w.r.t the PTE
address returned, we can now revert
commit: 13bd817bb8 ("powerpc/thp: Serialize pmd clear against a linux page table walk.")
We also drop the equivalent IPI from other pte updates routines. We still keep
IPI in hash pmdp collapse and that is to take care of parallel hash page table
insert. The radix pmdp collapse flush can possibly be removed once I am sure
generic code doesn't have the any expectations around parallel gup walk.
This speeds up Qemu guest RAM del/unplug time as below
128 core, 496GB guest:
Without patch:
munmap start: timer = 13162 ms, PID=7684
munmap finish: timer = 95312 ms, PID=7684 - delta = 82150 ms
With patch:
munmap start: timer = 196449 ms, PID=6681
munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-21-aneesh.kumar@linux.ibm.com
This adds _PAGE_PTE check and makes sure we validate the pte value returned via
find_kvm_host_pte.
NOTE: this also considers _PAGE_INVALID to the software valid bit.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-20-aneesh.kumar@linux.ibm.com
Current code just hold rmap lock to ensure parallel page table update is
prevented. That is not sufficient. The kernel should also check whether
a mmu_notifer callback was running in parallel.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-16-aneesh.kumar@linux.ibm.com
Since kvmppc_do_h_enter can get called in realmode use low level
arch_spin_lock which is safe to be called in realmode.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-15-aneesh.kumar@linux.ibm.com
The locking rules for walking nested shadow linux page table is different from process
scoped table. Hence add a helper for nested page table walk and also
add check whether we are holding the right locks.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-11-aneesh.kumar@linux.ibm.com
The locking rules for walking partition scoped table is different from process
scoped table. Hence add a helper for secondary linux page table walk and also
add check whether we are holding the right locks.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-10-aneesh.kumar@linux.ibm.com
These functions can get called in realmode. Hence use low level
arch_spin_lock which is safe to be called in realmode.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-9-aneesh.kumar@linux.ibm.com
read_user_stack_slow is called with interrupts soft disabled and it copies contents
from the page which we find mapped to a specific address. To convert
userspace address to pfn, the kernel now uses lockless page table walk.
The kernel needs to make sure the pfn value read remains stable and is not released
and reused for another process while the contents are read from the page. This
can only be achieved by holding a page reference.
One of the first approaches I tried was to check the pte value after the kernel
copies the contents from the page. But as shown below we can still get it wrong
CPU0 CPU1
pte = READ_ONCE(*ptep);
pte_clear(pte);
put_page(page);
page = alloc_page();
memcpy(page_address(page), "secret password", nr);
memcpy(buf, kaddr + offset, nb);
put_page(page);
handle_mm_fault()
page = alloc_page();
set_pte(pte, page);
if (pte_val(pte) != pte_val(*ptep))
Hence switch to __get_user_pages_fast.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-8-aneesh.kumar@linux.ibm.com
A lockless page table walk should be safe against parallel THP collapse, THP
split and madvise(MADV_DONTNEED)/parallel fault. This patch makes sure kernel
won't reload the pteval when checking for different conditions. The patch also added
a check for pte_present to make sure the kernel is indeed operating
on a PTE and not a pointer to level 0 table page.
The pfn value we find here can be different from the actual pfn on which
machine check happened. This can happen if we raced with a parallel update
of the page table. In such a scenario we end up isolating a wrong pfn. But that
doesn't have any other side effect.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-7-aneesh.kumar@linux.ibm.com
Don't fetch the pte value using lockless page table walk. Instead use the value from the
caller. hash_preload is called with ptl lock held. So it is safe to use the
pte_t address directly.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-6-aneesh.kumar@linux.ibm.com
This is only used with init_mm currently. Walking init_mm is much simpler
because we don't need to handle concurrent page table like other mm_context
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-5-aneesh.kumar@linux.ibm.com
This makes the pte_present check stricter by checking for additional _PAGE_PTE
bit. A level 1 pte pointer (THP pte) can be switched to a pointer to level 0 pte
page table page by following two operations.
1) THP split.
2) madvise(MADV_DONTNEED) in parallel to page fault.
A lockless page table walk need to make sure we can handle such changes
gracefully.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-4-aneesh.kumar@linux.ibm.com
If multiple threads in userspace keep changing the protection keys
mapping a range, there can be a scenario where kernel takes a key fault
but the pkey value found in the siginfo struct is a permissive one.
This can confuse the userspace as shown in the below test case.
/* use this to control the number of test iterations */
static void pkeyreg_set(int pkey, unsigned long rights)
{
unsigned long reg, shift;
shift = (NR_PKEYS - pkey - 1) * PKEY_BITS_PER_PKEY;
asm volatile("mfspr %0, 0xd" : "=r"(reg));
reg &= ~(((unsigned long) PKEY_BITS_MASK) << shift);
reg |= (rights & PKEY_BITS_MASK) << shift;
asm volatile("mtspr 0xd, %0" : : "r"(reg));
}
static unsigned long pkeyreg_get(void)
{
unsigned long reg;
asm volatile("mfspr %0, 0xd" : "=r"(reg));
return reg;
}
static int sys_pkey_mprotect(void *addr, size_t len, int prot, int pkey)
{
return syscall(SYS_pkey_mprotect, addr, len, prot, pkey);
}
static int sys_pkey_alloc(unsigned long flags, unsigned long access_rights)
{
return syscall(SYS_pkey_alloc, flags, access_rights);
}
static int sys_pkey_free(int pkey)
{
return syscall(SYS_pkey_free, pkey);
}
static int faulting_pkey;
static int permissive_pkey;
static pthread_barrier_t pkey_set_barrier;
static pthread_barrier_t mprotect_barrier;
static void pkey_handle_fault(int signum, siginfo_t *sinfo, void *ctx)
{
unsigned long pkeyreg;
/* FIXME: printf is not signal-safe but for the current purpose,
it gets the job done. */
printf("pkey: exp = %d, got = %d\n", faulting_pkey, sinfo->si_pkey);
fflush(stdout);
assert(sinfo->si_code == SEGV_PKUERR);
assert(sinfo->si_pkey == faulting_pkey);
/* clear pkey permissions to let the faulting instruction continue */
pkeyreg_set(faulting_pkey, 0x0);
}
static void *do_mprotect_fault(void *p)
{
unsigned long rights, pkeyreg, pgsize;
unsigned int i;
void *region;
int pkey;
srand(time(NULL));
pgsize = sysconf(_SC_PAGESIZE);
rights = PKEY_DISABLE_WRITE;
region = p;
/* allocate key, no permissions */
assert((pkey = sys_pkey_alloc(0, PKEY_DISABLE_ACCESS)) > 0);
pkeyreg_set(4, 0x0);
/* cache the pkey here as the faulting pkey for future reference
in the signal handler */
faulting_pkey = pkey;
printf("%s: faulting pkey = %d\n", __func__, faulting_pkey);
/* try to allocate, mprotect and free pkeys repeatedly */
for (i = 0; i < NUM_ITERATIONS; i++) {
/* sync up with the other thread here */
pthread_barrier_wait(&pkey_set_barrier);
/* make sure that the pkey used by the non-faulting thread
is made permissive for this thread's context too so that
no faults are triggered because it still might have been
set to a restrictive value */
// pkeyreg_set(permissive_pkey, 0x0);
/* sync up with the other thread here */
pthread_barrier_wait(&mprotect_barrier);
/* perform mprotect */
assert(!sys_pkey_mprotect(region, pgsize, PROT_READ | PROT_WRITE, pkey));
/* choose a random byte from the protected region and
attempt to write to it, this will generate a fault */
*((char *) region + (rand() % pgsize)) = rand();
/* restore pkey permissions as the signal handler may have
cleared the bit out for the sake of continuing */
pkeyreg_set(pkey, PKEY_DISABLE_WRITE);
}
/* free pkey */
sys_pkey_free(pkey);
return NULL;
}
static void *do_mprotect_nofault(void *p)
{
unsigned long pgsize;
unsigned int i, j;
void *region;
int pkey;
pgsize = sysconf(_SC_PAGESIZE);
region = p;
/* try to allocate, mprotect and free pkeys repeatedly */
for (i = 0; i < NUM_ITERATIONS; i++) {
/* allocate pkey, all permissions */
assert((pkey = sys_pkey_alloc(0, 0)) > 0);
permissive_pkey = pkey;
/* sync up with the other thread here */
pthread_barrier_wait(&pkey_set_barrier);
pthread_barrier_wait(&mprotect_barrier);
/* perform mprotect on the common page, no faults will
be triggered as this is most permissive */
assert(!sys_pkey_mprotect(region, pgsize, PROT_READ | PROT_WRITE, pkey));
/* free pkey */
assert(!sys_pkey_free(pkey));
}
return NULL;
}
int main(int argc, char **argv)
{
pthread_t fault_thread, nofault_thread;
unsigned long pgsize;
struct sigaction act;
pthread_attr_t attr;
cpu_set_t fault_cpuset, nofault_cpuset;
unsigned int i;
void *region;
/* allocate memory region to protect */
pgsize = sysconf(_SC_PAGESIZE);
assert(region = memalign(pgsize, pgsize));
CPU_ZERO(&fault_cpuset);
CPU_SET(0, &fault_cpuset);
CPU_ZERO(&nofault_cpuset);
CPU_SET(8, &nofault_cpuset);
assert(!pthread_attr_init(&attr));
/* setup sigsegv signal handler */
act.sa_handler = 0;
act.sa_sigaction = pkey_handle_fault;
assert(!sigprocmask(SIG_SETMASK, 0, &act.sa_mask));
act.sa_flags = SA_SIGINFO;
act.sa_restorer = 0;
assert(!sigaction(SIGSEGV, &act, NULL));
/* setup barrier for the two threads */
pthread_barrier_init(&pkey_set_barrier, NULL, 2);
pthread_barrier_init(&mprotect_barrier, NULL, 2);
/* setup and start threads */
assert(!pthread_create(&fault_thread, &attr, &do_mprotect_fault, region));
assert(!pthread_setaffinity_np(fault_thread, sizeof(cpu_set_t), &fault_cpuset));
assert(!pthread_create(&nofault_thread, &attr, &do_mprotect_nofault, region));
assert(!pthread_setaffinity_np(nofault_thread, sizeof(cpu_set_t), &nofault_cpuset));
/* cleanup */
assert(!pthread_attr_destroy(&attr));
assert(!pthread_join(fault_thread, NULL));
assert(!pthread_join(nofault_thread, NULL));
assert(!pthread_barrier_destroy(&pkey_set_barrier));
assert(!pthread_barrier_destroy(&mprotect_barrier));
free(region);
puts("PASS");
return EXIT_SUCCESS;
}
The above test can result the below failure without this patch.
pkey: exp = 3, got = 3
pkey: exp = 3, got = 4
a.out: pkey-siginfo-race.c💯 pkey_handle_fault: Assertion `sinfo->si_pkey == faulting_pkey' failed.
Aborted
Check for vma access before considering this a key fault. If vma pkey allow
access retry the acess again.
Test case is written by Sandipan Das <sandipan@linux.ibm.com> hence added SOB
from him.
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-3-aneesh.kumar@linux.ibm.com
Fetch pkey from vma instead of linux page table. Also document the fact that in
some cases the pkey returned in siginfo won't be the same as the one we took
keyfault on. Even with linux page table walk, we can end up in a similar scenario.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-2-aneesh.kumar@linux.ibm.com
- Fix a regression introduced in the last merge window, which results
in guests in HPT mode dying randomly.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJeni/pAAoJEJ2a6ncsY3GfTRoIANAQjIZi96AfJcfnrYQ4yUF7
scxawTiJ9VavvsEJLJ7vsozrJ4xxmvmA0fFWC84uw9+BwPqoLFFvZTjazbGEDVvF
FGwNBR/k7nfFVMIHS3K9iy9KjvYL3xkL26AgFTDJFq8hmOO9pH0txuk4r7SXb+NX
bGG0mScAD/Dg/HwAHAS6EP3jT35QtGTK62p8foqVTziTNcmBn9Ywtg0lEzAcq2iY
Y1BUD4Ov3cggshMI9SqHE8Yyq0XA2Wi6ggcyz/gVzvcbdFQmtg57Tri8nN8661LX
XKh+VTpYSIxNs5GgjwlNesJzJ9h6CSynJF556qrjQ0XsXcNqvn8fcZdNQ+hnRYw=
=Y19W
-----END PGP SIGNATURE-----
Merge tag 'kvm-ppc-fixes-5.7-1' into topic/ppc-kvm
This brings in a fix from the kvm-ppc tree that was merged to mainline
after rc2, and so isn't in the base of our topic branch. We'd like it
in the topic branch because it interacts with patches we plan to carry
in this branch.
Commit 0962e8004e ("powerpc/prom: Scan reserved-ranges node for
memory reservations") enabled support to parse reserved-ranges DT
node and reserve kernel memory falling in these ranges for F/W
purposes. Memory reserved for FADump should not overlap with these
ranges as it could corrupt memory meant for F/W or crash'ed kernel
memory to be exported as vmcore.
But since commit 579ca1a276 ("powerpc/fadump: make use of memblock's
bottom up allocation mode"), memblock_find_in_range() is being used to
find the appropriate area to reserve memory for FADump, which can't
account for reserved-ranges as these ranges are reserved only after
FADump memory reservation.
With reserved-ranges now being populated during early boot, look out
for these memory ranges while reserving memory for FADump. Without
this change, MPIPL on PowerNV systems aborts with hostboot failure,
when memory reserved for FADump is less than 4096MB.
Fixes: 579ca1a276 ("powerpc/fadump: make use of memblock's bottom up allocation mode")
Cc: stable@vger.kernel.org
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/158737297693.26700.16193820746269425424.stgit@hbathini.in.ibm.com
At times, memory ranges have to be looked up during early boot, when
kernel couldn't be initialized for dynamic memory allocation. In fact,
reserved-ranges look up is needed during FADump memory reservation.
Without accounting for reserved-ranges in reserving memory for FADump,
MPIPL boot fails with memory corruption issues. So, extend memory
ranges handling to support static allocation and populate reserved
memory ranges during early boot.
Fixes: dda9dbfeeb ("powerpc/fadump: consider reserved ranges while releasing memory")
Cc: stable@vger.kernel.org
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/158737294432.26700.4830263187856221314.stgit@hbathini.in.ibm.com
The system reset interrupt handler locks AMR and exits with
EXCEPTION_RESTORE_REGS without restoring AMR. Similarly to the
soft-NMI handler, it needs to restore.
Fixes: 890274c2dc ("powerpc/64s: Implement KUAP for Radix MMU")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200429065654.1677541-5-npiggin@gmail.com
Any kind of WARN causes a program check that will crash with
unrecoverable exception if it occurs when RI is clear.
Fixes: 68b34588e2 ("powerpc/64/sycall: Implement syscall entry/exit logic in C")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200429065654.1677541-2-npiggin@gmail.com
Hugh reported that his trusty G5 crashed after a few hours under load
with an "Unrecoverable exception 380".
The crash is in interrupt_return() where we check lazy_irq_pending(),
which calls get_paca() and with CONFIG_DEBUG_PREEMPT=y that goes to
check_preemption_disabled() via debug_smp_processor_id().
As Nick explained on the list:
Problem is MSR[RI] is cleared here, ready to do the last few things
for interrupt return where we're not allowed to take any other
interrupts.
SLB interrupts can happen just about anywhere aside from kernel
text, global variables, and stack. When that hits, it appears to be
unrecoverable due to RI=0.
The problematic access is in preempt_count() which is:
return READ_ONCE(current_thread_info()->preempt_count);
Because of THREAD_INFO_IN_TASK, current_thread_info() just points to
current, so the access is to somewhere in kernel memory, but not on
the stack or in .data, which means it can cause an SLB miss. If we
take an SLB miss with RI=0 it is fatal.
The easiest solution is to add a version of lazy_irq_pending() that
doesn't do the preemption check and call it from the interrupt return
path.
Fixes: 68b34588e2 ("powerpc/64/sycall: Implement syscall entry/exit logic in C")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200502143316.929341-1-mpe@ellerman.id.au
At the time being, unsafe_copy_to_user() is based on
raw_copy_to_user() which calls __copy_tofrom_user().
__copy_tofrom_user() is a big optimised function to copy big amount
of data. It aligns destinations to cache line in order to use
dcbz instruction.
Today unsafe_copy_to_user() is called only from filldir().
It is used to mainly copy small amount of data like filenames,
so __copy_tofrom_user() is not fit.
Also, unsafe_copy_to_user() is used within user_access_begin/end
sections. In those section, it is preferable to not call functions.
Rewrite unsafe_copy_to_user() as a macro that uses __put_user_goto().
We first perform a loop of long, then we finish with necessary
complements.
unsafe_copy_to_user() might be used in the near future to copy
fixed-size data, like pt_regs structs during signal processing.
Having it as a macro allows GCC to optimise it for instead when
it knows the size in advance, it can unloop loops, drop complements
when the size is a multiple of longs, etc ...
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/fe952112c29bf6a0a2778c9e6bbb4f4afd2c4258.1587143308.git.christophe.leroy@c-s.fr
get/put_user() can be called with nontrivial arguments. fs/proc/page.c
has a good example:
if (put_user(stable_page_flags(ppage), out)) {
stable_page_flags() is quite a lot of code, including spin locks in
the page allocator.
Ensure these arguments are evaluated before user access is allowed.
This improves security by reducing code with access to userspace, but
it also fixes a PREEMPT bug with KUAP on powerpc/64s:
stable_page_flags() is currently called with AMR set to allow writes,
it ends up calling spin_unlock(), which can call preempt_schedule. But
the task switch code can not be called with AMR set (it relies on
interrupts saving the register), so this blows up.
It's fine if the code inside allow_user_access() is preemptible,
because a timer or IPI will save the AMR, but it's not okay to
explicitly cause a reschedule.
Fixes: de78a9c42a ("powerpc: Add a framework for Kernel Userspace Access Protection")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200407041245.600651-1-npiggin@gmail.com
Currently, it is possible to have CONFIG_FUNCTION_TRACER disabled, but
CONFIG_MPROFILE_KERNEL enabled. Though all existing users of
MPROFILE_KERNEL are doing the right thing, it is weird to have
MPROFILE_KERNEL enabled when the function tracer isn't. Fix this by
making MPROFILE_KERNEL depend on FUNCTION_TRACER.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200422092612.514301-1-naveen.n.rao@linux.vnet.ibm.com
On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.
The total PURR and SPURR ticks are already exposed via the per-cpu
sysfs files "purr" and "spurr". This patch adds support for exposing
the idle PURR and SPURR ticks via new per-cpu sysfs files named
"idle_purr" and "idle_spurr".
This patch also adds helper functions to accurately read the values of
idle_purr and idle_spurr especially from an interrupt context between
when the interrupt has occurred between the pseries_idle_prolog() and
pseries_idle_epilog(). This will ensure that the idle purr/spurr
values corresponding to the latest idle period is accounted for before
these values are read.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-5-git-send-email-ego@linux.vnet.ibm.com
On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.
Via pseries_idle_prolog(), pseries_idle_epilog(), we track the idle
PURR ticks in the VPA variable "wait_state_cycles". This patch extends
the support to account for the idle SPURR ticks.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-4-git-send-email-ego@linux.vnet.ibm.com
Currently when CPU goes idle, we take a snapshot of PURR via
pseries_idle_prolog() which is used at the CPU idle exit to compute
the idle PURR cycles via the function pseries_idle_epilog(). Thus,
the value of idle PURR cycle thus read before pseries_idle_prolog() and
after pseries_idle_epilog() is always correct.
However, if we were to read the idle PURR cycles from an interrupt
context between pseries_idle_prolog() and pseries_idle_epilog() (this
will be done in a future patch), then, the value of the idle PURR thus
read will not include the cycles spent in the most recent idle period.
Thus, in that interrupt context, we will need access to the snapshot
of the PURR before going idle, in order to compute the idle PURR
cycles for the latest idle duration.
In this patch, we save the snapshot of PURR in pseries_idle_prolog()
in a per-cpu variable, instead of on the stack, so that it can be
accessed from an interrupt context.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-3-git-send-email-ego@linux.vnet.ibm.com
Currently prior to entering an idle state on a Linux Guest, the
pseries cpuidle driver implement an idle_loop_prolog() and
idle_loop_epilog() functions which ensure that idle_purr is correctly
computed, and the hypervisor is informed that the CPU cycles have been
donated.
These prolog and epilog functions are also required in the default
idle call, i.e pseries_lpar_idle(). Hence move these accessor
functions to a common header file and call them from
pseries_lpar_idle(). Since the existing header files such as
asm/processor.h have enough clutter, create a new header file
asm/idle.h. Finally rename idle_loop_prolog() and idle_loop_epilog()
to pseries_idle_prolog() and pseries_idle_epilog() as they are only
relavent for on pseries guests.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-2-git-send-email-ego@linux.vnet.ibm.com
CONFIG_PPC_KUAP_DEBUG is not selectable because it depends on PPC_32
which doesn't exists.
Fixing it leads to a deadlock due to a vital register getting
clobbered in _switch().
Change dependency to PPC32 and use r0 instead of r4 in _switch()
Fixes: e2fb9f5444 ("powerpc/32: Prepare for Kernel Userspace Access Protection")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/540242f7d4573f7cdf1b3bf46bb35f743b2cd68f.1587124651.git.christophe.leroy@c-s.fr
WRITE_RO lkdtm test works.
But when selecting CONFIG_DEBUG_RODATA_TEST, the kernel reports
rodata_test: test data was not read only
This is because when rodata test runs, there are still old entries
in TLB.
Flush TLB after setting kernel pages RO or NX.
Fixes: d5f17ee964 ("powerpc/8xx: don't disable large TLBs with CONFIG_STRICT_KERNEL_RWX")
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/485caac75f195f18c11eb077b0031fdd2bb7fb9e.1587361039.git.christophe.leroy@c-s.fr
allyesconfig fails with:
./usr/include/asm/vas-api.h:15:2: error: unknown type name '__u32'
15 | __u32 version;
| ^~~~~
./usr/include/asm/vas-api.h:16:2: error: unknown type name '__s16'
16 | __s16 vas_id; /* specific instance of vas or -1 for default */
| ^~~~~
./usr/include/asm/vas-api.h:17:2: error: unknown type name '__u16'
17 | __u16 reserved1;
| ^~~~~
./usr/include/asm/vas-api.h:18:2: error: unknown type name '__u64'
18 | __u64 flags; /* Future use */
| ^~~~~
./usr/include/asm/vas-api.h:19:2: error: unknown type name '__u64'
19 | __u64 reserved2[6];
| ^~~~~
uapi headers should be self contained, so add an include of
linux/types.h.
Fixes: 45f25a79fe ("powerpc/vas: Define VAS_TX_WIN_OPEN ioctl API")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Haren Myneni <haren@linux.ibm.com>
[mpe: Flesh out change log from linux-next error report]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200422154129.11f988fd@canb.auug.org.au
If {i,d}-cache-block-size is set and {i,d}-cache-line-size is not, use
the block-size value for both. Per the devicetree spec cache-line-size
is only needed if it differs from the block size.
Originally the code would fallback from block size to line size. An
error message was printed if both properties were missing.
Later the code was refactored to use clearer names and logic but it
inadvertently made line size a required property, meaning on systems
without a line size property we fall back to the default from the
cputable.
On powernv (OPAL) platforms, since the introduction of device tree CPU
features (5a61ef74f2 ("powerpc/64s: Support new device tree binding
for discovering CPU features")), that has led to the wrong value being
used, as the fallback value is incorrect for Power8/Power9 CPUs.
The incorrect values flow through to the VDSO and also to the sysconf
values, SC_LEVEL1_ICACHE_LINESIZE etc.
Fixes: bd067f83b0 ("powerpc/64: Fix naming of cache block vs. cache line")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reported-by: Qian Cai <cai@lca.pw>
[mpe: Add even more detail to change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200416221908.7886-1-chris.packham@alliedtelesis.co.nz
Since cd758a9b57 "KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot in HPT
page fault handler", it's been possible in fairly rare circumstances to
load a non-present PTE in kvmppc_book3s_hv_page_fault() when running a
guest on a POWER8 host.
Because that case wasn't checked for, we could misinterpret the non-present
PTE as being a cache-inhibited PTE. That could mismatch with the
corresponding hash PTE, which would cause the function to fail with -EFAULT
a little further down. That would propagate up to the KVM_RUN ioctl()
generally causing the KVM userspace (usually qemu) to fall over.
This addresses the problem by catching that case and returning to the guest
instead.
For completeness, this fixes the radix page fault handler in the same
way. For radix this didn't cause any obvious misbehaviour, because we
ended up putting the non-present PTE into the guest's partition-scoped
page tables, leading immediately to another hypervisor data/instruction
storage interrupt, which would go through the page fault path again
and fix things up.
Fixes: cd758a9b57 "KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot in HPT page fault handler"
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1820402
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When window is opened, pid reference is taken for user space
windows. Not needed for kernel windows. So remove 'pid' in
vas_tx_win_attr struct.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587114674.2275.1132.camel@hbabu-laptop
On power9, userspace can send GZIP compression requests directly to NX
once kernel establishes NX channel / window with VAS. This patch provides
user space API which allows user space to establish channel using open
VAS_TX_WIN_OPEN ioctl, mmap and close operations.
Each window corresponds to file descriptor and application can open
multiple windows. After the window is opened, VAS_TX_WIN_OPEN icoctl to
open a window on specific VAS instance, mmap() system call to map
the hardware address of engine's request queue into the application's
virtual address space.
Then the application can then submit one or more requests to the the
engine by using the copy/paste instructions and pasting the CRBs to
the virtual address (aka paste_address) returned by mmap().
Only NX GZIP coprocessor type is supported right now and allow GZIP
engine access via /dev/crypto/nx-gzip device node.
Thanks to Michael Ellerman for his changes and suggestions to make the
ioctl generic to support any coprocessor type.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587114121.2275.1109.camel@hbabu-laptop
Define the VAS_TX_WIN_OPEN ioctl interface for NX GZIP access
from user space. This interface is used to open GZIP send window and
mmap region which can be used by userspace to send requests to NX
directly with copy/paste instructions.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587114065.2275.1106.camel@hbabu-laptop
Initialize send and receive window attributes for GZIP high and
normal priority types.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587114029.2275.1103.camel@hbabu-laptop
set_thread_uses_vas() sets used_vas flag for a process that opened VAS
window and issue CP_ABORT during context switch for only that process.
In multi-thread application, windows can be shared. For example Thread
A can open a window and Thread B can run COPY/PASTE instructions to
send NX request which may cause corruption or snooping or a covert
channel Also once this flag is set, continue to run CP_ABORT even the
VAS window is closed.
So define vas-windows counter in process mm_context, increment this
counter for each window open and decrement it for window close. If
vas-windows is set, issue CP_ABORT during context switch. It means
clear the foreign real address mapping only if the process / thread
uses COPY/PASTE. Then disable it for that process if windows are not
open.
Moved set_thread_uses_vas() code to vas_tx_win_open() as this
functionality is needed only for userspace open windows. We are adding
VAS userspace support along with this fix. So no need to include this
fix in stable releases.
Fixes: 9d2a4d7133 ("powerpc: Define set_thread_uses_vas()")
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Reported-by: Nicholas Piggin <npiggin@gmail.com>
Suggested-by: Milton Miller <miltonm@us.ibm.com>
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587017291.2275.1077.camel@hbabu-laptop
NX may be processing requests while trying to close window. Wait until
all credits are returned and then free send window from VAS instance.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587017256.2275.1076.camel@hbabu-laptop
Process can not close send window until all requests are processed.
Means wait until window state is not busy and send credits are
returned. Display debug messages in case taking longer to close the
window.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587017219.2275.1073.camel@hbabu-laptop
System checkstops if RxFIFO overruns with more requests than the
maximum possible number of CRBs allowed in FIFO at any time. So
max credits value (rxattr.wcreds_max) is set and is passed to
vas_rx_win_open() by the the driver.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587017136.2275.1070.camel@hbabu-laptop
Dump FIFO entries if could not find send window and print CRB
for debugging.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587017099.2275.1067.camel@hbabu-laptop
NX uses credit mechanism to control the number of requests issued on
a specific window at any point of time. Only send windows and fault
window are used credits. When the request is issued on a given window,
a credit is taken. This credit will be returned after that request is
processed. If credits are not available, returns RMA_Busy for send
window and RMA_Reject for fault window.
NX expects OS to return credit for send window after processing fault
CRB. Also credit has to be returned for fault window after handling
the fault.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587017059.2275.1064.camel@hbabu-laptop
Applications polls on CSB for the status update after requests are
issued. NX process these requests and update the CSB with the status.
If it encounters translation error, pastes CRB in fault FIFO and
raises an interrupt. The kernel handles fault by reading CRB from
fault FIFO and process the fault CRB.
For each fault CRB, update fault address in CRB (fault_storage_addr)
and translation error status in CSB so that user space can touch the
fault address and resend the request. If the user space passed invalid
CSB address send signal to process with SIGSEGV.
In the case of multi-thread applications, child thread may not be
available. So if the task is not running, send signal to tgid.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587017022.2275.1063.camel@hbabu-laptop
When NX encounters translation error on CRB and any request buffer,
raises an interrupt on the CPU to handle the fault. It can raise one
interrupt for multiple faults. Expects OS to handle these faults and
return credits for fault window after processing faults.
Setup thread IRQ handler and IRQ thread function per each VAS instance.
IRQ handler checks if the thread is already woken up and can handle new
faults. If so returns with IRQ_HANDLED, otherwise wake up thread to
process new faults.
The thread functions reads each CRB entry from fault FIFO until sees
invalid entry. After reading each CRB, determine the corresponding
send window using pswid (from CRB) and process fault CRB. Then
invalidate the entry and return credit. Processing fault CRB and
return credit is described in subsequent patches.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587016982.2275.1060.camel@hbabu-laptop
When process opens a window, its pid and tgid will be saved in the
vas_window struct. This window will be closed when the process exits.
The kernel handles NX faults by updating CSB or send SEGV signal to pid
of the process if the userspace csb addr is invalid.
In multi-thread applications, a window can be opened by a child thread,
but it will not be closed when this thread exits. It is expected that
the parent will clean up all resources including NX windows opened by
child threads. A child thread can send NX requests using this window
and could be killed before completion is reported. If the pid assigned
to this thread is reused while requests are pending, a failure SEGV
would be directed to the wrong place.
To prevent reusing the pid, take references to pid and mm when the window
is opened and release them when when the window is closed. Then if child
thread is not running, SEGV signal will be sent to thread group leader
(tgid).
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587016936.2275.1057.camel@hbabu-laptop
For each user space send window, register NX with fault window ID
and port value so that NX paste CRBs in this fault FIFO when it
sees fault on the request buffer.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587016888.2275.1054.camel@hbabu-laptop
Setup fault window for each VAS instance. When NX gets a fault on
request buffer, pastes fault CRB in the corresponding fault FIFO and
then raises an interrupt to the OS. The kernel handles this fault
and process faults CRB from this FIFO.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587016846.2275.1053.camel@hbabu-laptop
Allocate a xive irq on each chip with a vas instance. The NX coprocessor
raises a host CPU interrupt via vas if it encounters page fault on user
space request buffer. Subsequent patches register the trigger port with
the NX coprocessor, and create a vas fault handler for this interrupt
mapping.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587016806.2275.1050.camel@hbabu-laptop
Kernel sets fault address and status in CRB for NX page fault on user
space address after processing page fault. User space gets the signal
and handles the fault mentioned in CRB by bringing the page in to
memory and send NX request again.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587016769.2275.1048.camel@hbabu-laptop
This function allocates IRQ on a specific chip. VAS needs per chip
IRQ allocation and will have IRQ handler per VAS instance.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1587016720.2275.1047.camel@hbabu-laptop
devm_memremap_pages() is currently used by the PCI P2PDMA code to create
struct page mappings for IO memory. At present, these mappings are
created with PAGE_KERNEL which implies setting the PAT bits to be WB.
However, on x86, an mtrr register will typically override this and force
the cache type to be UC-. In the case firmware doesn't set this
register it is effectively WB and will typically result in a machine
check exception when it's accessed.
Other arches are not currently likely to function correctly seeing they
don't have any MTRR registers to fall back on.
To solve this, provide a way to specify the pgprot value explicitly to
arch_add_memory().
Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
simple change to pass the pgprot_t down to their respective functions
which set up the page tables. For x86_32, set the page tables
explicitly using _set_memory_prot() (seeing they are already mapped).
For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
should be fine, for now, seeing these architectures don't support
ZONE_DEVICE.
A check in __add_pages() is also added to ensure the pgprot parameter
was set for all arches.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Badger <ebadger@gigaio.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In prepartion to support a pgprot_t argument for arch_add_memory().
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Badger <ebadger@gigaio.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200306170846.9333-6-logang@deltatee.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mhp_restrictions struct really doesn't specify anything resembling a
restriction anymore so rename it to be mhp_params as it is a list of
extended parameters.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Badger <ebadger@gigaio.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are many places where all basic VMA access flags (read, write,
exec) are initialized or checked against as a group. One such example
is during page fault. Existing vma_is_accessible() wrapper already
creates the notion of VMA accessibility as a group access permissions.
Hence lets just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which
will not only reduce code duplication but also extend the VMA
accessibility concept in general.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Springer <rspringer@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are many platforms with exact same value for VM_DATA_DEFAULT_FLAGS
This creates a default value for VM_DATA_DEFAULT_FLAGS in line with the
existing VM_STACK_DEFAULT_FLAGS. While here, also define some more
macros with standard VMA access flag combinations that are used
frequently across many platforms. Apart from simplification, this
reduces code duplication as well.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Chris Zankel <chris@zankel.net>
Link: http://lkml.kernel.org/r/1583391014-8170-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- A fix for a crash in machine check handling on pseries (ie. guests)
- A small series to make it possible to disable CONFIG_COMPAT, and turn it off
by default for ppc64le where it's not used.
- A few other miscellaneous fixes and small improvements.
Thanks to:
Alexey Kardashevskiy, Anju T Sudhakar, Arnd Bergmann, Christophe Leroy, Dan
Carpenter, Ganesh Goudar, Geert Uytterhoeven, Geoff Levand, Mahesh Salgaonkar,
Markus Elfring, Michal Suchanek, Nicholas Piggin, Stephen Boyd, Wen Xiong.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl6O8LgTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgMcYEACbGf+Z9brLSasYajoqU6QdqGPacHEN
1a9TEmUnN+HWgtfkkoEBFbyuYHnhyhYuf7hvNccDjDA91ESVylO+Wq7Q+v/xoz29
LVyb3V6uuVMLHnoqwP5jpr0lS0aOpuu3Nc2SpfBuolDtJqeMxpVEEK3Ln3uATlq6
SnEAxQEKb2x4Y4Tfuq5A3txupj0s/UYrmeR6GAdkN3Oapbb9uQl8Ql2smqrZo0cq
6TLxtoFzbfXkV6NmY2V0OKMTeXt0fhrvdDhFEBckCUpRZLv4Fd7CwPWNER2bUVs6
04kg87BAO8qRyfr3G93oP0mWgi65kpXI8yN6Vt5Lig+5PnxNh7swvEqDpQR1s9se
uVoeN7RBHOKQppZRkpzf6yZbelCcMuwTeIBQ9XOx/jYFAHJ2nUkJm9TYgKckVvNY
E4shiM1eoOVMrEmODFBCUmUkJLyn5jU1+r5mn708v2Nb5E1XgoTejitB6bHyL+Aa
zo/0DdsZO86iNE7th94oHQRgVbx1vtP9kV6vK6BLB5M95RSGEdVAEMo5CTL70wr+
hz7suOaijPi+TeCW9YPrcOHcHOQE9pzTFQoVZHuv8egbvd9Rni3aBAMMqHx/lQdE
Hk4zN7IytOLqQTfINNYgoo0kUKI1ipJQOBfR5jyeLhgg6yp36CnO/m+0QGvrILbi
18tlf2qkD75Z3g==
=72sb
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc updates from Michael Ellerman:
"The bulk of this is the series to make CONFIG_COMPAT user-selectable,
it's been around for a long time but was blocked behind the
syscall-in-C series.
Plus there's also a few fixes and other minor things.
Summary:
- A fix for a crash in machine check handling on pseries (ie. guests)
- A small series to make it possible to disable CONFIG_COMPAT, and
turn it off by default for ppc64le where it's not used.
- A few other miscellaneous fixes and small improvements.
Thanks to: Alexey Kardashevskiy, Anju T Sudhakar, Arnd Bergmann,
Christophe Leroy, Dan Carpenter, Ganesh Goudar, Geert Uytterhoeven,
Geoff Levand, Mahesh Salgaonkar, Markus Elfring, Michal Suchanek,
Nicholas Piggin, Stephen Boyd, Wen Xiong"
* tag 'powerpc-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Always build the tm-poison test 64-bit
powerpc: Improve ppc_save_regs()
Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled"
powerpc/time: Replace <linux/clk-provider.h> by <linux/of_clk.h>
powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent memory
powerpc/perf: split callchain.c by bitness
powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.
powerpc/64: make buildable without CONFIG_COMPAT
powerpc/perf: consolidate valid_user_sp -> invalid_user_sp
powerpc/perf: consolidate read_user_stack_32
powerpc: move common register copy functions from signal_32.c to signal.c
powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro
powerpc/ps3: Set CONFIG_UEVENT_HELPER=y in ps3_defconfig
powerpc/ps3: Remove an unneeded NULL check
powerpc/ps3: Remove duplicate error message
powerpc/powernv: Re-enable imc trace-mode in kernel
powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.
powerpc/pseries: Fix MCE handling on pseries
selftests/eeh: Skip ahci adapters
powerpc/64s: Fix doorbell wakeup msgclr optimisation
- Add support for region alignment configuration and enforcement to
fix compatibility across architectures and PowerPC page size
configurations.
- Introduce 'zero_page_range' as a dax operation. This facilitates
filesystem-dax operation without a block-device.
- Introduce phys_to_target_node() to facilitate drivers that want to
know resulting numa node if a given reserved address range was
onlined.
- Advertise a persistence-domain for of_pmem and papr_scm. The
persistence domain indicates where cpu-store cycles need to reach in
the platform-memory subsystem before the platform will consider them
power-fail protected.
- Promote numa_map_to_online_node() to a cross-kernel generic facility.
- Save x86 numa information to allow for node-id lookups for reserved
memory ranges, deploy that capability for the e820-pmem driver.
- Pick up some miscellaneous minor fixes, that missed v5.6-final,
including a some smatch reports in the ioctl path and some unit test
compilation fixups.
- Fixup some flexible-array declarations.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAl6LtIAACgkQHtKRamZ9
iAIwRA/8CLVVuQpgHQ1tqK4h8CZPrISFXh7wy7uhocEU2xrDh6iGVnLztmoLRr2k
5f8T9lRzreSAwIVL5DbGqP1pFncqIt9VMnKsFlaPMBGCBNR+hURY0iBCNjIT+jiq
BOzLd52MR2rqJxeXGTMUbWrBrbmuj4mZPdmGVuFFe7GFRpoaVpCgOo+296eWa/ot
gIOFUTonZY7STYjNvDok0TXCmiCFuJb+P+y5ldfCPShHvZhTiaF53jircja8vAjO
G5dt8ixBKUK0rXRc4SEQsQhAZNcAFHb6Gy5lg4C2QzhTF374xTc9usJZNWbIE9iM
5mipBYvjVuoY+XaCNZDkaRcJIy/jqB15O6l3QIWbZLGaK9m95YPp9LmkPFwd3JpO
e3rO24ML471DxqB9iWIiJCNcBBocLOlnd6qAQTpppWDpGNbudwXvfsmKHmKIScSE
x+IDCdscLmmm+WG2dLmLraWOVPu42xZFccoQCi4M3TTqfeB9pZ9XckFQ37zX62zG
5t+7Ek+t1W4QVt/JQYVKH03XT15sqUpVknvx0Hl4Y5TtbDOkFLkO8RN0/HyExDef
7iegS35kqTsM4EfZQ+9juKbI2JBAjHANcbj0V4dogqaRj6vr3akumBzUtuYqAofv
qU3s9skmLsEemOJC+ns2PT8vl5dyIoeDfH0r2XvGWxYqolMqJpA=
=sY4N
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and dax updates from Dan Williams:
"There were multiple touches outside of drivers/nvdimm/ this round to
add cross arch compatibility to the devm_memremap_pages() interface,
enhance numa information for persistent memory ranges, and add a
zero_page_range() dax operation.
This cycle I switched from the patchwork api to Konstantin's b4 script
for collecting tags (from x86, PowerPC, filesystem, and device-mapper
folks), and everything looks to have gone ok there. This has all
appeared in -next with no reported issues.
Summary:
- Add support for region alignment configuration and enforcement to
fix compatibility across architectures and PowerPC page size
configurations.
- Introduce 'zero_page_range' as a dax operation. This facilitates
filesystem-dax operation without a block-device.
- Introduce phys_to_target_node() to facilitate drivers that want to
know resulting numa node if a given reserved address range was
onlined.
- Advertise a persistence-domain for of_pmem and papr_scm. The
persistence domain indicates where cpu-store cycles need to reach
in the platform-memory subsystem before the platform will consider
them power-fail protected.
- Promote numa_map_to_online_node() to a cross-kernel generic
facility.
- Save x86 numa information to allow for node-id lookups for reserved
memory ranges, deploy that capability for the e820-pmem driver.
- Pick up some miscellaneous minor fixes, that missed v5.6-final,
including a some smatch reports in the ioctl path and some unit
test compilation fixups.
- Fixup some flexible-array declarations"
* tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
dax: Move mandatory ->zero_page_range() check in alloc_dax()
dax,iomap: Add helper dax_iomap_zero() to zero a range
dax: Use new dax zero page method for zeroing a page
dm,dax: Add dax zero_page_range operation
s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
dax, pmem: Add a dax operation zero_page_range
pmem: Add functions for reading/writing page to/from pmem
libnvdimm: Update persistence domain value for of_pmem and papr_scm device
tools/test/nvdimm: Fix out of tree build
libnvdimm/region: Fix build error
libnvdimm/region: Replace zero-length array with flexible-array member
libnvdimm/label: Replace zero-length array with flexible-array member
ACPI: NFIT: Replace zero-length array with flexible-array member
libnvdimm/region: Introduce an 'align' attribute
libnvdimm/region: Introduce NDD_LABELING
libnvdimm/namespace: Enforce memremap_compat_align()
libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
libnvdimm: Out of bounds read in __nd_ioctl()
acpi/nfit: improve bounds checking for 'func'
mm/memremap_pages: Introduce memremap_compat_align()
...
Some bug fixes.
The new vdpa subsystem with two first drivers.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAl6MS7wPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpGp8H/2H49Gya1cfVbGU13qgmBSQqQXC8hS3iNLuG
ltRgU+jafJT//kvkdm3/DUzfK3eRUWUfqZLKEbAQDtMY0OGHi/KGEBYVLDde7Zxt
Lg4VnwBhkYDR/f01ZZDbHxzj9JAr83i28nILjLIqf3a1BX4zf203+ZE0/JM8a7wL
dOPoH7NAfyz5ul2F67bR1IOF8vC6TidpavzR2+HC/MocHYXb6Bgfvt+i4EcrfuMf
9lnBfajgklKr9sNJniwvvR1pWVg+YyG3VeC6T8tIC/xzbCmIoNT+5b3q2XPSIHq1
EuQTeXH9CBFXS0qcFlq2ktR1xd1Lx95hKwZpqLwLFDmfgjhV2QU=
=/84P
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
- Some bug fixes
- The new vdpa subsystem with two first drivers
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio-balloon: Revert "virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM"
vdpa: move to drivers/vdpa
virtio: Intel IFC VF driver for VDPA
vdpasim: vDPA device simulator
vhost: introduce vDPA-based backend
virtio: introduce a vDPA based transport
vDPA: introduce vDPA bus
vringh: IOTLB support
vhost: factor out IOTLB
vhost: allow per device message handler
vhost: refine vhost and vringh kconfig
virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM
virtio-net: Introduce hash report feature
virtio-net: Introduce RSS receive steering feature
virtio-net: Introduce extended RSC feature
tools/virtio: option to build an out of tree module
Generated files are also checked by sparse that's why add newline to
remove sparse (C=1) warning.
The issue was found on Microblaze and reported like this:
./arch/microblaze/include/generated/uapi/asm/unistd_32.h:438:45: warning:
no newline at end of file
Mips and PowerPC have it already but let's align with style used by m68k.
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Stefan Asserhall <stefan.asserhall@xilinx.com>
Acked-by: Max Filippov <jcmvbkbc@gmail.com> (xtensa)
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/4d32ab4e1fb2edb691d2e1687e8fb303c09fd023.1581504803.git.michal.simek@xilinx.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's always try to online the re-added memory blocks. In case
add_memory() already onlined the added memory blocks, the first
device_online() call will fail and stop processing the remaining memory
blocks.
This avoids manually having to check memhp_auto_online.
Note: PPC always onlines all hotplugged memory directly from the kernel as
well - something that is handled by user space on other architectures.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-5-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- A large series from Nick for 64-bit to further rework our exception vectors,
and rewrite portions of the syscall entry/exit and interrupt return in C. The
result is much easier to follow code that is also faster in general.
- Cleanup of our ptrace code to split various parts out that had become badly
intertwined with #ifdefs over the years.
- Changes to our NUMA setup under the PowerVM hypervisor which should
hopefully avoid non-sensical topologies which can lead to warnings from the
workqueue code and other problems.
- MAINTAINERS updates to remove some of our old orphan entries and update the
status of others.
- Quite a few other small changes and fixes all over the map.
Thanks to:
Abdul Haleem, afzal mohammed, Alexey Kardashevskiy, Andrew Donnellan, Aneesh
Kumar K.V, Balamuruhan S, Cédric Le Goater, Chen Zhou, Christophe JAILLET,
Christophe Leroy, Christoph Hellwig, Clement Courbet, Daniel Axtens, David
Gibson, Douglas Miller, Fabiano Rosas, Fangrui Song, Ganesh Goudar, Gautham R.
Shenoy, Greg Kroah-Hartman, Greg Kurz, Gustavo Luiz Duarte, Hari Bathini, Ilie
Halip, Jan Kara, Joe Lawrence, Joe Perches, Kajol Jain, Larry Finger,
Laurentiu Tudor, Leonardo Bras, Libor Pechacek, Madhavan Srinivasan, Mahesh
Salgaonkar, Masahiro Yamada, Masami Hiramatsu, Mauricio Faria de Oliveira,
Michael Neuling, Michal Suchanek, Mike Rapoport, Nageswara R Sastry, Nathan
Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers,
Oliver O'Halloran, Po-Hsu Lin, Pratik Rajesh Sampat, Rasmus Villemoes, Ravi
Bangoria, Roman Bolshakov, Sam Bobroff, Sandipan Das, Santosh S, Sedat Dilek,
Segher Boessenkool, Shilpasri G Bhat, Sourabh Jain, Srikar Dronamraju, Stephen
Rothwell, Tyrel Datwyler, Vaibhav Jain, YueHaibing.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl6JypATHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgOTyD/0U90tXb3VXlQcc4OFIb8vWIj76k4Zn
ZSZ7RyOuvb5pCISBZjSK79XkR9eMHT77qagX4V41q64k4yQl8nbgLeVnwL76hLLc
IJCs23f4nsO0uqX/MhSCc5dfOOOS2i8V+OQYtsYWsH5QaG95v0cHIqVaHHMlfQxu
507GO/W5W6KTd4x008b5unQOuE51zMKlKvqEJXkT59obQFpaa2S5Wn7OzhsnarCH
YSRNxaC7vtgBKLA9wUnFh8UUbh0FbOwXBCaq4OhHMhgRihdteVBCzlcR/6c+IRbt
EoZxKzfQ0hI1z5f++kJNaRXMtUbSpM8D1HdKKHgiWjpdBSD0eu2X106KQT2R2ZOF
qhX8xPLWNzdBglA6L43AaZUu+4ayd3QrrJIkjDv/K1rCHZjfGOzSQfoZgTEBNLFA
tC0crhEfw8m98e4EwhCtekGQxdczRdLS9YvtC/h6mU2xkpA35yNSwB1/iuVQdkYD
XyrEqImAQ1PJla7NL0hxSy5ZxrBtMeKT4WZZ0BNgKXryemldg8Tuv3AEyach3BHz
eU0pIwpbnPm1JAPyrpDQ1yEf7QsD77gTPfEvilEci60R9DhvIMGAY+pt0qfME3yX
wOLp2yVBEXlRmvHk/y/+r+m4aCsmwSrikbWwmLLwAAA6JehtzFOWxTEfNpACP23V
mZyyZznsHIIE3Q==
=ARdm
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Slightly late as I had to rebase mid-week to insert a bug fix:
- A large series from Nick for 64-bit to further rework our exception
vectors, and rewrite portions of the syscall entry/exit and
interrupt return in C. The result is much easier to follow code
that is also faster in general.
- Cleanup of our ptrace code to split various parts out that had
become badly intertwined with #ifdefs over the years.
- Changes to our NUMA setup under the PowerVM hypervisor which should
hopefully avoid non-sensical topologies which can lead to warnings
from the workqueue code and other problems.
- MAINTAINERS updates to remove some of our old orphan entries and
update the status of others.
- Quite a few other small changes and fixes all over the map.
Thanks to: Abdul Haleem, afzal mohammed, Alexey Kardashevskiy, Andrew
Donnellan, Aneesh Kumar K.V, Balamuruhan S, Cédric Le Goater, Chen
Zhou, Christophe JAILLET, Christophe Leroy, Christoph Hellwig, Clement
Courbet, Daniel Axtens, David Gibson, Douglas Miller, Fabiano Rosas,
Fangrui Song, Ganesh Goudar, Gautham R. Shenoy, Greg Kroah-Hartman,
Greg Kurz, Gustavo Luiz Duarte, Hari Bathini, Ilie Halip, Jan Kara,
Joe Lawrence, Joe Perches, Kajol Jain, Larry Finger, Laurentiu Tudor,
Leonardo Bras, Libor Pechacek, Madhavan Srinivasan, Mahesh Salgaonkar,
Masahiro Yamada, Masami Hiramatsu, Mauricio Faria de Oliveira, Michael
Neuling, Michal Suchanek, Mike Rapoport, Nageswara R Sastry, Nathan
Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nick
Desaulniers, Oliver O'Halloran, Po-Hsu Lin, Pratik Rajesh Sampat,
Rasmus Villemoes, Ravi Bangoria, Roman Bolshakov, Sam Bobroff,
Sandipan Das, Santosh S, Sedat Dilek, Segher Boessenkool, Shilpasri G
Bhat, Sourabh Jain, Srikar Dronamraju, Stephen Rothwell, Tyrel
Datwyler, Vaibhav Jain, YueHaibing"
* tag 'powerpc-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits)
powerpc: Make setjmp/longjmp signature standard
powerpc/cputable: Remove unnecessary copy of cpu_spec->oprofile_type
powerpc: Suppress .eh_frame generation
powerpc: Drop -fno-dwarf2-cfi-asm
powerpc/32: drop unused ISA_DMA_THRESHOLD
powerpc/powernv: Add documentation for the opal sensor_groups sysfs interfaces
selftests/powerpc: Fix try-run when source tree is not writable
powerpc/vmlinux.lds: Explicitly retain .gnu.hash
powerpc/ptrace: move ptrace_triggered() into hw_breakpoint.c
powerpc/ptrace: create ppc_gethwdinfo()
powerpc/ptrace: create ptrace_get_debugreg()
powerpc/ptrace: split out ADV_DEBUG_REGS related functions.
powerpc/ptrace: move register viewing functions out of ptrace.c
powerpc/ptrace: split out TRANSACTIONAL_MEM related functions.
powerpc/ptrace: split out SPE related functions.
powerpc/ptrace: split out ALTIVEC related functions.
powerpc/ptrace: split out VSX related functions.
powerpc/ptrace: drop PARAMETER_SAVE_AREA_OFFSET
powerpc/ptrace: drop unnecessary #ifdefs CONFIG_PPC64
powerpc/ptrace: remove unused header includes
...
Here are 3 SPDX patches for 5.7-rc1.
One fixes up the SPDX tag for a single driver, while the other two go
through the tree and add SPDX tags for all of the .gitignore files as
needed.
Nothing too complex, but you will get a merge conflict with your current
tree, that should be trivial to handle (one file modified by two things,
one file deleted.)
All 3 of these have been in linux-next for a while, with no reported
issues other than the merge conflict.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXodg5A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykySQCgy9YDrkz7nWq6v3Gohl6+lW/L+rMAnRM4uTZm
m5AuCzO3Azt9KBi7NL+L
=2Lm5
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx
Pull SPDX updates from Greg KH:
"Here are three SPDX patches for 5.7-rc1.
One fixes up the SPDX tag for a single driver, while the other two go
through the tree and add SPDX tags for all of the .gitignore files as
needed.
Nothing too complex, but you will get a merge conflict with your
current tree, that should be trivial to handle (one file modified by
two things, one file deleted.)
All three of these have been in linux-next for a while, with no
reported issues other than the merge conflict"
* tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
ASoC: MT6660: make spdxcheck.py happy
.gitignore: add SPDX License Identifier
.gitignore: remove too obvious comments
This reverts commit ebb37cf3ff.
That commit does not play well with soft-masked irq state
manipulations in idle, interrupt replay, and possibly others due to
tracing code sometimes using irq_work_queue (e.g., in
trace_hardirqs_on()). That can cause PACA_IRQ_DEC to become set when
it is not expected, and be ignored or cleared or cause warnings.
The net result seems to be missing an irq_work until the next timer
interrupt in the worst case which is usually not going to be noticed,
however it could be a long time if the tick is disabled, which is
against the spirit of irq_work and might cause real problems.
The idea is still solid, but it would need more work. It's not really
clear if it would be worth added complexity, so revert this for
now (not a straight revert, but replace with a comment explaining why
we might see interrupts happening, and gives git blame something to
find).
Fixes: ebb37cf3ff ("powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200402120401.1115883-1-npiggin@gmail.com
The PowerPC time code is not a clock provider, and just needs to call
of_clk_init().
Hence it can include <linux/of_clk.h> instead of <linux/clk-provider.h>.
Remove the #ifdef protecting the of_clk_init() call, as a stub is
available for the !CONFIG_COMMON_CLK case.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200213083804.24315-1-geert+renesas@glider.be
- Introduce 'zero_page_range' as a dax operation. This facilitates
filesystem-dax operation without a block-device.
- Advertise a persistence-domain for of_pmem and papr_scm. The
persistence domain indicates where cpu-store cycles need to reach in
the platform-memory subsystem before the platform will consider them
power-fail protected.
- Fixup some flexible-array declarations.
- Promote numa_map_to_online_node() to a cross-kernel generic facility.
- Save x86 numa information to allow for node-id lookups for reserved
memory ranges, deploy that capability for the e820-pmem driver.
- Introduce phys_to_target_node() to facilitate drivers that want to
know resulting numa node if a given reserved address range was
onlined.
Unlike normal memory ("memory" compatible type in the FDT), the
persistent memory ("ibm,pmemory" in the FDT) can be mapped anywhere in
the guest physical space and it can be used for DMA.
In order to maintain 1:1 mapping via the huge DMA window, we need to
know the maximum physical address at the time of the window setup. So
far we've been looking at "memory" nodes but "ibm,pmemory" does not
have fixed addresses and the persistent memory may be mapped
afterwards.
Since the persistent memory is still backed with page structs, use
MAX_PHYSMEM_BITS as the upper limit.
This effectively disables huge DMA window in LPAR under pHyp if
persistent memory is present but this is the best we can do for the
moment.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Wen Xiong<wenxiong@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200331012338.23773-1-aik@ozlabs.ru
- Unit test for overlays with GPIO hogs
- Improve dma-ranges parsing to handle dma-ranges with multiple entries
- Update dtc to upstream version v1.6.0-2-g87a656ae5ff9
- Improve overlay error reporting
- Device link support for power-domains and hwlocks bindings
- Add vendor prefixes for Beacon, Topwise, ENE, Dell, SG Micro, Elida,
PocketBook, Xiaomi, Linutronix, OzzMaker, Waveshare Electronics, and
ITE Tech
- Add deprecated Marvell vendor prefix 'mrvl'
- A bunch of binding conversions to DT schema continues. Of note, the
common serial and USB connector bindings are converted.
- Add more Arm CPU compatibles
- Drop Mark Rutland as DT maintainer :(
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAl6GTJUQHHJvYmhAa2Vy
bmVsLm9yZwAKCRD6+121jbxhw8vaD/9HDT0EEDRdfr8s5OiLHYQMSXlSExJgR+ON
iLI4v19apWwyycil176RXuWL5fFxh9HHOKbxOL27mmfqSLdv/xe/ev6Z4oNLVZzm
YiUalm37K+R9yWsWTqxJKtQxWqSBBtOlyJmUYnvxI1bjfF4PANKF+IMpQVDflj1w
9UMixoPRFdlOoMXVmC7yT0sfE1B6REJIh0Qfa4pqDwnHRhs8imgoIYA+JwFnmFPr
k5Q8xS2YCwopH/7W5aRLzyEI2dF6Rfy9FID8jfyNoKZLtBW6cIEkA4zTv41BZSuj
4Zj2etELXAnkIWMZXXdeKSz8IoSOlgjsCR482QmzTRuoFdmVLoJgsgh9hO2+oK07
vJDEVJnXKxq1pF6DVbYwuzNgPB1pRzg6Y8EitdH3SBgjsOQ/m4RpKaqN0MkQhYrB
BI9PCiF7GKCzNfNl3b/Op+HSsZ6aJkOsQPzC89/Ww0mB8azypgAwvLsjte5R3hib
fS8z+V+lxAAC0/M981TsPKJ96HNv4VXbLI3bY9qJBd2LB+NDortsF/OkUuObsh+6
hVQCWnae5V5iq8WmMTFABfBAK5lHafop87edn6X33LOyZZQCT9svNz/zrpakXh0e
/SBGTUudtjCOy/BLj1VQrb6tczEzos7Ij+bszeIBAcSzrWg8gtVPjw9TvuAo2b/b
9nwR4yrm0Q==
=db/l
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree updates from Rob Herring:
- Unit test for overlays with GPIO hogs
- Improve dma-ranges parsing to handle dma-ranges with multiple entries
- Update dtc to upstream version v1.6.0-2-g87a656ae5ff9
- Improve overlay error reporting
- Device link support for power-domains and hwlocks bindings
- Add vendor prefixes for Beacon, Topwise, ENE, Dell, SG Micro, Elida,
PocketBook, Xiaomi, Linutronix, OzzMaker, Waveshare Electronics, and
ITE Tech
- Add deprecated Marvell vendor prefix 'mrvl'
- A bunch of binding conversions to DT schema continues. Of note, the
common serial and USB connector bindings are converted.
- Add more Arm CPU compatibles
- Drop Mark Rutland as DT maintainer :(
* tag 'devicetree-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (106 commits)
MAINTAINERS: drop an old reference to stm32 pwm timers doc
MAINTAINERS: dt: update etnaviv file reference
dt-bindings: usb: dwc2: fix bindings for amlogic, meson-gxbb-usb
dt-bindings: uniphier-system-bus: fix warning in the example
dt-bindings: display: meson-vpu: fix indentation of reg-names' "items"
dt-bindings: iio: Fix adi, ltc2983 uint64-matrix schema constraints
dt-bindings: power: Fix example for power-domain
dt-bindings: arm: Add some constraints for PSCI nodes
of: some unittest overlays not untracked
of: gpio unittest kfree() wrong object
dt-bindings: phy: convert phy-rockchip-inno-usb2 bindings to yaml
dt-bindings: serial: sh-sci: Convert to json-schema
dt-bindings: serial: Document serialN aliases
dt-bindings: thermal: tsens: Set 'additionalProperties: false'
dt-bindings: thermal: tsens: Fix nvmem-cell-names schema
dt-bindings: vendor-prefixes: Add Beacon vendor prefix
dt-bindings: vendor-prefixes: Add Topwise
of: of_private.h: Replace zero-length array with flexible-array member
docs: dt: fix a broken reference to input.yaml
docs: dt: fix references to ap806-system-controller.txt
...
update changing all our txt files to rst ones. Excluding that, we
have the usual driver updates (qla2xxx, ufs, lpfc, zfcp, ibmvfc,
pm80xx, aacraid), a treewide update for scnprintf and some other minor
updates. The major core update is Hannes moving functions out of the
aacraid driver and into the core.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXoYKiyYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishSasAP4iGwSB
Y8tFaZgWadu76+wj5MdqTBoXdhnIuFF0rZG3pQEAiIKdsfQlbSFdm75+gUtx5hG/
GOilX/pJczTRJDCGNis=
=g7Sk
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This series has a huge amount of churn because it pulls in Mauro's doc
update changing all our txt files to rst ones.
Excluding that, we have the usual driver updates (qla2xxx, ufs, lpfc,
zfcp, ibmvfc, pm80xx, aacraid), a treewide update for scnprintf and
some other minor updates.
The major core change is Hannes moving functions out of the aacraid
driver and into the core"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (223 commits)
scsi: aic7xxx: aic97xx: Remove FreeBSD-specific code
scsi: ufs: Do not rely on prefetched data
scsi: dc395x: remove dc395x_bios_param
scsi: libiscsi: Fix error count for active session
scsi: hpsa: correct race condition in offload enabled
scsi: message: fusion: Replace zero-length array with flexible-array member
scsi: qedi: Add PCI shutdown handler support
scsi: qedi: Add MFW error recovery process
scsi: ufs: Enable block layer runtime PM for well-known logical units
scsi: ufs-qcom: Override devfreq parameters
scsi: ufshcd: Let vendor override devfreq parameters
scsi: ufshcd: Update the set frequency to devfreq
scsi: ufs: Resume ufs host before accessing ufs device
scsi: ufs-mediatek: customize the delay for enabling host
scsi: ufs: make HCE polling more compact to improve initialization latency
scsi: ufs: allow custom delay prior to host enabling
scsi: ufs-mediatek: use common delay function
scsi: ufs: introduce common and flexible delay function
scsi: ufs: use an enum for host capabilities
scsi: ufs: fix uninitialized tx_lanes in ufshcd_disable_tx_lcc()
...
* GICv4.1 support
* 32bit host removal
PPC:
* secure (encrypted) using under the Protected Execution Framework
ultravisor
s390:
* allow disabling GISA (hardware interrupt injection) and protected
VMs/ultravisor support.
x86:
* New dirty bitmap flag that sets all bits in the bitmap when dirty
page logging is enabled; this is faster because it doesn't require bulk
modification of the page tables.
* Initial work on making nested SVM event injection more similar to VMX,
and less buggy.
* Various cleanups to MMU code (though the big ones and related
optimizations were delayed to 5.8). Instead of using cr3 in function
names which occasionally means eptp, KVM too has standardized on "pgd".
* A large refactoring of CPUID features, which now use an array that
parallels the core x86_features.
* Some removal of pointer chasing from kvm_x86_ops, which will also be
switched to static calls as soon as they are available.
* New Tigerlake CPUID features.
* More bugfixes, optimizations and cleanups.
Generic:
* selftests: cleanups, new MMU notifier stress test, steal-time test
* CSV output for kvm_stat.
KVM/MIPS has been broken since 5.5, it does not compile due to a patch committed
by MIPS maintainers. I had already prepared a fix, but the MIPS maintainers
prefer to fix it in generic code rather than KVM so they are taking care of it.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl6GOnIUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMfxwf/ZKLZiRoaovXCOG71M/eHtQb8ZIqU
3MPy+On3eC5Sk/aBxWUL9EFZsbYG6kYdbZ1VOvG9XPBoLlnkDSm/IR0kaELHtnjj
oGVda/tvGn46Ne39y8xBptmb91WDcWH0vFthT/CwlMxAw3xjr+gG7Qyo+8F2CW6m
SSSuLiHSBnyO1cQKruBTHZ8qnR8LlnfXEqtd6Y4LFLic0LbLIoIdRcT3wjQrcZrm
Djd7wbTEYZjUfoqZ72ekwEDUsONcDLDSKcguDO9pSMSCGhpxCVT5Vy68KRpoIMs2
nzNWDKjvqQo5zb2+GWxJgkd12Hv+n7PCXZMbVrWBu1pQsewUns9m4mkpGw==
=6fGt
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- GICv4.1 support
- 32bit host removal
PPC:
- secure (encrypted) using under the Protected Execution Framework
ultravisor
s390:
- allow disabling GISA (hardware interrupt injection) and protected
VMs/ultravisor support.
x86:
- New dirty bitmap flag that sets all bits in the bitmap when dirty
page logging is enabled; this is faster because it doesn't require
bulk modification of the page tables.
- Initial work on making nested SVM event injection more similar to
VMX, and less buggy.
- Various cleanups to MMU code (though the big ones and related
optimizations were delayed to 5.8). Instead of using cr3 in
function names which occasionally means eptp, KVM too has
standardized on "pgd".
- A large refactoring of CPUID features, which now use an array that
parallels the core x86_features.
- Some removal of pointer chasing from kvm_x86_ops, which will also
be switched to static calls as soon as they are available.
- New Tigerlake CPUID features.
- More bugfixes, optimizations and cleanups.
Generic:
- selftests: cleanups, new MMU notifier stress test, steal-time test
- CSV output for kvm_stat"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (277 commits)
x86/kvm: fix a missing-prototypes "vmread_error"
KVM: x86: Fix BUILD_BUG() in __cpuid_entry_get_reg() w/ CONFIG_UBSAN=y
KVM: VMX: Add a trampoline to fix VMREAD error handling
KVM: SVM: Annotate svm_x86_ops as __initdata
KVM: VMX: Annotate vmx_x86_ops as __initdata
KVM: x86: Drop __exit from kvm_x86_ops' hardware_unsetup()
KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection
KVM: x86: Set kvm_x86_ops only after ->hardware_setup() completes
KVM: VMX: Configure runtime hooks using vmx_x86_ops
KVM: VMX: Move hardware_setup() definition below vmx_x86_ops
KVM: x86: Move init-only kvm_x86_ops to separate struct
KVM: Pass kvm_init()'s opaque param to additional arch funcs
s390/gmap: return proper error code on ksm unsharing
KVM: selftests: Fix cosmetic copy-paste error in vm_mem_region_move()
KVM: Fix out of range accesses to memslots
KVM: X86: Micro-optimize IPI fastpath delay
KVM: X86: Delay read msr data iff writes ICR MSR
KVM: PPC: Book3S HV: Add a capability for enabling secure guests
KVM: arm64: GICv4.1: Expose HW-based SGIs in debugfs
KVM: arm64: GICv4.1: Allow non-trapping WFI when using HW SGIs
...
Pull integrity updates from Mimi Zohar:
"Just a couple of updates for linux-5.7:
- A new Kconfig option to enable IMA architecture specific runtime
policy rules needed for secure and/or trusted boot, as requested.
- Some message cleanup (eg. pr_fmt, additional error messages)"
* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: add a new CONFIG for loading arch-specific policies
integrity: Remove duplicate pr_fmt definitions
IMA: Add log statements for failure conditions
IMA: Update KBUILD_MODNAME for IMA files to ima
Merge updates from Andrew Morton:
"A large amount of MM, plenty more to come.
Subsystems affected by this patch series:
- tools
- kthread
- kbuild
- scripts
- ocfs2
- vfs
- mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
hugetlbfs, hugetlb"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
selftests/vm: fix map_hugetlb length used for testing read and write
mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
mm/hugetlb.c: clean code by removing unnecessary initialization
hugetlb_cgroup: add hugetlb_cgroup reservation docs
hugetlb_cgroup: add hugetlb_cgroup reservation tests
hugetlb: support file_region coalescing again
hugetlb_cgroup: support noreserve mappings
hugetlb_cgroup: add accounting for shared mappings
hugetlb: disable region_add file_region coalescing
hugetlb_cgroup: add reservation accounting for private mappings
mm/hugetlb_cgroup: fix hugetlb_cgroup migration
hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
hugetlb_cgroup: add hugetlb_cgroup reservation counter
hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
mm/memblock.c: remove redundant assignment to variable max_addr
mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
...
After introducing mem sub section concept, pfn_present() loses its literal
meaning, and will not be necessary a truth on partial populated mem
section.
Since all of the callers use it to judge an absent section, it is better
to rename pfn_present() as pfn_in_present_section().
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Leonardo Bras <leonardo@linux.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Link: http://lkml.kernel.org/r/1581919110-29575-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The idea comes from a discussion between Linus and Andrea [1].
Before this patch we only allow a page fault to retry once. We achieved
this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time. This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle the
page fault on a single page. However that should hardly happen, and after
all for each code path to return a VM_FAULT_RETRY we'll first wait for a
condition (during which time we should possibly yield the cpu) to happen
before VM_FAULT_RETRY is really returned.
This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
flag when we receive VM_FAULT_RETRY. It means that the page fault handler
now can retry the page fault for multiple times if necessary without the
need to generate another page fault event. Meanwhile we still keep the
FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
page fault is the first attempt or not.
Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):
- ALLOW_RETRY and !TRIED: this means the page fault allows to
retry, and this is the first try
- ALLOW_RETRY and TRIED: this means the page fault allows to
retry, and this is not the first try
- !ALLOW_RETRY and !TRIED: this means the page fault does not allow
to retry at all
- !ALLOW_RETRY and TRIED: this is forbidden and should never be used
In existing code we have multiple places that has taken special care of
the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect
the first retry of a page fault by checking against both (fault_flags &
FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
even the 2nd try will have the ALLOW_RETRY set, then use that helper in
all existing special paths. One example is in __lock_page_or_retry(), now
we'll drop the mmap_sem only in the first attempt of page fault and we'll
keep it in follow up retries, so old locking behavior will be retained.
This will be a nice enhancement for current code [2] at the same time a
supporting material for the future userfaultfd-writeprotect work, since in
that work there will always be an explicit userfault writeprotect retry
for protected pages, and if that cannot resolve the page fault (e.g., when
userfaultfd-writeprotect is used in conjunction with swapped pages) then
we'll possibly need a 3rd retry of the page fault. It might also benefit
other potential users who will have similar requirement like userfault
write-protection.
GUP code is not touched yet and will be covered in follow up patch.
Please read the thread below for more information.
[1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
[2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Although there're tons of arch-specific page fault handlers, most of them
are still sharing the same initial value of the page fault flags. Say,
merely all of the page fault handlers would allow the fault to be retried,
and they also allow the fault to respond to SIGKILL.
Let's define a default value for the fault flags to replace those initial
page fault flags that were copied over. With this, it'll be far easier to
introduce new fault flag that can be used by all the architectures instead
of touching all the archs.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160238.9694-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let powerpc code to use the new helper, by moving the signal handling
earlier before the retry logic.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160222.9422-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Idea of a foreign VMA with respect to the present context is very generic.
But currently there are two identical definitions for this in powerpc and
x86 platforms. Lets consolidate those redundant definitions while making
vma_is_foreign() available for general use later. This should not cause
any functional change.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Link: http://lkml.kernel.org/r/1582782965-3274-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change a header to mandatory-y if both of the following are met:
[1] At least one architecture (except um) specifies it as generic-y in
arch/*/include/asm/Kbuild
[2] Every architecture (except um) either has its own implementation
(arch/*/include/asm/*.h) or specifies it as generic-y in
arch/*/include/asm/Kbuild
This commit was generated by the following shell script.
----------------------------------->8-----------------------------------
arches=$(cd arch; ls -1 | sed -e '/Kconfig/d' -e '/um/d')
tmpfile=$(mktemp)
grep "^mandatory-y +=" include/asm-generic/Kbuild > $tmpfile
find arch -path 'arch/*/include/asm/Kbuild' |
xargs sed -n 's/^generic-y += \(.*\)/\1/p' | sort -u |
while read header
do
mandatory=yes
for arch in $arches
do
if ! grep -q "generic-y += $header" arch/$arch/include/asm/Kbuild &&
! [ -f arch/$arch/include/asm/$header ]; then
mandatory=no
break
fi
done
if [ "$mandatory" = yes ]; then
echo "mandatory-y += $header" >> $tmpfile
for arch in $arches
do
sed -i "/generic-y += $header/d" arch/$arch/include/asm/Kbuild
done
fi
done
sed -i '/^mandatory-y +=/d' include/asm-generic/Kbuild
LANG=C sort $tmpfile >> include/asm-generic/Kbuild
----------------------------------->8-----------------------------------
One obvious benefit is the diff stat:
25 files changed, 52 insertions(+), 557 deletions(-)
It is tedious to list generic-y for each arch that needs it.
So, mandatory-y works like a fallback default (by just wrapping
asm-generic one) when arch does not have a specific header
implementation.
See the following commits:
def3f7cefea1b39bae16
It is tedious to convert headers one by one, so I processed by a shell
script.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Simek <michal.simek@xilinx.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/20200210175452.5030-1-masahiroy@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge the 32bit and 64bit version.
Halve the check constants on 32bit.
Use STACK_TOP since it is defined.
Passing is_64 is now redundant since is_32bit_task() is used to
determine which callchain variant should be used. Use STACK_TOP and
is_32bit_task() directly.
This removes a page from the valid 32bit area on 64bit:
#define TASK_SIZE_USER32 (0x0000000100000000UL - (1 * PAGE_SIZE))
#define STACK_TOP_USER32 TASK_SIZE_USER32
Change return value to bool. It is inverted by users anyway.
Change to invalid_user_sp to avoid inverting the return value twice.
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/be8e40fc0737fb28ad08b198552dee7cac1c5ce2.1584699455.git.msuchanek@suse.de
There are two almost identical copies for 32bit and 64bit.
The function is used only in 32bit code which will be split out in next
patch so consolidate to one function.
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0c21c919ed1296420199c78f7c3cfd29d3c7e909.1584699455.git.msuchanek@suse.de
Set CONFIG_UEVENT_HELPER=y in ps3_defconfig.
commit 1be01d4a57 (driver: base: Disable
CONFIG_UEVENT_HELPER by default) disabled the CONFIG_UEVENT_HELPER option
that is needed for hotplug and module loading by most older 32bit powerpc
distributions that users typically install on the PS3.
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/410cda9aa1a6e04434dfe1f9aa2103d0694f706c.1585340156.git.geoff@infradead.org
commit <249fad734a25> ""powerpc/perf: Disable trace_imc pmu"
disables IMC(In-Memory Collection) trace-mode in kernel, since frequent
mode switching between accumulation mode and trace mode via the spr LDBAR
in the hardware can trigger a checkstop(system crash).
Patch to re-enable imc-trace mode in kernel.
The previous patch(1/2) in this series will address the mode switching issue
by implementing a global lock, and will restrict the usage of
accumulation and trace-mode at a time.
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200313055238.8656-2-anju@linux.vnet.ibm.com
IMC(In-memory Collection Counters) does performance monitoring in
two different modes, i.e accumulation mode(core-imc and thread-imc events),
and trace mode(trace-imc events). A cpu thread can either be in
accumulation-mode or trace-mode at a time and this is done via the LDBAR
register in POWER architecture. The current design does not address the
races between thread-imc and trace-imc events.
Patch implements a global id and lock to avoid the races between
core, trace and thread imc events. With this global id-lock
implementation, the system can either run core, thread or trace imc
events at a time. i.e. to run any core-imc events, thread/trace imc events
should not be enabled/monitored.
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200313055238.8656-1-anju@linux.vnet.ibm.com
MCE handling on pSeries platform fails as recent rework to use common
code for pSeries and PowerNV in machine check error handling tries to
access per-cpu variables in realmode. The per-cpu variables may be
outside the RMO region on pSeries platform and needs translation to be
enabled for access. Just moving these per-cpu variable into RMO region
did'nt help because we queue some work to workqueues in real mode, which
again tries to touch per-cpu variables. Also fwnmi_release_errinfo()
cannot be called when translation is not enabled.
This patch fixes this by enabling translation in the exception handler
when all required real mode handling is done. This change only affects
the pSeries platform.
Without this fix below kernel crash is seen on injecting
SLB multihit:
BUG: Unable to handle kernel data access on read at 0xc00000027b205950
Faulting instruction address: 0xc00000000003b7e0
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: mcetest_slb(OE+) af_packet(E) xt_tcpudp(E) ip6t_rpfilter(E) ip6t_REJECT(E) ipt_REJECT(E) xt_conntrack(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) ip6table_nat(E) ip6table_mangle(E) ip6table_raw(E) ip6table_security(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) xfs(E) ibmveth(E) vmx_crypto(E) gf128mul(E) uio_pdrv_genirq(E) uio(E) crct10dif_vpmsum(E) rtc_generic(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) raid6_pq(E) sr_mod(E) sd_mod(E) cdrom(E) ibmvscsi(E) scsi_transport_srp(E) crc32c_vpmsum(E) dm_mod(E) sg(E) scsi_mod(E)
CPU: 34 PID: 8154 Comm: insmod Kdump: loaded Tainted: G OE 5.5.0-mahesh #1
NIP: c00000000003b7e0 LR: c0000000000f2218 CTR: 0000000000000000
REGS: c000000007dcb960 TRAP: 0300 Tainted: G OE (5.5.0-mahesh)
MSR: 8000000000001003 <SF,ME,RI,LE> CR: 28002428 XER: 20040000
CFAR: c0000000000f2214 DAR: c00000027b205950 DSISR: 40000000 IRQMASK: 0
GPR00: c0000000000f2218 c000000007dcbbf0 c000000001544800 c000000007dcbd70
GPR04: 0000000000000001 c000000007dcbc98 c008000000d00258 c0080000011c0000
GPR08: 0000000000000000 0000000300000003 c000000001035950 0000000003000048
GPR12: 000000027a1d0000 c000000007f9c000 0000000000000558 0000000000000000
GPR16: 0000000000000540 c008000001110000 c008000001110540 0000000000000000
GPR20: c00000000022af10 c00000025480fd70 c008000001280000 c00000004bfbb300
GPR24: c000000001442330 c00800000800000d c008000008000000 4009287a77000510
GPR28: 0000000000000000 0000000000000002 c000000001033d30 0000000000000001
NIP [c00000000003b7e0] save_mce_event+0x30/0x240
LR [c0000000000f2218] pseries_machine_check_realmode+0x2c8/0x4f0
Call Trace:
Instruction dump:
3c4c0151 38429050 7c0802a6 60000000 fbc1fff0 fbe1fff8 f821ffd1 3d42ffaf
3fc2ffaf e98d0030 394a1150 3bdef530 <7d6a62aa> 1d2b0048 2f8b0063 380b0001
---[ end trace 46fd63f36bbdd940 ]---
Fixes: 9ca766f989 ("powerpc/64s/pseries: machine check convert to use common event code")
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200320110119.10207-1-ganeshgr@linux.ibm.com
Commit 3282a3da25 ("powerpc/64: Implement soft interrupt replay in C")
broke the doorbell wakeup optimisation introduced by commit a9af97aa0a
("powerpc/64s: msgclr when handling doorbell exceptions from system
reset").
This patch restores the msgclr, in C code. It's now done in the system
reset wakeup path rather than doorbell interrupt replay where it used
to be, because it is always the right thing to do in the wakeup case,
but it may be rarely of use in other interrupt replay situations in
which case it's wasted work - we would have to run measurements to see
if that was a worthwhile optimisation, and I suspect it would not be.
The results are similar to those in the original commit, test on POWER8
of context_switch selftests benchmark with polling idle disabled (e.g.,
always nap, giving cross-CPU IPIs) gives the following results:
broken patched
Different threads, same core: 317k/s 375k/s +18.7%
Different cores: 280k/s 282k/s +1.0%
Fixes: 3282a3da25 ("powerpc/64: Implement soft interrupt replay in C")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200402121212.1118218-1-npiggin@gmail.com
This series focuses on corner case bug fixes and general clarity
improvements to hmm_range_fault().
- 9 bug fixes
- Allow pgmap to track the 'owner' of a DEVICE_PRIVATE - in this case the
owner tells the driver if it can understand the DEVICE_PRIVATE page or
not. Use this to resolve a bug in nouveau where it could touch
DEVICE_PRIVATE pages from other drivers.
- Remove a bunch of dead, redundant or unused code and flags
- Clarity improvements to hmm_range_fault()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl6CUDEACgkQOG33FX4g
mxqXgQ/+I2o4QIq86xTddRCesINab73sA4bdvIYgixojTBvo5a9Tu9JO3nmzJOEi
mmUXsdN+MnyTdnrkcfbEvo0b5ktCGlwGYQmcRv9MrB1QvSChg6Zy3F3cWR+dsv5c
Mo3z0sWRx+//dVSI0p1BMhd5rym0nZCh/rUO94C/XgZ3YoC+MjX4lKeCMvjJpZFc
DDEN3Z+zjOvjeiT9TMNMmkPnZSY+N4RQwTRKek5l95QwcHyFy5SBI+uzOHyE0lVw
KG5+yk+TFRMoiHjZh4BFqkibQbVKG0qMKiOymIncTr3kL4DK0Hwf/4Zk4DMKucEb
Rs2B7C+UShiyIOzdjbnBsqPiHevAhR3nOpozJy3x/Z6fLapVoIlVx3UJJ/djuxZa
7+H2VzxKz1xGRDH4Js/WD0smZ8jisA04vW5THJVFr0Vd2sqKBo3G/5ay2Kw9NX7T
MRII1KkA/luZbmM6WLEQkliVuNkMCsVUU3hiY96tLI7uN73M9fQUe6s3NubeoFxi
UKg0zuMsAlsJxkHGeY+IMHtUR/law+k9/aZoB4idMGwV/i7KiiolbRcB0H5yn4L5
OV+4uxVBFS/n37sCpMwpx5WJtgbPzww3b6Cd0hfIUqnQzSOjP40Pl5ZX/c/2M7ps
bUdZjve5j653YeC/7NBPDprvzbfmyyxJdZZZzzXLQl5kyL2o9LE=
=UEpV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
"This series focuses on corner case bug fixes and general clarity
improvements to hmm_range_fault(). It arose from a review of
hmm_range_fault() by Christoph, Ralph and myself.
hmm_range_fault() is being used by these 'SVM' style drivers to
non-destructively read the page tables. It is very similar to
get_user_pages() except that the output is an array of PFNs and
per-pfn flags, and it has various modes of reading.
This is necessary before RDMA ODP can be converted, as we don't want
to have weird corner case regressions, which is still a looking
forward item. Ralph has a nice tester for this routine, but it is
waiting for feedback from the selftests maintainers.
Summary:
- 9 bug fixes
- Allow pgmap to track the 'owner' of a DEVICE_PRIVATE - in this case
the owner tells the driver if it can understand the DEVICE_PRIVATE
page or not. Use this to resolve a bug in nouveau where it could
touch DEVICE_PRIVATE pages from other drivers.
- Remove a bunch of dead, redundant or unused code and flags
- Clarity improvements to hmm_range_fault()"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (25 commits)
mm/hmm: return error for non-vma snapshots
mm/hmm: do not set pfns when returning an error code
mm/hmm: do not unconditionally set pfns when returning EBUSY
mm/hmm: use device_private_entry_to_pfn()
mm/hmm: remove HMM_FAULT_SNAPSHOT
mm/hmm: remove unused code and tidy comments
mm/hmm: return the fault type from hmm_pte_need_fault()
mm/hmm: remove pgmap checking for devmap pages
mm/hmm: check the device private page owner in hmm_range_fault()
mm: simplify device private page handling in hmm_range_fault
mm: handle multiple owners of device private pages in migrate_vma
memremap: add an owner field to struct dev_pagemap
mm: merge hmm_vma_do_fault into into hmm_vma_walk_hole_
mm/hmm: don't handle the non-fault case in hmm_vma_walk_hole_()
mm/hmm: simplify hmm_vma_walk_hugetlb_entry()
mm/hmm: remove the unused HMM_FAULT_ALLOW_RETRY flag
mm/hmm: don't provide a stub for hmm_range_fault()
mm/hmm: do not check pmd_protnone twice in hmm_vma_handle_pmd()
mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling
mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pages
...
Currently, CONFIG_VHOST depends on CONFIG_VIRTUALIZATION. But vhost is
not necessarily for VM since it's a generic userspace and kernel
communication protocol. Such dependency may prevent archs without
virtualization support from using vhost.
To solve this, a dedicated vhost menu is created under drivers so
CONIFG_VHOST can be decoupled out of CONFIG_VIRTUALIZATION.
While at it, also squash Kconfig.vringh into vhost Kconfig file. This
avoids the trick of conditional inclusion from VOP or CAIF. Then it
will be easier to introduce new vringh users and common dependency for
both vringh and vhost.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20200326140125.19794-2-jasowang@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Declaring setjmp()/longjmp() as taking longs makes the signature
non-standard, and makes clang complain. In the past, this has been
worked around by adding -ffreestanding to the compile flags.
The implementation looks like it only ever propagates the value
(in longjmp) or sets it to 1 (in setjmp), and we only call longjmp
with integer parameters.
This allows removing -ffreestanding from the compilation flags.
Fixes: c9029ef9c9 ("powerpc: Avoid clang warnings around setjmp and longjmp")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Clement Courbet <courbet@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200330080400.124803-1-courbet@google.com
Before checking for cpu_type == NULL, this same copy happens, so doing
it here will just write the same value to the t->oprofile_type
again.
Remove the repeated copy, as it is unnecessary.
Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200215053637.280880-1-leonardo@linux.ibm.com
GCC v8 defaults to enabling -fasynchronous-unwind-tables due to
https://gcc.gnu.org/r259298, which results in .eh_frame section being
generated. This results in additional disk usage by the build, as well
as the kernel modules. Since the kernel has no use for this, this
section is discarded.
Add -fno-asynchronous-unwind-tables to KBUILD_CFLAGS to suppress
generation of .eh_frame section. Note that our VDSOs need .eh_frame, but
are not affected by this change since our VDSO code are all in assembly.
Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1ed7cd84a7d1a3180b30c0c60e70eed8bb8b40c3.1583415544.git.naveen.n.rao@linux.vnet.ibm.com
The ISA_DMA_THRESHOLD variable is set by several platforms but never
referenced.
Remove it.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191125092033.20014-1-rppt@kernel.org
Relocatable kernel builds produce a warning about .gnu.hash being an
orphan section:
ld: warning: orphan section `.gnu.hash' from `linker stubs' being placed in section `.gnu.hash'
If we try to discard it the build fails:
ld -EL -m elf64lppc -pie --orphan-handling=warn --build-id -o
.tmp_vmlinux1 -T ./arch/powerpc/kernel/vmlinux.lds --whole-archive
arch/powerpc/kernel/head_64.o arch/powerpc/kernel/entry_64.o
...
sound/built-in.a net/built-in.a virt/built-in.a --no-whole-archive
--start-group lib/lib.a --end-group
ld: could not find section .gnu.hash
So add an entry to explicitly retain it, as we do for .hash.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200227045933.22967-1-mpe@ellerman.id.au
This reconciles interrupts in the system call case like all other
interrupts. This allows system_call_common to be shared with the scv
system call implementation in a subsequent patch.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-31-npiggin@gmail.com
Regular interrupt return restores NVGPRS whereas lite returns do not.
This is clumsy: most interrupts can return without restoring NVGPRS in
most of the time, but there are special cases that require it (when
registers have been modified by the kernel). So change interrupt
return to not restore NVGPRS, and have interrupt handlers restore them
explicitly in the cases that requires it.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-30-npiggin@gmail.com
Implement the bulk of interrupt return logic in C. The asm return code
must handle a few cases: restoring full GPRs, and emulating stack
store.
The stack store emulation is significantly simplfied, rather than
creating a new return frame and switching to that before performing
the store, it uses the PACA to keep a scratch register around to
perform the store.
The asm return code is moved into 64e for now. The new logic has made
allowance for 64e, but I don't have a full environment that works well
to test it, and even booting in emulated qemu is not great for stress
testing. 64e shouldn't be too far off working with this, given a bit
more testing and auditing of the logic.
This is slightly faster on a POWER9 (page fault speed increases about
1.1%), probably due to reduced mtmsrd.
mpe: Includes fixes from Nick for _TIF_EMULATE_STACK_STORE
handling (including the fast_interrupt_return path), to remove
trace_hardirqs_on(), and fixes the interrupt-return part of the
MSR_VSX restore bug caught by tm-unavailable selftest.
mpe: Incorporate fix from Nick:
The return-to-kernel path has to replay any soft-pending interrupts if
it is returning to a context that had interrupts soft-enabled. It has
to do this carefully and avoid plain enabling interrupts if this is an
irq context, which can cause multiple nesting of interrupts on the
stack, and other unexpected issues.
The code which avoided this case got the soft-mask state wrong, and
marked interrupts as enabled before going around again to retry. This
seems to be mostly harmless except when PREEMPT=y, this calls
preempt_schedule_irq with irqs apparently enabled and runs into a BUG
in kernel/sched/core.c
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-29-npiggin@gmail.com
When local_irq_enable() finds a pending soft-masked interrupt, it
"replays" it by setting up registers like the initial interrupt entry,
then calls into the low level handler to set up an interrupt stack
frame and process the interrupt.
This is not necessary, and uses more stack than needed. The high level
interrupt handler can be called directly from C, with just pt_regs set
up on stack. This should be faster and use less stack.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-28-npiggin@gmail.com
Kernel addresses and potentially other sensitive data could be leaked
in volatile registers after a syscall.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-27-npiggin@gmail.com
System call entry and particularly exit code is beyond the limit of
what is reasonable to implement in asm.
This conversion moves all conditional branches out of the asm code,
except for the case that all GPRs should be restored at exit.
Null syscall test is about 5% faster after this patch, because the
exit work is handled under local_irq_disable, and the hard mask and
pending interrupt replay is handled after that, which avoids games
with MSR.
mpe: Includes subsequent fixes from Nick:
This fixes 4 issues caught by TM selftests. First was a tm-syscall bug
that hit due to tabort_syscall being called after interrupts were
reconciled (in a subsequent patch), which led to interrupts being
enabled before tabort_syscall was called. Rather than going through an
un-reconciling interrupts for the return, I just go back to putting
the test early in asm, the C-ification of that wasn't a big win
anyway.
Second is the syscall return _TIF_USER_WORK_MASK check would go into
an infinite loop if _TIF_RESTORE_TM became set. The asm code uses
_TIF_USER_WORK_MASK to brach to slowpath which includes
restore_tm_state.
Third is system call return was not calling restore_tm_state, I missed
this completely (alhtough it's in the return from interrupt C
conversion because when the asm syscall code encountered problems it
would branch to the interrupt return code.
Fourth is MSR_VEC missing from restore_math, which was caught by
tm-unavailable selftest taking an unexpected facility unavailable
interrupt when testing VSX unavailble exception with MSR.FP=1
MSR.VEC=1. Fourth case also has a fixup in a subsequent patch.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-26-npiggin@gmail.com
powerpc has an optimisation where interrupts avoid saving the
non-volatile (or callee saved) registers to the interrupt stack frame
if they are not required.
Two problems with this are that an interrupt does not always know
whether it will need non-volatiles; and if it does need them, they can
only be saved from the entry-scoped asm code (because we don't control
what the C compiler does with these registers).
system calls are the most difficult: some system calls always require
all registers (e.g., fork, to copy regs into the child). Sometimes
registers are only required under certain conditions (e.g., tracing,
signal delivery). These cases require ugly logic in the call
chains (e.g., ppc_fork), and require a lot of logic to be implemented
in asm.
So remove the optimisation for system calls, and always save NVGPRs on
entry. Modern high performance CPUs are not so sensitive, because the
stores are dense in cache and can be hidden by other expensive work in
the syscall path -- the null syscall selftests benchmark on POWER9 is
not slowed (124.40ns before and 123.64ns after, i.e., within the
noise).
Other interrupts retain the NVGPR optimisation for now.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-24-npiggin@gmail.com
The soft NMI handler does not reconcile interrupt state, so it should
not return via the normal ret_from_except path. Return like other NMIs,
using the EXCEPTION_RESTORE_REGS macro.
This becomes important when the scv interrupt is implemented, which
must handle soft-masked interrupts that have r13 set to something
other than the PACA -- returning to kernel in this case must restore
r13.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-23-npiggin@gmail.com
This adds IRQ_HARD_DIS to irq_happened. Although it doesn't seem to
matter much because we're not allowed to enable irqs in an NMI
handler, the soft-irq debugging code is becoming more strict about
ensuring IRQ_HARD_DIS is in sync with MSR[EE], this may help avoid
asserts or other issues.
Add a comment explaining why MCE does not have this. Early machine
check is generally much smaller and more contained code which will
explode if you look at it wrong anyway as it runs in real mode, though
there's an argument that we should do similar reconciling for the MCE
as well.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-22-npiggin@gmail.com
Apart from SRESET, MCE, and syscall (hcall variant), the SRR type
interrupts are not escalated to hypervisor mode, so are delivered to
the OS.
When running PR KVM, the OS is the hypervisor, and the guest runs with
MSR[PR]=1 (ie. usermode), so these interrupts must test if a guest was
running when interrupted. These tests are required at the real-mode
entry points because the PR KVM host runs with LPCR[AIL]=0.
In HV KVM and nested HV KVM, the guest always receives these
interrupts, so there is no need for the host to make this test. So
remove the tests if PR KVM is not configured.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-21-npiggin@gmail.com
A few of the non-standard handlers are left uncommented. Some more
description could be added to some.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-20-npiggin@gmail.com
The hdec interrupt handler is reported to sometimes fire in Linux if
KVM leaves it pending after a guest exists. This is harmless, so there
is a no-op handler for it.
The interrupt handler currently uses the regular kernel stack. Change
this to avoid touching the stack entirely.
This should be the last place where the regular Linux stack can be
accessed with asynchronous interrupts (including PMI) soft-masked.
It might be possible to take advantage of this invariant, e.g., to
context switch the kernel stack SLB entry without clearing MSR[EE].
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-17-npiggin@gmail.com
Replace IEARLY=1 and IEARLY=2 with IBRANCH_COMMON, which controls if
the entry code branches to a common handler; and IREALMODE_COMMON,
which controls whether the common handler should remain in real mode.
These special cases no longer avoid loading the SRR registers, there
is no point as most of them load the registers immediately anyway.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-14-npiggin@gmail.com
This allows more code to be moved out of unrelocated regions. The
system call KVMTEST is changed to be open-coded and remain in the
tramp area to avoid having to move it to entry_64.S. The custom nature
of the system call entry code means the hcall case can be made more
streamlined than regular interrupt handlers.
mpe: Incorporate fix from Nick:
Moving KVM test to the common entry code missed the case of HMI and
MCE, which do not do __GEN_COMMON_ENTRY (because they don't want to
switch to virt mode).
This means a MCE or HMI exception that is taken while KVM is running a
guest context will not be switched out of that context, and KVM won't
be notified. Found by running sigfuz in guest with patched host on
POWER9 DD2.3, which causes some TM related HMI interrupts (which are
expected and supposed to be handled by KVM).
This fix adds a __GEN_REALMODE_COMMON_ENTRY for those handlers to add
the KVM test. This makes them look a little more like other handlers
that all use __GEN_COMMON_ENTRY.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-13-npiggin@gmail.com
As well as moving code out of the unrelocated vectors, this allows the
masked handlers to be moved to common code, and allows the soft_nmi
handler to be generated more like a regular handler.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-12-npiggin@gmail.com