This commit enables seccomp filter on powerpc, now that we have all the
necessary pieces in place.
To support seccomp's desire to modify the syscall return value under
some circumstances, we use a different ABI to the ptrace ABI. That is we
use r3 as the syscall return value, and orig_gpr3 is the first syscall
parameter.
This means the seccomp code, or a ptracer via SECCOMP_RET_TRACE, will
see -ENOSYS preloaded in r3. This is identical to the behaviour on x86,
and allows seccomp or the ptracer to either leave the -ENOSYS or change
it to something else, as well as rejecting or not the syscall by
modifying r0.
If seccomp does not reject the syscall, we restore the register state to
match what ptrace and audit expect, ie. r3 is the first syscall
parameter again. We do this restore using orig_gpr3, which may have been
modified by seccomp, which allows seccomp to modify the first syscall
paramater and allow the syscall to proceed.
We need to #ifdef the the additional handling of r3 for seccomp, so move
it all out of line.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
SIG_SYS was added in commit a0727e8ce5 "signal, x86: add SIGSYS info
and make it synchronous."
Because we use the asm-generic struct siginfo, we got support for
SIG_SYS for free as part of that commit.
However there was no compat handling added for powerpc. That means we've
been advertising the existence of signfo._sifields._sigsys to compat
tasks, but not actually filling in the fields correctly.
Luckily it looks like no one has noticed, presumably because the only
user of SIGSYS in the kernel is seccomp filter, which we don't support
yet.
So before we enable seccomp filter, add compat handling for SIGSYS.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
The documentation for syscall_get_nr() in asm-generic says:
Note this returns int even on 64-bit machines. Only 32 bits of
system call number can be meaningful. If the actual arch value
is 64 bits, this truncates to 32 bits so 0xffffffff means -1.
However our implementation was never updated to reflect this.
Generally it's not important, but there is once case where it matters.
For seccomp filter with SECCOMP_RET_TRACE, the tracer will set
regs->gpr[0] to -1 to reject the syscall. When the task is a compat
task, this means we end up with 0xffffffff in r0 because ptrace will
zero extend the 32-bit value.
If syscall_get_nr() returns an unsigned long, then a 64-bit kernel will
see a positive value in r0 and will incorrectly allow the syscall
through seccomp.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Currently syscall_get_arguments() is used by syscall tracepoints, and
collect_syscall() which is used in some debugging as well as
/proc/pid/syscall.
The current implementation just copies regs->gpr[3 .. 5] out, which is
fine for all the current use cases.
When we enable seccomp filter, that will also start using
syscall_get_arguments(). However for seccomp filter we want to use r3
as the return value of the syscall, and orig_gpr3 as the first
parameter. This will allow seccomp to modify the return value in r3.
To support this we need to modify syscall_get_arguments() to return
orig_gpr3 instead of r3. This is safe for all uses because orig_gpr3
always contains the r3 value that was passed to the syscall. We store it
in the syscall entry path and never modify it.
Update syscall_set_arguments() while we're here, even though it's never
used.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Currently syscall_get_arguments() has two loops, one for compat and one
for regular tasks. In prepartion for the next patch, which changes which
registers we use, switch it to only have one loop, so we only have one
place to update.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Currently the only caller of syscall_set_return_value() is seccomp
filter, which is not enabled on powerpc.
This means we have not noticed that our implementation of
syscall_set_return_value() negates error, even though the value passed
in is already negative.
So remove the negation in syscall_set_return_value(), and expect the
caller to do it like all other implementations do.
Also add a comment about the ccr handling.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
syscall_get_error() is unused, and never has been.
It's also probably wrong, as it negates r3 before returning it, but that
depends on what the caller is expecting.
It also doesn't deal with compat, and doesn't deal with TIF_NOERROR.
Although we could fix those, until it has a caller and it's clear what
semantics the caller wants it's just untested code. So drop it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
The API for calling do_syscall_trace_enter() is currently sensible
enough, it just returns the (modified) syscall number.
However once we enable seccomp filter it will get more complicated. When
seccomp filter runs, the seccomp kernel code (via SECCOMP_RET_ERRNO), or
a ptracer (via SECCOMP_RET_TRACE), may reject the syscall and *may* or may
*not* set a return value in r3.
That means the assembler that calls do_syscall_trace_enter() can not
blindly return ENOSYS, it needs to only return ENOSYS if a return value
has not already been set.
There is no way to implement that logic with the current API. So change
the do_syscall_trace_enter() API to make it deal with the return code
juggling, and the assembler can then just return whatever return code it
is given.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Currently on powerpc we have our own #define for the highest (negative)
errno value, called _LAST_ERRNO. This is defined to be 516, for reasons
which are not clear.
The generic code, and x86, use MAX_ERRNO, which is defined to be 4095.
In particular seccomp uses MAX_ERRNO to restrict the value that a
seccomp filter can return.
Currently with the mismatch between _LAST_ERRNO and MAX_ERRNO, a seccomp
tracer wanting to return 600, expecting it to be seen as an error, would
instead find on powerpc that userspace sees a successful syscall with a
return value of 600.
To avoid this inconsistency, switch powerpc to use MAX_ERRNO.
We are somewhat confident that generic syscalls that can return a
non-error value above negative MAX_ERRNO have already been updated to
use force_successful_syscall_return().
I have also checked all the powerpc specific syscalls, and believe that
none of them expect to return a non-error value between -MAX_ERRNO and
-516. So this change should be safe ...
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
This patch just changes data type of bhrb_users variable from
int to unsigned int because it never contains a negative value.
Reported-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Simplify code that extracts a 24x7 counter from the HCALL's result buffer.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fix parameter alignment to be consistent with coding style.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The hardware RNG on POWER8 and POWER7+ can be relatively slow, since
it can only supply one 64-bit value per microsecond. Currently we
read it in arch_get_random_long(), but that slows down reading from
/dev/urandom since the code in random.c calls arch_get_random_long()
for every longword read from /dev/urandom.
Since the hardware RNG supplies high-quality entropy on every read, it
matches the semantics of arch_get_random_seed_long() better than those
of arch_get_random_long(). Therefore this commit makes the code use
the POWER8/7+ hardware RNG only for arch_get_random_seed_{long,int}
and not for arch_get_random_{long,int}.
This won't affect any other PowerPC-based platforms because none of
them currently support a hardware RNG. To make it clear that the
ppc_md function pointer is used for arch_get_random_seed_*, we rename
it from get_random_long to get_random_seed.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The EPOW interrupt handler uses rtas_get_sensor(), which in turn
uses rtas_busy_delay() to wait for RTAS becoming ready in case it
is necessary. But rtas_busy_delay() is annotated with might_sleep()
and thus may not be used by interrupts handlers like the EPOW handler!
This leads to the following BUG when CONFIG_DEBUG_ATOMIC_SLEEP is
enabled:
BUG: sleeping function called from invalid context at arch/powerpc/kernel/rtas.c:496
in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc2-thuth #6
Call Trace:
[c00000007ffe7b90] [c000000000807670] dump_stack+0xa0/0xdc (unreliable)
[c00000007ffe7bc0] [c0000000000e1f14] ___might_sleep+0x134/0x180
[c00000007ffe7c20] [c00000000002aec0] rtas_busy_delay+0x30/0xd0
[c00000007ffe7c50] [c00000000002bde4] rtas_get_sensor+0x74/0xe0
[c00000007ffe7ce0] [c000000000083264] ras_epow_interrupt+0x44/0x450
[c00000007ffe7d90] [c000000000120260] handle_irq_event_percpu+0xa0/0x300
[c00000007ffe7e70] [c000000000120524] handle_irq_event+0x64/0xc0
[c00000007ffe7eb0] [c000000000124dbc] handle_fasteoi_irq+0xec/0x260
[c00000007ffe7ef0] [c00000000011f4f0] generic_handle_irq+0x50/0x80
[c00000007ffe7f20] [c000000000010f3c] __do_irq+0x8c/0x200
[c00000007ffe7f90] [c0000000000236cc] call_do_irq+0x14/0x24
[c00000007e6f39e0] [c000000000011144] do_IRQ+0x94/0x110
[c00000007e6f3a30] [c000000000002594] hardware_interrupt_common+0x114/0x180
Fix this issue by introducing a new rtas_get_sensor_fast() function
that does not use rtas_busy_delay() - and thus can only be used for
sensors that do not cause a BUSY condition - known as "fast" sensors.
The EPOW sensor is defined to be "fast" in sPAPR - mpe.
Fixes: 587f83e8dd ("powerpc/pseries: Use rtas_get_sensor in RAS code")
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
rtas.h already has some nice #defines for RTAS return status
codes - let's use them instead of hard-coded "magic" values!
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When detecting EEH error on non-existing PE, including the reserved
one, the PE is simply unfrozen without dumping the PHB diag-data,
which is useful for locating the root cause of the EEH error. The
patch dumps the PHB diag-data when non-existing PE reports error.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On LE kernel, the non-existing PE number in BE format derived from
skiboot firmware isn't converted to LE format properly as following
kernel log indicates:
EEH: Clear non-existing PHB#4-PE#200000000000000
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds one helper function 'sigcontext_vmx_regs' which computes
quad word aligned pointer for 'vmx_reserve' array element in sigcontext
structure making the code more readable.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[mpe: Reword comment and fix build for CONFIG_ALTIVEC=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit ce48b21007 "powerpc: Add VSX context save/restore, ptrace and
signal support" expanded the 'vmx_reserve' array element to contain 101
double words, but the comment block above was not updated.
Also reorder the constants in the array size declaration to reflect the
logic mentioned in the comment block above. This change helps in
explaining how the HW registers are represented in the array. But no
functional change.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[mpe: Reworded change log and added whitespace around +'s]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently tm_orig_msr is getting used during process context switch only.
Then there is ckpt_regs which saves the checkpointed userspace context
The MSR slot contained in ckpt_regs structure can be used during process
context switch instead of tm_orig_msr, thus allowing us to drop it from
thread_struct structure. This patch does that change.
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Destroy afu->contexts_idr on release of an afu, reclaiming the allocated
memory.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds support for OPAL EPOW (Environmental and Power Warnings)
and DPO (Delayed Power Off) events for the PowerNV platform. These events
are generated on FSP (Flexible Service Processor) based systems. EPOW
events are generated due to various critical system conditions that
require system shutdown. A few examples of these conditions are high
ambient temperature or system running on UPS power with low UPS battery.
DPO event is generated in response to admin initiated system shutdown
request. Upon receipt of EPOW and DPO events the host kernel invokes
orderly_poweroff() for performing graceful system shutdown.
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Acked-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When releasing PE for SRIOV VF, the PE is forced to be frozen
wrongly. When the same PE is picked for another VF, it won't
work anyhow. The patch fixes the issue by unfreezing, not
freezing the VF PE when releasing it.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The PELTV of PF PE should include VF PE, which is missed by current
code, so that the VF PE is frozen automatically when freezing PF PE.
The patch fixes the PELTV of PF PE to include VF PE.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On PHB3, PE might be reserved in advance to reflect the M64 segments
consumed by the PE according to M64 BARs (exclude VF BARs) of the PCI
devices included in the PE. The PE is picked based on M64 BARs instead
of the bridge's M64 windows, which might include VF BARs. Otherwise,
wrong PE could be picked.
The patch calculates the used M64 segments and PE numbers according to
the M64 BARs, excluding VF BARs, of PCI devices in one particular PE,
instead of the bridge's M64 windows. Then the right PE number is picked.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The patch changes the type of last argument of pnv_ioda_setup_bus_PE()
and phb::pick_m64_pe() to boolean. No functional change.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On PHB3, some PEs might be reserved in advance to reflect the M64
segments consumed by those PEs. We're reserving PEs based on the
M64 window of root port, which might contain VF BAR. The PEs for
VFs are allocated dynamically, not reserved based on the consumed
M64 segments. So the M64 window of root port isn't reliable for
the task. Instead, we go through M64 BARs (VF BARs excluded) of
PCI devices under the specified root bus and reserve PEs accordingly,
as the patch does.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The PE numbers are reserved according to root port's M64 window,
which is aligned to M64 segment finely. So one PE shouldn't be
reserved for multiple times. We will reserve PE numbers according
to the M64 BARs of PCI device in subsequent patches, which aren't
aligned to M64 segment size finely. It means one particular PE
could be reserved for multiple times.
The patch allows one PE to be reserved for multiple times and we
print the warning message at debugging level.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
mtmsr() does the right thing on 32bit and 64bit, so use it everywhere.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The encoding of the lengths in the ibm_architecture_vec array is
"interesting" to say the least. It's non-obvious how the number of bytes
we provide relates to the length value.
In fact we already got it wrong once, see 11e9ed43ca "Fix up
ibm_architecture_vec definition".
So add some macros to make it (hopefully) clearer. These at least have
the property that the integer present in the code is equal to the number
of bytes that follows it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Stewart Smith <stewart@linux.vnet.ibm.com>
This patch adds the ability to the DMA direct ops to fallback to the IOMMU
ops for coherent alloc/free if the coherent mask of the device isn't
suitable for accessing the direct DMA space and the device also happens
to have an active IOMMU table.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that the table and the offset can co-exist, we no longer need
to flip/flop, we can just establish both once at boot time.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To support "hybrid" DMA ops in a subsequent patch, we will need both
a direct DMA offset and an iommu pointer. Those are currently exclusive
(a union), so change them to be separate fields.
While there, also type iommu_table_base properly and make exist only
on CONFIG_PPC64 since it's not referenced on 32-bit at all.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A precision of 16 (%.16llx) has the same effect as a field width of 16
along with passing the 0 flag (%016llx), but the latter is much more
common in the kernel tree. Update cxl to use that.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
C99 says that a precision given as simply '.' with no following digits
or * should be interpreted as 0. The kernel's printf implementation,
however, treats this case as if the precision was omitted. C99 also
says that if both the precision and value are 0, no digits should be
printed. Even if the kernel followed C99 to the letter, I don't think
that would be particularly useful in these cases. For consistency with
most other format strings in the file, use an explicit precision of 16
and add a 0x prefix.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This reverts commit dec4f799d0.
Jörg Otte reports a NULL pointder dereference due to this commit, as
'crtc_state' very much can be NULL:
crtc_state = state->base.state ?
intel_atomic_get_crtc_state(state->base.state, intel_crtc) : NULL;
So the change to test 'crtc_state->base.active' cannot possibly be
correct as-is.
There may be some other minimal fix (like just checking crtc_state for
NULL), but I'm just reverting it now for the rc2 release, and people
like Daniel Vetter who actually know this code will figure out what the
right solution is in the longer term.
Reported-and-bisected-by: Jörg Otte <jrg.otte@gmail.com>
Cc: Ander Conselvan de Oliveira <ander.conselvan.de.oliveira@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
CC: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull VFS fixes from Al Viro:
"Fixes for this cycle regression in overlayfs and a couple of
long-standing (== all the way back to 2.6.12, at least) bugs"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
freeing unlinked file indefinitely delayed
fix a braino in ovl_d_select_inode()
9p: don't leave a half-initialized inode sitting around
Pull MIPS fixes from Ralf Baechle:
"A fair number of 4.2 fixes also because Markos opened the flood gates.
- Patch up the math used calculate the location for the page bitmap.
- The FDC (Not what you think, FDC stands for Fast Debug Channel) IRQ
around was causing issues on non-Malta platforms, so move the code
to a Malta specific location.
- A spelling fix replicated through several files.
- Fix to the emulation of an R2 instruction for R6 cores.
- Fix the JR emulation for R6.
- Further patching of mindless 64 bit issues.
- Ensure the kernel won't crash on CPUs with L2 caches with >= 8
ways.
- Use compat_sys_getsockopt for O32 ABI on 64 bit kernels.
- Fix cache flushing for multithreaded cores.
- A build fix"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: O32: Use compat_sys_getsockopt.
MIPS: c-r4k: Extend way_string array
MIPS: Pistachio: Support CDMM & Fast Debug Channel
MIPS: Malta: Make GIC FDC IRQ workaround Malta specific
MIPS: c-r4k: Fix cache flushing for MT cores
Revert "MIPS: Kconfig: Disable SMP/CPS for 64-bit"
MIPS: cps-vec: Use macros for various arithmetics and memory operations
MIPS: kernel: cps-vec: Replace KSEG0 with CKSEG0
MIPS: kernel: cps-vec: Use ta0-ta3 pseudo-registers for 64-bit
MIPS: kernel: cps-vec: Replace mips32r2 ISA level with mips64r2
MIPS: kernel: cps-vec: Replace 'la' macro with PTR_LA
MIPS: kernel: smp-cps: Fix 64-bit compatibility errors due to pointer casting
MIPS: Fix erroneous JR emulation for MIPS R6
MIPS: Fix branch emulation for BLTC and BGEC instructions
MIPS: kernel: traps: Fix broken indentation
MIPS: bootmem: Don't use memory holes for page bitmap
MIPS: O32: Do not handle require 32 bytes from the stack to be readable.
MIPS, CPUFREQ: Fix spelling of Institute.
MIPS: Lemote 2F: Fix build caused by recent mass rename.
Pull x86 fixes from Thomas Gleixner:
- the high latency PIT detection fix, which slipped through the cracks
for rc1
- a regression fix for the early printk mechanism
- the x86 part to plug irq/vector related hotplug races
- move the allocation of the espfix pages on cpu hotplug to non atomic
context. The current code triggers a might_sleep() warning.
- a series of KASAN fixes addressing boot crashes and usability
- a trivial typo fix for Kconfig help text
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kconfig: Fix typo in the CONFIG_CMDLINE_BOOL help text
x86/irq: Retrieve irq data after locking irq_desc
x86/irq: Use proper locking in check_irq_vectors_for_cpu_disable()
x86/irq: Plug irq vector hotplug race
x86/earlyprintk: Allow early_printk() to use console style parameters like '115200n8'
x86/espfix: Init espfix on the boot CPU side
x86/espfix: Add 'cpu' parameter to init_espfix_ap()
x86/kasan: Move KASAN_SHADOW_OFFSET to the arch Kconfig
x86/kasan: Add message about KASAN being initialized
x86/kasan: Fix boot crash on AMD processors
x86/kasan: Flush TLBs after switching CR3
x86/kasan: Fix KASAN shadow region page tables
x86/init: Clear 'init_level4_pgt' earlier
x86/tsc: Let high latency PIT fail fast in quick_pit_calibrate()
Pull timer fixes from Thomas Gleixner:
"This update from the timer departement contains:
- A series of patches which address a shortcoming in the tick
broadcast code.
If the broadcast device is not available or an hrtimer emulated
broadcast device, some of the original assumptions lead to boot
failures. I rather plugged all of the corner cases instead of only
addressing the issue reported, so the change got a little larger.
Has been extensivly tested on x86 and arm.
- Get rid of the last holdouts using do_posix_clock_monotonic_gettime()
- A regression fix for the imx clocksource driver
- An update to the new state callbacks mechanism for clockevents.
This is required to simplify the conversion, which will take place
in 4.3"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/broadcast: Prevent NULL pointer dereference
time: Get rid of do_posix_clock_monotonic_gettime
cris: Replace do_posix_clock_monotonic_gettime()
tick/broadcast: Unbreak CONFIG_GENERIC_CLOCKEVENTS=n build
tick/broadcast: Handle spurious interrupts gracefully
tick/broadcast: Check for hrtimer broadcast active early
tick/broadcast: Return busy when IPI is pending
tick/broadcast: Return busy if periodic mode and hrtimer broadcast
tick/broadcast: Move the check for periodic mode inside state handling
tick/broadcast: Prevent deep idle if no broadcast device available
tick/broadcast: Make idle check independent from mode and config
tick/broadcast: Sanity check the shutdown of the local clock_event
tick/broadcast: Prevent hrtimer recursion
clockevents: Allow set-state callbacks to be optional
clocksource/imx: Define clocksource for mx27
Pull irq fix from Thomas Gleixner:
"A single fix for a cpu hotplug race vs. interrupt descriptors:
Prevent irq setup/teardown across the cpu starting/dying parts of cpu
hotplug so that the starting/dying cpu has a stable view of the
descriptor space. This has been an issue for all architectures in the
cpu dying phase, where interrupts are migrated away from the dying
cpu. In the starting phase its mostly a x86 issue vs the vector space
update"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hotplug: Prevent alloc/free of irq descriptors during cpu up/down
Normally opening a file, unlinking it and then closing will have
the inode freed upon close() (provided that it's not otherwise busy and
has no remaining links, of course). However, there's one case where that
does *not* happen. Namely, if you open it by fhandle with cold dcache,
then unlink() and close().
In normal case you get d_delete() in unlink(2) notice that dentry
is busy and unhash it; on the final dput() it will be forcibly evicted from
dcache, triggering iput() and inode removal. In this case, though, we end
up with *two* dentries - disconnected (created by open-by-fhandle) and
regular one (used by unlink()). The latter will have its reference to inode
dropped just fine, but the former will not - it's considered hashed (it
is on the ->s_anon list), so it will stay around until the memory pressure
will finally do it in. As the result, we have the final iput() delayed
indefinitely. It's trivial to reproduce -
void flush_dcache(void)
{
system("mount -o remount,rw /");
}
static char buf[20 * 1024 * 1024];
main()
{
int fd;
union {
struct file_handle f;
char buf[MAX_HANDLE_SZ];
} x;
int m;
x.f.handle_bytes = sizeof(x);
chdir("/root");
mkdir("foo", 0700);
fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
close(fd);
name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
flush_dcache();
fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
unlink("foo/bar");
write(fd, buf, sizeof(buf));
system("df ."); /* 20Mb eaten */
close(fd);
system("df ."); /* should've freed those 20Mb */
flush_dcache();
system("df ."); /* should be the same as #2 */
}
will spit out something like
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 322023 303843 1131 100% /
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 322023 303843 1131 100% /
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 322023 283282 21692 93% /
- inode gets freed only when dentry is finally evicted (here we trigger
than by remount; normally it would've happened in response to memory
pressure hell knows when).
Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
when opening a directory we want the overlayfs inode, not one from
the topmost layer.
Reported-By: Andrey Jr. Melnikov <temnota.am@gmail.com>
Tested-By: Andrey Jr. Melnikov <temnota.am@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull libnvdimm fixes from Dan Williams:
"1) Fixes for a handful of smatch reports (Thanks Dan C.!) and minor
bug fixes (patches 1-6)
2) Correctness fixes to the BLK-mode nvdimm driver (patches 7-10).
Granted these are slightly large for a -rc update. They have been
out for review in one form or another since the end of May and were
deferred from the merge window while we settled on the "PMEM API"
for the PMEM-mode nvdimm driver (ie memremap_pmem, memcpy_to_pmem,
and wmb_pmem).
Now that those apis are merged we implement them in the BLK driver
to guarantee that mmio aperture moves stay ordered with respect to
incoming read/write requests, and that writes are flushed through
those mmio-windows and platform-buffers to be persistent on media.
These pass the sub-system unit tests with the updates to
tools/testing/nvdimm, and have received a successful build-report from
the kbuild robot (468 configs).
With acks from Rafael for the touches to drivers/acpi/"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm:
nfit: add support for NVDIMM "latch" flag
nfit: update block I/O path to use PMEM API
tools/testing/nvdimm: add mock acpi_nfit_flush_address entries to nfit_test
tools/testing/nvdimm: fix return code for unimplemented commands
tools/testing/nvdimm: mock ioremap_wt
pmem: add maintainer for include/linux/pmem.h
nfit: fix smatch "use after null check" report
nvdimm: Fix return value of nvdimm_bus_init() if class_create() fails
libnvdimm: smatch cleanups in __nd_ioctl
sparse: fix misplaced __pmem definition
Pull i2c fixes from Wolfram Sang:
"Mostly slight adjusments for new drivers, but also one core fix for
which finally the dependencies are now available as well"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: Mark instantiated device nodes with OF_POPULATE
i2c: jz4780: Fix return value if probe fails
i2c: xgene-slimpro: Fix missing mbox_free_channel call in probe error path
i2c: I2C_MT65XX should depend on HAS_DMA
Pull input fixes from Dmitry Torokhov:
"A fix (revert) for a recent regression in Synaptics driver and a fix
for Elan i2c touchpad driver"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Revert "Input: synaptics - allocate 3 slots to keep stability in image sensors"
Input: elan_i2c - change the hover event from MT to ST
that we added this rc and a handful of driver fixes that came in
during the merge window.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABCAAGBQJVoDbjAAoJENidgRMleOc98WkP/05EgvPc/S6FUBKpkZ9aRHMJ
cdFQ7anxcESj4rQo/AV2TQGiHBBqYLEmBt7Bf7LfbWfKu4OC3510PKBjbpsxteQG
mWLNKtDoqlqSBN2wC883HbJFEKb6cMN0IMGzT/zwaolrePfDinjPv/KQsg0eh16Q
nFcgZX0HAgoAHoJG5Kl0pPt5aGrbM0/PdtETasDf5EK9l6UUQmQfHKLfDY/OXO+T
2QD28aeL/FChqQ4MEGNApy0p8fdQPs8sZet48AIz0c9Frjsu4rlXuU0QAJ+jP2Ag
sjq/JaC/bwFnhAqy0vnUtYUFQpf075D4/FZ9AcrFzVpHfu3xJBCJWFvv47zJEvdX
iwunJe7FopCA2/va13RyUQgVzEqGDolipcbsq/ZONTwnAI/Vc2mvX3jEnDAWiejS
BeBzaOa9gKBuVG/Jyl1vXLoDPlEauBGmz8JmNA8wF9wW8Svlw7KKwfpjN0vlHoL/
CYuDIeZiUXaSe4WhxuAsOkHow2PgY+QupbId7rq7WEDPm0UyEuYMHMcKAbpF74d+
zD/dZBwPs2vfsmCYLCKZrDULYeHLTEPjpOf13x40IAyTvKiMsVj07eAnGCbrP1CS
/Tj7dCBSfyuwH6XMjVh0QGMNBdTC5DMqhG2aR4eRHrNXVkNVp6ty/CL7EgF/FrWl
qDW2dMoi3XVkz6rE2Mvb
=vTi8
-----END PGP SIGNATURE-----
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"A small set of fixes for problems found by smatch in new drivers that
we added this rc and a handful of driver fixes that came in during the
merge window"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
drivers: clk: st: Incorrect register offset used for lock_status
clk: mediatek: mt8173: Fix enabling of critical clocks
drivers: clk: st: Fix mux bit-setting for Cortex A9 clocks
drivers: clk: st: Add CLK_GET_RATE_NOCACHE flag to clocks
drivers: clk: st: Fix flexgen lock init
drivers: clk: st: Fix FSYN channel values
drivers: clk: st: Remove unused code
clk: qcom: Use parent rate when set rate to pixel RCG clock
clk: at91: do not leak resources
clk: stm32: Fix out-by-one error path in the index lookup
clk: iproc: fix bit manipulation arithmetic
clk: iproc: fix memory leak from clock name