In a similar manner to array_index_mask_nospec, this patch introduces an
assembly macro (mask_nospec64) which can be used to bound a value under
speculation. This macro is then used to ensure that the indirect branch
through the syscall table is bounded under speculation, with out-of-range
addresses speculating as calls to sys_io_setup (0).
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Similarly to x86, mitigate speculation past an access_ok() check by
masking the pointer against the address limit before use.
Even if we don't expect speculative writes per se, it is plausible that
a CPU may still speculate at least as far as fetching a cache line for
writing, hence we also harden put_user() and clear_user() for peace of
mind.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Currently, USER_DS represents an exclusive limit while KERNEL_DS is
inclusive. In order to do some clever trickery for speculation-safe
masking, we need them both to behave equivalently - there aren't enough
bits to make KERNEL_DS exclusive, so we have precisely one option. This
also happens to correct a longstanding false negative for a range
ending on the very top byte of kernel memory.
Mark Rutland points out that we've actually got the semantics of
addresses vs. segments muddled up in most of the places we need to
amend, so shuffle the {USER,KERNEL}_DS definitions around such that we
can correct those properly instead of just pasting "-1"s everywhere.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Provide an optimised, assembly implementation of array_index_mask_nospec()
for arm64 so that the compiler is not in a position to transform the code
in ways which affect its ability to inhibit speculation (e.g. by introducing
conditional branches).
This is similar to the sequence used by x86, modulo architectural differences
in the carry/borrow flags.
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
For CPUs capable of data value prediction, CSDB waits for any outstanding
predictions to architecturally resolve before allowing speculative execution
to continue. Provide macros to expose it to the arch code.
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The identity map is mapped as both writeable and executable by the
SWAPPER_MM_MMUFLAGS and this is relied upon by the kpti code to manage
a synchronisation flag. Update the .pushsection flags to reflect the
actual mapping attributes.
Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
pte_to_phys lives in assembler.h and takes its destination register as
the first argument. Move phys_to_pte out of head.S to sit with its
counterpart and rejig it to follow the same calling convention.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
We don't fully understand the Cavium ThunderX erratum, but it appears
that mapping the kernel as nG can lead to horrible consequences such as
attempting to execute userspace from kernel context. Since kpti isn't
enabled for these CPUs anyway, simplify the comment justifying the lack
of post_ttbr_update_workaround in the exception trampoline.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Since AArch64 assembly instructions take the destination register as
their first operand, do the same thing for the phys_to_ttbr macro.
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cavium ThunderX's erratum 27456 results in a corruption of icache
entries that are loaded from memory that is mapped as non-global
(i.e. ASID-tagged).
As KPTI is based on memory being mapped non-global, let's prevent
it from kicking in if this erratum is detected.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
[will: Update comment]
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Defaulting to global mappings for kernel space is generally good for
performance and appears to be necessary for Cavium ThunderX. If we
subsequently decide that we need to enable kpti, then we need to rewrite
our existing page table entries to be non-global. This is fiddly, and
made worse by the possible use of contiguous mappings, which require
a strict break-before-make sequence.
Since the enable callback runs on each online CPU from stop_machine
context, we can have all CPUs enter the idmap, where secondaries can
wait for the primary CPU to rewrite swapper with its MMU off. It's all
fairly horrible, but at least it only runs once.
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Break-before-make is not needed when transitioning from Global to
Non-Global mappings, provided that the contiguous hint is not being used.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
To allow systems which do not require kpti to continue running with
global kernel mappings (which appears to be a requirement for Cavium
ThunderX due to a CPU erratum), make the use of nG in the kernel page
tables dependent on arm64_kernel_unmapped_at_el0(), which is resolved
at runtime.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The ARM architecture defines the memory locations that are permitted
to be accessed as the result of a speculative instruction fetch from
an exception level for which all stages of translation are disabled.
Specifically, the core is permitted to speculatively fetch from the
4KB region containing the current program counter 4K and next 4K.
When translation is changed from enabled to disabled for the running
exception level (SCTLR_ELn[M] changed from a value of 1 to 0), the
Falkor core may errantly speculatively access memory locations outside
of the 4KB region permitted by the architecture. The errant memory
access may lead to one of the following unexpected behaviors.
1) A System Error Interrupt (SEI) being raised by the Falkor core due
to the errant memory access attempting to access a region of memory
that is protected by a slave-side memory protection unit.
2) Unpredictable device behavior due to a speculative read from device
memory. This behavior may only occur if the instruction cache is
disabled prior to or coincident with translation being changed from
enabled to disabled.
The conditions leading to this erratum will not occur when either of the
following occur:
1) A higher exception level disables translation of a lower exception level
(e.g. EL2 changing SCTLR_EL1[M] from a value of 1 to 0).
2) An exception level disabling its stage-1 translation if its stage-2
translation is enabled (e.g. EL1 changing SCTLR_EL1[M] from a value of 1
to 0 when HCR_EL2[VM] has a value of 1).
To avoid the errant behavior, software must execute an ISB immediately
prior to executing the MSR that will change SCTLR_ELn[M] from 1 to 0.
Signed-off-by: Shanker Donthineni <shankerd@codeaurora.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
If the spinlock "next" ticket wraps around between the initial LDR
and the cmpxchg in the LSE version of spin_trylock, then we can erroneously
think that we have successfuly acquired the lock because we only check
whether the next ticket return by the cmpxchg is equal to the owner ticket
in our updated lock word.
This patch fixes the issue by performing a full 32-bit check of the lock
word when trying to determine whether or not the CASA instruction updated
memory.
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The classic error injection mechanism, should_fail_request() does not
support use cases where more information is required (from the entire
struct bio, for example).
To that end, this patch introduces should_fail_bio(), which calls
should_fail_request() under the hood but provides a convenient
place for kprobes to hook into if they require the entire struct bio.
This patch also replaces some existing calls to should_fail_request()
with should_fail_bio() with no degradation in performance.
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the idr kernel-doc to its own idr.rst file and add a few
paragraphs about how to use it. Also add some more kernel-doc.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
About 20% of the IDR users in the kernel want the allocated IDs to start
at 1. The implementation currently searches all the way down the left
hand side of the tree, finds no free ID other than ID 0, walks all the
way back up, and then all the way down again. This patch 'rebases' the
ID so we fill the entire radix tree, rather than leave a gap at 0.
Chris Wilson says: "I did the quick hack of allocating index 0 of the
idr and that eradicated idr_get_free() from being at the top of the
profiles for the many-object stress tests. This improvement will be
much appreciated."
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Now that the IDR can be used to store large IDs, it is possible somebody
might only partially convert their old code and use the iterators which
can only handle IDs up to INT_MAX. It's probably unwise to show them a
truncated ID, so settle for spewing warnings to dmesg, and terminating
the iteration.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Most places in the kernel that we need to distinguish functions by the
type of their arguments, we use '_ul' as a suffix for the unsigned long
variant, not '_ext'. Also add kernel-doc.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
It has no more users, so remove it. Move idr_alloc() back into idr.c,
move the guts of idr_alloc_cmn() into idr_alloc_u32(), remove the
wrappers around idr_get_free_cmn() and rename it to idr_get_free().
While there is now no interface to allocate IDs larger than a u32,
the IDR internals remain ready to handle a larger ID should a need arise.
These changes make it possible to provide the guarantee that, if the
nextid pointer points into the object, the object's ID will be initialised
before a concurrent lookup can find the object.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
No real benefit to this classifier, but since we're allocating a u32
anyway, we should use this function.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Commit e7614370d6 ("net_sched: use idr to allocate u32 filter handles)
converted htid allocation to use the IDR. The ID allocated by this
scheme changes; it used to be cyclic, but now always allocates the
lowest available. The IDR supports cyclic allocation, so just use
the right function.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Use the new helper. Also untangle the error path, and in so doing
noticed that estimator generator failure would lead to us leaking an
ID. Fix that bug.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
All current users of idr_alloc_ext() actually want to allocate a u32
and idr_alloc_u32() fits their needs better.
Like idr_get_next(), it uses a 'nextid' argument which serves as both
a pointer to the start ID and the assigned ID (instead of a separate
minimum and pointer-to-assigned-ID argument). It uses a 'max' argument
rather than 'end' because the semantics that idr_alloc has for 'end'
don't work well for unsigned types.
Since idr_alloc_u32() returns an errno instead of the allocated ID, mark
it as __must_check to help callers use it correctly. Include copious
kernel-doc. Chris Mi <chrism@mellanox.com> has promised to contribute
test-cases for idr_alloc_u32.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Changing idr_replace's 'id' argument to 'unsigned long' works for all
callers. Callers which passed a negative ID now get -ENOENT instead of
-EINVAL. No callers relied on this error value.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
If a dentry is deleted, then a dentry is recreated with the same handle
but a different type (i.e. it was a file and now it's a symlink), then
its a different inode. The check was backwards, so d_revalidate would
not have noticed.
Due to the design of the OrangeFS server, this is rather unlikely.
It's also possible for the dentry to be deleted and recreated with the
same type. This would be undetectable. It's a bit of a ship of
Theseus.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Check whether this is a new inode at location of call.
Raises the question of what to do with an unknown inode type. Old code
would've marked the inode bad and returned ESTALE.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
When we get an error return code from userspace (the client-core)
we check to make sure it is a valid code.
This patch maps the whacky return code to -EINVAL instead of
propagating garbage back up the call chain potentially resulting
in a hard-to-find train-wreck.
The client-core doesn't have any business returning whacky return
codes, but if it does, we don't want the kernel to crash as a result.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
gcc-8 reports
fs/orangefs/dcache.c: In function 'orangefs_d_revalidate':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 256 equals destination size [-Wstringop-truncation]
fs/orangefs/namei.c: In function 'orangefs_rename':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 256 equals destination size [-Wstringop-truncation]
fs/orangefs/super.c: In function 'orangefs_mount':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 256 equals destination size [-Wstringop-truncation]
We need one less byte or call strlcpy() to make it a nul-terminated
string.
Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
It wasn't possible to enable it, and it would've had very little effect.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
gossip_ldebug is unused.
gossip_lerr is used in two places. The messages are unique so line
numbers are unnecessary.
Also remove support for compiling gossip messages out. It wasn't
possible to enable it anyway.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Mikulas reported a workload that saw bad performance, and figured
out what it was due to various other types of requests being
accounted as reads. Flush requests, for instance. Due to the
high latency of those, we heavily throttle the writes to keep
the latencies in balance. But they really should be accounted
as writes.
Fix this by checking the exact type of the request. If it's a
read, account as a read, if it's a write or a flush, account
as a write. Any other request we disregard. Previously everything
would have been mistakenly accounted as reads.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable fixes:
- Fix calculating ri_max_send_sges, which can oops if max_sge is too small
- Fix a BUG after device removal if freed resources haven't been allocated yet
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlp6D5MACgkQ18tUv7Cl
QOt2IhAA3lY89PxSWhQRpQbxCAvnCATL2IjjT8z3v8eLJrNrkJ0X7mOUIP47dQHb
cvF/XNfemQssAqd9lWuKCRzx9vFbBrUmavzGyYXQBBULGZr6d9ARapIJU0pUJpv3
Mf/tGonj8FJSLT84dbbYpuU5xRUCS/WM53N0mYPUGOfWoOgL2vQQOHqgUDPLRjyx
T+2IHzCuyU0fp/QpSRCyg9OvScY9BOp0P0MER1UTH5+PGI5ApzBSd19Go3/gupj2
H8kuHsI800EWJbUynDFwDwsh+YDwDIs+WVkfEkD27LHhqT7L0W7RusV9xSzguoaJ
CK3g8L4GsqIKASQQCgqRJ1hcUbaHTHgqOvejIH8CBlH4YWMXanzT2aRh1AdDwERY
07n1zsJSyFZ4YMDRg786snWK/b3udeGv/eKS5/jEMxhNNNaTjPyceQHt3ZTNw/KO
/TYSvxYJWvaw/e4UoqcL7wb2rrr78uGTdLRDUcbtNllYk6hla4bUQkDp4D/rdoWk
vQoNpEvMty7r2+GosKgfZru/bOfZpAi8AXqyGPZPjz4LC/woYaro3HhvA4BLRRfO
NDsCSdejLO5iIssOvKfz99nxqD9VGkYF3rjxUlZ4ksQw+kdJ3IU+As5NLCdWd2xW
jCELPCXE2zlps6RjowJSykKqmtD0oGzdP3vVCiU4IbmVYF4LLpw=
=eXee
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-for-4.16-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
NFS-over-RDMA client fixes for Linux 4.16 #2
Stable fixes:
- Fix calculating ri_max_send_sges, which can oops if max_sge is too small
- Fix a BUG after device removal if freed resources haven't been allocated yet
One of the charming quirks of the idr_alloc() interface is that you
can pass a negative end and it will be interpreted as "maximum". Ensure
we don't break that.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
The test was checking the wrong errno; ida_get_new_above() returns
EAGAIN, not ENOMEM on memory allocation failure. Double the number of
threads to increase the chance that we actually exercise this path
during the test suite (it was a bit sporadic before).
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
If the table result is out of bounds on the array map
there is something really wrong with VBT pin so we don't
return that vbt_pin, but only return 0 instead.
This basically reverts commit 'a8e6f3888b05 ("drm/i915/cnp:
Ignore VBT request for know invalid DDC pin.")'
Also this properly fixes commit 9c3b2689d0 ("drm/i915/cnl:
Map VBT DDC Pin to BSpec DDC Pin.")
v2: Do in a way that we don't break other platforms. (Jani)
v3: Keep debug message (Jani)
v4: Don't mess with 0 mapping was noticed by Jani and
addressed with a simple solution suggested by Lucas
that makes this even simpler.
Fixes: a8e6f3888b ("drm/i915/cnp: Ignore VBT request for know invalid DDC pin.")
Fixes: 9c3b2689d0 ("drm/i915/cnl: Map VBT DDC Pin to BSpec DDC Pin.")
Cc: Radhakrishna Sripada <radhakrishna.sripada@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Kai Heng Feng <kai.heng.feng@canonical.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Suggested-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180125222524.22059-1-rodrigo.vivi@intel.com
(cherry picked from commit 3393ce1ed8)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>