Enable io-wq hashing stuff for dependent works simply by re-enqueueing
such requests.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's a preparation patch removing io_wq_enqueue_hashed(), which
now should be done by io_wq_hash_work() + io_wq_enqueue().
Also, set hash value for dependant works, and do it as late as possible,
because req->file can be unavailable before. This hash will be ignored
by io-wq.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This little tweak restores the behaviour that was before the recent
io_worker_handle_work() optimisation patches. It makes the function do
cond_resched() and flush_signals() only if there is an actual work to
execute.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Processing links, io_submit_sqe() prepares requests, drops sqes, and
passes them with sqe=NULL to io_queue_sqe(). There IOSQE_DRAIN and/or
IOSQE_ASYNC requests will go through the same prep, which doesn't expect
sqe=NULL and fail with NULL pointer deference.
Always do full prepare including io_alloc_async_ctx() for linked
requests, and then it can skip the second preparation.
Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix the handling of signals in client rxrpc calls made by the afs
filesystem. Ignore signals completely, leaving call abandonment or
connection loss to be detected by timeouts inside AF_RXRPC.
Allowing a filesystem call to be interrupted after the entire request has
been transmitted and an abort sent means that the server may or may not
have done the action - and we don't know. It may even be worse than that
for older servers.
Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells <dhowells@redhat.com>
When an AFS service handler function aborts a call, AF_RXRPC marks the call
as complete - which means that it's not going to get any more packets from
the receiver. This is a problem because reception of the final ACK is what
triggers afs_deliver_to_call() to drop the final ref on the afs_call
object.
Instead, aborted AFS service calls may then just sit around waiting for
ever or until they're displaced by a new call on the same connection
channel or a connection-level abort.
Fix this by calling afs_set_call_complete() to finalise the afs_call struct
representing the call.
However, we then need to drop the ref that stops the call from being
deallocated. We can do this in afs_set_call_complete(), as the work queue
is holding a separate ref of its own, but then we shouldn't do it in
afs_process_async_call() and afs_delete_async_call().
call->drop_ref is set to indicate that a ref needs dropping for a call and
this is dealt with when we transition a call to AFS_CALL_COMPLETE.
But then we also need to get rid of the ref that pins an asynchronous
client call. We can do this by the same mechanism, setting call->drop_ref
for an async client call too.
We can also get rid of call->incoming since nothing ever sets it and only
one thing ever checks it (futilely).
A trace of the rxrpc_call and afs_call struct ref counting looks like:
<idle>-0 [001] ..s5 164.764892: rxrpc_call: c=00000002 SEE u=3 sp=rxrpc_new_incoming_call+0x473/0xb34 a=00000000442095b5
<idle>-0 [001] .Ns5 164.766001: rxrpc_call: c=00000002 QUE u=4 sp=rxrpc_propose_ACK+0xbe/0x551 a=00000000442095b5
<idle>-0 [001] .Ns4 164.766005: rxrpc_call: c=00000002 PUT u=3 sp=rxrpc_new_incoming_call+0xa3f/0xb34 a=00000000442095b5
<idle>-0 [001] .Ns7 164.766433: afs_call: c=00000002 WAKE u=2 o=11 sp=rxrpc_notify_socket+0x196/0x33c
kworker/1:2-1810 [001] ...1 164.768409: rxrpc_call: c=00000002 SEE u=3 sp=rxrpc_process_call+0x25/0x7ae a=00000000442095b5
kworker/1:2-1810 [001] ...1 164.769439: rxrpc_tx_packet: c=00000002 e9f1a7a8:95786a88:00000008:09c5 00000001 00000000 02 22 ACK CallAck
kworker/1:2-1810 [001] ...1 164.769459: rxrpc_call: c=00000002 PUT u=2 sp=rxrpc_process_call+0x74f/0x7ae a=00000000442095b5
kworker/1:2-1810 [001] ...1 164.770794: afs_call: c=00000002 QUEUE u=3 o=12 sp=afs_deliver_to_call+0x449/0x72c
kworker/1:2-1810 [001] ...1 164.770829: afs_call: c=00000002 PUT u=2 o=12 sp=afs_process_async_call+0xdb/0x11e
kworker/1:2-1810 [001] ...2 164.771084: rxrpc_abort: c=00000002 95786a88:00000008 s=0 a=1 e=1 K-1
kworker/1:2-1810 [001] ...1 164.771461: rxrpc_tx_packet: c=00000002 e9f1a7a8:95786a88:00000008:09c5 00000002 00000000 04 00 ABORT CallAbort
kworker/1:2-1810 [001] ...1 164.771466: afs_call: c=00000002 PUT u=1 o=12 sp=SRXAFSCB_ProbeUuid+0xc1/0x106
The abort generated in SRXAFSCB_ProbeUuid(), labelled "K-1", indicates that
the local filesystem/cache manager didn't recognise the UUID as its own.
Fixes: 2067b2b3f4 ("afs: Fix the CB.ProbeUuid service handler to reply correctly")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix a couple of tracelines to indicate the usage count after the atomic op,
not the usage count before it to be consistent with other afs and rxrpc
trace lines.
Change the wording of the afs_call_trace_work trace ID label from "WORK" to
"QUEUE" to reflect the fact that it's queueing work, not doing work.
Fixes: 341f741f04 ("afs: Refcount the afs_call struct")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the interruptibility of kernel-initiated client calls so that they're
either only interruptible when they're waiting for a call slot to come
available or they're not interruptible at all. Either way, they're not
interruptible during transmission.
This should help prevent StoreData calls from being interrupted when
writeback is in progress. It doesn't, however, handle interruption during
the receive phase.
Userspace-initiated calls are still interruptable. After the signal has
been handled, sendmsg() will return the amount of data copied out of the
buffer and userspace can perform another sendmsg() call to continue
transmission.
Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells <dhowells@redhat.com>
Fixes:
- Ensure the fs_context has the correct fs_type when mounting and submounting
- Fix leaking of ctx->nfs_server.hostname
- Add minor version to fscache key to prevent collisions
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl5r+QkACgkQ18tUv7Cl
QOtq4Q/+Oo707rb3N7DrPikUARB8D7FMTs/m/+xSPNm2DSllImIXJUdckqaoZkwc
DwBMLw+ZDvHtcNytytJQOWJNp9LGjHpZ20g0TLr2p2/JRrQyGgpc0FxTJONwA5Pp
zU6MSgqqfMZ5nLgxpMKsqoPNzO45sS8SKi2I6yZIupLZlsZOzF8L1wL/zc6gJvv8
71UGrSId9mEMKCrE8EQRx7etct5VPuP+pXfDGz4oaI1tdEmfmx3FoJlzZA1/Pf90
YSHdGZb7mR3LFkFRDlnh6NFHWU+yE+b5iWCt32ifO8pCN/CyIUvBxQblx4VLA47H
6S5nrYA96zRcQwhh9B/8yWLiqqxXo2hNl574uBJL/iDqSKSmkEBxZmCbE3aFEGa8
ClWlF6T5z4dlcAlKWXQkn3EXBHzL5+Opev5dArMhqNkr55g4z9Opsa6sc0ZWdywf
h/rSM8bHn9SNYkCGFHQ1MjAn6eNU0vVQ/s9DhM2xdtyfyTQOOHx5yA/KF6aGG5oQ
3mlVEJCEfsBKyWWjHhq3e/7ezgLlKRRlauxdLgjmKy+PmtlY6mGii0eF601e9OSL
RvK7I5/9spbcYmkyyQs0BxitrDZObyhxk31hgNrUMlN/JJrhPhKiyll8/Z7Z4sq6
QP6T/Vfn2FORA3ZAMtMH6V/ZOeiXUZjOkBRVWArIrPwOBJn/EQY=
=3DDn
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"These are mostly fscontext fixes, but there is also one that fixes
collisions seen in fscache:
- Ensure the fs_context has the correct fs_type when mounting and
submounting
- Fix leaking of ctx->nfs_server.hostname
- Add minor version to fscache key to prevent collisions"
* tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
nfs: add minor version to nfs_server_key for fscache
NFS: Fix leak of ctx->nfs_server.hostname
NFS: Don't hard-code the fs_type when submounting
NFS: Ensure the fs_context has the correct fs_type before mounting
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXmpHOAAKCRDh3BK/laaZ
PP0XAQCN52kSOBiSvr8xiQrO5YOONo4yfPDi6qIk/ltvA1yr6wEA3NWAepAL07AS
n51hMi02+JNXuMVnxOm0z2us5/PYJw0=
=MJC1
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fix from Miklos Szeredi:
"Fix an Oops introduced in v5.4"
* tag 'fuse-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix stack use after return
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXmufyAAKCRDh3BK/laaZ
POXNAQDmkgiy41nUQZ3LxtGKstsgVuzFhqBq+erinBPcF1r9mQEA/xJp4uc2Q8NO
JKZZHyWFLtAN8gGNYTCli4vrm1LoKQc=
=JV3K
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix three bugs introduced in this cycle"
* tag 'ovl-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix lockdep warning for async write
ovl: fix some xino configurations
ovl: fix lock in ovl_llseek()
During a rename whiteout, if btrfs_whiteout_for_rename() returns an error
we can end up returning from btrfs_rename() with the log context object
still in the root's log context list - this happens if 'sync_log' was
set to true before we called btrfs_whiteout_for_rename() and it is
dangerous because we end up with a corrupt linked list (root->log_ctxs)
as the log context object was allocated on the stack.
After btrfs_rename() returns, any task that is running btrfs_sync_log()
concurrently can end up crashing because that linked list is traversed by
btrfs_sync_log() (through btrfs_remove_all_log_ctxs()). That results in
the same issue that commit e6c617102c ("Btrfs: fix log context list
corruption after rename exchange operation") fixed.
Fixes: d4682ba03e ("Btrfs: sync log after logging new name")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5rxtkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv/xEACifgfgyE3a2ZM7w2VTe41IpMxOouEnUWOJ
oVKRp+9gkynE8pUGlE1igTa7T2nQIZM+Qd0KWqknkP2iFiQaNXSqqr8U6qIz9lzV
I6SAcj0Pa2FRzlRly5UXLKiadIHbt2OfP6PIk6sXTcMCFUXb75/WzNFVOnNnBuee
j8F5JUw45xyLXvQnfxpYSt8LeZyGYLoOwJEZX3j+hFHl1GCqSrAY8EB5tkXFbCZi
L9JdJYOBEvnwFF4qxWl++2bmEOywnKeFea84JqbGr9BaVrDAOjAWMairZAU82xiI
EWdQRKkSyDzrl+TACz/ri4J87fzE8FhBpHLufSY3HCxizaayNawxItDg5CCW1ghn
i+bEaKq6djZn1CpSU0w0CTfA1g0D1DnErBS82znC8ciV1ZflAed8oADh3/+X64j8
HzPT1DRoDGnzp4pBwTiZcG7Jb605Mh8i1TY1p35riaUbIR4y84BVNroEUHtO5Cmh
U09efdYifsU9XM+u0OXK+SvrHqtDb6EVSx5x37qiV1SVxZ3JSsr9/uTjnBOrjH5W
nUjqCzQfJZYSNmvRT6aSGDzk5wON95nnv7hYE9HWER/Cw7/VwKdJmBwehIAZUaXG
NxJ7I/mVndGKV8ghoN119XVl7t2i56Ctj2pwu/UJH7lZB/Yfu9qZ5oKpku/Kbriy
pYqSdy8J/Q==
=0jJw
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Just a single fix here, improving the RCU callback ordering from last
week. After a bit more perusing by Paul, he poked a hole in the
original"
* tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-block:
io_uring: ensure RCU callback ordering with rcu_barrier()
afs_put_addrlist() casts kfree() to rcu_callback_t. Apart from being wrong
in theory, this might also blow up when people start enforcing function
types via compiler instrumentation, and it means the rcu_head has to be
first in struct afs_addr_list.
Use kfree_rcu() instead, it's simpler and more correct.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lockdep reports "WARNING: lock held when returning to user space!" due to
async write holding freeze lock over the write. Apparently aio.c already
deals with this by lying to lockdep about the state of the lock.
Do the same here. No need to check for S_IFREG() here since these file ops
are regular-only.
Reported-by: syzbot+9331a354f4f624a52a55@syzkaller.appspotmail.com
Fixes: 2406a307ac ("ovl: implement async IO routines")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fix up two bugs in the coversion to xino_mode:
1. xino=off does not always end up in disabled mode
2. xino=auto on 32bit arch should end up in disabled mode
Take a proactive approach to disabling xino on 32bit kernel:
1. Disable XINO_AUTO config during build time
2. Disable xino with a warning on mount time
As a by product, xino=on on 32bit arch also ends up in disabled mode.
We never intended to enable xino on 32bit arch and this will make the
rest of the logic simpler.
Fixes: 0f831ec85e ("ovl: simplify ovl_same_sb() helper")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull vfs fixes from Al Viro:
"A couple of fixes for old crap in ->atomic_open() instances"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
cifs_atomic_open(): fix double-put on late allocation failure
gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache
several iterations of ->atomic_open() calling conventions ago, we
used to need fput() if ->atomic_open() failed at some point after
successful finish_open(). Now (since 2016) it's not needed -
struct file carries enough state to make fput() work regardless
of the point in struct file lifecycle and discarding it on
failure exits in open() got unified. Unfortunately, I'd missed
the fact that we had an instance of ->atomic_open() (cifs one)
that used to need that fput(), as well as the stale comment in
finish_open() demanding such late failure handling. Trivially
fixed...
Fixes: fe9ec8291f "do_last(): take fput() on error after opening to out:"
Cc: stable@kernel.org # v4.7+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
with the way fs/namei.c:do_last() had been done, ->atomic_open()
instances needed to recognize the case when existing file got
found with O_EXCL|O_CREAT, either by falling back to finish_no_open()
or failing themselves. gfs2 one didn't.
Fixes: 6d4ade986f (GFS2: Add atomic_open support)
Cc: stable@kernel.org # v3.11
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ovl_inode_lock() is interruptible. When inode_lock() in ovl_llseek()
was replaced with ovl_inode_lock(), we did not add a check for error.
Fix this by making ovl_inode_lock() uninterruptible and change the
existing call sites to use an _interruptible variant.
Reported-by: syzbot+66a9752fa927f745385e@syzkaller.appspotmail.com
Fixes: b1f9d3858f ("ovl: use ovl_inode_lock in ovl_llseek()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Deduplicate cancellation parts, as many of them looks the same, as do
e.g.
- io_wqe_cancel_cb_work() and io_wqe_cancel_work()
- io_wq_worker_cancel() and io_work_cancel()
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix a bug where if userspace is writing to encrypted files while the
FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (introduced in v5.4) is running,
dirty inodes could be evicted, causing writes could be lost or the
filesystem to hang due to a use-after-free. This was encountered during
real-world use, not just theoretical.
Tested with the existing fscrypt xfstests, and with a new xfstest I
wrote to reproduce this bug. This fix does expose an existing bug with
'-o lazytime' that Ted is working on fixing, but this fix is more
critical and needed anyway regardless of the lazytime fix.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXmk8HxQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK4YiAQC1RZyH4/mZ890Or6s8SzCgJTVmiLk9
ZTO/56XmLte6LAD+IBAExqDkkybmAF0rQ4kY1oL75f/e/nEs+50TXra9NQc=
=s2KD
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt fix from Eric Biggers:
"Fix a bug where if userspace is writing to encrypted files while the
FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (introduced in v5.4) is running,
dirty inodes could be evicted, causing writes could be lost or the
filesystem to hang due to a use-after-free. This was encountered
during real-world use, not just theoretical.
Tested with the existing fscrypt xfstests, and with a new xfstest I
wrote to reproduce this bug. This fix does expose an existing bug with
'-o lazytime' that Ted is working on fixing, but this fix is more
critical and needed anyway regardless of the lazytime fix"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: don't evict dirty inodes after removing key
Ensure we keep the truncated value, if we did truncate it. If not, we
might read/write more than the registered buffer size.
Also for retry, ensure that we return the truncated mapped value for
the vectorized versions of the read/write commands.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need
to do io completion events polling again, they can rely on io_sq_thread to do
polling work, which can reduce cpu usage and uring_lock contention.
I modify fio io_uring engine codes a bit to evaluate the performance:
static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
continue;
}
- if (!o->sqpoll_thread) {
+ if (o->sqpoll_thread && o->hipri) {
r = io_uring_enter(ld, 0, actual_min,
IORING_ENTER_GETEVENTS);
if (r < 0) {
and use "fio -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread
-rw=read -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k
-size=10G -numjobs=1 -time_based -runtime=120"
original codes
--------------------------------------------------------------------
iodepth | 4 | 8 | 16 | 32 | 64
bw | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s
fio cpu usage | 100% | 100% | 100% | 100% | 100%
--------------------------------------------------------------------
with patch
--------------------------------------------------------------------
iodepth | 4 | 8 | 16 | 32 | 64
bw | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s
fio cpu usage | 63.8% | 74.4%% | 81.1% | 83.7% | 82.4%
--------------------------------------------------------------------
bw improve | 5.5% | 13.2% | 12.3% | 9.8% | 11.5%
--------------------------------------------------------------------
From above test results, we can see that bw has above 5.5%~13%
improvement, and fio process's cpu usage also drops much. Note this
won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL
are both enabled, in this case, io_sq_thread always has 100% cpu usage.
I think this patch will be friendly to applications which will often use
io_uring_wait_cqe() or similar from liburing.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If CONFIG_NET is not set, gcc warns:
fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function]
static int io_setup_async_msg(struct io_kiocb *req,
^~~~~~~~~~~~~~~~~~
There are many funcions wraped by CONFIG_NET, move them
together to simplify code, also fix this warning.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Minor tweaks.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Not easy to tell if we're going over the size of bits we can shove
in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON()
check for it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers
is to trigger IO on them. The usual case of shrinking a buffer pool
would be to just not replenish the buffers when IO completes, and
instead just free it. But it may be nice to have a way to manually
remove a number of buffers from a given group, and
IORING_OP_REMOVE_BUFFERS provides that functionality.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for the vectored read. This is limited to supporting
just 1 segment in the iov, and is provided just for convenience for
applications that use IORING_OP_READV already.
The iov helpers will be used for IORING_OP_RECVMSG as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.
Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
support passing in an addr/len that is associated with a buffer ID and
buffer group ID. The group ID is used to index and lookup the buffers,
while the buffer ID can be used to notify the application which buffer
in the group was used. The addr passed in is the starting buffer address,
and length is each buffer length. A number of buffers to add with can be
specified, in which case addr is incremented by length for each addition,
and each buffer increments the buffer ID specified.
No validation is done of the buffer ID. If the application provides
buffers within the same group with identical buffer IDs, then it'll have
a hard time telling which buffer ID was used. The only restriction is
that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
maximum ID that can be used.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200309202327.GA8813@embeddedor
Signed-off-by: Kees Cook <keescook@chromium.org>
After more careful studying, Paul informs me that we cannot rely on
ordering of RCU callbacks in the way that the the tagged commit did.
The current construct looks like this:
void C(struct rcu_head *rhp)
{
do_something(rhp);
call_rcu(&p->rh, B);
}
call_rcu(&p->rh, A);
call_rcu(&p->rh, C);
and we're relying on ordering between A and B, which isn't guaranteed.
Make this explicit instead, and have a work item issue the rcu_barrier()
to ensure that A has run before we manually execute B.
While thorough testing never showed this issue, it's dependent on the
per-cpu load in terms of RCU callbacks. The updated method simplifies
the code as well, and eliminates the need to maintain an rcu_head in
the fileset data.
Fixes: c1e2148f8e ("io_uring: free fixed_file_data after RCU grace period")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Here are 4 small driver core / debugfs patches for 5.6-rc3
They are:
- debugfs api cleanup now that all callers for
debugfs_create_regset32() have been fixed up. This was
waiting until after the -rc1 merge as these fixes came in
through different trees
- driver core sync state fixes based on reports of minor issues
found in the feature
All of these have been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXmS2Lg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylvNgCfbnALILZh05QJPCfZv/seNFcFYLIAnRNAzxAU
mTPqUqTp5+WMXSzGigMa
=NyIX
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and debugfs fixes from Greg KH:
"Here are four small driver core / debugfs patches for 5.6-rc3:
- debugfs api cleanup now that all debugfs_create_regset32() callers
have been fixed up. This was waiting until after the -rc1 merge as
these fixes came in through different trees
- driver core sync state fixes based on reports of minor issues found
in the feature
All of these have been in linux-next with no reported issues"
* tag 'driver-core-5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
driver core: Skip unnecessary work when device doesn't have sync_state()
driver core: Add dev_has_sync_state()
driver core: Call sync_state() even if supplier has no consumers
debugfs: remove return value of debugfs_create_regset32()
After FS_IOC_REMOVE_ENCRYPTION_KEY removes a key, it syncs the
filesystem and tries to get and put all inodes that were unlocked by the
key so that unused inodes get evicted via fscrypt_drop_inode().
Normally, the inodes are all clean due to the sync.
However, after the filesystem is sync'ed, userspace can modify and close
one of the files. (Userspace is *supposed* to close the files before
removing the key. But it doesn't always happen, and the kernel can't
assume it.) This causes the inode to be dirtied and have i_count == 0.
Then, fscrypt_drop_inode() failed to consider this case and indicated
that the inode can be dropped, causing the write to be lost.
On f2fs, other problems such as a filesystem freeze could occur due to
the inode being freed while still on f2fs's dirty inode list.
Fix this bug by making fscrypt_drop_inode() only drop clean inodes.
I've written an xfstest which detects this bug on ext4, f2fs, and ubifs.
Fixes: b1c0ec3599 ("fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl")
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://lore.kernel.org/r/20200305084138.653498-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5j8gkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphtKEADIid1/6xG6EO965jKjR1G3e7pnA7M6Ek01
T0svGLMSYtPV9aRERiWDWdyCE01C0kjwWvmpiTCmWr0sm3bJYBB+NaDXkCtwa1IW
uFPMNDpeCQijQI1sImbeP2yN2ufGY5r7Y9RCMU7+iKgcao3pFaR136y7UfBHykJ8
Iyp/sir5FRHlEzrGyoXOe1j131BZrDGCa+cuPyAOlr75abN+TDazJAv05MGBQVfI
wc4hOHy0+D07juXP3ZD8UptoLTXPNk+tcAIqAEIaEuPxmRxq1lOfnM506rWyp2sy
XZrQhUblkL8nqfqXASYGQcY/DaNxhEvbzn86MaCKm4qf12uCiP0/DS3hFY/32lAt
VX9eOYenX1zTRLQoRNwvVHT4+m+Splp7IpICFK9bSGk1jp3rbclSXmWITqSWkOgi
C45wAAmWw4lzrbxcEDfBAns/lcwsrPwHn12WdM9ofk2I1jTDubO47c/oFEzEn0w/
IixdKeMVnifNoytP9XFcUcotNzc/NPiPvMNgCkNm59kUHfXMXx6HHyTLO/JUzjZ9
B/s2LkC23EksjEGC3gQiQxighyvNCsN0Wv9L7InaCjJY5IpcOoL495fnPCPfaOaW
7c6xrkRxvHN8bSsKmESywcFjtBv23OtlTfbma7hjdByaGkW/M62qdT6DOiQcoiX/
Ts7YOMtPdQ==
=aukx
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-03-07' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Here are a few io_uring fixes that should go into this release. This
contains:
- Removal of (now) unused io_wq_flush() and associated flag (Pavel)
- Fix cancelation lockup with linked timeouts (Pavel)
- Fix for potential use-after-free when freeing percpu ref for fixed
file sets
- io-wq cancelation fixups (Pavel)"
* tag 'io_uring-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
io_uring: fix lockup with timeouts
io_uring: free fixed_file_data after RCU grace period
io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNAL
io-wq: fix IO_WQ_WORK_NO_CANCEL cancellation
There is a recipe to deadlock the kernel: submit a timeout sqe with a
linked_timeout (e.g. test_single_link_timeout_ception() from liburing),
and SIGKILL the process.
Then, io_kill_timeouts() takes @ctx->completion_lock, but the timeout
isn't flagged with REQ_F_COMP_LOCKED, and will try to double grab it
during io_put_free() to cancel the linked timeout. Probably, the same
can happen with another io_kill_timeout() call site, that is
io_commit_cqring().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5iXe8ACgkQxWXV+ddt
WDvWGg/+LFP+Y8Qz6xHTl3vXuGJKjCr7X/MIi69r2N0JFoCUeXyOdxeSlOuNCfhb
HiLZzfA5TYoptsdLJAXQLy7nPKFCQcc+J19Mbt2+aebpdGqfgN+YZEGkltfKL8Ao
xjOGu5HROFFpNTtnwa1dYOQkyVuZ8oafuJxwVJ8T28fxepRvBbi5jmy3lb1ypL3W
NoIPBe+67g5z/W0ATFmBMF7cCbvS5gsEGWKpbbjh7r8ZHJkhUaxVU7YdxPqlXrAO
ejZfiJUwi8rTGm0zd8A5TX/wsxSeBEXolvh91k5tatTljjzROHa028KRg2voUZIW
C5/7X+Z2C3gzuT0o7TGLBOR6CkVhkSutDV8/QIE6hDjZ/aCMNi0mIFco1hG8jjd1
jQfjemjj7PWuVEnZ6EuVSoHSXjZvBvX66of40YhTQEtSaJpcZU4jP26+8cXENN6+
6WbWcQpEQbT0cp0YKWhWvAIwGMf0jmWESISeFMEaF0eQd8BtzrH1qYcs3JTmXvHC
XmC47hoEJLhjQkAgQ4oNa5PZQzR1wEfW/4FPdqlADOR2frE1wDiKdrpN/dkAYHdQ
edNlo9u0+bRWCP40p04i2IUX/aUAc+me9QxiZwxT3Fw0g5QBSE2035Ly4spvT8NZ
gIvwnq1KGxmtrJSo5Lpkv4bjHYbByYMOiGJUMOTCIEdqajFI224=
=06pr
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One fixup for DIO when in use with the new checksums, a missed case
where the checksum size was still assuming u32"
* tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix RAID direct I/O reads with alternate csums
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAl5igvYTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFdbFD/9ZP3XDY+ngnN5nsSYS4QuzudlncnZL
ceRLD5YykNPLOAesr7DWI8EDky+IFL5w4wRHVxAbOeHpj3haySLefV9vsM/G6sm4
CiHdikx7uls184r5WYK3jfB19UF3UIePUjTnAtxOpemjkLv58Z15nPNGGQv9lkFJ
dJbCk1kdwaEA3LYEyXiGC/ianaxLtiqBy+C0d581OZn3ty551c8vmF0Ziz5tcuot
aObPE3f8sYNxDuTDZcseRxvXUfMS1Qj/tMxeDDIXryX71zIsFbQ6PMPUNHGHGit/
uoeuprDy90mLqGuEEuUfVaXjn8zEPFlW8IHy1OJ4fFNQ0X/HYa2/CFTA2BiVrpfM
1lVYKWuMz+mCq9i8wzF/+ikQ9QVMG2cSb0i4kyuAb+RBP+PDjNTbTLjFeEIJVz6O
yN9MUXWH5XS8liFq2F5VbITwpSJEk7vxiTGDT1zU38HXFdrxL0FRC60TKhkplLzO
9xsj9jUBV/sD5ohwq9Ga+kcXOB/KA/9iW3TMfBApq7oWIxaEfW7rQ6A/O5tuF/hX
q2mwrRoEx6tpCy77KFBLT89iF0gzV3xzadwWcnpDkFC7x2OkMmZPPr2nWeJS6qbN
hPOD1fiWW/NXMXs7foQ9HZ7HdbQMDI7olnf1sjkh4pq2MKDWsJLvNB4fYwZUxhpn
8K4B+9yfIofvpg==
=H/ky
-----END PGP SIGNATURE-----
Merge tag 'filelock-v5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking fixes from Jeff Layton:
"Just a couple of late-breaking patches for the file locking code. The
second patch (from yangerkun) fixes a rather nasty looking potential
use-after-free that should go to stable.
The other patch could technically wait for 5.7, but it's fairly
innocuous so I figured we might as well take it"
* tag 'filelock-v5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: fix a potential use-after-free problem when wakeup a waiter
fcntl: Distribute switch variables for initialization
The percpu refcount protects this structure, and we can have an atomic
switch in progress when exiting. This makes it unsafe to just free the
struct normally, and can trigger the following KASAN warning:
BUG: KASAN: use-after-free in percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
Read of size 1 at addr ffff888181a19a30 by task swapper/0/0
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc4+ #5747
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
<IRQ>
dump_stack+0x76/0xa0
print_address_description.constprop.0+0x3b/0x60
? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
__kasan_report.cold+0x1a/0x3d
? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
rcu_core+0x370/0x830
? percpu_ref_exit+0x50/0x50
? rcu_note_context_switch+0x7b0/0x7b0
? run_rebalance_domains+0x11d/0x140
__do_softirq+0x10a/0x3e9
irq_exit+0xd5/0xe0
smp_apic_timer_interrupt+0x86/0x200
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:default_idle+0x26/0x1f0
Fix this by punting the final exit and free of the struct to RCU, then
we know that it's safe to do so. Jann suggested the approach of using a
double rcu callback to achieve this. It's important that we do a nested
call_rcu() callback, as otherwise the free could be ordered before the
atomic switch, even if the latter was already queued.
Reported-by: syzbot+e017e49c39ab484ac87a@syzkaller.appspotmail.com
Suggested-by: Jann Horn <jannh@google.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'16306a61d3b7 ("fs/locks: always delete_block after waiting.")' add the
logic to check waiter->fl_blocker without blocked_lock_lock. And it will
trigger a UAF when we try to wakeup some waiter:
Thread 1 has create a write flock a on file, and now thread 2 try to
unlock and delete flock a, thread 3 try to add flock b on the same file.
Thread2 Thread3
flock syscall(create flock b)
...flock_lock_inode_wait
flock_lock_inode(will insert
our fl_blocked_member list
to flock a's fl_blocked_requests)
sleep
flock syscall(unlock)
...flock_lock_inode_wait
locks_delete_lock_ctx
...__locks_wake_up_blocks
__locks_delete_blocks(
b->fl_blocker = NULL)
...
break by a signal
locks_delete_block
b->fl_blocker == NULL &&
list_empty(&b->fl_blocked_requests)
success, return directly
locks_free_lock b
wake_up(&b->fl_waiter)
trigger UAF
Fix it by remove this logic, and this patch may also fix CVE-2019-19769.
Cc: stable@vger.kernel.org
Fixes: 16306a61d3 ("fs/locks: always delete_block after waiting.")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
When get an error in the middle of reading an inode, some fields in the
inode might be still not initialized. And then the evict_inode path may
access those fields via iput().
To fix, this makes sure that inode fields are initialized.
Reported-by: syzbot+9d82b8de2992579da5d0@syzkaller.appspotmail.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/871rqnreqx.fsf@mail.parknet.co.jp
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.
This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
This just prepares the ring for having lists of buffers associated with
it, that the application can provide for SQEs to consume instead of
providing their own.
The buffers are organized by group ID.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
First it changes io-wq interfaces. It replaces {get,put}_work() with
free_work(), which guaranteed to be called exactly once. It also enforces
free_work() callback to be non-NULL.
io_uring follows the changes and instead of putting a submission reference
in io_put_req_async_completion(), it will be done in io_free_work(). As
removes io_get_work() with corresponding refcount_inc(), the ref balance
is maintained.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When executing non-linked hashed work, io_worker_handle_work()
will lock-unlock wqe->lock to update hash, and then immediately
lock-unlock to get next work. Optimise this case and do
lock/unlock only once.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are 2 optimisations:
- Now, io_worker_handler_work() do io_assign_current_work() twice per
request, and each one adds lock/unlock(worker->lock) pair. The first is
to reset worker->cur_work to NULL, and the second to set a real work
shortly after. If there is a dependant work, set it immediately, that
effectively removes the extra NULL'ing.
- And there is no use in taking wqe->lock for linked works, as they are
not hashed now. Optimise it out.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is a preparation patch, it adds some helpers and makes
the next patches cleaner.
- extract io_impersonate_work() and io_assign_current_work()
- replace @next label with nested do-while
- move put_work() right after NULL'ing cur_work.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If after dropping the submission reference req->refs == 1, the request
is done, because this one is for io_put_work() and will be dropped
synchronously shortly after. In this case it's safe to steal a next
work from the request.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There will be no use for @nxt in the handlers, and it's doesn't work
anyway, so purge it
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The rule is simple, any async handler gets a submission ref and should
put it at the end. Make them all follow it, and so more consistent.
This is a preparation patch, and as io_wq_assign_next() currently won't
ever work, this doesn't care to use io_put_req_find_next() instead of
io_put_req().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
refcount_inc_not_zero() -> refcount_inc() fix.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl5dnvEACgkQiiy9cAdy
T1FaWAv/XnyYfYh6H4fhtgtfNxW9xt9mkHo/AohHcf2rk2erqjVz0lHVe7SuS9C5
EpDYnijZKa//aiIV6VzDymPaMrXQZ+oCAExAzLPmWZnLeZ65Q02K2P1F3KvURdue
4nLjuOyzyG4YYkoBi4wKneu1Ji377m9L6BpSfM+MzPScCOl8OV/vv/nBRY1N6gIY
Rreq5iipRaDhifsaOgiA501sUu7mvpPEHNpluCtFmY4iTHQzYqjWZ5ZGXr2xz63n
5VV8KWWn/p3nhJGt7L/1aynws59AdEd5GNZ5FbDQHokx9n3MMnyl4QGDzUehnhlY
Ym6n50QA5QMn9I9NLg8I2aD6z4vNIj9kZxersoHduf4UsA9CyPaucUIyV81mt683
AZIqtz8H21fgJXOQ3nv4uNc8Yyt1SGQfFDo1EfphwLl6LaE8rx3CFEnVoNLM+jqb
nyRB/NxLtDWVQhYM8Bg/TP7iMqknHtarfZirv48LFdXLlhb83+qpSSHy0zVy9vli
y/0B7rEI
=zLW4
-----END PGP SIGNATURE-----
Merge tag '5.6-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Five small cifs/smb3 fixes, two for stable (one for a reconnect
problem and the other fixes a use case when renaming an open file)"
* tag '5.6-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Use #define in cifs_dbg
cifs: fix rename() by ensuring source handle opened with DELETE bit
cifs: add missing mount option to /proc/mounts
cifs: fix potential mismatch of UNC paths
cifs: don't leak -EAGAIN for stat() during reconnect
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.
To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.
fs/fcntl.c: In function ‘send_sigio_to_task’:
fs/fcntl.c:738:20: warning: statement will never be executed [-Wswitch-unreachable]
738 | kernel_siginfo_t si;
| ^~
[1] https://bugs.llvm.org/show_bug.cgi?id=44916
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
As Lasse pointed out, "Looking at fs/erofs/decompress.c,
the return value from LZ4_decompress_safe_partial is only
checked for negative value to catch errors. ... So if
I understood it correctly, if there is bad data whose
uncompressed size is much less than it should be, it can
leave part of the output buffer untouched and expose the
previous data as the file content. "
Let's fix it now.
Cc: Lasse Collin <lasse.collin@tukaani.org>
Fixes: 7fc45dbc93 ("staging: erofs: introduce generic decompression backend")
[ Gao Xiang: v5.3+, I will manually backport this to stable later. ]
Link: https://lore.kernel.org/r/20200226081008.86348-3-gaoxiang25@huawei.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
As Lasse pointed out, "EROFS uses LZ4_decompress_safe_partial
for both partial and full blocks. Thus when it is decoding a
full block, it doesn't know if the LZ4 decoder actually decoded
all the input. The real uncompressed size could be bigger than
the value stored in the file system metadata.
Using LZ4_decompress_safe instead of _safe_partial when
decompressing a full block would help to detect errors."
So it's reasonable to use _safe in case of potential corrupted
images and it might have some speed gain as well although
I didn't observe much difference.
Note that legacy compressor (< 5.3, no LZ4_0PADDING) could
encode extra data in a pcluster, which is excluded as well.
Cc: Lasse Collin <lasse.collin@tukaani.org>
Fixes: 0ffd71bcc3 ("staging: erofs: introduce LZ4 decompression inplace")
[ Gao Xiang: v5.3+, I will manually backport this to stable later. ]
Link: https://lore.kernel.org/r/20200226081008.86348-2-gaoxiang25@huawei.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
XArray has friendly APIs and it will replace the old radix
tree in the near future.
This convert makes use of __xa_cmpxchg when inserting on
a just inserted item by other thread. In detail, instead
of totally looking up again as what we did for the old
radix tree, it will try to legitimize the current in-tree
item in the XArray therefore more effective.
In addition, naming is rather a challenge for non-English
speaker like me. The basic idea of workstn is to provide
a runtime sparse array with items arranged in the physical
block number order. Such items (was called workgroup) can be
used to record compress clusters or for later new features.
However, both workgroup and workstn seem not good names from
whatever point of view, so I'd like to rename them as pslot
and managed_pslots to stand for physical slots. This patch
handles the second as a part of the radix tree convert.
Cc: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/r/20200220024642.91529-1-gaoxiang25@huawei.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
btrfs_lookup_and_bind_dio_csum() does pointer arithmetic which assumes
32-bit checksums. If using a larger checksum, this leads to spurious
failures when a direct I/O read crosses a stripe. This is easy
to reproduce:
# mkfs.btrfs -f --checksum blake2 -d raid0 /dev/vdc /dev/vdd
...
# mount /dev/vdc /mnt
# cd /mnt
# dd if=/dev/urandom of=foo bs=1M count=1 status=none
# dd if=foo of=/dev/null bs=1M iflag=direct status=none
dd: error reading 'foo': Input/output error
# dmesg | tail -1
[ 135.821568] BTRFS warning (device vdc): csum failed root 5 ino 257 off 421888 ...
Fix it by using the actual checksum size.
Fixes: 1e25a2e3ca ("btrfs: don't assume ordered sums to be 4 bytes")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Clang warns:
fs/io_uring.c:4178:6: warning: variable 'mask' is used uninitialized
whenever 'if' condition is false [-Wsometimes-uninitialized]
if (def->pollin)
^~~~~~~~~~~
fs/io_uring.c:4182:2: note: uninitialized use occurs here
mask |= POLLERR | POLLPRI;
^~~~
fs/io_uring.c:4178:2: note: remove the 'if' if its condition is always
true
if (def->pollin)
^~~~~~~~~~~~~~~~
fs/io_uring.c:4154:15: note: initialize the variable 'mask' to silence
this warning
__poll_t mask, ret;
^
= 0
1 warning generated.
io_op_defs has many definitions where pollin is not set so mask indeed
might be uninitialized. Initialize it to zero and change the next
assignment to |=, in case further masks are added in the future to avoid
missing changing the assignment then.
Fixes: d7718a9d25 ("io_uring: use poll driven retry for files that support it")
Link: https://github.com/ClangBuiltLinux/linux/issues/916
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io-wq cares about IO_WQ_WORK_UNBOUND flag only while enqueueing, so
it's useless setting it for a next req of a link. Thus, removed it
from io_prep_linked_timeout(), and inline the function.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After __io_queue_sqe() ended up in io_queue_async_work(), it's already
known that there is no @nxt req, so skip the check and return from the
function.
Also, @nxt initialisation now can be done just before
io_put_req_find_next(), as there is no jumping until it's checked.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently io_uring tries any request in a non-blocking manner, if it can,
and then retries from a worker thread if we get -EAGAIN. Now that we have
a new and fancy poll based retry backend, use that to retry requests if
the file supports it.
This means that, for example, an IORING_OP_RECVMSG on a socket no longer
requires an async thread to complete the IO. If we get -EAGAIN reading
from the socket in a non-blocking manner, we arm a poll handler for
notification on when the socket becomes readable. When it does, the
pending read is executed directly by the task again, through the io_uring
task work handlers. Not only is this faster and more efficient, it also
means we're not generating potentially tons of async threads that just
sit and block, waiting for the IO to complete.
The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
pollable IO is fast, and that poll<link>other_op is fast as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a pollin/pollout field to the request table, and have commands that
we can safely poll for properly marked.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For poll requests, it's not uncommon to link a read (or write) after
the poll to execute immediately after the file is marked as ready.
Since the poll completion is called inside the waitqueue wake up handler,
we have to punt that linked request to async context. This slows down
the processing, and actually means it's faster to not use a link for this
use case.
We also run into problems if the completion_lock is contended, as we're
doing a different lock ordering than the issue side is. Hence we have
to do trylock for completion, and if that fails, go async. Poll removal
needs to go async as well, for the same reason.
eventfd notification needs special case as well, to avoid stack blowing
recursion or deadlocks.
These are all deficiencies that were inherited from the aio poll
implementation, but I think we can do better. When a poll completes,
simply queue it up in the task poll list. When the task completes the
list, we can run dependent links inline as well. This means we never
have to go async, and we can remove a bunch of code associated with
that, and optimizations to try and make that run faster. The diffstat
speaks for itself.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Store the io_kiocb in the private field instead of the poll entry, this
is in preparation for allowing multiple waitqueues.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
@hash_map is unsigned long, but BIT_ULL() is used for manipulations.
BIT() is a better match as it returns exactly unsigned long value.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IO_WQ_WORK_CB is used only for linked timeouts, which will be armed
before the work setup (i.e. mm, override creds, etc). The setup
shouldn't take long, so it's ok to arm it a bit later and get rid
of IO_WQ_WORK_CB.
Make io-wq call work->func() only once, callbacks will handle the rest.
i.e. the linked timeout handler will do the actual issue. And as a
bonus, it removes an extra indirect call.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Deduplicate call to io_cqring_fill_event(), plain and easy
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add support for splice(2).
- output file is specified as sqe->fd, so it's handled by generic code
- hash_reg_file handled by generic code as well
- len is 32bit, but should be fine
- the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which
is a splice flag (i.e. sqe->splice_flags).
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Preparation without functional changes. Adds io_get_file(), that allows
to grab files not only into req->file.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make do_splice(), so other kernel parts can reuse it
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->in_async is not really needed, it only prevents propagation of
@nxt for fast not-blocked submissions. Remove it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_prep_async_worker() called io_wq_assign_next() do many useless checks:
io_req_work_grab_env() was already called during prep, and @do_hashed
is not ever used. Add io_prep_next_work() -- simplified version, that
can be called io-wq.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Many operations define custom work.func before getting into an io-wq.
There are several points against:
- it calls io_wq_assign_next() from outside io-wq, that may be confusing
- sync context would go unnecessary through io_req_cancelled()
- prototypes are quite different, so work!=old_work looks strange
- makes async/sync responsibilities fuzzy
- adds extra overhead
Don't call generic path and io-wq handlers from each other, but use
helpers instead
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't drop an early reference, hang on to it and let the caller drop
it. This makes it behave more like "regular" requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the -EAGAIN happens because of a static condition, then a poll
or later retry won't fix it. We must call it again from blocking
condition. Play it safe and ensure that any -EAGAIN condition from read
or write must retry from async context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_wq_flush() is buggy, during cancelation of a flush, the associated
work may be passed to the caller's (i.e. io_uring) @match callback. That
callback is expecting it to be embedded in struct io_kiocb. Cancelation
of internal work probably doesn't make a lot of sense to begin with.
As the flush helper is no longer used, just delete it and the associated
work flag.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When converting and moving nfsroot.txt to nfsroot.rst the references to
the old text file was not updated to match the change, fix this.
Fixes: f9a9349846 ("Documentation: nfsroot.txt: convert to ReST")
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20200212181332.520545-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
To cancel a work, io-wq sets IO_WQ_WORK_CANCEL and executes the
callback. However, IO_WQ_WORK_NO_CANCEL works will just execute and may
return next work, which will be ignored and lost.
Cancel the whole link.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl5cMPoACgkQ8vlZVpUN
gaNYmgf/WX4/jMSYQu2fICudCqLr5fkLqsybvYGZGei3F8BaJ90zohQAQybNznWS
iyF0JzrOp37b/o0haz7KfDr7xVB3lAVsKu9Bglq+zL8mc9IkPmjhCXuLbknUtOUw
j3aVdntt4d6S3szbtP4PIZxNqh+/4KJDS2soWvuNWRpYMOv2yoMClptWWQtsimAt
3fYpxasSz0Jrhtbuf+I1oID++wOycDT3RKiko5tpLlQiFVoKBzfou+0ZdkC4+UIl
KvcpMBm1ijdGAaN9jfb2L2KCY5UdSvmeVui3sMXtHBEpKMJl2QsClylR1wGfgBKi
+YMEsjBONxKo3kH2DaPJaU6LEm8JuQ==
=rszH
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Two more bug fixes (including a regression) for 5.6"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()
jbd2: fix data races at struct journal_head
If sbi->s_flex_groups_allocated is zero and the first allocation fails
then this code will crash. The problem is that "i--" will set "i" to
-1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1
is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX
is more than zero, the condition is true so we call kvfree(new_groups[-1]).
The loop will carry on freeing invalid memory until it crashes.
Fixes: 7c990728b9 ("ext4: fix potential race between s_flex_groups online resizing and access")
Reviewed-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
journal_head::b_transaction and journal_head::b_next_transaction could
be accessed concurrently as noticed by KCSAN,
LTP: starting fsync04
/dev/zero: Can't open blockdev
EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
==================================================================
BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]
write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
__jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
__jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
(inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
kjournald2+0x13b/0x450 [jbd2]
kthread+0x1cd/0x1f0
ret_from_fork+0x27/0x50
read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
jbd2_write_access_granted+0x1b2/0x250 [jbd2]
jbd2_write_access_granted at fs/jbd2/transaction.c:1155
jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
__ext4_journal_get_write_access+0x50/0x90 [ext4]
ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
ext4_mb_new_blocks+0x54f/0xca0 [ext4]
ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
ext4_map_blocks+0x3b4/0x950 [ext4]
_ext4_get_block+0xfc/0x270 [ext4]
ext4_get_block+0x3b/0x50 [ext4]
__block_write_begin_int+0x22e/0xae0
__block_write_begin+0x39/0x50
ext4_write_begin+0x388/0xb50 [ext4]
generic_perform_write+0x15d/0x290
ext4_buffered_write_iter+0x11f/0x210 [ext4]
ext4_file_write_iter+0xce/0x9e0 [ext4]
new_sync_write+0x29c/0x3b0
__vfs_write+0x92/0xa0
vfs_write+0x103/0x260
ksys_write+0x9d/0x130
__x64_sys_write+0x4c/0x60
do_syscall_64+0x91/0xb05
entry_SYSCALL_64_after_hwframe+0x49/0xbe
5 locks held by fsync04/25724:
#0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
#1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
#2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
#3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
#4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
irq event stamp: 1407125
hardirqs last enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790
hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790
softirqs last enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c
softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0
Reported by Kernel Concurrency Sanitizer on:
CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
The plain reads are outside of jh->b_state_lock critical section which result
in data races. Fix them by adding pairs of READ|WRITE_ONCE().
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5ZXkgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprZqEACOvhiprH9Q75Pp+ZwQknM3xGyJRWI3Mbj9
ZOyTVTK0qhTeaq6rN4MLSYevXOh+L68x5WRt1YJ1UnQRE0i8+ZQyZczqLKxxl8gF
trhbYDXjvXIWr9zvdtiL01PKKu4Vjjp6eZAomrbxCTFku0qn76fo9wDgGPRGL+Kx
lNO/6QvCXr9EjDniEUhlQsxTad5xc4sL0cnL4s2i7RlTCYtW4WJXJMC/4Gkg69j+
W5GBZyjJDa8Sj3pEbLjtDtA4ooE9VMaldb7ZvR62ONUVwGpftPsbN7UhVlhyhpW+
8v4ZEf07CxB246+hj7oL0RvEW3+/nB2hym1ySMXyBzpbx4O1JOUG7hQtNgdLRbCZ
27IOg2O36qbUKM1hUwn7Qm3XAfBPQdFpVmqE2+E9MEOKzigLzhRP6Bu5d9x9VQGh
JDxsm3B8PRHFJVAasiYu0p7mlx/+BCLjB84UrMB3I9UCBuVfk4mtmuwZX+mcK2PR
pV1xJlEMYKme3cz2/u6uB8p3Nq6ipE1nSVrI6AnfEvJbQ9sFL61KaG4wHKPvtb0y
mlNgc4seSjiWcBR2/84561a4CSmlXAn9dWMIGdHFFA43mTPYGc5omTcM8FwcEDkW
cTFGB8sFukcTNmOw62HUHYI1vPpowX6apV08lEQrScz7GiK5piTYqTFNneqEzcwZ
3bIMisH3Gg==
=WheR
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Fix for a race with IOPOLL used with SQPOLL (Xiaoguang)
- Only show ->fdinfo if procfs is enabled (Tobias)
- Fix for a chain with multiple personalities in the SQEs
- Fix for a missing free of personality idr on exit
- Removal of the spin-for-work optimization
- Fix for next work lookup on request completion
- Fix for non-vec read/write result progation in case of links
- Fix for a fileset references on switch
- Fix for a recvmsg/sendmsg 32-bit compatability mode
* tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block:
io_uring: fix 32-bit compatability with sendmsg/recvmsg
io_uring: define and set show_fdinfo only if procfs is enabled
io_uring: drop file set ref put/get on switch
io_uring: import_single_range() returns 0/-ERROR
io_uring: pick up link work on submit reference drop
io-wq: ensure work->task_pid is cleared on init
io-wq: remove spin-for-work optimization
io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL
io_uring: fix personality idr leak
io_uring: handle multiple personalities in link chains
Two fixes in this pull request:
* Revert the initial decision to silently ignore IOCB_NOWAIT for
asynchronous direct IOs to sequential zone files. Instead, return an
error to the user to signal that the feature is not supported (from
Christoph)
* A fix to zonefs Kconfig to select FS_IOMAP to avoid build failures if
no other file system already selected this option (from Johannes).
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXljJWAAKCRDdoc3SxdoY
dmztAP9Sj74cHVTxac+HoDKwf6DYWfjPWonT5tO4wc8q0PBDOgEAhKzHQJZNqJvd
a0BrEf/t6RLWDgsi75cB/U6HsiGkiA0=
=+maQ
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fixes from Damien Le Moal:
"Two fixes in here:
- Revert the initial decision to silently ignore IOCB_NOWAIT for
asynchronous direct IOs to sequential zone files. Instead, return
an error to the user to signal that the feature is not supported
(from Christoph)
- A fix to zonefs Kconfig to select FS_IOMAP to avoid build failures
if no other file system already selected this option (from
Johannes)"
* tag 'zonefs-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: select FS_IOMAP
zonefs: fix IOCB_NOWAIT handling
We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise
the iovec import for these commands will not do the right thing and fail
the command with -EINVAL.
Found by running the test suite compiled as 32-bit.
Cc: stable@vger.kernel.org
Fixes: aa1fa28fc7 ("io_uring: add support for recvmsg()")
Fixes: 0fa03c624d ("io_uring: add support for sendmsg()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In Aug 2018 NeilBrown noticed
commit 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code and interface")
"Some ->next functions do not increment *pos when they return NULL...
Note that such ->next functions are buggy and should be fixed.
A simple demonstration is
dd if=/proc/swaps bs=1000 skip=1
Choose any block size larger than the size of /proc/swaps. This will
always show the whole last line of /proc/swaps"
/proc/swaps output was fixed recently, however there are lot of other
affected files, and one of them is related to pstore subsystem.
If .next function does not change position index, following .show function
will repeat output related to current position index.
There are at least 2 related problems:
- read after lseek beyond end of file, described above by NeilBrown
"dd if=<AFFECTED_FILE> bs=1000 skip=1" will generate whole last list
- read after lseek on in middle of last line will output expected rest of
last line but then repeat whole last line once again.
If .show() function generates multy-line output (like
pstore_ftrace_seq_show() does ?) following bash script cycles endlessly
$ q=;while read -r r;do echo "$((++q)) $r";done < AFFECTED_FILE
Unfortunately I'm not familiar enough to pstore subsystem and was unable
to find affected pstore-related file on my test node.
If .next function does not change position index, following .show function
will repeat output related to current position index.
Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Link: https://lore.kernel.org/r/4e49830d-4c88-0171-ee24-1ee540028dad@virtuozzo.com
[kees: with robustness tweak from Joel Fernandes <joelaf@google.com>]
Signed-off-by: Kees Cook <keescook@chromium.org>
Follow the pattern used with other *_show_fdinfo functions and only
define and use io_uring_show_fdinfo and its helper functions if
CONFIG_PROC_FS is set.
Fixes: 87ce955b24 ("io_uring: add ->show_fdinfo() for the io_uring file descriptor")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This time, the set of changes for the EFI subsystem is much larger than
usual. The main reasons are:
- Get things cleaned up before EFI support for RISC-V arrives, which will
increase the size of the validation matrix, and therefore the threshold to
making drastic changes,
- After years of defunct maintainership, the GRUB project has finally started
to consider changes from the distros regarding UEFI boot, some of which are
highly specific to the way x86 does UEFI secure boot and measured boot,
based on knowledge of both shim internals and the layout of bootparams and
the x86 setup header. Having this maintenance burden on other architectures
(which don't need shim in the first place) is hard to justify, so instead,
we are introducing a generic Linux/UEFI boot protocol.
Summary of changes:
- Boot time GDT handling changes (Arvind)
- Simplify handling of EFI properties table on arm64
- Generic EFI stub cleanups, to improve command line handling, file I/O,
memory allocation, etc.
- Introduce a generic initrd loading method based on calling back into
the firmware, instead of relying on the x86 EFI handover protocol or
device tree.
- Introduce a mixed mode boot method that does not rely on the x86 EFI
handover protocol either, and could potentially be adopted by other
architectures (if another one ever surfaces where one execution mode
is a superset of another)
- Clean up the contents of struct efi, and move out everything that
doesn't need to be stored there.
- Incorporate support for UEFI spec v2.8A changes that permit firmware
implementations to return EFI_UNSUPPORTED from UEFI runtime services at
OS runtime, and expose a mask of which ones are supported or unsupported
via a configuration table.
- Various documentation updates and minor code cleanups (Heinrich)
- Partial fix for the lack of by-VA cache maintenance in the decompressor
on 32-bit ARM. Note that these patches were deliberately put at the
beginning so they can be used as a stable branch that will be shared with
a PR containing the complete fix, which I will send to the ARM tree.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEnNKg2mrY9zMBdeK7wjcgfpV0+n0FAl5S7WYACgkQwjcgfpV0
+n1jmQgAmwV3V8pbgB4mi4P2Mv8w5Zj5feUe6uXnTR2AFv5nygLcTzdxN+TU/6lc
OmZv2zzdsAscYlhuUdI/4t4cXIjHAZI39+UBoNRuMqKbkbvXCFscZANLxvJjHjZv
FFbgUk0DXkF0BCEDuSLNavidAv4b3gZsOmHbPfwuB8xdP05LbvbS2mf+2tWVAi2z
ULfua/0o9yiwl19jSS6iQEPCvvLBeBzTLW7x5Rcm/S0JnotzB59yMaeqD7jO8JYP
5PvI4WM/l5UfVHnZp2k1R76AOjReALw8dQgqAsT79Q7+EH3sNNuIjU6omdy+DFf4
tnpwYfeLOaZ1SztNNfU87Hsgnn2CGw==
=pDE3
-----END PGP SIGNATURE-----
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/core
Pull EFI updates for v5.7 from Ard Biesheuvel:
This time, the set of changes for the EFI subsystem is much larger than
usual. The main reasons are:
- Get things cleaned up before EFI support for RISC-V arrives, which will
increase the size of the validation matrix, and therefore the threshold to
making drastic changes,
- After years of defunct maintainership, the GRUB project has finally started
to consider changes from the distros regarding UEFI boot, some of which are
highly specific to the way x86 does UEFI secure boot and measured boot,
based on knowledge of both shim internals and the layout of bootparams and
the x86 setup header. Having this maintenance burden on other architectures
(which don't need shim in the first place) is hard to justify, so instead,
we are introducing a generic Linux/UEFI boot protocol.
Summary of changes:
- Boot time GDT handling changes (Arvind)
- Simplify handling of EFI properties table on arm64
- Generic EFI stub cleanups, to improve command line handling, file I/O,
memory allocation, etc.
- Introduce a generic initrd loading method based on calling back into
the firmware, instead of relying on the x86 EFI handover protocol or
device tree.
- Introduce a mixed mode boot method that does not rely on the x86 EFI
handover protocol either, and could potentially be adopted by other
architectures (if another one ever surfaces where one execution mode
is a superset of another)
- Clean up the contents of struct efi, and move out everything that
doesn't need to be stored there.
- Incorporate support for UEFI spec v2.8A changes that permit firmware
implementations to return EFI_UNSUPPORTED from UEFI runtime services at
OS runtime, and expose a mask of which ones are supported or unsupported
via a configuration table.
- Various documentation updates and minor code cleanups (Heinrich)
- Partial fix for the lack of by-VA cache maintenance in the decompressor
on 32-bit ARM. Note that these patches were deliberately put at the
beginning so they can be used as a stable branch that will be shared with
a PR containing the complete fix, which I will send to the ARM tree.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Unlike the other core import helpers, import_single_range() returns 0 on
success, not the length imported. This means that links that depend on
the result of non-vec based IORING_OP_{READ,WRITE} that were added for
5.5 get errored when they should not be.
Fixes: 3a6820f2bb ("io_uring: add non-vectored read/write commands")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If work completes inline, then we should pick up a dependent link item
in __io_queue_sqe() as well. If we don't do so, we're forced to go async
with that item, which is suboptimal.
This also fixes an issue with io_put_req_find_next(), which always looks
up the next work item. That should only be done if we're dropping the
last reference to the request, to prevent multiple lookups of the same
work item.
Outside of being a fix, this also enables a good cleanup series for 5.7,
where we never have to pass 'nxt' around or into the work handlers.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zonefs makes use of iomap internally, so it should also select iomap in
Kconfig.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
IOCB_NOWAIT can't just be ignored as it breaks applications expecting
it not to block. Just refuse the operation as applications must handle
that (e.g. by falling back to a thread pool).
Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
We use ->task_pid for exit cancellation, but we need to ensure it's
cleared to zero for io_req_work_grab_env() to do the right thing. Take
a suggestion from Bart and clear the whole thing, just setting the
function passed in. This makes it more future proof as well.
Fixes: 36282881a7 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid")
Signed-off-by: Jens Axboe <axboe@kernel.dk>