-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EXPEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiR4EAC3trm1ojXVF7y9/XRhcPpb4Pror+ZA1coO
gyoy+zUuCEl9WCzzHWqXULMYMP0YzNJnJs0oLQPA1s0sx1H4uDMl/UXg0OXZisYG
Y59Kca3c1DHFwj9KPQXfGmCEjc/rbDWK5TqRc2iZMz+6E5Mt71UFZHtenwgV1zD8
hTmZZkzLCu2ePfOvrPONgL5tDqPWGVyn61phoC7cSzMF66juXGYuvQGktzi/m6q+
jAxUnhKvKTlLB9wsq3s5X/20/QD56Yuba9U+YxeeNDBE8MDWQOsjz0mZCV1fn4p3
h/6762aRaWaXH7EwMtsHFUWy7arJZg/YoFYNYLv4Ksyy3y4sMABZCy3A+JyzrgQ0
hMu7vjsP+k22X1WH8nyejBfWNEmxu6dpgckKrgF0dhJcXk/acWA3XaDWZ80UwfQy
isKRAP1rA0MJKHDMIwCzSQJDPvtUAkPptbNZJcUSU78o+pPoCaQ93V++LbdgGtKn
iGJJqX05dVbcsDx5X7fluphjkUTC4yFr7ZgLgbOIedXajWRD8iOkO2xxCHk6SKFl
iv9entvRcX9k3SHK9uffIUkRBUujMU0+HCIQFCO1qGmkCaS5nSrovZl4HoL7L/Dj
+T8+v7kyJeklLXgJBaE7jk01O4HwZKjwPWMbCjvL9NKk8j7c1soYnRu5uNvi85Mu
/9wn671s+w==
=udgj
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
- Add blkcg accounting for io-wq offload (Dennis)
- A use-after-free fix for io-wq (Hillf)
- Cancelation fixes and improvements
- Use proper files_struct references for offload
- Cleanup of io_uring_get_socket() since that can now go into our own
header
- SQPOLL fixes and cleanups, and support for sharing the thread
- Improvement to how page accounting is done for registered buffers and
huge pages, accounting the real pinned state
- Series cleaning up the xarray code (Willy)
- Various cleanups, refactoring, and improvements (Pavel)
- Use raw spinlock for io-wq (Sebastian)
- Add support for ring restrictions (Stefano)
* tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (62 commits)
io_uring: keep a pointer ref_node in file_data
io_uring: refactor *files_register()'s error paths
io_uring: clean file_data access in files_register
io_uring: don't delay io_init_req() error check
io_uring: clean leftovers after splitting issue
io_uring: remove timeout.list after hrtimer cancel
io_uring: use a separate struct for timeout_remove
io_uring: improve submit_state.ios_left accounting
io_uring: simplify io_file_get()
io_uring: kill extra check in fixed io_file_get()
io_uring: clean up ->files grabbing
io_uring: don't io_prep_async_work() linked reqs
io_uring: Convert advanced XArray uses to the normal API
io_uring: Fix XArray usage in io_uring_add_task_file
io_uring: Fix use of XArray in __io_uring_files_cancel
io_uring: fix break condition for __io_uring_register() waiting
io_uring: no need to call xa_destroy() on empty xarray
io_uring: batch account ->req_issue and task struct references
io_uring: kill callback_head argument for io_req_task_work_add()
io_uring: move req preps out of io_issue_sqe()
...
Pull compat iovec cleanups from Al Viro:
"Christoph's series around import_iovec() and compat variant thereof"
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
security/keys: remove compat_keyctl_instantiate_key_iov
mm: remove compat_process_vm_{readv,writev}
fs: remove compat_sys_vmsplice
fs: remove the compat readv/writev syscalls
fs: remove various compat readv/writev helpers
iov_iter: transparently handle compat iovecs in import_iovec
iov_iter: refactor rw_copy_check_uvector and import_iovec
iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
compat.h: fix a spelling error in <linux/compat.h>
->cur_refs of struct fixed_file_data always points to percpu_ref
embedded into struct fixed_file_ref_node. Don't overuse container_of()
and offsetting, and point directly to fixed_file_ref_node.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't keep repeating cleaning sequences in error paths, write it once
in the and use labels. It's less error prone and looks cleaner.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keep file_data in a local var and replace with it complex references
such as ctx->file_data.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't postpone io_init_req() error checks and do that right after
calling it. There is no control-flow statements or dependencies with
sqe/submitted accounting, so do those earlier, that makes the code flow
a bit more natural.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kill extra if in io_issue_sqe() and place send/recv[msg] calls
appropriately under switch's cases.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove timeouts from ctx->timeout_list after hrtimer_try_to_cancel()
successfully cancels it. With this we don't need to care whether there
was a race and it was removed in io_timeout_fn(), and that will be handy
for following patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't use struct io_timeout for both IORING_OP_TIMEOUT and
IORING_OP_TIMEOUT_REMOVE, they're quite different. Split them in two,
that allows to remove an unused field in struct io_timeout, and btw kill
->flags not used by either. This also easier to follow, especially for
timeout remove.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
state->ios_left isn't decremented for requests that don't need a file,
so it might be larger than number of SQEs left. That in some
circumstances makes us to grab more files that is needed so imposing
extra put.
Deaccount one ios_left for each request.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keep ->needs_file_no_error check out of io_file_get(), and let callers
handle it. It makes it more straightforward. Also, as the only error it
can hand back -EBADF, make it return a file or NULL.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ctx->nr_user_files == 0 IFF ctx->file_data == NULL and there fixed files
are not used. Hence, verifying fds only against ctx->nr_user_files is
enough. Remove the other check from hot path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move work.files grabbing into io_prep_async_work() to all other work
resources initialisation. We don't need to keep it separately now, as
->ring_fd/file are gone. It also allows to not grab it when a request
is not going to io-wq.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no real reason left for preparing io-wq work context for linked
requests in advance, remove it as this might become a bottleneck in some
cases.
Reported-by: Roman Gershman <romger@amazon.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are no bugs here that I've spotted, it's just easier to use the
normal API and there are no performance advantages to using the more
verbose advanced API.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The xas_store() wasn't paired with an xas_nomem() loop, so if it couldn't
allocate memory using GFP_NOWAIT, it would leak the reference to the file
descriptor. Also the node pointed to by the xas could be freed between
the call to xas_load() under the rcu_read_lock() and the acquisition of
the xa_lock.
It's easier to just use the normal xa_load/xa_store interface here.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
[axboe: fix missing assign after alloc, cur_uring -> tctx rename]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have to drop the lock during each iteration, so there's no advantage
to using the advanced API. Convert this to a standard xa_for_each() loop.
Reported-by: syzbot+27c12725d8ff0bfe1a13@syzkaller.appspotmail.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Colin reports that there's unreachable code, since we only ever break
if ret == 0. This is correct, and is due to a reversed logic condition
in when to break or not.
Break out of the loop if we don't process any task work, in that case
we do want to return -EINTR.
Fixes: af9c1a44f8 ("io_uring: process task work in io_uring_register()")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Identical to how we handle the ctx reference counts, increase by the
batch we're expecting to submit, and handle any slow path residual,
if any. The request alloc-and-issue path is very hot, and this makes
a noticeable difference by avoiding an two atomic incs for each
individual request.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.
This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All request preparations are done only during submission, reflect it in
the code by moving io_req_prep() much earlier into io_queue_sqe().
That's much cleaner, because it doen't expose bits to async code which
it won't ever use. Also it makes the interface harder to misuse, and
there are potential places for bugs.
For instance, __io_queue() doesn't clear @sqe before proceeding to a
next linked request, that could have been disastrous, but hopefully
there are linked requests IFF sqe==NULL, so not actually a bug.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_issue_sqe() does two things at once, trying to prepare request and
issuing them. Split it in two and deduplicate with io_defer_prep().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All io_*_prep() functions including io_{read,write}_prep() are called
only during submission where @force_nonblock is always true. Don't keep
propagating it and instead remove the @force_nonblock argument
from prep() altogether.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so
it's set/cleared in a single place. Also remove @force_nonblock
parameter from io_prep_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
REQ_F_NEED_CLEANUP is set only by io_*_prep() and they're guaranteed to
be called only once, so there is no one who may have set the flag
before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Put brackets around bitwise ops in a complex expression
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Extract common code from if/else branches. That is cleaner and optimised
even better.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
io_uring instances in a host. Since all sqthreads are named
"io_uring-sq", it's hard to distinguish the relations between
application process and its io_uring sqthread.
With this patch, application can get its corresponding sqthread pid
and cpu through show_fdinfo.
Steps:
1. Get io_uring fd first.
$ ls -l /proc/<pid>/fd | grep -w io_uring
2. Then get io_uring instance related info, including corresponding
sqthread pid and cpu.
$ cat /proc/<pid>/fdinfo/<io_uring_fd>
pos: 0
flags: 02000002
mnt_id: 13
SqThread: 6929
SqThreadCpu: 2
UserFiles: 1
0: testfile
UserBufs: 0
PollList:
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
[axboe: fixed for new shared SQPOLL]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We do this for CQ ring wait, in case task_work completions come in. We
should do the same in io_uring_register(), to avoid spurious -EINTR
if the ring quiescing ends up having to process task_work to complete
the operation
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are a few operations that are offloaded to the worker threads. In
this case, we lose process context and end up in kthread context. This
results in ios to be not accounted to the issuing cgroup and
consequently end up as issued by root. Just like others, adopt the
personality of the blkcg too when issuing via the workqueues.
For the SQPOLL thread, it will live and attach in the inited cgroup's
context.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring does account any registered buffer as pinned/locked memory, and
checks limit and fails if the given user doesn't have a big enough limit
to register the ranges specified. However, if huge pages are used, we
are potentially under-accounting the memory in terms of what gets pinned
on the vm side.
This patch rectifies that, by ensuring that we account the full size of
a compound page, regardless of how much of it is being registered. Huge
pages are not accounted mulitple times - if multiple sections of a huge
page is registered, then the page is only accounted once.
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the spirit of fairness, cap the max number of SQ entries we'll submit
for SQPOLL if we have multiple rings. If we don't do that, we could be
submitting tons of entries for one ring, while others are waiting to get
service.
The value of 8 is somewhat arbitrarily chosen as something that allows
a fair bit of batching, without using an excessive time per ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's really no point in having this union, it just means that we're
always allocating enough room to cater to any command. But that's
pointless, as the ->io field is request type private anyway.
This gets rid of the io_async_ctx structure, and fills in the required
size in the io_op_defs[] instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
because in that case ctx->nr_user_bufs would be zero, and the following
check would fail.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
iorw->iter into itself. Even though it doesn't lead to an error, such a
memcpy()'s aliasing rules violation is considered to be a bad practise.
Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
remapping there, so it's much simpler than the generic implementation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set rw->free_iovec to @iovec, that gives an identical result and stresses
that @iovec param rw->free_iovec play the same role.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't touch iter->iov and iov in between __io_import_iovec() and
io_req_map_rw(), the former function aleady sets it correctly, because it
creates one more case with NULL'ed iov to consider in io_req_map_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When using SQPOLL, applications can run into the issue of running out of
SQ ring entries because the thread hasn't consumed them yet. The only
option for dealing with that is checking later, or busy checking for the
condition.
Provide IORING_ENTER_SQ_WAIT if applications want to wait on this
condition.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We support using IORING_SETUP_ATTACH_WQ to share async backends between
rings created by the same process, this now also allows the same to
happen with SQPOLL. The setup procedure remains the same, the caller
sets io_uring_params->wq_fd to the 'parent' context, and then the newly
created ring will attach to that async backend.
This means that multiple rings can share the same SQPOLL thread, saving
resources.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the SQPOLL thread from the ctx, and use the io_sq_data as the
data structure we pass in. io_sq_data has a list of ctx's that we can
then iterate over and handle.
As of now we're ready to handle multiple ctx's, though we're still just
handling a single one after this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move all the necessary state out of io_ring_ctx, and into a new
structure, io_sq_data. The latter now deals with any state or
variables associated with the SQPOLL thread itself.
In preparation for supporting more than one io_ring_ctx per SQPOLL
thread.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is done in preparation for handling more than one ctx, but it also
cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
get a get overview on.
__io_sq_thread() is now the main handler, and it returns an enum sq_ret
that tells io_sq_thread() what it ended up doing. The parent then makes
a decision on idle, spinning, or work handling based on that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to decouple the clearing on wakeup from the the inline schedule,
as that is going to be required for handling multiple rings in one
thread.
Wrap our wakeup handler so we can clear it when we get the wakeup, by
definition that is when we no longer need the flag set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation to sharing the poller thread between rings. For
that we need per-ring wait_queue_entry storage, and we can't easily put
that on the stack if one thread is managing multiple rings.
We'll also be sharing the wait_queue_head across rings for the purposes
of wakeups, provide the usual private ring wait_queue_head for now but
make it a pointer so we can easily override it when sharing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're not handling signals by default in kernel threads, and we never
use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
have a signal pending, and we don't need to check for it (nor flush it).
Signed-off-by: Jens Axboe <axboe@kernel.dk>