If the credentials or the mm doesn't match, don't allow the task to
submit anything on behalf of this ring. The task that owns the ring can
pass the file descriptor to another task, but we don't want to allow
that task to submit an SQE that then assumes the ring mm and creds if
it needs to go async.
Cc: stable@vger.kernel.org
Suggested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit moved the locking for the async sqthread, but didn't
take into account that the io-wq workers still need it. We can't use
req->in_async for this anymore as both the sqthread and io-wq workers
set it, gate the need for locking on io_wq_current_is_worker() instead.
Fixes: 8a4955ff1c ("io_uring: sqthread should grab ctx->uring_lock for submissions")
Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->result is cleared when io_issue_sqe() calls io_read/write_pre()
routines. Those routines however are not called when the sqe
argument is NULL, which is the case when io_issue_sqe() is called from
io_wq_submit_work(). io_issue_sqe() may then examine a stale result if
a polled request had previously failed with -EAGAIN:
if (ctx->flags & IORING_SETUP_IOPOLL) {
if (req->result == -EAGAIN)
return -EAGAIN;
io_iopoll_req_issued(req);
}
and in turn cause a subsequently completed request to be re-issued in
io_wq_submit_work().
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we pass back dependent work in case of links, we need to always
ensure that we call the link setup and work prep handler. If not, we
might be missing some setup for the next work item.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't need it, and if we have it, then the retry handler will attempt
to copy the non-existent iovec with the inline iovec, with a segment
count that doesn't make sense.
Fixes: f67676d160 ("io_uring: ensure async punted read/write requests copy iovec")
Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently punt any short read on a regular file to async context,
but this fails if the short read is due to running into EOF. This is
especially problematic since we only do the single prep for commands
now, as we don't reset kiocb->ki_pos. This can result in a 4k read on
a 1k file returning zero, as we detect the short read and then retry
from async context. At the time of retry, the position is now 1k, and
we end up reading nothing, and hence return 0.
Instead of trying to patch around the fact that short reads can be
legitimate and won't succeed in case of retry, remove the logic to punt
a short read to async context. Simply return it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reschedule the current IO worker to cut the risk that it is becoming
a cpu hog.
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit e61df66c69 ("io-wq: ensure free/busy list browsing see all
items") added a list for io workers in addition to the free and busy
lists, not only making worker walk cleaner, but leaving the busy list
unused. Let's remove it.
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This moves the prep handlers outside of the opcode handlers, and allows
us to pass in the sqe directly. If the sqe is non-NULL, it means that
the request should be prepared for the first time.
With the opcode handlers not having access to the sqe at all, we are
guaranteed that the prep handler has setup the request fully by the
time we get there. As before, for opcodes that need to copy in more
data then the io_kiocb allows for, the io_async_ctx holds that info. If
a prep handler is invoked with req->io set, it must use that to retain
information for later.
Finally, we can remove io_kiocb->sqe as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently have a mix of use cases. Most of the newer ones are pretty
uniform, but we have some older ones that use different calling
calling conventions. This is confusing.
For the opcodes that currently rely on the req->io->sqe copy saving
them from reuse, add a request type struct in the io_kiocb command
union to store the data they need.
Prepare for all opcodes having a standard prep method, so we can call
it in a uniform fashion and outside of the opcode handler. This is in
preparation for passing in the 'sqe' pointer, rather than storing it
in the io_kiocb. Once we have uniform prep handlers, we can leave all
the prep work to that part, and not even pass in the sqe to the opcode
handler. This ensures that we don't reuse sqe data inadvertently.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add the count field to struct io_timeout, and ensure the prep handler
has read it. Timeout also needs an async context always, set it up
in the prep handler if we don't have one.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add struct io_sr_msg in our io_kiocb per-command union, and ensure that
the send/recvmsg prep handlers have grabbed what they need from the SQE
by the time prep is done.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add struct io_connect in our io_kiocb per-command union, and ensure
that io_connect_prep() has grabbed what it needs from the SQE.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Put the kiocb in struct io_rw, and add the addr/len for the request as
well. Use the kiocb->private field for the buffer index for fixed reads
and writes.
Any use of kiocb->ki_filp is flipped to req->file. It's the same thing,
and less confusing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use it in some spots, but not consistently. Convert the rest over,
makes it easier to read as well.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I've been chasing a weird and obscure crash that was userspace stack
corruption, and finally narrowed it down to a bit flip that made a
stack address invalid. io_wq_submit_work() unconditionally flips
the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work
handler, this isn't valid. Normal read/write operations own that
part of the request, on other types it could be something else.
Move the IOCB_NOWAIT clear to the read/write handlers where it belongs.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no reliable way to submit and wait in a single syscall, as
io_submit_sqes() may under-consume sqes (in case of an early error).
Then it will wait for not-yet-submitted requests, deadlocking the user
in most cases.
Don't wait/poll if can't submit all sqes
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that we have all the opcodes handled in terms of command prep and
SQE reuse, add a printk_once() to warn about any potentially new and
unhandled ones.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we defer a request, we can't be reading the opcode again. Ensure that
the user_data and opcode fields are stable. For the user_data we already
have a place for it, for the opcode we can fill a one byte hold and store
that as well. For both of them, assign them when we originally read the
SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data
is switched to req->opcode and req->user_data.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the timeout remove op into
the prep handling to make it safe for SQE reuse.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the async cancel op into
the prep handling to make it safe for SQE reuse.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we defer these commands as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the poll add/remove into
the prep handling to make it safe for SQE reuse.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a
link and there is no need to set IOSQE_IO_LINK separately, though it
could be there. Add proper check and ensure that IOSQE_IO_HARDLINK
implies IOSQE_IO_LINK.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're currently not retaining sqe data for accept, fsync, and
sync_file_range. None of these commands need data outside of what
is directly provided, hence it can't go stale when the request is
deferred. However, it can get reused, if an application reuses
SQE entries.
Ensure that we retain the information we need and only read the sqe
contents once, off the submission path. Most of this is just moving
code into a prep and finish function.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We pass in req->sqe for all of them, no need to pass it in as the
request is always passed in. This is a necessary prep patch to be
able to cleanup/fix the request prep path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some of these code paths assume that any force_nonblock == true issue
is not prepped, but that's not true if we did prep as part of link setup
earlier. Check if we already have an async context allocate before
setting up a new one.
Cleanup the async context setup in general, we have a lot of duplicated
code there.
Fixes: 03b1230ca1 ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Fixes: f67676d160 ("io_uring: ensure async punted read/write requests copy iovec")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we have to punt the recvmsg to async context, we copy all the
context. But since the iovec used can be either on-stack (if small) or
dynamically allocated, if it's on-stack, then we need to ensure we reset
the iov pointer. If we don't, then we're reusing old stack data, and
that can lead to -EFAULTs if things get overwritten.
Ensure we retain the right pointers for the iov, and free it as well if
we end up having to go beyond UIO_FASTIOV number of vectors.
Fixes: 03b1230ca1 ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Fix a few typos found while reading the code.
- Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter
was renamed to 'req', but the comment still holds.
Signed-off-by: Brian Gianforcaro <b.gianfo@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3y5kIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv0MEADY9LOC2JkG2aNP53gCXGu74rFQNstJfusr
xPpkMs5I7hxoeVp/bkVAvR6BbpfPDsyGMhSMURELgMFzkuw+03z2IbxVuexdGOEu
X1+6EkunAq/r331fXL+cXUKJM3ThlkUBbNTuV8z5oWWSM1EfoBAWYfh2u4L4oBcc
yIHRH8by9qipx+fhBWPU/ZWeCcMVt5Ju2b6PHAE/GWwteZh00PsTwq0oqQcVcm/5
4Ennr0OELb7LaZWblAIIZe96KeJuCPzGfbRV/y12Ne/t6STH/sWA1tMUzgT22Eju
27uXtwAJtoLsep4V66a4VYObs/hXmEP281wIsXEnIkX0YCvbAiM+J6qVHGjYSMGf
mRrzfF6nDC1vaMduE4waoO3VFiDFQU/qfQRa21ZP50dfQOAXTpEz8kJNG/PjN1VB
gmz9xsujDU1QPD7IRiTreiPPHE5AocUzlUYuYOIMt7VSbFZIMHW5S/pML93avkJt
nx6g3gOP0JKkjaCEvIKY4VLljDT8eDZ/WScDsSedSbPZikMzEo8DSx4U6nbGPqzy
qSljgQniDcrH8GdRaJFDXgLkaB8pu83NH7zUH+xioUZjAHq/XEzKQFYqysf1DbtU
SEuSnUnLXBvbwfb3Z2VpQ/oz8G3a0nD9M7oudt1sN19oTCJKxYSbOUgNUrj9JsyQ
QlAVrPKRkQ==
=fbsf
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- A tweak to IOSQE_IO_LINK (also marked for stable) to allow links that
don't sever if the result is < 0.
This is mostly for linked timeouts, where if we ask for a pure
timeout we always get -ETIME. This makes links useless for that case,
hence allow a case where it works.
- Five minor optimizations to fix and improve cases that regressed
since v5.4.
- An SQTHREAD locking fix.
- A sendmsg/recvmsg iov assignment fix.
- Net fix where read_iter/write_iter don't honor IOCB_NOWAIT, and
subsequently ensuring that works for io_uring.
- Fix a case where for an invalid opcode we might return -EBADF instead
of -EINVAL, if the ->fd of that sqe was set to an invalid fd value.
* tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block:
io_uring: ensure we return -EINVAL on unknown opcode
io_uring: add sockets to list of files that support non-blocking issue
net: make socket read/write_iter() honor IOCB_NOWAIT
io_uring: only hash regular files for async work execution
io_uring: run next sqe inline if possible
io_uring: don't dynamically allocate poll data
io_uring: deferred send/recvmsg should assign iov
io_uring: sqthread should grab ctx->uring_lock for submissions
io-wq: briefly spin for new work after finishing work
io-wq: remove worker->wait waitqueue
io_uring: allow unbreakable links
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl3umDgWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlvsD/49R12HK7UzTxNTrcpvbadJ4t7j
j/qJvjMerW7iVNAPOoNAOePUa21+y3rI1AZPvoPyzIqp1Bf2eOICf5SdisG2cG+O
X0A8EKWvS0SSQWSKaT6udUKJ3nBJItwvOvQ5B58KQzcOj3S4X7B9iVBWgieMHrzz
urkZm7pqowrZB3wuF8keRtli5IZaoiCwzApy48Qrn70G3OeXymknFbpHTDwIAiGw
RiE5Xh0R4EzQdsYyCgjR8U56gBchadAmj8BUJU0ppMnOFMyIAG670hNLrs0L3roP
8TOIeyb993ZC5GZaMlnR8mz0jfibfkPa3Z85VAsVyQSPaOQldwc9j8TGBqD5Gfat
1PjOU5RVwma0pH5xTPOeevWPQpIK9KovQpQYqMMN9GMxOEx96IOUjwTrnNK2xWoN
UGyOVlESFGoniClhCiKYzPSrYOjlIBk5ovf15PdTe+bwyUDMfyfy5CZV88OS2DHz
ZBZvpLrH/EMW9zJ+FqMTp0C4s4wa2Ioid3bSh6XuNUTtltKSjp71eUja8ZEz+2sd
5AGstCC+hYqxaEk+6/851pfkQ9sbBjwuGtNrtX+pqreiLUvWLhQ0yUj6cLXlEQNH
aucjCukCjI+4lMzofeaQ2LbNhtff4YsfO4b1Ye8maoDdHjzUVL57n3bTOxKhdzbt
y6FM3lApOjk3OyaTJQ==
=YU4A
-----END PGP SIGNATURE-----
Merge tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull FIELD_SIZEOF conversion from Kees Cook:
"A mostly mechanical treewide conversion from FIELD_SIZEOF() to
sizeof_field(). This avoids the redundancy of having 2 macros
(actually 3) doing the same thing, and consolidates on sizeof_field().
While "field" is not an accurate name, it is the common name used in
the kernel, and doesn't result in any unintended innuendo.
As there are still users of FIELD_SIZEOF() in -next, I will clean up
those during this coming development cycle and send the final old
macro removal patch at that time"
* tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
treewide: Use sizeof_field() macro
MIPS: OCTEON: Replace SIZEOF_FIELD() macro
from Xiubo, a patch to add some observability into cap waiters from
Jeff and a couple of cleanups.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3yiJMTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5k9CACmM3fJGrTUuOLgXAxxllCfiV6UQLoY
nuTo/bx0DmG603n+Ze8+Z0iz7hDc1Gw2XUeLkJcAE/xSetgZXO/MvJ0Ionq5Ac/k
CrqS6ucIa1bPxbE1QMTHswHjkajKwBpAZ5+khdLNLuXJxy3c9HDCGOT4VZav7Yc9
99W4kIdzOKdYLpZHAedMK97IJIrD5WhYTAFW4rNPY0GL6OPD1V0uiS9v7xUWIxnZ
Uusnu+zY8miQlLVx/V9DyLh/6G5X7XyQO1nkSQcVXZOOG7+qnkq6jDhQW8adgOSZ
wUFigTxxhSTIcntWg01TaCRNoi1N3/P8Z9/rD27zBHPbl93ANH+lUkCh
=NicF
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.5-rc2' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A fix to avoid a corner case when scheduling cap reclaim in batches
from Xiubo, a patch to add some observability into cap waiters from
Jeff and a couple of cleanups"
* tag 'ceph-for-5.5-rc2' of git://github.com/ceph/ceph-client:
ceph: add more debug info when decoding mdsmap
ceph: switch to global cap helper
ceph: trigger the reclaim work once there has enough pending caps
ceph: show tasks waiting on caps in debugfs caps file
ceph: convert int fields in ceph_mount_options to unsigned int
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3xW+cACgkQ+7dXa6fL
C2t+UQ/9ETwAYJ2XgfwatNhAHX0+Tf4XMlgEDybf5a+ERRLbkSQEqRjTImopk5m4
T7yKea0ctsVb1avdUVdayh2KuJhjmO627u3w48kd/uf2ut4IyOyaraatYKYOJ+XL
upr8WxAMXUHgnwsb38avjH0dwq7+eDyX8FNLHiDY4Sk3mTAVHbafL6V0ujp0ms2o
VmHLX6ihwECehUqrNEyy1pLX2nuFmd+MeLnoi/EWLa47/X0te21G8u8UdPWHGgHn
cZp8kqWSEoaeChO7x6XOoJZCY6N/7o+hogsrU/N5YrB6FXPpFWGdwFpej3YcyFso
QqffyXX5jVAyUg9I/v3WCrtDmTQ9xVG4kxhmBMVId6bBk3ZVpGaHuLE3ENLbr1w/
Z1k26BI41nJwrqV0C2bPoMalpDprP2WIuWsBIpYTqaICpc53KJ72c6mLhcDEt+oc
YCSd9X0Bv5O7tZS9rLI8mt0JtCr9Uw+VfnK+uOuUTPv2YcqQwK1/xZ0/9hd2P8XS
L5GKcEeuk3indgmefjwsz5YJcmzeCttaOQTlmaN/Hz3MbkK2CC+spMG4OZWzUXL4
axZmCIo/kKrv64bLyVZul6BjkhawyvfWzCs2NM6Ni63akW3Yb0S2WRdgJhvGILml
48YgqcG7R1y8LRPYiJzcuAPnbumuu6ZSxVuyZN0FU0qc4lt5mPM=
=ldOp
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20191211' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Fixes for AFS plus one patch to make debugging easier:
- Fix how addresses are matched to server records. This is currently
incorrect which means cache invalidation callbacks from the server
don't necessarily get delivered correctly. This causes stale data
and metadata to be seen under some circumstances.
- Make the dynamic root superblock R/W so that rpm/dnf can reapply
the SELinux label to it when upgrading the Fedora filesystem-afs
package. If the filesystem is R/O, this fails and the upgrade
fails.
It might be better in future to allow setxattr from an LSM to
bypass the R/O protections, if only for pseudo-filesystems.
- Fix the parsing of mountpoint strings. The mountpoint object has to
have a terminal dot, whereas the source/device string passed to
mount should not. This confuses type-forcing suffix detection
leading to the wrong volume variant being mounted.
- Make lookups in the dynamic root superblock for creation events
(such as mkdir) fail with EOPNOTSUPP rather than something like
EEXIST. The dynamic root only allows implicit creation by the
->lookup() method - and only if the target cell exists.
- Fix the looking up of an AFS superblock to include the cell in the
matching key - otherwise all volumes with the same ID number are
treated as the same thing, irrespective of which cell they're in.
- Show the volume name of each volume in the volume records displayed
in /proc/net/afs/<cell>/volumes. This proved useful in debugging as
it provides a way to map the volume IDs to names, where the names
are what appear in /proc/mounts"
* tag 'afs-fixes-20191211' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Show volume name in /proc/net/afs/<cell>/volumes
afs: Fix missing cell comparison in afs_test_super()
afs: Fix creation calls in the dynamic root to fail with EOPNOTSUPP
afs: Fix mountpoint parsing
afs: Fix SELinux setting security label on /afs
afs: Fix afs_find_server lookups for ipv4 peers
If we submit an unknown opcode and have fd == -1, io_op_needs_file()
will return true as we default to needing a file. Then when we go and
assign the file, we find the 'fd' invalid and return -EBADF. We really
should be returning -EINVAL for that case, as we normally do for
unsupported opcodes.
Change io_op_needs_file() to have the following return values:
0 - does not need a file
1 - does need a file
< 0 - error value
and use this to pass back the right value for this invalid case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Fix improper return value of listxattr() with no xattr;
- Keep up documentation with latest code.
-----BEGIN PGP SIGNATURE-----
iIwEABYIADQWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXfELlBYcZ2FveGlhbmcy
NUBodWF3ZWkuY29tAAoJEDk3MdwfteYEtUABAN164UwGU9QKEsqgZQcmbz23qXSJ
QDR8r/ch2LxzXKkVAQDXCNU+ol6jkiapLcTvsXEjBk8sUxsCEVnmZ36jru+TBA==
=kRp9
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.5-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"Mainly address a regression reported by David recently observed
together with overlayfs due to the improper return value of
listxattr() without xattr. Update outdated expressions in document as
well.
Summary:
- Fix improper return value of listxattr() with no xattr
- Keep up documentation with latest code"
* tag 'erofs-for-5.5-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: update documentation
erofs: zero out when listxattr is called with no xattr
There's no need to separately check for signals while inside the locked
region, since we're going to do "wait_event_interruptible()" right
afterwards anyway, and the error handling is much simpler there.
The check for whether we had already read anything was also redundant,
since we no longer do the odd merging of reads when there are pending
writers.
But perhaps more importantly, this adds commentary about why we still
need to wake up possible writers even though we didn't read any data,
and why we can skip all the finishing touches now if we get a signal (or
had a signal pending) while waiting for more data.
[ This is a split-out cleanup from my "make pipe IO use exclusive wait
queues" thing, which I can't apply because it triggers a nasty bug in
the GNU make jobserver - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Show the name of each volume in /proc/net/afs/<cell>/volumes to make it
easier to work out the name corresponding to a volume ID. This makes it
easier to work out which mounts in /proc/mounts correspond to which volume
ID.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Fix missing cell comparison in afs_test_super(). Without this, any pair
volumes that have the same volume ID will share a superblock, no matter the
cell, unless they're in different network namespaces.
Normally, most users will only deal with a single cell and so they won't
see this. Even if they do look into a second cell, they won't see a
problem unless they happen to hit a volume with the same ID as one they've
already got mounted.
Before the patch:
# ls /afs/grand.central.org/archive
linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/
# ls /afs/kth.se/
linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/
# cat /proc/mounts | grep afs
none /afs afs rw,relatime,dyn,autocell 0 0
#grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0
#grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0
#grand.central.org:root.archive /afs/kth.se afs ro,relatime 0 0
After the patch:
# ls /afs/grand.central.org/archive
linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/
# ls /afs/kth.se/
admin/ common/ install/ OldFiles/ service/ system/
bakrestores/ home/ misc/ pkg/ src/ wsadmin/
# cat /proc/mounts | grep afs
none /afs afs rw,relatime,dyn,autocell 0 0
#grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0
#grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0
#kth.se:root.cell /afs/kth.se afs ro,relatime 0 0
Fixes: ^1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Carsten Jacobi <jacobi@de.ibm.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
cc: Todd DeSantis <atd@us.ibm.com>
Fix the lookup method on the dynamic root directory such that creation
calls, such as mkdir, open(O_CREAT), symlink, etc. fail with EOPNOTSUPP
rather than failing with some odd error (such as EEXIST).
lookup() itself tries to create automount directories when it is invoked.
These are cached locally in RAM and not committed to storage.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
Each AFS mountpoint has strings that define the target to be mounted. This
is required to end in a dot that is supposed to be stripped off. The
string can include suffixes of ".readonly" or ".backup" - which are
supposed to come before the terminal dot. To add to the confusion, the "fs
lsmount" afs utility does not show the terminal dot when displaying the
string.
The kernel mount source string parser, however, assumes that the terminal
dot marks the suffix and that the suffix is always "" and is thus ignored.
In most cases, there is no suffix and this is not a problem - but if there
is a suffix, it is lost and this affects the ability to mount the correct
volume.
The command line mount command, on the other hand, is expected not to
include a terminal dot - so the problem doesn't arise there.
Fix this by making sure that the dot exists and then stripping it when
passing the string to the mount configuration.
Fixes: bec5eb6141 ("AFS: Implement an autocell mount capability [ver #2]")
Reported-by: Jonathan Billings <jsbillings@jsbillings.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
In chasing a performance issue between using IORING_OP_RECVMSG and
IORING_OP_READV on sockets, tracing showed that we always punt the
socket reads to async offload. This is due to io_file_supports_async()
not checking for S_ISSOCK on the inode. Since sockets supports the
O_NONBLOCK (or MSG_DONTWAIT) flag just fine, add sockets to the list
of file types that we can do a non-blocking issue to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We hash regular files to avoid having multiple threads hammer on the
inode mutex, but it should not be needed on other types of files
(like sockets).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
One major use case of linked commands is the ability to run the next
link inline, if at all possible. This is done correctly for async
offload, but somewhere along the line we lost the ability to do so when
we were able to complete a request without having to punt it. Ensure
that we do so correctly.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This essentially reverts commit e944475e69. For high poll ops
workloads, like TAO, the dynamic allocation of the wait_queue
entry for IORING_OP_POLL_ADD adds considerable extra overhead.
Go back to embedding the wait_queue_entry, but keep the usage of
wait->private for the pointer stashing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't just assign it from the main call path, that can miss the case
when we're called from issue deferral.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use the mutex to guard against registered file updates, for instance.
Ensure we're safe in accessing that state against concurrent updates.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To avoid going to sleep only to get woken shortly thereafter, spin
briefly for new work upon completion of work.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We only have one cases of using the waitqueue to wake the worker, the
rest are using wake_up_process(). Since we can save some cycles not
fiddling with the waitqueue io_wqe_worker(), switch the work activation
to task wakeup and get rid of the now unused wait_queue_head_t in
struct io_worker.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some commands will invariably end in a failure in the sense that the
completion result will be less than zero. One such example is timeouts
that don't have a completion count set, they will always complete with
-ETIME unless cancelled.
For linked commands, we sever links and fail the rest of the chain if
the result is less than zero. Since we have commands where we know that
will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever
regardless of the completion result. Note that the link will still sever
if we fail submitting the parent request, hard links are only resilient
in the presence of completion results for requests that did submit
correctly.
Cc: stable@vger.kernel.org # v5.4
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>