commit 398840f8bb935d33c64df4ec4fed77a7d24c267d upstream.
This was an oversight in the original implementation, as it makes no
sense to specify both scoping flags to the same openat2(2) invocation
(before this patch, the result of such an invocation was equivalent to
RESOLVE_IN_ROOT being ignored).
This is a userspace-visible ABI change, but the only user of openat2(2)
at the moment is LXC which doesn't specify both flags and so no
userspace programs will break as a result.
Fixes: fddb5d430a ("open: introduce openat2(2) syscall")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <stable@vger.kernel.org> # v5.6+
Link: https://lore.kernel.org/r/20201027235044.5240-2-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c61b3e4839007668360ed8b87d7da96d2e59fc6c upstream.
Bounds checking tools can flag a bug in dbAdjTree() for an array index
out of bounds in dmt_stree. Since dmt_stree can refer to the stree in
both structures dmaptree and dmapctl, use the larger array to eliminate
the false positive.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fecc4559780d52d174ea05e3bf543669165389c3 upstream.
fsnotify_parent() used to send two separate events to backends when a
parent inode is watching children and the child inode is also watching.
In an attempt to avoid duplicate events in fanotify, we unified the two
backend callbacks to a single callback and handled the reporting of the
two separate events for the relevant backends (inotify and dnotify).
However the handling is buggy and can result in inotify and dnotify
listeners receiving events of the type they never asked for or spurious
events.
The problem is the unified event callback with two inode marks (parent and
child) is called when any of the parent and child inodes are watched and
interested in the event, but the parent inode's mark that is interested
in the event on the child is not necessarily the one we are currently
reporting to (it could belong to a different group).
So before reporting the parent or child event flavor to backend we need
to check that the mark is really interested in that event flavor.
The semantics of INODE and CHILD marks were hard to follow and made the
logic more complicated than it should have been. Replace it with INODE
and PARENT marks semantics to hopefully make the logic more clear.
Thanks to Hugh Dickins for spotting a bug in the earlier version of this
patch.
Fixes: 497b0c5a7c ("fsnotify: send event to parent and child with single callback")
CC: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201202120713.702387-4-amir73il@gmail.com
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1a2620a99803ad660edc5d22fd9c66cce91ceb1c upstream.
Convert inotify to use the simple handle_inode_event() interface to
get rid of the code duplication between the generic helper
fsnotify_handle_event() and the inotify_handle_event() callback, which
also happen to be buggy code.
The bug will be fixed in the generic helper.
Link: https://lore.kernel.org/r/20201202120713.702387-3-amir73il@gmail.com
CC: stable@vger.kernel.org
Fixes: b9a1b97725 ("fsnotify: create method handle_inode_event() in fsnotify_operations")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 950cc0d2bef078e1f6459900ca4d4b2a2e0e3c37 upstream.
The handle_inode_event() interface was added as (quoting comment):
"a simple variant of handle_event() for groups that only have inode
marks and don't have ignore mask".
In other words, all backends except fanotify. The inotify backend
also falls under this category, but because it required extra arguments
it was left out of the initial pass of backends conversion to the
simple interface.
This results in code duplication between the generic helper
fsnotify_handle_event() and the inotify_handle_event() callback
which also happen to be buggy code.
Generalize the handle_inode_event() arguments and add the check for
FS_EXCL_UNLINK flag to the generic helper, so inotify backend could
be converted to use the simple interface.
Link: https://lore.kernel.org/r/20201202120713.702387-2-amir73il@gmail.com
CC: stable@vger.kernel.org
Fixes: b9a1b97725 ("fsnotify: create method handle_inode_event() in fsnotify_operations")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 08cd274f9b8283a1da93e2ccab216a336da83525 upstream.
The jffs2 mount options will be ignored when remounting jffs2.
It can be easily reproduced with the steps listed below.
1. mount -t jffs2 -o compr=none /dev/mtdblockx /mnt
2. mount -o remount compr=zlib /mnt
Since ec10a24f10, the option parsing happens before fill_super and
then pass fc, which contains the options parsing results, to function
jffs2_reconfigure during remounting. But function jffs2_reconfigure do
not update c->mount_opts.
This patch add a function jffs2_update_mount_opts to fix this problem.
By the way, I notice that tmpfs use the same way to update remounting
options. If it is necessary to unify them?
Cc: <stable@vger.kernel.org>
Fixes: ec10a24f10 ("vfs: Convert jffs2 to use the new mount API")
Signed-off-by: lizhe <lizhe67@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9afc9a8a4909fece0e911e72b1060614ba2f7969 upstream.
The log of this problem is:
jffs2: Error garbage collecting node at 0x***!
jffs2: No space for garbage collection. Aborting GC thread
This is because GC believe that it do nothing, so it abort.
After going over the image of jffs2, I find a scene that
can trigger this problem stably.
The scene is: there is a normal dirent node at summary-area,
but abnormal at corresponding not-summary-area with error
name_crc.
The reason that GC exit abnormally is because it find that
abnormal dirent node to GC, but when it goes to function
jffs2_add_fd_to_list, it cannot meet the condition listed
below:
if ((*prev)->nhash == new->nhash && !strcmp((*prev)->name, new->name))
So no node is marked obsolete, statistical information of
erase_block do not change, which cause GC exit abnormally.
The root cause of this problem is: we do not check the
name_crc of the abnormal dirent node with summary is enabled.
Noticed that in function jffs2_scan_dirent_node, we use
function jffs2_scan_dirty_space to deal with the dirent
node with error name_crc. So this patch add a checking
code in function read_direntry to ensure the correctness
of dirent node. If checked failed, the dirent node will
be marked obsolete so GC will pass this node and this
problem will be fixed.
Cc: <stable@vger.kernel.org>
Signed-off-by: Zhe Li <lizhe67@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 20f1431160c6b590cdc269a846fc5a448abf5b98 upstream.
Write buffers use a kmalloc()'ed buffer, they can leak
up to seven bytes of kernel memory to flash if writes are not
aligned.
So use ubifs_pad() to fill these gaps with padding bytes.
This was never a problem while scanning because the scanner logic
manually aligns node lengths and skips over these gaps.
Cc: <stable@vger.kernel.org>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7955f105afb6034af344038d663bc98809483cdd upstream.
In the negotiate protocol preauth context, the server is not required
to populate the salt (although it is done by most servers) so do
not warn on mount.
We retain the checks (warn) that the preauth context is the minimum
size and that the salt does not exceed DataLength of the SMB response.
Although we use the defaults in the case that the preauth context
response is invalid, these checks may be useful in the future
as servers add support for additional mechanisms.
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bc7c4129d4cdc56d1b5477c1714246f27df914dd upstream.
Azure does not send an SPNEGO blob in the negotiate protocol response,
so we shouldn't assume that it is there when validating the location
of the first negotiate context. This avoids the potential confusing
mount warning:
CIFS: Invalid negotiate context offset
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ebcd6de98754d9b6a5f89d7835864b1c365d432f upstream.
Mounts to Azure cause an unneeded warning message in dmesg
"CIFS: VFS: parse_server_interfaces: incomplete interface info"
Azure rounds up the size (by 8 additional bytes, to a
16 byte boundary) of the structure returned on the query
of the server interfaces at mount time. This is permissible
even though different than other servers so do not log a warning
if query network interfaces response is only rounded up by 8
bytes or fewer.
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e5cafce3ad0f8652d6849314d951459c2bff7233 upstream.
A NULL pointer dereference may occur in __ceph_remove_cap with some of the
callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and
remove_session_caps_cb. Those callers hold the session->s_mutex, so they
are prevented from concurrent execution, but ceph_evict_inode does not.
Since the callers of this function hold the i_ceph_lock, the fix is simply
a matter of returning immediately if caps->ci is NULL.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/43272
Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 89bdfaf93d9157499c3a0d61f489df66f2dead7f upstream.
ovl_ioctl_set_flags() does a capability check using flags, but then the
real ioctl double-fetches flags and uses potentially different value.
The "Check the capability before cred override" comment misleading: user
can skip this check by presenting benign flags first and then overwriting
them to non-benign flags.
Just remove the cred override for now, hoping this doesn't cause a
regression.
The proper solution is to create a new setxflags i_op (patches are in the
works).
Xfstests don't show a regression.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Fixes: dab5ca8fd9 ("ovl: add lsattr/chattr support")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b08070eca9e247f60ab39d79b2c25d274750441f upstream.
ext4_handle_error() with errors=continue mount option can accidentally
remount the filesystem read-only when the system is rebooting. Fix that.
Fixes: 1dc1097ff6 ("ext4: avoid panic during forced reboot")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20201127113405.26867-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 46e294efc355c48d1dd4d58501aa56dac461792a upstream.
Xattr code using inodes with large xattr data can end up dropping last
inode reference (and thus deleting the inode) from places like
ext4_xattr_set_entry(). That function is called with transaction started
and so ext4_evict_inode() can deadlock against fs freezing like:
CPU1 CPU2
removexattr() freeze_super()
vfs_removexattr()
ext4_xattr_set()
handle = ext4_journal_start()
...
ext4_xattr_set_entry()
iput(old_ea_inode)
ext4_evict_inode(old_ea_inode)
sb->s_writers.frozen = SB_FREEZE_FS;
sb_wait_write(sb, SB_FREEZE_FS);
ext4_freeze()
jbd2_journal_lock_updates()
-> blocks waiting for all
handles to stop
sb_start_intwrite()
-> blocks as sb is already in SB_FREEZE_FS state
Generally it is advisable to delete inodes from a separate transaction
as it can consume quite some credits however in this case it would be
quite clumsy and furthermore the credits for inode deletion are quite
limited and already accounted for. So just tweak ext4_evict_inode() to
avoid freeze protection if we have transaction already started and thus
it is not really needed anyway.
Cc: stable@vger.kernel.org
Fixes: dec214d00e ("ext4: xattr inode deduplication")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127110649.24730-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cca415537244f6102cbb09b5b90db6ae2c953bdd upstream.
When freeing metadata, we will create an ext4_free_data and
insert it into the pending free list. After the current
transaction is committed, the object will be freed.
ext4_mb_free_metadata() will check whether the area to be freed
overlaps with the pending free list. If true, return directly. At this
time, ext4_free_data is leaked. Fortunately, the probability of this
problem is small, since it only occurs if the file system is corrupted
such that a block is claimed by more one inode and those inodes are
deleted within a single jbd2 transaction.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/1604764698-4269-8-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bc18546bf68e47996a359d2533168d5770a22024 upstream.
The ext4_find_extent() function never returns NULL, it returns error
pointers.
Fixes: 44059e503b03 ("ext4: fast commit recovery path")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201023112232.GB282278@mwanda
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7f458a3873ae94efe1f37c8b96c97e7298769e98 upstream.
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).
So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.
I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 27d56e62e4748c2135650c260024e9904b8c1a0a upstream.
While writing an explanation for the need of the commit_root_sem for
btrfs_prepare_extent_commit, I realized we have a slight hole that could
result in leaked space if we have to do the old style caching. Consider
the following scenario
commit root
+----+----+----+----+----+----+----+
|\\\\| |\\\\|\\\\| |\\\\|\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7
new commit root
+----+----+----+----+----+----+----+
| | | |\\\\| | |\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7
Prior to this patch, we run btrfs_prepare_extent_commit, which updates
the last_byte_to_unpin, and then we subsequently run
switch_commit_roots. In this example lets assume that
caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
means that cache->last_byte_to_unpin == 1. Then we go and do the
switch_commit_roots(), but in the meantime the caching thread has made
some more progress, because we drop the commit_root_sem and re-acquired
it. Now caching_ctl->progress == 3. We swap out the commit root and
carry on to unpin.
The race can happen like:
1) The caching thread was running using the old commit root when it
found the extent for [2, 3);
2) Then it released the commit_root_sem because it was in the last
item of a leaf and the semaphore was contended, and set ->progress
to 3 (value of 'last'), as the last extent item in the current leaf
was for the extent for range [2, 3);
3) Next time it gets the commit_root_sem, will start using the new
commit root and search for a key with offset 3, so it never finds
the hole for [2, 3).
So the caching thread never saw [2, 3) as free space in any of the
commit roots, and by the time finish_extent_commit() was called for
the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
subrange [0, 1) to the free space cache, skipping [2, 3).
In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
but do not unpin [2,3). However because caching_ctl->progress == 3 we
do not see the newly freed section of [2,3), and thus do not add it to
our free space cache. This results in us missing a chunk of free space
in memory (on disk too, unless we have a power failure before writing
the free space cache to disk).
Fix this by making sure the ->last_byte_to_unpin is set at the same time
that we swap the commit roots, this ensures that we will always be
consistent.
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ update changelog with Filipe's review comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9076dbd5ee837c3882fc42891c14cecd0354a849 upstream.
While fixing up our ->last_byte_to_unpin locking I noticed that we will
shorten len based on ->last_byte_to_unpin if we're caching when we're
adding back the free space. This is correct for the free space, as we
cannot unpin more than ->last_byte_to_unpin, however we use len to
adjust the ->bytes_pinned counters and such, which need to track the
actual pinned usage. This could result in
WARN_ON(space_info->bytes_pinned) triggering at unmount time.
Fix this by using a local variable for the amount to add to free space
cache, and leave len untouched in this case.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 00c18640c2430c4bafaaeede1f9dd6f7ec0e4b25 upstream.
Before IORING_SETUP_ATTACH_WQ, we could just cancel everything on the
io-wq when exiting. But that's not the case if they are shared, so
cancel for the specific ctx instead.
Cc: stable@vger.kernel.org
Fixes: 24369c2e3b ("io_uring: add io-wq workqueue sharing")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9faadcc8abe4b83d0263216dc3a6321d5bbd616b upstream.
Once we created a file for current context during setup, we should not
call io_ring_ctx_wait_and_kill() directly as it'll be done by fput(file)
Cc: stable@vger.kernel.org # 5.10
Reported-by: syzbot+c9937dfb2303a5f18640@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: fix unused 'ret' for !CONFIG_UNIX]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c07e6719511e77c4b289f62bfe96423eb6ea061d upstream.
io_iopoll_complete() does not hold completion_lock to complete polled io,
so in io_wq_submit_work(), we can not call io_req_complete() directly, to
complete polled io, otherwise there maybe concurrent access to cqring,
defer_list, etc, which is not safe. Commit dad1b1242fd5 ("io_uring: always
let io_iopoll_complete() complete polled io") has fixed this issue, but
Pavel reported that IOPOLL apart from rw can do buf reg/unreg requests(
IORING_OP_PROVIDE_BUFFERS or IORING_OP_REMOVE_BUFFERS), so the fix is not
good.
Given that io_iopoll_complete() is always called under uring_lock, so here
for polled io, we can also get uring_lock to fix this issue.
Fixes: dad1b1242fd5 ("io_uring: always let io_iopoll_complete() complete polled io")
Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: don't deref 'req' after completing it']
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0020ef04e48571a88d4f482ad08f71052c5c5a08 upstream.
For the first time a req punted to io-wq, we'll initialize io_wq_work's
list to be NULL, then insert req to io_wqe->work_list. If this req is not
inserted into tail of io_wqe->work_list, this req's io_wq_work list will
point to another req's io_wq_work. For splitted bio case, this req maybe
inserted to io_wqe->work_list repeatedly, once we insert it to tail of
io_wqe->work_list for the second time, now io_wq_work->list->next will be
invalid pointer, which then result in many strang error, panic, kernel
soft-lockup, rcu stall, etc.
In my vm, kernel doest not have commit cc29e1bf0d63f7 ("block: disable
iopoll for split bio"), below fio job can reproduce this bug steadily:
[global]
name=iouring-sqpoll-iopoll-1
ioengine=io_uring
iodepth=128
numjobs=1
thread
rw=randread
direct=1
registerfiles=1
hipri=1
bs=4m
size=100M
runtime=120
time_based
group_reporting
randrepeat=0
[device]
directory=/home/feiman.wxg/mntpoint/ # an ext4 mount point
If we have commit cc29e1bf0d63f7 ("block: disable iopoll for split bio"),
there will no splitted bio case for polled io, but I think we still to need
to fix this list corruption, it also should maybe go to stable branchs.
To fix this corruption, if a req is inserted into tail of io_wqe->work_list,
initialize req->io_wq_work->list->next to bu NULL.
Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 31bff9a51b264df6d144931a6a5f1d6cc815ed4b upstream.
IOPOLL allows buffer remove/provide requests, but they doesn't
synchronise by rules of IOPOLL, namely it have to hold uring_lock.
Cc: <stable@vger.kernel.org> # 5.7+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 59850d226e4907a6f37c1d2fe5ba97546a8691a4 upstream.
Checking !list_empty(&ctx->cq_overflow_list) around noflush in
io_cqring_events() is racy, because if it fails but a request overflowed
just after that, io_cqring_overflow_flush() still will be called.
Remove the second check, it shouldn't be a problem for performance,
because there is cq_check_overflow bit check just above.
Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 14e3e989f6a5d9646b6cf60690499cc8bdc11f7d ]
Since commit 36e2c7421f ("fs: don't allow splice read/write without
explicit ops") we've required that file operation structures explicitly
enable splice support, rather than falling back to the default handlers.
Most /proc files use the indirect 'struct proc_ops' to describe their
file operations, and were fixed up to support splice earlier in commits
40be821d627c..b24c30c67863, but the mountinfo files interact with the
VFS directly using their own 'struct file_operations' and got missed as
a result.
This adds the necessary support for splice to work for /proc/*/mountinfo
and friends.
Reported-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Reported-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209971
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cda286f0715c82f8117e166afd42cca068876dde ]
io_uring_cancel_task_requests() doesn't imply that the ring is going
away, it may continue to work well after that. The problem is that it
sets ->cq_overflow_flushed effectively disabling the CQ overflow feature
Split setting cq_overflow_flushed from flush, and do the first one only
on exit. It's ok in terms of cancellations because there is a
io_uring->in_idle check in __io_cqring_fill_event().
It also fixes a race with setting ->cq_overflow_flushed in
io_uring_cancel_task_requests, whuch's is not atomic and a part of a
bitmask with other flags. Though, the only other flag that's not set
during init is drain_next, so it's not as bad for sane architectures.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: 0f2122045b ("io_uring: don't rely on weak ->files references")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 634578f800652035debba3098d8ab0d21af7c7a5 ]
It's not safe to call io_cqring_overflow_flush() for IOPOLL mode without
hodling uring_lock, because it does synchronisation differently. Make
sure we have it.
As for io_ring_exit_work(), we don't even need it there because
io_ring_ctx_wait_and_kill() already set force flag making all overflowed
requests to be dropped.
Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 289caf5d8f6c61c6d2b7fd752a7f483cd153f182 ]
Patch series "simplify ep_poll".
This patch series is a followup based on the suggestions and feedback by
Linus:
https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
The first patch in the series is a fix for the epoll race in presence of
timeouts, so that it can be cleanly backported to all affected stable
kernels.
The rest of the patch series simplify the ep_poll() implementation. Some
of these simplifications result in minor performance enhancements as well.
We have kept these changes under self tests and internal benchmarks for a
few days, and there are minor (1-2%) performance enhancements as a result.
This patch (of 8):
After abc610e01c ("fs/epoll: avoid barrier after an epoll_wait(2)
timeout"), we break out of the ep_poll loop upon timeout, without checking
whether there is any new events available. Prior to that patch-series we
always called ep_events_available() after exiting the loop.
This can cause races and missed wakeups. For example, consider the
following scenario reported by Guantao Liu:
Suppose we have an eventfd added using EPOLLET to an epollfd.
Thread 1: Sleeps for just below 5ms and then writes to an eventfd.
Thread 2: Calls epoll_wait with a timeout of 5 ms. If it sees an
event of the eventfd, it will write back on that fd.
Thread 3: Calls epoll_wait with a negative timeout.
Prior to abc610e01c, it is guaranteed that Thread 3 will wake up either
by Thread 1 or Thread 2. After abc610e01c, Thread 3 can be blocked
indefinitely if Thread 2 sees a timeout right before the write to the
eventfd by Thread 1. Thread 2 will be woken up from
schedule_hrtimeout_range and, with evail 0, it will not call
ep_send_events().
To fix this issue:
1) Simplify the timed_out case as suggested by Linus.
2) while holding the lock, recheck whether the thread was woken up
after its time out has reached.
Note that (2) is different from Linus' original suggestion: It do not set
"eavail = ep_events_available(ep)" to avoid unnecessary contention (when
there are too many timed-out threads and a small number of events), as
well as races mentioned in the discussion thread.
This is the first patch in the series so that the backport to stable
releases is straightforward.
Link: https://lkml.kernel.org/r/20201106231635.3528496-1-soheil.kdev@gmail.com
Link: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
Link: https://lkml.kernel.org/r/20201106231635.3528496-2-soheil.kdev@gmail.com
Fixes: abc610e01c ("fs/epoll: avoid barrier after an epoll_wait(2) timeout")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Guantao Liu <guantaol@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Guantao Liu <guantaol@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit df9923f96717d0aebb0a73adbcf6285fa79e38cb ]
io_uring_cancel_files() cancels all request that match files regardless
of task. There is no real need in that, cancel only requests of the
specified task. That also handles SQPOLL case as it already changes task
to it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c6c75deda81344c3a95d1d1f606d5cee109e5d54 ]
Commit 1fde6f21d9 ("proc: fix /proc/net/* after setns(2)") only forced
revalidation of regular files under /proc/net/
However, /proc/net/ is unusual in the sense of /proc/net/foo handlers
take netns pointer from parent directory which is old netns.
Steps to reproduce:
(void)open("/proc/net/sctp/snmp", O_RDONLY);
unshare(CLONE_NEWNET);
int fd = open("/proc/net/sctp/snmp", O_RDONLY);
read(fd, &c, 1);
Read will read wrong data from original netns.
Patch forces lookup on every directory under /proc/net .
Link: https://lkml.kernel.org/r/20201205160916.GA109739@localhost.localdomain
Fixes: 1da4d377f9 ("proc: revalidate misc dentries")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3cded66330591cfd2554a3fd5edca8859ea365a2 ]
Fix to return PTR_ERR() error code from the error handling case where
ubifs_hash_get_desc() failed instead of 0 in ubifs_init_authentication(),
as done elsewhere in this function.
Fixes: 49525e5eec ("ubifs: Add helper functions for authentication support")
Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 88149082bb8ef31b289673669e080ec6a00c2e59 ]
If generic_drop_inode() returns true, it means iput_final() can evict
this inode regardless of whether it is dirty or not. If we check
I_DONTCACHE in generic_drop_inode(), any inode with this bit set will be
evicted unconditionally. This is not the desired behavior because
I_DONTCACHE only means the inode shouldn't be cached on the LRU list.
As for whether we need to evict this inode, this is what
generic_drop_inode() should do. This patch corrects the usage of
I_DONTCACHE.
This patch was proposed in [1].
[1]: https://lore.kernel.org/linux-fsdevel/20200831003407.GE12096@dread.disaster.area/
Fixes: dae2f8ed79 ("fs: Lift XFS_IDONTCACHE to the VFS layer")
Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ca9364dde50daba93eff711b4b945fd08beafcc2 ]
Since commit b4868b44c5 ("NFSv4: Wait for stateid updates after
CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
seconds delay regardless of the size of the copy. The delay is from
nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
fails because the seqid in both nfs4_state and nfs4_stateid are 0.
Fix by modifying nfs4_init_cp_state to return the stateid with seqid 1
instead of 0. This is also to conform with section 4.8 of RFC 7862.
Here is the relevant paragraph from section 4.8 of RFC 7862:
A copy offload stateid's seqid MUST NOT be zero. In the context of a
copy offload operation, it is inappropriate to indicate "the most
recent copy offload operation" using a stateid with a seqid of zero
(see Section 8.2.2 of [RFC5661]). It is inappropriate because the
stateid refers to internal state in the server and there may be
several asynchronous COPY operations being performed in parallel on
the same file by the server. Therefore, a copy offload stateid with
a seqid of zero MUST be considered invalid.
Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4420440c57892779f265108f46f83832a88ca795 ]
The warning message from nfsd terminating normally
can confuse system adminstrators or monitoring software.
Though it's not exactly fair to pin-point a commit where it
originated, the current form in the current place started
to appear in:
Fixes: e096bbc648 ("knfsd: remove special handling for SIGHUP")
Signed-off-by: kazuo ito <kzpn200@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 89ff6005039a878afac87889fee748fa3f957c3a ]
In case of retrying fill_super with skip_recovery,
s_encoding for casefold would not be loaded again even though it's
already been freed because it's not NULL.
Set NULL after free to prevent double freeing when unmount.
Fixes: eca4873ee1 ("f2fs: Use generic casefolding support")
Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bf701b765eaa82dd164d65edc5747ec7288bb5c3 ]
nfsiod is currently a concurrency-managed workqueue (CMWQ).
This means that workitems scheduled to nfsiod on a given CPU are queued
behind all other work items queued on any CMWQ on the same CPU. This
can introduce unexpected latency.
Occaionally nfsiod can even cause excessive latency. If the work item
to complete a CLOSE request calls the final iput() on an inode, the
address_space of that inode will be dismantled. This takes time
proportional to the number of in-memory pages, which on a large host
working on large files (e.g.. 5TB), can be a large number of pages
resulting in a noticable number of seconds.
We can avoid these latency problems by switching nfsiod to WQ_UNBOUND.
This causes each concurrent work item to gets a dedicated thread which
can be scheduled to an idle CPU.
There is precedent for this as several other filesystems use WQ_UNBOUND
workqueue for handling various async events.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: ada609ee2a ("workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9b82d88d5976e5f2b8015d58913654856576ace5 ]
NLM uses an interval-based rebinding, i.e. it clears the transport's
binding under certain conditions if more than 60 seconds have elapsed
since the connection was last bound.
This rebinding is not necessary for an autobind RPC client over a
connection-oriented protocol like TCP.
It can also cause problems: it is possible for nlm_bind_host() to clear
XPRT_BOUND whilst a connection worker is in the middle of trying to
reconnect, after it had already been checked in xprt_connect().
When the connection worker notices that XPRT_BOUND has been cleared
under it, in xs_tcp_finish_connecting(), that results in:
xs_tcp_setup_socket: connect returned unhandled error -107
Worse, it's possible that the two can get into lockstep, resulting in
the same behaviour repeated indefinitely, with the above error every
300 seconds, without ever recovering, and the connection never being
established. This has been seen in practice, with a large number of NLM
client tasks, following a server restart.
The existing callers of nlm_bind_host & nlm_rebind_host should not need
to force the rebind, for TCP, so restrict the interval-based rebinding
to UDP only.
For TCP, we will still rebind when needed, e.g. on timeout, and connection
error (including closure), since connection-related errors on an existing
connection, ECONNREFUSED when trying to connect, and rpc_check_timeout(),
already unconditionally clear XPRT_BOUND.
To avoid having to add the fix, and explanation, to both nlm_bind_host()
and nlm_rebind_host(), remove the duplicate code from the former, and
have it call the latter.
Drop the dprintk, which adds no value over a trace.
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Fixes: 35f5a422ce ("SUNRPC: new interface to force an RPC rebind")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 046e5ccb4198b990190e11fb52fd9cfd264402eb ]
We can fit the device_addr4 opaque data padding in the pages.
Fixes: cf500bac8f ("SUNRPC: Introduce rpc_prepare_reply_pages()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 05ad917561fca39a03338cb21fe9622f998b0f9c ]
Currently, the client will always ask for security_labels if the server
returns that it supports that feature regardless of any LSM modules
(such as Selinux) enforcing security policy. This adds performance
penalty to the READDIR operation.
Client adjusts superblock's support of the security_label based on
the server's support but also current client's configuration of the
LSM modules. Thus, prior to using the default bitmask in READDIR,
this patch checks the server's capabilities and then instructs
READDIR to remove FATTR4_WORD2_SECURITY_LABEL from the bitmask.
v5: fixing silly mistakes of the rushed v4
v4: simplifying logic
v3: changing label's initialization per Ondrej's comment
v2: dropping selinux hook and using the sb cap.
Suggested-by: Ondrej Mosnacek <omosnace@redhat.com>
Suggested-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: 2b0143b5c9 ("VFS: normal filesystems (and lustre): d_inode() annotations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 66ab33bf6d4341574f88b511e856a73f6f2a921e ]
This can be triggered for example by adding the "-omand" mount option,
which will be rejected and virtio_fs_fill_super() will return an error.
In such a case the allocations for fuse_conn and fuse_mount will leak due
to s_root not yet being set and so ->put_super() not being called.
Fixes: a62a8ef9d9 ("virtio-fs: add virtiofs filesystem")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3acc4522d89e0a326db69e9d0afaad8cf763a54c ]
When running fault injection test, if we don't stop checkpoint, some stale
NAT entries were flushed which breaks consistency.
Fixes: 86f33603f8 ("f2fs: handle errors of f2fs_get_meta_page_nofail")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit e51d68e76d604c6d5d1eb13ae1d6da7f6c8c0dfc upstream.
When dquot_resume() was last updated, the argument that got passed
to vfs_cleanup_quota_inode was incorrectly set.
If type = -1 and dquot_load_quota_sb() returns a negative value,
then vfs_cleanup_quota_inode() gets called with -1 passed as an
argument, and this leads to an array-index-out-of-bounds bug.
Fix this issue by correctly passing the arguments.
Fixes: ae45f07d47 ("quota: Simplify dquot_resume()")
Link: https://lore.kernel.org/r/20201208194338.7064-1-anant.thazhemadam@gmail.com
Reported-by: syzbot+2643e825238d7aabb37f@syzkaller.appspotmail.com
Tested-by: syzbot+2643e825238d7aabb37f@syzkaller.appspotmail.com
CC: stable@vger.kernel.org
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>