commit fd57a98d6f0c98fa295813087f13afb26c224e73 upstream.
When we have smack enabled, during the creation of a directory smack may
attempt to add a "smack transmute" xattr on the inode, which results in
the following warning and trace:
WARNING: CPU: 3 PID: 2548 at fs/btrfs/transaction.c:537 start_transaction+0x489/0x4f0
Modules linked in: nft_objref nf_conntrack_netbios_ns (...)
CPU: 3 PID: 2548 Comm: mkdir Not tainted 5.9.0-rc2smack+ #81
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:start_transaction+0x489/0x4f0
Code: e9 be fc ff ff (...)
RSP: 0018:ffffc90001887d10 EFLAGS: 00010202
RAX: ffff88816f1e0000 RBX: 0000000000000201 RCX: 0000000000000003
RDX: 0000000000000201 RSI: 0000000000000002 RDI: ffff888177849000
RBP: ffff888177849000 R08: 0000000000000001 R09: 0000000000000004
R10: ffffffff825e8f7a R11: 0000000000000003 R12: ffffffffffffffe2
R13: 0000000000000000 R14: ffff88803d884270 R15: ffff8881680d8000
FS: 00007f67317b8440(0000) GS:ffff88817bcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f67247a22a8 CR3: 000000004bfbc002 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? slab_free_freelist_hook+0xea/0x1b0
? trace_hardirqs_on+0x1c/0xe0
btrfs_setxattr_trans+0x3c/0xf0
__vfs_setxattr+0x63/0x80
smack_d_instantiate+0x2d3/0x360
security_d_instantiate+0x29/0x40
d_instantiate_new+0x38/0x90
btrfs_mkdir+0x1cf/0x1e0
vfs_mkdir+0x14f/0x200
do_mkdirat+0x6d/0x110
do_syscall_64+0x2d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f673196ae6b
Code: 8b 05 11 (...)
RSP: 002b:00007ffc3c679b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 00000000000001ff RCX: 00007f673196ae6b
RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffc3c67a30d
RBP: 00007ffc3c67a30d R08: 00000000000001ff R09: 0000000000000000
R10: 000055d3e39fe930 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffc3c679cd8 R14: 00007ffc3c67a30d R15: 00007ffc3c679ce0
irq event stamp: 11029
hardirqs last enabled at (11037): [<ffffffff81153fe6>] console_unlock+0x486/0x670
hardirqs last disabled at (11044): [<ffffffff81153c01>] console_unlock+0xa1/0x670
softirqs last enabled at (8864): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20
softirqs last disabled at (8851): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20
This happens because at btrfs_mkdir() we call d_instantiate_new() while
holding a transaction handle, which results in the following call chain:
btrfs_mkdir()
trans = btrfs_start_transaction(root, 5);
d_instantiate_new()
smack_d_instantiate()
__vfs_setxattr()
btrfs_setxattr_trans()
btrfs_start_transaction()
start_transaction()
WARN_ON()
--> a tansaction start has TRANS_EXTWRITERS
set in its type
h->orig_rsv = h->block_rsv
h->block_rsv = NULL
btrfs_end_transaction(trans)
Besides the warning triggered at start_transaction, we set the handle's
block_rsv to NULL which may cause some surprises later on.
So fix this by making btrfs_setxattr_trans() not start a transaction when
we already have a handle on one, stored in current->journal_info, and use
that handle. We are good to use the handle because at btrfs_mkdir() we did
reserve space for the xattr and the inode item.
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
CC: stable@vger.kernel.org # 5.4+
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Link: https://lore.kernel.org/linux-btrfs/434d856f-bd7b-4889-a6ec-e81aaebfa735@schaufler-ca.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4f6a49de64fd1b1dba5229c02047376da7cf24fd upstream.
If btrfs_qgroup_reserve_data returns an error (i.e quota limit reached)
the handling logic directly goes to the 'out' label without first
unlocking the extent range between lockstart, lockend. This results in
deadlocks as other processes try to lock the same extent.
Fixes: a7f8b1c2ac ("btrfs: file: reserve qgroup space after the hole punch range is locked")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0f9c03d824f6f522d3bc43629635c9765546ebc5 upstream.
Following commit f218ea6c47 ("btrfs: delayed-inode: Remove wrong
qgroup meta reservation calls") this function now reserves num_bytes,
rather than the fixed amount of nodesize. As such this requires the
same amount to be freed in case of failure. Fix this by adjusting
the amount we are freeing.
Fixes: f218ea6c47 ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5011c5a663b9c6d6aff3d394f11049b371199627 upstream.
The problem is we're copying "inherit" from user space but we don't
necessarily know that we're copying enough data for a 64 byte
struct. Then the next problem is that 'inherit' has a variable size
array at the end, and we have to verify that array is the size we
expected.
Fixes: 6f72c7e20d ("Btrfs: add qgroup inheritance")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3c17916510428dbccdf657de050c34e208347089 upstream.
During allocation the allocator will try to allocate an extent using
cluster policy. Once the current cluster is exhausted it will remove the
entry under btrfs_free_cluster::lock and subsequently acquire
btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
because there exists a race condition between removing the entry under
one lock and doing the necessary accounting holding a different lock
since extent freeing only uses the 2nd lock. This can result in the
following situation:
T1: T2:
btrfs_alloc_from_cluster insert_into_bitmap <holds tree_lock>
if (entry->bytes == 0) if (block_group && !list_empty(&block_group->cluster_list)) {
rb_erase(entry)
spin_unlock(&cluster->lock);
(total_bitmaps is still 4) spin_lock(&cluster->lock);
<doesn't find entry in cluster->root>
spin_lock(&ctl->tree_lock); <goes to new_bitmap label, adds
<blocked since T2 holds tree_lock> <a new entry and calls add_new_bitmap>
recalculate_thresholds <crashes,
due to total_bitmaps
becoming 5 and triggering
an ASSERT>
To fix this ensure that once depleted, the cluster entry is deleted when
both cluster lock and tree locks are held in the allocator (T1), this
ensures that even if there is a race with a concurrent
insert_into_bitmap call it will correctly find the entry in the cluster
and add the new space to it.
CC: <stable@vger.kernel.org> # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3660d0bcdb82807d434da9d2e57d88b37331182d upstream.
When using the NO_HOLES feature, if we clone a file range that spans only
a hole into a range that is at or beyond the current i_size of the
destination file, we end up not setting the full sync runtime flag on the
inode. As a result, if we then fsync the destination file and have a power
failure, after log replay we can end up exposing stale data instead of
having a hole for that range.
The conditions for this to happen are the following:
1) We have a file with a size of, for example, 1280K;
2) There is a written (non-prealloc) extent for the file range from 1024K
to 1280K with a length of 256K;
3) This particular file extent layout is durably persisted, so that the
existing superblock persisted on disk points to a subvolume root where
the file has that exact file extent layout and state;
4) The file is truncated to a smaller size, to an offset lower than the
start offset of its last extent, for example to 800K. The truncate sets
the full sync runtime flag on the inode;
6) Fsync the file to log it and clear the full sync runtime flag;
7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
into the file with a destination offset that starts at or beyond the
256K file extent item we had - for example to offset 1024K;
8) Since the clone operation does not find extents in the source range,
we end up in the if branch at the bottom of btrfs_clone() where we
punch a hole for the file range starting at offset 1024K by calling
btrfs_replace_file_extents(). There we end up not setting the full
sync flag on the inode, because we don't know we are being called in
a clone context (and not fallocate's punch hole operation), and
neither do we create an extent map to represent a hole because the
requested range is beyond eof;
9) A further fsync to the file will be a fast fsync, since the clone
operation did not set the full sync flag, and therefore it relies on
modified extent maps to correctly log the file layout. But since
it does not find any extent map marking the range from 1024K (the
previous eof) to the new eof, it does not log a file extent item
for that range representing the hole;
10) After a power failure no hole for the range starting at 1024K is
punched and we end up exposing stale data from the old 256K extent.
Turning this into exact steps:
$ mkfs.btrfs -f -O no-holes /dev/sdi
$ mount /dev/sdi /mnt
# Create our test file with 3 extents of 256K and a 256K hole at offset
# 256K. The file has a size of 1280K.
$ xfs_io -f -s \
-c "pwrite -S 0xab -b 256K 0 256K" \
-c "pwrite -S 0xcd -b 256K 512K 256K" \
-c "pwrite -S 0xef -b 256K 768K 256K" \
-c "pwrite -S 0x73 -b 256K 1024K 256K" \
/mnt/sdi/foobar
# Make sure it's durably persisted. We want the last committed super
# block to point to this particular file extent layout.
sync
# Now truncate our file to a smaller size, falling within a position of
# the second extent. This sets the full sync runtime flag on the inode.
# Then fsync the file to log it and clear the full sync flag from the
# inode. The third extent is no longer part of the file and therefore
# it is not logged.
$ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar
# Now do a clone operation that only clones the hole and sets back the
# file size to match the size it had before the truncate operation
# (1280K).
$ xfs_io \
-c "reflink /mnt/foobar 256K 1024K 256K" \
-c "fsync" \
/mnt/foobar
# File data before power failure:
$ od -A d -t x1 /mnt/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
*
0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
1310720
<power fail>
# Mount the fs again to replay the log tree.
$ mount /dev/sdi /mnt
# File data after power failure:
$ od -A d -t x1 /mnt/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
*
0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
*
1310720
The range from 1024K to 1280K should correspond to a hole but instead it
points to stale data, to the 256K extent that should not exist after the
truncate operation.
The issue does not exists when not using NO_HOLES, because for that case
we use file extent items to represent holes, these are found and copied
during the loop that iterates over extents at btrfs_clone(), and that
causes btrfs_replace_file_extents() to be called with a non-NULL
extent_info argument and therefore set the full sync runtime flag on the
inode.
So fix this by making the code that deals with a trailing hole during
cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
range starts at or beyond the current i_size.
A test case for fstests will follow soon.
Backporting notes: for kernel 5.4 the change goes to ioctl.c into
btrfs_clone before the last call to btrfs_punch_hole_range.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dd0734f2a866f9d619d4abf97c3d71bcdee40ea9 upstream.
When creating a snapshot we check if the current number of swap files, in
the root, is non-zero, and if it is, we error out and warn that we can not
create the snapshot because there are active swap files.
However this is racy because when a task started activation of a swap
file, another task might have started already snapshot creation and might
have seen the counter for the number of swap files as zero. This means
that after the swap file is activated we may end up with a snapshot of the
same root successfully created, and therefore when the first write to the
swap file happens it has to fall back into COW mode, which should never
happen for active swap files.
Basically what can happen is:
1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
There it sees that root->nr_swapfiles has a value of 0 so it continues;
2) Task B enters btrfs_swap_activate(). It is not aware that another task
started snapshot creation but it did not finish yet. It increments
root->nr_swapfiles from 0 to 1;
3) Task B checks that the file meets all requirements to be an active
swap file - it has NOCOW set, there are no snapshots for the inode's
root at the moment, no file holes, no reflinked extents, etc;
4) Task B returns success and now the file is an active swap file;
5) Task A commits the transaction to create the snapshot and finishes.
The swap file's extents are now shared between the original root and
the snapshot;
6) A write into an extent of the swap file is attempted - there is a
snapshot of the file's root, so we fall back to COW mode and therefore
the physical location of the extent changes on disk.
So fix this by taking the snapshot lock during swap file activation before
locking the extent range, as that is the order in which we lock these
during buffered writes.
Fixes: ed46ff3d42 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 195a49eaf655eb914896c92cecd96bc863c9feb3 upstream.
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes: ed46ff3d42 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d70cef0d46729808dc53f145372c02b145c92604 upstream.
When a qstripe is required an extra page is allocated and mapped. There
were 3 problems:
1) There is no corresponding call of kunmap() for the qstripe page.
2) There is no reason to map the qstripe page more than once if the
number of bits set in rbio->dbitmap is greater than one.
3) There is no reason to map the parity page and unmap it each time
through the loop.
The page memory can continue to be reused with a single mapping on each
iteration by raid6_call.gen_syndrome() without remapping. So map the
page for the duration of the loop.
Similarly, improve the algorithm by mapping the parity page just 1 time.
Fixes: 5a6ac9eacb ("Btrfs, raid56: support parity scrub on raid56")
CC: stable@vger.kernel.org # 4.4.x: c17af96554: btrfs: raid56: simplify tracking of Q stripe presence
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 95c85fba1f64c3249c67f0078a29f8a125078189 upstream.
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.
Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.
Fixes: fa9c0d795f ("Btrfs: rework allocation clustering")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4f4317c13a40194940acf4a71670179c4faca2b5 ]
While doing error injection I would sometimes get a corrupt file system.
This is because I was injecting errors at btrfs_search_slot, but would
only do it one time per stack. This uncovered a problem in
commit_fs_roots, where if we get an error we would just break. However
we're in a nested loop, the first loop being a loop to find all the
dirty fs roots, and then subsequent root updates would succeed clearing
the error value.
This isn't likely to happen in real scenarios, however we could
potentially get a random ENOMEM once and then not again, and we'd end up
with a corrupted file system. Fix this by moving the error checking
around a bit to the main loop, as this is the only place where something
will fail, and return the error as soon as it occurs.
With this patch my reproducer no longer corrupts the file system.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e36cffed20a324e116f329a94061ae30dd26fb51 ]
Most callers check for non-zero return, and assume it's -ECHILD (which
it always will be). One caller uses the actual error return. Clean this
up and make it fully consistent, by having unlazy_walk() return a bool
instead. Rename it to try_to_unlazy() and return true on success, and
failure on error. That's easier to read.
No functional changes in this patch.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 88a9e03beef22cc5fabea344f54b9a0dfe63de08 upstream.
An assert failure is triggered by syzkaller test due to
ATTR_KILL_PRIV is not cleared before xfs_setattr_size.
As ATTR_KILL_PRIV is not checked/used by xfs_setattr_size,
just remove it from the assert.
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3bef198f1b17d1bb89260bad947ef084c0a2d1a6 upstream.
syzbot is feeding invalid superblock data to JFS for mount testing.
JFS does not check several of the fields -- just assumes that they
are good since the JFS_MAGIC and version fields are good.
In this case (syzbot reproducer), we have s_l2bsize == 0xda0c,
pad == 0xf045, and s_state == 0x50, all of which are invalid IMO.
Having s_l2bsize == 0xda0c causes this UBSAN warning:
UBSAN: shift-out-of-bounds in fs/jfs/jfs_mount.c:373:25
shift exponent -9716 is negative
s_l2bsize can be tested for correctness. pad can be tested for non-0
and punted. s_state can be tested for its valid values and punted.
Do those 3 tests and if any of them fails, report the superblock as
invalid/corrupt and let fsck handle it.
With this patch, chkSuper() says this when JFS_DEBUG is enabled:
jfs_mount: Mount Failure: superblock is corrupt!
Mount JFS Failure: -22
jfs_mount failed w/return code = -22
The obvious problem with this method is that next week there could
be another syzbot test that uses different fields for invalid values,
this making this like a game of whack-a-mole.
syzkaller link: https://syzkaller.appspot.com/bug?extid=36315852ece4132ec193
Reported-by: syzbot+36315852ece4132ec193@syzkaller.appspotmail.com
Reported-by: kernel test robot <lkp@intel.com> # v2
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: jfs-discussion@lists.sourceforge.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7009fa9cd9a5262944b30eb7efb1f0561d074b68 upstream.
When starting an iomap write, gfs2_quota_lock_check -> gfs2_quota_lock
-> gfs2_quota_hold is called from gfs2_iomap_begin. At the end of the
write, before unlocking the quotas, punch_hole -> gfs2_quota_hold can be
called again in gfs2_iomap_end, which is incorrect and leads to a failed
assertion. Instead, move the call to gfs2_quota_unlock before the call
to punch_hole to fix that.
Fixes: 64bc06bb32 ("gfs2: iomap buffered write support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 834ec3e1ee65029029225a86c12337a6cd385af7 upstream.
In gfs2_recover_one, fix a sd_log_flush_lock imbalance when a recovery
pass fails.
Fixes: c9ebc4b737 ("gfs2: allow journal replay to hold sd_log_flush_lock")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 78178ca844f0eb88f21f31c7fde969384be4c901 upstream.
Patch fb6791d100 was designed to allow gfs2 to unmount quicker by
skipping the step where it tells dlm to unlock glocks in EX with lvbs.
This was done because when gfs2 unmounts a file system, it destroys the
dlm lockspace shortly after it destroys the glocks so it doesn't need to
unlock them all: the unlock is implied when the lockspace is destroyed
by dlm.
However, that patch introduced a use-after-free in dlm: as part of its
normal dlm_recoverd process, it can call ls_recovery to recover dead
locks. In so doing, it can call recover_rsbs which calls recover_lvb for
any mastered rsbs. Func recover_lvb runs through the list of lkbs queued
to the given rsb (if the glock is cached but unlocked, it will still be
queued to the lkb, but in NL--Unlocked--mode) and if it has an lvb,
copies it to the rsb, thus trying to preserve the lkb. However, when
gfs2 skips the dlm unlock step, it frees the glock and its lvb, which
means dlm's function recover_lvb references the now freed lvb pointer,
copying the freed lvb memory to the rsb.
This patch changes the check in gdlm_put_lock so that it calls
dlm_unlock for all glocks that contain an lvb pointer.
Fixes: fb6791d100 ("GFS2: skip dlm_unlock calls in unmount")
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f5f02fde9f52b2d769c1c2ddfd3d9c4a1fe739a7 upstream.
If go_free is defined, function signal_our_withdraw is supposed to
synchronize on the GLF_FREEING flag of the inode glock, but it
accidentally does that on the live glock. Fix that and disambiguate
the glock variables.
Fixes: 601ef0d52e ("gfs2: Force withdraw to replay journals and wait for it to finish")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b0ff4fe746fd028eef920ddc8c7b0361c1ede6ec upstream.
During checkpoint=disable period, f2fs bypasses all the synchronous IOs such as
sync and fsync. So, when enabling it back, we must flush all of them in order
to keep the data persistent. Otherwise, suddern power-cut right after enabling
checkpoint will cause data loss.
Fixes: 4354994f09 ("f2fs: checkpoint disabling")
Cc: stable@vger.kernel.org
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e0fcd01510ad025c9bbce704c5c2579294056141 upstream.
This patch ports commit 02b016ca7f ("ext4: enforce the immutable
flag on open files") to f2fs.
According to the chattr man page, "a file with the 'i' attribute
cannot be modified..." Historically, this was only enforced when the
file was opened, per the rest of the description, "... and the file
can not be opened in write mode".
There is general agreement that we should standardize all file systems
to prevent modifications even for files that were opened at the time
the immutable flag is set. Eventually, a change to enforce this at
the VFS layer should be landing in mainline.
Cc: stable@kernel.org
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0d4370cfe36b7f1719123b621a4ec4d9c7a25f89 upstream.
If this is attempted by an io-wq kthread, then return -EOPNOTSUPP as we
don't currently support that. Once we can get task_pid_ptr() doing the
right thing, then this can go away again.
Use PF_IO_WORKER for this to speciically target the io_uring workers.
Modify the /proc/self/ check to use PF_IO_WORKER as well.
Cc: stable@vger.kernel.org
Fixes: 8d4c3e76e3 ("proc: don't allow async path resolution of /proc/self components")
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bfe3911a91047557eb0e620f95a370aee6a248c7 upstream.
Userspace has discovered the functionality offered by SYS_kcmp and has
started to depend upon it. In particular, Mesa uses SYS_kcmp for
os_same_file_description() in order to identify when two fd (e.g. device
or dmabuf) point to the same struct file. Since they depend on it for
core functionality, lift SYS_kcmp out of the non-default
CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.
Rasmus Villemoes also pointed out that systemd uses SYS_kcmp to
deduplicate the per-service file descriptor store.
Note that some distributions such as Ubuntu are already enabling
CHECKPOINT_RESTORE in their configs and so, by extension, SYS_kcmp.
References: https://gitlab.freedesktop.org/drm/intel/-/issues/3046
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: stable@vger.kernel.org
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> # DRM depends on kcmp
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> # systemd uses kcmp
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210205220012.1983-1-chris@chris-wilson.co.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 059c01039c0185dbee7ed080f1f2bd22cb1e4dab upstream.
Per ZBC/ZAC/ZNS specifications, write pointers may not have valid values
when zones are in full condition. However, when zonefs mounts a zoned
block device, zonefs refers write pointers to set file size even when
the zones are in full condition. This results in wrong file size. To fix
this, refer maximum file size in place of write pointers for zones in
full condition.
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Cc: <stable@vger.kernel.org> # 5.6+
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 78c276f5495aa53a8beebb627e5bf6a54f0af34f upstream.
syzbot reported a warning which could cause shift-out-of-bounds issue.
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x183/0x22e lib/dump_stack.c:120
ubsan_epilogue lib/ubsan.c:148 [inline]
__ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
exfat_read_boot_sector fs/exfat/super.c:471 [inline]
__exfat_fill_super fs/exfat/super.c:556 [inline]
exfat_fill_super+0x2acb/0x2d00 fs/exfat/super.c:624
get_tree_bdev+0x406/0x630 fs/super.c:1291
vfs_get_tree+0x86/0x270 fs/super.c:1496
do_new_mount fs/namespace.c:2881 [inline]
path_mount+0x1937/0x2c50 fs/namespace.c:3211
do_mount fs/namespace.c:3224 [inline]
__do_sys_mount fs/namespace.c:3432 [inline]
__se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3409
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
exfat specification describe sect_per_clus_bits field of boot sector
could be at most 25 - sect_size_bits and at least 0. And sect_size_bits
can also affect this calculation, It also needs validation.
This patch add validation for sect_per_clus_bits and sect_size_bits
field of boot sector.
Fixes: 719c1e1829 ("exfat: add super block operations")
Cc: stable@vger.kernel.org # v5.9+
Reported-by: syzbot+da4fe66aaadd3c2e2d1c@syzkaller.appspotmail.com
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 70779b897395b330ba5a47bed84f94178da599f9 upstream.
The reference count of the old buffer head should be decremented on path
that fails to get the new buffer head.
Fixes: 6b4657667b ("fs/affs: add rename exchange")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 19d8e9149c27b689c6224f5c84b96a159342195a upstream.
Both pstore_compress() and decompress_record() use a mistyped config
option name ("PSTORE_COMPRESSION" instead of "PSTORE_COMPRESS"). As
a result compression and decompression of pstore records was always
disabled.
Use the correct config option name.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: fd49e03280 ("pstore: Fix linking when crypto API disabled")
Acked-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210218111547.johvp5klpv3xrpnn@dwarf.suse.cz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 72c9925f87c8b74f36f8e75a4cd93d964538d3ca upstream.
At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up
returning without unlocking and releasing our reference on the extent
buffer named "cow" we previously allocated with btrfs_alloc_tree_block().
So fix that by unlocking the extent buffer and dropping our reference on
it before returning.
Fixes: be20aa9dba ("Btrfs: Add mount option to turn off data cow")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 81e75ac74ecba929d1e922bf93f9fc467232e39f upstream.
My recent patch set "A variety of lock contention fixes", found here
https://lore.kernel.org/linux-btrfs/cover.1608319304.git.josef@toxicpanda.com/
(Tracked in https://github.com/btrfs/linux/issues/86)
that reduce lock contention on the extent root by running delayed refs
less often resulted in a regression in generic/371. This test
fallocate()'s the fs until it's full, deletes all the files, and then
tries to fallocate() until full again.
Before these patches we would run all of the delayed refs during
flushing, and then would commit the transaction because we had plenty of
pinned space to recover in order to allocate. However my patches made
it so we weren't running the delayed refs as aggressively, which meant
that we appeared to have less pinned space when we were deciding to
commit the transaction.
We use the space_info->total_bytes_pinned to approximate how much space
we have pinned. It's approximate because if we remove a reference to an
extent we may free it, but there may be more references to it than we
know of at that point, but we account it as pinned at the creation time,
and then it's properly accounted when the delayed ref runs.
The way we account for pinned space is if the
delayed_ref_head->total_ref_mod is < 0, because that is clearly a
freeing option. However there is another case, and that is where
->total_ref_mod == 0 && ->must_insert_reserved == 1.
When we allocate a new extent, we have ->total_ref_mod == 1 and we have
->must_insert_reserved == 1. This is used to indicate that it is a
brand new extent and will need to have its extent entry added before we
modify any references on the delayed ref head. But if we subsequently
remove that extent reference, our ->total_ref_mod will be 0, and that
space will be pinned and freed. Accounting for this case properly
allows for generic/371 to pass with my delayed refs patches applied.
It's important to note that this problem exists without the referenced
patches, it just was uncovered by them.
CC: stable@vger.kernel.org # 5.10
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2187374f35fe9cadbddaa9fcf0c4121365d914e8 upstream.
Currently we pass things around to figure out if we maybe freeing data
based on the state of the delayed refs head. This makes the accounting
sort of confusing and hard to follow, as it's distinctly separate from
the delayed ref heads stuff, but also depends on it entirely.
Fix this by explicitly adjusting the space_info->total_bytes_pinned in
the delayed refs code. We now have two places where we modify this
counter, once where we create the delayed and destroy the delayed refs,
and once when we pin and unpin the extents. This means there is a
slight overlap between delayed refs and the pin/unpin mechanisms, but
this is simply used by the ENOSPC infrastructure to determine if we need
to commit the transaction, so there's no adverse affect from this, we
might simply commit thinking it will give us enough space when it might
not.
CC: stable@vger.kernel.org # 5.10
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 938fcbfb0cbcf532a1869efab58e6009446b1ced upstream.
While doing error injection testing with my relocation patches I hit the
following assert:
assertion failed: list_empty(&block_group->dirty_list), in fs/btrfs/block-group.c:3356
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3357!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 0 PID: 24351 Comm: umount Tainted: G W 5.10.0-rc3+ #193
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:assertfail.constprop.0+0x18/0x1a
RSP: 0018:ffffa09b019c7e00 EFLAGS: 00010282
RAX: 0000000000000056 RBX: ffff8f6492c18000 RCX: 0000000000000000
RDX: ffff8f64fbc27c60 RSI: ffff8f64fbc19050 RDI: ffff8f64fbc19050
RBP: ffff8f6483bbdc00 R08: 0000000000000000 R09: 0000000000000000
R10: ffffa09b019c7c38 R11: ffffffff85d70928 R12: ffff8f6492c18100
R13: ffff8f6492c18148 R14: ffff8f6483bbdd70 R15: dead000000000100
FS: 00007fbfda4cdc40(0000) GS:ffff8f64fbc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbfda666fd0 CR3: 000000013cf66002 CR4: 0000000000370ef0
Call Trace:
btrfs_free_block_groups.cold+0x55/0x55
close_ctree+0x2c5/0x306
? fsnotify_destroy_marks+0x14/0x100
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x36/0xa0
cleanup_mnt+0x12d/0x190
task_work_run+0x5c/0xa0
exit_to_user_mode_prepare+0x1b1/0x1d0
syscall_exit_to_user_mode+0x54/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happened because I injected an error in btrfs_cow_block() while
running the dirty block groups. When we run the dirty block groups, we
splice the list onto a local list to process. However if an error
occurs, we only cleanup the transactions dirty block group list, not any
pending block groups we have on our locally spliced list.
In fact if we fail to allocate a path in this function we'll also fail
to clean up the splice list.
Fix this by splicing the list back onto the transaction dirty block
group list so that the block groups are cleaned up. Then add a 'out'
label and have the error conditions jump to out so that the errors are
handled properly. This also has the side-effect of fixing a problem
where we would clear 'ret' on error because we unconditionally ran
btrfs_run_delayed_refs().
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c78a10aebb275c38d0cfccae129a803fe622e305 upstream.
When recovering a relocation, if we run into a reloc root that has 0
refs we simply add it to the reloc_control->reloc_roots list, and then
clean it up later. The problem with this is __del_reloc_root() doesn't
do anything if the root isn't in the radix tree, which in this case it
won't be because we never call __add_reloc_root() on the reloc_root.
This exit condition simply isn't correct really. During normal
operation we can remove ourselves from the rb tree and then we're meant
to clean up later at merge_reloc_roots() time, and this happens
correctly. During recovery we're depending on free_reloc_roots() to
drop our references, but we're short-circuiting.
Fix this by continuing to check if we're on the list and dropping
ourselves from the reloc_control root list and dropping our reference
appropriately. Change the corresponding BUG_ON() to an ASSERT() that
does the correct thing if we aren't in the rb tree.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 867ed321f90d06aaba84e2c91de51cd3038825ef upstream.
While testing my error handling patches, I added a error injection site
at btrfs_inc_extent_ref, to validate the error handling I added was
doing the correct thing. However I hit a pretty ugly corruption while
doing this check, with the following error injection stack trace:
btrfs_inc_extent_ref
btrfs_copy_root
create_reloc_root
btrfs_init_reloc_root
btrfs_record_root_in_trans
btrfs_start_transaction
btrfs_update_inode
btrfs_update_time
touch_atime
file_accessed
btrfs_file_mmap
This is because we do not catch the error from btrfs_inc_extent_ref,
which in practice would be ENOMEM, which means we lose the extent
references for a root that has already been allocated and inserted,
which is the problem. Fix this by aborting the transaction if we fail
to do the reference modification.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit eddda68d97732ce05ca145f8e85e8a447f65cdad upstream.
A weird KASAN problem that Zygo reported could have been easily caught
if we checked for basic things in our backref freeing code. We have two
methods of freeing a backref node
- btrfs_backref_free_node: this just is kfree() essentially.
- btrfs_backref_drop_node: this actually unlinks the node and cleans up
everything and then calls btrfs_backref_free_node().
We should mostly be using btrfs_backref_drop_node(), to make sure the
node is properly unlinked from the backref cache, and only use
btrfs_backref_free_node() when we know the node isn't actually linked to
the backref cache. We made a mistake here and thus got the KASAN splat.
Make this style of issue easier to find by adding some ASSERT()'s to
btrfs_backref_free_node() and adjusting our deletion stuff to properly
init the list so we can rely on list_empty() checks working properly.
BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
Read of size 8 at addr ffff888112402950 by task btrfs/28836
CPU: 0 PID: 28836 Comm: btrfs Tainted: G W 5.10.0-e35f27394290-for-next+ #23
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
dump_stack+0xbc/0xf9
? btrfs_backref_cleanup_node+0x18a/0x420
print_address_description.constprop.8+0x21/0x210
? record_print_text.cold.34+0x11/0x11
? btrfs_backref_cleanup_node+0x18a/0x420
? btrfs_backref_cleanup_node+0x18a/0x420
kasan_report.cold.10+0x20/0x37
? btrfs_backref_cleanup_node+0x18a/0x420
__asan_load8+0x69/0x90
btrfs_backref_cleanup_node+0x18a/0x420
btrfs_backref_release_cache+0x83/0x1b0
relocate_block_group+0x394/0x780
? merge_reloc_roots+0x4a0/0x4a0
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
? check_flags.part.50+0x6c/0x1e0
? btrfs_relocate_chunk+0x120/0x120
? kmem_cache_alloc_trace+0xa06/0xcb0
? _copy_from_user+0x83/0xc0
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
? __kasan_check_read+0x11/0x20
? check_chain_key+0x1f4/0x2f0
? __asan_loadN+0xf/0x20
? btrfs_ioctl_get_supported_features+0x30/0x30
? kvm_sched_clock_read+0x18/0x30
? check_chain_key+0x1f4/0x2f0
? lock_downgrade+0x3f0/0x3f0
? handle_mm_fault+0xad6/0x2150
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? check_flags.part.50+0x6c/0x1e0
? check_flags.part.50+0x6c/0x1e0
? check_flags+0x26/0x30
? lock_is_held_type+0xc3/0xf0
? syscall_enter_from_user_mode+0x1b/0x60
? do_syscall_64+0x13/0x80
? rcu_read_lock_sched_held+0xa1/0xd0
? __kasan_check_read+0x11/0x20
? __fget_light+0xae/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4c4bdfe427
RSP: 002b:00007fff33ee6df8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fff33ee6e98 RCX: 00007f4c4bdfe427
RDX: 00007fff33ee6e98 RSI: 00000000c4009420 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
R10: fffffffffffff59d R11: 0000000000000202 R12: 0000000000000001
R13: 0000000000000000 R14: 00007fff33ee8a34 R15: 0000000000000001
Allocated by task 28836:
kasan_save_stack+0x21/0x50
__kasan_kmalloc.constprop.18+0xbe/0xd0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x410/0xcb0
btrfs_backref_alloc_node+0x46/0xf0
btrfs_backref_add_tree_node+0x60d/0x11d0
build_backref_tree+0xc5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 28836:
kasan_save_stack+0x21/0x50
kasan_set_track+0x20/0x30
kasan_set_free_info+0x1f/0x30
__kasan_slab_free+0xf3/0x140
kasan_slab_free+0xe/0x10
kfree+0xde/0x200
btrfs_backref_error_cleanup+0x452/0x530
build_backref_tree+0x1a5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff888112402900
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 80 bytes inside of
128-byte region [ffff888112402900, ffff888112402980)
The buggy address belongs to the page:
page:0000000028b1cd08 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888131c810c0 pfn:0x112402
flags: 0x17ffe0000000200(slab)
raw: 017ffe0000000200 ffffea000424f308 ffffea0007d572c8 ffff888100040440
raw: ffff888131c810c0 ffff888112402000 0000000100000009 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888112402800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888112402880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888112402900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888112402980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888112402a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Link: https://lore.kernel.org/linux-btrfs/20201208194607.GI31381@hungrycats.org/
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f78743fbdae1bb31bc9c9233c3590a5048782381 upstream.
The backref code is looking for a reloc_root that corresponds to the
given fs root. However any number of things could have gone wrong while
initializing that reloc_root, like ENOMEM while trying to allocate the
root itself, or EIO while trying to write the root item. This would
result in no corresponding reloc_root being in the reloc root cache, and
thus would return NULL when we do the find_reloc_root() call.
Because of this we do not want to WARN_ON(). This presumably was meant
to catch developer errors, cases where we messed up adding the reloc
root. However we can easily hit this case with error injection, and
thus should not do a WARN_ON().
CC: stable@vger.kernel.org # 5.10+
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7e2a870a599d4699a626ec26430c7a1ab14a2a49 upstream.
Zygo reported the following panic when testing my error handling patches
for relocation:
kernel BUG at fs/btrfs/backref.c:2545!
invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 8472 Comm: btrfs Tainted: G W 14
Hardware name: QEMU Standard PC (i440FX + PIIX,
Call Trace:
btrfs_backref_error_cleanup+0x4df/0x530
build_backref_tree+0x1a5/0x700
? _raw_spin_unlock+0x22/0x30
? release_extent_buffer+0x225/0x280
? free_extent_buffer.part.52+0xd7/0x140
relocate_tree_blocks+0x2a6/0xb60
? kasan_unpoison_shadow+0x35/0x50
? do_relocation+0xc10/0xc10
? kasan_kmalloc+0x9/0x10
? kmem_cache_alloc_trace+0x6a3/0xcb0
? free_extent_buffer.part.52+0xd7/0x140
? rb_insert_color+0x342/0x360
? add_tree_block.isra.36+0x236/0x2b0
relocate_block_group+0x2eb/0x780
? merge_reloc_roots+0x470/0x470
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x18f0
? pvclock_clocksource_read+0xeb/0x190
? btrfs_relocate_chunk+0x120/0x120
? lock_contended+0x620/0x6e0
? do_raw_spin_lock+0x1e0/0x1e0
? do_raw_spin_unlock+0xa8/0x140
btrfs_ioctl_balance+0x1f9/0x460
btrfs_ioctl+0x24c8/0x4380
? __kasan_check_read+0x11/0x20
? check_chain_key+0x1f4/0x2f0
? __asan_loadN+0xf/0x20
? btrfs_ioctl_get_supported_features+0x30/0x30
? kvm_sched_clock_read+0x18/0x30
? check_chain_key+0x1f4/0x2f0
? lock_downgrade+0x3f0/0x3f0
? handle_mm_fault+0xad6/0x2150
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? check_flags.part.50+0x6c/0x1e0
? check_flags.part.50+0x6c/0x1e0
? check_flags+0x26/0x30
? lock_is_held_type+0xc3/0xf0
? syscall_enter_from_user_mode+0x1b/0x60
? do_syscall_64+0x13/0x80
? rcu_read_lock_sched_held+0xa1/0xd0
? __kasan_check_read+0x11/0x20
? __fget_light+0xae/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This occurs because of this check
if (RB_EMPTY_NODE(&upper->rb_node))
BUG_ON(!list_empty(&node->upper));
As we are dropping the backref node, if we discover that our upper node
in the edge we just cleaned up isn't linked into the cache that we are
now done with this node, thus the BUG_ON().
However this is an erroneous assumption, as we will look up all the
references for a node first, and then process the pending edges. All of
the 'upper' nodes in our pending edges won't be in the cache's rb_tree
yet, because they haven't been processed. We could very well have many
edges still left to cleanup on this node.
The fact is we simply do not need this check, we can just process all of
the edges only for this node, because below this check we do the
following
if (list_empty(&upper->lower)) {
list_add_tail(&upper->lower, &cache->leaves);
upper->lowest = 1;
}
If the upper node truly isn't used yet, then we add it to the
cache->leaves list to be cleaned up later. If it is still used then the
last child node that has it linked into its node will add it to the
leaves list and then it will be cleaned up.
Fix this problem by dropping this logic altogether. With this fix I no
longer see the panic when testing with error injection in the backref
code.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ce063129181312f8781a047a50be439c5859747b upstream.
Currently, although set_bit() & test_bit() pairs are used as a fast-
path for initialized configurations. However, these atomic ops are
actually relaxed forms. Instead, load-acquire & store-release form is
needed to make sure uninitialized fields won't be observed in advance
here (yet no such corresponding bitops so use full barriers instead.)
Link: https://lore.kernel.org/r/20210209130618.15838-1-hsiangkao@aol.com
Fixes: 62dc45979f ("staging: erofs: fix race of initializing xattrs of a inode at the same time")
Fixes: 152a333a58 ("staging: erofs: add compacted compression indexes support")
Cc: <stable@vger.kernel.org> # 5.3+
Reported-by: Huang Jianan <huangjianan@oppo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4508943794efdd94171549c0bd52810e2f4ad9fe ]
Since
sysctl: pass kernel pointers to ->proc_handler
we have been pre-allocating a buffer to copy the data from the proc
handlers into, and then copying that to userspace. The problem is this
just blindly kzalloc()'s the buffer size passed in from the read, which in
the case of our 'cat' binary was 64kib. Order-4 allocations are not
awesome, and since we can potentially allocate up to our maximum order, so
use kvzalloc for these buffers.
[willy@infradead.org: changelog tweaks]
Link: https://lkml.kernel.org/r/6345270a2c1160b89dd5e6715461f388176899d1.1612972413.git.josef@toxicpanda.com
Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
CC: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 45901a231723a5a513ff08477983f3a274a6a910 ]
We don't want to ask for the ACL in a WRITE reply, since we don't have
a preallocated buffer.
Instead of checking NFS_INO_INVALID_ACCESS, which is really about
managing the access cache, we should look at the value of
NFS_INO_INVALID_OTHER. Also ensure we assign the mode, owner and
owner_group flags to the correct bit mask.
Finally, fix up the check for NFS_INO_INVALID_CTIME to retrieve the
ctime, and add a check for NFS_INO_INVALID_CHANGE.
Fixes: 76bd5c016e ("NFSv4: make cache consistency bitmask dynamic")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c57d117f2b2f2a19b570c36f2819ef8d8210af20 ]
The error handling in this function frees "reg" but it is still on the
"o2hb_all_regions" list so it will lead to a use after freew. Joseph Qi
points out that we need to clear the bit in the "o2hb_region_bitmap" as
well
Link: https://lkml.kernel.org/r/YBk4M6HUG8jB/jc7@mwanda
Fixes: 1cf257f511 ("ocfs2: fix memory leak")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 302fdadeafe4be539f247abf25f61822e4a5a577 ]
EXT4_KUNIT_TESTS selects EXT4_FS, thus enabling an optional feature the
user may not want to enable. Fix this by making the test depend on
EXT4_FS instead.
Fixes: 1cbeab1b24 ("ext4: add kunit test for decoding extended timestamps")
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20210122110234.2825685-1-geert@linux-m68k.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b5776e7524afbd4569978ff790864755c438bba7 ]
In the case where we need to do an interior node split, and
immediately afterwards, we are unable to allocate a new directory leaf
block due to ENOSPC, the directory index checksum's will not be filled
in correctly (and indeed, will not be correctly journalled).
This looks like a bug that was introduced when we added largedir
support. The original code doesn't make any sense (and should have
been caught in code review), but it was hidden because most of the
time, the index node checksum will be set by do_split(). But if
do_split bails out due to ENOSPC, then ext4_handle_dirty_dx_node()
won't get called, and so the directory index checksum field will not
get set, leading to:
EXT4-fs error (device sdb): dx_probe:858: inode #6635543: block 4022: comm nfsd: Directory index failed checksum
Google-Bug-Id: 176345532
Fixes: e08ac99fa2 ("ext4: add largedir feature")
Cc: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 64f36da5625f7f9853b86750eaa89d499d16a2e9 ]
A primary reason for skipping ceph_check_caps after putting the
references was to avoid the locking in ceph_check_caps during a
reconnect. __ceph_put_cap_refs can still call ceph_flush_snaps in that
case though, and that takes many of the same inconvenient locks.
Fix the logic in __ceph_put_cap_refs to skip flushing snaps when the
skip_checking_caps flag is set.
Fixes: e64f44a884 ("ceph: skip checking caps when session reconnecting and releasing reqs")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bd5ae9288d6451bd346a1b4a59d4fe7e62ba29b7 ]
These pernet operations may depend on stuff set up or torn down in the
module init/exit functions. And they may be called at any time in
between. So it makes more sense for them to be the last to be
registered in the init function, and the first to be unregistered in the
exit function.
In particular, without this, the drc slab is being destroyed before all
the per-net drcs are shut down, resulting in an "Objects remaining in
nfsd_drc on __kmem_cache_shutdown()" warning in exit_nfsd.
Reported-by: Zhi Li <yieli@redhat.com>
Fixes: 3ba75830ce "nfsd4: drc containerization"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0a6dc67a6aa45f19bd4ff89b4f468fc50c4b8daa ]
Release the buffer_head before returning error code in
do_isofs_readdir() and isofs_find_entry().
Fixes: 2deb1acc65 ("isofs: fix access to unallocated memory when reading corrupted filesystem")
Link: https://lore.kernel.org/r/20210118120455.118955-1-bianpan2016@163.com
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a4db1072e1a3bd7a8d9c356e1902b13ac5deb8ef ]
When checking corrupted quota file we can bail out and leak allocated
info structure. Properly free info structure on error return.
Reported-by: syzbot+77779c9b52ab78154b08@syzkaller.appspotmail.com
Fixes: 11c514a99bb9 ("quota: Sanity-check quota file headers on load")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>