When doing an incremental send and a file has extents shared with itself
at different file offsets, it's possible for send to emit clone operations
that will fail at the destination because the source range goes beyond the
file's current size. This happens when the file size has increased in the
send snapshot, there is a hole between the shared extents and both shared
extents are at file offsets which are greater the file's size in the
parent snapshot.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xf1 0 64K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/1.snap /mnt/sdb/base
# Create a 320K extent at file offset 512K.
$ xfs_io -c "pwrite -S 0xab 512K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0xcd 576K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0xef 640K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0x64 704K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0x73 768K 64K" /mnt/sdb/foobar
# Clone part of that 320K extent into a lower file offset (192K).
# This file offset is greater than the file's size in the parent
# snapshot (64K). Also the clone range is a bit behind the offset of
# the 320K extent so that we leave a hole between the shared extents.
$ xfs_io -c "reflink /mnt/sdb/foobar 448K 192K 192K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -p /mnt/sdb/base -f /tmp/2.snap /mnt/sdb/incr
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/1.snap /mnt/sdc
$ btrfs receive -f /tmp/2.snap /mnt/sdc
ERROR: failed to clone extents to foobar: Invalid argument
The problem is that after processing the extent at file offset 256K, which
refers to the first 128K of the 320K extent created by the buffered write
operations, we have 'cur_inode_next_write_offset' set to 384K, which
corresponds to the end offset of the partially shared extent (256K + 128K)
and to the current file size in the receiver. Then when we process the
extent at offset 512K, we do extent backreference iteration to figure out
if we can clone the extent from some other inode or from the same inode,
and we consider the extent at offset 256K of the same inode as a valid
source for a clone operation, which is not correct because at that point
the current file size in the receiver is 384K, which corresponds to the
end of last processed extent (at file offset 256K), so using a clone
source range from 256K to 256K + 320K is invalid because that goes past
the current size of the file (384K) - this makes the receiver get an
-EINVAL error when attempting the clone operation.
So fix this by excluding clone sources that have a range that goes beyond
the current file size in the receiver when iterating extent backreferences.
A test case for fstests follows soon.
Fixes: 11f2069c11 ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We ran into a deadlock in production with the fixup worker. The stack
traces were as follows:
Thread responsible for the writeout, waiting on the page lock
[<0>] io_schedule+0x12/0x40
[<0>] __lock_page+0x109/0x1e0
[<0>] extent_write_cache_pages+0x206/0x360
[<0>] extent_writepages+0x40/0x60
[<0>] do_writepages+0x31/0xb0
[<0>] __writeback_single_inode+0x3d/0x350
[<0>] writeback_sb_inodes+0x19d/0x3c0
[<0>] __writeback_inodes_wb+0x5d/0xb0
[<0>] wb_writeback+0x231/0x2c0
[<0>] wb_workfn+0x308/0x3c0
[<0>] process_one_work+0x1e0/0x390
[<0>] worker_thread+0x2b/0x3c0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
Thread of the fixup worker who is holding the page lock
[<0>] start_delalloc_inodes+0x241/0x2d0
[<0>] btrfs_start_delalloc_roots+0x179/0x230
[<0>] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0
[<0>] btrfs_check_data_free_space+0x53/0xa0
[<0>] btrfs_delalloc_reserve_space+0x20/0x70
[<0>] btrfs_writepage_fixup_worker+0x1fc/0x2a0
[<0>] normal_work_helper+0x11c/0x360
[<0>] process_one_work+0x1e0/0x390
[<0>] worker_thread+0x2b/0x3c0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
Thankfully the stars have to align just right to hit this. First you
have to end up in the fixup worker, which is tricky by itself (my
reproducer does DIO reads into a MMAP'ed region, so not a common
operation). Then you have to have less than a page size of free data
space and 0 unallocated space so you go down the "commit the transaction
to free up pinned space" path. This was accomplished by a random
balance that was running on the host. Then you get this deadlock.
I'm still in the process of trying to force the deadlock to happen on
demand, but I've hit other issues. I can still trigger the fixup worker
path itself so this patch has been tested in that regard, so the normal
case is fine.
Fixes: 87826df0ec ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we only return 0 or -EAGAIN from btrfs_writepage_cow_fixup, we
do not need this -EBUSY case.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For COW, btrfs expects pages dirty pages to have been through a few setup
steps. This includes reserving space for the new block allocations and marking
the range in the state tree for delayed allocation.
A few places outside btrfs will dirty pages directly, especially when unmapping
mmap'd pages. In order for these to properly go through COW, we run them
through a fixup worker to wait for stable pages, and do the delalloc prep.
87826df0ec added a window where the dirty pages were cleaned, but pending
more action from the fixup worker. We clear_page_dirty_for_io() before
we call into writepage, so the page is no longer dirty. The commit
changed it so now we leave the page clean between unlocking it here and
the fixup worker starting at some point in the future.
During this window, page migration can jump in and relocate the page. Once our
fixup work actually starts, it finds page->mapping is NULL and we end up
freeing the page without ever writing it.
This leads to crc errors and other exciting problems, since it screws up the
whole statemachine for waiting for ordered extents. The fix here is to keep
the page dirty while we're waiting for the fixup worker to get to work.
This is accomplished by returning -EAGAIN from btrfs_writepage_cow_fixup
if we queued the page up for fixup, which will cause the writepage
function to redirty the page.
Because we now expect the page to be dirty once it gets to the fixup
worker we must adjust the error cases to call clear_page_dirty_for_io()
on the page. That is the bulk of the patch, but it is not the fix, the
fix is the -EAGAIN from btrfs_writepage_cow_fixup. We cannot separate
these two changes out because the error conditions change with the new
expectations.
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
inc_block_group_ro does a calculation to see if we have enough room left
over if we mark this block group as read only in order to see if it's ok
to mark the block group as read only.
The problem is this calculation _only_ works for data, where our used is
always less than our total. For metadata we will overcommit, so this
will almost always fail for metadata.
Fix this by exporting btrfs_can_overcommit, and then see if we have
enough space to remove the remaining free space in the block group we
are trying to mark read only. If we do then we can mark this block
group as read only.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For some reason we've translated the do_chunk_alloc that goes into
btrfs_inc_block_group_ro to force in inc_block_group_ro, but these are
two different things.
force for inc_block_group_ro is used when we are forcing the block group
read only no matter what, for example when the underlying chunk is
marked read only. We need to not do the space check here as this block
group needs to be read only.
btrfs_inc_block_group_ro() has a do_chunk_alloc flag that indicates that
we need to pre-allocate a chunk before marking the block group read
only. This has nothing to do with forcing, and in fact we _always_ want
to do the space check in this case, so unconditionally pass false for
force in this case.
Then fixup inc_block_group_ro to honor force as it's expected and
documented to do.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Raviu reported that running his regular fs_trim segfaulted with the
following backtrace:
[ 237.525947] assertion failed: prev, in ../fs/btrfs/extent_io.c:1595
[ 237.525984] ------------[ cut here ]------------
[ 237.525985] kernel BUG at ../fs/btrfs/ctree.h:3117!
[ 237.525992] invalid opcode: 0000 [#1] SMP PTI
[ 237.525998] CPU: 4 PID: 4423 Comm: fstrim Tainted: G U OE 5.4.14-8-vanilla #1
[ 237.526001] Hardware name: ASUSTeK COMPUTER INC.
[ 237.526044] RIP: 0010:assfail.constprop.58+0x18/0x1a [btrfs]
[ 237.526079] Call Trace:
[ 237.526120] find_first_clear_extent_bit+0x13d/0x150 [btrfs]
[ 237.526148] btrfs_trim_fs+0x211/0x3f0 [btrfs]
[ 237.526184] btrfs_ioctl_fitrim+0x103/0x170 [btrfs]
[ 237.526219] btrfs_ioctl+0x129a/0x2ed0 [btrfs]
[ 237.526227] ? filemap_map_pages+0x190/0x3d0
[ 237.526232] ? do_filp_open+0xaf/0x110
[ 237.526238] ? _copy_to_user+0x22/0x30
[ 237.526242] ? cp_new_stat+0x150/0x180
[ 237.526247] ? do_vfs_ioctl+0xa4/0x640
[ 237.526278] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 237.526283] do_vfs_ioctl+0xa4/0x640
[ 237.526288] ? __do_sys_newfstat+0x3c/0x60
[ 237.526292] ksys_ioctl+0x70/0x80
[ 237.526297] __x64_sys_ioctl+0x16/0x20
[ 237.526303] do_syscall_64+0x5a/0x1c0
[ 237.526310] entry_SYSCALL_64_after_hwframe+0x49/0xbe
That was due to btrfs_fs_device::aloc_tree being empty. Initially I
thought this wasn't possible and as a percaution have put the assert in
find_first_clear_extent_bit. Turns out this is indeed possible and could
happen when a file system with SINGLE data/metadata profile has a 2nd
device added. Until balance is run or a new chunk is allocated on this
device it will be completely empty.
In this case find_first_clear_extent_bit should return the full range
[0, -1ULL] and let the caller handle this i.e for trim the end will be
capped at the size of actual device.
Link: https://lore.kernel.org/linux-btrfs/izW2WNyvy1dEDweBICizKnd2KDwDiDyY2EYQr4YCwk7pkuIpthx-JRn65MPBde00ND6V0_Lh8mW0kZwzDiLDv25pUYWxkskWNJnVP0kgdMA=@protonmail.com/
Fixes: 45bfcfc168 ("btrfs: Implement find_first_clear_extent_bit")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There exists a deadlock with range_cyclic that has existed forever. If
we loop around with a bio already built we could deadlock with a writer
who has the page locked that we're attempting to write but is waiting on
a page in our bio to be written out. The task traces are as follows
PID: 1329874 TASK: ffff889ebcdf3800 CPU: 33 COMMAND: "kworker/u113:5"
#0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
#1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
#2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
#3 [ffffc900297bb708] __lock_page at ffffffff811f145b
#4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
#5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
#6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
#7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
#8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
#9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd
PID: 2167901 TASK: ffff889dc6a59c00 CPU: 14 COMMAND:
"aio-dio-invalid"
#0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
#1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
#2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
#3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
#4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
#5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
#6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
#7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
#8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
#9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032
I used drgn to find the respective pages we were stuck on
page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874
As you can see the kworker is waiting for bit 0 (PG_locked) on index
7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
8148. aio-dio-invalid has 7680, and the kworker epd looks like the
following
crash> struct extent_page_data ffffc900297bbbb0
struct extent_page_data {
bio = 0xffff889f747ed830,
tree = 0xffff889eed6ba448,
extent_locked = 0,
sync_io = 0
}
Probably worth mentioning as well that it waits for writeback of the
page to complete while holding a lock on it (at prepare_pages()).
Using drgn I walked the bio pages looking for page
0xffffea00fbfc7500 which is the one we're waiting for writeback on
bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
for i in range(0, bio.bi_vcnt.value_()):
bv = bio.bi_io_vec[i]
if bv.bv_page.value_() == 0xffffea00fbfc7500:
print("FOUND IT")
which validated what I suspected.
The fix for this is simple, flush the epd before we loop back around to
the beginning of the file during writeout.
Fixes: b293f02e14 ("Btrfs: Add writepages support")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.
Consider the following example that explains how/why the problems happens:
1) Task A has mod log element with sequence number 200. It currently is
the only element in the mod log list;
2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
access the tree mod log. When it enters the function, it initializes
'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
before checking if there are other elements in the mod seq list.
Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
unlocks the lock 'tree_mod_seq_lock';
3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
sequence number of 201;
4) Some other task, name it task C, modifies a btree and because there
elements in the mod seq list, it adds a tree mod elem to the tree
mod log rbtree. That node added to the mod log rbtree is assigned
a sequence number of 202;
5) Task B, which is doing fiemap and resolving indirect back references,
calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
calls tree_mod_log_search() - the search returns the mod log node
from the rbtree with sequence number 202, created by task C;
6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
the mod log rbtree and finds the node with sequence number 202. Since
202 is less than the previously computed 'min_seq', (u64)-1, it
removes the node and frees it;
7) Task B still has a pointer to the node with sequence number 202, and
it dereferences the pointer itself and through the call to
__tree_mod_log_rewind(), resulting in a use-after-free problem.
This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:
[ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
[ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 1245.321287] RIP: 0010:rb_next+0x16/0x50
[ 1245.321307] Code: ....
[ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
[ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
[ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
[ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
[ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
[ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
[ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
[ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
[ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1245.321706] Call Trace:
[ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs]
[ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs]
[ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs]
[ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322066] do_vfs_ioctl+0x45a/0x750
[ 1245.322081] ksys_ioctl+0x70/0x80
[ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1245.322113] __x64_sys_ioctl+0x16/0x20
[ 1245.322126] do_syscall_64+0x5c/0x280
[ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1245.322155] RIP: 0033:0x7fdee3942dd7
[ 1245.322177] Code: ....
[ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
[ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
[ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
[ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
[ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
[ 1245.322423] Modules linked in: ....
[ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---
Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.
Fixes: bd989ba359 ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4vEM0ACgkQxWXV+ddt
WDvWFA//eHQZjWQ4FNby9AZwkDS4NIq2v6XGg7Xx9Ioxf1NenuBdyt1uNda2I3Mw
4/0tEnrd3XL3POldx4TTZXlg0itfXeY2un8pGREPwbqGL67296wmENUT8kwXU8aX
4zMuqO8veO7jzqeibX7T0VntnAC4Q+yWUd8Tb07KXaNH6Jeowo2rbU7n2CdI8T91
IZdc/AxnmomVQ3Y0lmQe5fL5z9KRkwud7toXek8yVXyheX1JgnqkzVjCgLyk3ai7
0VP0jfOPtC1qoMPR+buD2VuxaDTrLGkmsaG9l4XZghmmSnAvTMOxIosFA/SLqHwp
LPDZ5RABbxMj1MHTAr2Bq+9uokqSIIdbSWFiMazeJsL/JWOiOQk0BekNJudeL7Ow
PtGIZkKC+c3qKIjLymnd+u56klCe/O4vhx7wOHjoGxC0j+qRN473LK8dmGmgM9Yv
XH5YfWqDpjt5h7SSei94S7JrqaJ29+gY8bMtMXJhx46jHh6AdXAkszU83kLr8Biz
mvV49ECib12RuI4EVB4I3XedCftzMRH5ocN/Y3B2qGn/bU3g9nuExAc24FRBt6QK
e/ucc4jQyRpbvpDwsmdE+kSxUvENuIRnlyke9XdOsN9Mth0X85gvYOS6aFS+4qjq
QiXborlp+fZY00INYjG6Ck9UyfpdSEUJg452xVkGBhd/emAAIDo=
=FGXQ
-----END PGP SIGNATURE-----
Merge tag 'fs-dedupe-last-block-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull fs deduplication fix from David Sterba:
"This is a fix for deduplication bug: the last block of two files is
allowed to deduplicated. This got broken in 5.1 by lifting some
generic checks to VFS layer. The affected filesystems are btrfs and
xfs.
The patches are marked for stable as the bug decreases deduplication
effectivity"
* tag 'fs-dedupe-last-block-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: make deduplication with range including the last block work
fs: allow deduplication of eof block into the end of the destination file
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4vDYkACgkQxWXV+ddt
WDsNJQ//WJEcYoRpN5Y7oOIk/vo5ulF68P3kUh3hl206A13xpaHorvTvZKAD5s2o
C6xACJk839sGEhMdDRWvdeBDCHTedMk7EXjiZ6kJD+7EPpWmDllI5O6DTolT7SR2
b9zId4KCO+m8LiLZccRsxCJbdkJ7nJnz2c5+063TjsS3uq1BFudctRUjW/XnFCCZ
JIE5iOkdXrA+bFqc+l2zKTwgByQyJg+hVKRTZEJBT0QZsyNQvHKzXAmXxGopW8bO
SeuzFkiFTA0raK8xBz6mUwaZbk40Qlzm9v9AitFZx0x2nvQnMu447N3xyaiuyDWd
Li1aMN0uFZNgSz+AemuLfG0Wj70x1HrQisEj958XKzn4cPpUuMcc3lr1PZ2NIX+C
p6pSgaLOEq8Rc0U78/euZX6oyiLJPAmQO1TdkVMHrcMi36esBI6uG11rds+U+xeK
XoP20qXLFVYLLrl3wH9F4yIzydfMYu66Us1AeRPRB14NSSa7tbCOG//aCafOoLM6
518sJCazSWlv1kDewK8dtLiXc8eM6XJN+KI4NygFZrUj2Rq376q5oovUUKKkn3iN
pdHtF/7gAxIx6bZ+jY/gyt/Xe5AdPi7sKggahvrSOL3X+LLINwC4r+vAnnpd6yh4
NfJj5fobvc/mO9PEVMwgJ8PmHw5uNqeMlORGjk7stQs7Oez3tCw=
=4OkE
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features, highlights:
- async discard
- "mount -o discard=async" to enable it
- freed extents are not discarded immediatelly, but grouped
together and trimmed later, with IO rate limiting
- the "sync" mode submits short extents that could have been
ignored completely by the device, for SATA prior to 3.1 the
requests are unqueued and have a big impact on performance
- the actual discard IO requests have been moved out of
transaction commit to a worker thread, improving commit latency
- IO rate and request size can be tuned by sysfs files, for now
enabled only with CONFIG_BTRFS_DEBUG as we might need to
add/delete the files and don't have a stable-ish ABI for
general use, defaults are conservative
- export device state info in sysfs, eg. missing, writeable
- no discard of extents known to be untouched on disk (eg. after
reservation)
- device stats reset is logged with process name and PID that called
the ioctl
Fixes:
- fix missing hole after hole punching and fsync when using NO_HOLES
- writeback: range cyclic mode could miss some dirty pages and lead
to OOM
- two more corner cases for metadata_uuid change after power loss
during the change
- fix infinite loop during fsync after mix of rename operations
Core changes:
- qgroup assign returns ENOTCONN when quotas not enabled, used to
return EINVAL that was confusing
- device closing does not need to allocate memory anymore
- snapshot aware code got removed, disabled for years due to
performance problems, reimplmentation will allow to select wheter
defrag breaks or does not break COW on shared extents
- tree-checker:
- check leaf chunk item size, cross check against number of
stripes
- verify location keys for DIR_ITEM, DIR_INDEX and XATTR items
- new self test for physical -> logical mapping code, used for super
block range exclusion
- assertion helpers/macros updated to avoid objtool "unreachable
code" reports on older compilers or config option combinations"
* tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits)
btrfs: free block groups after free'ing fs trees
btrfs: Fix split-brain handling when changing FSID to metadata uuid
btrfs: Handle another split brain scenario with metadata uuid feature
btrfs: Factor out metadata_uuid code from find_fsid.
btrfs: Call find_fsid from find_fsid_inprogress
Btrfs: fix infinite loop during fsync after rename operations
btrfs: set trans->drity in btrfs_commit_transaction
btrfs: drop log root for dropped roots
btrfs: sysfs, add devid/dev_state kobject and device attributes
btrfs: Refactor btrfs_rmap_block to improve readability
btrfs: Add self-tests for btrfs_rmap_block
btrfs: selftests: Add support for dummy devices
btrfs: Move and unexport btrfs_rmap_block
btrfs: separate definition of assertion failure handlers
btrfs: device stats, log when stats are zeroed
btrfs: fix improper setting of scanned for range cyclic write cache pages
btrfs: safely advance counter when looking up bio csums
btrfs: remove unused member btrfs_device::work
btrfs: remove unnecessary wrapper get_alloc_profile
btrfs: add correction to handle -1 edge case in async discard
...
Pull scheduler updates from Ingo Molnar:
"These were the main changes in this cycle:
- More -rt motivated separation of CONFIG_PREEMPT and
CONFIG_PREEMPTION.
- Add more low level scheduling topology sanity checks and warnings
to filter out nonsensical topologies that break scheduling.
- Extend uclamp constraints to influence wakeup CPU placement
- Make the RT scheduler more aware of asymmetric topologies and CPU
capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y
- Make idle CPU selection more consistent
- Various fixes, smaller cleanups, updates and enhancements - please
see the git log for details"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
sched/fair: Define sched_idle_cpu() only for SMP configurations
sched/topology: Assert non-NUMA topology masks don't (partially) overlap
idle: fix spelling mistake "iterrupts" -> "interrupts"
sched/fair: Remove redundant call to cpufreq_update_util()
sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
sched/fair: calculate delta runnable load only when it's needed
sched/cputime: move rq parameter in irqtime_account_process_tick
stop_machine: Make stop_cpus() static
sched/debug: Reset watchdog on all CPUs while processing sysrq-t
sched/core: Fix size of rq::uclamp initialization
sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
sched/fair: Load balance aggressively for SCHED_IDLE CPUs
sched/fair : Improve update_sd_pick_busiest for spare capacity case
watchdog: Remove soft_lockup_hrtimer_cnt and related code
sched/rt: Make RT capacity-aware
sched/fair: Make EAS wakeup placement consider uclamp restrictions
sched/fair: Make task_fits_capacity() consider uclamp restrictions
sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
sched/uclamp: Make uclamp util helpers use and return UL values
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4sLasACgkQxWXV+ddt
WDsegg/8CBQ1/pGj+8mvf+ws6f71Av8jspY2Ebr+HCjaGhD2MG3HI1kA5gC9Qnbb
fQVd12M5ma2BTrIcszxwm+VMIMlDotRFzfAp8uuFJtW0aAEGMCboX6VRYWa/4I0o
SmgJg0RYh926VL73qSe3S72pfIYjar30RwjVIVTmsHxL/D/lEkrHg6IGKRCe/MaN
eQipth3iuFtcWmGm1+DxEySsOs7AMPg3wL8KVnQcYoDI2kg3BXFH9a4wTE6VmWsU
ZjonJBA/Rl8oA2YOVDum4mL5j2c5RulWEymdVKyo1oH+8kLDOQ8snd7Bxp3qtJ1C
gdVbS8gi7gT5/C+yex+ZWlAdfmCSGWj7dr7jjiELZhTrsBhtS7y+GM52GivSrJ3z
TciNQtF/Y0SrZGprPMgVGAHuIKWWwSmWJPmkRB4zv/5efFFdKg8/UmcRmh6dMo83
IF4VPEBQgJLj3ja9Wns5yvW9asKNcynGeFK7aV+BlGW/wuvBW9o017c4Q04dXSAK
iFpipJaR/6ZGmXlRQLa1uyKWVHNIfSFT47WJqa6Dbo6iWRE/S/MhfkZU42z2A3H9
O2qMWmZikZnPCkha6fWyNJEDxF3imC+/LBsYoEuVPR7kZ/irDnI1cJNsTocOlyj1
kgFtL5MnCBHCop9/tPGiVdin9ilHJs3q2kAkR5BNCSEqhC8mo4g=
=IPUk
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"Here's a last minute fix for a regression introduced in this
development cycle.
There's a small chance of a silent corruption when device replace and
NOCOW data writes happen at the same time in one block group. Metadata
or COW data writes are unaffected.
The extra fixup patch is there to silence an unnecessary warning"
* tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: dev-replace: remove warning for unknown return codes when finished
btrfs: scrub: Require mandatory block group RO for dev-replace
The fstests btrfs/011 triggered a warning at the end of device replace,
[ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
[ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
[ 1892.059993] ------------[ cut here ]------------
[ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
[ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942
[ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
[ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286
[ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889
[ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080
[ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000
[ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000
[ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200
[ 1892.108642] FS: 00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000
[ 1892.111558] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0
[ 1892.115814] Call Trace:
[ 1892.116896] btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
[ 1892.118962] btrfs_ioctl+0x1d62/0x2550 [btrfs]
caused by the previous patch ("btrfs: scrub: Require mandatory block
group RO for dev-replace"). Hitting ENOSPC is possible and could happen
when the block group is set read-only, preventing NOCOW writes to the
area that's being accessed by dev-replace.
This has happend with scratch devices of size 12G but not with 5G and
20G, so this is depends on timing and other activity on the filesystem.
The whole replace operation is restartable, the space state should be
examined by the user in any case.
The error code is propagated back to the ioctl caller so the kernel
warning is causing false alerts.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.
The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.
The bug is observable after commit b12de52896 ("btrfs: scrub: Don't
check free space before marking a block group RO")
[CAUSE]
Dev-replace has two source of writes:
- Write duplication
All writes to source device will also be duplicated to target device.
Content: Not yet persisted data/meta
- Scrub copy
Dev-replace reused scrub code to iterate through existing extents, and
copy the verified data to target device.
Content: Previously persisted data and metadata
The difference in contents makes the following race possible:
Regular Writer | Dev-replace
-----------------------------------------------------------------
^ |
| Preallocate one data extent |
| at bytenr X, len 1M |
v |
^ Commit transaction |
| Now extent [X, X+1M) is in |
v commit root |
================== Dev replace starts =========================
| ^
| | Scrub extent [X, X+1M)
| | Read [X, X+1M)
| | (The content are mostly garbage
| | since it's preallocated)
^ | v
| Write back happens for |
| extent [X, X+512K) |
| New data writes to both |
| source and target dev. |
v |
| ^
| | Scrub writes back extent [X, X+1M)
| | to target device.
| | This will over write the new data in
| | [X, X+512K)
| v
This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous
transaction (in commit root).
This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.
*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
-f collapse=0 -f punch=0 -f resvsp=0"
I didn't expect resvsp ioctl will fallback to fallocate in VFS...
[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().
This patch will mostly revert commit 76a8efa171 ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.
The side effect is, dev-replace can be more strict on avaialble space, but
definitely worth to avoid data corruption.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171 ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896 ("btrfs: scrub: Don't check free space before marking a block group RO")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since btrfs was migrated to use the generic VFS helpers for clone and
deduplication, it stopped allowing for the last block of a file to be
deduplicated when the source file size is not sector size aligned (when
eof is somewhere in the middle of the last block). There are two reasons
for that:
1) The generic code always rounds down, to a multiple of the block size,
the range's length for deduplications. This means we end up never
deduplicating the last block when the eof is not block size aligned,
even for the safe case where the destination range's end offset matches
the destination file's size. That rounding down operation is done at
generic_remap_check_len();
2) Because of that, the btrfs specific code does not expect anymore any
non-aligned range length's for deduplication and therefore does not
work if such nona-aligned length is given.
This patch addresses that second part, and it depends on a patch that
fixes generic_remap_check_len(), in the VFS, which was submitted ealier
and has the following subject:
"fs: allow deduplication of eof block into the end of the destination file"
These two patches address reports from users that started seeing lower
deduplication rates due to the last block never being deduplicated when
the file size is not aligned to the filesystem's block size.
Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sometimes when running generic/475 we would trip the
WARN_ON(cache->reserved) check when free'ing the block groups on umount.
This is because sometimes we don't commit the transaction because of IO
errors and thus do not cleanup the tree logs until at umount time.
These blocks are still reserved until they are cleaned up, but they
aren't cleaned up until _after_ we do the free block groups work. Fix
this by moving the free after free'ing the fs roots, that way all of the
tree logs are cleaned up and we have a properly cleaned fs. A bunch of
loops of generic/475 confirmed this fixes the problem.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current code doesn't correctly handle the situation which arises when
a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID
changed to the one in metadata uuid. This causes the incompat flag to
disappear.
In case of a power failure we could end up in a situation where part of
the disks in a multi-disk filesystem are correctly reverted to
METADATA_UUID_INCOMPAT flag unset state, while others have
METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS.
This patch corrects the behavior required to handle the case where a
disk of the second type is scanned first, creating the necessary
btrfs_fs_devices. Subsequently, when a disk which has already completed
the transition is scanned it should overwrite the data in
btrfs_fs_devices.
Reported-by: Su Yue <Damenly_Su@gmx.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is one more cases which isn't handled by the original metadata
uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and
the user decides to change the FSID to the original one e.g. have
metadata_uuid and fsid match. In case of power failure while this
operation is in progress we could end up in a situation where some of
the disks have the incompat bit removed and the other half have both
METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags.
This patch handles the case where a disk that has successfully changed
its FSID such that it equals METADATA_UUID is scanned first.
Subsequently when a disk with both
METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned
find_fsid_changed won't be able to find an appropriate btrfs_fs_devices.
This is done by extending find_fsid_changed to correctly find
btrfs_fs_devices whose metadata_uuid/fsid are the same and they match
the metadata_uuid of the currently scanned device.
Fixes: cc5de4e702 ("btrfs: Handle final split-brain possibility during fsid change")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reported-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
find_fsid became rather hairy with the introduction of metadata uuid
changing feature. Alleviate this by factoring out the metadata uuid
specific code in a dedicated function which deals with finding
correct fsid for a device with changed uuid.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since find_fsid_inprogress should also handle the case in which an fs
didn't change its FSID make it call find_fsid directly. This makes the
code in device_list_add simpler by eliminating a conditional call of
find_fsid. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recently fsstress (from fstests) sporadically started to trigger an
infinite loop during fsync operations. This turned out to be because
support for the rename exchange and whiteout operations was added to
fsstress in fstests. These operations, unlike any others in fsstress,
cause file names to be reused, whence triggering this issue. However
it's not necessary to use rename exchange and rename whiteout operations
trigger this issue, simple rename operations and file creations are
enough to trigger the issue.
The issue boils down to when we are logging inodes that conflict (that
had the name of any inode we need to log during the fsync operation), we
keep logging them even if they were already logged before, and after
that we check if there's any other inode that conflicts with them and
then add it again to the list of inodes to log. Skipping already logged
inodes fixes the issue.
Consider the following example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir # inode 257
$ touch /mnt/testdir/zz # inode 258
$ ln /mnt/testdir/zz /mnt/testdir/zz_link
$ touch /mnt/testdir/a # inode 259
$ sync
# The following 3 renames achieve the same result as a rename exchange
# operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a).
$ mv /mnt/testdir/a /mnt/testdir/a/tmp
$ mv /mnt/testdir/zz_link /mnt/testdir/a
$ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link
# The following rename and file creation give the same result as a
# rename whiteout operation (<rename_whiteout> zz to a2).
$ mv /mnt/testdir/zz /mnt/testdir/a2
$ touch /mnt/testdir/zz # inode 260
$ xfs_io -c fsync /mnt/testdir/zz
--> results in the infinite loop
The following steps happen:
1) When logging inode 260, we find that its reference named "zz" was
used by inode 258 in the previous transaction (through the commit
root), so inode 258 is added to the list of conflicting indoes that
need to be logged;
2) After logging inode 258, we find that its reference named "a" was
used by inode 259 in the previous transaction, and therefore we add
inode 259 to the list of conflicting inodes to be logged;
3) After logging inode 259, we find that its reference named "zz_link"
was used by inode 258 in the previous transaction - we add inode 258
to the list of conflicting inodes to log, again - we had already
logged it before at step 3. After logging it again, we find again
that inode 259 conflicts with him, and we add again 259 to the list,
etc - we end up repeating all the previous steps.
So fix this by skipping logging of conflicting inodes that were already
logged.
Fixes: 6b5fc433a7 ("Btrfs: fix fsync after succession of renames of different files")
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we abort a transaction we have the following sequence
if (!trans->dirty && list_empty(&trans->new_bgs))
return;
WRITE_ONCE(trans->transaction->aborted, err);
The idea being if we didn't modify anything with our trans handle then
we don't really need to abort the whole transaction, maybe the other
trans handles are fine and we can carry on.
However in the case of create_snapshot we add a pending_snapshot object
to our transaction and then commit the transaction. We don't actually
modify anything. sync() behaves the same way, attach to an existing
transaction and commit it. This means that if we have an IO error in
the right places we could abort the committing transaction with our
trans->dirty being not set and thus not set transaction->aborted.
This is a problem because in the create_snapshot() case we depend on
pending->error being set to something, or btrfs_commit_transaction
returning an error.
If we are not the trans handle that gets to commit the transaction, and
we're waiting on the commit to happen we get our return value from
cur_trans->aborted. If this was not set to anything because sync() hit
an error in the transaction commit before it could modify anything then
cur_trans->aborted would be 0. Thus we'd return 0 from
btrfs_commit_transaction() in create_snapshot.
This is a problem because we then try to do things with
pending_snapshot->snap, which will be NULL because we didn't create the
snapshot, and then we'll get a NULL pointer dereference like the
following
"BUG: kernel NULL pointer dereference, address: 00000000000001f0"
RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330
Call Trace:
? btrfs_mksubvol.isra.31+0x3f2/0x510
btrfs_mksubvol.isra.31+0x4bc/0x510
? __sb_start_write+0xfa/0x200
? mnt_want_write_file+0x24/0x50
btrfs_ioctl_snap_create_transid+0x16c/0x1a0
btrfs_ioctl_snap_create_v2+0x11e/0x1a0
btrfs_ioctl+0x1534/0x2c10
? free_debug_processing+0x262/0x2a3
do_vfs_ioctl+0xa6/0x6b0
? do_sys_open+0x188/0x220
? syscall_trace_enter+0x1f8/0x330
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x4a/0x1b0
In order to fix this we need to make sure anybody who calls
commit_transaction has trans->dirty set so that they properly set the
trans->transaction->aborted value properly so any waiters know bad
things happened.
This was found while I was running generic/475 with my modified
fsstress, it reproduced within a few runs. I ran with this patch all
night and didn't see the problem again.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fsync on a subvolume and create a log root for that volume, and
then later delete that subvolume we'll never clean up its log root. Fix
this by making switch_commit_roots free the log for any dropped roots we
encounter. The extra churn is because we need a btrfs_trans_handle, not
the btrfs_transaction.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
New sysfs attributes that track the filesystem status of devices, stored
in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's
a directory for each device, with name corresponding to the numerical
device id.
in_fs_metadata - device is in the list of fs metadata
missing - device is missing (no device node or block device)
replace_target - device is target of replace
writeable - writes from fs are allowed
These attributes reflect the state of the device::dev_state and created
at mount time.
Sample output:
$ pwd
/sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/
$ ls
in_fs_metadata missing replace_target writeable
$ cat missing
0
The output from these attributes are 0 or 1. 0 indicates unset and 1
indicates set. These attributes are readonly.
It is observed that the device delete thread and sysfs read thread will
not race because the delete thread calls sysfs kobject_put() which in
turn waits for existing sysfs read to complete.
Note for device replace devid swap:
During the replace the target device temporarily assumes devid 0 before
assigning the devid of the soruce device.
In btrfs_dev_replace_finishing() we remove source sysfs devid using the
function btrfs_sysfs_remove_devices_attr(), so after that call
kobject_rename() to update the devid in the sysfs. This adds and calls
btrfs_sysfs_update_devid() helper function to update the device id.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Move variables to appropriate scope. Remove last BUG_ON in the function
and rework error handling accordingly. Make the duplicate detection code
more straightforward. Use in_range macro. And give variables more
descriptive name by explicitly distinguishing between IO stripe size
(size recorded in the chunk item) and data stripe size (the size of
an actual stripe, constituting a logical chunk/block group).
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add RAID1 and single testcases to verify that data stripes are excluded
from super block locations and that the address mapping is valid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add basic infrastructure to create and link dummy btrfs_devices. This
will be used in the pending btrfs_rmap_block test which deals with
the block groups.
Calling btrfs_alloc_dummy_device will link the newly created device to
the passed fs_info and the test framework will free them once the test
is finished.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only during initial block group reading to map physical
address of super block to a list of logical ones. Make it private to
block-group.c, add proper kernel doc and ensure it's exported only for
tests.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a report where objtool detects unreachable instructions, eg.:
fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction
This seems to be a false positive due to compiler version. The cause is
in the ASSERT macro implementation that does the conditional check as
IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef.
To avoid that, use the ifdefs directly.
There are still 2 reports that aren't fixed:
fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction
fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
We had a report indicating that some read errors aren't reported by the
device stats in the userland. It is important to have the errors
reported in the device stat as user land scripts might depend on it to
take the reasonable corrective actions. But to debug these issue we need
to be really sure that request to reset the device stat did not come
from the userland itself. So log an info message when device error reset
happens.
For example:
BTRFS info (device sdc): device stats zeroed by btrfs(9223)
Reported-by: philip@philip-seeger.de
Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We noticed that we were having regular CG OOM kills in cases where there
was still enough dirty pages to avoid OOM'ing. It turned out there's
this corner case in btrfs's handling of range_cyclic where files that
were being redirtied were not getting fully written out because of how
we do range_cyclic writeback.
We unconditionally were setting scanned = 1; the first time we found any
pages in the inode. This isn't actually what we want, we want it to be
set if we've scanned the entire file. For range_cyclic we could be
starting in the middle or towards the end of the file, so we could write
one page and then not write any of the other dirty pages in the file
because we set scanned = 1.
Fix this by not setting scanned = 1 if we find pages. The rules for
setting scanned should be
1) !range_cyclic. In this case we have a specified range to write out.
2) range_cyclic && index == 0. In this case we've started at the
beginning and there is no need to loop around a second time.
3) range_cyclic && we started at index > 0 and we've reached the end of
the file without satisfying our nr_to_write.
This patch fixes both of our writepages implementations to make sure
these rules hold true. This fixed our over zealous CG OOMs in
production.
Fixes: d1310b2e0c ("Btrfs: Split the extent_map code into two parts")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Dan's smatch tool reports
fs/btrfs/file-item.c:295 btrfs_lookup_bio_sums()
warn: should this be 'count == -1'
which points to the while (count--) loop. With count == 0 the check
itself could decrement it to -1. There's a WARN_ON a few lines below
that has never been seen in practice though.
It turns out that the value of page_bytes_left matches the count (by
sectorsize multiples). The loop never reaches the state where count
would go to -1, because page_bytes_left == 0 is found first and this
breaks out.
For clarity, use only plain check on count (and only for positive
value), decrement safely inside the loop. Any other discrepancy after
the whole bio list processing should be reported by the exising
WARN_ON_ONCE as well.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a leftover from recently removed bio scheduling framework.
Fixes: ba8a9d0795 ("Btrfs: delete the entire async bio submission framework")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_get_alloc_profile() is a simple wrapper over get_alloc_profile().
The only difference is btrfs_get_alloc_profile() is visible to other
functions in btrfs while get_alloc_profile() is static and thus only
visible to functions in block-group.c.
Let's just fold get_alloc_profile() into btrfs_get_alloc_profile() to
get rid of the unnecessary second function.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
From Dave's testing described below, it's possible to drive a file
system to have bogus values of discardable_extents and _bytes. As
btrfs_discard_calc_delay() is the only user of discardable_extents, we
can correct here for any negative discardable_extents/discardable_bytes.
The problem is not reliably reproducible. The workload that created it
was based on linux git tree, switching between release tags, then
everytihng deleted followed by a full rebalance. At this state the
values of discardable_bytes was 16K and discardable_extents was -1,
expected values 0 and 0.
Repeating the workload again did not correct the bogus values so the
offset seems to be stable once it happens.
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Most callers of free_bitmap() only call it if bitmap_info->bytes is 0.
However, there are certain cases where we may free the free space cache
via __btrfs_remove_free_space_cache(). This exposes a path where
free_bitmap() is called regardless. This may result in a bad accounting
situation for discardable_bytes and discardable_extents. So, remove the
stats and call btrfs_discard_update_discardable().
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
It's less than ideal for small extents to eat into our extent budget, so
force extents <= 32KB into the bitmaps save for the first handful.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, there is no way for the free space cache to recover from
being serviced by purely bitmaps because the extent threshold is set to
0 in recalculate_thresholds() when we surpass the metadata allowance.
This adds a recovery mechanism by keeping large extents out of the
bitmaps and increases the metadata upper bound to 64KB. The recovery
mechanism bypasses this upper bound, thus making it a soft upper bound.
But, with the bypass being 1MB or greater, it shouldn't add unbounded
overhead.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Give a brief overview for how async discard is implemented.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Keep track of how much we are discarding and how often we are reusing
with async discard. The discard_*_bytes values don't need any special
protection because the work item provides the single threaded access.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As mentioned earlier, discarding data can be done either by issuing an
explicit discard or implicitly by reusing the LBA. Metadata block_groups
see much more frequent reuse due to well it being metadata. So instead
of explicitly discarding metadata block_groups, just leave them be and
let the latter implicit discarding be done for them.
For mixed block_groups, block_groups which contain both metadata and
data, we let them be as higher fragmentation is expected.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Non-block group destruction discarding currently only had a single list
with no minimum discard length. This can lead to caravaning more
meaningful discards behind a heavily fragmented block group.
This adds support for multiple lists with minimum discard lengths to
prevent the caravan effect. We promote block groups back up when we
exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support
only 2 lists with filters of 1MB and 32KB respectively.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Expose max_discard_size as a tunable via sysfs and switch the current
fixed maximum to the default value.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Throttle the maximum size of a discard so that we can provide an upper
bound for the rate of async discard. While the block layer is able to
split discards into the appropriate sized discards, we want to be able
to account more accurately the rate at which we are consuming NCQ slots
as well as limit the upper bound of work for a discard.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Provide the ability to rate limit based on kbps in addition to iops as
additional guides for the target discard rate. The delay used ends up
being max(kbps_delay, iops_delay).
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
An earlier patch keeps track of discardable_extents. These are
undiscarded extents managed by the free space cache. Here, we will use
this to dynamically calculate the discard delay interval.
There are 3 rate to consider. The first is the target convergence rate,
the rate to discard all discardable_extents over the
BTRFS_DISCARD_TARGET_MSEC time frame. This is clamped by the lower
limit, the iops limit or BTRFS_DISCARD_MIN_DELAY (1ms), and the upper
limit, BTRFS_DISCARD_MAX_DELAY (1s). We reevaluate this delay every
transaction commit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Keep track of this metric so that we can understand how ahead or behind
we are in discarding rate. This uses the same accounting method as
discardable_extents, deltas between previous/current values and
propagating them up.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The number of discardable extents will serve as the rate limiting metric
for how often we should discard. This keeps track of discardable extents
in the free space caches by maintaining deltas and propagating them to
the global count.
The deltas are calculated from 2 values stored in PREV and CURR entries,
then propagated up to the global discard ctl. The current counter value
becomes the previous counter value after update.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Setup base sysfs directory for discard stats + tunables.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs only allowed attributes to be exposed in debug/. Let's let other
groups be created by making debug its own kobject.
This also makes the per-fs debug options separate from the global
features mount attributes. This seems to be needed as
sysfs_create_files() requires const struct attribute * while
sysfs_create_group() can take struct attribute *. This seems nicer as
per file system, you'll probably use to_fs_info().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We probably should call sysfs_remove_group() on debug/.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The prior two patches added discarding via a background workqueue. This
just piggybacked off of the fstrim code to trim the whole block at once.
Well inevitably this is worse performance wise and will aggressively
overtrim. But it was nice to plumb the other infrastructure to keep the
patches easier to review.
This adds the real goal of this series which is discarding slowly (ie. a
slow long running fstrim). The discarding is split into two phases,
extents and then bitmaps. The reason for this is two fold. First, the
bitmap regions overlap the extent regions. Second, discarding the
extents first will let the newly trimmed bitmaps have the highest chance
of coalescing when being readded to the free space cache.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.
This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be addressed in a future patch of the
series.
For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.
On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.
discard=sync:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
---------------------------------------------------------------
Drive A | 434 | 1170
Drive B | 880 | 2330
Drive C | 2943 | 3920
Drive D | 4763 | 5701
discard=async:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
--------------------------------------------------------------
Drive A | 134 | 956
Drive B | 64 | 1972
Drive C | 59 | 1032
Drive D | 62 | 1200
While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a cap in btrfs in the amount of free extents that a block group
can have. When it surpasses that threshold, future extents are placed
into bitmaps. Instead of keeping track of if a certain bit is trimmed or
not in a second bitmap, keep track of the relative state of the bitmap.
With async discard, trimming bitmaps becomes a more frequent operation.
As a trade off with simplicity, we keep track of if discarding a bitmap
is in progress. If we fully scan a bitmap and trim as necessary, the
bitmap is marked clean. This has some caveats as the min block size may
skip over regions deemed too small. But this should be a reasonable
trade off rather than keeping a second bitmap and making allocation
paths more complex. The downside is we may overtrim, but ideally the min
block size should prevent us from doing that too often and getting stuck
trimming pathological cases.
BTRFS_TRIM_STATE_TRIMMING is added to indicate a bitmap is in the
process of being trimmed. If additional free space is added to that
bitmap, the bit is cleared. A bitmap will be marked
BTRFS_TRIM_STATE_TRIMMED if the trimming code was able to reach the end
of it and the former is still set.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Async discard will use the free space cache as backing knowledge for
which extents to discard. This patch plumbs knowledge about which
extents need to be discarded into the free space cache from
unpin_extent_range().
An untrimmed extent can merge with everything as this is a new region.
Absorbing trimmed extents is a tradeoff to for greater coalescing which
makes life better for find_free_extent(). Additionally, it seems the
size of a trim isn't as problematic as the trim io itself.
When reading in the free space cache from disk, if sync is set, mark all
extents as trimmed. The current code ensures at transaction commit that
all free space is trimmed when sync is set, so this reflects that.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This series introduces async discard which will use the flag
DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is
synchronously done in transaction commit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
There is a user report in the mail list, showing the following corrupted
tree blocks:
item 62 key (486836 DIR_ITEM 2543451757) itemoff 6273 itemsize 74
location key (4065004 INODE_ITEM 1073741824) type FILE
transid 21397 data_len 0 name_len 44
name: FILENAME
Note that location key, its offset should be 0 for all INODE_ITEMS.
This caused failed lookup of the inode.
[CAUSE]
That offending value, 1073741824, is 0x40000000. So this looks like a
memory bit flip.
[FIX]
This patch will enhance tree-checker to check location key of
DIR_INDEX/DIR_ITEM/XATTR_ITEM.
There are several different combinations needs to check:
- item_key.type == DIR_INDEX/DIR_ITEM
* location_key.type == BTRFS_INODE_ITEM_KEY
This location_key should follow the check in inode_item check.
* location_key.type == BTRFS_ROOT_ITEM_KEY
Despite the existing check, DIR_INDEX/DIR_ITEM can only points to
subvolume trees.
* All other keys are not allowed.
- item_key.type == XATTR_ITEM
location_key should be all 0.
Reported-by: Mike Gilbert <floppymaster@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ROOT_ITEM key check itself is not as simple as single line check, and
will be reused for both ROOT_ITEM and DIR_ITEM/DIR_INDEX location key
check, so refactor such check into check_root_key().
Also since we are here, fix a comment error about ROOT_ITEM offset,
which is transid of snapshot creation, not some "older kernel behavior".
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inode key check is not as easy as several lines, and it will be called
in more than one location (INODE_ITEM check and
DIR_ITEM/DIR_INDEX/XATTR_ITEM location key check).
So here refactor such check into check_inode_key(). And add extra
checks for XATTR_ITEM.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The @fs_info parameter can be extracted from extent_buffer structure,
and there are already some wrappers getting rid of the @fs_info
parameter.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inspired by btrfs-progs github issue #208, where chunk item in chunk
tree has invalid num_stripes (0).
Although that can already be caught by current btrfs_check_chunk_valid(),
that function doesn't really check item size as it needs to handle chunk
item in super block sys_chunk_array().
This patch will add two extra checks for chunk items in chunk tree:
- Basic chunk item size
If the item is smaller than btrfs_chunk (which already contains one
stripe), exit right now as reading num_stripes may even go beyond
eb boundary.
- Item size check against num_stripes
If item size doesn't match with calculated chunk size, then either the
item size or the num_stripes is corrupted. Error out anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This hasn't been used since it was first introduced in commit
b4bd745d12 ("btrfs: Introduce find_free_extent_ctl structure for later
rework"). Passing that to btrfs_add_reserved_bytes in find_free_extent
is not strictly necessary and using the local ram_bytes instead seems
cleaner.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 7087a9d8db ("btrfs: Remove
extent_io_ops::writepage_end_io_hook") left this logic in a confusing
state. Simplify it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass this as 1 from __extent_writepage_io(). The parameter
basically means "pretend I didn't pass in a page". This is silly since
we can simply not pass in the page. Get rid of the parameter from
btrfs_get_extent(), and since it's used as a get_extent_t callback,
remove it from get_extent_t and btree_get_extent(), neither of which
need it.
While we're here, let's document btrfs_get_extent().
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In __extent_writepage_io(), we check whether
i_size <= page_offset(page).
Note that if i_size < page_offset(page), then
i_size >> PAGE_SHIFT < page->index.
If i_size == page_offset(page), then
i_size >> PAGE_SHIFT == page->index && offset_in_page(i_size) == 0.
__extent_writepage() already has a check for these cases that
returns without calling __extent_writepage_io():
end_index = i_size >> PAGE_SHIFT
pg_offset = offset_in_page(i_size);
if (page->index > end_index ||
(page->index == end_index && !pg_offset)) {
page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
unlock_page(page);
return 0;
}
Get rid of the one in __extent_writepage_io(), which was obsoleted in
211c17f51f ("Fix corners in writepage and btrfs_truncate_page").
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since 40f765805f ("Btrfs: split up __extent_writepage to lower stack
usage"), done_unlocked is simply a return 0. Get rid of it.
Mid-statement block returns don seem to make the code less readable here.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're initializing pg_offset to 0, setting it immediately, then
reassigning it to 0 again after. The former became unnecessary in
211c17f51f ("Fix corners in writepage and btrfs_truncate_page"). The
latter is a leftover that should've been removed in 40f765805f
("Btrfs: split up __extent_writepage to lower stack usage"). Remove
both.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ordered->start, ordered->len, and ordered->disk_len correspond to
fi->disk_bytenr, fi->num_bytes, and fi->disk_num_bytes, respectively.
It's confusing to translate between the two naming schemes. Since a
btrfs_ordered_extent is basically a pending btrfs_file_extent_item,
let's make the former use the naming from the latter.
Note that I didn't touch the names in tracepoints just in case there are
scripts depending on the current naming.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Snapshot-aware defrag has been disabled since commit 8101c8dbf6
("Btrfs: disable snapshot aware defrag for now") almost 6 years ago.
Let's remove the dead code. If someone is up to the task of bringing it
back, they can dig it up from git.
This is logically a revert of commit 38c227d87c ("Btrfs:
snapshot-aware defrag") except that now we have to clear the
EXTENT_DEFRAG bit to avoid need_force_cow() returning true forever.
The reasons to disable were caused by runtime problems (like long stalls
or memory consumption) on heavily referenced extents (eg. thousands of
snapshots). There were attempts to fix that but never finished.
Current defrag breaks the extent references and some users prefer that
behaviour over the one implemented by snapshot aware (ie. keeping links
for defragmentation). To enable both usecases we'd need to extend
defrag ioctl but let's do that properly from scratch.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance ]
Signed-off-by: David Sterba <dsterba@suse.com>
We can encode this in the offset parameter: -1 means use the page
offsets, anything else is a valid offset.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we have two wrappers for __btrfs_lookup_bio_sums():
btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and
btrfs_lookup_bio_sums(), which is used everywhere else. The only
difference is that the _dio variant looks up csums starting at the given
offset instead of using the page index, which isn't actually direct
I/O-specific. Let's clean up the signature and return value of
__btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get
rid of the trivial helpers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When closing a device, btrfs_close_one_device() first allocates a new
device, copies the device to close's name, replaces it in the dev_list
with the copy and then finally frees it.
This involves two memory allocation, which can potentially fail. As this
code path is tricky to unwind, the allocation failures where handled by
BUG_ON()s.
But this copying isn't strictly needed, all that is needed is resetting
the device in question to it's state it had after the allocation.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_close_one_device we're decrementing the number of open devices
before we're calling btrfs_close_bdev().
As there is no intermediate exit between these points in this function it
is technically OK to do so, but it makes the code a bit harder to understand.
Move both operations closer together and move the decrement step after
btrfs_close_bdev().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When you snapshot a subvolume containing a subvolume, you get a
placeholder directory where the subvolume would be. These directories
have their own btrfs_dir_ro_inode_operations.
Al pointed out [1] that these directories can use simple_lookup()
instead of btrfs_lookup(), as they are always empty. Furthermore, they
can use the default generic_permission() instead of btrfs_permission();
the additional checks in the latter don't matter because we can't write
to the directory anyways. Finally, they can use the default
generic_update_time() instead of btrfs_update_time(), as the inode
doesn't exist on disk and doesn't need any special handling.
All together, this means that we can get rid of
btrfs_dir_ro_inode_operations and use simple_dir_inode_operations
instead.
1: https://lore.kernel.org/linux-btrfs/20190929052934.GY26530@ZenIV.linux.org.uk/
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
We have a user report, that cppcheck is complaining about a possible
NULL-pointer dereference in btrfs_destroy_dev_replace_tgtdev().
We're first dereferencing the 'tgtdev' variable and the later check for
the validity of the pointer with a WARN_ON(!tgtdev);
But all callers of btrfs_destroy_dev_replace_tgtdev() either explicitly
check if 'tgtdev' is non-NULL or directly allocate 'tgtdev', so the
WARN_ON() is impossible to hit. Just remove it to silence the checker's
complains.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfsic_process_superblock() BUG_ON()s if 'state' is NULL. But this can
never happen as the only caller from btrfsic_process_superblock() is
btrfsic_mount() which allocates 'state' some lines above calling
btrfsic_process_superblock() and checks for the allocation to succeed.
Let's just remove the impossible to hit BUG_ON().
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A user reports a possible NULL-pointer dereference in
btrfsic_process_superblock(). We are assigning state->fs_info to a local
fs_info variable and afterwards checking for the presence of state.
While we would BUG_ON() a NULL state anyways, we can also just remove
the local fs_info copy, as fs_info is only used once as the first
argument for btrfs_num_copies(). There we can just pass in
state->fs_info as well.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a relic from a time before we had a proper reservation mechanism
and you could end up with really full chunks at chunk allocation time.
This doesn't make sense anymore, so just kill it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have the space_info, we can just check its flags to see if it's the
system chunk space info.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's a simple wrapper over btrfs_panic and is called only once. Just
open code it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two relocation stages but both print the same message. Add the
description of the stage. This can help debugging or provides
informative message to users.
BTRFS info (device dm-5): balance: start -d -m -s
BTRFS info (device dm-5): relocating block group 30408704 flags metadata|dup
BTRFS info (device dm-5): found 2 extents, stage: move data extents
BTRFS info (device dm-5): relocating block group 22020096 flags system|dup
BTRFS info (device dm-5): found 1 extents, stage: move data extents
BTRFS info (device dm-5): relocating block group 13631488 flags data
BTRFS info (device dm-5): found 1 extents, stage: move data extents
BTRFS info (device dm-5): found 1 extents, stage: update data pointers
BTRFS info (device dm-5): balance: ended with status: 0
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The condition '!ret2' is always true. commit 717beb96d9 ("Btrfs: fix
regression in btrfs_page_mkwrite() from vm_fault_t conversion") left
behind the check after moving this code out of the goto, so remove the
unused condition check.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
level <0 and level >= BTRFS_MAX_LEVEL are already performed upon
extent buffer read by tree checker in btrfs_check_node.
go. As far as 'level <= 0' we are guaranteed that level is '> 0'
because the value of level _before_ reading 'next' is larger than 1
(otherwise we wouldn't have executed that code at all) this in turn
guarantees that 'level' after btrfs_read_buffer is 'level - 1' since
we verify this invariant in:
btrfs_read_buffer
btree_read_extent_buffer_pages
btrfs_verify_level_key
This guarantees that level can never be '<= 0' so the warn on is
never triggered.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The log_root passed to walk_log_tree is guaranteed to have its
root_key.objectid always be BTRFS_TREE_LOG_OBJECTID. This is by
merit that all log roots of an ordinary root are allocated in
alloc_log_tree which hard-codes objectid to be BTRFS_TREE_LOG_OBJECTID.
In case walk_log_tree is called for a log tree found by btrfs_read_fs_root
in btrfs_recover_log_trees, that function already ensures
found_key.objectid is BTRFS_TREE_LOG_OBJECTID.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_free_reserved_extent now performs the actions of
btrfs_free_and_pin_reserved_extent. But this name is a bit of a
misnomer, since the extent is not really freed but just pinned. Reflect
this in the new name. No semantics changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_free_reserved_extent performs 2 entirely different operations
depending on whether its 'pin' argument is true or false. This patch
lifts the 2nd case (pin is false) into it's sole caller
btrfs_free_reserved_extent. No semantics changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of btrfs_free_reserved_extent (respectively
__btrfs_free_reserved_extent with in set to 0) pass in extents which
have only been reserved but not yet written to. Namely,
* in cow_file_range that function is called only if create_io_em fails
or btrfs_add_ordered_extent fail, both of which happen _before_ any IO
is submitted to the newly reserved range
* in submit_compressed_extents the code flow is similar -
out_free_reserve can be called only before
btrfs_submit_compressed_write which is where any writes to the range
could occur
* btrfs_new_extent_direct also calls btrfs_free_reserved_extent only
if extent_map fails, before any IO is issued
* __btrfs_prealloc_file_range also calls btrfs_free_reserved_extent
in case insertion of the metadata fails
* btrfs_alloc_tree_block again can only be called in case in-memory
operations fail, before any IO is submitted
* btrfs_finish_ordered_io - this is the only caller where discarding
the extent could have a material effect, since it can be called for
an extent which was partially written.
With this change the submission of discards is optimised since discards
are now not being created for extents which are known to not have been
touched on disk.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
qgroup create/remove code is currently returning EINVAL when the user
tries to create a qgroup on a subvolume without quota enabled. EINVAL is
already being used for too many error scenarios so that is hard to
depict what is the problem.
[FIX]
Currently scrub and balance code return -ENOTCONN when the user tries to
cancel/pause and no scrub or balance is currently running for the
desired subvolume. Do the same here by returning -ENOTCONN when a user
tries to create/delete/assing/list a qgroup on a subvolume without quota
enabled.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove some variables that are set only to be checked later, and never
used.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Merge btrfs_sysfs_add_fsid() and btrfs_sysfs_add_devices_kobj() functions
as these two are small and they are called one after the other.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_sysfs_add_device() creates the directory
/sys/fs/btrfs/UUID/devices but its function name is misleading. Rename
it to btrfs_sysfs_add_devices_kobj() instead. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 24bd69cb ("Btrfs: sysfs: add support to add parent for fsid")
added parent argument in preparation to show the seed fsid under the
sprout fsid as in the patch [1] in the mailing list.
[1] Btrfs: sysfs: support seed devices in the sysfs layout
But later this idea was superseded by another idea to rename the fsid as
in the commit f93c39970b ("btrfs: factor out sysfs code for updating
sprout fsid").
So we don't need parent argument anymore.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The struct member btrfs_device::device_dir_kobj holds the kobj of the
sysfs directory /sys/fs/btrfs/UUID/devices, so rename it from
device_dir_kobj to devices_kobj. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the NO_HOLES feature, if we punch a hole into a file and then
fsync it, there are cases where a subsequent fsync will miss the fact that
a hole was punched, resulting in the holes not existing after replaying
the log tree.
Essentially these cases all imply that, tree-log.c:copy_items(), is not
invoked for the leafs that delimit holes, because nothing changed those
leafs in the current transaction. And it's precisely copy_items() where
we currenly detect and log holes, which works as long as the holes are
between file extent items in the input leaf or between the beginning of
input leaf and the previous leaf or between the last item in the leaf
and the next leaf.
First example where we miss a hole:
*) The extent items of the inode span multiple leafs;
*) The punched hole covers a range that affects only the extent items of
the first leaf;
*) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
is set in the inode's runtime flags).
That results in the hole not existing after replaying the log tree.
For example, if the fs/subvolume tree has the following layout for a
particular inode:
Leaf N, generation 10:
[ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]
Leaf N + 1, generation 10:
[ EXTENT_ITEM (128K 64K) ... ]
If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
up dropping the two extent items from leaf N, but we don't touch the other
leaf, so we end up in the following state:
Leaf N, generation 11:
[ ... INODE_ITEM INODE_REF ]
Leaf N + 1, generation 10:
[ EXTENT_ITEM (128K 64K) ... ]
A full fsync after punching the hole will only process leaf N because it
was modified in the current transaction, but not leaf N + 1, since it
was not modified in the current transaction (generation 10 and not 11).
As a result the fsync will not log any holes, because it didn't process
any leaf with extent items.
Second example where we will miss a hole:
*) An inode as its items spanning 5 (or more) leafs;
*) A hole is punched and it covers only the extents items of the 3rd
leaf. This resulsts in deleting the entire leaf and not touching any
of the other leafs.
So the only leaf that is modified in the current transaction, when
punching the hole, is the first leaf, which contains the inode item.
During the full fsync, the only leaf that is passed to copy_items()
is that first leaf, and that's not enough for the hole detection
code in copy_items() to determine there's a hole between the last
file extent item in the 2nd leaf and the first file extent item in
the 3rd leaf (which was the 4th leaf before punching the hole).
Fix this by scanning all leafs and punch holes as necessary when doing a
full fsync (less common than a non-full fsync) when the NO_HOLES feature
is enabled. The lack of explicit file extent items to mark holes makes it
necessary to scan existing extents to determine if holes exist.
A test case for fstests follows soon.
Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4h7wwACgkQxWXV+ddt
WDtFgxAAqanZ3wqq8xNqybfWmmFNrKtdkakErRuQFWe/+kZH59HuBifTbYeVMD4Z
hFgveFJpBIqo7uMUxUItSTtHcr3qx7TP9ejGYOaQO997oNPxPQXuEY8Lq5ebDBVB
89Gn+Eg/Q+uPvCJSctxx4dblSiGZKb3iOEh+lJuWJV4bj8beekcTrqsg01ZchPRO
Ygk1ltW7Vpf0wVkdts4FKiKiwX02M2C9zxh9NQjpNwH1DMow4XtBPsbqHbiHzRym
SoD4+0dbhfdnKkNnBTFEJBbjbZcYwM9EQnfiyVL+/hDMHX4XTetqeFN1G8usfXXX
2kxvwttPUtluJqlQXQnUU4mQEA4p5ORTgGgw1WBF3h+Aezumkql+27Bd6aiDKGZz
SPc9sveft60R23TxorlrYVqfADgyZKEaZ+2wEM99Xoz4OdvP7jkqDentJW9us1Xh
Xmfovq5xcRY17f9tdhiwqH5vgwxrLgmjBvTm/kcGX3ImhU8Yxk8xKw1JoV0P9cjW
7awK4l8pyPbOUhekdT8hYqWXlL/DXhAMHraV1zfBKIbu1omlGByeg23jNM2iS/0B
YtRkEEen0tRHpuKLB08twTKCak94wObBamKNFE6Snt1cDudLwGDpUosazM9l4uPR
2D3SHs7UWNgtvTRCfq2LVoRMRSR2BA1b19EkylUig4ay7khmZ2k=
=2e9q
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes that have been in the works during last twp weeks.
All have a user visible effect and are stable material:
- scrub: properly update progress after calling cancel ioctl, calling
'resume' would start from the beginning otherwise
- fix subvolume reference removal, after moving out of the original
path the reference is not recognized and will lead to transaction
abort
- fix reloc root lifetime checks, could lead to crashes when there's
subvolume cleaning running in parallel
- fix memory leak when quotas get disabled in the middle of extent
accounting
- fix transaction abort in case of balance being started on degraded
mount on eg. RAID1"
* tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: check rw_devices, not num_devices for balance
Btrfs: always copy scrub arguments back to user space
btrfs: relocation: fix reloc_root lifespan and access
btrfs: fix memory leak in qgroup accounting
btrfs: do not delete mismatched root refs
btrfs: fix invalid removal of root ref
btrfs: rework arguments of btrfs_unlink_subvol
The fstest btrfs/154 reports
[ 8675.381709] BTRFS: Transaction aborted (error -28)
[ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
[ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
[ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
[ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
[ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
[ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
[ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
[ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
[ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
[ 8675.419801] Call Trace:
[ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
[ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
[ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs]
[ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs]
[ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0
[ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs]
[ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0
[ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400
[ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0
[ 8675.436618] ? _raw_spin_unlock+0x1f/0x30
[ 8675.438093] ? __handle_mm_fault+0x499/0x740
[ 8675.439619] ? do_vfs_ioctl+0x56e/0x770
[ 8675.441034] do_vfs_ioctl+0x56e/0x770
[ 8675.442411] ksys_ioctl+0x3a/0x70
[ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 8675.445333] __x64_sys_ioctl+0x16/0x20
[ 8675.446705] do_syscall_64+0x50/0x210
[ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
We now use btrfs_can_overcommit() to see if we can flip a block group
read only. Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks. With my
patches we were allowed to do the balance, which is technically correct.
The test is trying to start balance on degraded mount. So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage. This
results in an ENOSPC.
But we shouldn't even be making it this far, we don't have enough
devices to restripe. The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.
The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.
Fixes: e4d8ec0f65 ("Btrfs: implement online profile changing")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ]
Signed-off-by: David Sterba <dsterba@suse.com>
If scrub returns an error we are not copying back the scrub arguments
structure to user space. This prevents user space to know how much
progress scrub has done if an error happened - this includes -ECANCELED
which is returned when users ask for scrub to stop. A particular use
case, which is used in btrfs-progs, is to resume scrub after it is
canceled, in that case it relies on checking the progress from the scrub
arguments structure and then use that progress in a call to resume
scrub.
So fix this by always copying the scrub arguments structure to user
space, overwriting the value returned to user space with -EFAULT only if
copying the structure failed to let user space know that either that
copying did not happen, and therefore the structure is stale, or it
happened partially and the structure is probably not valid and corrupt
due to the partial copy.
Reported-by: Graham Cobb <g.btrfs@cobb.uk.net>
Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/
Fixes: 06fe39ab15 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Graham Cobb <g.btrfs@cobb.uk.net>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There are several different KASAN reports for balance + snapshot
workloads. Involved call paths include:
should_ignore_root+0x54/0xb0 [btrfs]
build_backref_tree+0x11af/0x2280 [btrfs]
relocate_tree_blocks+0x391/0xb80 [btrfs]
relocate_block_group+0x3e5/0xa00 [btrfs]
btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
btrfs_relocate_chunk+0x53/0xf0 [btrfs]
btrfs_balance+0xc91/0x1840 [btrfs]
btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
btrfs_ioctl+0x8af/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10
create_reloc_root+0x9f/0x460 [btrfs]
btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
create_pending_snapshot+0xa9b/0x15f0 [btrfs]
create_pending_snapshots+0x111/0x140 [btrfs]
btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
btrfs_mksubvol+0x915/0x960 [btrfs]
btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
btrfs_ioctl+0x241b/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10
btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
create_pending_snapshot+0x209/0x15f0 [btrfs]
create_pending_snapshots+0x111/0x140 [btrfs]
btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
btrfs_mksubvol+0x915/0x960 [btrfs]
btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
btrfs_ioctl+0x241b/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10
[CAUSE]
All these call sites are only relying on root->reloc_root, which can
undergo btrfs_drop_snapshot(), and since we don't have real refcount
based protection to reloc roots, we can reach already dropped reloc
root, triggering KASAN.
[FIX]
To avoid such access to unstable root->reloc_root, we should check
BTRFS_ROOT_DEAD_RELOC_TREE bit first.
This patch introduces wrappers that provide the correct way to check the
bit with memory barriers protection.
Most callers don't distinguish merged reloc tree and no reloc tree. The
only exception is should_ignore_root(), as merged reloc tree can be
ignored, while no reloc tree shouldn't.
[CRITICAL SECTION ANALYSIS]
Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
DEAD_RELOC_TREE bit has extra help from transaction as a higher level
barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
NULL: reloc_root is NULL PTR: reloc_root is not NULL
0: DEAD_RELOC_ROOT bit not set DEAD: DEAD_RELOC_ROOT bit set
(NULL, 0) Initial state __
| /\ Section A
btrfs_init_reloc_root() \/
| __
(PTR, 0) reloc_root initialized /\
| |
btrfs_update_reloc_root() | Section B
| |
(PTR, DEAD) reloc_root has been merged \/
| __
=== btrfs_commit_transaction() ====================
| /\
clean_dirty_subvols() |
| | Section C
(NULL, DEAD) reloc_root cleanup starts \/
| __
btrfs_drop_snapshot() /\
| | Section D
(NULL, 0) Back to initial state \/
Every have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller holds
transaction handle, so none of such caller can cross transaction boundary.
In Section A, every caller just found no DEAD bit, and grab reloc_root.
In the cross section A-B, caller may get no DEAD bit, but since reloc_root
is still completely valid thus accessing reloc_root is completely safe.
No test_bit() caller can cross the boundary of Section B and Section C.
In Section C, every caller found the DEAD bit, so no one will access
reloc_root.
In the cross section C-D, either caller gets the DEAD bit set, avoiding
access reloc_root no matter if it's safe or not. Or caller get the DEAD
bit cleared, then access reloc_root, which is already NULL, nothing will
be wrong.
The memory write barriers are between the reloc_root updates and bit
set/clear, the pairing read side is before test_bit.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ barriers ]
Signed-off-by: David Sterba <dsterba@suse.com>
When running xfstests on the current btrfs I get the following splat from
kmemleak:
unreferenced object 0xffff88821b2404e0 (size 32):
comm "kworker/u4:7", pid 26663, jiffies 4295283698 (age 8.776s)
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 10 ff fd 26 82 88 ff ff ...........&....
10 ff fd 26 82 88 ff ff 20 ff fd 26 82 88 ff ff ...&.... ..&....
backtrace:
[<00000000f94fd43f>] ulist_alloc+0x25/0x60 [btrfs]
[<00000000fd023d99>] btrfs_find_all_roots_safe+0x41/0x100 [btrfs]
[<000000008f17bd32>] btrfs_find_all_roots+0x52/0x70 [btrfs]
[<00000000b7660afb>] btrfs_qgroup_rescan_worker+0x343/0x680 [btrfs]
[<0000000058e66778>] btrfs_work_helper+0xac/0x1e0 [btrfs]
[<00000000f0188930>] process_one_work+0x1cf/0x350
[<00000000af5f2f8e>] worker_thread+0x28/0x3c0
[<00000000b55a1add>] kthread+0x109/0x120
[<00000000f88cbd17>] ret_from_fork+0x35/0x40
This corresponds to:
(gdb) l *(btrfs_find_all_roots_safe+0x41)
0x8d7e1 is in btrfs_find_all_roots_safe (fs/btrfs/backref.c:1413).
1408
1409 tmp = ulist_alloc(GFP_NOFS);
1410 if (!tmp)
1411 return -ENOMEM;
1412 *roots = ulist_alloc(GFP_NOFS);
1413 if (!*roots) {
1414 ulist_free(tmp);
1415 return -ENOMEM;
1416 }
1417
Following the lifetime of the allocated 'roots' ulist, it gets freed
again in btrfs_qgroup_account_extent().
But this does not happen if the function is called with the
'BTRFS_FS_QUOTA_ENABLED' flag cleared, then btrfs_qgroup_account_extent()
does a short leave and directly returns.
Instead of directly returning we should jump to the 'out_free' in order to
free all resources as expected.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_del_root_ref() will simply WARN_ON() if the ref doesn't match in
any way, and then continue to delete the reference. This shouldn't
happen, we have these values because there's more to the reference than
the original root and the sub root. If any of these checks fail, return
-ENOENT.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have the following sequence of events
btrfs sub create A
btrfs sub create A/B
btrfs sub snap A C
mkdir C/foo
mv A/B C/foo
rm -rf *
We will end up with a transaction abort.
The reason for this is because we create a root ref for B pointing to A.
When we create a snapshot of C we still have B in our tree, but because
the root ref points to A and not C we will make it appear to be empty.
The problem happens when we move B into C. This removes the root ref
for B pointing to A and adds a ref of B pointing to C. When we rmdir C
we'll see that we have a ref to our root and remove the root ref,
despite not actually matching our reference name.
Now btrfs_del_root_ref() allowing this to work is a bug as well, however
we know that this inode does not actually point to a root ref in the
first place, so we shouldn't be calling btrfs_del_root_ref() in the
first place and instead simply look up our dir index for this item and
do the rest of the removal.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_unlink_subvol takes the name of the dentry and the root objectid
based on what kind of inode this is, either a real subvolume link or a
empty one that we inherited as a snapshot. We need to fix how we unlink
in the case for BTRFS_EMPTY_SUBVOL_DIR_OBJECTID in the future, so rework
btrfs_unlink_subvol to just take the dentry and handle getting the right
objectid given the type of inode this is. There is no functional change
here, simply pushing the work into btrfs_unlink_subvol() proper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4PenMACgkQxWXV+ddt
WDtsyg/6A86Da1zpLQKHiQtfhH629oeL/lFI0Cz+52xcgKSAUX8pHrWzKawSVpTT
OTgksskHREPDjyDQytxAUKJnuplFEF7sCFNhzhsjQ7cvhC0lLDmW0VQr1Wob1/c4
miW18YXN2DrYFQqdCeypvq1FItglxuBmsa9eITYzytRZcATY+3b506FhT/d8NIiH
f3nnChUiNuyfAiez1SC4FavDpVma1O6SpPMk29wUN/+/Dnd4aBrfvhuygy7xyIOK
rzotRR5xagQDpei+99lT2hys4Pv0yEOajoGmgbgpaZNP0vgQcBLaF5UTeMv/Ib2j
i+muu1rWi4R6lS3hh30kRSifKmPF7I9JB1dbd5gPfZiERlDQk+qZWrPFv9l0lf7M
R3jn04EaPVNr0dtftRFF2VvpIlUQYgIvwZHx4Td6Oy9XO0X4n5vlKxYg9aHmybJ2
Vni13ad2oWNLo2Dd1eUYrEJ/QwGc1pq7JU9HiOST7yEJv1FN++AF4qTYTyvc7Yl4
HrN349/2k2J9vU88BIlXIxYG73AL9lIpGF2ROiEw+xn4oOevDP/5HCP8ZvELoNVM
NqhJGoURu+AhZ7Rm3RYAYySShjQPF+qw/Dnz8sowQsEFUQ990E1wUxWZwTOP6+JQ
cdwzYiXYIExTghfQdUDdwNlpCXUbSuJxFOGjKmhBuzA7HHwGf1Q=
=jGvW
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes for btrfs:
- blkcg accounting problem with compression that could stall writes
- setting up blkcg bio for compression crashes due to NULL bdev
pointer
- fix possible infinite loop in writeback for nocow files (here
possible means almost impossible, 13 things that need to happen to
trigger it)"
* tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix infinite loop during nocow writeback due to race
btrfs: fix compressed write bio blkcg attribution
btrfs: punt all bios created in btrfs_submit_compressed_write()
When starting writeback for a range that covers part of a preallocated
extent, due to a race with writeback for another range that also covers
another part of the same preallocated extent, we can end up in an infinite
loop.
Consider the following example where for inode 280 we have two dirty
ranges:
range A, from 294912 to 303103, 8192 bytes
range B, from 348160 to 438271, 90112 bytes
and we have the following file extent item layout for our inode:
leaf 38895616 gen 24544 total ptrs 29 free space 13820 owner 5
(...)
item 27 key (280 108 200704) itemoff 14598 itemsize 53
extent data disk bytenr 0 nr 0 type 1 (regular)
extent data offset 0 nr 94208 ram 94208
item 28 key (280 108 294912) itemoff 14545 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 2 (prealloc)
extent data offset 0 nr 81920 ram 81920
Then the following happens:
1) Writeback starts for range B (from 348160 to 438271), execution of
run_delalloc_nocow() starts;
2) The first iteration of run_delalloc_nocow()'s whil loop leaves us at
the extent item at slot 28, pointing to the prealloc extent item
covering the range from 294912 to 376831. This extent covers part of
our range;
3) An ordered extent is created against that extent, covering the file
range from 348160 to 376831 (28672 bytes);
4) We adjust 'cur_offset' to 376832 and move on to the next iteration of
the while loop;
5) The call to btrfs_lookup_file_extent() leaves us at the same leaf,
pointing to slot 29, 1 slot after the last item (the extent item
we processed in the previous iteration);
6) Because we are a slot beyond the last item, we call btrfs_next_leaf(),
which releases the search path before doing a another search for the
last key of the leaf (280 108 294912);
7) Right after btrfs_next_leaf() released the path, and before it did
another search for the last key of the leaf, writeback for the range
A (from 294912 to 303103) completes (it was previously started at
some point);
8) Upon completion of the ordered extent for range A, the prealloc extent
we previously found got split into two extent items, one covering the
range from 294912 to 303103 (8192 bytes), with a type of regular extent
(and no longer prealloc) and another covering the range from 303104 to
376831 (73728 bytes), with a type of prealloc and an offset of 8192
bytes. So our leaf now has the following layout:
leaf 38895616 gen 24544 total ptrs 31 free space 13664 owner 5
(...)
item 27 key (280 108 200704) itemoff 14598 itemsize 53
extent data disk bytenr 0 nr 0 type 1
extent data offset 0 nr 8192 ram 94208
item 28 key (280 108 208896) itemoff 14545 itemsize 53
extent data disk bytenr 10433142784 nr 86016 type 1
extent data offset 0 nr 86016 ram 86016
item 29 key (280 108 294912) itemoff 14492 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 1
extent data offset 0 nr 8192 ram 81920
item 30 key (280 108 303104) itemoff 14439 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 2
extent data offset 8192 nr 73728 ram 81920
9) After btrfs_next_leaf() returns, we have our path pointing to that same
leaf and at slot 30, since it has a key we didn't have before and it's
the first key greater then the key that was previously the last key of
the leaf (key (280 108 294912));
10) The extent item at slot 30 covers the range from 303104 to 376831
which is in our target range, so we process it, despite having already
created an ordered extent against this extent for the file range from
348160 to 376831. This is because we skip to the next extent item only
if its end is less than or equals to the start of our delalloc range,
and not less than or equals to the current offset ('cur_offset');
11) As a result we compute 'num_bytes' as:
num_bytes = min(end + 1, extent_end) - cur_offset;
= min(438271 + 1, 376832) - 376832 = 0
12) We then call create_io_em() for a 0 bytes range starting at offset
376832;
13) Then create_io_em() enters an infinite loop because its calls to
btrfs_drop_extent_cache() do nothing due to the 0 length range
passed to it. So no existing extent maps that cover the offset
376832 get removed, and therefore calls to add_extent_mapping()
return -EEXIST, resulting in an infinite loop. This loop from
create_io_em() is the following:
do {
btrfs_drop_extent_cache(BTRFS_I(inode), em->start,
em->start + em->len - 1, 0);
write_lock(&em_tree->lock);
ret = add_extent_mapping(em_tree, em, 1);
write_unlock(&em_tree->lock);
/*
* The caller has taken lock_extent(), who could race with us
* to add em?
*/
} while (ret == -EEXIST);
Also, each call to btrfs_drop_extent_cache() triggers a warning because
the start offset passed to it (376832) is smaller then the end offset
(376832 - 1) passed to it by -1, due to the 0 length:
[258532.052621] ------------[ cut here ]------------
[258532.052643] WARNING: CPU: 0 PID: 9987 at fs/btrfs/file.c:602 btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
(...)
[258532.052672] CPU: 0 PID: 9987 Comm: fsx Tainted: G W 5.4.0-rc7-btrfs-next-64 #1
[258532.052673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[258532.052691] RIP: 0010:btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
(...)
[258532.052695] RSP: 0018:ffffb4be0153f860 EFLAGS: 00010287
[258532.052700] RAX: ffff975b445ee360 RBX: ffff975b44eb3e08 RCX: 0000000000000000
[258532.052700] RDX: 0000000000038fff RSI: 0000000000039000 RDI: ffff975b445ee308
[258532.052700] RBP: 0000000000038fff R08: 0000000000000000 R09: 0000000000000001
[258532.052701] R10: ffff975b513c5c10 R11: 00000000e3c0cfa9 R12: 0000000000039000
[258532.052703] R13: ffff975b445ee360 R14: 00000000ffffffef R15: ffff975b445ee308
[258532.052705] FS: 00007f86a821de80(0000) GS:ffff975b76a00000(0000) knlGS:0000000000000000
[258532.052707] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[258532.052708] CR2: 00007fdacf0f3ab4 CR3: 00000001f9d26002 CR4: 00000000003606f0
[258532.052712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[258532.052717] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[258532.052717] Call Trace:
[258532.052718] ? preempt_schedule_common+0x32/0x70
[258532.052722] ? ___preempt_schedule+0x16/0x20
[258532.052741] create_io_em+0xff/0x180 [btrfs]
[258532.052767] run_delalloc_nocow+0x942/0xb10 [btrfs]
[258532.052791] btrfs_run_delalloc_range+0x30b/0x520 [btrfs]
[258532.052812] ? find_lock_delalloc_range+0x221/0x250 [btrfs]
[258532.052834] writepage_delalloc+0xe4/0x140 [btrfs]
[258532.052855] __extent_writepage+0x110/0x4e0 [btrfs]
[258532.052876] extent_write_cache_pages+0x21c/0x480 [btrfs]
[258532.052906] extent_writepages+0x52/0xb0 [btrfs]
[258532.052911] do_writepages+0x23/0x80
[258532.052915] __filemap_fdatawrite_range+0xd2/0x110
[258532.052938] btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
[258532.052954] start_ordered_ops+0x57/0xa0 [btrfs]
[258532.052973] ? btrfs_sync_file+0x225/0x490 [btrfs]
[258532.052988] btrfs_sync_file+0x225/0x490 [btrfs]
[258532.052997] __x64_sys_msync+0x199/0x200
[258532.053004] do_syscall_64+0x5c/0x250
[258532.053007] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[258532.053010] RIP: 0033:0x7f86a7dfd760
(...)
[258532.053014] RSP: 002b:00007ffd99af0368 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
[258532.053016] RAX: ffffffffffffffda RBX: 0000000000000ec9 RCX: 00007f86a7dfd760
[258532.053017] RDX: 0000000000000004 RSI: 000000000000836c RDI: 00007f86a8221000
[258532.053019] RBP: 0000000000021ec9 R08: 0000000000000003 R09: 00007f86a812037c
[258532.053020] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000074a3
[258532.053021] R13: 00007f86a8221000 R14: 000000000000836c R15: 0000000000000001
[258532.053032] irq event stamp: 1653450494
[258532.053035] hardirqs last enabled at (1653450493): [<ffffffff9dec69f9>] _raw_spin_unlock_irq+0x29/0x50
[258532.053037] hardirqs last disabled at (1653450494): [<ffffffff9d4048ea>] trace_hardirqs_off_thunk+0x1a/0x20
[258532.053039] softirqs last enabled at (1653449852): [<ffffffff9e200466>] __do_softirq+0x466/0x6bd
[258532.053042] softirqs last disabled at (1653449845): [<ffffffff9d4c8a0c>] irq_exit+0xec/0x120
[258532.053043] ---[ end trace 8476fce13d9ce20a ]---
Which results in flooding dmesg/syslog since btrfs_drop_extent_cache()
uses WARN_ON() and not WARN_ON_ONCE().
So fix this issue by changing run_delalloc_nocow()'s loop to move to the
next extent item when the current extent item ends at at offset less than
or equals to the current offset instead of the start offset.
Fixes: 80ff385665 ("Btrfs: update nodatacow code v2")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Bio attribution is handled at bio_set_dev() as once we have a device, we
have a corresponding request_queue and then can derive the current css.
In special cases, we want to attribute to bio to someone else. This can
be done by calling bio_associate_blkg_from_css() or
kthread_associate_blkcg() depending on the scenario. Btrfs does this for
compressed writeback as they are handled by kworkers, so the latter can
be done here.
Commit 1a41802701 ("btrfs: drop bio_set_dev where not needed") removes
early bio_set_dev() calls prior to submit_stripe_bio(). This breaks the
above assumption that we'll have a request_queue when we are doing
association. To fix this, switch to using kthread_associate_blkcg().
Without this, we crash in btrfs/024:
[ 3052.093088] BUG: kernel NULL pointer dereference, address: 0000000000000510
[ 3052.107013] #PF: supervisor read access in kernel mode
[ 3052.107014] #PF: error_code(0x0000) - not-present page
[ 3052.107015] PGD 0 P4D 0
[ 3052.107021] Oops: 0000 [#1] SMP
[ 3052.138904] CPU: 42 PID: 201270 Comm: kworker/u161:0 Kdump: loaded Not tainted 5.5.0-rc1-00062-g4852d8ac90a9 #712
[ 3052.138905] Hardware name: Quanta Tioga Pass Single Side 01-0032211004/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
[ 3052.138912] Workqueue: btrfs-delalloc btrfs_work_helper
[ 3052.191375] RIP: 0010:bio_associate_blkg_from_css+0x1e/0x3c0
[ 3052.191379] RSP: 0018:ffffc900210cfc90 EFLAGS: 00010282
[ 3052.191380] RAX: 0000000000000000 RBX: ffff88bfe5573c00 RCX: 0000000000000000
[ 3052.191382] RDX: ffff889db48ec2f0 RSI: ffff88bfe5573c00 RDI: ffff889db48ec2f0
[ 3052.191386] RBP: 0000000000000800 R08: 0000000000203bb0 R09: ffff889db16b2400
[ 3052.293364] R10: 0000000000000000 R11: ffff88a07fffde80 R12: ffff889db48ec2f0
[ 3052.293365] R13: 0000000000001000 R14: ffff889de82bc000 R15: ffff889e2b7bdcc8
[ 3052.293367] FS: 0000000000000000(0000) GS:ffff889ffba00000(0000) knlGS:0000000000000000
[ 3052.293368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3052.293369] CR2: 0000000000000510 CR3: 0000000002611001 CR4: 00000000007606e0
[ 3052.293370] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3052.293371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3052.293372] PKRU: 55555554
[ 3052.293376] Call Trace:
[ 3052.402552] btrfs_submit_compressed_write+0x137/0x390
[ 3052.402558] submit_compressed_extents+0x40f/0x4c0
[ 3052.422401] btrfs_work_helper+0x246/0x5a0
[ 3052.422408] process_one_work+0x200/0x570
[ 3052.438601] ? process_one_work+0x180/0x570
[ 3052.438605] worker_thread+0x4c/0x3e0
[ 3052.438614] kthread+0x103/0x140
[ 3052.460735] ? process_one_work+0x570/0x570
[ 3052.460737] ? kthread_mod_delayed_work+0xc0/0xc0
[ 3052.460744] ret_from_fork+0x24/0x30
Fixes: 1a41802701 ("btrfs: drop bio_set_dev where not needed")
Reported-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Compressed writes happen in the background via kworkers. However, this
causes bios to be attributed to root bypassing any cgroup limits from
the actual writer. We tag the first bio with REQ_CGROUP_PUNT, which will
punt the bio to an appropriate cgroup specific workqueue and attribute
the IO properly. However, if btrfs_submit_compressed_write() creates a
new bio, we don't tag it the same way. Add the appropriate tagging for
subsequent bios.
Fixes: ec39f7696c ("Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios")
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl35CAYACgkQxWXV+ddt
WDur/w//S98RvSZYMW5y2u+bPGe8sCpXwu5Sr87hTd14We8cBWj8684npUmSk7Dz
rTRSjcf9EQe5dGoiHOzpKU0HcsLKy9DVTPigvVbmsWZfT9mqS6Y8wAKMw/7UUvyy
n7aZk/yQGRow3gZ/Z/aF23JypRoDJK7DPbSMKUW164BnD5rCCyr+VdA8V+CwHgVh
UN6UG0KMDbDKS4501DsX8418pcJN+a+Jo4oBGwN/guKRjK1oNcrhj34DNhvXlaOV
Rlu7HcVtfHNDS/xD3DZS9mDIiycJ6qHkvC3hUsEmlKRoPEm1leVxTDLDf78oEy9H
TrvH71hbvYjxaOU4YQbJG8ky+VwFfiV0Vrj73GgdEeRRDuMbYwUyFI5gYQOji8fS
DuYdJGyslOqQovpii+jrPiT1TPG+97R4+qKH2DfOW1xUChYsbQHt7FOfzUbLe0JE
dev9zV6MRqZ1qf70+Wt2LuWYFefpg9KVnsn8mcjoBwz9s9uImzLgpI90+DMPKOaU
TizwJK3W5K3YLhqPHwPLvqVxKwVOzu00v01xl/bjTuyp982oPSCj3fj+FprGV1la
OkqOYbKe2ZqEkQpINDu8I58oydTKywZGVsUl4ldJlcSY1hEDFCyeoAFmixaJbRbQ
IdBcQnjD7qgvu9E4cA0kL8Ma1op2+1zw8sUOdXKFIDiNEqL5FPs=
=AnHL
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A mix of regression fixes and regular fixes for stable trees:
- fix swapped error messages for qgroup enable/rescan
- fixes for NO_HOLES feature with clone range
- fix deadlock between iget/srcu lock/synchronize srcu while freeing
an inode
- fix double lock on subvolume cross-rename
- tree log fixes
* fix missing data checksums after replaying a log tree
* also teach tree-checker about this problem
* skip log replay on orphaned roots
- fix maximum devices constraints for RAID1C -3 and -4
- send: don't print warning on read-only mount regarding orphan
cleanup
- error handling fixes"
* tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: send: remove WARN_ON for readonly mount
btrfs: do not leak reloc root if we fail to read the fs root
btrfs: skip log replay on orphaned roots
btrfs: handle ENOENT in btrfs_uuid_tree_iterate
btrfs: abort transaction after failed inode updates in create_subvol
Btrfs: fix hole extent items with a zero size after range cloning
Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues
Btrfs: make tree checker detect checksum items with overlapping ranges
Btrfs: fix missing data checksums after replaying a log tree
btrfs: return error pointer from alloc_test_extent_buffer
btrfs: fix devs_max constraints for raid1c3 and raid1c4
btrfs: tree-checker: Fix error format string for size_t
btrfs: don't double lock the subvol_sem for rename exchange
btrfs: handle error in btrfs_cache_block_group
btrfs: do not call synchronize_srcu() in inode_tree_del
Btrfs: fix cloning range with a hole when using the NO_HOLES feature
btrfs: Fix error messages in qgroup_rescan_init
We log warning if root::orphan_cleanup_state is not set to
ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is
mounted as readonly we skip the orphan item cleanup during the lookup
and root::orphan_cleanup_state remains at the init state 0 instead of
ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the
warning as below.
WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE);
WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
::
RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
::
Call Trace:
::
_btrfs_ioctl_send+0x7b/0x110 [btrfs]
btrfs_ioctl+0x150a/0x2b00 [btrfs]
::
do_vfs_ioctl+0xa9/0x620
? __fget+0xac/0xe0
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x49/0x130
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reproducer:
mkfs.btrfs -fq /dev/sdb
mount /dev/sdb /btrfs
btrfs subvolume create /btrfs/sv1
btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1
umount /btrfs
mount -o ro /dev/sdb /btrfs
btrfs send /btrfs/ss1 -f /tmp/f
The warning exists because having orphan inodes could confuse send and
cause it to fail or produce incorrect streams. The two cases that would
cause such send failures, which are already fixed are:
1) Inodes that were unlinked - these are orphanized and remain with a
link count of 0. These caused send operations to fail because it
expected to always find at least one path for an inode. However this
is no longer a problem since send is now able to deal with such
inodes since commit 46b2f4590a ("Btrfs: fix send failure when root
has deleted files still open") and treats them as having been
completely removed (the state after an orphan cleanup is performed).
2) Inodes that were in the process of being truncated. These resulted in
send not knowing about the truncation and potentially issue write
operations full of zeroes for the range from the new file size to the
old file size. This is no longer a problem because we no longer
create orphan items for truncation since commit f7e9e8fc79 ("Btrfs:
stop creating orphan items for truncate").
As such before these commits, the WARN_ON here provided a clue in case
something went wrong. Instead of being a warning against the
root::orphan_cleanup_state value, it could have been more accurate by
checking if there were actually any orphan items, and then issue a
warning only if any exists, but that would be more expensive to check.
Since orphanized inodes no longer cause problems for send, just remove
the warning.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/
CC: stable@vger.kernel.org # 4.19+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to read the fs root corresponding with a reloc root we'll
just break out and free the reloc roots. But we remove our current
reloc_root from this list higher up, which means we'll leak this
reloc_root. Fix this by adding ourselves back to the reloc_roots list
so we are properly cleaned up.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
My fsstress modifications coupled with generic/475 uncovered a failure
to mount and replay the log if we hit a orphaned root. We do not want
to replay the log for an orphan root, but it's completely legitimate to
have an orphaned root with a log attached. Fix this by simply skipping
replaying the log. We still need to pin it's root node so that we do
not overwrite it while replaying other logs, as we re-read the log root
at every stage of the replay.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the
uuid tree we'll just continue and do btrfs_next_item(). However we've
done a btrfs_release_path() at this point and no longer have a valid
path. So increment the key and go back and do a normal search.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can just abort the transaction here, and in fact do that for every
other failure in this function except these two cases.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Normally when cloning a file range if we find an implicit hole at the end
of the range we assume it is because the NO_HOLES feature is enabled.
However that is not always the case. One well known case [1] is when we
have a power failure after mixing buffered and direct IO writes against
the same file.
In such cases we need to punch a hole in the destination file, and if
the NO_HOLES feature is not enabled, we need to insert explicit file
extent items to represent the hole. After commit 690a5dbfc5
("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning
extents"), we started to insert file extent items representing the hole
with an item size of 0, which is invalid and should be 53 bytes (the size
of a btrfs_file_extent_item structure), resulting in all sorts of
corruptions and invalid memory accesses. This is detected by the tree
checker when we attempt to write a leaf to disk.
The problem can be sporadically triggered by test case generic/561 from
fstests. That test case does not exercise power failure and creates a new
filesystem when it starts, so it does not use a filesystem created by any
previous test that tests power failure. However the test does both
buffered and direct IO writes (through fsstress) and it's precisely that
which is creating the implicit holes in files. That happens even before
the commit mentioned earlier. I need to investigate why we get those
implicit holes to check if there is a real problem or not. For now this
change fixes the regression of introducing file extent items with an item
size of 0 bytes.
Fix the issue by calling btrfs_punch_hole_range() without passing a
btrfs_clone_extent_info structure, which ensures file extent items are
inserted to represent the hole with a correct item size. We were passing
a btrfs_clone_extent_info with a value of 0 for its 'item_size' field,
which was causing the insertion of file extent items with an item size
of 0.
[1] https://www.spinics.net/lists/linux-btrfs/msg75350.html
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a tree mod log user no longer needs to use the tree it calls
btrfs_put_tree_mod_seq() to remove itself from the list of users and
delete all no longer used elements of the tree's red black tree, which
should be all elements with a sequence number less then our equals to
the caller's sequence number. However the logic is broken because it
can delete and free elements from the red black tree that have a
sequence number greater then the caller's sequence number:
1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
tree mod log;
2) The task which got assigned the sequence number 1 calls
btrfs_put_tree_mod_seq();
3) Sequence number 1 is deleted from the list of sequence numbers;
4) The current minimum sequence number is computed to be the sequence
number 2;
5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
a pointer to one of its elements from the red black tree through
a call to tree_mod_log_search();
6) The task with sequence number 1 iterates the red black tree of tree
modification elements and deletes (and frees) all elements with a
sequence number less then or equals to 2 (the computed minimum sequence
number) - it ends up only leaving elements with sequence numbers of 3
and 4;
7) The task with sequence number 2 now uses the pointer to its element,
already freed by the other task, at __tree_mod_log_rewind(), resulting
in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
a trace like the following:
[16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G W 5.4.0-rc8-btrfs-next-51 #1
[16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[16804.548666] RIP: 0010:rb_next+0x16/0x50
(...)
[16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202
[16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b
[16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600
[16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000
[16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e
[16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8
[16804.554399] FS: 00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000
[16804.555039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0
[16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[16804.557583] Call Trace:
[16804.558207] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[16804.558835] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[16804.559468] resolve_indirect_refs+0x1eb/0xc70 [btrfs]
[16804.560087] ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs]
[16804.560700] find_parent_nodes+0x388/0x1120 [btrfs]
[16804.561310] btrfs_check_shared+0x115/0x1c0 [btrfs]
[16804.561916] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[16804.562518] extent_fiemap+0x59d/0x6d0 [btrfs]
[16804.563112] ? __might_fault+0x11/0x90
[16804.563706] do_vfs_ioctl+0x45a/0x700
[16804.564299] ksys_ioctl+0x70/0x80
[16804.564885] ? trace_hardirqs_off_thunk+0x1a/0x20
[16804.565461] __x64_sys_ioctl+0x16/0x20
[16804.566020] do_syscall_64+0x5c/0x250
[16804.566580] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[16804.567153] RIP: 0033:0x7f4b1ba2add7
(...)
[16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7
[16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003
[16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44
[16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48
[16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50
(...)
[16804.575623] ---[ end trace 87317359aad4ba50 ]---
Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
have a sequence number equals to the computed minimum sequence number, and
not just elements with a sequence number greater then that minimum.
Fixes: bd989ba359 ("Btrfs: add tree modification log functions")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Having checksum items, either on the checksums tree or in a log tree, that
represent ranges that overlap each other is a sign of a corruption. Such
case confuses the checksum lookup code and can result in not being able to
find checksums or find stale checksums.
So add a check for such case.
This is motivated by a recent fix for a case where a log tree had checksum
items covering ranges that overlap each other due to extent cloning, and
resulted in missing checksums after replaying the log tree. It also helps
detect past issues such as stale and outdated checksums due to overlapping,
commit 27b9a8122f ("Btrfs: fix csum tree corruption, duplicate and
outdated checksums").
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Callers of alloc_test_extent_buffer have not correctly interpreted the
return value as error pointer, as alloc_test_extent_buffer should behave
as alloc_extent_buffer. The self-tests were unaffected but
btrfs_find_create_tree_block could call both functions and that would
cause problems up in the call chain.
Fixes: faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting code")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value 0 for devs_max means to spread the allocated chunks over all
available devices, eg. stripe for RAID0 or RAID5. This got mistakenly
copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and
4 copies respectively.
Fixes: 47e6f7423b ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087 ("btrfs: add support for 4-copy replication (raid1c4)")
Signed-off-by: David Sterba <dsterba@suse.com>
Argument BTRFS_FILE_EXTENT_INLINE_DATA_START is defined as offsetof(),
which returns type size_t, so we need %zu instead of %lu.
This fixes a build warning on 32-bit ARM:
../fs/btrfs/tree-checker.c: In function 'check_extent_data_item':
../fs/btrfs/tree-checker.c:230:43: warning: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'unsigned int' [-Wformat=]
230 | "invalid item size, have %u expect [%lu, %u)",
| ~~^
| long unsigned int
| %u
Fixes: 153a6d2999 ("btrfs: tree-checker: Check item size before reading file extent type")
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andreas Färber <afaerber@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're rename exchanging two subvols we'll try to lock this lock
twice, which is bad. Just lock once if either of the ino's are subvols.
Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a BUG_ON(ret < 0) in find_free_extent from
btrfs_cache_block_group. If we fail to allocate our ctl we'll just
panic, which is not good. Instead just go on to another block group.
If we fail to find a block group we don't want to return ENOSPC, because
really we got a ENOMEM and that's the root of the problem. Save our
return from btrfs_cache_block_group(), and then if we still fail to make
our allocation return that ret so we get the right error back.
Tested with inject-error.py from bcc.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Testing with the new fsstress uncovered a pretty nasty deadlock with
lookup and snapshot deletion.
Process A
unlink
-> final iput
-> inode_tree_del
-> synchronize_srcu(subvol_srcu)
Process B
btrfs_lookup <- srcu_read_lock() acquired here
-> btrfs_iget
-> find inode that has I_FREEING set
-> __wait_on_freeing_inode()
We're holding the srcu_read_lock() while doing the iget in order to make
sure our fs root doesn't go away, and then we are waiting for the inode
to finish freeing. However because the free'ing process is doing a
synchronize_srcu() we deadlock.
Fix this by dropping the synchronize_srcu() in inode_tree_del(). We
don't need people to stop accessing the fs root at this point, we're
only adding our empty root to the dead roots list.
A larger much more invasive fix is forthcoming to address how we deal
with fs roots, but this fixes the immediate problem.
Fixes: 76dda93c6a ("Btrfs: add snapshot/subvolume destroy ioctl")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the NO_HOLES feature if we clone a range that contains a hole
and a temporary ENOSPC happens while dropping extents from the target
inode's range, we can end up failing and aborting the transaction with
-EEXIST or with a corrupt file extent item, that has a length greater
than it should and overlaps with other extents. For example when cloning
the following range from inode A to inode B:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 4MB length ] [ ------------- ]
0 1MB 5MB 6MB
Range to clone: [1MB, 6MB)
Inode B:
extent B1 extent B2 extent B3 extent B4
[ ---------- ] [ --------- ] [ ---------- ] [ ---------- ]
0 1MB 1MB 2MB 2MB 5MB 5MB 6MB
Target range: [1MB, 6MB) (same as source, to make it easier to explain)
The following can happen:
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 2MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB =
1MB;
4) We then attempt to insert a file extent item at inode B with a file
offset of 5MB, which is the value of clone_info->file_offset. This
fails with error -EEXIST because there's already an extent at that
offset (extent B4);
5) We abort the current transaction with -EEXIST and return that error
to user space as well.
Another example, for extent corruption:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 10MB length ] [ ------------- ]
0 1MB 11MB 12MB
Inode B:
extent B1 extent B2
[ ----------- ] [ --------- ] [ ----------------------------- ]
0 1MB 1MB 5MB 5MB 12MB
Target range: [1MB, 12MB) (same as source, to make it easier to explain)
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 5MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB =
4MB;
4) We then insert a file extent item at inode B with a file offset of 11MB
which is the value of clone_info->file_offset, and a length of 4MB (the
value of 'clone_len'). So we get 2 extents items with ranges that
overlap and an extent length of 4MB, larger then the extent A2 from
inode A (1MB length);
5) After that we end the transaction, balance the btree dirty pages and
then start another or join the previous transaction. It might happen
that the transaction which inserted the incorrect extent was committed
by another task so we end up with extent corruption if a power failure
happens.
So fix this by making sure we attempt to insert the extent to clone at
the destination inode only if we are past dropping the sub-range that
corresponds to a hole.
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The branch of qgroup_rescan_init which is executed from the mount
path prints wrong errors messages. The textual print out in case
BTRFS_QGROUP_STATUS_FLAG_RESCAN/BTRFS_QGROUP_STATUS_FLAG_ON are not
set are transposed. Fix it by exchanging their place.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Because the BLAKE2B code went through a different tree, it was not
available at the time the btrfs part was merged. Now that the Kconfig
symbol exists, add it to the list.
Signed-off-by: David Sterba <dsterba@suse.com>
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.
Switch the btrfs_device_set_…() macro over to use CONFIG_PREEMPTION.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-btrfs@vger.kernel.org
Link: https://lore.kernel.org/r/20191015191821.11479-25-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need support
for time64_t.
In completely unrelated work, I spent time on cleaning up parts of this
file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the rest
of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which is
the scsi ioctl handlers. I have patches for those as well, but they need
more testing or possibly a rewrite.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
+ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
bN65armTm7bFFR32Avnu
=lgCl
-----END PGP SIGNATURE-----
Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
"As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need
support for time64_t.
In completely unrelated work, I spent time on cleaning up parts of
this file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the
rest of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which
is the scsi ioctl handlers. I have patches for those as well, but they
need more testing or possibly a rewrite"
* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
scsi: sd: enable compat ioctls for sed-opal
pktcdvd: add compat_ioctl handler
compat_ioctl: move SG_GET_REQUEST_TABLE handling
compat_ioctl: ppp: move simple commands into ppp_generic.c
compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
compat_ioctl: unify copy-in of ppp filters
tty: handle compat PPP ioctls
compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
compat_ioctl: handle SIOCOUTQNSD
af_unix: add compat_ioctl support
compat_ioctl: reimplement SG_IO handling
compat_ioctl: move WDIOC handling into wdt drivers
fs: compat_ioctl: move FITRIM emulation into file systems
gfs2: add compat_ioctl support
compat_ioctl: remove unused convert_in_user macro
compat_ioctl: remove last RAID handling code
compat_ioctl: remove /dev/raw ioctl translation
compat_ioctl: remove PCI ioctl translation
compat_ioctl: remove joystick ioctl translation
...
After previous patches removing bdev being passed around to set it to
bio, it has become unused in submit_extent_page. So it now has "only" 13
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
We can now remove the bdev from extent_map. Previous patches made sure
that bio_set_dev is correctly in all places and that we don't need to
grab it from latest_bdev or pass it around inside the extent map.
Signed-off-by: David Sterba <dsterba@suse.com>
bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
also changing some state bits if there was a different bdev set before.
This is one thing that's not needed.
Another thing is that setting a bdev at bio allocation time is too early
and actually does not work with plain redundancy profiles, where each
time we submit a bio to a device, the bdev is set correctly.
In many places the bio bdev is set to latest_bdev that seems to serve as
a stub pointer "just to put something to bio". But we don't have to do
that.
Where do we know which bdev to set:
* for regular IO: submit_stripe_bio that's called by btrfs_map_bio
* repair IO: repair_io_failure, read or write from specific device
* super block write (using buffer_heads but uses raw bdev) and barriers
* scrub: this does not use all regular IO paths as it needs to reach all
copies, verify and fixup eventually, and for that all bdev management
is independent
* raid56: rbio_add_io_page, for the RMW write
* integrity-checker: does it's own low-level block tracking
Signed-off-by: David Sterba <dsterba@suse.com>
This is preparatory patch to remove @bdev parameter from
submit_extent_page. It can't be removed completely, because the cgroups
need it for wbc when initializing the bio
wbc_init_bio
bio_associate_blkg_from_css
dereference bdev->bi_disk->queue
The bdev pointer is the same as latest_bdev, thus no functional change.
We can retrieve it from fs_devices that's reachable through several
dereferences. The local variable shadows the parameter, but that's only
temporary.
Signed-off-by: David Sterba <dsterba@suse.com>
Testing with the new fsstress support for subvolumes uncovered a pretty
bad problem with rename exchange on subvolumes. We're modifying two
different subvolumes, but we only start the transaction on one of them,
so the other one is not added to the dirty root list. This is caught by
btrfs_cow_block() with a warning because the root has not been updated,
however if we do not modify this root again we'll end up pointing at an
invalid root because the root item is never updated.
Fix this by making sure we add the destination root to the trans list,
the same as we do with normal renames. This fixes the corruption.
Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
set the block group to RO mode and then wait for any ongoing writes into
extents of the block group to complete. While doing that wait we overwrite
the value of the variable 'ret' and can break out of the loop if an error
happens without turning the block group back into RW mode. So what happens
is the following:
1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
to RO mode (its ->ro field set to 1 or incremented to some value > 1);
2) Then btrfs_wait_ordered_roots() returns a value > 0;
3) Then if either joining or committing the transaction fails, we break
out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
the block group in RO mode forever.
To fix this, just remove the code that waits for ongoing writes to extents
of the block group, since it's not needed because in the initial setup
phase of a device replace operation, before starting to find all chunks
and their extents, we set the target device for replace while holding
fs_info->dev_replace->rwsem, which ensures that after releasing that
semaphore, any writes into the source device are made to the target device
as well (__btrfs_map_block() guarantees that). So while at
scrub_enumerate_chunks() we only need to worry about finding and copying
extents (from the source device to the target device) that were written
before we started the device replace operation.
Fixes: f0e9b7d640 ("Btrfs: fix race setting block group readonly during device replace")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:
btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...
And with the following call trace:
BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---
[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()
This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.
The root cause is, we have the following bad loop of creating tons of
system chunks:
1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.
If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.
[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.
To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.
For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.
Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_fs_devices::rotating currently is declared as an integer
variable but only used as a boolean.
Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_fs_devices::seeding currently is declared as an integer
variable but only used as a boolean.
Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For read_one_block_group(), its only caller has already got the item key
to search next block group item.
So we can use that key directly without doing our own convertion on
stack.
Also, since that key used in btrfs_read_block_groups() is vital for
block group item search, add 'const' keyword for that parameter to
prevent read_one_block_group() to modify it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor the work inside the loop of btrfs_read_block_groups() into one
separate function, read_one_block_group().
This allows read_one_block_group to be reused for later BG_TREE feature.
The refactor does the following extra fix:
- Use btrfs_fs_incompat() to replace open-coded feature check
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
once policies can be found here
https://lwn.net/Articles/799218/#Access-Marking%20Policies .
The locked and unlocked access to eb::blocking_writers should be
annotated accordingly, following this:
Writes:
- locked write must use ONCE, may use plain read
- unlocked write must use ONCE
Reads:
- unlocked read must use ONCE
- locked read may use plain read iff not mixed with unlocked read
- unlocked read then locked must use ONCE
There's one difference on the assembly level, where
btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
value and did not reevaluate it after taking the lock. This could have
missed some opportunities to take the lock in case blocking writers
changed between the calls, but the window is just a few instructions
long. As this is in try-lock, the callers handle that.
Signed-off-by: David Sterba <dsterba@suse.com>
The increment and decrement was inherited from previous version that
used atomics, switched in commit 06297d8cef ("btrfs: switch
extent_buffer blocking_writers from atomic to int"). The only possible
values are 0 and 1 so we can set them directly.
The generated assembly (gcc 9.x) did the direct value assignment in
btrfs_set_lock_blocking_write (asm diff after change in 06297d8cef):
5d: test %eax,%eax
5f: je 62 <btrfs_set_lock_blocking_write+0x22>
61: retq
- 62: lock incl 0x44(%rdi)
- 66: add $0x50,%rdi
- 6a: jmpq 6f <btrfs_set_lock_blocking_write+0x2f>
+ 62: movl $0x1,0x44(%rdi)
+ 69: add $0x50,%rdi
+ 6d: jmpq 72 <btrfs_set_lock_blocking_write+0x32>
The part in btrfs_tree_unlock did a decrement because
BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
otherwise the output looks safe:
- lock decl 0x44(%rdi)
+ sub $0x1,%eax
+ mov %eax,0x44(%rdi)
Signed-off-by: David Sterba <dsterba@suse.com>
There are two ifs that use eb::blocking_writers. As this is a variable
modified inside and outside of locks, we could minimize number of
accesses to avoid problems with getting different results at different
times.
The access here is locked so this can only race with btrfs_tree_unlock
that sets blocking_writers to 0 without lock and unsets the lock owner.
The first branch is taken only if the same thread already holds the
lock, the second if checks for blocking writers. Here we'd either unlock
and wait, or proceed. Both are valid states of the locking protocol.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
When there are no raid1c3 or raid1c4 block groups left after balance
(either convert or with other filters applied), remove the incompat bit.
This is already done for RAID56, do the same for RAID1C34.
Signed-off-by: David Sterba <dsterba@suse.com>
The new raid1c3 and raid1c4 profiles are backward incompatible and the
name shall be 'raid1c34', the status can be found in the global
supported features in /sys/fs/btrfs/features or in the per-filesystem
directory.
Signed-off-by: David Sterba <dsterba@suse.com>
Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.
The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.
Signed-off-by: David Sterba <dsterba@suse.com>
Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.
The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.
Signed-off-by: David Sterba <dsterba@suse.com>
In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
cow_file_range_async gained wbc as a parameter and this makes passing
write flags redundant. Set it inside the function and remove the
parameter.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage reads write flags from wbc and passes both to
__extent_writepage_io. This makes write_flags redundant and we can
remove it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Backreference walking, which is used by send to figure if it can issue
clone operations instead of write operations, can be very slow and use
too much memory when extents have many references. This change simply
skips backreference walking when an extent has more than 64 references,
in which case we fallback to a write operation instead of a clone
operation. This limit is conservative and in practice I observed no
signicant slowdown with up to 100 references and still low memory usage
up to that limit.
This is a temporary workaround until there are speedups in the backref
walking code, and as such it does not attempt to add extra interfaces or
knobs to tweak the threshold.
Reported-by: Atemu <atemu.main@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For send we currently skip clone operations when the source and
destination files are the same. This is so because clone didn't support
this case in its early days, but support for it was added back in May
2013 by commit a96fbc7288 ("Btrfs: allow file data clone within a
file"). This change adds support for it.
Example:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt/sdd
$ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar
$ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar
$ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap
$ mkfs.btrfs -f /dev/sde
$ mount /dev/sde /mnt/sde
$ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde
Without this change file foobar at the destination has a single 128Kb
extent:
$ filefrag -v /mnt/sde/snap/foobar
Filesystem type is: 9123683e
File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 31: 0.. 31: 32: last,unknown_loc,delalloc,eof
/mnt/sde/snap/foobar: 1 extent found
With this we get a single 64Kb extent that is shared at file offsets 0
and 64K, just like in the source filesystem:
$ filefrag -v /mnt/sde/snap/foobar
Filesystem type is: 9123683e
File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 15: 3328.. 3343: 16: shared
1: 16.. 31: 3328.. 3343: 16: 3344: last,shared,eof
/mnt/sde/snap/foobar: 2 extents found
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:
btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len
Thus leaving some unused space not discarded.
[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:
btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
|- btrfs_discard_extent()
However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.
When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:
|<--- BG1 --->|<--- BG2 --->|
|//////|<-- Range to discard --->|/////|
To discard that range, we have the following calls:
btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.
So we will only discard the range in BG1, not the remaining part in BG2.
Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.
[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.
- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old code goes:
offset = logical - em->start;
length = min_t(u64, em->len - offset, length);
Where @length calculation is dependent on offset, it can take reader
several more seconds to find it's just the same code as:
offset = logical - em->start;
length = min_t(u64, em->start + em->len - logical, length);
Use above code to make the length calculate independent from other
variable, thus slightly increase the readability.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In check_extent_data_item(), we read file extent type without verifying
if the item size is valid.
Add such check to ensure the file extent type we read is correct.
The check is not as accurate as we need to cover both inline and regular
extents, so it only checks if the item size is larger or equal to inline
header.
So the existing size checks on inline/regular extents are still needed.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The "&fs_info->dev_replace.rwsem" and "&dev_replace->rwsem" refer to
the same lock but Smatch is not clever enough to figure that out so it
leads to static checker warnings. It's better to use it consistently
anyway.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The backup_root_index member stores the index at which the backup root
should be saved upon next transaction commit. However, there is a
small deviation from this behavior in the form of a check in
backup_super_roots which checks if current root generation equals to the
generation of the previous root. This can trigger in the following
scenario:
slot0: gen-2
slot1: gen-1
slot2: gen
slot3: unused
Now suppose slot3 (which is also the root specified in the super block)
is corrupted hence init_tree_roots chooses to use the backup root at
slot2, meaning read_backup_root will read slot2 and assign the
superblock generation to gen-1. Despite this backup_root_index will
point at slot3 because its init happens in init_backup_root_slot, long
before any parsing of the backup roots occur. Then on next transaction
start, gen-1 will be incremented by 1 making the root's generation
equal gen. Subsequently, on transaction commit the following check
triggers:
if (btrfs_backup_tree_root_gen(root_backup) ==
btrfs_header_generation(info->tree_root->node))
This causes the 'next_backup', which is the index at which the backup is
going to be written to, to set to last_backup, which will be slot2.
All of this is a very confusing way of expressing the following
invariant:
Always write a backup root at the index following the last used backup
root.
This commit streamlines this logic by setting backup_root_index to the
next index after the one used for mount.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old name name was an awful misnomer because it didn't really find
the oldest super backup per-se but rather its slot. For example if we
have:
slot0: gen - 2
slot1: gen - 1
slot2: gen
slot3: empty
init_backup_root_slot will return slot3 and not slot0.
The new name is more appropriate since the function doesn't care whether
there is a valid backup in the returned slot or not.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function has been superseded by previous commits and is no longer
used so just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the filesystem is not well formed and no trees are loaded it's
pointless holding the objectid_mutex. Just remove its usage.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The code responsible for reading and initializing tree roots is
scattered in open_ctree among 2 labels, emulating a loop. This is rather
confusing to reason about. Instead, factor the code to a new function,
init_tree_roots which implements the same logical flow.
There are a couple of notable differences, namely:
* Instead of using next_backup_root it's using the newly introduced
read_backup_root.
* If read_backup_root returns an error init_tree_roots propagates the
error and there is no special handling of that case e.g. the code jumps
straight to 'fail_tree_roots' label. The old code, however, was
(erroneously) jumping to 'fail_block_groups' label if next_backup_root
did fail, this was unnecessary since the tree roots init logic doesn't
modify the state of block groups.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function will replace next_root_backup with a much saner/cleaner
interface.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's no longer needed following cleanups around find_newest_backup_root
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Backup roots are always written in a circular manner. By definition we
can only ever have 1 backup root whose generation equals to that of the
superblock. Hence, the 'if' in the for loop will trigger at most once.
This is sufficient to return the newest backup root.
Furthermore the newest_gen parameter is always set to the generation of
the superblock. This value can be obtained from the fs_info.
This patch removes the unnecessary code dealing with the wraparound
case and makes 'newest_gen' a local variable.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode delalloc mutex was added a long time ago by commit f248679e86
("Btrfs: add a delalloc mutex to inodes for delalloc reservations"), and
the reason for its introduction is not very clear from the change log. It
claims it solves bogus warnings from lockdep, however it lacks an example
report/warning from lockdep, or any explanation.
Since we have enough concurrentcy protection from the locks of the space
info and block reserve objects, and such lockdep warnings don't seem to
exist anymore (at least on a 5.3 kernel I couldn't get them with fstests,
ltp, fs_mark, etc), remove it, simplifying things a bit and decreasing
the size of the btrfs_inode structure. With some quick fio tests doing
direct IO and mmap writes I couldn't observe any significant performance
increase either (direct IO writes that don't increase the file's size
don't hold the inode's lock for their entire duration and mmap writes
don't hold the inode's lock at all), which are the only type of writes
that could see any performance gain due to less serialization.
Review feedback from Josef:
The problem was taking the i_mutex in mmap, which is how I was
protecting delalloc reservations originally. The delalloc mutex didn't
come with all of the other dependencies. That's what the lockdep
messages were about, removing the lock isn't going to make them appear
again.
We _had_ to lock around this because we used to do tricks to keep from
over-reserving, and if we didn't serialize delalloc reservations we'd
end up with ugly accounting problems when we tried to clean things up.
However with my recentish changes this isn't the case anymore. Every
operation is responsible for reserving its space, and then adding it to
the inode. Then cleaning up is straightforward and can't be mucked up
by other users. So we no longer need the delalloc mutex to safe us from
ourselves.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is not used anymore since commit 957780eb27 ("Btrfs: introduce
ticketed enospc infrastructure"), so just remove it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_statfs() we cache fs_info::space_info in a local variable only
to use it once in a list_for_each_rcu() statement.
Not only is the local variable unnecessary it even makes the code harder
to follow as it's not clear which list it is iterating.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The on-disk format of block group item makes use of the key that stores
the offset and length. This is further used in the code, although this
makes thing harder to understand. The key is also packed so the
offset/length is not properly aligned as u64.
Add start (key.objectid) and length (key.offset) members to block group
and remove the embedded key. When the item is searched or written, a
local variable for key is used.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Accessors defined by BTRFS_SETGET_FUNCS take a raw extent buffer and
manipulate the items there, there's no special prefix required. The
block group accessors had _disk_ because previously the names were
occupied by the on-stack accessors. As this has been addressed in the
previous patch, we can now unify the naming.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All accessors defined by BTRFS_SETGET_STACK_FUNCS contain _stack_ in the
name, the block group ones were not following that scheme, so let's
switch them.
Signed-off-by: David Sterba <dsterba@suse.com>
The members ::used and ::flags are now in the block group cache
structure, the last one is chunk_objectid, but that's set to a fixed
value and otherwise unused. The item is constructed from a local
variable before write, so we can remove the embedded one from block
group.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The flags are read from the item that's embedded to block group struct,
but the item will be removed. Use the ::flags after read and before
write.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For unknown reasons, the member 'used' in the block group struct is
stored in the b-tree item and accessed everywhere using the special
accessor helper. Let's unify it and make it a regular member and only
update the item before writing it to the tree.
The item is still being used for flags and chunk_objectid, there's some
duplication until the item is removed in following patches.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The last user of btrfs_bio::flags was removed in commit 326e1dbb57
("block: remove management of bi_remaining when restoring original
bi_end_io"), remove it.
(Tagged for stable as the structure is heavily used and space savings
are desirable.)
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently all the checksum algorithms generate a fixed size digest size
and we use it. The on-disk format can hold up to BTRFS_CSUM_SIZE bytes
and BLAKE2b produces digest of 512 bits by default. We can't do that and
will use the blake2b-256, this needs to be passed to the crypto API.
Separate that from the base algorithm name and add a member to request
specific driver, in this case with the digest size.
The only place that uses the driver name is the crypto API setup.
Signed-off-by: David Sterba <dsterba@suse.com>
Show the used driver for the checksum algorithm for the filesystem in
sysfs file /sys/fs/btrfs/UUID/features/checksum, eg.
crc32c (crc32c-generic)
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Export supported checksum algorithms via sysfs in the list of static
features:
/sys/fs/btrfs/features/supported_checksums
Space spearated list of checksum algorithm names.
Co-developed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Add sha256 to the list of possible checksumming algorithms used by BTRFS.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add xxhash64 to the list of possible checksumming algorithms used by
BTRFS.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To remove use of extent_map::bdev we need to find a replacement, and the
latest_bdev is the only one we can use here, because inode::i_bdev and
superblock::s_bdev are NULL.
The DIO code uses bdev in two places:
* to read blocksize to perform alignment checks in
do_blockdev_direct_IO, but we do them in btrfs code before any call to
DIO
* in the following call chain:
do_direct_IO
get_more_blocks
sdio->get_block() <-- this is btrfs_get_blocks_direct
subsequently the map_bh->b_dev member is used in clean_bdev_aliases
and dio_new_bio to set the bio's bdev to that of the buffer_head.
However, because we have provided a submit function dio_bio_submit
calls our submission function and ignores the bdev.
So it's safe to pass any valid bdev that's used within the filesystem.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a preparatory patch for removing extent_map::bdev. There's some
history behind the code so this is only precaution to catch if things
break before the actual removal happens.
Logically, comparing a raw low-level block device (bdev) does not make
sense for extent maps (high-level objects). This had no effect in
practice but was quite confusing in the code. The lookup_map is set iff
EXTENT_FLAG_FS_MAPPING is set.
The two pointers were stored in the same bytes and used potentially in
two meanings. Now they're split, so the asserts are in place to check
that the condition will not change.
The lookup map pointer misused bdev, this has been changed in commit
95617d6932 ("btrfs: cleanup, stop casting for extent_map->lookup
everywhere") to the explicit type. But the semantics hasn't changed and
bdev was not actually used to decide if maps are mergeable.
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of checking if we've read a BTRFS_CHUNK_ITEM_KEY from disk and
then process it we could just bail out early if the read disk key wasn't
a BTRFS_CHUNK_ITEM_KEY.
This removes a level of indentation and makes the code nicer to read.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_may_alloc_data_chunk() we're checking if the chunk type is of
type BTRFS_BLOCK_GROUP_DATA and if it is we process it.
Instead of checking if the chunk type is a BTRFS_BLOCK_GROUP_DATA chunk
we can negate the check and bail out early if it isn't.
This makes the code a bit more readable.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In lock_stripe_add() we're caching the bucket for the stripe hash table
just for a single call to dereference the stripe hash.
If we just directly call rbio_bucket() we can safe the pointless local
variable.
Also move the dereferencing of the stripe hash outside of the variable
declaration block to not break over the 80 characters limit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In lock_stripe_add() we're traversing the stripe hash list and check if
the current list element's raid_map equals is equal to the raid bio's
raid_map. If both are equal we continue processing.
If we'd check for inequality instead of equality we can reduce one level
of indentation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using an input pointer parameter as the return value and have
an int as the return type of find_desired_extent, rework the function to
directly return the found offset. Doing that the 'ret' variable in
btrfs_llseek_file can be removed. Additional (subjective) benefit is
that btrfs' llseek function now resemebles those of the other major
filesystems.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Handle SEEK_END/SEEK_CUR in a single 'default' case by directly
returning from generic_file_llseek. This makes the 'out' label
redundant. Finally return directly the vale from vfs_setpos. No
semantic changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Modifying the file position is done on a per-file basis. This renders
holding the inode lock for writing useless and makes the performance of
concurrent llseek's abysmal.
Fix this by holding the inode for read. This provides protection against
concurrent truncates and find_desired_extent already includes proper
extent locking for the range which ensures proper locking against
concurrent writes. SEEK_CUR and SEEK_END can be done lockessly.
The former is synchronized by file::f_lock spinlock. SEEK_END is not
synchronized but atomic, but that's OK since there is not guarantee that
SEEK_END will always be at the end of the file in the face of tail
modifications.
This change brings ~82% performance improvement when doing a lot of
parallel fseeks. The workload essentially does:
for (d=0; d<num_seek_read; d++)
{
/* offset %= 16777216; */
fseek (f, 256 * d % 16777216, SEEK_SET);
fread (buffer, 64, 1, f);
}
Without patch:
num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real 0m41.412s
user 0m28.777s
sys 2m16.510s
With patch:
num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real 0m11.479s
user 0m27.629s
sys 0m21.040s
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can infer the ops from the type that is now passed to all functions
that would need it, this makes workspace_manager::ops redundant and can
be removed.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace indirect calls to free_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for free_workspace.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace indirect calls to alloc_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for alloc_workspace.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to get_workspace, majority of the callbacks is trivial, we don't
gain anything by the indirection, so replace them by a switch function.
Trivial callback implementations use the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Majority of the callbacks is trivial, we don't gain anything by the
indirection, so replace them by a switch function.
ZLIB needs to adjust level in the callback and ZSTD workspace management
is complex, the rest is call to the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The indirect calls will be replaced by a switch in compression.c.
(Switch is faster than indirect calls with when Spectre mitigations are
enabled).
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace loop calling to all algos with a list of direct calls to the
cleanup manager callback. When that becomes trivial it is replaced by
direct call to the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the access to the workspace structures, we can look it up together
with the compression ops inside the workspace manager cleanup helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace loop calling to all algos with a list of direct calls to the
init manager callback. When that becomes trivial it is replaced by
direct call to the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the access to the workspace structures, we can look it up together
with the compression ops inside the workspace manager init helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a lot of indirection when the generic code calls into
algo-specific callbacks to reach the private workspace manager structure
and back to the generic code.
To simplify that, export the workspace manager for heuristic, LZO and
ZLIB, while ZSTD is going to use it's own manager.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The indirect calls bring some overhead due to spectre vulnerability
mitigations. The number of cases is small and below the threshold
(10-20) where indirect call would be better.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Export compress_pages, decompress_bio and decompress callbacks for all
compression algos. The indirect calls will be replaced by a switch.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When free'ing extents in a block group we check to see if the block
group is not cached, and then cache it if we need to. However we'll
just carry on as long as we're loading the cache. This is problematic
because we are dirtying the block group here. If we are fast enough we
could do a transaction commit and clear the free space cache while we're
still loading the space cache in another thread. This truncates the
free space inode, which will keep it from loading the space cache.
Fix this by using the btrfs_block_group_cache_done helper so that we try
to load the space cache unconditionally here, which will result in the
caller waiting for the fast caching to complete and keep us from
truncating the free space inode.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While testing 5.2 we ran into the following panic
[52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001
[52238.105608] RIP: 0010:drop_buffers+0x3d/0x150
[52238.304051] Call Trace:
[52238.308958] try_to_free_buffers+0x15b/0x1b0
[52238.317503] shrink_page_list+0x1164/0x1780
[52238.325877] shrink_inactive_list+0x18f/0x3b0
[52238.334596] shrink_node_memcg+0x23e/0x7d0
[52238.342790] ? do_shrink_slab+0x4f/0x290
[52238.350648] shrink_node+0xce/0x4a0
[52238.357628] balance_pgdat+0x2c7/0x510
[52238.365135] kswapd+0x216/0x3e0
[52238.371425] ? wait_woken+0x80/0x80
[52238.378412] ? balance_pgdat+0x510/0x510
[52238.386265] kthread+0x111/0x130
[52238.392727] ? kthread_create_on_node+0x60/0x60
[52238.401782] ret_from_fork+0x1f/0x30
The page we were trying to drop had a page->private, but had no
page->mapping and so called drop_buffers, assuming that we had a
buffer_head on the page, and then panic'ed trying to deref 1, which is
our page->private for data pages.
This is happening because we're truncating the free space cache while
we're trying to load the free space cache. This isn't supposed to
happen, and I'll fix that in a followup patch. However we still
shouldn't allow those sort of mistakes to result in messing with pages
that do not belong to us. So add the page->mapping check to verify that
we still own this page after dropping and re-acquiring the page lock.
This page being unlocked as:
btrfs_readpage
extent_read_full_page
__extent_read_full_page
__do_readpage
if (!nr)
unlock_page <-- nr can be 0 only if submit_extent_page
returns an error
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add callchain ]
Signed-off-by: David Sterba <dsterba@suse.com>
In the fixup worker, if we fail to mark the range as delalloc in the io
tree, we must release the previously reserved metadata, as well as update
the outstanding extents counter for the inode, otherwise we leak metadata
space.
In pratice we can't return an error from btrfs_set_extent_delalloc(),
which is just a wrapper around __set_extent_bit(), as for most errors
__set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as
well) and returning an -EEXIST error doesn't happen in this case since
the exclusive bits parameter always has a value of 0 through this code
path. Nevertheless, just fix the error handling in the fixup worker,
in case one day __set_extent_bit() can return an error to this code
path.
Fixes: f3038ee3a3 ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a buffered write it's possible to leave the subv_writers
counter of the root, used for synchronization between buffered nocow
writers and snapshotting. This happens in an exceptional case like the
following:
1) We fail to allocate data space for the write, since there's not
enough available data space nor enough unallocated space for allocating
a new data block group;
2) Because of that failure, we try to go to NOCOW mode, which succeeds
and therefore we set the local variable 'only_release_metadata' to true
and set the root's sub_writers counter to 1 through the call to
btrfs_start_write_no_snapshotting() made by check_can_nocow();
3) The call to btrfs_copy_from_user() returns zero, which is very unlikely
to happen but not impossible;
4) No pages are copied because btrfs_copy_from_user() returned zero;
5) We call btrfs_end_write_no_snapshotting() which decrements the root's
subv_writers counter to 0;
6) We don't set 'only_release_metadata' back to 'false' because we do
it only if 'copied', the value returned by btrfs_copy_from_user(), is
greater than zero;
7) On the next iteration of the while loop, which processes the same
page range, we are now able to allocate data space for the write (we
got enough data space released in the meanwhile);
8) After this if we fail at btrfs_delalloc_reserve_metadata(), because
now there isn't enough free metadata space, or in some other place
further below (prepare_pages(), lock_and_cleanup_extent_if_need(),
btrfs_dirty_pages()), we break out of the while loop with
'only_release_metadata' having a value of 'true';
9) Because 'only_release_metadata' is 'true' we end up decrementing the
root's subv_writers counter to -1 (through a call to
btrfs_end_write_no_snapshotting()), and we also end up not releasing the
data space previously reserved through btrfs_check_data_free_space().
As a consequence the mechanism for synchronizing NOCOW buffered writes
with snapshotting gets broken.
Fix this by always setting 'only_release_metadata' to false at the start
of each iteration.
Fixes: 8257b2dc3c ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume")
Fixes: 7ee9e4405f ("Btrfs: check if we can nocow if we don't have data space")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some functions are doing some unnecessary indirection to reach the
btrfs_fs_info struct. Change these functions to receive a btrfs_fs_info
struct instead of a *file.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need int argument bool shall do in free_root_pointers(). And
rename the argument as it confused two people.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compression type upper limit constant is the same as the last value
and this is confusing. In order to keep coding style consistent, use
BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of
'NR' being one more than the last value.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use enum to replace macro definitions of extent types.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
DEFINE_HASHTABLE itself has already included initialization code,
we don't have to call hash_init() again, so remove it.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is used only during the final phase of freespace cache
writeout. This is necessary since using the plain btrfs_join_transaction
api is deadlock prone. The deadlock looks like:
T1:
btrfs_commit_transaction
commit_cowonly_roots
btrfs_write_dirty_block_groups
btrfs_wait_cache_io
__btrfs_wait_cache_io
btrfs_wait_ordered_range <-- Triggers ordered IO for freespace
inode and blocks transaction commit
until freespace cache writeout
T2: <-- after T1 has triggered the writeout
finish_ordered_fn
btrfs_finish_ordered_io
btrfs_join_transaction <--- this would block waiting for current
transaction to commit, but since trans
commit is waiting for this writeout to
finish
The special purpose functions prevents it by simply skipping the "wait
for writeout" since it's guaranteed the transaction won't proceed until
we are done.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using an ASSERT in btrfs_pin_extent allows to more stringently observe
whether the function is called under a transaction or not.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper is trivial and we can understand what the atomic_inc on
something named refs does.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a cyclic writeback, extent_write_cache_pages() uses done_index
to update the writeback_index after the current run is over. However,
instead of current index + 1, it gets to to the current index itself.
Unfortunately, this, combined with returning on EOF instead of looping
back, can lead to the following pathlogical behavior.
1. There is a single file which has accumulated enough dirty pages to
trigger balance_dirty_pages() and the writer appending to the file
with a series of short writes.
2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.
3. Writeback kicks in and the cursor is on the last page of the dirty
file. Writeback is started or skipped if already in progress. As
it's EOF, extent_write_cache_pages() returns and the cursor is set
to done_index which is pointing to the last page.
4. Writeback is done. Nothing happens till balance_dirty_pages
finishes, at which point we go back to #1.
This can almost completely stall out writing back of the file and keep
the system over dirty threshold for a long time which can mess up the
whole system. We encountered this issue in production with a package
handling application which can reliably reproduce the issue when
running under tight memory limits.
Reading the comment in the error handling section, this seems to be to
avoid accidentally skipping a page in case the write attempt on the
page doesn't succeed. However, this concern seems bogus.
On each page, the code either:
* Skips and moves onto the next page.
* Fails issue and sets done_index to index + 1.
* Successfully issues and continue to the next page if budget allows
and not EOF.
IOW, as long as it's not EOF and there's budget, the code never
retries writing back the same page. Only when a page happens to be
the last page of a particular run, we end up retrying the page, which
can't possibly guarantee anything data integrity related. Besides,
cyclic writes are only used for non-syncing writebacks meaning that
there's no data integrity implication to begin with.
Fix it by always setting done_index past the current page being
processed.
Note that this problem exists in other writepages too.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 4617ea3a52 (" Btrfs: fix necessary chunk tree space calculation
when allocating a chunk") removed the is_allocation argument from
check_system_chunk, since the formula for reserving the necessary space
for allocation or removing a chunk would be the same.
So, rework the comment by removing the mention of is_allocation
argument.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike read time tree checker errors, write time error can't be
inspected by "btrfs inspect dump-tree", so we need extra information to
determine what's going wrong.
The patch will add the following output for write time tree checker
error:
- The content of the offending tree block
To help determining if it's a false alert.
- Kernel WARN_ON() for debug build
This is helpful for us to detect unexpected write time tree checker
error, especially fstests could catch the dmesg.
Since the WARN_ON() is only triggered for write time tree checker,
test cases utilizing dm-error won't trigger this WARN_ON(), thus no
extra noise.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor the check for prev_key->objectid of the following key types
into one function, check_prev_ino():
- EXTENT_DATA
- INODE_REF
- DIR_INDEX
- DIR_ITEM
- XATTR_ITEM
Also add the check of prev_key for INODE_REF.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_write_locked_range() is used when we're falling back to buffered
IO from inside of compression. It allocates its own wbc and should
associate it with the inode's i_wb to make sure the IO goes down from
the correct cgroup.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Async CRCs and compression submit IO through helper threads, which means
they have IO priority inversions when cgroup IO controllers are in use.
This flags all of the writes submitted by btrfs helper threads as
REQ_CGROUP_PUNT. submit_bio() will punt these to dedicated per-blkcg
work items to avoid the priority inversion.
For the compression code, we take a reference on the wbc's blkg css and
pass it down to the async workers.
For the async CRCs, the bio already has the correct css, we just need to
tell the block layer to use REQ_CGROUP_PUNT.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Modified-and-reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs writepages function collects a large range of pages flagged
for delayed allocation, and then sends them down through the COW code
for processing. When compression is on, we allocate one async_chunk
structure for every 512K, and then run those pages through the
compression code for IO submission.
writepages starts all of this off with a single page, locked by the
original call to extent_write_cache_pages(), and it's important to keep
track of this page because it has already been through
clear_page_dirty_for_io().
The btrfs async_chunk struct has a pointer to the locked_page, and when
we're redirtying the page because compression had to fallback to
uncompressed IO, we use page->index to decide if a given async_chunk
struct really owns that page.
But, this is racey. If a given delalloc range is broken up into two
async_chunks (chunkA and chunkB), we can end up with something like
this:
compress_file_range(chunkA)
submit_compress_extents(chunkA)
submit compressed bios(chunkA)
put_page(locked_page)
compress_file_range(chunkB)
...
Or:
async_cow_submit
submit_compressed_extents <--- falls back to buffered writeout
cow_file_range
extent_clear_unlock_delalloc
__process_pages_contig
put_page(locked_pages)
async_cow_submit
The end result is that chunkA is completed and cleaned up before chunkB
even starts processing. This means we can free locked_page() and reuse
it elsewhere. If we get really lucky, it'll have the same page->index
in its new home as it did before.
While we're processing chunkB, we might decide we need to fall back to
uncompressed IO, and so compress_file_range() will call
__set_page_dirty_nobufers() on chunkB->locked_page.
Without cgroups in use, this creates as a phantom dirty page, which
isn't great but isn't the end of the world. What can happen, it can go
through the fixup worker and the whole COW machinery again:
in submit_compressed_extents():
while (async extents) {
...
cow_file_range
if (!page_started ...)
extent_write_locked_range
else if (...)
unlock_page
continue;
This hasn't been observed in practice but is still possible.
With cgroups in use, we might crash in the accounting code because
page->mapping->i_wb isn't set.
BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
IP: percpu_counter_add_batch+0x11/0x70
PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 16 PID: 2172 Comm: rm Not tainted
RIP: 0010:percpu_counter_add_batch+0x11/0x70
RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
account_page_cleaned+0x15b/0x1f0
__cancel_dirty_page+0x146/0x200
truncate_cleanup_page+0x92/0xb0
truncate_inode_pages_range+0x202/0x7d0
btrfs_evict_inode+0x92/0x5a0
evict+0xc1/0x190
do_unlinkat+0x176/0x280
do_syscall_64+0x63/0x1a0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
The fix here is to make asyc_chunk->locked_page NULL everywhere but the
one async_chunk struct that's allowed to do things to the locked page.
Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
Fixes: 771ed689d2 ("Btrfs: Optimize compressed writeback and reads")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
[ update changelog from mail thread discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're not using btrfs_schedule_bio() anymore, delete all the
code that supported it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_schedule_bio() hands IO off to a helper thread to do the actual
submit_bio() call. This has been used to make sure async crc and
compression helpers don't get stuck on IO submission. To maintain good
performance, over time the IO submission threads duplicated some IO
scheduler characteristics such as high and low priority IOs and they
also made some ugly assumptions about request allocation batch sizes.
All of this cost at least one extra context switch during IO submission,
and doesn't fit well with the modern blkmq IO stack. So, this commit stops
using btrfs_schedule_bio(). We may need to adjust the number of async
helper threads for crcs and compression, but long term it's a better
path.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The attribute is more relaxed than const and the functions could
dereference pointers, as long as the observable state is not changed. We
do have such functions, based on -Wsuggest-attribute=pure .
The visible effects of this patch are negligible, there are differences
in the assembly but hard to summarize.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For some reason the attribute is called __attribute_const__ and not
__const, marks functions that have no observable effects on program
state, IOW not reading pointers, just the arguments and calculating a
value. Allows the compiler to do some optimizations, based on
-Wsuggest-attribute=const . The effects are rather small, though, about
60 bytes decrese of btrfs.ko.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The attribute can mark functions supposed to be called rarely if at all
and the text can be moved to sections far from the other code. The
attribute has been added to several functions already, this patch is
based on hints given by gcc -Wsuggest-attribute=cold.
The net effect of this patch is decrease of btrfs.ko by 1000-1300,
depending on the config options.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is now always set to NULL and could be dropped. The last
user was get_default_root but that got reworked in 05dbe6837b ("Btrfs:
unify subvol= and subvolid= mounting") and the parameter became unused.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We hit the following warning while running down a different problem
[ 6197.175850] ------------[ cut here ]------------
[ 6197.185082] refcount_t: underflow; use-after-free.
[ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
[ 6197.521792] Call Trace:
[ 6197.526687] __btrfs_release_delayed_node+0x76/0x1c0
[ 6197.536615] btrfs_kill_all_delayed_nodes+0xec/0x130
[ 6197.546532] ? __btrfs_btree_balance_dirty+0x60/0x60
[ 6197.556482] btrfs_clean_one_deleted_snapshot+0x71/0xd0
[ 6197.566910] cleaner_kthread+0xfa/0x120
[ 6197.574573] kthread+0x111/0x130
[ 6197.581022] ? kthread_create_on_node+0x60/0x60
[ 6197.590086] ret_from_fork+0x1f/0x30
[ 6197.597228] ---[ end trace 424bb7ae00509f56 ]---
This is because the free side drops the ref without the lock, and then
takes the lock if our refcount is 0. So you can have nodes on the tree
that have a refcount of 0. Fix this by zero'ing out that element in our
temporary array so we don't try to kill it again.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Its very helpful if we had logged the device scanner process name to
debug the race condition between the systemd-udevd scan and the user
initiated device forget command.
This patch adds process name and pid to the scan message.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add pid to the message ]
Signed-off-by: David Sterba <dsterba@suse.com>
That function adds unnecessary indirection between backref_in_log and
the caller. Furthermore it also "downgrades" backref_in_log's return
value to a boolean, when in fact it could very well be an error.
Rectify the situation by simply opencoding name_in_log_ref in
replay_one_name and properly handling possible return codes from
backref_in_log.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function can return a negative error value if btrfs_search_slot
errors for whatever reason or if btrfs_alloc_path runs out of memory.
This is currently problemattic because backref_in_log is treated by its
callers as if it returns boolean.
Fix this by adding proper error handling in callers. That also enables
the function to return the direct error code from btrfs_search_slot.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Direct replacement, though note that the inside of the loop in
btrfs_find_name_in_backref is organized in a slightly different way but
is equvalent.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The state was introduced in commit 4a9d8bdee3 ("Btrfs: make the state
of the transaction more readable"), then in commit 302167c50b
("btrfs: don't end the transaction for delayed refs in throttle") the
state is completely removed.
So we can just clean up the state since it's only compared but never
set.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an overview of the basic btrfs transaction transitions, including
the following states:
- No transaction states
- Transaction N [[TRANS_STATE_RUNNING]]
- Transaction N [[TRANS_STATE_COMMIT_START]]
- Transaction N [[TRANS_STATE_COMMIT_DOING]]
- Transaction N [[TRANS_STATE_UNBLOCKED]]
- Transaction N [[TRANS_STATE_COMPLETED]]
For each state, the comment will include:
- Basic explaination about current state
- How to go next stage
- What will happen if we call various start_transaction() functions
- Relationship to transaction N+1
This doesn't provide tech details, but serves as a cheat sheet for
reader to get into the code a little easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace is_power_of_2 with the helper that is self-documenting and
remove the open coded call in alloc_profile_is_valid.
Signed-off-by: David Sterba <dsterba@suse.com>
As is_power_of_two takes unsigned long, it's not safe on 32bit
architectures, but we could pass any u64 value in seveal places. Add a
separate helper and also an alias that better expresses the purpose for
which the helper is used.
Signed-off-by: David Sterba <dsterba@suse.com>
When balance reduces the number of copies of metadata, it reduces the
redundancy, use the term redundancy instead of integrity.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function belongs to the family of locking functions, so move it
there. The 'noinline' keyword is dropped as it's now an exported
function that does not need it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function belongs to the family of locking functions, so move it
there. The 'noinline' keyword is dropped as it's now an exported
function that does not need it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_assert_tree_locked is used outside of the locking
code so it is exported, however we can make it static inine as it's
fairly trivial.
This is the only locking assertion used in release builds, inlining
improves the text size by 174 bytes and reduces stack consumption in the
callers.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I've noticed that none of the btrfs_assert_*lock* debugging helpers is
inlined, despite they're short and mostly a value update. Making them
inline shaves 67 from the text size, reduces stack consumption and
perhaps also slightly improves the performance due to avoiding
unnecessary calls.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit ac0c7cf8be ("btrfs: fix crash when tracepoint arguments are
freed by wq callbacks") added a void pointer, wtag, which is passed into
trace_btrfs_all_work_done() instead of the freed work item. This is
silly for a few reasons:
1. The freed work item still has the same address.
2. work is still in scope after it's freed, so assigning wtag doesn't
stop anyone from using it.
3. The tracepoint has always taken a void * argument, so assigning wtag
doesn't actually make things any more type-safe. (Note that the
original bug in commit bc074524e1 ("btrfs: prefix fsid to all trace
events") was that the void * was implicitly casted when it was passed
to btrfs_work_owner() in the trace point itself).
Instead, let's add some clearer warnings as comments.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 9e0af23764 ("Btrfs: fix task hang under heavy compressed
write") worked around the issue that a recycled work item could get a
false dependency on the original work item due to how the workqueue code
guarantees non-reentrancy. It did so by giving different work functions
to different types of work.
However, the fixes in the previous few patches are more complete, as
they prevent a work item from being recycled at all (except for a tiny
window that the kernel workqueue code handles for us). This obsoletes
the previous fix, so we don't need the unique helpers for correctness.
The only other reason to keep them would be so they show up in stack
traces, but they always seem to be optimized to a tail call, so they
don't show up anyways. So, let's just get rid of the extra indirection.
While we're here, rename normal_work_helper() to the more informative
btrfs_work_helper().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, scrub_missing_raid56_worker() puts and potentially frees
sblock (which embeds the work item) and then submits a bio through
scrub_wr_submit(). This is another potential instance of the bug in
"btrfs: don't prematurely free work in run_ordered_work()". Fix it by
dropping the reference after we submit the bio.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, reada_start_machine_worker() frees the reada_machine_work and
then calls __reada_start_machine() to do readahead. This is another
potential instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".
There _might_ already be a deadlock here: reada_start_machine_worker()
can depend on itself through stacked filesystems (__read_start_machine()
-> reada_start_machine_dev() -> reada_tree_block_flagged() ->
read_extent_buffer_pages() -> submit_one_bio() ->
btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() ->
submit_bio() onto a loop device can trigger readahead on the lower
filesystem).
Either way, let's fix it by freeing the work at the end.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds
the work item) and then calls bio_endio(). This is another potential
instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".
In particular, the endio call may depend on other work items. For
example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() ->
__btrfs_correct_data_nocsum() -> dio_read_error() ->
submit_dio_repair_bio(), which submits a bio that is also completed
through a end_workqueue_fn() work item. However,
__btrfs_correct_data_nocsum() waits for the newly submitted bio to
complete, thus it depends on another work item.
This example currently usually works because we use different workqueue
helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR.
However, it may deadlock with stacked filesystems and is fragile
overall. The proper fix is to free the work item at the very end of the
work function, so let's do that.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit fc97fab0ea ("btrfs: Replace fs_info->qgroup_rescan_worker
workqueue with btrfs_workqueue.") converted qgroup_rescan_work to be
initialized with btrfs_init_work(), but it left behind an unnecessary
memset(). Get rid of the memset().
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This needs to be cleaned up in the future, but for now it belongs to the
extent-io-tree stuff since it uses the internal tree search code.
Needed to export get_state_failrec and set_state_failrec as well since
we're not going to move the actual IO part of the failrec stuff out at
this point.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This utilizes internal stuff to the extent_io_tree, so we need to export
it before we move it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_io.c/h are huge, encompassing a bunch of different things. The
extent_io_tree code can live on its own, so separate this out.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are moving extent_io_tree into it's on file, so separate out the
extent_state init stuff from extent_io_tree_init().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We check both extent buffer and extent state leaks in the same function,
separate these two functions out so we can move them around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following comment shows up in btrfs_search_slot() with out much
sense:
/*
* setup the path here so we can release it under lock
* contention with the cow code
*/
if (cow) {
/* code touching path->lock[] is far away from here */
}
This comment hasn't been cleaned up after the relevant code has been
removed.
The original code is introduced in commit 65b51a009e
("btrfs_search_slot: reduce lock contention by cowing in two stages"):
+
+ /*
+ * setup the path here so we can release it under lock
+ * contention with the cow code
+ */
+ p->nodes[level] = b;
+ if (!p->skip_locking)
+ p->locks[level] = 1;
+
But in current code, we have different timing for modifying path lock,
so just remove the comment.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to btrfs_search_slot() done in previous patch, make a shortcut
for the level 0 case and allow to reduce indentation for the remaining
case.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_search_slot(), we something like:
if (level != 0) {
/* Do search inside tree nodes*/
} else {
/* Do search inside tree leaves */
goto done;
}
This caused extra indent for tree node search code. Change it to
something like:
if (level == 0) {
/* Do search inside tree leaves */
goto done'
}
/* Do search inside tree nodes */
So we have more space to maneuver our code, this is especially useful as
the tree nodes search code is more complex than the leaves search code.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For INODE_REF we will check:
- Objectid (ino) against previous key
To detect missing INODE_ITEM.
- No overflow/padding in the data payload
Much like DIR_ITEM, but with less members to check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the following items, key->objectid is inode number:
- DIR_ITEM
- DIR_INDEX
- XATTR_ITEM
- EXTENT_DATA
- INODE_REF
So in the subvolume tree, such items must have its previous item share the
same objectid, e.g.:
(257 INODE_ITEM 0)
(257 DIR_INDEX xxx)
(257 DIR_ITEM xxx)
(258 INODE_ITEM 0)
(258 INODE_REF 0)
(258 XATTR_ITEM 0)
(258 EXTENT_DATA 0)
But if we have the following sequence, then there is definitely
something wrong, normally some INODE_ITEM is missing, like:
(257 INODE_ITEM 0)
(257 DIR_INDEX xxx)
(257 DIR_ITEM xxx)
(258 XATTR_ITEM 0) <<< objecitd suddenly changed to 258
(258 EXTENT_DATA 0)
So just by checking the previous key for above inode based key types, we
can detect a missing inode item.
For INODE_REF key type, the check will be added along with INODE_REF
checker.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not used ouside of transaction.c
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A recent patch to btrfs showed that there was at least 1 case where a
nested transaction was committed. Nested transaction in this case means
a code which has a transaction handle calls some function which in turn
obtains a copy of the same transaction handle. In such cases the correct
thing to do is for the lower callee to call btrfs_end_transaction which
contains appropriate checks so as to not commit the transaction which
will result in stale trans handler for the caller.
To catch such cases add an assert in btrfs_commit_transaction ensuring
btrfs_trans_handle::use_count is always 1.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is similar to 942491c9e6 ("xfs: fix AIM7 regression"). Apparently
our current rwsem code doesn't like doing the trylock, then lock for
real scheme. This causes extra contention on the lock and can be
measured eg. by AIM7 benchmark. So change our read/write methods to
just do the trylock for the RWF_NOWAIT case.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3MUkoACgkQxWXV+ddt
WDvAEQ//fEHZ51NbIMwJNltqF4mr6Oao0M5u0wejEgiXzmR9E1IuHUgVK+KDQmSu
wZl/y+RTlQC0TiURyStaVFBreXEqiuB79my9u4iDeNv/4UJQB42qpmYB4EuviDgB
mxb9bFpWTLkO6Oc+vMrGF3BOmVsQKlq2nOua25g8VFtApQ6uiEfbwBOslCcC8kQB
ZpNBl6x74xz/VWNWZnRStBfwYjRitKNDVU6dyIyRuLj8cktqfGBxGtx7/w0wDiZT
kPR1bNtdpy3Ndke6H/0G6plRWi9kENqcN43hvrz54IKh2l+Jd2/as51j4Qq2tJU9
KaAnJzRaSePxc2m0SqtgZTvc2BYSOg7dqaCyHxBB0CUBdTdJdz2TVZ2KM9MiLns4
1haHBLo4l8o8zeYZpW05ac6OXKY4f8qsjWPEGshn4FDbq0TrHQzYxAF3c0X3hPag
SnuvilgYUuYal+n87qinePg/ZmVrrBXPRycpQnn7FxqezJbf/2WUEojUVQnreU05
mdp8mulxQxyFhgEvO7K1uDtlP8bqW69IO9M//6IWzGNKTDK2SRI08ULplghqgyna
8SG0+y9w26r8UIWDhuvPbdfUMSG3kEH8yLFK84AFDMVJJxOnfznE3sC8sGOiP5q9
OUkl8l7bhDkyAdWZY57gGUobebdPfnLxRV9A+LZQ2El1kSOEK18=
=xzXs
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fix for an older bug that has started to show up during testing
(because of an updated test for rename exchange).
It's an in-memory corruption caused by local variable leaking out of
the function scope"
* tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix log context list corruption after rename exchange operation
During rename exchange we might have successfully log the new name in the
source root's log tree, in which case we leave our log context (allocated
on stack) in the root's list of log contextes. However we might fail to
log the new name in the destination root, in which case we fallback to
a transaction commit later and never sync the log of the source root,
which causes the source root log context to remain in the list of log
contextes. This later causes invalid memory accesses because the context
was allocated on stack and after rename exchange finishes the stack gets
reused and overwritten for other purposes.
The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
detect this and report something like the following:
[ 691.489929] ------------[ cut here ]------------
[ 691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
[ 691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
(...)
[ 691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
[ 691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
(...)
[ 691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
[ 691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
[ 691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
[ 691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
[ 691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
[ 691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
[ 691.490019] FS: 00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
[ 691.490021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
[ 691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 691.490029] Call Trace:
[ 691.490058] btrfs_log_inode_parent+0x667/0x2730 [btrfs]
[ 691.490083] ? join_transaction+0x24a/0xce0 [btrfs]
[ 691.490107] ? btrfs_end_log_trans+0x80/0x80 [btrfs]
[ 691.490111] ? dget_parent+0xb8/0x460
[ 691.490116] ? lock_downgrade+0x6b0/0x6b0
[ 691.490121] ? rwlock_bug.part.0+0x90/0x90
[ 691.490127] ? do_raw_spin_unlock+0x142/0x220
[ 691.490151] btrfs_log_dentry_safe+0x65/0x90 [btrfs]
[ 691.490172] btrfs_sync_file+0x9f1/0xc00 [btrfs]
[ 691.490195] ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
[ 691.490198] ? rcu_read_lock_any_held.part.11+0x20/0x20
[ 691.490204] ? __do_sys_newstat+0x88/0xd0
[ 691.490207] ? cp_new_stat+0x5d0/0x5d0
[ 691.490218] ? do_fsync+0x38/0x60
[ 691.490220] do_fsync+0x38/0x60
[ 691.490224] __x64_sys_fdatasync+0x32/0x40
[ 691.490228] do_syscall_64+0x9f/0x540
[ 691.490233] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 691.490235] RIP: 0033:0x7f4b253ad5f0
(...)
[ 691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
[ 691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
[ 691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
[ 691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
[ 691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0
This started happening recently when running some test cases from fstests
like btrfs/004 for example, because support for rename exchange was added
last week to fsstress from fstests.
So fix this by deleting the log context for the source root from the list
if we have logged the new name in the source root.
Reported-by: Su Yue <Damenly_Su@gmx.com>
Fixes: d4682ba03e ("Btrfs: sync log after logging new name")
CC: stable@vger.kernel.org # 4.19+
Tested-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3GwvEACgkQxWXV+ddt
WDvsfQ//UQ/TND1roMW3EmN+PLTRTUQS75Hb737r5Q66j0j94gFnBxB03R+tU8lP
bH5XVpateb1wDmmdscqVQ1WM9O82bQdDNiYeLDr86+kzpLgy61rZHswZfNlDtl5M
wDwyaxsrd7HndDeUZnIuaakYI9MXz9WIaNXkt0o8hSHctt0N18y23DSBFTVSh/4q
T3cn4odkoBKtQ4mIn2OSmQvkl69nWRBVpjPJZIvsNszKjOo9aZTuGHrOWUV5RPiE
Ho9UBkr+IjEDo8OH88vXAsHeYFIoYhEeUltjLHyF6Agwk1/Ajwp5sxXSubbfHUMQ
l1YdmrTZf+l1Dxdj0sCdyK1npcgGI5IuZmIICpNUEAny9AbTtpSE3GNwtnIHAEAr
cpki+1Z3lfaVSwNMYUz9Esbsb72C+f08WJHGHBMaOhjrBIwQUeWeYzTx6N7uDwNg
GjaDRxjqFV2o7373isQaz7WOTatUMNtL1xvhkeceONI9NBXQRjW5rq5zECmr1ix7
lSTKDzn7yAd6eGBW0T6iYdRl4Bta/zxPiDEF5KDNPugvv23yx17LAOdS4qAiHzbx
oglra/kz9D4xmKVqH7hpFaaHrzL38G4mz6aMdwBu0M7dkn9AwsXiXWSDb0n7jnK0
o96gkUBZjUSFBseUFD2s34MikFtrDynEJzPq96JHuWLguaByMcc=
=CZxY
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few regressions and fixes for stable.
Regressions:
- fix a race leading to metadata space leak after task received a
signal
- un-deprecate 2 ioctls, marked as deprecated by mistake
Fixes:
- fix limit check for number of devices during chunk allocation
- fix a race due to double evaluation of i_size_read inside max()
macro, can cause a crash
- remove wrong device id check in tree-checker"
* tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC
btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range
Btrfs: fix race leading to metadata space leak after task received signal
btrfs: tree-checker: Fix wrong check on max devid
btrfs: Consider system chunk array size for new SYSTEM chunks
The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as
deprecated and scheduled for removal but we actualy do use them for
'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351e5
should have been just the async flag for subvolume creation.
The deprecation has been added in this development cycle, remove it
until it's time.
Fixes: ebc87351e5 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag")
Signed-off-by: David Sterba <dsterba@suse.com>
We hit a regression while rolling out 5.2 internally where we were
hitting the following panic
kernel BUG at mm/page-writeback.c:2659!
RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
Call Trace:
__process_pages_contig+0x25a/0x350
? extent_clear_unlock_delalloc+0x43/0x70
submit_compressed_extents+0x359/0x4d0
normal_work_helper+0x15a/0x330
process_one_work+0x1f5/0x3f0
worker_thread+0x2d/0x3d0
? rescuer_thread+0x340/0x340
kthread+0x111/0x130
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x1f/0x30
This is happening because the page is not locked when doing
clear_page_dirty_for_io. Looking at the core dump it was because our
async_extent had a ram_size of 24576 but our async_chunk range only
spanned 20480, so we had a whole extra page in our ram_size for our
async_extent.
This happened because we try not to compress pages outside of our
i_size, however a cleanup patch changed us to do
actual_end = min_t(u64, i_size_read(inode), end + 1);
which is problematic because i_size_read() can evaluate to different
values in between checking and assigning. So either an expanding
truncate or a fallocate could increase our i_size while we're doing
writeout and actual_end would end up being past the range we have
locked.
I confirmed this was what was happening by installing a debug kernel
that had
actual_end = min_t(u64, i_size_read(inode), end + 1);
if (actual_end > end + 1) {
printk(KERN_ERR "KABOOM\n");
actual_end = end + 1;
}
and installing it onto 500 boxes of the tier that had been seeing the
problem regularly. Last night I got my debug message and no panic,
confirming what I expected.
[ dsterba: the assembly confirms a tiny race window:
mov 0x20(%rsp),%rax
cmp %rax,0x48(%r15) # read
movl $0x0,0x18(%rsp)
mov %rax,%r12
mov %r14,%rax
cmovbe 0x48(%r15),%r12 # eval
Where r15 is inode and 0x48 is offset of i_size.
The original fix was to revert 62b3762271 that would do an
intermediate assignment and this would also avoid the doulble
evaluation but is not future-proof, should the compiler merge the
stores and call i_size_read anyway.
There's a patch adding READ_ONCE to i_size_read but that's not being
applied at the moment and we need to fix the bug. Instead, emulate
READ_ONCE by two barrier()s that's what effectively happens. The
assembly confirms single evaluation:
mov 0x48(%rbp),%rax # read once
mov 0x20(%rsp),%rcx
mov $0x20,%edx
cmp %rax,%rcx
cmovbe %rcx,%rax
mov %rax,(%rsp)
mov %rax,%rcx
mov %r14,%rax
Where 0x48(%rbp) is inode->i_size stored to %eax.
]
Fixes: 62b3762271 ("btrfs: Remove isize local variable in compress_file_range")
CC: stable@vger.kernel.org # v5.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changelog updated ]
Signed-off-by: David Sterba <dsterba@suse.com>
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).
Here's the sequence of steps leading to the space leak:
1) A task tries to create a file for example, so it ends up trying to
start a transaction at btrfs_create();
2) The filesystem is currently in a state where there is not enough
metadata free space to satisfy the transaction's needs. So at
space-info.c:__reserve_metadata_bytes() we create a ticket and
add it to the list of tickets of the space info object. Also,
because the metadata async reclaim job is not running, we queue
a job ro run metadata reclaim;
3) In the meanwhile the task receives a signal (like SIGTERM from
a kill command for example);
4) After queing the async reclaim job, at __reserve_metadata_bytes(),
we unlock the metadata space info and call handle_reserve_ticket();
5) That last function calls wait_reserve_ticket(), which acquires the
lock from the metadata space info. Then in the first iteration of
its while loop, it calls prepare_to_wait_event(), which returns
-ERESTARTSYS because the task has a pending signal. As a result,
we set the error field of the ticket to -EINTR and exit the while
loop without deleting the ticket from the list of tickets (in the
space info object). After exiting the loop we unlock the space info;
6) The async reclaim job is able to release enough metadata, acquires
the metadata space info's lock and then reserves space for the ticket,
since the ticket is still in the list of (non-priority) tickets. The
space reservation happens at btrfs_try_granting_tickets(), called from
maybe_fail_all_tickets(). This increments the bytes_may_use counter
from the metadata space info object, sets the ticket's bytes field to
zero (meaning success, that space was reserved) and removes it from
the list of tickets;
7) wait_reserve_ticket() returns, with the error field of the ticket
set to -EINTR. Then handle_reserve_ticket() just propagates that error
to the caller. Because an error was returned, the caller does not
release the reserved space, since the expectation is that any error
means no space was reserved.
Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.
Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.
This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.
When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x1ad/0x390 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x70
cleanup_mnt+0xb4/0x160
task_work_run+0x7e/0xc0
exit_to_usermode_loop+0xfa/0x100
do_syscall_64+0x1cb/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4274d2cb37
(...)
RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bcf4b235461b26f6 ]---
BTRFS info (device sdb): space_info 4 has 19116032 free, is full
BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
Fixes: 374bf9c5cd ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following script will cause false alert on devid check.
#!/bin/bash
dev1=/dev/test/test
dev2=/dev/test/scratch1
mnt=/mnt/btrfs
umount $dev1 &> /dev/null
umount $dev2 &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f $dev1
mount $dev1 $mnt
_fail()
{
echo "!!! FAILED !!!"
exit 1
}
for ((i = 0; i < 4096; i++)); do
btrfs dev add -f $dev2 $mnt || _fail
btrfs dev del $dev1 $mnt || _fail
dev_tmp=$dev1
dev1=$dev2
dev2=$dev_tmp
done
[CAUSE]
Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as
upper limit for devid. But we can have devid holes just like above
script.
So the check for devid is incorrect and could cause false alert.
[FIX]
Just remove the whole devid check. We don't have any hard requirement
for devid assignment.
Furthermore, even devid could get corrupted by a bitflip, we still have
dev extents verification at mount time, so corrupted data won't sneak
in.
This fixes fstests btrfs/194.
Reported-by: Anand Jain <anand.jain@oracle.com>
Fixes: ab4ba2e133 ("btrfs: tree-checker: Verify dev item")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For SYSTEM chunks, despite the regular chunk item size limit, there is
another limit due to system chunk array size.
The extra limit was removed in a refactoring, so add it back.
Fixes: e3ecdb3fde ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The .ioctl and .compat_ioctl file operations have the same prototype so
they can both point to the same function, which works great almost all
the time when all the commands are compatible.
One exception is the s390 architecture, where a compat pointer is only
31 bit wide, and converting it into a 64-bit pointer requires calling
compat_ptr(). Most drivers here will never run in s390, but since we now
have a generic helper for it, it's easy enough to use it consistently.
I double-checked all these drivers to ensure that all ioctl arguments
are used as pointers or are ignored, but are not interpreted as integer
values.
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Darren Hart (VMware) <dvhart@infradead.org>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2vBPgACgkQxWXV+ddt
WDvaGQ/+JbV05ML7QjufNFKjzojNPg8dOwvLqF58askMEAqzyqOrzu1YsIOjchqv
eZPP+wxh6j6i8LFnbaMYRhSVMlgpK9XOR3tNStp73QlSox6/6VKu7XR4wFXAoip3
l4XI8UqO5XqDG55UTtjKyo2VNq8vgq1gRUWD6hPtfzDP6WYj4JZuXOd+dT6PbP32
VHd91RoB8Z8qmypLF9Sju6IBWNhKl91TjlRqVdfLywaK8azqaxE1UttPf5DVbE6+
MlTuXhuxkNc4ddzj//oJ1s3ZP/nXtFmIZ75+Sd6P5DfqRNeIfjkKqXvffsjqoYVI
1Wv9sDiezxRB1RZjfInhtqvdvsqcsrXrM7x6BRVqI7IZDUaH5em8IoozQT62ze/4
MJBrRIbVodx2I7EVDMNGkx2TaIDAfnW4Z3UC+YSHLoy6jht1+SA5KLQDR1G/4NeR
dWht5wOXwhp8P3DoaczTUpk0DLtAtygj04fH8CG277EILLCpuWwW8iRkPttkrIM2
HrtRKrKFJNyEpq9vHSFVvpQJkzgzqBFs1UnqnEuYh6qNgWCrS4PJDu9geQ8aPGr2
pA5aEqg5b+jjWzIYDP93PF0u7kF4mFVAozn6xMf95FKmM1OupYYQg5BRe7n/DfDu
yh3Ms0Mmd9+snpNJ9lDr40cobHC5CDvfvO2SRERyBeXxARainN0=
=Ft+G
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fixes of error handling cleanup of metadata accounting with qgroups
enabled
- fix swapped values for qgroup tracepoints
- fix race when handling full sync flag
- don't start unused worker thread, functionality removed already
* tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: check for the full sync flag while holding the inode lock during fsync
Btrfs: fix qgroup double free after failure to reserve metadata for delalloc
btrfs: tracepoints: Fix bad entry members of qgroup events
btrfs: tracepoints: Fix wrong parameter order for qgroup events
btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
btrfs: don't needlessly create extent-refs kernel thread
btrfs: block-group: Fix a memory leak due to missing btrfs_put_block_group()
Btrfs: add missing extents release on file extent cluster relocation error
We were checking for the full fsync flag in the inode before locking the
inode, which is racy, since at that that time it might not be set but
after we acquire the inode lock some other task set it. One case where
this can happen is on a system low on memory and some concurrent task
failed to allocate an extent map and therefore set the full sync flag on
the inode, to force the next fsync to work in full mode.
A consequence of missing the full fsync flag set is hitting the problems
fixed by commit 0c713cbab6 ("Btrfs: fix race between ranged fsync and
writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
of weird inconsistencies after replaying a log due to file extents items
representing ranges that overlap.
So just move the check such that it's done after locking the inode and
before starting writeback again.
Fixes: 0c713cbab6 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to reserve metadata for delalloc operations we end up releasing
the previously reserved qgroup amount twice, once explicitly under the
'out_qgroup' label by calling btrfs_qgroup_free_meta_prealloc() and once
again, under label 'out_fail', by calling btrfs_inode_rsv_release() with a
value of 'true' for its 'qgroup_free' argument, which results in
btrfs_qgroup_free_meta_prealloc() being called again, so we end up having
a double free.
Also if we fail to reserve the necessary qgroup amount, we jump to the
label 'out_fail', which calls btrfs_inode_rsv_release() and that in turns
calls btrfs_qgroup_free_meta_prealloc(), even though we weren't able to
reserve any qgroup amount. So we freed some amount we never reserved.
So fix this by removing the call to btrfs_inode_rsv_release() in the
failure path, since it's not necessary at all as we haven't changed the
inode's block reserve in any way at this point.
Fixes: c8eaeac7b7 ("btrfs: reserve delalloc metadata differently")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For btrfs:qgroup_meta_reserve event, the trace event can output garbage:
qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=DATA diff=2
The diff should always be alinged to sector size (4k), so there is
definitely something wrong.
[CAUSE]
For the wrong @diff, it's caused by wrong parameter order.
The correct parameters are:
struct btrfs_root, s64 diff, int type.
However the parameters used are:
struct btrfs_root, int type, s64 diff.
Fixes: 4ee0d8832c ("btrfs: qgroup: Update trace events for metadata reservation")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.
PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.
[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.
The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
We always free qgroup PREALLOC meta space.
While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
We only free qgroup PREALLOC meta space when something went wrong.
[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.
The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()
- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.
E.g. for 64K buffered write, the metadata rsv works like:
/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */
/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */
/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)
This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().
The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e3 ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.
- The trans commit in advance for qgroup would hide the bug
Since commit f5fef45936 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.
[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The patch 32b593bfcb ("Btrfs: remove no longer used function to run
delayed refs asynchronously") removed the async delayed refs but the
thread has been created, without any use. Remove it to avoid resource
consumption.
Fixes: 32b593bfcb ("Btrfs: remove no longer used function to run delayed refs asynchronously")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_read_block_groups(), if we have an invalid block group which
has mixed type (DATA|METADATA) while the fs doesn't have MIXED_GROUPS
feature, we error out without freeing the block group cache.
This patch will add the missing btrfs_put_block_group() to prevent
memory leak.
Note for stable backports: the file to patch in versions <= 5.3 is
fs/btrfs/extent-tree.c
Fixes: 49303381f1 ("Btrfs: bail out if block group has different mixed flag")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we error out when finding a page at relocate_file_extent_cluster(), we
need to release the outstanding extents counter on the relocation inode,
set by the previous call to btrfs_delalloc_reserve_metadata(), otherwise
the inode's block reserve size can never decrease to zero and metadata
space is leaked. Therefore add a call to btrfs_delalloc_release_extents()
in case we can't find the target page.
Fixes: 8b62f87bad ("Btrfs: rework outstanding_extents")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2fQkoACgkQxWXV+ddt
WDsXgA/+OpORcroqswa5V/AB5NgSMv08QfBtL7en7cUA2cGbrTt0lAoixIesCeGA
7y4umlx7zWDhrG0NFv8E21xxkMzK72/YQnN0C6ZwRCi+ZbPSDlpMCwSNV9b6oVt4
t9mJQ2kNZ80wj9W7jtoyiiWZ2OBccWywKBYr+BXybha5BliSd1XbpWMv90lODQHc
1oH1FOphPAf+nSEtW4g0BV8UHy1n1+7TqjEHDilj0TuHlO0MHsR2FQcsRnMBv5H/
CcEH3+870pUOjvm/l1OAdIrzrnH6UsXKHnOmJpYZIAQdG4xtHq/d6O9sfQMZK/Af
VMXuju558kGOvrYdgffYa0HuW3P6c8q5BeEtGAZwjGsQS5GxmW63et/x+HIEl4RU
4SWMfD592GK1/yQn31NWGav7Olkr4L0Eh9iqfKk2nWayTV+pqs8NgkGumkoNB66n
WJs2m2WwKvZzj1Ys9ilRzo17qt2U/m4eLQtbut3m5eYCpzvo38PDrqOVqKTZG/dc
nG1JBJNk+yjV+RkOO0gnQ8/zzooEGvMhVIx5iq52tWjvxnvOGSVv9QAAkKrbMoOX
kn/b3heL57Z+kilUU3Zd0GvVif+awIgOj+PmAevR+w3MLoJ2gLfWb7qcONdHyFPk
jK7rsuP737fyHpZ1tvJyfTt4Lk/u1UbApKrO4QIku1r9IJmmu20=
=zpDj
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more stabitly fixes, one build warning fix.
- fix inode allocation under NOFS context
- fix leak in fiemap due to concurrent append writes
- fix log-root tree updates
- fix balance convert of single profile on 32bit architectures
- silence false positive warning on old GCCs (code moved in rc1)"
* tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: silence maybe-uninitialized warning in clone_range
btrfs: fix uninitialized ret in ref-verify
btrfs: allocate new inode in NOFS context
btrfs: fix balance convert to single on 32-bit host CPUs
btrfs: fix incorrect updating of log root tree
Btrfs: fix memory leak due to concurrent append writes with fiemap
GCC throws warning message as below:
‘clone_src_i_size’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
^
fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here
u64 clone_src_i_size;
^
The clone_src_i_size is only used as call-by-reference
in a call to get_inode_info().
Silence the warning by initializing clone_src_i_size to 0.
Note that the warning is a false positive and reported by older versions
of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the
patch is applied. Setting clone_src_i_size to 0 does not otherwise make
sense and would not do any action in case the code changes in the future.
Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
Coverity caught a case where we could return with a uninitialized value
in ret in process_leaf. This is actually pretty likely because we could
very easily run into a block group item key and have a garbage value in
ret and think there was an errror. Fix this by initializing ret to 0.
Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes: fd708b81d9 ("Btrfs: add a extent ref verify tool")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the command:
btrfs balance start -dconvert=single,soft .
on a Raspberry Pi produces the following kernel message:
BTRFS error (device mmcblk0p2): balance: invalid convert data profile single
This fails because we use is_power_of_2(unsigned long) to validate
the new data profile, the constant for 'single' profile uses bit 48,
and there are only 32 bits in a long on ARM.
Fix by open-coding the check using u64 variables.
Tested by completing the original balance command on several Raspberry
Pis.
Fixes: 818255feec ("btrfs: use common helper instead of open coding a bit test")
CC: stable@vger.kernel.org # 4.20+
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've historically had reports of being unable to mount file systems
because the tree log root couldn't be read. Usually this is the "parent
transid failure", but could be any of the related errors, including
"fsid mismatch" or "bad tree block", depending on which block got
allocated.
The modification of the individual log root items are serialized on the
per-log root root_mutex. This means that any modification to the
per-subvol log root_item is completely protected.
However we update the root item in the log root tree outside of the log
root tree log_mutex. We do this in order to allow multiple subvolumes
to be updated in each log transaction.
This is problematic however because when we are writing the log root
tree out we update the super block with the _current_ log root node
information. Since these two operations happen independently of each
other, you can end up updating the log root tree in between writing out
the dirty blocks and setting the super block to point at the current
root.
This means we'll point at the new root node that hasn't been written
out, instead of the one we should be pointing at. Thus whatever garbage
or old block we end up pointing at complains when we mount the file
system later and try to replay the log.
Fix this by copying the log's root item into a local root item copy.
Then once we're safely under the log_root_tree->log_mutex we update the
root item in the log_root_tree. This way we do not modify the
log_root_tree while we're committing it, fixing the problem.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we have a buffered write that starts at an offset greater than or
equals to the file's size happening concurrently with a full ranged
fiemap, we can end up leaking an extent state structure.
Suppose we have a file with a size of 1Mb, and before the buffered write
and fiemap are performed, it has a single extent state in its io tree
representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
The following sequence diagram shows how the memory leak happens if a
fiemap a buffered write, starting at offset 1Mb and with a length of
4Kb, are performed concurrently.
CPU 1 CPU 2
extent_fiemap()
--> it's a full ranged fiemap
range from 0 to LLONG_MAX - 1
(9223372036854775807)
--> locks range in the inode's
io tree
--> after this we have 2 extent
states in the io tree:
--> 1 for range [0, 1Mb[ with
the bits EXTENT_LOCKED and
EXTENT_DELALLOC_BITS set
--> 1 for the range
[1Mb, LLONG_MAX[ with
the EXTENT_LOCKED bit set
--> start buffered write at offset
1Mb with a length of 4Kb
btrfs_file_write_iter()
btrfs_buffered_write()
--> cached_state is NULL
lock_and_cleanup_extent_if_need()
--> returns 0 and does not lock
range because it starts
at current i_size / eof
--> cached_state remains NULL
btrfs_dirty_pages()
btrfs_set_extent_delalloc()
(...)
__set_extent_bit()
--> splits extent state for range
[1Mb, LLONG_MAX[ and now we
have 2 extent states:
--> one for the range
[1Mb, 1Mb + 4Kb[ with
EXTENT_LOCKED set
--> another one for the range
[1Mb + 4Kb, LLONG_MAX[ with
EXTENT_LOCKED set as well
--> sets EXTENT_DELALLOC on the
extent state for the range
[1Mb, 1Mb + 4Kb[
--> caches extent state
[1Mb, 1Mb + 4Kb[ into
@cached_state because it has
the bit EXTENT_LOCKED set
--> btrfs_buffered_write() ends up
with a non-NULL cached_state and
never calls anything to release its
reference on it, resulting in a
memory leak
Fix this by calling free_extent_state() on cached_state if the range was
not locked by lock_and_cleanup_extent_if_need().
The same issue can happen if anything else other than fiemap locks a range
that covers eof and beyond.
This could be triggered, sporadically, by test case generic/561 from the
fstests suite, which makes duperemove run concurrently with fsstress, and
duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
the leak is reported in dmesg/syslog when removing the btrfs module with
a message like the following:
[77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
CC: stable@vger.kernel.org # 4.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2SDbMACgkQxWXV+ddt
WDsUhw/9HcRsT6SlrwA2R5leHxCR5UMwT2Zmbxpfft37ANF0SC1UINHBfnmquM97
xX6fdRSR9RUjF9DrdLPfLBnJDQ/MnHl1ruIVBFhJm6cJ9TJwf9E0TiJBQt+08JWg
vy5hZBWvsPWWRBJ94XPMe4LtakK/isW4Cz5W9AdrC2Siqw69j6eZzms2AnIjyBjA
BoKg4se2Ay2rMxLZWXIOj9374PU+N1cnRnqgh77ZxLku5WdCzrDfB5safE7UmoTG
/MWJuuIgzOk0iQpQORRtEZDS1dNe5KT9m4xXkUbrZbQROwqnXrT1SVIsuqNAvlPk
uaymR1W8nshepzpMlSxVydLv/mKWZNUGnDxOJ23ooow8Yd7ndppXEtFuGwCYqIFc
xQqxuTLREvJ9+jpSv11bmDpk/ULRqpV+2PjUqGaWlGwFArJ+qFRLVGYx31eXmDPj
t2mrPOcXGzY0pKtIpbkuUGleY/jeI+BNsvD4+QPs+jnp0nmfvH0/Rmp7grGqx2FI
rQM8Gn4a5i3nuEDWLp8nN2wcKC3ePwy96Vp2tqfsl6TVTPx4EFzGLkWogHR2yiqI
0LAj8YWFmWuChSv71wYOjX79CppjcbNwOakSwtDjV30jkwoh2f/0D3OwOpua2xe8
75KQMaSB0kesGZz7ZkL1kMqA5m5w7MGZom6XZoBJ+bq2HPLB2jo=
=2UM7
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A bunch of fixes that accumulated in recent weeks, mostly material for
stable.
Summary:
- fix for regression from 5.3 that prevents to use balance convert
with single profile
- qgroup fixes: rescan race, accounting leak with multiple writers,
potential leak after io failure recovery
- fix for use after free in relocation (reported by KASAN)
- other error handling fixups"
* tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls
btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space
btrfs: Fix a regression which we can't convert to SINGLE profile
btrfs: relocation: fix use-after-free on dead relocation roots
Btrfs: fix race setting up and completing qgroup rescan workers
Btrfs: fix missing error return if writeback for extent buffer never started
btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer
Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
[BUG]
The following script can cause btrfs qgroup data space leak:
mkfs.btrfs -f $dev
mount $dev -o nospace_cache $mnt
btrfs subv create $mnt/subv
btrfs quota en $mnt
btrfs quota rescan -w $mnt
btrfs qgroup limit 128m $mnt/subv
for (( i = 0; i < 3; i++)); do
# Create 3 64M holes for latter fallocate to fail
truncate -s 192m $mnt/subv/file
xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
sync
# it's supposed to fail, and each failure will leak at least 64M
# data space
xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
rm $mnt/subv/file
sync
done
# Shouldn't fail after we removed the file
xfs_io -f -c "falloc 0 64m" $mnt/subv/file
[CAUSE]
Btrfs qgroup data reserve code allow multiple reservations to happen on
a single extent_changeset:
E.g:
btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
Btrfs qgroup code has its internal tracking to make sure we don't
double-reserve in above example.
The only pattern utilizing this feature is in the main while loop of
btrfs_fallocate() function.
However btrfs_qgroup_reserve_data()'s error handling has a bug in that
on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
flag but doesn't free previously reserved bytes.
This bug has a two fold effect:
- Clearing EXTENT_QGROUP_RESERVED ranges
This is the correct behavior, but it prevents
btrfs_qgroup_check_reserved_leak() to catch the leakage as the
detector is purely EXTENT_QGROUP_RESERVED flag based.
- Leak the previously reserved data bytes.
The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
the last one fails, leaking space reserved in the previous ones.
[FIX]
Also free previously reserved data bytes when btrfs_qgroup_reserve_data
fails.
Fixes: 5247255370 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Under the following case with qgroup enabled, if some error happened
after we have reserved delalloc space, then in error handling path, we
could cause qgroup data space leakage:
From btrfs_truncate_block() in inode.c:
ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
block_start, blocksize);
if (ret)
goto out;
again:
page = find_or_create_page(mapping, index, mask);
if (!page) {
btrfs_delalloc_release_space(inode, data_reserved,
block_start, blocksize, true);
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
ret = -ENOMEM;
goto out;
}
[CAUSE]
In the above case, btrfs_delalloc_reserve_space() will call
btrfs_qgroup_reserve_data() and mark the io_tree range with
EXTENT_QGROUP_RESERVED flag.
In the error handling path, we have the following call stack:
btrfs_delalloc_release_space()
|- btrfs_free_reserved_data_space()
|- btrsf_qgroup_free_data()
|- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
|- qgroup_free_reserved_data(reserved=@reserved)
|- clear_record_extent_bits();
|- freed += changeset.bytes_changed;
However due to a completion bug, qgroup_free_reserved_data() will clear
EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
than the correct BTRFS_I(inode)->io_tree.
Since io_failure_tree is never marked with that flag,
btrfs_qgroup_free_data() will not free any data reserved space at all,
causing a leakage.
This type of error handling can only be triggered by errors outside of
qgroup code. So EDQUOT error from qgroup can't trigger it.
[FIX]
Fix the wrong target io_tree.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: bc42bda223 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With v5.3 kernel, we can't convert to SINGLE profile:
# btrfs balance start -f -dconvert=single $mnt
ERROR: error during balancing '/mnt/btrfs': Invalid argument
# dmesg -t | tail
validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1
BTRFS error (device dm-3): balance: invalid convert data profile single
[CAUSE]
With the extra debug output added, it shows that the @allowed bit is
lacking the special in-memory only SINGLE profile bit.
Thus we fail at that (profile & ~allowed) check.
This regression is caused by commit 081db89b13 ("btrfs: use raid_attr
to get allowed profiles for balance conversion") and the fact that we
don't use any bit to indicate SINGLE profile on-disk, but uses special
in-memory only bit to help distinguish different profiles.
[FIX]
Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be
the same as it was and fix the regression.
Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: 081db89b13 ("btrfs: use raid_attr to get allowed profiles for balance conversion")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
One user reported a reproducible KASAN report about use-after-free:
BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265
BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0
==================================================================
BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579
CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P OE 5.2.11-arch1-1-kasan #4
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
Call Trace:
dump_stack+0x7b/0xba
print_address_description+0x6c/0x22e
? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
__kasan_report.cold+0x1b/0x3b
? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
kasan_report+0x12/0x17
__asan_report_store8_noabort+0x17/0x20
btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
record_root_in_trans+0x2a0/0x370 [btrfs]
btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
start_transaction+0x1ab/0xe90 [btrfs]
btrfs_join_transaction+0x1d/0x20 [btrfs]
btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs]
? lock_repin_lock+0x400/0x400
? __kmem_cache_shutdown.cold+0x140/0x1ad
? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs]
finish_ordered_fn+0x15/0x20 [btrfs]
normal_work_helper+0x1bd/0xca0 [btrfs]
? process_one_work+0x819/0x1720
? kasan_check_read+0x11/0x20
btrfs_endio_write_helper+0x12/0x20 [btrfs]
process_one_work+0x8c9/0x1720
? pwq_dec_nr_in_flight+0x2f0/0x2f0
? worker_thread+0x1d9/0x1030
worker_thread+0x98/0x1030
kthread+0x2bb/0x3b0
? process_one_work+0x1720/0x1720
? kthread_park+0x120/0x120
ret_from_fork+0x35/0x40
Allocated by task 369692:
__kasan_kmalloc.part.0+0x44/0xc0
__kasan_kmalloc.constprop.0+0xba/0xc0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x138/0x260
btrfs_read_tree_root+0x92/0x360 [btrfs]
btrfs_read_fs_root+0x10/0xb0 [btrfs]
create_reloc_root+0x47d/0xa10 [btrfs]
btrfs_init_reloc_root+0x1e2/0x340 [btrfs]
record_root_in_trans+0x2a0/0x370 [btrfs]
btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
start_transaction+0x1ab/0xe90 [btrfs]
btrfs_start_transaction+0x1e/0x20 [btrfs]
__btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs]
btrfs_prealloc_file_range+0x13/0x20 [btrfs]
prealloc_file_extent_cluster+0x29f/0x570 [btrfs]
relocate_file_extent_cluster+0x193/0xc30 [btrfs]
relocate_data_extent+0x1f8/0x490 [btrfs]
relocate_block_group+0x600/0x1060 [btrfs]
btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
btrfs_relocate_chunk+0x9e/0x180 [btrfs]
btrfs_balance+0x14e4/0x2fc0 [btrfs]
btrfs_ioctl_balance+0x47f/0x640 [btrfs]
btrfs_ioctl+0x119d/0x8380 [btrfs]
do_vfs_ioctl+0x9f5/0x1060
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x73/0xb0
do_syscall_64+0xa5/0x370
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 369692:
__kasan_slab_free+0x14f/0x210
kasan_slab_free+0xe/0x10
kfree+0xd8/0x270
btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs]
clean_dirty_subvols+0x227/0x340 [btrfs]
relocate_block_group+0x972/0x1060 [btrfs]
btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
btrfs_relocate_chunk+0x9e/0x180 [btrfs]
btrfs_balance+0x14e4/0x2fc0 [btrfs]
btrfs_ioctl_balance+0x47f/0x640 [btrfs]
btrfs_ioctl+0x119d/0x8380 [btrfs]
do_vfs_ioctl+0x9f5/0x1060
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x73/0xb0
do_syscall_64+0xa5/0x370
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff88856f671100
which belongs to the cache kmalloc-4k of size 4096
The buggy address is located 1552 bytes inside of
4096-byte region [ffff88856f671100, ffff88856f672100)
The buggy address belongs to the page:
page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0
flags: 0x2ffff0000010200(slab|head)
raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600
raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
BTRFS info (device sdi1): 1 enospc errors during balance
BTRFS info (device sdi1): balance: ended with status: -28
[CAUSE]
The problem happens when finish_ordered_io() get called with balance
still running, while the reloc root of that subvolume is already dead.
(Tree is swap already done, but tree not yet deleted for possible qgroup
usage.)
That means root->reloc_root still exists, but that reloc_root can be
under btrfs_drop_snapshot(), thus we shouldn't access it.
The following race could cause the use-after-free problem:
CPU1 | CPU2
--------------------------------------------------------------------------
| relocate_block_group()
| |- unset_reloc_control(rc)
| |- btrfs_commit_transaction()
btrfs_finish_ordered_io() | |- clean_dirty_subvols()
|- btrfs_join_transaction() | |
|- record_root_in_trans() | |
|- btrfs_init_reloc_root() | |
|- if (root->reloc_root) | |
| | |- root->reloc_root = NULL
| | |- btrfs_drop_snapshot(reloc_root);
|- reloc_root->last_trans|
= trans->transid |
^^^^^^^^^^^^^^^^^^^^^^
Use after free
[FIX]
Fix it by the following modifications:
- Test if the root has dead reloc tree before accessing root->reloc_root
If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to
create or update root->reloc_tree
- Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped
reloc tree
To co-operate with above modification, so as long as
BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create
reloc tree at record_root_in_trans().
Reported-by: Cebtenzzre <cebtenzzre@gmail.com>
Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a race between setting up a qgroup rescan worker and completing
a qgroup rescan worker that can lead to callers of the qgroup rescan wait
ioctl to either not wait for the rescan worker to complete or to hang
forever due to missing wake ups. The following diagram shows a sequence
of steps that illustrates the race.
CPU 1 CPU 2 CPU 3
btrfs_ioctl_quota_rescan()
btrfs_qgroup_rescan()
qgroup_rescan_init()
mutex_lock(&fs_info->qgroup_rescan_lock)
spin_lock(&fs_info->qgroup_lock)
fs_info->qgroup_flags |=
BTRFS_QGROUP_STATUS_FLAG_RESCAN
init_completion(
&fs_info->qgroup_rescan_completion)
fs_info->qgroup_rescan_running = true
mutex_unlock(&fs_info->qgroup_rescan_lock)
spin_unlock(&fs_info->qgroup_lock)
btrfs_init_work()
--> starts the worker
btrfs_qgroup_rescan_worker()
mutex_lock(&fs_info->qgroup_rescan_lock)
fs_info->qgroup_flags &=
~BTRFS_QGROUP_STATUS_FLAG_RESCAN
mutex_unlock(&fs_info->qgroup_rescan_lock)
starts transaction, updates qgroup status
item, etc
btrfs_ioctl_quota_rescan()
btrfs_qgroup_rescan()
qgroup_rescan_init()
mutex_lock(&fs_info->qgroup_rescan_lock)
spin_lock(&fs_info->qgroup_lock)
fs_info->qgroup_flags |=
BTRFS_QGROUP_STATUS_FLAG_RESCAN
init_completion(
&fs_info->qgroup_rescan_completion)
fs_info->qgroup_rescan_running = true
mutex_unlock(&fs_info->qgroup_rescan_lock)
spin_unlock(&fs_info->qgroup_lock)
btrfs_init_work()
--> starts another worker
mutex_lock(&fs_info->qgroup_rescan_lock)
fs_info->qgroup_rescan_running = false
mutex_unlock(&fs_info->qgroup_rescan_lock)
complete_all(&fs_info->qgroup_rescan_completion)
Before the rescan worker started by the task at CPU 3 completes, if
another task calls btrfs_ioctl_quota_rescan(), it will get -EINPROGRESS
because the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN is set at
fs_info->qgroup_flags, which is expected and correct behaviour.
However if other task calls btrfs_ioctl_quota_rescan_wait() before the
rescan worker started by the task at CPU 3 completes, it will return
immediately without waiting for the new rescan worker to complete,
because fs_info->qgroup_rescan_running is set to false by CPU 2.
This race is making test case btrfs/171 (from fstests) to fail often:
btrfs/171 9s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad)
--- tests/btrfs/171.out 2018-09-16 21:30:48.505104287 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad 2019-09-19 02:01:36.938486039 +0100
@@ -1,2 +1,3 @@
QA output created by 171
+ERROR: quota rescan failed: Operation now in progress
Silence is golden
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/171.out /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad' to see the entire diff)
That is because the test calls the btrfs-progs commands "qgroup quota
rescan -w", "qgroup assign" and "qgroup remove" in a sequence that makes
calls to the rescan start ioctl fail with -EINPROGRESS (note the "btrfs"
commands 'qgroup assign' and 'qgroup remove' often call the rescan start
ioctl after calling the qgroup assign ioctl,
btrfs_ioctl_qgroup_assign()), since previous waits didn't actually wait
for a rescan worker to complete.
Another problem the race can cause is missing wake ups for waiters,
since the call to complete_all() happens outside a critical section and
after clearing the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN. In the sequence
diagram above, if we have a waiter for the first rescan task (executed
by CPU 2), then fs_info->qgroup_rescan_completion.wait is not empty, and
if after the rescan worker clears BTRFS_QGROUP_STATUS_FLAG_RESCAN and
before it calls complete_all() against
fs_info->qgroup_rescan_completion, the task at CPU 3 calls
init_completion() against fs_info->qgroup_rescan_completion which
re-initilizes its wait queue to an empty queue, therefore causing the
rescan worker at CPU 2 to call complete_all() against an empty queue,
never waking up the task waiting for that rescan worker.
Fix this by clearing BTRFS_QGROUP_STATUS_FLAG_RESCAN and setting
fs_info->qgroup_rescan_running to false in the same critical section,
delimited by the mutex fs_info->qgroup_rescan_lock, as well as doing the
call to complete_all() in that same critical section. This gives the
protection needed to avoid rescan wait ioctl callers not waiting for a
running rescan worker and the lost wake ups problem, since setting that
rescan flag and boolean as well as initializing the wait queue is done
already in a critical section delimited by that mutex (at
qgroup_rescan_init()).
Fixes: 57254b6ebc ("Btrfs: add ioctl to wait for qgroup rescan completion")
Fixes: d2c609b834 ("btrfs: properly track when rescan worker is running")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If lock_extent_buffer_for_io() fails, it returns a negative value, but its
caller btree_write_cache_pages() ignores such error. This means that a
call to flush_write_bio(), from lock_extent_buffer_for_io(), might have
failed. We should make btree_write_cache_pages() notice such error values
and stop immediatelly, making sure filemap_fdatawrite_range() returns an
error to the transaction commit path. A failure from flush_write_bio()
should also result in the endio callback end_bio_extent_buffer_writepage()
being invoked, which sets the BTRFS_FS_*_ERR bits appropriately, so that
there's no risk a transaction or log commit doesn't catch a writeback
failure.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before, if a eb failed to write out, we would end up triggering a
BUG_ON(). As of f4340622e0 ("btrfs: extent_io: Move the BUG_ON() in
flush_write_bio() one level up"), we no longer BUG_ON(), so we should
make life consistent and add back the unwritten bytes to
dirty_metadata_bytes.
Fixes: f4340622e0 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Filipe Manana <fdmanana@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>