It's no longer used so let's remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't call readpage_end_io_hook for the btree inode. Instead of relying
on indirect calls to implement metadata buffer validation simply check
if the inode whose page we are processing equals the btree inode. If it
does call the necessary function.
This is an improvement in 2 directions:
1. We aren't paying the penalty of indirect calls in a post-speculation
attacks world.
2. The function is now named more explicitly so it's obvious what's
going on
This is in preparation to removing struct extent_io_ops altogether.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During an incremental send, when an inode has multiple new references we
might end up emitting rename operations for orphanizations that have a
source path that is no longer valid due to a previous orphanization of
some directory inode. This causes the receiver to fail since it tries
to rename a path that does not exists.
Example reproducer:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/f1
touch /mnt/sdi/f2
mkdir /mnt/sdi/d1
mkdir /mnt/sdi/d1/d2
# Filesystem looks like:
#
# . (ino 256)
# |----- f1 (ino 257)
# |----- f2 (ino 258)
# |----- d1/ (ino 259)
# |----- d2/ (ino 260)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
# Now do a series of changes such that:
#
# *) inode 258 has one new hardlink and the previous name changed
#
# *) both names conflict with the old names of two other inodes:
#
# 1) the new name "d1" conflicts with the old name of inode 259,
# under directory inode 256 (root)
#
# 2) the new name "d2" conflicts with the old name of inode 260
# under directory inode 259
#
# *) inodes 259 and 260 now have the old names of inode 258
#
# *) inode 257 is now located under inode 260 - an inode with a number
# smaller than the inode (258) for which we created a second hard
# link and swapped its names with inodes 259 and 260
#
ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1
# Swap d1 and f2.
mv /mnt/sdi/d1 /mnt/sdi/tmp
mv /mnt/sdi/f2 /mnt/sdi/d1
mv /mnt/sdi/tmp /mnt/sdi/f2
# Swap d2 and f2_link
mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link
# Filesystem now looks like:
#
# . (ino 256)
# |----- d1 (ino 258)
# |----- f2/ (ino 259)
# |----- f2_link/ (ino 260)
# | |----- f1 (ino 257)
# |
# |----- d2 (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
mkfs.btrfs -f /dev/sdj >/dev/null
mount /dev/sdj /mnt/sdj
btrfs receive -f /tmp/snap1.send /mnt/sdj
btrfs receive -f /tmp/snap2.send /mnt/sdj
umount /mnt/sdi
umount /mnt/sdj
When executed the receive of the incremental stream fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory
This happens because:
1) When processing inode 257 we end up computing the name for inode 259
because it is an ancestor in the send snapshot, and at that point it
still has its old name, "d1", from the parent snapshot because inode
259 was not yet processed. We then cache that name, which is valid
until we start processing inode 259 (or set the progress to 260 after
processing its references);
2) Later we start processing inode 258 and collecting all its new
references into the list sctx->new_refs. The first reference in the
list happens to be the reference for name "d1" while the reference for
name "d2" is next (the last element of the list).
We compute the full path "d1/d2" for this second reference and store
it in the reference (its ->full_path member). The path used for the
new parent directory was "d1" and not "f2" because inode 259, the
new parent, was not yet processed;
3) When we start processing the new references at process_recorded_refs()
we start with the first reference in the list, for the new name "d1".
Because there is a conflicting inode that was not yet processed, which
is directory inode 259, we orphanize it, renaming it from "d1" to
"o259-6-0";
4) Then we start processing the new reference for name "d2", and we
realize it conflicts with the reference of inode 260 in the parent
snapshot. So we issue an orphanization operation for inode 260 by
emitting a rename operation with a destination path of "o260-6-0"
and a source path of "d1/d2" - this source path is the value we
stored in the reference earlier at step 2), corresponding to the
->full_path member of the reference, however that path is no longer
valid due to the orphanization of the directory inode 259 in step 3).
This makes the receiver fail since the path does not exists, it should
have been "o259-6-0/d2".
Fix this by recomputing the full path of a reference before emitting an
orphanization if we previously orphanized any directory, since that
directory could be a parent in the new path. This is a rare scenario so
keeping it simple and not checking if that previously orphanized directory
is in fact an ancestor of the inode we are trying to orphanize.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an incremental send it is possible that when processing the new
references for an inode we end up issuing rename or link operations that
have an invalid path, which contains the orphanized name of a directory
before we actually orphanized it, causing the receiver to fail.
The following reproducer triggers such scenario:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/a
touch /mnt/sdi/b
mkdir /mnt/sdi/testdir
# We want "a" to have a lower inode number then "testdir" (257 vs 259).
mv /mnt/sdi/a /mnt/sdi/testdir/a
# Filesystem looks like:
#
# . (ino 256)
# |----- testdir/ (ino 259)
# | |----- a (ino 257)
# |
# |----- b (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
# Now rename 259 to "testdir_2", then change the name of 257 to
# "testdir" and make it a direct descendant of the root inode (256).
# Also create a new link for inode 257 with the old name of inode 258.
# By swapping the names and location of several inodes and create a
# nasty dependency chain of rename and link operations.
mv /mnt/sdi/testdir/a /mnt/sdi/a2
touch /mnt/sdi/testdir/a
mv /mnt/sdi/b /mnt/sdi/b2
ln /mnt/sdi/a2 /mnt/sdi/b
mv /mnt/sdi/testdir /mnt/sdi/testdir_2
mv /mnt/sdi/a2 /mnt/sdi/testdir
# Filesystem now looks like:
#
# . (ino 256)
# |----- testdir_2/ (ino 259)
# | |----- a (ino 260)
# |
# |----- testdir (ino 257)
# |----- b (ino 257)
# |----- b2 (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
mkfs.btrfs -f /dev/sdj >/dev/null
mount /dev/sdj /mnt/sdj
btrfs receive -f /tmp/snap1.send /mnt/sdj
btrfs receive -f /tmp/snap2.send /mnt/sdj
umount /mnt/sdi
umount /mnt/sdj
When running the reproducer, the receive of the incremental send stream
fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: link b -> o259-6-0/a failed: No such file or directory
The problem happens because of the following:
1) Before we start iterating the list of new references for inode 257,
we generate its current path and store it at @valid_path, done at
the very beginning of process_recorded_refs(). The generated path
is "o259-6-0/a", containing the orphanized name for inode 259;
2) Then we iterate over the list of new references, which has the
references "b" and "testdir" in that specific order;
3) We process reference "b" first, because it is in the list before
reference "testdir". We then issue a link operation to create
the new reference "b" using a target path corresponding to the
content at @valid_path, which corresponds to "o259-6-0/a".
However we haven't yet orphanized inode 259, its name is still
"testdir", and not "o259-6-0". The orphanization of 259 did not
happen yet because we will process the reference named "testdir"
for inode 257 only in the next iteration of the loop that goes
over the list of new references.
Fix the issue by having a preliminar iteration over all the new references
at process_recorded_refs(). This iteration is responsible only for doing
the orphanization of other inodes that have and old reference that
conflicts with one of the new references of the inode we are currently
processing. The emission of rename and link operations happen now in the
next iteration of the new references.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 259ee7754b ("btrfs: tree-checker: Add ROOT_ITEM check")
introduced btrfs root item size check, however btrfs root item has two
versions, the legacy one which just ends before generation_v2 member, is
smaller than current btrfs root item size.
This caused btrfs kernel to reject valid but old tree root leaves.
Fix this problem by also allowing legacy root item, since kernel can
already handle them pretty well and upgrade to newer root item format
when needed.
Reported-by: Martin Steigerwald <martin@lichtvoll.de>
Fixes: 259ee7754b ("btrfs: tree-checker: Add ROOT_ITEM check")
CC: stable@vger.kernel.org # 5.4+
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the definitions generated by BTRFS_SETGET_HEADER_FUNCS there's direct
pointer assignment but we should use the helpers for unaligned access
for clarity. It hasn't been a problem so far because of the natural
alignment.
Similarly for BTRFS_SETGET_STACK_FUNCS, that usually get a structure
from stack that has an aligned start but some members may not be aligned
due to packing. This as well hasn't caused problems so far.
Move the put/get_unaligned_le8 stubs to ctree.h so we can use them.
Signed-off-by: David Sterba <dsterba@suse.com>
The free space inode stores the tracking data, checksums etc, using the
io_ctl structure and moving the pointers. The data are generally aligned
to at least 4 bytes (u32 for CRC) so it's not completely unaligned but
for clarity we should use the proper helpers whenever a struct is
initialized from io_ctl->cur pointer.
Signed-off-by: David Sterba <dsterba@suse.com>
The header is mapped onto the send buffer and thus its members may be
potentially unaligned so use the helpers instead of directly assigning
the pointers. This has worked so far but let's use the helpers to make
that clear.
Signed-off-by: David Sterba <dsterba@suse.com>
Btree inode is special compared to all other inode extent io_trees,
although it has a btrfs inode, it doesn't have the track_uptodate bit at
all.
This means a lot of things like extent locking doesn't even need to be
applied to btree io tree.
Since it's so special, adds a new owner value for it to make debuging a
little easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace kvzalloc() call with kvcalloc() that also checks the size
internally. There's a standalone overflow check in the function so we
can return invalid parameter combination. Use array_size() helper to
compute the memory size for clone_sources_tmp.
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_ioctl_send() used open-coded kvzalloc implementation earlier.
The code was accidentally replaced with kzalloc() call [1]. Restore
the original code by using kvzalloc() to allocate sctx->clone_roots.
[1] https://patchwork.kernel.org/patch/9757891/#20529627
Fixes: 818e010bf9 ("btrfs: replace opencoded kvzalloc with the helper")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The passed in ordered_extent struct is always well-formed and contains
the inode making the explicit argument redundant.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's used to reference the csum root which can be done from the trans
handle as well. Simplify the signature and while at it also remove the
noinline attribute as the function uses only at most 16 bytes of stack
space.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This makes reading the code a tad easier by decreasing the level of
indirection by one.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0 by the 2 callers so move it inside __do_readpage.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0 by its sole caller - btrfs_readpage. Simply remove
it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0 from the sole caller - btrfs_readpage.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_readpage is the only caller of extent_read_full_page the
latter can be open coded in the former. Use the occassion to rename
__extent_read_full_page to extent_read_full_page. To facillitate this
change submit_one_bio has to be exported as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's called only from btrfs_readpage which always passes 0 so just sink
the argument into extent_read_full_page.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that this function is only responsible for reading data pages it's
no longer necessary to pass get_extent_t parameter across several
layers of functions. This patch removes this parameter from multiple
functions: __get_extent_map/__do_readpage/__extent_read_full_page/
extent_read_full_page and simply calls btrfs_get_extent directly in
__get_extent_map.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The sole purpose of this function was to satisfy the requirements of
__do_readpage. Since that function is no longer used to read metadata
pages the need to keep btree_get_extent around has also disappeared.
Simply remove it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Metadata pages currently use __do_readpage to read metadata pages,
unfortunately this function is also used to deal with ordinary data
pages. This makes the metadata pages reading code to go through multiple
hoops in order to adhere to __do_readpage invariants. Most of these are
necessary for data pages which could be compressed. For metadata it's
enough to simply build a bio and submit it.
To this effect simply call submit_extent_page directly from
read_extent_buffer_pages which is the only callpath used to populate
extent_buffers with data. This in turn enables further cleanups.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no way for this function to be called as ->readpage() since
it's called from
generic_file_buffered_read/filemap_fault/do_read_cache_page/readhead
code. BTRFS doesn't utilize the first 3 for the btree inode and
implements it's owon readhead mechanism. So simply remove the function.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Logging directories with many entries can take a significant amount of
time, and in some cases monopolize a cpu/core for a long time if the
logging task doesn't happen to block often enough.
Johannes and Lu Fengqi reported test case generic/041 triggering a soft
lockup when the kernel has CONFIG_SOFTLOCKUP_DETECTOR=y. For this test
case we log an inode with 3002 hard links, and because the test removed
one hard link before fsyncing the file, the inode logging causes the
parent directory do be logged as well, which has 6004 directory items to
log (3002 BTRFS_DIR_ITEM_KEY items plus 3002 BTRFS_DIR_INDEX_KEY items),
so it can take a significant amount of time and trigger the soft lockup.
So just make tree-log.c:log_dir_items() reschedule when necessary,
releasing the current search path before doing so and then resume from
where it was before the reschedule.
The stack trace produced when the soft lockup happens is the following:
[10480.277653] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [xfs_io:28172]
[10480.279418] Modules linked in: dm_thin_pool dm_persistent_data (...)
[10480.284915] irq event stamp: 29646366
[10480.285987] hardirqs last enabled at (29646365): [<ffffffff85249b66>] __slab_alloc.constprop.0+0x56/0x60
[10480.288482] hardirqs last disabled at (29646366): [<ffffffff8579b00d>] irqentry_enter+0x1d/0x50
[10480.290856] softirqs last enabled at (4612): [<ffffffff85a00323>] __do_softirq+0x323/0x56c
[10480.293615] softirqs last disabled at (4483): [<ffffffff85800dbf>] asm_call_on_stack+0xf/0x20
[10480.296428] CPU: 2 PID: 28172 Comm: xfs_io Not tainted 5.9.0-rc4-default+ #1248
[10480.298948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[10480.302455] RIP: 0010:__slab_alloc.constprop.0+0x19/0x60
[10480.304151] Code: 86 e8 31 75 21 00 66 66 2e 0f 1f 84 00 00 00 (...)
[10480.309558] RSP: 0018:ffffadbe09397a58 EFLAGS: 00000282
[10480.311179] RAX: ffff8a495ab92840 RBX: 0000000000000282 RCX: 0000000000000006
[10480.313242] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff85249b66
[10480.315260] RBP: ffff8a497d04b740 R08: 0000000000000001 R09: 0000000000000001
[10480.317229] R10: ffff8a497d044800 R11: ffff8a495ab93c40 R12: 0000000000000000
[10480.319169] R13: 0000000000000000 R14: 0000000000000c40 R15: ffffffffc01daf70
[10480.321104] FS: 00007fa1dc5c0e40(0000) GS:ffff8a497da00000(0000) knlGS:0000000000000000
[10480.323559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10480.325235] CR2: 00007fa1dc5befb8 CR3: 0000000004f8a006 CR4: 0000000000170ea0
[10480.327259] Call Trace:
[10480.328286] ? overwrite_item+0x1f0/0x5a0 [btrfs]
[10480.329784] __kmalloc+0x831/0xa20
[10480.331009] ? btrfs_get_32+0xb0/0x1d0 [btrfs]
[10480.332464] overwrite_item+0x1f0/0x5a0 [btrfs]
[10480.333948] log_dir_items+0x2ee/0x570 [btrfs]
[10480.335413] log_directory_changes+0x82/0xd0 [btrfs]
[10480.336926] btrfs_log_inode+0xc9b/0xda0 [btrfs]
[10480.338374] ? init_once+0x20/0x20 [btrfs]
[10480.339711] btrfs_log_inode_parent+0x8d3/0xd10 [btrfs]
[10480.341257] ? dget_parent+0x97/0x2e0
[10480.342480] btrfs_log_dentry_safe+0x3a/0x50 [btrfs]
[10480.343977] btrfs_sync_file+0x24b/0x5e0 [btrfs]
[10480.345381] do_fsync+0x38/0x70
[10480.346483] __x64_sys_fsync+0x10/0x20
[10480.347703] do_syscall_64+0x2d/0x70
[10480.348891] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10480.350444] RIP: 0033:0x7fa1dc80970b
[10480.351642] Code: 0f 05 48 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 (...)
[10480.356952] RSP: 002b:00007fffb3d081d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[10480.359458] RAX: ffffffffffffffda RBX: 0000562d93d45e40 RCX: 00007fa1dc80970b
[10480.361426] RDX: 0000562d93d44ab0 RSI: 0000562d93d45e60 RDI: 0000000000000003
[10480.363367] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007fa1dc7b2a40
[10480.365317] R10: 0000562d93d0e366 R11: 0000000000000293 R12: 0000000000000001
[10480.367299] R13: 0000562d93d45290 R14: 0000562d93d45e40 R15: 0000562d93d45e60
Link: https://lore.kernel.org/linux-btrfs/20180713090216.GC575@fnst.localdomain/
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
CC: stable@vger.kernel.org # 4.4+
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While running xfstests btrfs/177 I got the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc3+ #5 Not tainted
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc+0x37/0x270
alloc_inode+0x82/0xb0
iget_locked+0x10d/0x2c0
kernfs_get_inode+0x1b/0x130
kernfs_get_tree+0x136/0x240
sysfs_get_tree+0x16/0x40
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (kernfs_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
kernfs_add_one+0x23/0x150
kernfs_create_dir_ns+0x7a/0xb0
sysfs_create_dir_ns+0x60/0xb0
kobject_add_internal+0xc0/0x2c0
kobject_add+0x6e/0x90
btrfs_sysfs_add_block_group_type+0x102/0x160
btrfs_make_block_group+0x167/0x230
btrfs_alloc_chunk+0x54f/0xb80
btrfs_chunk_alloc+0x18e/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_insert_empty_items+0x64/0xb0
btrfs_new_inode+0x225/0x730
btrfs_create+0xab/0x1f0
lookup_open.isra.0+0x52d/0x690
path_openat+0x2a7/0x9e0
do_filp_open+0x75/0x100
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_lookup_inode+0x2a/0x8f
__btrfs_update_delayed_inode+0x80/0x240
btrfs_commit_inode_delayed_inode+0x119/0x120
btrfs_evict_inode+0x357/0x500
evict+0xcf/0x1f0
do_unlinkat+0x1a9/0x2b0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> kernfs_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(kernfs_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb8
check_noncircular+0x12d/0x150
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa7/0x3d0
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x41/0x50
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This happens because when we link in a block group with a new raid index
type we'll create the corresponding sysfs entries for it. This is
problematic because while restriping we're holding the chunk_mutex, and
while mounting we're holding the tree locks.
Fixing this isn't pretty, we move the call to the sysfs stuff into the
btrfs_create_pending_block_groups() work, where we're not holding any
locks. This creates a slight race where other threads could see that
there's no sysfs kobj for that raid type, and race to create the
sysfs dir. Fix this by wrapping the creation in space_info->lock, so we
only get one thread calling kobject_add() for the new directory. We
don't worry about the lock on cleanup as it only gets deleted on
unmount.
On mount it's more straightforward, we loop through the space_infos
already, just check every raid index in each space_info and added the
sysfs entries for the corresponding block groups.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this thing wrapped in an RCU lock, but it's really not needed.
We create all the space_info's on mount, and we destroy them on unmount.
The list never changes and we're protected from messing with it by the
normal mount/umount path, so kill the RCU stuff around it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reword and update formats to match variable types.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update formats ]
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
That parameter can easily be derived based on the "data_size" and "nr"
parameters exploit this fact to simply the function's signature. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of this argument can be derived from the total_data as it's
simply the value of the data size + size of btrfs_items being touched.
Move the parameter calculation inside the function. This results in a
simpler interface and also a minor size reduction:
./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34)
Function old new delta
btrfs_duplicate_item 260 259 -1
setup_items_for_insert 1200 1190 -10
btrfs_insert_empty_items 177 154 -23
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rearrange statements calculating the offset of the newly added items so
that the calculation has to be done only once. No functional change.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reports the latest send stream version supported by the kernel as
the feature in /sys/fs/btrfs/features/send_stream_version .
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
send_write_or_clone() basically has an open-coded copy of
btrfs_file_extent_end() except that it (incorrectly) aligns to PAGE_SIZE
instead of sectorsize. Fix and simplify the code by using
btrfs_file_extent_end().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
send_write() currently copies from the page cache to sctx->read_buf, and
then from sctx->read_buf to sctx->send_buf. Similarly, send_hole()
zeroes sctx->read_buf and then copies from sctx->read_buf to
sctx->send_buf. However, if we write the TLV header manually, we can
copy to sctx->send_buf directly and get rid of sctx->read_buf.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
send_write()/fill_read_buf() have some logic for avoiding reading past
i_size. However, everywhere that we call
send_write()/send_extent_data(), we've already clamped the length down
to i_size. Get rid of the i_size handling, which simplifies the next
change.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we use the same mechanism to replace all the extents in a file
range with either a hole, an existing extent (when cloning) or a new
extent (when using fallocate), the name of btrfs_insert_clone_extent()
no longer reflects its genericity.
So rename it to btrfs_insert_replace_extent(), since what it does is
to either insert an existing extent or a new extent into a file range.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_punch_hole_range() is now used to replace all the file
extents in a given file range with an extent described in the given struct
btrfs_replace_extent_info argument. This extent can either be an existing
extent that is being cloned or it can be a new extent (namely a prealloc
extent). When that argument is NULL it only punches a hole (drops all the
existing extents) in the file range.
So rename the function to btrfs_replace_file_extents().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we can use btrfs_clone_extent_info to convey information for a
new prealloc extent as well, and not just for existing extents that are
being cloned, rename it to btrfs_replace_extent_info, which reflects the
fact that this is now more generic and it is used to replace all existing
extents in a file range with the extent described by the structure.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of item_size of struct btrfs_clone_extent_info is always set to
the size of a non-inline file extent item, and in fact the infrastructure
that uses this structure (btrfs_punch_hole_range()) does not work with
inline file extents at all (and it is not supposed to).
So just remove that field from the structure and use directly
sizeof(struct btrfs_file_extent_item) instead. Also assert that the
file extent type is not inline at btrfs_insert_clone_extent().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an fallocate(), specially a zero range operation, we assume
that reserving 3 units of metadata space is enough, that at most we touch
one leaf in subvolume/fs tree for removing existing file extent items and
inserting a new file extent item. This assumption is generally true for
most common use cases. However when we end up needing to remove file extent
items from multiple leaves, we can end up failing with -ENOSPC and abort
the current transaction, turning the filesystem to RO mode. When this
happens a stack trace like the following is dumped in dmesg/syslog:
[ 1500.620934] ------------[ cut here ]------------
[ 1500.620938] BTRFS: Transaction aborted (error -28)
[ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...)
[ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1
[ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.621026] Code: 8b 40 50 f0 48 (...)
[ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286
[ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000
[ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001
[ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000
[ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60
[ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000
[ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0
[ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1500.621049] Call Trace:
[ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs]
[ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs]
[ 1500.621108] vfs_fallocate+0x14d/0x290
[ 1500.621112] ksys_fallocate+0x3a/0x70
[ 1500.621117] __x64_sys_fallocate+0x1a/0x20
[ 1500.621120] do_syscall_64+0x33/0x80
[ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1500.621126] RIP: 0033:0x7fb5b248c477
[ 1500.621128] Code: 89 7c 24 08 (...)
[ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477
[ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003
[ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000
[ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010
[ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003
[ 1500.621151] irq event stamp: 1026217
[ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0
[ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0
[ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606
[ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20
[ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]---
[ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left
When we use fallocate() internally, for reserving an extent for a space
cache, inode cache or relocation, we can't hit this problem since either
there aren't any file extent items to remove from the subvolume tree or
there is at most one.
When using plain fallocate() it's very unlikely, since that would require
having many file extent items representing holes for the target range and
crossing multiple leafs - we attempt to increase the range (merge) of such
file extent items when punching holes, so at most we end up with 2 file
extent items for holes at leaf boundaries.
However when using the zero range operation of fallocate() for a large
range (100+ MiB for example) that's fairly easy to trigger. The following
example reproducer triggers the issue:
$ cat reproducer.sh
#!/bin/bash
umount /dev/sdj &> /dev/null
mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null
mount /dev/sdj /mnt/sdj
# Create a 100M file with many file extent items. Punch a hole every 8K
# just to speedup the file creation - we could do 4K sequential writes
# followed by fsync (or O_SYNC) as well, but that takes a lot of time.
file_size=$((100 * 1024 * 1024))
xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
for ((i = 0; i < $file_size; i += 8192)); do
xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
done
# Force a transaction commit, so the zero range operation will be forced
# to COW all metadata extents it need to touch.
sync
xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar
umount /mnt/sdj
$ ./reproducer.sh
wrote 104857600/104857600 bytes at offset 0
100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
fallocate: No space left on device
$ dmesg
<shows the same stack trace pasted before>
To fix this use the existing infrastructure that hole punching and
extent cloning use for replacing a file range with another extent. This
deals with doing the removal of file extent items and inserting the new
one using an incremental approach, reserving more space when needed and
always ensuring we don't leave an implicit hole in the range in case
we need to do multiple iterations and a crash happens between iterations.
A test case for fstests will follow up soon.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is not used since commit 0096420adb ("btrfs: do not
account global reserve in can_overcommit").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is short and simple, we can get rid of the declaration as
it's not necessary for a static function. Move it before its first
caller. No functional changes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not have a common exit block and returns immediatelly
so there's no point having the goto. Remove the two cases.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can check the argument value directly, no need for the temporary
variable.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the function btrfs_init_dev_replace_tgtdev(), the local variable
devices is used only once, we can remove it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On a mounted sprout filesystem, all threads now are using the
sprout::device_list_mutex, and this is the only code using the
seed::device_list_mutex. This patch converts to use the sprouts
fs_info->fs_devices->device_list_mutex.
The same reasoning holds true here, that device delete is holding
the sprout::device_list_mutex.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>