linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-19 13:46:46 +07:00

Author	SHA1	Message	Date
Qu Wenruo	71f572a9e8	btrfs: reloc: use btrfs_backref_iter infrastructure In the core function of relocation, build_backref_tree, it needs to iterate all backref items of one tree block. Use btrfs_backref_iter infrastructure to do the loop and make the code more readable. The backref items look would be much more easier to read: ret = btrfs_backref_iter_start(iter, cur->bytenr); for (; ret == 0; ret = btrfs_backref_iter_next(iter)) { /* The really important work */ } Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-05-25 11:25:16 +02:00
Qu Wenruo	c39c2ddc67	btrfs: backref: implement btrfs_backref_iter_next() This function will go to the next inline/keyed backref for btrfs_backref_iter infrastructure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-05-25 11:25:16 +02:00
Qu Wenruo	a37f232b7b	btrfs: backref: introduce the skeleton of btrfs_backref_iter Due to the complex nature of btrfs extent tree, when we want to iterate all backrefs of one extent, this involves quite a lot of work, like searching the EXTENT_ITEM/METADATA_ITEM, iteration through inline and keyed backrefs. Normally this would result in a complex code, something like: btrfs_search_slot() /* Ensure we are at EXTENT_ITEM/METADATA_ITEM / while (1) { / Loop for extent tree items / while (ptr < end) { / Loop for inlined items / / Real work here / } next: ret = btrfs_next_item() / Ensure we're still at keyed item for specified bytenr / } The idea of btrfs_backref_iter is to avoid such complex and hard to read code structure, but something like the following: iter = btrfs_backref_iter_alloc(); ret = btrfs_backref_iter_start(iter, bytenr); if (ret < 0) goto out; for (; ; ret = btrfs_backref_iter_next(iter)) { / Real work here */ } out: btrfs_backref_iter_free(iter); This patch is just the skeleton + btrfs_backref_iter_start() code. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-05-25 11:25:16 +02:00
Jules Irenge	78d933c79c	btrfs: add missing annotation for btrfs_tree_lock() Sparse reports a warning at btrfs_tree_lock() warning: context imbalance in btrfs_tree_lock() - wrong count at exit The root cause is the missing annotation at btrfs_tree_lock() Add the missing __acquires(&eb->lock) annotation Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-05-25 11:25:16 +02:00
Jules Irenge	c142c6a449	btrfs: add missing annotation for btrfs_lock_cluster() Sparse reports a warning at btrfs_lock_cluster() warning: context imbalance in btrfs_lock_cluster() - wrong count The root cause is the missing annotation at btrfs_lock_cluster() Add the missing __acquires(&cluster->refill_lock) annotation. Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-05-25 11:25:16 +02:00
Arnd Bergmann	9c6c723f48	btrfs: fix gcc-4.8 build warning for struct initializer Some older compilers like gcc-4.8 warn about mismatched curly braces in a initializer: fs/btrfs/backref.c: In function 'is_shared_data_backref': fs/btrfs/backref.c:394:9: error: missing braces around initializer [-Werror=missing-braces] struct prelim_ref target = {0}; ^ fs/btrfs/backref.c:394:9: error: (near initialization for 'target.rbnode') [-Werror=missing-braces] Use the GNU empty initializer extension to avoid this. Fixes: `ed58f2e66e` ("btrfs: backref, don't add refs from shared block when resolving normal backref") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-30 12:17:49 +02:00
Qu Wenruo	fcc99734d1	btrfs: transaction: Avoid deadlock due to bad initialization timing of fs_info::journal_info [BUG] One run of btrfs/063 triggered the following lockdep warning: ============================================ WARNING: possible recursive locking detected 5.6.0-rc7-custom+ #48 Not tainted -------------------------------------------- kworker/u24:0/7 is trying to acquire lock: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] but task is already holding lock: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(sb_internal#2); lock(sb_internal#2); * DEADLOCK * May be due to missing lock nesting notation 4 locks held by kworker/u24:0/7: #0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80 #1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80 #2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs] #3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs] stack backtrace: CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ #48 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] Call Trace: dump_stack+0xc2/0x11a __lock_acquire.cold+0xce/0x214 lock_acquire+0xe6/0x210 __sb_start_write+0x14e/0x290 start_transaction+0x66c/0x890 [btrfs] btrfs_join_transaction+0x1d/0x20 [btrfs] find_free_extent+0x1504/0x1a50 [btrfs] btrfs_reserve_extent+0xd5/0x1f0 [btrfs] btrfs_alloc_tree_block+0x1ac/0x570 [btrfs] btrfs_copy_root+0x213/0x580 [btrfs] create_reloc_root+0x3bd/0x470 [btrfs] btrfs_init_reloc_root+0x2d2/0x310 [btrfs] record_root_in_trans+0x191/0x1d0 [btrfs] btrfs_record_root_in_trans+0x90/0xd0 [btrfs] start_transaction+0x16e/0x890 [btrfs] btrfs_join_transaction+0x1d/0x20 [btrfs] btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs] finish_ordered_fn+0x15/0x20 [btrfs] btrfs_work_helper+0x116/0x9a0 [btrfs] process_one_work+0x632/0xb80 worker_thread+0x80/0x690 kthread+0x1a3/0x1f0 ret_from_fork+0x27/0x50 It's pretty hard to reproduce, only one hit so far. [CAUSE] This is because we're calling btrfs_join_transaction() without re-using the current running one: btrfs_finish_ordered_io() \|- btrfs_join_transaction() <<< Call #1 \|- btrfs_record_root_in_trans() \|- btrfs_reserve_extent() \|- btrfs_join_transaction() <<< Call #2 Normally such btrfs_join_transaction() call should re-use the existing one, without trying to re-start a transaction. But the problem is, in btrfs_join_transaction() call #1, we call btrfs_record_root_in_trans() before initializing current::journal_info. And in btrfs_join_transaction() call #2, we're relying on current::journal_info to avoid such deadlock. [FIX] Call btrfs_record_root_in_trans() after we have initialized current::journal_info. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-27 17:16:07 +02:00
Filipe Manana	f135cea30d	btrfs: fix partial loss of prealloc extent past i_size after fsync When we have an inode with a prealloc extent that starts at an offset lower than the i_size and there is another prealloc extent that starts at an offset beyond i_size, we can end up losing part of the first prealloc extent (the part that starts at i_size) and have an implicit hole if we fsync the file and then have a power failure. Consider the following example with comments explaining how and why it happens. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt # Create our test file with 2 consecutive prealloc extents, each with a # size of 128Kb, and covering the range from 0 to 256Kb, with a file # size of 0. $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo $ xfs_io -c "falloc -k 128K 128K" /mnt/foo # Fsync the file to record both extents in the log tree. $ xfs_io -c "fsync" /mnt/foo # Now do a redudant extent allocation for the range from 0 to 64Kb. # This will merely increase the file size from 0 to 64Kb. Instead we # could also do a truncate to set the file size to 64Kb. $ xfs_io -c "falloc 0 64K" /mnt/foo # Fsync the file, so we update the inode item in the log tree with the # new file size (64Kb). This also ends up setting the number of bytes # for the first prealloc extent to 64Kb. This is done by the truncation # at btrfs_log_prealloc_extents(). # This means that if a power failure happens after this, a write into # the file range 64Kb to 128Kb will not use the prealloc extent and # will result in allocation of a new extent. $ xfs_io -c "fsync" /mnt/foo # Now set the file size to 256K with a truncate and then fsync the file. # Since no changes happened to the extents, the fsync only updates the # i_size in the inode item at the log tree. This results in an implicit # hole for the file range from 64Kb to 128Kb, something which fsck will # complain when not using the NO_HOLES feature if we replay the log # after a power failure. $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo So instead of always truncating the log to the inode's current i_size at btrfs_log_prealloc_extents(), check first if there's a prealloc extent that starts at an offset lower than the i_size and with a length that crosses the i_size - if there is one, just make sure we truncate to a size that corresponds to the end offset of that prealloc extent, so that we don't lose the part of that extent that starts at i_size if a power failure happens. A test case for fstests follows soon. Fixes: `31d11b83b9` ("Btrfs: fix duplicate extents after fsync of file with prealloc extents") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-27 17:16:07 +02:00
Xiyu Yang	1402d17dfd	btrfs: fix transaction leak in btrfs_recover_relocation btrfs_recover_relocation() invokes btrfs_join_transaction(), which joins a btrfs_trans_handle object into transactions and returns a reference of it with increased refcount to "trans". When btrfs_recover_relocation() returns, "trans" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in one exception handling path of btrfs_recover_relocation(). When read_fs_root() failed, the refcnt increased by btrfs_join_transaction() is not decreased, causing a refcnt leak. Fix this issue by calling btrfs_end_transaction() on this error path when read_fs_root() failed. Fixes: `79787eaab4` ("btrfs: replace many BUG_ONs with proper error handling") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-23 17:24:56 +02:00
Xiyu Yang	f6033c5e33	btrfs: fix block group leak when removing fails btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which returns a local reference of the block group that contains the given bytenr to "block_group" with increased refcount. When btrfs_remove_block_group() returns, "block_group" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in several exception handling paths of btrfs_remove_block_group(). When those error scenarios occur such as btrfs_alloc_path() returns NULL, the function forgets to decrease its refcnt increased by btrfs_lookup_block_group() and will cause a refcnt leak. Fix this issue by jumping to "out_put_group" label and calling btrfs_put_block_group() when those error scenarios occur. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-23 17:24:52 +02:00
Josef Bacik	ef67963dac	btrfs: drop logs when we've aborted a transaction Dave reported a problem where we were panicing with generic/475 with misc-5.7. This is because we were doing IO after we had stopped all of the worker threads, because we do the log tree cleanup on roots at drop time. Cleaning up the log tree will always need to do reads if we happened to have evicted the blocks from memory. Because of this simply add a helper to btrfs_cleanup_transaction() that will go through and drop all of the log roots. This gets run before we do the close_ctree() work, and thus we are allowed to do any reads that we would need. I ran this through many iterations of generic/475 with constrained memory and I did not see the issue. general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 2 PID: 12359 Comm: umount Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_queue_work+0x33/0x1c0 [btrfs] RSP: 0018:ffff9cfb015937d8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8eb5e339ed80 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff8eb5eb33b770 RDI: ffff8eb5e37a0460 RBP: ffff8eb5eb33b770 R08: 000000000000020c R09: ffffffff9fc09ac0 R10: 0000000000000007 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b R13: ffff9cfb00229040 R14: 0000000000000008 R15: ffff8eb5d3868000 FS: 00007f167ea022c0(0000) GS:ffff8eb5fae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f167e5e0cb1 CR3: 0000000138c18004 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_end_bio+0x81/0x130 [btrfs] __split_and_process_bio+0xaf/0x4e0 [dm_mod] ? percpu_counter_add_batch+0xa3/0x120 dm_process_bio+0x98/0x290 [dm_mod] ? generic_make_request+0xfb/0x410 dm_make_request+0x4d/0x120 [dm_mod] ? generic_make_request+0xfb/0x410 generic_make_request+0x12a/0x410 ? submit_bio+0x38/0x160 submit_bio+0x38/0x160 ? percpu_counter_add_batch+0xa3/0x120 btrfs_map_bio+0x289/0x570 [btrfs] ? kmem_cache_alloc+0x24d/0x300 btree_submit_bio_hook+0x79/0xc0 [btrfs] submit_one_bio+0x31/0x50 [btrfs] read_extent_buffer_pages+0x2fe/0x450 [btrfs] btree_read_extent_buffer_pages+0x7e/0x170 [btrfs] walk_down_log_tree+0x343/0x690 [btrfs] ? walk_log_tree+0x3d/0x380 [btrfs] walk_log_tree+0xf7/0x380 [btrfs] ? plist_requeue+0xf0/0xf0 ? delete_node+0x4b/0x230 free_log_tree+0x4c/0x130 [btrfs] ? wait_log_commit+0x140/0x140 [btrfs] btrfs_free_log+0x17/0x30 [btrfs] btrfs_drop_and_free_fs_root+0xb0/0xd0 [btrfs] btrfs_free_fs_roots+0x10c/0x190 [btrfs] ? do_raw_spin_unlock+0x49/0xc0 ? _raw_spin_unlock+0x29/0x40 ? release_extent_buffer+0x121/0x170 [btrfs] close_ctree+0x289/0x2e6 [btrfs] generic_shutdown_super+0x6c/0x110 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x3a/0x70 Reported-by: David Sterba <dsterba@suse.com> Fixes: `8c38938c7b` ("btrfs: move the root freeing stuff into btrfs_put_root") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-23 17:23:11 +02:00
Filipe Manana	5150bf1963	btrfs: fix memory leak of transaction when deleting unused block group When cleaning pinned extents right before deleting an unused block group, we check if there's still a previous transaction running and if so we increment its reference count before using it for cleaning pinned ranges in its pinned extents iotree. However we ended up never decrementing the reference count after using the transaction, resulting in a memory leak. Fix it by decrementing the reference count. Fixes: `fe119a6eeb` ("btrfs: switch to per-transaction pinned extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-23 17:22:45 +02:00
Nishad Kamdar	317ddf3715	btrfs: discard: Use the correct style for SPDX License Identifier This patch corrects the SPDX License Identifier style in header file related to Btrfs File System support. For C header files Documentation/process/license-rules.rst mandates C-like comments (opposed to C source files where C++ style should be used). Changes made by using a script provided by Joe Perches here: https://lkml.org/lkml/2019/2/7/46. Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-20 17:43:42 +02:00
Josef Bacik	aec7db3b13	btrfs: fix setting last_trans for reloc roots I made a mistake with my previous fix, I assumed that we didn't need to mess with the reloc roots once we were out of the part of relocation where we are actually moving the extents. The subtle thing that I missed is that btrfs_init_reloc_root() also updates the last_trans for the reloc root when we do btrfs_record_root_in_trans() for the corresponding fs_root. I've added a comment to make sure future me doesn't make this mistake again. This showed up as a WARN_ON() in btrfs_copy_root() because our last_trans didn't == the current transid. This could happen if we snapshotted a fs root with a reloc root after we set rc->create_reloc_tree = 0, but before we actually merge the reloc root. Worth mentioning that the regression produced the following warning when running snapshot creation and balance in parallel: BTRFS info (device sdc): relocating block group 30408704 flags metadata\|dup ------------[ cut here ]------------ WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs] CPU: 0 PID: 12823 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs] RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202 RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48 RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818 RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002 R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818 R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00 FS: 00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? create_reloc_root+0x49/0x2b0 [btrfs] ? kmem_cache_alloc_trace+0xe5/0x200 create_reloc_root+0x8b/0x2b0 [btrfs] btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs] create_pending_snapshot+0x610/0x1010 [btrfs] create_pending_snapshots+0xa8/0xd0 [btrfs] btrfs_commit_transaction+0x4c7/0xc50 [btrfs] ? btrfs_mksubvol+0x3cd/0x560 [btrfs] btrfs_mksubvol+0x455/0x560 [btrfs] __btrfs_ioctl_snap_create+0x15f/0x190 [btrfs] btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs] ? mem_cgroup_commit_charge+0x6e/0x540 btrfs_ioctl+0x12d8/0x3760 [btrfs] ? do_raw_spin_unlock+0x49/0xc0 ? _raw_spin_unlock+0x29/0x40 ? __handle_mm_fault+0x11b3/0x14b0 ? ksys_ioctl+0x92/0xb0 ksys_ioctl+0x92/0xb0 ? trace_hardirqs_off_thunk+0x1a/0x1c __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fdeabd3bdd7 Fixes: `2abc726ab4` ("btrfs: do not init a reloc root if we aren't relocating") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-17 15:20:08 +02:00
Filipe Manana	d611add48b	btrfs: fix reclaim counter leak of space_info objects Whenever we add a ticket to a space_info object we increment the object's reclaim_size counter witht the ticket's bytes, and we decrement it with the corresponding amount only when we are able to grant the requested space to the ticket. When we are not able to grant the space to a ticket, or when the ticket is removed due to a signal (e.g. an application has received sigterm from the terminal) we never decrement the counter with the corresponding bytes from the ticket. This leak can result in the space reclaim code to later do much more work than necessary. So fix it by decrementing the counter when those two cases happen as well. Fixes: `db161806dc` ("btrfs: account ticket size at add/delete time") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-08 19:11:19 +02:00
Filipe Manana	7af597433d	btrfs: make full fsyncs always operate on the entire file again This is a revert of commit `0a8068a3dd` ("btrfs: make ranged full fsyncs more efficient"), with updated comment in btrfs_sync_file. Commit `0a8068a3dd` ("btrfs: make ranged full fsyncs more efficient") made full fsyncs operate on the given range only as it assumed it was safe when using the NO_HOLES feature, since the hole detection was simplified some time ago and no longer was a source for races with ordered extent completion of adjacent file ranges. However it's still not safe to have a full fsync only operate on the given range, because extent maps for new extents might not be present in memory due to inode eviction or extent cloning. Consider the following example: 1) We are currently at transaction N; 2) We write to the file range [0, 1MiB); 3) Writeback finishes for the whole range and ordered extents complete, while we are still at transaction N; 4) The inode is evicted; 5) We open the file for writing, causing the inode to be loaded to memory again, which sets the 'full sync' bit on its flags. At this point the inode's list of modified extent maps is empty (figuring out which extents were created in the current transaction and were not yet logged by an fsync is expensive, that's why we set the 'full sync' bit when loading an inode); 6) We write to the file range [512KiB, 768KiB); 7) We do a ranged fsync (such as msync()) for file range [512KiB, 768KiB). This correctly flushes this range and logs its extent into the log tree. When the writeback started an extent map for range [512KiB, 768KiB) was added to the inode's list of modified extents, and when the fsync() finishes logging it removes that extent map from the list of modified extent maps. This fsync also clears the 'full sync' bit; 8) We do a regular fsync() (full ranged). This fsync() ends up doing nothing because the inode's list of modified extents is empty and no other changes happened since the previous ranged fsync(), so it just returns success (0) and we end up never logging extents for the file ranges [0, 512KiB) and [768KiB, 1MiB). Another scenario where this can happen is if we replace steps 2 to 4 with cloning from another file into our test file, as that sets the 'full sync' bit in our inode's flags and does not populate its list of modified extent maps. This was causing test case generic/457 to fail sporadically when using the NO_HOLES feature, as it exercised this later case where the inode has the 'full sync' bit set and has no extent maps in memory to represent the new extents due to extent cloning. Fix this by reverting commit `0a8068a3dd` ("btrfs: make ranged full fsyncs more efficient") since there is no easy way to work around it. Fixes: `0a8068a3dd` ("btrfs: make ranged full fsyncs more efficient") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-08 19:10:52 +02:00
Filipe Manana	4fdb688c70	btrfs: fix lost i_size update after cloning inline extent When not using the NO_HOLES feature we were not marking the destination's file range as written after cloning an inline extent into it. This can lead to a data loss if the current destination file size is smaller than the source file's size. Example: $ mkfs.btrfs -f -O ^no-holes /dev/sdc $ mount /mnt/sdc /mnt $ echo "hello world" > /mnt/foo $ cp --reflink=always /mnt/foo /mnt/bar $ rm -f /mnt/foo $ umount /mnt $ mount /mnt/sdc /mnt $ cat /mnt/bar $ $ stat -c %s /mnt/bar 0 # -> the file is empty, since we deleted foo, the data lost is forever Fix that by calling btrfs_inode_set_file_extent_range() after cloning an inline extent. A test case for fstests will follow soon. Link: https://lore.kernel.org/linux-btrfs/20200404193846.GA432065@latitude/ Reported-by: Johannes Hirte <johannes.hirte@datenkhaos.de> Fixes: `9ddc959e80` ("btrfs: use the file extent tree infrastructure") Tested-by: Johannes Hirte <johannes.hirte@datenkhaos.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-08 19:10:34 +02:00
Josef Bacik	4d4225fc22	btrfs: check commit root generation in should_ignore_root Previously we would set the reloc root's last snapshot to transid - 1. However there was a problem with doing this, and we changed it to setting the last snapshot to the generation of the commit node of the fs root. This however broke should_ignore_root(). The assumption is that if we are in a generation newer than when the reloc root was created, then we would find the reloc root through normal backref lookups, and thus can ignore any fs roots we find with an old enough reloc root. Now that the last snapshot could be considerably further in the past than before, we'd end up incorrectly ignoring an fs root. Thus we'd find no nodes for the bytenr we were searching for, and we'd fail to relocate anything. We'd loop through the relocate code again and see that there were still used space in that block group, attempt to relocate those bytenr's again, fail in the same way, and just loop like this forever. This is tricky in that we have to not modify the fs root at all during this time, so we need to have a block group that has data in this fs root that is not shared by any other root, which is why this has been difficult to reproduce. Fixes: `054570a1dc` ("Btrfs: fix relocation incorrectly dropping data references") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-04-08 19:10:31 +02:00
Robbie Ko	6ff06729c2	btrfs: fix missing semaphore unlock in btrfs_sync_file Ordered ops are started twice in sync file, once outside of inode mutex and once inside, taking the dio semaphore. There was one error path missing the semaphore unlock. Fixes: `aab15e8ec2` ("Btrfs: fix rare chances for data loss when doing a fast fsync") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ add changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-25 16:29:16 +01:00
Josef Bacik	351cbf6e44	btrfs: use nofs allocations for running delayed items Zygo reported the following lockdep splat while testing the balance patches ====================================================== WARNING: possible circular locking dependency detected 5.6.0-c6f0579d496a+ #53 Not tainted ------------------------------------------------------ kswapd0/1133 is trying to acquire lock: ffff888092f622c0 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node+0x7c/0x5b0 but task is already holding lock: ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}: fs_reclaim_acquire.part.91+0x29/0x30 fs_reclaim_acquire+0x19/0x20 kmem_cache_alloc_trace+0x32/0x740 add_block_entry+0x45/0x260 btrfs_ref_tree_mod+0x6e2/0x8b0 btrfs_alloc_tree_block+0x789/0x880 alloc_tree_block_no_bg_flush+0xc6/0xf0 __btrfs_cow_block+0x270/0x940 btrfs_cow_block+0x1ba/0x3a0 btrfs_search_slot+0x999/0x1030 btrfs_insert_empty_items+0x81/0xe0 btrfs_insert_delayed_items+0x128/0x7d0 __btrfs_run_delayed_items+0xf4/0x2a0 btrfs_run_delayed_items+0x13/0x20 btrfs_commit_transaction+0x5cc/0x1390 insert_balance_item.isra.39+0x6b2/0x6e0 btrfs_balance+0x72d/0x18d0 btrfs_ioctl_balance+0x3de/0x4c0 btrfs_ioctl+0x30ab/0x44a0 ksys_ioctl+0xa1/0xe0 __x64_sys_ioctl+0x43/0x50 do_syscall_64+0x77/0x2c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&delayed_node->mutex){+.+.}: __lock_acquire+0x197e/0x2550 lock_acquire+0x103/0x220 __mutex_lock+0x13d/0xce0 mutex_lock_nested+0x1b/0x20 __btrfs_release_delayed_node+0x7c/0x5b0 btrfs_remove_delayed_node+0x49/0x50 btrfs_evict_inode+0x6fc/0x900 evict+0x19a/0x2c0 dispose_list+0xa0/0xe0 prune_icache_sb+0xbd/0xf0 super_cache_scan+0x1b5/0x250 do_shrink_slab+0x1f6/0x530 shrink_slab+0x32e/0x410 shrink_node+0x2a5/0xba0 balance_pgdat+0x4bd/0x8a0 kswapd+0x35a/0x800 kthread+0x1e9/0x210 ret_from_fork+0x3a/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&delayed_node->mutex); lock(fs_reclaim); lock(&delayed_node->mutex); * DEADLOCK * 3 locks held by kswapd0/1133: #0: ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff8fc380d8 (shrinker_rwsem){++++}, at: shrink_slab+0x1e8/0x410 #2: ffff8881e0e6c0e8 (&type->s_umount_key#42){++++}, at: trylock_super+0x1b/0x70 stack backtrace: CPU: 2 PID: 1133 Comm: kswapd0 Not tainted 5.6.0-c6f0579d496a+ #53 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0xc1/0x11a print_circular_bug.isra.38.cold.57+0x145/0x14a check_noncircular+0x2a9/0x2f0 ? print_circular_bug.isra.38+0x130/0x130 ? stack_trace_consume_entry+0x90/0x90 ? save_trace+0x3cc/0x420 __lock_acquire+0x197e/0x2550 ? btrfs_inode_clear_file_extent_range+0x9b/0xb0 ? register_lock_class+0x960/0x960 lock_acquire+0x103/0x220 ? __btrfs_release_delayed_node+0x7c/0x5b0 __mutex_lock+0x13d/0xce0 ? __btrfs_release_delayed_node+0x7c/0x5b0 ? __asan_loadN+0xf/0x20 ? pvclock_clocksource_read+0xeb/0x190 ? __btrfs_release_delayed_node+0x7c/0x5b0 ? mutex_lock_io_nested+0xc20/0xc20 ? __kasan_check_read+0x11/0x20 ? check_chain_key+0x1e6/0x2e0 mutex_lock_nested+0x1b/0x20 ? mutex_lock_nested+0x1b/0x20 __btrfs_release_delayed_node+0x7c/0x5b0 btrfs_remove_delayed_node+0x49/0x50 btrfs_evict_inode+0x6fc/0x900 ? btrfs_setattr+0x840/0x840 ? do_raw_spin_unlock+0xa8/0x140 evict+0x19a/0x2c0 dispose_list+0xa0/0xe0 prune_icache_sb+0xbd/0xf0 ? invalidate_inodes+0x310/0x310 super_cache_scan+0x1b5/0x250 do_shrink_slab+0x1f6/0x530 shrink_slab+0x32e/0x410 ? do_shrink_slab+0x530/0x530 ? do_shrink_slab+0x530/0x530 ? __kasan_check_read+0x11/0x20 ? mem_cgroup_protected+0x13d/0x260 shrink_node+0x2a5/0xba0 balance_pgdat+0x4bd/0x8a0 ? mem_cgroup_shrink_node+0x490/0x490 ? _raw_spin_unlock_irq+0x27/0x40 ? finish_task_switch+0xce/0x390 ? rcu_read_lock_bh_held+0xb0/0xb0 kswapd+0x35a/0x800 ? _raw_spin_unlock_irqrestore+0x4c/0x60 ? balance_pgdat+0x8a0/0x8a0 ? finish_wait+0x110/0x110 ? __kasan_check_read+0x11/0x20 ? __kthread_parkme+0xc6/0xe0 ? balance_pgdat+0x8a0/0x8a0 kthread+0x1e9/0x210 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x50 This is because we hold that delayed node's mutex while doing tree operations. Fix this by just wrapping the searches in nofs. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-25 16:26:00 +01:00
Takashi Iwai	abdd9feb45	btrfs: sysfs: Use scnprintf() instead of snprintf() snprintf() is a hard-to-use function, and it's especially difficult to use it properly for concatenating substrings in a buffer with a limited size. Since snprintf() returns the would-be-output size, not the actual size, the subsequent use of snprintf() may point to the incorrect position easily. Also, returning the value from snprintf() directly to sysfs show function would pass a bogus value that is higher than the actually truncated string. That said, although the current code doesn't actually overflow the buffer with PAGE_SIZE, it's a usage that shouldn't be done. Or it's worse; this gives a wrong confidence as if it were doing safe operations. This patch replaces such snprintf() calls with a safer version, scnprintf(). It returns the actual output size, hence it's more intuitive and the code does what's expected. Signed-off-by: Takashi Iwai <tiwai@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 18:14:47 +01:00
Josef Bacik	39dba8739c	btrfs: do not resolve backrefs for roots that are being deleted Zygo reported a deadlock where a task was stuck in the inode logical resolve code. The deadlock looks like this Task 1 btrfs_ioctl_logical_to_ino ->iterate_inodes_from_logical ->iterate_extent_inodes ->path->search_commit_root isn't set, so a transaction is started ->resolve_indirect_ref for a root that's being deleted ->search for our key, attempt to lock a node, DEADLOCK Task 2 btrfs_drop_snapshot ->walk down to a leaf, lock it, walk up, lock node ->end transaction ->start transaction -> wait_cur_trans Task 3 btrfs_commit_transaction ->wait_event(cur_trans->write_wait, num_writers == 1) DEADLOCK We are holding a transaction open in btrfs_ioctl_logical_to_ino while we try to resolve our references. btrfs_drop_snapshot() holds onto its locks while it stops and starts transaction handles, because it assumes nobody is going to touch the root now. Commit just does what commit does, waiting for the writers to finish, blocking any new trans handles from starting. Fix this by making the backref code not try to resolve backrefs of roots that are currently being deleted. This will keep us from walking into a snapshot that's currently being deleted. This problem was harder to hit before because we rarely broke out of the snapshot delete halfway through, but with my delayed ref throttling code it happened much more often. However we've always been able to do this, so it's not a new problem. Fixes: `8da6d5815c` ("Btrfs: added btrfs_find_all_roots()") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:51 +01:00
Josef Bacik	ea287ab157	btrfs: track reloc roots based on their commit root bytenr We always search the commit root of the extent tree for looking up back references, however we track the reloc roots based on their current bytenr. This is wrong, if we commit the transaction between relocating tree blocks we could end up in this code in build_backref_tree if (key.objectid == key.offset) { /* * Only root blocks of reloc trees use backref * pointing to itself. */ root = find_reloc_root(rc, cur->bytenr); ASSERT(root); cur->root = root; break; } find_reloc_root() is looking based on the bytenr we had in the commit root, but if we've COWed this reloc root we will not find that bytenr, and we will trip over the ASSERT(root). Fix this by using the commit_root->start bytenr for indexing the commit root. Then we change the __update_reloc_root() caller to be used when we switch the commit root for the reloc root during commit. This fixes the panic I was seeing when we started throttling relocation for delayed refs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:51 +01:00
Josef Bacik	50dbbb71c7	btrfs: restart relocate_tree_blocks properly There are two bugs here, but fixing them independently would just result in pain if you happened to bisect between the two patches. First is how we handle the -EAGAIN from relocate_tree_block(). We don't set error, unless we happen to be the first node, which makes no sense, I have no idea what the code was trying to accomplish here. We in fact _do_ want err set here so that we know we need to restart in relocate_block_group(). Also we need finish_pending_nodes() to not actually call link_to_upper(), because we didn't actually relocate the block. And then if we do get -EAGAIN we do not want to set our backref cache last_trans to the one before ours. This would force us to update our backref cache if we didn't cross transaction ids, which would mean we'd have some nodes updated to their new_bytenr, but still able to find their old bytenr because we're searching the same commit root as the last time we went through relocate_tree_blocks. Fixing these two things keeps us from panicing when we start breaking out of relocate_tree_blocks() either for delayed ref flushing or enospc. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:51 +01:00
Josef Bacik	5f6b2e5cd6	btrfs: reloc: reorder reservation before root selection Since we're not only checking for metadata reservations but also if we need to throttle our delayed ref generation, reorder reserve_metadata_space() above the select_one_root() call in relocate_tree_block(). The reason we want this is because select_reloc_root() will mess with the backref cache, and if we're going to bail we want to be able to cleanly remove this node from the backref cache and come back along to regenerate it. Move it up so this is the first thing we do to make restarting cleaner. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:50 +01:00
Josef Bacik	d7ff00f608	btrfs: do not readahead in build_backref_tree Here we are just searching down to the bytenr we're building the backref tree for, and all of it's paths to the roots. These bytenrs are not guaranteed to be anywhere near each other, so readahead just generates extra latency. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:50 +01:00
Josef Bacik	cd22a51c66	btrfs: do not use readahead for running delayed refs Readahead will generate a lot of extra reads for adjacent nodes, but when running delayed refs we have no idea if the next ref is going to be adjacent or not, so this potentially just generates a lot of extra IO. To make matters worse each ref is truly just looking for one item, it doesn't generally search forward, so we simply don't need it here. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:50 +01:00
Nikolay Borisov	9babda9f33	btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot With BTRFS_SUBVOL_CREATE_ASYNC support remove it's no longer required to pass the async_transid parameter so remove it and any code using it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:02:00 +01:00
Nikolay Borisov	5d54c67ecc	btrfs: Remove transid argument from btrfs_ioctl_snap_create_transid btrfs_ioctl_snap_create_transid no longer takes a transid argument, so remove it and rename the function to __btrfs_ioctl_snap_create to reflect it's an internal, worker function. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:02:00 +01:00
Nikolay Borisov	9c1036fdb1	btrfs: Remove BTRFS_SUBVOL_CREATE_ASYNC support This functionality was deprecated in kernel 5.4. Since no one has complained of the impending removal it's time we did so. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:02:00 +01:00
Josef Bacik	c75e839414	btrfs: kill the subvol_srcu Now that we have proper root ref counting everywhere we can kill the subvol_srcu. * removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes * the refcount_t used for the references checks for accidental 0->1 in cases where the root lifetime would not be properly protected * there's a leak detector for roots to catch unfreed roots at umount time * SRCU served us well over the years but is was not a proper synchronization mechanism for some cases Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:02:00 +01:00
Josef Bacik	efc3453494	btrfs: make btrfs_cleanup_fs_roots use the radix tree lock The radix root is primarily protected by the fs_roots_radix_lock, so use that to lookup and get a ref on all of our fs roots in btrfs_cleanup_fs_roots. The tree reference is taken in the protected section as before. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:59 +01:00
Josef Bacik	4785e24fa5	btrfs: don't take an extra root ref at allocation time Now that all the users of roots take references for them we can drop the extra root ref we've been taking. Before we had roots at 2 refs for the life of the file system, one for the radix tree, and one simply for existing. Now that we have proper ref accounting in all places that use roots we can drop this extra ref simply for existing as we no longer need it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:59 +01:00
Josef Bacik	dc9492c14c	btrfs: hold a ref on the root on the dead roots list At the point we add a root to the dead roots list we have no open inodes for that root, so we need to hold a ref on that root to keep it from disappearing. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:59 +01:00
Josef Bacik	5c8fd99fec	btrfs: make inodes hold a ref on their roots If we make sure all the inodes have refs on their root we don't have to worry about the root disappearing while we have open inodes. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:59 +01:00
Josef Bacik	8c38938c7b	btrfs: move the root freeing stuff into btrfs_put_root There are a few different ways to free roots, either you allocated them yourself and you just do free_extent_buffer(root->node); free_extent_buffer(root->commit_node); btrfs_put_root(root); Which is the pattern for log roots. Or for snapshots/subvolumes that are being dropped you simply call btrfs_free_fs_root() which does all the cleanup for you. Unify this all into btrfs_put_root(), so that we don't free up things associated with the root until the last reference is dropped. This makes the root freeing code much more significant. The only caveat is at close_ctree() time we have to free the extent buffers for all of our main roots (extent_root, chunk_root, etc) because we have to drop the btree_inode and we'll run into issues if we hold onto those nodes until ->kill_sb() time. This will be addressed in the future when we kill the btree_inode. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:59 +01:00
Josef Bacik	0e996e7fcf	btrfs: move ino_cache_inode dropping out of btrfs_free_fs_root We are going to make root life be controlled soley by refcounting, and inodes will be one of the things that hold a ref on the root. This means we need to handle dropping the ino_cache_inode outside of the root freeing logic, so move it into btrfs_drop_and_free_fs_root() so it is cleaned up properly on unmount. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:58 +01:00
Josef Bacik	3fd6372758	btrfs: make the extent buffer leak check per fs info I'm going to make the entire destruction of btrfs_root's controlled by their refcount, so it will be helpful to notice if we're leaking their eb's on umount. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:58 +01:00
Josef Bacik	7b7b74315b	btrfs: remove a BUG_ON() from merge_reloc_roots() This was pretty subtle, we default to reloc roots having 0 root refs, so if we crash in the middle of the relocation they can just be deleted. If we successfully complete the relocation operations we'll set our root refs to 1 in prepare_to_merge() and then go on to merge_reloc_roots(). At prepare_to_merge() time if any of the reloc roots have a 0 reference still, we will remove that reloc root from our reloc root rb tree, and then clean it up later. However this only happens if we successfully start a transaction. If we've aborted previously we will skip this step completely, and only have reloc roots with a reference count of 0, but were never properly removed from the reloc control's rb tree. This isn't a problem per-se, our references are held by the list the reloc roots are on, and by the original root the reloc root belongs to. If we end up in this situation all the reloc roots will be added to the dirty_reloc_list, and then properly dropped at that point. The reloc control will be free'd and the rb tree is no longer used. There were two options when fixing this, one was to remove the BUG_ON(), the other was to make prepare_to_merge() handle the case where we couldn't start a trans handle. IMO this is the cleaner solution. I started with handling the error in prepare_to_merge(), but it turned out super ugly. And in the end this BUG_ON() simply doesn't matter, the cleanup was happening properly, we were just panicing because this BUG_ON() only matters in the success case. So I've opted to just remove it and add a comment where it was. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:58 +01:00
Josef Bacik	f44deb7442	btrfs: hold a ref on the root->reloc_root We previously were relying on root->reloc_root to be cleaned up by the drop snapshot, or the error handling. However if btrfs_drop_snapshot() failed it wouldn't drop the ref for the root. Also we sort of depend on the right thing to happen with moving reloc roots between lists and the fs root they belong to, which makes it hard to figure out who owns the reference. Fix this by explicitly holding a reference on the reloc root for roo->reloc_root. This means that we hold two references on reloc roots, one for whichever reloc_roots list it's attached to, and the root->reloc_root we're on. This makes it easier to reason out who owns a reference on the root, and when it needs to be dropped. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:58 +01:00
Josef Bacik	f28de8d8fd	btrfs: clear DEAD_RELOC_TREE before dropping the reloc root The DEAD_RELOC_TREE flag is in place in order to avoid a use after free in init_reloc_root, tracking the presence of reloc_root. However adding the explicit tree references in previous patches makes the use after free impossible because at this point we no longer have a reloc_control set on the fs_info and thus cannot enter the function. So move this to be coupled with clearing the root->reloc_root so we're consistent with all other operations of the reloc root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:58 +01:00
Josef Bacik	1a0afa0ecf	btrfs: free the reloc_control in a consistent way If we have an error while processing the reloc roots we could leak roots that were added to rc->reloc_roots before we hit the error. We could have also not removed the reloc tree mapping from our rb_tree, so clean up any remaining nodes in the reloc root rb_tree. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use rbtree_postorder_for_each_entry_safe ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Josef Bacik	2abc726ab4	btrfs: do not init a reloc root if we aren't relocating We previously were checking if the root had a dead root before accessing root->reloc_root in order to avoid a use-after-free type bug. However this scenario happens after we've unset the reloc control, so we would have been saved if we'd simply checked for fs_info->reloc_control. At this point during relocation we no longer need to be creating new reloc roots, so simply move this check above the reloc_root checks to avoid any future races and confusion. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Josef Bacik	6217b0fadd	btrfs: reloc: clean dirty subvols if we fail to start a transaction If we do merge_reloc_roots() we could insert a few roots onto the dirty subvol roots list, where we hold a ref on them. If we fail to start the transaction we need to run clean_dirty_subvols() in order to cleanup the refs. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Josef Bacik	fb2d83eefe	btrfs: unset reloc control if we fail to recover If we fail to load an fs root, or fail to start a transaction we can bail without unsetting the reloc control, which leads to problems later when we free the reloc control but still have it attached to the file system. In the normal path we'll end up calling unset_reloc_control() twice, but all it does is set fs_info->reloc_control = NULL, and we can only have one balance at a time so it's not racey. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Josef Bacik	8e19c9732a	btrfs: drop block from cache on error in relocation If we have an error while building the backref tree in relocation we'll process all the pending edges and then free the node. However if we integrated some edges into the cache we'll lose our link to those edges by simply freeing this node, which means we'll leak memory and references to any roots that we've found. Instead we need to use remove_backref_node(), which walks through all of the edges that are still linked to this node and free's them up and drops any root references we may be holding. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Qu Wenruo	19b546d7a1	btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves In relocation, we need to locate all parent tree leaves referring to one data extent, thus we have a complex mechanism to iterate throught extent tree and subvolume trees to locate the related leaves. However this is already done in backref.c, we have btrfs_find_all_leafs(), which can return a ulist containing all leaves referring to that data extent. Use btrfs_find_all_leafs() to replace find_data_references(). There is a special handling for v1 space cache data extents, where we need to delete the v1 space cache data extents, to avoid those data extents to hang the data relocation. In this patch, the special handling is done by re-iterating the root tree leaf. Although it's a little less efficient than the old handling, considering we can reuse a lot of code, it should be acceptable. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Josef Bacik	b39c8f5a39	btrfs: fix ref-verify to catch operations on 0 ref extents While debugging I noticed I wasn't getting ref verify errors before everything blew up. Turns out it's because we don't warn when we try to add a normal ref via btrfs_inc_ref() if the block entry exists but has 0 references. This is incorrect, we should never be doing anything other than adding a new extent once a block entry drops to 0 references. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:56 +01:00
Filipe Manana	0a8068a3dd	btrfs: make ranged full fsyncs more efficient Commit `0c713cbab6` ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges") fixed a bug where we could end up with file extent items in a log tree that represent file ranges that overlap due to a race between the hole detection of a ranged full fsync and writeback for a different file range. The problem was solved by forcing any ranged full fsync to become a non-ranged full fsync - setting the range start to 0 and the end offset to LLONG_MAX. This was a simple solution because the code that detected and marked holes was very complex, it used to be done at copy_items() and implied several searches on the fs/subvolume tree. The drawback of that solution was that we started to flush delalloc for the entire file and wait for all the ordered extents to complete for ranged full fsyncs (including ordered extents covering ranges completely outside the given range). Fortunatelly ranged full fsyncs are not the most common case (hopefully for most workloads). However a later fix for detecting and marking holes was made by commit `0e56315ca1` ("Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES") and it simplified a lot the detection of holes, and now copy_items() no longer does it and we do it in a much more simple way at btrfs_log_holes(). This makes it now possible to simply make the code that detects holes to operate only on the initial range and no longer need to operate on the whole file, while also avoiding the need to flush delalloc for the entire file and wait for ordered extents that cover ranges that don't overlap the given range. Another special care is that we must skip file extent items that fall entirely outside the fsync range when copying inode items from the fs/subvolume tree into the log tree - this is to avoid races with ordered extent completion for extents falling outside the fsync range, which could cause us to end up with file extent items in the log tree that have overlapping ranges - for example if the fsync range is [1Mb, 2Mb], when we copy inode items we could copy an extent item for the range [0, 512K], then release the search path and before moving to the next leaf, an ordered extent for a range of [256Kb, 512Kb] completes - this would cause us to copy the new extent item for range [256Kb, 512Kb] into the log tree after we have copied one for the range [0, 512Kb] - the extents overlap, resulting in a corruption. So this change just does these steps: 1) When the NO_HOLES feature is enabled it leaves the initial range intact - no longer sets it to [0, LLONG_MAX] when the full sync bit is set in the inode. If NO_HOLES is not enabled, always set the range to a full, just like before this change, to avoid missing file extent items representing holes after replaying the log (for both full and fast fsyncs); 2) Make the hole detection code to operate only on the fsync range; 3) Make the code that copies items from the fs/subvolume tree to skip copying file extent items that cover a range completely outside the range of the fsync. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:56 +01:00
Filipe Manana	da447009a2	btrfs: factor out inode items copy loop from btrfs_log_inode() The function btrfs_log_inode() is quite large and so is its loop which iterates the inode items from the fs/subvolume tree and copies them into a log tree. Because this is a large loop inside a very large function and because an upcoming patch in this series needs to add some more logic inside that loop, move the loop into a helper function to make it a bit more manageable. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:56 +01:00
Filipe Manana	a5eeb3d17b	btrfs: add helper to get the end offset of a file extent item Getting the end offset for a file extent item requires a bit of code since the extent can be either inline or regular/prealloc. There are some places all over the code base that open code this logic and in another patch later in this series it will be needed again. Therefore encapsulate this logic in a helper function and use it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:56 +01:00
Filipe Manana	95418ed1d1	btrfs: fix missing file extent item for hole after ranged fsync When doing a fast fsync for a range that starts at an offset greater than zero, we can end up with a log that when replayed causes the respective inode miss a file extent item representing a hole if we are not using the NO_HOLES feature. This is because for fast fsyncs we don't log any extents that cover a range different from the one requested in the fsync. Example scenario to trigger it: $ mkfs.btrfs -O ^no-holes -f /dev/sdd $ mount /dev/sdd /mnt # Create a file with a single 256K and fsync it to clear to full sync # bit in the inode - we want the msync below to trigger a fast fsync. $ xfs_io -f -c "pwrite -S 0xab 0 256K" -c "fsync" /mnt/foo # Force a transaction commit and wipe out the log tree. $ sync # Dirty 768K of data, increasing the file size to 1Mb, and flush only # the range from 256K to 512K without updating the log tree # (sync_file_range() does not trigger fsync, it only starts writeback # and waits for it to finish). $ xfs_io -c "pwrite -S 0xcd 256K 768K" /mnt/foo $ xfs_io -c "sync_range -abw 256K 256K" /mnt/foo # Now dirty the range from 768K to 1M again and sync that range. $ xfs_io -c "mmap -w 768K 256K" \ -c "mwrite -S 0xef 768K 256K" \ -c "msync -s 768K 256K" \ -c "munmap" \ /mnt/foo <power fail> # Mount to replay the log. $ mount /dev/sdd /mnt $ umount /mnt $ btrfs check /dev/sdd Opening filesystem to check... Checking filesystem on /dev/sdd UUID: 482fb574-b288-478e-a190-a9c44a78fca6 [1/7] checking root items [2/7] checking extents [3/7] checking free space cache [4/7] checking fs roots root 5 inode 257 errors 100, file extent discount Found file extent holes: start: 262144, len: 524288 ERROR: errors found in fs roots found 720896 bytes used, error(s) found total csum bytes: 512 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 123514 file data blocks allocated: 589824 referenced 589824 Fix this issue by setting the range to full (0 to LLONG_MAX) when the NO_HOLES feature is not enabled. This results in extra work being done but it gives the guarantee we don't end up with missing holes after replaying the log. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:56 +01:00
Nikolay Borisov	db161806dc	btrfs: account ticket size at add/delete time Instead of iterating all pending tickets on the normal/priority list to sum their total size the cost can be amortized across ticket addition/ removal. This turns O(n) + O(m) (where n is the size of the normal list and m of the priority list) into O(1). This will mostly have effect in workloads that experience heavy flushing. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:55 +01:00
Roman Gushchin	f8e6608180	btrfs: implement migratepage callback for data pages Currently btrfs doesn't provide a migratepage callback for data pages. It means that fallback_migrate_page() is used to migrate btrfs pages. fallback_migrate_page() cannot move dirty pages, instead it tries to flush them (in sync mode) or just fails (in async mode). In the sync mode pages which are scheduled to be processed by btrfs_writepage_fixup_worker() can't be effectively flushed by the migration code, because there is no established way to wait for the completion of the delayed work. It all leads to page migration failures. To fix it the patch implements a btrs-specific migratepage callback, which is similar to iomap_migrate_page() used by some other fs, except it does take care of the PagePrivate2 flag which is used for data ordering purposes. Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:55 +01:00
Nikolay Borisov	0078a9f941	btrfs: Remove block_rsv parameter from btrfs_drop_snapshot It's no longer used following `30d40577e3` ("btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON()"), so just remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:55 +01:00
Nikolay Borisov	63f018be57	btrfs: Remove __ prefix from btrfs_block_rsv_release Currently the non-prefixed version is a simple wrapper used to hide the 4th argument of the prefixed version. This doesn't bring much value in practice and only makes the code harder to follow by adding another level of indirection. Rectify this by removing the __ prefix and have only one public function to release bytes from a block reservation. No semantic changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:55 +01:00
Qu Wenruo	f31ea0888c	btrfs: relocation: Check cancel request after each extent found When relocating data block groups with tons of small extents, or large metadata block groups, there can be over 200,000 extents. We will iterate all extents of such block group in relocate_block_group(), where iteration itself can be kinda time-consuming. So when user want to cancel the balance, the extent iteration loop can be another target. This patch will add the cancelling check in the extent iteration loop of relocate_block_group() to make balance cancelling faster. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:55 +01:00
Qu Wenruo	7f913c7cfe	btrfs: relocation: Check cancel request after each data page read When relocating a data extents with large large data extents, we spend most of our time in relocate_file_extent_cluster() at stage "moving data extents": 1) \| btrfs_relocate_block_group [btrfs]() { 1) \| relocate_file_extent_cluster [btrfs]() { 1) $ 6586769 us \| } 1) + 18.260 us \| relocate_file_extent_cluster [btrfs](); 1) + 15.770 us \| relocate_file_extent_cluster [btrfs](); 1) $ 8916340 us \| } 1) \| btrfs_relocate_block_group [btrfs]() { 1) \| relocate_file_extent_cluster [btrfs]() { 1) $ 11611586 us \| } 1) + 16.930 us \| relocate_file_extent_cluster [btrfs](); 1) + 15.870 us \| relocate_file_extent_cluster [btrfs](); 1) $ 14986130 us \| } To make data relocation cancelling quicker, add extra balance cancelling check after each page read in relocate_file_extent_cluster(). Cleanup and error handling uses the same mechanism as if the whole process finished Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Qu Wenruo	726a342120	btrfs: relocation: add error injection points for cancelling balance Introduce a new error injection point, should_cancel_balance(). It's just a wrapper of atomic_read(&fs_info->balance_cancel_req), but allows us to override the return value. Currently there are only one locations using this function: - btrfs_balance() It checks cancel before each block group. There are other locations checking fs_info->balance_cancel_req, but they are not used as an indicator to exit, so there is no need to use the wrapper. But there will be more locations coming, and some locations can cause kernel panic if not handled properly. So introduce this error injection to provide better test interface. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Filipe Manana	05a5a7621c	Btrfs: implement full reflink support for inline extents There are a few cases where we don't allow cloning an inline extent into the destination inode, returning -EOPNOTSUPP to user space. This was done to prevent several types of file corruption and because it's not very straightforward to deal with these cases, as they can't rely on simply copying the inline extent between leaves. Such cases require copying the inline extent's data into the respective page of the destination inode. Not supporting these cases makes it harder and more cumbersome to write applications/libraries that work on any filesystem with reflink support, since all these cases for which btrfs fails with -EOPNOTSUPP work just fine on xfs for example. These unsupported cases are also not documented anywhere and explaining which exact cases fail require a bit of too technical understanding of btrfs's internal (inline extents and when and where can they exist in a file), so it's not really user friendly. Also some test cases from fstests that use fsx, such as generic/522 for example, can sporadically fail because they trigger one of these cases, and fsx expects all operations to succeed. This change adds supports for cloning all these cases by copying the inline extent's data into the respective page of the destination inode. With this change test case btrfs/112 from fstests fails because it expects some clone operations to fail, so it will be updated. Also a new test case that exercises all these previously unsupported cases will be added to fstests. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Filipe Manana	a61e1e0df9	Btrfs: simplify inline extent handling when doing reflinks We can not reflink parts of an inline extent, we must always reflink the whole inline extent. We know that inline extents always start at file offset 0 and that can never represent an amount of data larger then the filesystem's sector size (both compressed and uncompressed). We also have had the constraints that reflink operations must have a start offset that is aligned to the sector size and an end offset that is also aligned or it ends the inode's i_size, so there's no way for user space to be able to do a reflink operation that will refer to only a part of an inline extent. Initially there was a bug in the inlining code that could allow compressed inline extents that encoded more than 1 page, but that was fixed in 2008 by commit `70b99e6959` ("Btrfs: Compression corner fixes") since that was problematic. So remove all the extent cloning code that deals with the possibility of cloning only partial inline extents. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Filipe Manana	6a17738100	Btrfs: move all reflink implementation code into its own file The reflink code is quite large and has been living in ioctl.c since ever. It has grown over the years after many bug fixes and improvements, and since I'm planning on making some further improvements on it, it's time to get it better organized by moving into its own file, reflink.c (similar to what xfs does for example). This change only moves the code out of ioctl.c into the new file, it doesn't do any other change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Gustavo A. R. Silva	a8753ee3a8	btrfs: scrub: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero." [1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Gustavo A. R. Silva	7593f4c53c	btrfs: rcu-string: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero." [1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:53 +01:00
Gustavo A. R. Silva	17b238acf7	btrfs: delayed-inode: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero." [1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:53 +01:00
Madhuparna Bhowmik	29566c9c77	btrfs: add RCU locks around block group initialization The space_info list is normally RCU protected and should be traversed with rcu_read_lock held. There's a warning [29.104756] WARNING: suspicious RCU usage [29.105046] 5.6.0-rc4-next-20200305 #1 Not tainted [29.105231] ----------------------------- [29.105401] fs/btrfs/block-group.c:2011 RCU-list traversed in non-reader section!! pointing out that the locking is missing in btrfs_read_block_groups. However this is not necessary as the list traversal happens at mount time when there's no other thread potentially accessing the list. To fix the warning and for consistency let's add the RCU lock/unlock, the code won't be affected much as it's doing some lightweight operations. Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:53 +01:00
Nikolay Borisov	65cd6d9e30	btrfs: Open code insert_extent_backref No need to add a level of indirection for hiding a simple 'if'. Open code insert_extent_backref in its sole caller. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:53 +01:00
Nikolay Borisov	c6600d9ac6	btrfs: Remove impossible BUG_ON in get_tree_block_key relocate_tree_blocks calls get_tree_block_key for a block iff that block has its ->key_ready equal false. Thus the BUG_ON in the latter function cannot ever be triggered so remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:53 +01:00
David Sterba	5ba366c399	btrfs: balance: factor out convert profile validation The validation follows the same steps for all three block group types, the existing helper validate_convert_profile can be enhanced and do more of the common things. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
David Sterba	c67b38925b	btrfs: return void from csum_tree_block Now that csum_tree_block is not returning any errors, we can make csum_tree_block return void and simplify callers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
David Sterba	e9be5a303d	btrfs: simplify tree block checksumming loop Thw whole point of csum_tree_block is to iterate over all extent buffer pages and pass it to checksumming functions. The bytes where checksum is stored must be skipped, thus map_private_extent_buffer. This complicates further offset calculations. As the first page will be always present, checksum the relevant bytes unconditionally and then do a simple iteration over the remaining pages. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
David Sterba	59a0fcdb48	btrfs: inline checksum name and driver definitions There's an unnecessary indirection in the checksum definition table, pointer and the string itself. The strings are short and the overall size of one entry is now 24 bytes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
Nikolay Borisov	11c67b1a40	btrfs: Rename __btrfs_alloc_chunk to btrfs_alloc_chunk Having btrfs_alloc_chunk doesn't bring any value since it encapsulates a lockdep assert and a call to find_next_chunk. Simply rename the internal __btrfs_alloc_chunk function to the public one and remove it's 2nd parameter as all callers always pass the return value of find_next_chunk. Finally, migrate the call to lockdep_assert_held so as to not lose the check. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
Josef Bacik	fa121a26b2	btrfs: fix btrfs_calc_reclaim_metadata_size calculation I noticed while running my snapshot torture test that we were getting a lot of metadata chunks allocated with very little actually used. Digging into this we would commit the transaction, still not have enough space, and then force a chunk allocation. I noticed that we were barely flushing any delalloc at all, despite the fact that we had around 13gib of outstanding delalloc reservations. It turns out this is because of our btrfs_calc_reclaim_metadata_size() calculation. It _only_ takes into account the outstanding ticket sizes, which isn't the whole story. In this particular workload we're slowly filling up the disk, which means our overcommit space will suddenly become a lot less, and our outstanding reservations will be well more than what we can handle. However we are only flushing based on our ticket size, which is much less than we need to actually reclaim. So fix btrfs_calc_reclaim_metadata_size() to take into account the overage in the case that we've gotten less available space suddenly. This makes it so we attempt to reclaim a lot more delalloc space, which allows us to make our reservations and we no longer are allocating a bunch of needless metadata chunks. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
Filipe Manana	f0cc2cd701	Btrfs: fix crash during unmount due to race with delayed inode workers During unmount we can have a job from the delayed inode items work queue still running, that can lead to at least two bad things: 1) A crash, because the worker can try to create a transaction just after the fs roots were freed; 2) A transaction leak, because the worker can create a transaction before the fs roots are freed and just after we committed the last transaction and after we stopped the transaction kthread. A stack trace example of the crash: [79011.691214] kernel BUG at lib/radix-tree.c:982! [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G W 5.6.0-rc2-btrfs-next-54 #2 (...) [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs] [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170 (...) [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293 [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0 [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0 [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000 [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001 [79011.709270] FS: 0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000 [79011.710699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0 [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [79011.715448] Call Trace: [79011.715925] record_root_in_trans+0x72/0xf0 [btrfs] [79011.716819] btrfs_record_root_in_trans+0x4b/0x70 [btrfs] [79011.717925] start_transaction+0xdd/0x5c0 [btrfs] [79011.718829] btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs] [79011.719915] btrfs_work_helper+0xaa/0x720 [btrfs] [79011.720773] process_one_work+0x26d/0x6a0 [79011.721497] worker_thread+0x4f/0x3e0 [79011.722153] ? process_one_work+0x6a0/0x6a0 [79011.722901] kthread+0x103/0x140 [79011.723481] ? kthread_create_worker_on_cpu+0x70/0x70 [79011.724379] ret_from_fork+0x3a/0x50 (...) The following diagram shows a sequence of steps that lead to the crash during ummount of the filesystem: CPU 1 CPU 2 CPU 3 btrfs_punch_hole() btrfs_btree_balance_dirty() btrfs_balance_delayed_items() --> sees fs_info->delayed_root->items with value 200, which is greater than BTRFS_DELAYED_BACKGROUND (128) and smaller than BTRFS_DELAYED_WRITEBACK (512) btrfs_wq_run_delayed_node() --> queues a job for fs_info->delayed_workers to run btrfs_async_run_delayed_root() btrfs_async_run_delayed_root() --> job queued by CPU 1 --> starts picking and running delayed nodes from the prepare_list list close_ctree() btrfs_delete_unused_bgs() btrfs_commit_super() btrfs_join_transaction() --> gets transaction N btrfs_commit_transaction(N) --> set transaction state to TRANTS_STATE_COMMIT_START btrfs_first_prepared_delayed_node() --> picks delayed node X through the prepared_list list btrfs_run_delayed_items() btrfs_first_delayed_node() --> also picks delayed node X but through the node_list list __btrfs_commit_inode_delayed_items() --> runs all delayed items from this node and drops the node's item count to 0 through call to btrfs_release_delayed_inode() --> finishes running any remaining delayed nodes --> finishes transaction commit --> stops cleaner and transaction threads btrfs_free_fs_roots() --> frees all roots and removes them from the radix tree fs_info->fs_roots_radix btrfs_join_transaction() start_transaction() btrfs_record_root_in_trans() record_root_in_trans() radix_tree_tag_set() --> crashes because the root is not in the radix tree anymore If the worker is able to call btrfs_join_transaction() before the unmount task frees the fs roots, we end up leaking a transaction and all its resources, since after the call to btrfs_commit_super() and stopping the transaction kthread, we don't expect to have any transaction open anymore. When this situation happens the worker has a delayed node that has no more items to run, since the task calling btrfs_run_delayed_items(), which is doing a transaction commit, picks the same node and runs all its items first. We can not wait for the worker to complete when running delayed items through btrfs_run_delayed_items(), because we call that function in several phases of a transaction commit, and that could cause a deadlock because the worker calls btrfs_join_transaction() and the task doing the transaction commit may have already set the transaction state to TRANS_STATE_COMMIT_DOING. Also it's not possible to get into a situation where only some of the items of a delayed node are added to the fs/subvolume tree in the current transaction and the remaining ones in the next transaction, because when running the items of a delayed inode we lock its mutex, effectively waiting for the worker if the worker is running the items of the delayed node already. Since this can only cause issues when unmounting a filesystem, fix it in a simple way by waiting for any jobs on the delayed workers queue before calling btrfs_commit_supper() at close_ctree(). This works because at this point no one can call btrfs_btree_balance_dirty() or btrfs_balance_delayed_items(), and if we end up waiting for any worker to complete, btrfs_commit_super() will commit the transaction created by the worker. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	7e89540942	btrfs: factor out prepare_allocation() for extent allocation This function finally factor out prepare_allocation() form find_free_extent(). This function is called before the allocation loop and a specific allocator function like prepare_allocation_clustered() should initialize their private information and can set proper hint_byte to indicate where to start the allocation with. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	45d8e033b2	btrfs: skip LOOP_NO_EMPTY_SIZE if not clustered allocation LOOP_NO_EMPTY_SIZE is solely dedicated for clustered allocation. So, we can skip this stage and give up the allocation. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	c70e2139dc	btrfs: factor out chunk_allocation_failed() for extent allocation Factor out chunk_allocation_failed() from find_free_extent_update_loop(). This function is called when it failed to allocate a chunk. The function can modify "ffe_ctl->loop" and return 0 to continue with the next stage. Or, it can return -ENOSPC to give up here. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	15b7ee6584	btrfs: drop unnecessary arguments from find_free_extent_update_loop() Now that, we don't use last_ptr and use_cluster in the function. Drop these arguments from it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	0ab9724bf5	btrfs: factor out found_extent() for extent allocation Factor out found_extent() from find_free_extent_update_loop(). This function is called when a proper extent is found and before returning from find_free_extent(). Hook functions like found_extent_clustered() should save information for a next allocation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	baba50624f	btrfs: factor out release_block_group() Factor out release_block_group() from find_free_extent(). This function is called when it gives up an allocation from a block group. Each allocation policy should reset its information for an allocation in the next block group. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	897cae7948	btrfs: drop unnecessary arguments from clustered allocation functions Now that, find_free_extent_clustered() and find_free_extent_unclustered() can access "last_ptr" from the "clustered" variable, we can drop it from the arguments. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	c668690dc0	btrfs: factor out do_allocation() for extent allocation Factor out do_allocation() from find_free_extent(). This function do an actual extent allocation in a given block group. The ffe_ctl->policy is used to determine the actual allocator function to use. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	c10859be9b	btrfs: move variables for clustered allocation into find_free_extent_ctl Move "last_ptr" and "use_cluster" into struct find_free_extent_ctl, so that hook functions for clustered allocator can use these variables. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	ea544149a4	btrfs: move hint_byte into find_free_extent_ctl This commit moves hint_byte into find_free_extent_ctl, so that we can modify the hint_byte in the other functions. This will help us split find_free_extent further. This commit also renames the function argument "hint_byte" to "hint_byte_orig" to avoid misuse. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	cb2f96f8ab	btrfs: introduce extent allocation policy This commit introduces extent allocation policy for btrfs. This policy controls how btrfs allocate an extents from block groups. There is no functional change introduced with this commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	6aafb30384	btrfs: parameterize dev_extent_min for chunk allocation Currently, we ignore a device whose available space is less than "BTRFS_STRIPE_LEN * dev_stripes". This is a lower limit for current allocation policy (to maximize the number of stripes). This commit parameterizes dev_extent_min, so that other policies can set their own lower limitat to ignore a device with insufficient space. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	dce580ca40	btrfs: factor out create_chunk() Factor out create_chunk() from __btrfs_alloc_chunk(). This function finally creates a chunk. There is no functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	5badf512ec	btrfs: factor out decide_stripe_size() Factor out decide_stripe_size() from __btrfs_alloc_chunk(). This function calculates the actual stripe size to allocate. decide_stripe_size() handles the common case to round down the 'ndevs' to 'devs_increment' and check the upper and lower limitation of 'ndevs'. decide_stripe_size_regular() decides the size of a stripe and the size of a chunk. The policy is to maximize the number of stripes. This commit has no functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	560156cb25	btrfs: factor out gather_device_info() Factor out gather_device_info() from __btrfs_alloc_chunk(). This function iterates over devices list and gather information about devices. This commit also introduces "max_avail" and "dev_extent_min" to fold the same calculation to one variable. This commit has no functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	27c314d5ca	btrfs: factor out init_alloc_chunk_ctl Factor out init_alloc_chunk_ctl() from __btrfs_alloc_chunk(). This function initialises parameters of "struct alloc_chunk_ctl" for allocation. init_alloc_chunk_ctl() handles a common part of the initialisation to load the RAID parameters from btrfs_raid_array. init_alloc_chunk_ctl_policy_regular() decides some parameters for its allocation. The last "else" case in the original code is moved to __btrfs_alloc_chunk() to handle the error case in the common code. Replace the BUG_ON with ASSERT() and error return at the same time. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:48 +01:00
Naohiro Aota	4f2bafe8a4	btrfs: introduce alloc_chunk_ctl Introduce "struct alloc_chunk_ctl" to wrap needed parameters for the chunk allocation. This will be used to split __btrfs_alloc_chunk() into smaller functions. This commit folds a number of local variables in __btrfs_alloc_chunk() into one "struct alloc_chunk_ctl ctl". There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:48 +01:00
Naohiro Aota	3b4ffa4088	btrfs: refactor find_free_dev_extent_start() Factor out two functions from find_free_dev_extent_start(). dev_extent_search_start() decides the starting position of the search. dev_extent_hole_check() checks if a hole found is suitable for device extent allocation. These functions also have the switch-cases to change the allocation behavior depending on the policy. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:48 +01:00
Naohiro Aota	c4a816c67c	btrfs: introduce chunk allocation policy Introduce chunk allocation policy for btrfs. This policy controls how chunks and device extents are allocated from devices. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:48 +01:00
Naohiro Aota	b25c19f49e	btrfs: handle invalid profile in chunk allocation Do not BUG_ON() when an invalid profile is passed to __btrfs_alloc_chunk(). Instead return -EINVAL with ASSERT() to catch a bug in the development stage. Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:48 +01:00
Naohiro Aota	52d40aba68	btrfs: change full_search to bool in find_free_extent_update_loop While the "full_search" variable defined in find_free_extent() is bool, but the full_search argument of find_free_extent_update_loop() is defined as int. Let's trivially fix the argument type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:47 +01:00
Qu Wenruo	daf475c915	btrfs: qgroup: Remove the unnecesaary spin lock for qgroup_rescan_running After the previous patch, qgroup_rescan_running is protected by btrfs_fs_info::qgroup_rescan_lock, thus no need for the extra spinlock. Suggested-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:47 +01:00
Qu Wenruo	d61acbbf54	btrfs: qgroup: ensure qgroup_rescan_running is only set when the worker is at least queued [BUG] There are some reports about btrfs wait forever to unmount itself, with the following call trace: INFO: task umount:4631 blocked for more than 491 seconds. Tainted: G X 5.3.8-2-default #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0 4631 3337 0x00000000 Call Trace: ([<00000000174adf7a>] __schedule+0x342/0x748) [<00000000174ae3ca>] schedule+0x4a/0xd8 [<00000000174b1f08>] schedule_timeout+0x218/0x420 [<00000000174af10c>] wait_for_common+0x104/0x1d8 [<000003ff804d6994>] btrfs_qgroup_wait_for_completion+0x84/0xb0 [btrfs] [<000003ff8044a616>] close_ctree+0x4e/0x380 [btrfs] [<0000000016fa3136>] generic_shutdown_super+0x8e/0x158 [<0000000016fa34d6>] kill_anon_super+0x26/0x40 [<000003ff8041ba88>] btrfs_kill_super+0x28/0xc8 [btrfs] [<0000000016fa39f8>] deactivate_locked_super+0x68/0x98 [<0000000016fcb198>] cleanup_mnt+0xc0/0x140 [<0000000016d6a846>] task_work_run+0xc6/0x110 [<0000000016d04f76>] do_notify_resume+0xae/0xb8 [<00000000174b30ae>] system_call+0xe2/0x2c8 [CAUSE] The problem happens when we have called qgroup_rescan_init(), but not queued the worker. It can be caused mostly by error handling. Qgroup ioctl thread \| Unmount thread ----------------------------------------+----------------------------------- \| btrfs_qgroup_rescan() \| \|- qgroup_rescan_init() \| \| \|- qgroup_rescan_running = true; \| \| \| \|- trans = btrfs_join_transaction() \| \| Some error happened \| \| \| \|- btrfs_qgroup_rescan() returns error \| But qgroup_rescan_running == true; \| \| close_ctree() \| \|- btrfs_qgroup_wait_for_completion() \| \|- running == true; \| \|- wait_for_completion(); btrfs_qgroup_rescan_worker is never queued, thus no one is going to wake up close_ctree() and we get a deadlock. All involved qgroup_rescan_init() callers are: - btrfs_qgroup_rescan() The example above. It's possible to trigger the deadlock when error happened. - btrfs_quota_enable() Not possible. Just after qgroup_rescan_init() we queue the work. - btrfs_read_qgroup_config() It's possible to trigger the deadlock. It only init the work, the work queueing happens in btrfs_qgroup_rescan_resume(). Thus if error happened in between, deadlock is possible. We shouldn't set fs_info->qgroup_rescan_running just in qgroup_rescan_init(), as at that stage we haven't yet queued qgroup rescan worker to run. [FIX] Set qgroup_rescan_running before queueing the work, so that we ensure the rescan work is queued when we wait for it. Fixes: `8d9eddad19` ("Btrfs: fix qgroup rescan worker initialization") Signed-off-by: Jeff Mahoney <jeffm@suse.com> [ Change subject and cause analyse, use a smaller fix ] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:47 +01:00
Andy Shevchenko	807fc790aa	btrfs: switch to use new generic UUID API There are new types and helpers that are supposed to be used in new code. As a preparation to get rid of legacy types and API functions do the conversion here. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:47 +01:00
Qu Wenruo	b3ff8f1d38	btrfs: Don't submit any btree write bio if the fs has errors [BUG] There is a fuzzed image which could cause KASAN report at unmount time. BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390 Read of size 8 at addr ffff888067cf6848 by task umount/1922 CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x5b/0x8b print_address_description+0x70/0x280 kasan_report+0x13a/0x19b btrfs_queue_work+0x2c1/0x390 btrfs_wq_submit_bio+0x1cd/0x240 btree_submit_bio_hook+0x18c/0x2a0 submit_one_bio+0x1be/0x320 flush_write_bio.isra.41+0x2c/0x70 btree_write_cache_pages+0x3bb/0x7f0 do_writepages+0x5c/0x130 __writeback_single_inode+0xa3/0x9a0 writeback_single_inode+0x23d/0x390 write_inode_now+0x1b5/0x280 iput+0x2ef/0x600 close_ctree+0x341/0x750 generic_shutdown_super+0x126/0x370 kill_anon_super+0x31/0x50 btrfs_kill_super+0x36/0x2b0 deactivate_locked_super+0x80/0xc0 deactivate_super+0x13c/0x150 cleanup_mnt+0x9a/0x130 task_work_run+0x11a/0x1b0 exit_to_usermode_loop+0x107/0x130 do_syscall_64+0x1e5/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 [CAUSE] The fuzzed image has a completely screwd up extent tree: leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938 item 0 key (12587008 168 4096) itemoff 3942 itemsize 53 extent refs 1 gen 9 flags 1 ref#0: extent data backref root 5 objectid 259 offset 0 count 1 item 1 key (12591104 168 8192) itemoff 3889 itemsize 53 extent refs 1 gen 9 flags 1 ref#0: extent data backref root 5 objectid 271 offset 0 count 1 item 2 key (12599296 168 4096) itemoff 3836 itemsize 53 extent refs 1 gen 9 flags 1 ref#0: extent data backref root 5 objectid 259 offset 4096 count 1 item 3 key (29360128 169 0) itemoff 3803 itemsize 33 extent refs 1 gen 9 flags 2 ref#0: tree block backref root 5 item 4 key (29368320 169 1) itemoff 3770 itemsize 33 extent refs 1 gen 9 flags 2 ref#0: tree block backref root 5 item 5 key (29372416 169 0) itemoff 3737 itemsize 33 extent refs 1 gen 9 flags 2 ref#0: tree block backref root 5 Note that leaf 29421568 doesn't have its backref in the extent tree. Thus extent allocator can re-allocate leaf 29421568 for other trees. In short, the bug is caused by: - Existing tree block gets allocated to log tree This got its generation bumped. - Log tree balance cleaned dirty bit of offending tree block It will not be written back to disk, thus no WRITTEN flag. - Original owner of the tree block gets COWed Since the tree block has higher transid, no WRITTEN flag, it's reused, and not traced by transaction::dirty_pages. - Transaction aborted Tree blocks get cleaned according to transaction::dirty_pages. But the offending tree block is not recorded at all. - Filesystem unmount All pages are assumed to be are clean, destroying all workqueue, then call iput(btree_inode). But offending tree block is still dirty, which triggers writeback, and causes use-after-free bug. The detailed sequence looks like this: - Initial status eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8, not traced by any dirty extent_iot_tree. - New tree block is allocated Since there is no backref for 29421568, it's re-allocated as new tree block. Keep in mind that tree block 29421568 is still referred by extent tree. - Tree block 29421568 is filled for log tree eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped) traced by btrfs_root::dirty_log_pages - Some log tree operations Since the fs is using node size 4096, the log tree can easily go a level higher. - Log tree needs balance Tree block 29421568 gets all its content pushed to right, thus now it is empty, and we don't need it. btrfs_clean_tree_block() from __push_leaf_right() get called. eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9 traced by btrfs_root::dirty_log_pages - Log tree write back btree_write_cache_pages() goes through dirty pages ranges, but since page of tree block 29421568 gets cleaned already, it's not written back to disk. Thus it doesn't have WRITTEN bit set. But ranges in dirty_log_pages are cleared. eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9 not traced by any dirty extent_iot_tree. - Extent tree update when committing transaction Since tree block 29421568 has transid equal to running trans, and has no WRITTEN bit, should_cow_block() will use it directly without adding it to btrfs_transaction::dirty_pages. eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 not traced by any dirty extent_iot_tree. At this stage, we're doomed. We have a dirty eb not tracked by any extent io tree. - Transaction gets aborted due to corrupted extent tree Btrfs cleans up dirty pages according to transaction::dirty_pages and btrfs_root::dirty_log_pages. But since tree block 29421568 is not tracked by neither of them, it's still dirty. eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 not traced by any dirty extent_iot_tree. - Filesystem unmount Since all cleanup is assumed to be done, all workqueus are destroyed. Then iput(btree_inode) is called, expecting no dirty pages. But tree 29421568 is still dirty, thus triggering writeback. Since all workqueues are already freed, we cause use-after-free. This shows us that, log tree blocks + bad extent tree can cause wild dirty pages. [FIX] To fix the problem, don't submit any btree write bio if the filesytem has any error. This is the last safe net, just in case other cleanup haven't caught catch it. Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377 CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:46 +01:00

1 2 3 4 5 ...

8892 Commits