linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-11-24 20:30:53 +07:00

Author	SHA1	Message	Date
Nikolay Borisov	c7ee1819dc	btrfs: make btrfs_submit_compressed_write take btrfs_inode Majority of its uses are for btrfs_inode so take it as an argument directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:31 +02:00
Nikolay Borisov	4cc612090b	btrfs: make btrfs_add_ordered_extent_compress take btrfs_inode It simpy forwards its inode argument to __btrfs_add_ordered_extent which already takes btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:31 +02:00
Nikolay Borisov	6e26c44223	btrfs: make cow_file_range take btrfs_inode All its children functions take btrfs_inode so convert it to taking btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:31 +02:00
Nikolay Borisov	e7fbf60453	btrfs: make btrfs_add_ordered_extent take btrfs_inode Preparation to converting its callers to taking btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:31 +02:00
Nikolay Borisov	a0349401c1	btrfs: make cow_file_range_inline take btrfs_inode It has only 2 uses for the vfs_inode - insert_inline_extent and i_size_read. On the flipside it will allow converting its callers to btrfs_inode, so convert it to taking btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:30 +02:00
Nikolay Borisov	8b8a979f1f	btrfs: make btrfs_qgroup_free_data take btrfs_inode It passes btrfs_inode to its callee so change the interface. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:30 +02:00
Nikolay Borisov	8769af96cf	btrfs: make __btrfs_qgroup_release_data take btrfs_inode It uses vfs_inode only for a tracepoint so convert its interface to take btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:30 +02:00
Nikolay Borisov	df2cfd131f	btrfs: make qgroup_free_reserved_data take btrfs_inode It only uses btrfs_inode so can just as easily take it as an argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:30 +02:00
Nikolay Borisov	f0cdd15c21	btrfs: tracepoints: convert flush states to using EM macros Only 6 out of all flush states were being printed correctly since only they were exported via the TRACE_DEFINE_ENUM macro. This patch converts all flush states to use the newly introduced EM macro so that they can all be printed correctly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:30 +02:00
Nikolay Borisov	c92bb3046f	btrfs: tracepoints: switch extent_io_tree_owner to using EM macro This fixes correct pint out of the extent io tree owner in btrfs_set_extent_bit/btrfs_clear_extent_bit/btrfs_convert_extent_bit tracepoints. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:30 +02:00
Nikolay Borisov	1cb1f0b248	btrfs: tracepoints: fix qgroup reservation type printing Since qgroup's reservation types are define in a macro they must be exported to user space in order for user space tools to convert raw binary data to symbolic names. Currently trace-cmd report produces the following output: kworker/u8:2-459 [003] 1208.543587: qgroup_update_reserve: 2b742cae-e0e5-4def-9ef7-28a9b34a951e: qgid=5 type=0x2 cur_reserved=54870016 diff=-32768 With this fix the output is: kworker/u8:2-459 [003] 1208.543587: qgroup_update_reserve: 2b742cae-e0e5-4def-9ef7-28a9b34a951e: qgid=5 type=BTRFS_QGROUP_RSV_META_PREALLOC cur_reserved=54870016 diff=-32768 Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:29 +02:00
Nikolay Borisov	5bca2c952c	btrfs: tracepoints: move FLUSH_ACTIONS define Since all enums used in btrfs' tracepoints are going to be redefined to allow proper parsing of their values by userspace tools let's rearrange when they are defined. This will allow to use only a single set of #define EM/#undef EM sequence. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:29 +02:00
Nikolay Borisov	0840dd28b5	btrfs: tracepoints: fix extent type symbolic name print extent's type is an enum and this requires that the enum values be exported to user space so that user space tools can correctly map raw binary data to the symbolic name. Currently tracepoints using btrfs__file_extent_item_regular or btrfs__file_extent_item_inline result in the following output: fio-443 [002] 586.609450: btrfs_get_extent_show_fi_regular: f0c3bf8e-0174-4bcc-92aa-6c2d62430420:i root=5(FS_TREE) inode=258 size=2136457216 disk_isize=0 file extent range=[2126946304 2136457216] (num_bytes=9510912 ram_bytes=9510912 disk_bytenr=0 disk_num_bytes=0 extent_offset=0 type=0x1 compression=0 E.g type is 0x1 . With this patch applie the output is: <ommitted for brevity> disk_bytenr=141348864 disk_num_bytes=4096 extent_offset=0 type=REG compression=0 Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:29 +02:00
Nikolay Borisov	45e31869cc	btrfs: tracepoints: fix btrfs_trigger_flush symbolic string for flags When tracepoints use __print_symbolic to print textual representation of a value that comes from an ENUM each enum value needs to be exported to user space so that user space tools can convert the binary value data to the trings as user space does not know what those enums are about. Doing a trace-cmd record && trace-cmd report currently results in: kworker/u8:1-61 [000] 66.299527: btrfs_flush_space: 5302ee13-c65e-45bb-98ef-8fe3835bd943: state=3(0x3) flags=4(METADATA) num_bytes=2621440 ret=0 I.e state is not translated to its symbolic counterpart. With this patch applied the output is: fio-370 [002] 56.762402: btrfs_trigger_flush: d04cd7ac-38e2-452f-a7f5-8157529fd5f0: preempt: flush=3(BTRFS_RESERVE_FLUSH_ALL) flags=4(METADATA) bytes=655360 See also `190f0b76ca` ("mm: tracing: Export enums in tracepoints to user space"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:29 +02:00
David Sterba	3502a8c0dc	btrfs: allow use of global block reserve for balance item deletion On a filesystem with exhausted metadata, but still enough to start balance, it's possible to hit this error: [324402.053842] BTRFS info (device loop0): 1 enospc errors during balance [324402.060769] BTRFS info (device loop0): balance: ended with status: -28 [324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left It fails inside reset_balance_state and turns the filesystem to read-only, which is unnecessary and should be fixed too, but the problem is caused by lack for space when the balance item is deleted. This is a one-time operation and from the same rank as unlink that is allowed to use the global block reserve. So do the same for the balance item. Status of the filesystem (100GiB) just after the balance fails: $ btrfs fi df mnt Data, single: total=80.01GiB, used=38.58GiB System, single: total=4.00MiB, used=16.00KiB Metadata, single: total=19.99GiB, used=19.48GiB GlobalReserve, single: total=512.00MiB, used=50.11MiB CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:29 +02:00
Qu Wenruo	38d37aa9c3	btrfs: refactor btrfs_check_can_nocow() into two variants The function btrfs_check_can_nocow() now has two completely different call patterns. For nowait variant, callers don't need to do any cleanup. While for wait variant, callers need to release the lock if they can do nocow write. This is somehow confusing, and is already a problem for the exported btrfs_check_can_nocow(). So this patch will separate the different patterns into different functions. For nowait variant, the function will be called check_nocow_nolock(). For wait variant, the function pair will be btrfs_check_nocow_lock() btrfs_check_nocow_unlock(). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:28 +02:00
Qu Wenruo	e4ecaf90bc	btrfs: add comments for btrfs_check_can_nocow() and can_nocow_extent() These two functions have extra conditions that their callers need to meet, and some not-that-common parameters used for return value. So adding some comments may save reviewers some time. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:28 +02:00
Qu Wenruo	6d4572a9d7	btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation [BUG] When the data space is exhausted, even if the inode has NOCOW attribute, we will still refuse to truncate unaligned range due to ENOSPC. The following script can reproduce it pretty easily: #!/bin/bash dev=/dev/test/test mnt=/mnt/btrfs umount $dev &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f $dev -b 1G mount -o nospace_cache $dev $mnt touch $mnt/foobar chattr +C $mnt/foobar xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null sync xfs_io -c "fpunch 0 2k" $mnt/foobar umount $mnt Currently this will fail at the fpunch part. [CAUSE] Because btrfs_truncate_block() always reserves space without checking the NOCOW attribute. Since the writeback path follows NOCOW bit, we only need to bother the space reservation code in btrfs_truncate_block(). [FIX] Make btrfs_truncate_block() follow btrfs_buffered_write() to try to reserve data space first, and fall back to NOCOW check only when we don't have enough space. Such always-try-reserve is an optimization introduced in btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call. This patch will export check_can_nocow() as btrfs_check_can_nocow(), and use it in btrfs_truncate_block() to fix the problem. Reported-by: Martin Doucha <martin.doucha@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:28 +02:00
David Sterba	b547a88ea5	btrfs: start deprecation of mount option inode_cache Estimated time of removal of the functionality is 5.11, the option will be still parsed but will have no effect. Reasons for deprecation and removal: - very poor naming choice of the mount option, it's supposed to cache and reuse the inode _numbers_, but it sounds a some generic cache for inodes - the only known usecase where this option would make sense is on a 32bit architecture where inode numbers in one subvolume would be exhausted due to 32bit inode::i_ino - the cache is stored on disk, consumes space, needs to be loaded and written back - new inode number allocation is slower due to lookups into the cache (compared to a simple increment which is the default) - uses the free-space-cache code that is going to be deprecated as well in the future Known problems: - since 2011, returning EEXIST when there's not enough space in a page to store all checksums, see commit `4b9465cb9e` ("Btrfs: add mount -o inode_cache") Remaining issues: - if the option was enabled, new inodes created, the option disabled again, the cache is still stored on the devices and there's currently no way to remove it Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:28 +02:00
David Sterba	a2570ef330	btrfs: remove unused btrfs_root::defrag_trans_start Last touched in 2013 by commit `de78b51a28` ("btrfs: remove cache only arguments from defrag path") that was the only code that used the value. Now it's only set but never used for anything, so we can remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:28 +02:00
David Sterba	bab16e21e8	btrfs: don't use UAPI types for fiemap callback The fiemap callback is not part of UAPI interface and the prototypes don't have the __u64 types either. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Denis Efremov	5af9d6ef3f	btrfs: tests: remove if duplicate in __check_free_space_extents() num_extents is already checked in the next if condition and can be safely removed. Signed-off-by: Denis Efremov <efremov@linux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Johannes Thumshirn	923eb52365	btrfs: use free_root_extent_buffer to free root In btrfs_put_root() we're freeing a btrfs_root's 'node' and 'commit_root' extent buffers manually via kfree(), while we're using free_root_extent_buffers() in the free_root_pointers() function above. free_root_extent_buffers() also NULLs the pointers after freeing, which mitigates potential double frees. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Nikolay Borisov	4e9d0d0109	btrfs: use for loop in prealloc_file_extent_cluster This function iterates all extents in the extent cluster, make this intention obvious by using a for loop. No functional chanes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Nikolay Borisov	214e61d07e	btrfs: perform data management operations outside of inode lock btrfs_alloc_data_chunk_ondemand and btrfs_free_reserved_data_space_noquota don't really use the guts of the inodes being passed to them. This implies it's not required to call them under extent lock. Move code around in prealloc_file_extent_cluster to do the heavy, data alloc/free operations outside of the lock. This also makes the 'out' label unnecessary, so remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Nikolay Borisov	c171edd5c8	btrfs: remove hole check in prealloc_file_extent_cluster Extents in the extent cluster are guaranteed to be contiguous as such the hole check inside the loop can never trigger. In fact this check was never functional since it was added in `18513091af` ("btrfs: update btrfs_space_info's bytes_may_use timely") which came after the commit introducing clustered/contiguous extents `0257bb82d2` ("Btrfs: relocate file extents in clusters"). Let's just remove it as it adds noise to the source. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:27 +02:00
Nikolay Borisov	906c448c3d	btrfs: make __btrfs_drop_extents take btrfs_inode It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the interface with the non __ prefixed version. Will also makes converting its callers to btrfs_inode easier. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:26 +02:00
Nikolay Borisov	bd242a08a6	btrfs: make btrfs_csum_one_bio takae btrfs_inode Will enable converting btrfs_submit_compressed_write to btrfs_inode more easily. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:26 +02:00
Nikolay Borisov	ad7ff17b65	btrfs: make extent_clear_unlock_delalloc take btrfs_inode It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode interface will allow seamless conversion of its callers. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:26 +02:00
Nikolay Borisov	4b67c11dd1	btrfs: make create_io_em take btrfs_inode It really wants a btrfs_inode and will allow submit_compressed_extents to be completely converted to btrfs_inode in follow up patches. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:26 +02:00
Nikolay Borisov	7bfa953501	btrfs: make btrfs_reloc_clone_csums take btrfs_inode It really wants btrfs_inode and not a vfs inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:26 +02:00
Nikolay Borisov	c350437269	btrfs: make btrfs_lookup_ordered_extent take btrfs_inode It doesn't use the generic vfs inode for anything use btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:25 +02:00
Nikolay Borisov	43c69849ae	btrfs: make get_extent_allocation_hint take btrfs_inode It doesn't use the vfs inode for anything, can just as easily take btrfs_inode. Follow up patches will convert callers as well. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:25 +02:00
Nikolay Borisov	da69fea9f7	btrfs: make __btrfs_add_ordered_extent take struct btrfs_inode This is internal btrfs function what really needs the vfs_inode only for igrab and a tracepoint. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:25 +02:00
Filipe Manana	3ef64143a7	btrfs: remove no longer used trans_list member of struct btrfs_ordered_extent The 'trans_list' member of an ordered extent was used to keep track of the ordered extents for which a transaction commit had to wait. These were ordered extents that were started and logged by an fsync. However we don't do that anymore and before we stopped doing it we changed the approach to wait for the ordered extents in commit `161c3549b4` ("Btrfs: change how we wait for pending ordered extents"), which stopped using that list and therefore the 'trans_list' member is not used anymore since that commit. So just remove it since it's doing nothing and making each ordered extent structure waste memory (2 pointers). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:25 +02:00
Filipe Manana	cd8d39f4ae	btrfs: remove no longer used log_list member of struct btrfs_ordered_extent The 'log_list' member of an ordered extent was used keep track of which ordered extents we needed to wait after logging metadata, but is not used anymore since commit `5636cf7d6d` ("btrfs: remove the logged extents infrastructure"), as we now always wait on ordered extent completion before logging metadata. So just remove it since it's doing nothing and making each ordered extent structure waste more memory (2 pointers). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:25 +02:00
David Sterba	ce6ef5abe6	btrfs: add little-endian optimized key helpers The CPU and on-disk keys are mapped to two different structures because of the endianness. There's an intermediate buffer used to do the conversion, but this is not necessary when CPU and on-disk endianness match. Add optimized versions of helpers that take disk_key and use the buffer directly for CPU keys or drop the intermediate buffer and conversion. This saves a lot of stack space accross many functions and removes about 6K of generated binary code: text data bss dec hex filename 1090439 17468 14912 1122819 112203 pre/btrfs.ko 1084613 17456 14912 1116981 110b35 post/btrfs.ko Delta: -5826 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:24 +02:00
Qu Wenruo	5958253cf6	btrfs: qgroup: catch reserved space leaks at unmount time Before this patch, qgroup completely relies on per-inode extent io tree to detect reserved data space leak. However previous bug has already shown how release page before btrfs_finish_ordered_io() could lead to leak, and since it's QGROUP_RESERVED bit cleared without triggering qgroup rsv, it can't be detected by per-inode extent io tree. So this patch adds another (and hopefully the final) safety net to catch qgroup data reserved space leak. At least the new safety net catches all the leaks during development, so it should be pretty useful in the real world. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:24 +02:00
Qu Wenruo	7dbeaad0af	btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak [BUG] The following simple workload from fsstress can lead to qgroup reserved data space leak: 0/0: creat f0 x:0 0 0 0/0: creat add id=0,parent=-1 0/1: write f0[259 1 0 0 0 0] [600030,27288] 0 0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat() 0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0 This would cause btrfs qgroup to leak 20480 bytes for data reserved space. If btrfs qgroup limit is enabled, such leak can lead to unexpected early EDQUOT and unusable space. [CAUSE] When doing direct IO, kernel will try to writeback existing buffered page cache, then invalidate them: generic_file_direct_write() \|- filemap_write_and_wait_range(); \|- invalidate_inode_pages2_range(); However for btrfs, the bi_end_io hook doesn't finish all its heavy work right after bio ends. In fact, it delays its work further: submit_extent_page(end_io_func=end_bio_extent_writepage); end_bio_extent_writepage() \|- btrfs_writepage_endio_finish_ordered() \|- btrfs_init_work(finish_ordered_fn); <<< Work queue execution >>> finish_ordered_fn() \|- btrfs_finish_ordered_io(); \|- Clear qgroup bits This means, when filemap_write_and_wait_range() returns, btrfs_finish_ordered_io() is not guaranteed to be executed, thus the qgroup bits for related range are not cleared. Now into how the leak happens, this will only focus on the overlapping part of buffered and direct IO part. 1. After buffered write The inode had the following range with QGROUP_RESERVED bit: 596 616K \|///////////////\| Qgroup reserved data space: 20K 2. Writeback part for range [596K, 616K) Write back finished, but btrfs_finish_ordered_io() not get called yet. So we still have: 596K 616K \|///////////////\| Qgroup reserved data space: 20K 3. Pages for range [596K, 616K) get released This will clear all qgroup bits, but don't update the reserved data space. So we have: 596K 616K \| \| Qgroup reserved data space: 20K That number doesn't match the qgroup bit range anymore. 4. Dio prepare space for range [596K, 700K) Qgroup reserved data space for that range, we got: 596K 616K 700K \|///////////////\|///////////////////////\| Qgroup reserved data space: 20K + 104K = 124K 5. btrfs_finish_ordered_range() gets executed for range [596K, 616K) Qgroup free reserved space for that range, we got: 596K 616K 700K \| \|///////////////////////\| We need to free that range of reserved space. Qgroup reserved data space: 124K - 20K = 104K 6. btrfs_finish_ordered_range() gets executed for range [596K, 700K) However qgroup bit for range [596K, 616K) is already cleared in previous step, so we only free 84K for qgroup reserved space. 596K 616K 700K \| \| \| We need to free that range of reserved space. Qgroup reserved data space: 104K - 84K = 20K Now there is no way to release that 20K unless disabling qgroup or unmounting the fs. [FIX] This patch will change the timing of btrfs_qgroup_release/free_data() call. Here it uses buffered COW write as an example. The new timing \| The old timing ----------------------------------------+--------------------------------------- btrfs_buffered_write() \| btrfs_buffered_write() \|- btrfs_qgroup_reserve_data() \| \|- btrfs_qgroup_reserve_data() \| btrfs_run_delalloc_range() \| btrfs_run_delalloc_range() \|- btrfs_add_ordered_extent() \| \|- btrfs_qgroup_release_data() \| The reserved is passed into \| btrfs_ordered_extent structure \| \| btrfs_finish_ordered_io() \| btrfs_finish_ordered_io() \|- The reserved space is passed to \| \|- btrfs_qgroup_release_data() btrfs_qgroup_record \| The resereved space is passed \| to btrfs_qgroup_recrod \| btrfs_qgroup_account_extents() \| btrfs_qgroup_account_extents() \|- btrfs_qgroup_free_refroot() \| \|- btrfs_qgroup_free_refroot() The point of such change is to ensure, when ordered extents are submitted, the qgroup reserved space is already released, to keep the timing aligned with file_write_and_wait_range(). So that qgroup data reserved space is all bound to btrfs_ordered_extent and solve the timing mismatch. Fixes: `f695fdcef8` ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space") Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:24 +02:00
Qu Wenruo	a7f8b1c2ac	btrfs: file: reserve qgroup space after the hole punch range is locked The incoming qgroup reserved space timing will move the data reservation to ordered extent completely. However in btrfs_punch_hole_lock_range() will call btrfs_invalidate_page(), which will clear QGROUP_RESERVED bit for the range. In current stage it's OK, but if we're making ordered extents handle the reserved space, then btrfs_punch_hole_lock_range() can clear the QGROUP_RESERVED bit before we submit ordered extent, leading to qgroup reserved space leakage. So here change the timing to make reserve data space after btrfs_punch_hole_lock_range(). The new timing is fine for either current code or the new code. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:24 +02:00
Qu Wenruo	9729f10a60	btrfs: inode: move qgroup reserved space release to the callers of insert_reserved_file_extent() This is to prepare for the incoming timing change of qgroup reserved data space and ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:24 +02:00
Qu Wenruo	203f44c519	btrfs: inode: refactor the parameters of insert_reserved_file_extent() Function insert_reserved_file_extent() takes a long list of parameters, which are all for btrfs_file_extent_item, even including two reserved members, encryption and other_encoding. This makes the parameter list unnecessary long for a function which only gets called twice. This patch will refactor the parameter list, by using btrfs_file_extent_item as parameter directly to hugely reduce the number of parameters. Also, since there are only two callers, one in btrfs_finish_ordered_io() which inserts file extent for ordered extent, and one __btrfs_prealloc_file_range(). These two call sites have completely different context, where ordered extent can be compressed, but will always be regular extent, while the preallocated one is never going to be compressed and always has PREALLOC type. So use two small wrapper for these two different call sites to improve readability. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	100aa5d9f9	btrfs: scrub: clean up temporary page variables in scrub_checksum_tree_block Add proper variable for the scrub page and use it instead of repeatedly dereferencing the other structures. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	521e102227	btrfs: scrub: simplify tree block checksum calculation Use a simpler iteration over tree block pages, same what csum_tree_block does: first page always exists, loop over the rest. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	d41ebef200	btrfs: scrub: clean up temporary page variables in scrub_checksum_data Add proper variable for the scrub page and use it instead of repeatedly dereferencing the other structures. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	771aba0d12	btrfs: scrub: simplify data block checksum calculation We have sectorsize same as PAGE_SIZE, the checksum can be calculated in one go. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	c746054109	btrfs: scrub: clean up temporary page variables in scrub_checksum_super Add proper variable for the scrub page and use it instead of repeatedly dereferencing the other structures. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:23 +02:00
David Sterba	74710cf1fb	btrfs: scrub: remove temporary csum array in scrub_checksum_super The page contents with the checksum is available during the entire function so we don't need to make a copy. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:22 +02:00
David Sterba	83cf6d5eae	btrfs: scrub: simplify superblock checksum calculation BTRFS_SUPER_INFO_SIZE is 4096, and fits to a page on all supported architectures, so we can calculate the checksum in one go. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:22 +02:00
David Sterba	b04852520e	btrfs: scrub: unify naming of page address variables As the page mapping has been removed, rename the variables to 'kaddr' that we use everywhere else. The type is changed to 'char *' so pointer arithmetic works without casts. Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-27 12:55:22 +02:00

1 2 3 4 5 ...

935094 Commits