linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-27 23:56:44 +07:00

Author	SHA1	Message	Date
Filipe Manana	3c850b4511	Btrfs: incremental send, fix emission of invalid clone operations When doing an incremental send we can now issue clone operations with a source range that ends at the source's file eof and with a destination range that ends at an offset smaller then the destination's file eof. If the eof of the source file is not aligned to the sector size of the filesystem, the receiver will get a -EINVAL error when trying to do the operation or, on older kernels, silently corrupt the destination file. The corruption happens on kernels without commit `ac765f83f1` ("Btrfs: fix data corruption due to cloning of eof block"), while the failure to clone happens on kernels with that commit. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt/sdb $ xfs_io -f -c "pwrite -S 0xb1 0 2M" /mnt/sdb/foo $ xfs_io -f -c "pwrite -S 0xc7 0 2M" /mnt/sdb/bar $ xfs_io -f -c "pwrite -S 0x4d 0 2M" /mnt/sdb/baz $ xfs_io -f -c "pwrite -S 0xe2 0 2M" /mnt/sdb/zoo $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base $ btrfs send -f /tmp/base.send /mnt/sdb/base $ xfs_io -c "reflink /mnt/sdb/bar 1560K 500K 100K" /mnt/sdb/bar $ xfs_io -c "reflink /mnt/sdb/bar 1560K 0 100K" /mnt/sdb/zoo $ xfs_io -c "truncate 550K" /mnt/sdb/bar $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr $ btrfs send -f /tmp/incr.send -p /mnt/sdb/base /mnt/sdb/incr $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ btrfs receive -f /tmp/base.send /mnt/sdc $ btrfs receive -vv -f /tmp/incr.send /mnt/sdc (...) truncate bar size=563200 utimes bar clone zoo - source=bar source offset=512000 offset=0 length=51200 ERROR: failed to clone extents to zoo Invalid argument The failure happens because the clone source range ends at the eof of file bar, 563200, which is not aligned to the filesystems sector size (4Kb in this case), and the destination range ends at offset 0 + 51200, which is less then the size of the file zoo (2Mb). So fix this by detecting such case and instead of issuing a clone operation for the whole range, do a clone operation for smaller range that is sector size aligned followed by a write operation for the block containing the eof. Here we will always be pessimistic and assume the destination filesystem of the send stream has the largest possible sector size (64Kb), since we have no way of determining it. This fixes a recent regression introduced in kernel 5.2-rc1. Fixes: `040ee6120c` ("Btrfs: send, improve clone range") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-05-28 18:54:10 +02:00
Filipe Manana	6b1f72e5b8	Btrfs: incremental send, fix file corruption when no-holes feature is enabled When using the no-holes feature, if we have a file with prealloc extents with a start offset beyond the file's eof, doing an incremental send can cause corruption of the file due to incorrect hole detection. Such case requires that the prealloc extent(s) exist in both the parent and send snapshots, and that a hole is punched into the file that covers all its extents that do not cross the eof boundary. Example reproducer: $ mkfs.btrfs -f -O no-holes /dev/sdb $ mount /dev/sdb /mnt/sdb $ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar $ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base $ btrfs send -f /tmp/base.snap /mnt/sdb/base $ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr $ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr $ md5sum /mnt/sdb/incr/foobar 816df6f64deba63b029ca19d880ee10a /mnt/sdb/incr/foobar $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ btrfs receive -f /tmp/base.snap /mnt/sdc $ btrfs receive -f /tmp/incr.snap /mnt/sdc $ md5sum /mnt/sdc/incr/foobar cf2ef71f4a9e90c2f6013ba3b2257ed2 /mnt/sdc/incr/foobar --> Different checksum, because the prealloc extent beyond the file's eof confused the hole detection code and it assumed a hole starting at offset 0 and ending at the offset of the prealloc extent (1200Kb) instead of ending at the offset 500Kb (the file's size). Fix this by ensuring we never cross the file's size when issuing the write operations for a hole. Fixes: `16e7549f04` ("Btrfs: incompatible format change to remove hole extents") CC: stable@vger.kernel.org # 3.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-05-28 18:54:10 +02:00
Filipe Manana	62d54f3a7f	Btrfs: fix race between send and deduplication that lead to failures and crashes Send operates on read only trees and expects them to never change while it is using them. This is part of its initial design, and this expection is due to two different reasons: 1) When it was introduced, no operations were allowed to modifiy read-only subvolumes/snapshots (including defrag for example). 2) It keeps send from having an impact on other filesystem operations. Namely send does not need to keep locks on the trees nor needs to hold on to transaction handles and delay transaction commits. This ends up being a consequence of the former reason. However the deduplication feature was introduced later (on September 2013, while send was introduced in July 2012) and it allowed for deduplication with destination files that belong to read-only trees (subvolumes and snapshots). That means that having a send operation (either full or incremental) running in parallel with a deduplication that has the destination inode in one of the trees used by the send operation, can result in tree nodes and leaves getting freed and reused while send is using them. This problem is similar to the problem solved for the root nodes getting freed and reused when a snapshot is made against one tree that is currenly being used by a send operation, fixed in commits [1] and [2]. These commits explain in detail how the problem happens and the explanation is valid for any node or leaf that is not the root of a tree as well. This problem was also discussed and explained recently in a thread [3]. The problem is very easy to reproduce when using send with large trees (snapshots) and just a few concurrent deduplication operations that target files in the trees used by send. A stress test case is being sent for fstests that triggers the issue easily. The most common error to hit is the send ioctl return -EIO with the following messages in dmesg/syslog: [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400 [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27 The first one is very easy to hit while the second one happens much less frequently, except for very large trees (in that test case, snapshots with 100000 files having large xattrs to get deep and wide trees). Less frequently, at least one BUG_ON can be hit: [1631742.130080] ------------[ cut here ]------------ [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806! [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1 [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246 [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002 [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000 [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708 [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000 [1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0 [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1631742.141272] Call Trace: [1631742.141826] ? do_raw_spin_unlock+0x49/0xc0 [1631742.142390] tree_advance+0x173/0x1d0 [btrfs] [1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs] [1631742.143533] ? process_extent+0x1070/0x1070 [btrfs] [1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs] [1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0 [1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs] [1631742.146179] ? account_entity_enqueue+0xd3/0x100 [1631742.146662] ? reweight_entity+0x154/0x1a0 [1631742.147135] ? update_curr+0x20/0x2a0 [1631742.147593] ? check_preempt_wakeup+0x103/0x250 [1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0 [1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [1631742.148942] do_vfs_ioctl+0xa2/0x6f0 [1631742.149361] ? __fget+0x113/0x200 [1631742.149767] ksys_ioctl+0x70/0x80 [1631742.150159] __x64_sys_ioctl+0x16/0x20 [1631742.150543] do_syscall_64+0x60/0x1b0 [1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe [1631742.151326] RIP: 0033:0x7f4cd9f5add7 (...) [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7 [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007 [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700 [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003 [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001 (...) [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]--- That BUG_ON happens because while send is using a node, that node is COWed by a concurrent deduplication, gets freed and gets reused as a leaf (because a transaction commit happened in between), so when it attempts to read a slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer contents were wiped out and it now matches a leaf (which can even belong to some other tree now), hitting the BUG_ON(level == 0). Fix this concurrency issue by not allowing send and deduplication to run in parallel if both operate on the same readonly trees, returning EAGAIN to user space and logging an exlicit warning in dmesg/syslog. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2 [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:52 +02:00
Filipe Manana	9f89d5de86	Btrfs: send, flush dellaloc in order to avoid data loss When we set a subvolume to read-only mode we do not flush dellaloc for any of its inodes (except if the filesystem is mounted with -o flushoncommit), since it does not affect correctness for any subsequent operations - except for a future send operation. The send operation will not be able to see the delalloc data since the respective file extent items, inode item updates, backreferences, etc, have not hit yet the subvolume and extent trees. Effectively this means data loss, since the send stream will not contain any data from existing delalloc. Another problem from this is that if the writeback starts and finishes while the send operation is in progress, we have the subvolume tree being being modified concurrently which can result in send failing unexpectedly with EIO or hitting runtime errors, assertion failures or hitting BUG_ONs, etc. Simple reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ btrfs subvolume create /mnt/sv $ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo $ btrfs property set /mnt/sv ro true $ btrfs send -f /tmp/send.stream /mnt/sv $ od -t x1 -A d /mnt/sv/foo 0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea * 0110592 $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive -f /tmp/send.stream /mnt $ echo $? 0 $ od -t x1 -A d /mnt/sv/foo 0000000 # ---> empty file Since this a problem that affects send only, fix it in send by flushing dellaloc for all the roots used by the send operation before send starts to process the commit roots. This is a problem that affects send since it was introduced (commit `31db9f7c23` ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive")) but backporting it to older kernels has some dependencies: - For kernels between 3.19 and 4.20, it depends on commit `3cd24c6980` ("btrfs: use tagged writepage to mitigate livelock of snapshot") because the function btrfs_start_delalloc_snapshot() does not exist before that commit. So one has to either pick that commit or replace the calls to btrfs_start_delalloc_snapshot() in this patch with calls to btrfs_start_delalloc_inodes(). - For kernels older than 3.19 it also requires commit `e5fa8f865b` ("Btrfs: ensure send always works on roots without orphans") because it depends on the function ensure_commit_roots_uptodate() which that commits introduced. - No dependencies for 5.0+ kernels. A test case for fstests follows soon. CC: stable@vger.kernel.org # 3.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:51 +02:00
Robbie Ko	040ee6120c	Btrfs: send, improve clone range Improve clone_range in two scenarios. 1. Remove the limit of inode size when find clone inodes We can do partial clone, so there is no need to limit the size of the candidate inode. When clone a range, we clone the legal range only by bytenr, offset, len, inode size. 2. In the scenarios of rewrite or clone_range, data_offset rarely matches exactly, so the chance of a clone is missed. e.g. 1. Write a 1M file dd if=/dev/zero of=1M bs=1M count=1 2. Clone 1M file cp --reflink 1M clone 3. Rewrite 4k on the clone file dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc The disk layout is as follows: item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53 extent data disk byte 1103101952 nr 1048576 extent data offset 0 nr 1048576 ram 1048576 extent compression(none) ... item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53 extent data disk byte 1104150528 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression(none) item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53 extent data disk byte 1103101952 nr 1048576 extent data offset 4096 nr 1044480 ram 1048576 extent compression(none) When send, inode 258 file offset 4096~1048576 (item 23) has a chance to clone_range, but because data_offset does not match inode 257 (item 16), it causes missed clone and can only transfer actual data. Improve the problem by judging whether the current data_offset has overlap with the file extent item, and if so, adjusting offset and extent_len so that we can clone correctly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:35 +02:00
Linus Torvalds	96d4f267e4	Remove 'type' argument from access_ok() function Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument of the user address range verification function since we got rid of the old racy i386-only code to walk page tables by hand. It existed because the original 80386 would not honor the write protect bit when in kernel mode, so you had to do COW by hand before doing any user access. But we haven't supported that in a long time, and these days the 'type' argument is a purely historical artifact. A discussion about extending 'user_access_begin()' to do the range checking resulted this patch, because there is no way we're going to move the old VERIFY_xyz interface to that model. And it's best done at the end of the merge window when I've done most of my merges, so let's just get this done once and for all. This patch was mostly done with a sed-script, with manual fix-ups for the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form. There were a couple of notable cases: - csky still had the old "verify_area()" name as an alias. - the iter_iov code had magical hardcoded knowledge of the actual values of VERIFY_{READ,WRITE} (not that they mattered, since nothing really used it) - microblaze used the type argument for a debug printout but other than those oddities this should be a total no-op patch. I tried to fix up all architectures, did fairly extensive grepping for access_ok() uses, and the changes are trivial, but I may have missed something. Any missed conversion should be trivially fixable, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-01-03 18:57:57 -08:00
Andrea Gelmini	52042d8e82	btrfs: Fix typos in comments and strings The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:50 +01:00
Johannes Thumshirn	7073017aeb	btrfs: use offset_in_page instead of open-coding it Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an offset into a page. So replace them by the offset_in_page() macro instead of open-coding it if they're not used as an alignment check. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-12-17 14:51:45 +01:00
Robbie Ko	a4390aee72	Btrfs: send, fix infinite loop due to directory rename dependencies When doing an incremental send, due to the need of delaying directory move (rename) operations we can end up in infinite loop at apply_children_dir_moves(). An example scenario that triggers this problem is described below, where directory names correspond to the numbers of their respective inodes. Parent snapshot: . \|--- 261/ \|--- 271/ \|--- 266/ \|--- 259/ \|--- 260/ \| \|--- 267 \| \|--- 264/ \| \|--- 258/ \| \|--- 257/ \| \|--- 265/ \|--- 268/ \|--- 269/ \| \|--- 262/ \| \|--- 270/ \|--- 272/ \| \|--- 263/ \| \|--- 275/ \| \|--- 274/ \|--- 273/ Send snapshot: . \|-- 275/ \|-- 274/ \|-- 273/ \|-- 262/ \|-- 269/ \|-- 258/ \|-- 271/ \|-- 268/ \|-- 267/ \|-- 270/ \|-- 259/ \| \|-- 265/ \| \|-- 272/ \|-- 257/ \|-- 260/ \|-- 264/ \|-- 263/ \|-- 261/ \|-- 266/ When processing inode 257 we delay its move (rename) operation because its new parent in the send snapshot, inode 272, was not yet processed. Then when processing inode 272, we delay the move operation for that inode because inode 274 is its ancestor in the send snapshot. Finally we delay the move operation for inode 274 when processing it because inode 275 is its new parent in the send snapshot and was not yet moved. When finishing processing inode 275, we start to do the move operations that were previously delayed (at apply_children_dir_moves()), resulting in the following iterations: 1) We issue the move operation for inode 274; 2) Because inode 262 depended on the move operation of inode 274 (it was delayed because 274 is its ancestor in the send snapshot), we issue the move operation for inode 262; 3) We issue the move operation for inode 272, because it was delayed by inode 274 too (ancestor of 272 in the send snapshot); 4) We issue the move operation for inode 269 (it was delayed by 262); 5) We issue the move operation for inode 257 (it was delayed by 272); 6) We issue the move operation for inode 260 (it was delayed by 272); 7) We issue the move operation for inode 258 (it was delayed by 269); 8) We issue the move operation for inode 264 (it was delayed by 257); 9) We issue the move operation for inode 271 (it was delayed by 258); 10) We issue the move operation for inode 263 (it was delayed by 264); 11) We issue the move operation for inode 268 (it was delayed by 271); 12) We verify if we can issue the move operation for inode 270 (it was delayed by 271). We detect a path loop in the current state, because inode 267 needs to be moved first before we can issue the move operation for inode 270. So we delay again the move operation for inode 270, this time we will attempt to do it after inode 267 is moved; 13) We issue the move operation for inode 261 (it was delayed by 263); 14) We verify if we can issue the move operation for inode 266 (it was delayed by 263). We detect a path loop in the current state, because inode 270 needs to be moved first before we can issue the move operation for inode 266. So we delay again the move operation for inode 266, this time we will attempt to do it after inode 270 is moved (its move operation was delayed in step 12); 15) We issue the move operation for inode 267 (it was delayed by 268); 16) We verify if we can issue the move operation for inode 266 (it was delayed by 270). We detect a path loop in the current state, because inode 270 needs to be moved first before we can issue the move operation for inode 266. So we delay again the move operation for inode 266, this time we will attempt to do it after inode 270 is moved (its move operation was delayed in step 12). So here we added again the same delayed move operation that we added in step 14; 17) We attempt again to see if we can issue the move operation for inode 266, and as in step 16, we realize we can not due to a path loop in the current state due to a dependency on inode 270. Again we delay inode's 266 rename to happen after inode's 270 move operation, adding the same dependency to the empty stack that we did in steps 14 and 16. The next iteration will pick the same move dependency on the stack (the only entry) and realize again there is still a path loop and then again the same dependency to the stack, over and over, resulting in an infinite loop. So fix this by preventing adding the same move dependency entries to the stack by removing each pending move record from the red black tree of pending moves. This way the next call to get_pending_dir_moves() will not return anything for the current parent inode. A test case for fstests, with this reproducer, follows soon. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [Wrote changelog with example and more clear explanation] Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-11-21 17:03:50 +01:00
Liu Bo	3cf5068f3d	Btrfs: unify error handling of btrfs_lookup_dir_item Unify the error handling of directory item lookups using IS_ERR_OR_NULL. No functional changes. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:30 +02:00
Misono Tomohiro	4fd786e6c3	btrfs: Remove 'objectid' member from struct btrfs_root There are two members in struct btrfs_root which indicate root's objectid: objectid and root_key.objectid. They are both set to the same value in __setup_root(): static void __setup_root(struct btrfs_root root, struct btrfs_fs_info fs_info, u64 objectid) { ... root->objectid = objectid; ... root->root_key.objectid = objecitd; ... } and not changed to other value after initialization. grep in btrfs directory shows both are used in many places: $ grep -rI "root->root_key.objectid" \| wc -l 133 $ grep -rI "root->objectid" \| wc -l 55 (4.17, inc. some noise) It is confusing to have two similar variable names and it seems that there is no rule about which should be used in a certain case. Since ->root_key itself is needed for tree reloc tree, let's remove 'objecitd' member and unify code to use ->root_key.objectid in all places. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-10-15 17:23:25 +02:00
Filipe Manana	22d3151c2c	Btrfs: send, fix incorrect file layout after hole punching beyond eof When doing an incremental send, if we have a file in the parent snapshot that has prealloc extents beyond EOF and in the send snapshot it got a hole punch that partially covers the prealloc extents, the send stream, when replayed by a receiver, can result in a file that has a size bigger than it should and filled with zeroes past the correct EOF. For example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "falloc -k 0 4M" /mnt/foobar $ xfs_io -c "pwrite -S 0xea 0 1M" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send -f /tmp/1.send /mnt/snap1 $ xfs_io -c "fpunch 1M 2M" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -f /tmp/2.send -p /mnt/snap1 /mnt/snap2 $ stat --format %s /mnt/snap2/foobar 1048576 $ md5sum /mnt/snap2/foobar d31659e82e87798acd4669a1e0a19d4f /mnt/snap2/foobar $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive -f /mnt/1.snap /mnt $ btrfs receive -f /mnt/2.snap /mnt $ stat --format %s /mnt/snap2/foobar 3145728 # --> should be 1Mb and not 3Mb (which was the end offset of hole # punch operation) $ md5sum /mnt/snap2/foobar 117baf295297c2a995f92da725b0b651 /mnt/snap2/foobar # --> should be d31659e82e87798acd4669a1e0a19d4f as in the original fs This issue actually happens only since commit `ffa7c4296e` ("Btrfs: send, do not issue unnecessary truncate operations"), but before that commit we were issuing a write operation full of zeroes (to "punch" a hole) which was extending the file size beyond the correct value and then immediately issue a truncate operation to the correct size and undoing the previous write operation. Since the send protocol does not support fallocate, for extent preallocation and hole punching, fix this by not even attempting to send a "hole" (regular write full of zeroes) if it starts at an offset greater then or equals to the file's size. This approach, besides being much more simple then making send issue the truncate operation, adds the benefit of avoiding the useless pair of write of zeroes and truncate operations, saving time and IO at the receiver and reducing the size of the send stream. A test case for fstests follows soon. Fixes: `ffa7c4296e` ("Btrfs: send, do not issue unnecessary truncate operations") CC: stable@vger.kernel.org # 4.17+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:13:03 +02:00
Filipe Manana	46b2f4590a	Btrfs: fix send failure when root has deleted files still open The more common use case of send involves creating a RO snapshot and then use it for a send operation. In this case it's not possible to have inodes in the snapshot that have a link count of zero (inode with an orphan item) since during snapshot creation we do the orphan cleanup. However, other less common use cases for send can end up seeing inodes with a link count of zero and in this case the send operation fails with a ENOENT error because any attempt to generate a path for the inode, with the purpose of creating it or updating it at the receiver, fails since there are no inode reference items. One use case it to use a regular subvolume for a send operation after turning it to RO mode or turning a RW snapshot into RO mode and then using it for a send operation. In both cases, if a file gets all its hard links deleted while there is an open file descriptor before turning the subvolume/snapshot into RO mode, the send operation will encounter an inode with a link count of zero and then fail with errno ENOENT. Example using a full send with a subvolume: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ btrfs subvolume create /mnt/sv1 $ touch /mnt/sv1/foo $ touch /mnt/sv1/bar # keep an open file descriptor on file bar $ exec 73</mnt/sv1/bar $ unlink /mnt/sv1/bar # Turn the subvolume to RO mode and use it for a full send, while # holding the open file descriptor. $ btrfs property set /mnt/sv1 ro true $ btrfs send -f /tmp/full.send /mnt/sv1 At subvol /mnt/sv1 ERROR: send ioctl failed with -2: No such file or directory Example using an incremental send with snapshots: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ btrfs subvolume create /mnt/sv1 $ touch /mnt/sv1/foo $ touch /mnt/sv1/bar $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1 $ echo "hello world" >> /mnt/sv1/bar $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap2 # Turn the second snapshot to RW mode and delete file foo while # holding an open file descriptor on it. $ btrfs property set /mnt/snap2 ro false $ exec 73</mnt/snap2/foo $ unlink /mnt/snap2/foo # Set the second snapshot back to RO mode and do an incremental send. $ btrfs property set /mnt/snap2 ro true $ btrfs send -f /tmp/inc.send -p /mnt/snap1 /mnt/snap2 At subvol /mnt/snap2 ERROR: send ioctl failed with -2: No such file or directory So fix this by ignoring inodes with a link count of zero if we are either doing a full send or if they do not exist in the parent snapshot (they are new in the send snapshot), and unlink all paths found in the parent snapshot when doing an incremental send (and ignoring all other inode items, such as xattrs and extents). A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Reported-by: Martin Wilck <martin.wilck@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:59 +02:00
Filipe Manana	ca5d2ba1ae	Btrfs: remove unused key assignment when doing a full send At send.c:full_send_tree() we were setting the 'key' variable in the loop while never using it later. We were also using two btrfs_key variables to store the initial key for search and the key found in every iteration of the loop. So remove this useless key assignment and use the same btrfs_key variable to store the initial search key and the key found in each iteration. This was introduced in the initial send commit but was never used (commit `31db9f7c23` ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:56 +02:00
Qu Wenruo	e41ca58974	btrfs: Get rid of the confusing btrfs_file_extent_inline_len We used to call btrfs_file_extent_inline_len() to get the uncompressed data size of an inlined extent. However this function is hiding evil, for compressed extent, it has no choice but to directly read out ram_bytes from btrfs_file_extent_item. While for uncompressed extent, it uses item size to calculate the real data size, and ignoring ram_bytes completely. In fact, for corrupted ram_bytes, due to above behavior kernel btrfs_print_leaf() can't even print correct ram_bytes to expose the bug. Since we have the tree-checker to verify all EXTENT_DATA, such mismatch can be detected pretty easily, thus we can trust ram_bytes without the evil btrfs_file_extent_inline_len(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-08-06 13:12:38 +02:00
Robbie Ko	0f96f517dc	btrfs: incremental send, improve rmdir performance for large directory Currently when checking if a directory can be deleted, we always check if all its children have been processed. Example: A directory with 2,000,000 files was deleted original: 1994m57.071s patch: 1m38.554s [FIX] Instead of checking all children on all calls to can_rmdir(), we keep track of the directory index offset of the child last checked in the last call to can_rmdir(), and then use it as the starting point for future calls to can_rmdir(). Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Robbie Ko	35c8eda12f	btrfs: incremental send, move allocation until it's needed in orphan_dir_info Move the allocation after the search when it's clear that the new entry will be added. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Colin Ian King	f5686e3acd	btrfs: send: fix spelling mistake: "send_in_progres" -> "send_in_progress" Trivial fix to spelling mistake of function name in btrfs_err message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Filipe Manana	a6aa10c70b	Btrfs: send, fix missing truncate for inode with prealloc extent past eof An incremental send operation can miss a truncate operation when an inode has an increased size in the send snapshot and a prealloc extent beyond its size. Consider the following scenario where a necessary truncate operation is missing in the incremental send stream: 1) In the parent snapshot an inode has a size of 1282957 bytes and it has no prealloc extents beyond its size; 2) In the the send snapshot it has a size of 5738496 bytes and has a new extent at offsets 1884160 (length of 106496 bytes) and a prealloc extent beyond eof at offset 6729728 (and a length of 339968 bytes); 3) When processing the prealloc extent, at offset 6729728, we end up at send.c:send_write_or_clone() and set the @len variable to a value of 18446744073708560384 because @offset plus the original @len value is larger then the inode's size (6729728 + 339968 > 5738496). We then call send_extent_data(), with that @offset and @len, which in turn calls send_write(), and then the later calls fill_read_buf(). Because the offset passed to fill_read_buf() is greater then inode's i_size, this function returns 0 immediately, which makes send_write() and send_extent_data() do nothing and return immediately as well. When we get back to send.c:send_write_or_clone() we adjust the value of sctx->cur_inode_next_write_offset to @offset plus @len, which corresponds to 6729728 + 18446744073708560384 = 5738496, which is precisely the the size of the inode in the send snapshot; 4) Later when at send.c:finish_inode_if_needed() we determine that we don't need to issue a truncate operation because the value of sctx->cur_inode_next_write_offset corresponds to the inode's new size, 5738496 bytes. This is wrong because the last write operation that was issued started at offset 1884160 with a length of 106496 bytes, so the correct value for sctx->cur_inode_next_write_offset should be 1990656 (1884160 + 106496), so that a truncate operation with a value of 5738496 bytes would have been sent to insert a trailing hole at the destination. So fix the issue by making send.c:send_write_or_clone() not attempt to send write or clone operations for extents that start beyond the inode's size, since such attempts do nothing but waste time by calling helper functions and allocating path structures, and send currently has no fallocate command in order to create prealloc extents at the destination (either beyond a file's eof or not). The issue was found running the test btrfs/007 from fstests using a seed value of 1524346151 for fsstress. Reported-by: Gu, Jinxiang <gujx@cn.fujitsu.com> Fixes: `ffa7c4296e` ("Btrfs: send, do not issue unnecessary truncate operations") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-02 11:55:29 +02:00
David Sterba	c1d7c514f7	btrfs: replace GPL boilerplate by SPDX -- sources Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <dsterba@suse.com>	2018-04-12 16:29:51 +02:00
Liu Bo	895a72be41	Btrfs: send: fix typo in TLV_PUT According to tlv_put()'s prototype, data and attrlen needs to be exchanged in the macro, but seems all callers are already aware of this misorder and are therefore not affected. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:42 +02:00
Filipe Manana	ffa7c4296e	Btrfs: send, do not issue unnecessary truncate operations When send finishes processing an inode representing a regular file, it always issues a truncate operation for that file, even if its size did not change or the last write sets the file size correctly. In the most common cases, the issued write operations set the file to correct size (either full or incremental sends) or the file size did not change (for incremental sends), so the only case where a truncate operation is needed is when a file size becomes smaller in the send snapshot when compared to the parent snapshot. By not issuing unnecessary truncate operations we reduce the stream size and save time in the receiver. Currently truncating a file to the same size triggers writeback of its last page (if it's dirty) and waits for it to complete (only if the file size is not aligned with the filesystem's sector size). This is being fixed by another patch and is independent of this change (that patch's title is "Btrfs: skip writeback of last page when truncating file to same size"). The following script was used to measure time spent by a receiver without this change applied, with this change applied, and without this change and with the truncate fix applied (the fix to not make it start and wait for writeback to complete). $ cat test_send.sh #!/bin/bash SRC_DEV=/dev/sdc DST_DEV=/dev/sdd SRC_MNT=/mnt/sdc DST_MNT=/mnt/sdd mkfs.btrfs -f $SRC_DEV >/dev/null mkfs.btrfs -f $DST_DEV >/dev/null mount $SRC_DEV $SRC_MNT mount $DST_DEV $DST_MNT echo "Creating source filesystem" for ((t = 0; t < 10; t++)); do ( for ((i = 1; i <= 20000; i++)); do xfs_io -f -c "pwrite -S 0xab 0 5000" \ $SRC_MNT/file_$i > /dev/null done ) & worker_pids[$t]=$! done wait ${worker_pids[@]} echo "Creating and sending snapshot" btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null /usr/bin/time -f "send took %e seconds" \ btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1 /usr/bin/time -f "receive took %e seconds" \ btrfs receive -f $SRC_MNT/send_file $DST_MNT umount $SRC_MNT umount $DST_MNT The results, which are averages for 5 runs for each case, were the following: * Without this change average receive time was 26.49 seconds standard deviation of 2.53 seconds * Without this change and with the truncate fix average receive time was 12.51 seconds standard deviation of 0.32 seconds * With this change and without the truncate fix average receive time was 10.02 seconds standard deviation of 1.11 seconds Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:41 +02:00
David Sterba	e67c718b5b	btrfs: add more __cold annotations The __cold functions are placed to a special section, as they're expected to be called rarely. This could help i-cache prefetches or help compiler to decide which branches are more/less likely to be taken without any other annotations needed. Though we can't add more __exit annotations, it's still possible to add __cold (that's also added with __exit). That way the following function categories are tagged: - printf wrappers, error messages - exit helpers Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:39 +02:00
Nikolay Borisov	9678c54388	btrfs: Remove custom crc32c init code The custom crc32 init code was introduced in `14a958e678` ("Btrfs: fix btrfs boot when compiled as built-in") to enable using btrfs as a built-in. However, later as pointed out by `60efa5eb2e` ("Btrfs: use late_initcall instead of module_init") this wasn't enough and finally btrfs was switched to late_initcall which comes after the generic crc32c implementation is initiliased. The latter commit superseeded the former. Now that we don't have to maintain our own code let's just remove it and switch to using the generic implementation. Despite touching a lot of files the patch is really simple. Here is the gist of the changes: 1. Select LIBCRC32C rather than the low-level modules. 2. s/btrfs_crc32c/crc32c/g 3. replace hash.h with linux/crc32c.h 4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly. I've tested this with btrfs being both a module and a built-in and xfstest doesn't complain. Does seem to fix the longstanding problem of not automatically selectiong the crc32c module when btrfs is used. Possibly there is a workaround in dracut. The modinfo confirms that now all the module dependencies are there: before: depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate after: depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add more info to changelog from mails ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:39 +02:00
Filipe Manana	d4dfc0f4d3	Btrfs: send, fix issuing write op when processing hole in no data mode When doing an incremental send of a filesystem with the no-holes feature enabled, we end up issuing a write operation when using the no data mode send flag, instead of issuing an update extent operation. Fix this by issuing the update extent operation instead. Trivial reproducer: $ mkfs.btrfs -f -O no-holes /dev/sdc $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdc /mnt/sdc $ mount /dev/sdd /mnt/sdd $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1 $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2 $ btrfs send /mnt/sdc/snap1 \| btrfs receive /mnt/sdd $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \ \| btrfs receive -vv /mnt/sdd Before this change the output of the second receive command is: receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447... utimes write foobar, offset 8192, len 8192 utimes foobar BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-... After this change it is: receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64... utimes update_extent foobar: offset=8192, len=8192 utimes foobar BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64... Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-01 16:18:07 +01:00
Qu Wenruo	bae15d95e2	btrfs: Cleanup existing name_len checks Since tree-checker has verified leaf when reading from disk, we don't need the existing verify_dir_item() or btrfs_is_name_len_valid() checks. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-01-22 16:08:12 +01:00
Filipe Manana	ea37d5998b	Btrfs: incremental send, fix wrong unlink path after renaming file Under some circumstances, an incremental send operation can issue wrong paths for unlink commands related to files that have multiple hard links and some (or all) of those links were renamed between the parent and send snapshots. Consider the following example: Parent snapshot . (ino 256) \|---- a/ (ino 257) \| \|---- b/ (ino 259) \| \| \|---- c/ (ino 260) \| \| \|---- f2 (ino 261) \| \| \| \|---- f2l1 (ino 261) \| \|---- d/ (ino 262) \|---- f1l1_2 (ino 258) \|---- f2l2 (ino 261) \|---- f1_2 (ino 258) Send snapshot . (ino 256) \|---- a/ (ino 257) \| \|---- f2l1/ (ino 263) \| \|---- b2/ (ino 259) \| \|---- c/ (ino 260) \| \| \|---- d3 (ino 262) \| \| \|---- f1l1_2 (ino 258) \| \| \|---- f2l2_2 (ino 261) \| \| \|---- f1_2 (ino 258) \| \| \| \|---- f2 (ino 261) \| \|---- f1l2 (ino 258) \| \|---- d (ino 261) When computing the incremental send stream the following steps happen: 1) When processing inode 261, a rename operation is issued that renames inode 262, which currently as a path of "d", to an orphan name of "o262-7-0". This is done because in the send snapshot, inode 261 has of its hard links with a path of "d" as well. 2) Two link operations are issued that create the new hard links for inode 261, whose names are "d" and "f2l2_2", at paths "/" and "o262-7-0/" respectively. 3) Still while processing inode 261, unlink operations are issued to remove the old hard links of inode 261, with names "f2l1" and "f2l2", at paths "a/" and "d/". However path "d/" does not correspond anymore to the directory inode 262 but corresponds instead to a hard link of inode 261 (link command issued in the previous step). This makes the receiver fail with a ENOTDIR error when attempting the unlink operation. The problem happens because before sending the unlink operation, we failed to detect that inode 262 was one of ancestors for inode 261 in the parent snapshot, and therefore we didn't recompute the path for inode 262 before issuing the unlink operation for the link named "f2l2" of inode 262. The detection failed because the function "is_ancestor()" only follows the first hard link it finds for an inode instead of all of its hard links (as it was originally created for being used with directories only, for which only one hard link exists). So fix this by making "is_ancestor()" follow all hard links of the input inode. A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-11-28 17:15:30 +01:00
Zygo Blaxell	c995ab3cda	btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and offset (encoded as a single logical address) to a list of extent refs. LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping (extent ref -> extent bytenr and offset, or logical address). These are useful capabilities for programs that manipulate extents and extent references from userspace (e.g. dedup and defrag utilities). When the extents are uncompressed (and not encrypted and not other), check_extent_in_eb performs filtering of the extent refs to remove any extent refs which do not contain the same extent offset as the 'logical' parameter's extent offset. This prevents LOGICAL_INO from returning references to more than a single block. To find the set of extent references to an uncompressed extent from [a, b), userspace has to run a loop like this pseudocode: for (i = a; i < b; ++i) extent_ref_set += LOGICAL_INO(i); At each iteration of the loop (up to 32768 iterations for a 128M extent), data we are interested in is collected in the kernel, then deleted by the filter in check_extent_in_eb. When the extents are compressed (or encrypted or other), the 'logical' parameter must be an extent bytenr (the 'a' parameter in the loop). No filtering by extent offset is done (or possible?) so the result is the complete set of extent refs for the entire extent. This removes the need for the loop, since we get all the extent refs in one call. Add an 'ignore_offset' argument to iterate_inodes_from_logical, [...several levels of function call graph...], and check_extent_in_eb, so that we can disable the extent offset filtering for uncompressed extents. This flag can be set by an improved version of the LOGICAL_INO ioctl to get either behavior as desired. There is no functional change in this patch. The new flag is always false. Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: David Sterba <dsterba@suse.com> [ minor coding style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-11-01 20:45:34 +01:00
Nikolay Borisov	eb7b9d6a46	btrfs: send: remove unused code This code was first introduced in `31db9f7c23` ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive") and it was not functional, then it got slightly refactored in `e938c8ad54` ("Btrfs: code cleanups for send/receive"), alas it was still dead. So let's remove it for good! Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-11-01 20:45:34 +01:00
Josef Bacik	2351f431f7	btrfs: fix send ioctl on 32bit with 64bit kernel We pass in a pointer in our send arg struct, this means the struct size doesn't match with 32bit user space and 64bit kernel space. Fix this by adding a compat mode and doing the appropriate conversion. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ move structure to the beginning, next to receive 32bit compat ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-10-30 12:27:59 +01:00
Kuanling Huang	eef16ba269	Btrfs: send, apply asynchronous page cache readahead to enhance page read By analyzing the perf on btrfs send, we found it take large amount of cpu time on page_cache_sync_readahead. This effort can be reduced after switching to asynchronous one. Overall performance gain on HDD and SSD were 9 and 15 percent if simply send a large file. Signed-off-by: Kuanling Huang <peterh@synology.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-10-30 12:27:57 +01:00
Nikolay Borisov	ee8c494f88	btrfs: Remove unused arguments from btrfs_changed_cb_t btrfs_changed_cb_t represents the signature of the callback being passed to btrfs_compare_trees. Currently there is only one such callback, namely changed_cb in send.c. This function doesn't really uses the first 2 parameters, i.e. the roots. Since there are not other functions implementing the btrfs_changed_cb_t let's remove the unused parameters from the prototype and implementation. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-10-30 12:27:56 +01:00
Nikolay Borisov	a0357511f2	btrfs: Remove unused parameters from various functions iterate_dir_item:found_key - introduced in `31db9f7c23` ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"), yet never used. record_ref:num - ditto This is a first pass with the low-hanging fruit. There are still quite a few unsued parameters in some function which have to abide by a callback interface. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-10-30 12:27:55 +01:00
Linus Torvalds	5ba88cd6e9	Merge branch 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We've collected a bunch of isolated fixes, for crashes, user-visible behaviour or missing bits from other subsystem cleanups from the past. The overall number is not small but I was not able to make it significantly smaller. Most of the patches are supposed to go to stable" * 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: log csums for all modified extents Btrfs: fix unexpected result when dio reading corrupted blocks btrfs: Report error on removing qgroup if del_qgroup_item fails Btrfs: skip checksum when reading compressed data if some IO have failed Btrfs: fix kernel oops while reading compressed data Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block Btrfs: do not backup tree roots when fsync btrfs: remove BTRFS_FS_QUOTA_DISABLING flag btrfs: propagate error to btrfs_cmp_data_prepare caller btrfs: prevent to set invalid default subvolid Btrfs: send: fix error number for unknown inode types btrfs: fix NULL pointer dereference from free_reloc_roots() btrfs: finish ordered extent cleaning if no progress is found btrfs: clear ordered flag on cleaning up ordered extents Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO Btrfs: do not reset bio->bi_ops while writing bio Btrfs: use the new helper wbc_to_write_flags	2017-09-29 12:57:35 -07:00
Tsutomu Itoh	ca6842bf01	Btrfs: send: fix error number for unknown inode types ENOTSUPP should not be returned to the user program. (cf. include/linux/errno.h) Therefore, EOPNOTSUPP is used instead of ENOTSUPP. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-09-26 14:52:06 +02:00
Linus Torvalds	581bfce969	Merge branch 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more set_fs removal from Al Viro: "Christoph's 'use kernel_read and friends rather than open-coding set_fs()' series" * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: unexport vfs_readv and vfs_writev fs: unexport vfs_read and vfs_write fs: unexport __vfs_read/__vfs_write lustre: switch to kernel_write gadget/f_mass_storage: stop messing with the address limit mconsole: switch to kernel_read btrfs: switch write_buf to kernel_write net/9p: switch p9_fd_read to kernel_write mm/nommu: switch do_mmap_private to kernel_read serial2002: switch serial2002_tty_write to kernel_{read/write} fs: make the buf argument to __kernel_write a void pointer fs: fix kernel_write prototype fs: fix kernel_read prototype fs: move kernel_read to fs/read_write.c fs: move kernel_write to fs/read_write.c autofs4: switch autofs4_write to __kernel_write ashmem: switch to ->read_iter	2017-09-14 18:13:32 -07:00
Christoph Hellwig	8e93157bdd	btrfs: switch write_buf to kernel_write Instead of playing with the addressing limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-09-04 19:05:16 -04:00
Filipe Manana	72610b1b40	Btrfs: incremental send, fix emission of invalid clone operations When doing an incremental send it's possible that the computed send stream contains clone operations that will fail on the receiver if the receiver has compression enabled and the clone operations target a sector sized extent that starts at a zero file offset, is not compressed on the source filesystem but ends up being compressed and inlined at the destination filesystem. Example scenario: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt # By doing a direct IO write, the data is not compressed. $ xfs_io -f -d -c "pwrite -S 0xab 0 4K" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1 $ xfs_io -c "reflink /mnt/foobar 0 8K 4K" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2 $ btrfs send -f /tmp/1.snap /mnt/mysnap1 $ btrfs send -f /tmp/2.snap -p /mnt/mysnap1 /mnt/mysnap2 $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount -o compress /dev/sdc /mnt $ btrfs receive -f /tmp/1.snap /mnt $ btrfs receive -f /tmp/2.snap /mnt ERROR: failed to clone extents to foobar Operation not supported The same could be achieved by mounting the source filesystem without compression and doing a buffered IO write instead of a direct IO one, and mounting the destination filesystem with compression enabled. So fix this by issuing regular write operations in the send stream instead of clone operations when the source offset is zero and the range has a length matching the sector size. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-21 17:47:42 +02:00
David Sterba	d3c0bab563	btrfs: remove trivial wrapper btrfs_force_ra It's a simple call page_cache_sync_readahead, same arguments in the same order. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
Filipe Manana	24e52b11e0	Btrfs: incremental send, fix invalid memory access When doing an incremental send, while processing an extent that changed between the parent and send snapshots and that extent was an inline extent in the parent snapshot, it's possible to access a memory region beyond the end of leaf if the inline extent is very small and it is the first item in a leaf. An example scenario is described below. The send snapshot has the following leaf: leaf 33865728 items 33 free space 773 generation 46 owner 5 fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7 chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f (...) item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53 generation 36 type 1 (regular) extent data disk byte 12791808 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53 generation 36 type 1 (regular) extent data disk byte 138170368 nr 225280 extent data offset 0 nr 225280 ram 225280 extent compression 0 (none) (...) And the parent snapshot has the following leaf: leaf 31272960 items 17 free space 17 generation 31 owner 5 fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7 chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44 generation 31 type 0 (inline) inline extent data size 23 ram_bytes 613 compression 1 (zlib) (...) When computing the send stream, it is detected that the extent of inode 335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we grab the leaf from the parent snapshot and access the inline extent item. However, before jumping to the 'out' label, we access the 'offset' and 'disk_bytenr' fields of the extent item, which should not be done for inline extents since the inlined data starts at the offset of the 'disk_bytenr' field and can be very small. For example accessing the 'offset' field of the file extent item results in the following trace: [ 599.705368] general protection fault: 0000 [#1] PREEMPT SMP [ 599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$ [ 599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1 [ 599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000 [ 599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs] [ 599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286 [ 599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001 [ 599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f [ 599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098 [ 599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7 [ 599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088 [ 599.709340] FS: 00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000 [ 599.709340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0 [ 599.709340] Call Trace: [ 599.709340] btrfs_get_token_64+0x93/0xce [btrfs] [ 599.709340] ? printk+0x48/0x50 [ 599.709340] btrfs_get_64+0xb/0xd [btrfs] [ 599.709340] process_extent+0x3a1/0x1106 [btrfs] [ 599.709340] ? btree_read_extent_buffer_pages+0x5/0xef [btrfs] [ 599.709340] changed_cb+0xb03/0xb3d [btrfs] [ 599.709340] ? btrfs_get_token_32+0x7a/0xcc [btrfs] [ 599.709340] btrfs_compare_trees+0x432/0x53d [btrfs] [ 599.709340] ? process_extent+0x1106/0x1106 [btrfs] [ 599.709340] btrfs_ioctl_send+0x960/0xe26 [btrfs] [ 599.709340] btrfs_ioctl+0x181b/0x1fed [btrfs] [ 599.709340] ? trace_hardirqs_on_caller+0x150/0x1ac [ 599.709340] vfs_ioctl+0x21/0x38 [ 599.709340] ? vfs_ioctl+0x21/0x38 [ 599.709340] do_vfs_ioctl+0x611/0x645 [ 599.709340] ? rcu_read_unlock+0x5b/0x5d [ 599.709340] ? __fget+0x6d/0x79 [ 599.709340] SyS_ioctl+0x57/0x7b [ 599.709340] entry_SYSCALL_64_fastpath+0x18/0xad [ 599.709340] RIP: 0033:0x7f51945eec47 [ 599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47 [ 599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004 [ 599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700 [ 599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046 [ 599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040 [ 599.709340] ? trace_hardirqs_off_caller+0x43/0xb1 [ 599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$ [ 599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00 [ 599.762057] ---[ end trace fe00d7af61b9f49e ]--- This is because the 'offset' field starts at an offset of 37 bytes (offsetof(struct btrfs_file_extent_item, offset)), has a length of 8 bytes and therefore attemping to read it causes a 1 byte access beyond the end of the leaf, as the first item's content in a leaf is located at the tail of the leaf, the item size is 44 bytes and the offset of that field plus its length (37 + 8 = 45) goes beyond the item's size by 1 byte. So fix this by accessing the 'offset' and 'disk_bytenr' fields after jumping to the 'out' label if we are processing an inline extent. We move the reading operation of the 'disk_bytenr' field too because we have the same problem as for the 'offset' field explained above when the inline data is less then 8 bytes. The access to the 'generation' field is also moved but just for the sake of grouping access to all the fields. Fixes: `e1cbfd7bf6` ("Btrfs: send, fix file hole not being preserved due to inline extent") Cc: <stable@vger.kernel.org> # v4.12+ Signed-off-by: Filipe Manana <fdmanana@suse.com>	2017-07-06 23:02:30 +01:00
Filipe Manana	f59627810e	Btrfs: incremental send, fix invalid path for link commands In some scenarios an incremental send stream can contain link commands with an invalid target path. Such scenarios happen after moving some directory inode A, renaming a regular file inode B into the old name of inode A and finally creating a new hard link for inode B at directory inode A. Consider the following example scenario where this issue happens. Parent snapshot: . (ino 256) \| \|--- dir1/ (ino 257) \| \|--- dir2/ (ino 258) \| \|--- dir3/ (ino 259) \| \|--- file1 (ino 261) \| \|--- dir4/ (ino 262) \| \|--- dir5/ (ino 260) Send snapshot: . (ino 256) \| \|--- dir1/ (ino 257) \|--- dir2/ (ino 258) \| \|--- dir3/ (ino 259) \| \|--- dir4 (ino 261) \| \|--- dir6/ (ino 263) \|--- dir44/ (ino 262) \|--- file11 (ino 261) \|--- dir55/ (ino 260) When attempting to apply the corresponding incremental send stream, a link command contains an invalid target path which makes the receiver fail. The following is the verbose output of the btrfs receive command: receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...) utimes utimes dir1 utimes dir1/dir2/dir3 utimes rename dir1/dir2/dir3/dir4 -> o262-7-0 link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1 link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory The following steps happen during the computation of the incremental send stream the lead to this issue: 1) When processing inode 261, we orphanize inode 262 due to a name/location collision with one of the new hard links for inode 261 (created in the second step below). 2) We create one of the 2 new hard links for inode 261, the one whose location is at "dir1/dir2/dir3/dir4". 3) We then attempt to create the other new hard link for inode 261, which has inode 262 as its parent directory. Because the path for this new hard link was computed before we started processing the new references (hard links), it reflects the old name/location of inode 262, that is, it does not account for the orphanization step that happened when we started processing the new references for inode 261, whence it is no longer valid, causing the receiver to fail. So fix this issue by recomputing the full path of new references if we ended up orphanizing other inodes which are directories. A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com>	2017-07-06 23:02:18 +01:00
Su Yue	59b0a7f2c7	btrfs: Check name_len before read in iterate_dir_item Since iterate_dir_item checks name_len in its own way, so use btrfs_is_name_len_valid not 'verify_dir_item' to make more strict name_len check. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ switched ENAMETOOLONG to EIO ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:16:04 +02:00
Filipe Manana	fdb1388994	Btrfs: incremental send, fix invalid path for unlink commands An incremental send can contain unlink operations with an invalid target path when we rename some directory inode A, then rename some file inode B to the old name of inode A and directory inode A is an ancestor of inode B in the parent snapshot (but not anymore in the send snapshot). Consider the following example scenario where this issue happens. Parent snapshot: . (ino 256) \| \|--- dir1/ (ino 257) \|--- dir2/ (ino 258) \| \|--- file1 (ino 259) \| \|--- file3 (ino 261) \| \|--- dir3/ (ino 262) \|--- file22 (ino 260) \|--- dir4/ (ino 263) Send snapshot: . (ino 256) \| \|--- dir1/ (ino 257) \|--- dir2/ (ino 258) \|--- dir3 (ino 260) \|--- file3/ (ino 262) \|--- dir4/ (ino 263) \|--- file11 (ino 269) \|--- file33 (ino 261) When attempting to apply the corresponding incremental send stream, an unlink operation contains an invalid path which makes the receiver fail. The following is verbose output of the btrfs receive command: receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...) utimes utimes dir1 utimes dir1/dir2 link dir1/dir3/dir4/file11 -> dir1/dir2/file1 unlink dir1/dir2/file1 utimes dir1/dir2 truncate dir1/dir3/dir4/file11 size=0 utimes dir1/dir3/dir4/file11 rename dir1/dir3 -> o262-7-0 link dir1/dir3 -> o262-7-0/file22 unlink dir1/dir3/file22 ERROR: unlink dir1/dir3/file22 failed. Not a directory The following steps happen during the computation of the incremental send stream the lead to this issue: 1) Before we start processing the new and deleted references for inode 260, we compute the full path of the deleted reference ("dir1/dir3/file22") and cache it in the list of deleted references for our inode. 2) We then start processing the new references for inode 260, for which there is only one new, located at "dir1/dir3". When processing this new reference, we check that inode 262, which was not yet processed, collides with the new reference and because of that we orphanize inode 262 so its new full path becomes "o262-7-0". 3) After the orphanization of inode 262, we create the new reference for inode 260 by issuing a link command with a target path of "dir1/dir3" and a source path of "o262-7-0/file22". 4) We then start processing the deleted references for inode 260, for which there is only one with the base name of "file22", and issue an unlink operation containing the target path computed at step 1, which is wrong because that path no longer exists and should be replaced with "o262-7-0/file22". So fix this issue by recomputing the full path of deleted references if when we processed the new references for an inode we ended up orphanizing any other inode that is an ancestor of our inode in the parent snapshot. A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> [ adjusted after prev patch removed fs_path::dir_path and dir_path_len ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 16:53:10 +02:00
Filipe Manana	72c3668fed	Btrfs: send, fix invalid path after renaming and linking file Currently an incremental snapshot can generate link operations which contain an invalid target path. Such case happens when in the send snapshot a file was renamed, a new hard link added for it and some other inode (with a lower number) got renamed to the former name of that file. Example: Parent snapshot . (ino 256) \| \|--- f1 (ino 257) \|--- f2 (ino 258) \|--- f3 (ino 259) Send snapshot . (ino 256) \| \|--- f2 (ino 257) \|--- f3 (ino 258) \|--- f4 (ino 259) \|--- f5 (ino 258) The following steps happen when computing the incremental send stream: 1) When processing inode 257, inode 258 is orphanized (renamed to "o258-7-0"), because its current reference has the same name as the new reference for inode 257; 2) When processing inode 258, we iterate over all its new references, which have the names "f3" and "f5". The first iteration sees name "f5" and renames the inode from its orphan name ("o258-7-0") to "f5", while the second iteration sees the name "f3" and, incorrectly, issues a link operation with a target name matching the orphan name, which no longer exists. The first iteration had reset the current valid path of the inode to "f5", but in the second iteration we lost it because we found another inode, with a higher number of 259, which has a reference named "f3" as well, so we orphanized inode 259 and recomputed the current valid path of inode 258 to its old orphan name because inode 259 could be an ancestor of inode 258 and therefore the current valid path could contain the pre-orphanization name of inode 259. However in this case inode 259 is not an ancestor of inode 258 so the current valid path should not be recomputed. This makes the receiver fail with the following error: ERROR: link f3 -> o258-7-0 failed: No such file or directory So fix this by not recomputing the current valid path for an inode whenever we find a colliding reference from some not yet processed inode (inode number higher then the one currently being processed), unless that other inode is an ancestor of the one we are currently processing. A test case for fstests will follow soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 16:53:03 +02:00
David Sterba	f11f74416a	btrfs: send: use kvmalloc in iterate_dir_item We use a growing buffer for xattrs larger than a page size, at some point vmalloc is unconditionally used for larger buffers. We can still try to avoid it using the kvmalloc helper. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:02 +02:00
David Sterba	818e010bf9	btrfs: replace opencoded kvzalloc with the helper The logic of kmalloc and vmalloc fallback is opencoded in several places, we can now use the existing helper. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:01 +02:00
David Sterba	ee4ea69852	btrfs: remove unused members dir_path from recorded_ref The two members do not seem to be used since the initial commit. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
Linus Torvalds	1176032cb1	Merge branch 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has fixes and cleanups Dave Sterba collected for the merge window. The biggest functional fixes are between btrfs raid5/6 and scrub, and raid5/6 and device replacement. Some of our pending qgroup fixes are included as well while I bash on the rest in testing. We also have the usual set of cleanups, including one that makes __btrfs_map_block() much more maintainable, and conversions from atomic_t to refcount_t" * 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (71 commits) btrfs: fix the gfp_mask for the reada_zones radix tree Btrfs: fix reported number of inode blocks Btrfs: send, fix file hole not being preserved due to inline extent Btrfs: fix extent map leak during fallocate error path Btrfs: fix incorrect space accounting after failure to insert inline extent Btrfs: fix invalid attempt to free reserved space on failure to cow range btrfs: Handle delalloc error correctly to avoid ordered extent hang btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error btrfs: check if the device is flush capable btrfs: delete unused member nobarriers btrfs: scrub: Fix RAID56 recovery race condition btrfs: scrub: Introduce full stripe lock for RAID56 btrfs: Use ktime_get_real_ts for root ctime Btrfs: handle only applicable errors returned by btrfs_get_extent btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option btrfs: use q which is already obtained from bdev_get_queue Btrfs: switch to div64_u64 if with a u64 divisor Btrfs: update scrub_parity to use u64 stripe_len Btrfs: enable repair during read for raid56 profile btrfs: use clear_page where appropriate ...	2017-05-10 08:33:17 -07:00
Michal Hocko	752ade68cb	treewide: use kv[mz]alloc* rather than opencoded variants There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:13 -07:00
Filipe Manana	e1cbfd7bf6	Btrfs: send, fix file hole not being preserved due to inline extent Normally we don't have inline extents followed by regular extents, but there's currently at least one harmless case where this happens. For example, when the page size is 4Kb and compression is enabled: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar In this case we get a compressed inline extent, representing 4Kb of data, followed by a hole extent and then a regular data extent. The inline extent was not expanded/converted to a regular extent exactly because it represents 4Kb of data. This does not cause any apparent problem (such as the issue solved by commit `e1699d2d7b` ("btrfs: add missing memset while reading compressed inline extents")) except trigger an unexpected case in the incremental send code path that makes us issue an operation to write a hole when it's not needed, resulting in more writes at the receiver and wasting space at the receiver. So teach the incremental send code to deal with this particular case. The issue can be currently triggered by running fstests btrfs/137 with compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2017-04-26 16:27:25 +01:00

1 2 3 4 5

236 Commits