Combine the clone_file_range and dedupe_file_range operations into a
single remap_file_range file operation dispatch since they're
fundamentally the same operation. The differences between the two can
be made in the prep functions.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAluRLa8ACgkQxWXV+ddt
WDvc+BAAqxTMVngZ60WfktXzsS56OB6fu/R3DORgYcSZ0BCD4zTwoDlCjLhrCK6E
cmC+BMj+AspDQYiYESwGyFcN10sK0X7w7fa3wypTc4GNWxpkRm0Z6zT/kCvLUhdI
NlkMqAfsZ9N6iIXcR0qOxI7G55e3mpXPZGdFTk5rmDTv/9TqU0TMp9s8Zw5scn6R
ctdE+iE0lpRfNjF8ZDH1BtYIV4g2X81sZF/fkGz621HQfMTCjjPHFdlz+jlirBaf
BrYR4w4zjVuMKd3ZC5FHffVchbkvt29h6fAr4sEpJTwFJwd8pjI7GuPYWDQ918NB
TGX6EUP6usQqDK2zD405jCS6MbMshJm3uh5kmEpeNgK/tKJTln8Sbef/Xs93yIn2
+k9BMKOIcUHHBiv6PgCaZomcWCpii2S2u6vncqCnNuI4wK1RN3gHJc5YPhJArlrB
NUFJiTCQE6LWYOP2Hw+rggcrtBxli0bX7Mqp5FYFVdh5KBvolJE1o3B/JS8qpqRF
u0dPwbLHtTpTpXM5EfmM8a45S+DxuxTDBh3vdoAOM9LN/ivpeqqnFbHrIGmrTMjo
pQJ8aTrCwYMEMNu6oCV1cniFrOYRZ439hYjg524MjVXYCRyxhzAdVmVTEBaLjWCW
9GlGqEC7YZY2wLi5lPEGqxsIaVVELpettJB9KbBKmYB47VFWEf0=
=fu93
-----END PGP SIGNATURE-----
Merge tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix for improper fsync after hardlink
- fix for a corruption during file deduplication
- use after free fixes
- RCU warning fix
- fix for buffered write to nodatacow file
* tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu
btrfs: use after free in btrfs_quota_enable
btrfs: btrfs_shrink_device should call commit transaction at the end
btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata
Btrfs: fix data corruption when deduplicating between different files
Btrfs: sync log after logging new name
Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
If we deduplicate extents between two different files we can end up
corrupting data if the source range ends at the size of the source file,
the source file's size is not aligned to the filesystem's block size
and the destination range does not go past the size of the destination
file size.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo
# The first byte with a value of 0xae starts at an offset (2518890)
# which is not a multiple of the sector size.
$ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo
# Confirm the file content is full of bytes with values 0x6b and 0xae.
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# Create a second file with a length not aligned to the sector size,
# whose bytes all have the value 0x6b, so that its extent(s) can be
# deduplicated with the first file.
$ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar
# Now deduplicate the entire second file into a range of the first file
# that also has all bytes with the value 0x6b. The destination range's
# end offset must not be aligned to the sector size and must be less
# then the offset of the first byte with the value 0xae (byte at offset
# 2518890).
$ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo
# The bytes in the range starting at offset 2515659 (end of the
# deduplication range) and ending at offset 2519040 (start offset
# rounded up to the block size) must all have the value 0xae (and not
# replaced with 0x00 values). In other words, we should have exactly
# the same data we had before we asked for deduplication.
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# Unmount the filesystem and mount it again. This guarantees any file
# data in the page cache is dropped.
$ umount /dev/sdb
$ mount /dev/sdb /mnt
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00
11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# The bytes in range 2515659 to 2519040 have a value of 0x00 and not a
# value of 0xae, data corruption happened due to the deduplication
# operation.
So fix this by rounding down, to the sector size, the length used for the
deduplication when the following conditions are met:
1) Source file's range ends at its i_size;
2) Source file's i_size is not aligned to the sector size;
3) Destination range does not cross the i_size of the destination file.
Fixes: e1d227a42e ("btrfs: Handle unaligned length in extent_same")
CC: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This contains two new features:
1) Stack file operations: this allows removal of several hacks from the
VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
2) Metadata only copy-up: when file is on lower layer and only metadata is
modified (except size) then only copy up the metadata and continue to
use the data from the lower file.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
=QDXH
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"This contains two new features:
- Stack file operations: this allows removal of several hacks from
the VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
- Metadata only copy-up: when file is on lower layer and only
metadata is modified (except size) then only copy up the metadata
and continue to use the data from the lower file"
* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
ovl: Enable metadata only feature
ovl: Do not do metacopy only for ioctl modifying file attr
ovl: Do not do metadata only copy-up for truncate operation
ovl: add helper to force data copy-up
ovl: Check redirect on index as well
ovl: Set redirect on upper inode when it is linked
ovl: Set redirect on metacopy files upon rename
ovl: Do not set dentry type ORIGIN for broken hardlinks
ovl: Add an inode flag OVL_CONST_INO
ovl: Treat metacopy dentries as type OVL_PATH_MERGE
ovl: Check redirects for metacopy files
ovl: Move some dir related ovl_lookup_single() code in else block
ovl: Do not expose metacopy only dentry from d_real()
ovl: Open file with data except for the case of fsync
ovl: Add helper ovl_inode_realdata()
ovl: Store lower data inode in ovl_inode
ovl: Fix ovl_getattr() to get number of blocks from lower
ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
ovl: Copy up meta inode data from lowest data inode
ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
...
Commit e9894fd3e3 ("Btrfs: fix snapshot vs nocow writting") forced
nocow writes to fallback to COW, during writeback, when a snapshot is
created. This resulted in writes made before creating the snapshot to
unexpectedly fail with ENOSPC during writeback when success (0) was
returned to user space through the write system call.
The steps leading to this problem are:
1. When it's not possible to allocate data space for a write, the
buffered write path checks if a NOCOW write is possible. If it is,
it will not reserve space and success (0) is returned to user space.
2. Then when a snapshot is created, the root's will_be_snapshotted
atomic is incremented and writeback is triggered for all inode's that
belong to the root being snapshotted. Incrementing that atomic forces
all previous writes to fallback to COW during writeback (running
delalloc).
3. This results in the writeback for the inodes to fail and therefore
setting the ENOSPC error in their mappings, so that a subsequent
fsync on them will report the error to user space. So it's not a
completely silent data loss (since fsync will report ENOSPC) but it's
a very unexpected and undesirable behaviour, because if a clean
shutdown/unmount of the filesystem happens without previous calls to
fsync, it is expected to have the data present in the files after
mounting the filesystem again.
So fix this by adding a new atomic named snapshot_force_cow to the
root structure which prevents this behaviour and works the following way:
1. It is incremented when we start to create a snapshot after triggering
writeback and before waiting for writeback to finish.
2. This new atomic is now what is used by writeback (running delalloc)
to decide whether we need to fallback to COW or not. Because we
incremented this new atomic after triggering writeback in the
snapshot creation ioctl, we ensure that all buffered writes that
happened before snapshot creation will succeed and not fallback to
COW (which would make them fail with ENOSPC).
3. The existing atomic, will_be_snapshotted, is kept because it is used
to force new buffered writes, that start after we started
snapshotting, to reserve data space even when NOCOW is possible.
This makes these writes fail early with ENOSPC when there's no
available space to allocate, preventing the unexpected behaviour of
writeback later failing with ENOSPC due to a fallback to COW mode.
Fixes: e9894fd3e3 ("Btrfs: fix snapshot vs nocow writting")
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Requiring a read-write descriptor conflicts both ways with exec,
returning ETXTBSY whenever you try to defrag a program that's currently
being run, or causing intermittent exec failures on a live system being
defragged.
As defrag doesn't change the file's contents in any way, there's no
reason to consider it a rw operation. Thus, let's check only whether
the file could have been opened rw. Such access control is still needed
as currently defrag can use extra disk space, and might trigger bugs.
We return EINVAL when the request is invalid; here it's ok but merely
the user has insufficient privileges. Thus, the EPERM return value
reflects the error better -- as discussed in the identical case for
dedupe.
According to codesearch.debian.net, no userspace program distinguishes
these values beyond strerror().
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fold the EPERM patch from Adam ]
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to
btrfs_quota_enable") not only resulted in an easier to follow code but
it also introduced a subtle bug. It changed the timing when the initial
transaction rescan was happening:
- before the commit: it would happen after transaction commit had occured
- after the commit: it might happen before the transaction was committed
This results in failure to correctly rescan the quota since there could
be data which is still not committed on disk.
This patch aims to fix this by moving the transaction creation/commit
inside btrfs_quota_enable, which allows to schedule the quota commit
after the transaction has been committed.
Fixes: 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable")
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Link: https://marc.info/?l=linux-btrfs&m=152999289017582
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a new extent buffer is allocated there are a few mandatory fields
which need to be set in order for the buffer to be sane: level,
generation, bytenr, backref_rev, owner and FSID/UUID. Currently this
is open coded in the callers of btrfs_alloc_tree_block, meaning it's
fairly high in the abstraction hierarchy of operations. This patch
solves this by simply moving this init code in btrfs_init_new_buffer,
since this is the function which initializes a newly allocated
extent buffer. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch avoids that building the BTRFS source code with smatch
triggers complaints about inconsistent indenting.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_cmp_data_free() puts cmp's src_pages and dst_pages, but leaves
their page address intact. Now, if you hit "goto again" in
btrfs_extent_same_range() and hit some error in
btrfs_cmp_data_prepare(), you'll try to unlock/put already put pages.
This is simple fix to reset the address to avoid use-after-free.
Fixes: 67b07bd4be ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl")
Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Clean up f_op->dedupe_file_range() interface.
1) Use loff_t for offsets and length instead of u64
2) Order the arguments the same way as {copy|clone}_file_range().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsvtAIACgkQxWXV+ddt
WDvPJQ//eEr+ACNxG5n42e63TaNpLOeoGAsUChwekP6DAIc2n/r78SHX+T/woDqd
7/dc2eqlYF5Fmn6MPQ1ufL2xlfw0t3OnemK8T5+4sxZDdzeAH9V+kHaqoaLUPExL
4r6lK5Ywwa2cWghC7WvQg3+bWLX18OEExG63SlEhLo3YJM5uUqVhGVPi6ARrbxNM
GJvXcQsxjXLqukm4gYvHC6Zra9Awv65uiAU+VCm2y96j1fEJ0yjK/pC1RtoFGCqS
48Jjuzfq/V3nxy0Wjr3DvpQVEQcKyGha6i/eazZISdRhGSjrYdvIwpUn7gZeoo2Q
hT8VVergLbVYgIeaOwgwubQzNaG2C/ZTsEjPQrNrA7a/AGsh5C/ommYE9MSyS6L0
PG0NLUNDXFmEj8WdI97So+1Sm2OCb04DPbPhHbbkhw5L/MPE1TaLN5aUWguj8laB
NnyPRdVP9jCJAI0OhJY7nPDmsPKe2jogVVsRheTcob+V5G+vIgzDXlGfsW/88Seg
dHubMaC0nz6u8Cj4dIviitiLXuustyz0JkVdljTLawWWJ/Hs7NlsSf3Q3nj2Kvia
e8QMID0vLphQyL1hqC0n7M0g2ohq9NUGT1nhLTPSpFl0l8bIA9PQehOx9Q5vx8yp
tJF+d0qiNfgvadA+KhyvQ8puQb2+zZQ8+Pqfjwd4ySD7SI5dcA4=
=fCdm
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two regression fixes and an incorrect error value propagation fix from
'rename exchange'"
* tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix return value on rename exchange failure
btrfs: fix invalid-free in btrfs_extent_same
Btrfs: fix physical offset reported by fiemap for inline extents
If this condition ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
(BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM))
is hit, we will go to free the uninitialized cmp.src_pages and
cmp.dst_pages.
Fixes: 67b07bd4be ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl")
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
There were no conflicts between this and the contents of linux-next
until just before the merge window, when we saw multiple problems:
- A minor conflict with my own y2038 fixes, which I could address
by adding another patch on top here.
- One semantic conflict with late changes to the NFS tree. I addressed
this by merging Deepa's original branch on top of the changes that
now got merged into mainline and making sure the merge commit includes
the necessary changes as produced by coccinelle.
- A trivial conflict against the removal of staging/lustre.
- Multiple conflicts against the VFS changes in the overlayfs tree.
These are still part of linux-next, but apparently this is no longer
intended for 4.18 [1], so I am ignoring that part.
As Deepa writes:
The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions.
Thomas Gleixner adds:
I think there is no point to drag that out for the next merge window.
The whole thing needs to be done in one go for the core changes which
means that you're going to play that catchup game forever. Let's get
over with it towards the end of the merge window.
[1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
A3HkjIBrKW5AgQDxfgvm
=CZX2
-----END PGP SIGNATURE-----
Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
As Deepa writes:
'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'
Thomas Gleixner adds:
'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"
* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()
The patch introducing the ioctl was not the latest version at the time
of merging to the mainline and needs a fixup from this patch.
Fixes: ba637a252d30 ("btrfs: Check error of btrfs_iget() in btrfs_search_path_in_tree_user")
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER
to allow normal users to call "btrfs subvolume list/show" etc. in
combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF.
This can be used like BTRFS_IOC_INO_LOOKUP but the argument is
different. This is because it always searches the fs/file tree
correspoinding to the fd with which this ioctl is called and also
returns the name of bottom subvolume.
The main differences from original ino_lookup ioctl are:
1. Read + Exec permission will be checked using inode_permission()
during path construction. -EACCES will be returned in case
of failure.
2. Path construction will be stopped at the inode number which
corresponds to the fd with which this ioctl is called. If
constructed path does not exist under fd's inode, -EACCES
will be returned.
3. The name of bottom subvolume is also searched and filled.
Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1)
bytes than ino_lookup ioctl because of space of subvolume's name.
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which returns
ROOT_REF information of the subvolume containing this inode except the
subvolume name (this is because to prevent potential name leak). The
subvolume name will be gained by user version of ino_lookup ioctl
(BTRFS_IOC_INO_LOOKUP_USER) which also performs permission check.
The min id of root ref's subvolume to be searched is specified by
@min_id in struct btrfs_ioctl_get_subvol_rootref_args. After the search
ends, @min_id is set to the last searched root ref's subvolid + 1. Also,
if there are more root refs than BTRFS_MAX_ROOTREF_BUFFER_NUM,
-EOVERFLOW is returned. Therefore the caller can just call this ioctl
again without changing the argument to continue search.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ style fixes and struct item renames ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns
the information of subvolume containing this inode.
(i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.)
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ minor style fixes, update struct comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 7775c8184e ("btrfs: remove unused parameter from
btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
by caller of function btrfs_subvolume_reserve_metadata. So remove it.
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ rename the function ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have invalid flags set, when we error out we must drop our writer
counter and free the buffer we allocated for the arguments. This bug is
trivially reproduced with the following program on 4.7+:
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <linux/btrfs.h>
#include <linux/btrfs_tree.h>
int main(int argc, char **argv)
{
struct btrfs_ioctl_vol_args_v2 vol_args = {
.flags = UINT64_MAX,
};
int ret;
int fd;
if (argc != 2) {
fprintf(stderr, "usage: %s PATH\n", argv[0]);
return EXIT_FAILURE;
}
fd = open(argv[1], O_WRONLY);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, &vol_args);
if (ret == -1)
perror("ioctl");
close(fd);
return EXIT_SUCCESS;
}
When unmounting the filesystem, we'll hit the
WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt() and also may prevent the
filesystem to be remounted read-only as the writer count will stay
lifted.
Fixes: 6b526ed70c ("btrfs: introduce device delete by devid")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_clone_files(), we must check the NODATASUM flag while the
inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
will change the flags after we check and we can end up with a party
checksummed file.
The race window is only a few instructions in size, between the if and
the locks which is:
3834 if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
3835 return -EISDIR;
where the setflags must be run and toggle the NODATASUM flag (provided
the file size is 0). The clone will block on the inode lock, segflags
takes the inode lock, changes flags, releases log and clone continues.
Not impossible but still needs a lot of bad luck to hit unintentionally.
Fixes: 0e7b824c4e ("Btrfs: don't make a file partly checksummed through file clone")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_read_fs_root_no_name() may return ERR_PTR(-ENOENT) or
ERR_PTR(-ENOMEM) and therefore search_ioctl() and
btrfs_search_path_in_tree() should use PTR_ERR() instead of -ENOENT,
which all other callers of btrfs_read_fs_root_no_name() do.
Drop the error message as it would be confusing, the caller of ioctl
will likely interpret the error code and not look into the syslog.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The dedupe range is 16 MiB, with 4 KiB pages and 8 byte pointers, the
arrays can be 32KiB large. To avoid allocation failures due to
fragmented memory, use the allocation with fallback to vmalloc.
The arrays are allocated and freed only inside btrfs_extent_same and
reused for all the ranges.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We support big dedup requests by splitting range to smaller parts, and
call dedupe logic on each of them.
Instead of repeated allocation and deallocation, allocate once at the
beginning and reuse in the iteration.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_dedupe_file_range silently restricts the dedupe range to
to 16MiB to limit locking and working memory size and is documented in
manual page as implementation specific.
Let's remove that restriction by iterating over the dedup range in 16MiB
steps. This is backward compatible and will not change anything for
requests smaller then 16MiB.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Split btrfs_extent_same() to two parts where one is the main EXTENT_SAME
entry and a helper that can be repeatedly called on a range. This will
be used in following patches.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
* The simple 'flags' refer to the btrfs inode
* ... that's in 'binode
* the FS_*_FL variables are 'fsflags'
* the old copies of the variable are prefixed by 'old_'
* Struct inode flags contain 'i_flags'.
Signed-off-by: David Sterba <dsterba@suse.com>
The new ioctl is an extension to the FS_IOC_SETFLAGS and adds new
flags and is extensible. Don't get fooled by the XATTR in the name, it
does not have anything in common with the extended attributes,
incidentally also abbreviated as XATTRs.
This patch allows to set the xflags portion of the fsxattr structure,
other items have no meaning and non-zero values will result in
EOPNOTSUPP.
Currently supported xflags:
- APPEND
- IMMUTABLE
- NOATIME
- NODUMP
- SYNC
The structure of btrfs_ioctl_fssetxattr copies btrfs_ioctl_setflags but
is simpler on the flag setting side.
The original patch was written by Chandan Jay Sharma but was incomplete
and no further revision has been sent.
Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new ioctl is an extension to the FS_IOC_GETFLAGS and adds new
flags and is extensible. This patch allows to return the xflags portion
of the fsxattr structure, other items have no meaning for btrfs or can
be added later.
The original patch was written by Chandan Jay Sharma but was incomplete
and no further revision has been sent. Several cleanups were necessary
to avoid confusion with other ioctls, as we have another flavor of
flags.
Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The FS_*_FL flags cannot be easily identified by a prefix but we still
need to recognize them so the 'fsflags' should be closer to the naming
scheme but again the 'fs' part sounds like it's a filesystem flag. I
don't have a better idea for now.
Signed-off-by: David Sterba <dsterba@suse.com>
The FS_*_FL flags cannot be easily identified by a variable name prefix
but we still need to recognize them so the 'fsflags' should be closer to
the naming scheme but again the 'fs' part sounds like it's a filesystem
flag. I don't have a better idea for now.
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info is always available from the context so we don't need to
store it in the structure.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>