We should account trimmed block number from __wait_all_discard_cmd
in __issue_discard_cmd_range, otherwise trimmed blocks returned
by f2fs_trim_fs will be wrong, this patch fixes it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is millions of discard entries cached in rb tree, each
sanity check of it can cause very long latency as held cmd_lock
blocking other lock grabbers.
In other aspect, we have enabled the check very long time, as
we see, there is no such inconsistent condition caused by bugs.
But still we do not choose to kill it directly, instead, adding
an flag to disable the check now, if there is related code change,
we can reuse it to detect bugs.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces verify_blkaddr to check meta/data block address
with valid range to detect bug earlier.
In addition, once we encounter an invalid blkaddr, notice user to run
fsck to fix, and let the kernel panic.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The on-disk representation and the vfs both use 64-bit tv_sec values,
so let's change the last missing piece in the middle.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In error path of f2fs_move_rehashed_dirents, inode page could be writeback
state, so we should wait on inode page writeback before updating it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Jens reported, we'd better assign REQ_RAHEAD to bio by the fact that
->readpages is called only from read-ahead.
In Documentation/filesystems/vfs.txt,
readpages: called by the VM to read pages associated with the address_space
object. This is essentially just a vector version of
readpage. Instead of just one page, several pages are
requested.
readpages is only used for read-ahead, so read errors are
ignored. If anything goes wrong, feel free to give up.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fix hungtask problem which can be reproduced as follow:
Thread 0~3:
while true
do
touch /xxx/test/file_xxx
done
Thread 4 write a new checkpoint every three seconds.
In the meantime, fio start 16 threads for randwrite.
With my debug info, cycles num will exceed 1000 in function
f2fs_sync_dirty_inodes, and most of cycle will be dropped
into congestion_wait() and sleep more than 20ms. Cycles num
reduced to 3 with this patch.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
"ret" can be uninitialized on the success path when "in ==
F2FS_GOING_DOWN_FULLSYNC".
Fixes: 60b2b4ee2b ("f2fs: Fix deadlock in shutdown ioctl")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Actually, we don't need to issue discard commands, if discard is on, as
mentioned in the comment.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Enable in-memory inode checksum to protect metadata blocks from
in-memory scribbles when checking consistency, which has no
performance requirements.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In fill_super, if root inode's attribute is incorrect, we need to
call f2fs_destroy_stats to release stats memory.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
readdir_ra is sysfs configuration instead of mount option, so it should
not be initialized in default_options(), otherwise after remount, it can
be reset to be enabled which may not as user wish, so let's move it to
f2fs_tuning_parameters().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let default_options() initialize s_res{u,g}id with default value like
other options.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During orphan inode recovery, checkpoint should never succeed due to
SBI_POR_DOING flag, so we don't need acquire orphan ino which only be
used by checkpoint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Once we shutdown f2fs, we have to flush stale pages in order to unmount
the system. In order to make stable, we need to stop fault injection as well.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It turns out losing meta pages in shutdown period makes f2fs very unstable
so that I could see many unexpected error conditions.
Let's keep meta pages for fault injection and sudden power-off tests.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When unmounting f2fs in force mode, we can get it stuck by io_schedule()
by some pending IOs in meta_inode.
io_schedule+0xd/0x30
wait_on_page_bit_common+0xc6/0x130
__filemap_fdatawait_range+0xbd/0x100
filemap_fdatawait_keep_errors+0x15/0x40
sync_inodes_sb+0x1cf/0x240
sync_filesystem+0x52/0x90
generic_shutdown_super+0x1d/0x110
kill_f2fs_super+0x28/0x80 [f2fs]
deactivate_locked_super+0x35/0x60
cleanup_mnt+0x36/0x70
task_work_run+0x79/0xa0
exit_to_usermode_loop+0x62/0x70
do_syscall_64+0xdb/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
0xffffffffffffffff
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's flush journal nat entries for speed up in the next run.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
sgid directories have special semantics, making newly created files in
the directory belong to the group of the directory, and newly created
subdirectories will also become sgid. This is historically used for
group-shared directories.
But group directories writable by non-group members should not imply
that such non-group members can magically join the group, so make sure
to clear the sgid bit on non-directories for non-members (but remember
that sgid without group execute means "mandatory locking", just to
confuse things even more).
Reported-by: Jann Horn <jannh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It turns out that systemd has a bug: it wants to load the autofs module
early because of some initialization ordering with udev, and it doesn't
do that correctly. Everywhere else it does the proper "look up module
name" that does the proper alias resolution, but in that early code, it
just uses a hardcoded "autofs4" for the module name.
The result of that is that as of commit a2225d931f ("autofs: remove
left-over autofs4 stubs"), you get
systemd[1]: Failed to insert module 'autofs4': No such file or directory
in the system logs, and a lack of module loading. All this despite the
fact that we had very clearly marked 'autofs4' as an alias for this
module.
What's so ridiculous about this is that literally everything else does
the module alias handling correctly, including really old versions of
systemd (that just used 'modprobe' to do this), and even all the other
systemd module loading code.
Only that special systemd early module load code is broken, hardcoding
the module names for not just 'autofs4', but also "ipv6", "unix",
"ip_tables" and "virtio_rng". Very annoying.
Instead of creating an _additional_ separate compatibility 'autofs4'
module, just rely on the fact that everybody else gets this right, and
just call the module 'autofs4' for compatibility reasons, with 'autofs'
as the alias name.
That will allow the systemd people to fix their bugs, adding the proper
alias handling, and maybe even fix the name of the module to be just
"autofs" (so that they can _test_ the alias handling). And eventually,
we can revert this silly compatibility hack.
See also
https://github.com/systemd/systemd/issues/9501https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=902946
for the systemd bug reports upstream and in the Debian bug tracker
respectively.
Fixes: a2225d931f ("autofs: remove left-over autofs4 stubs")
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: Michael Biebl <biebl@debian.org>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use huge_ptep_get() to translate huge ptes to normal ptes so we can
check them with the huge_pte_* functions. Otherwise some architectures
will check the wrong values and will not wait for userspace to bring in
the memory.
Link: http://lkml.kernel.org/r/20180626132421.78084-1-frankja@linux.ibm.com
Fixes: 369cd2121b ("userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges")
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAls4zz0ACgkQxWXV+ddt
WDs0ZhAAplEAcN1BP986BS7GpjrG20vQtP9AnHlnSEJJJmsnykpspBylOcLRkKjF
LKBfBPCKqIo7kn5ebKT1Kk7zJPkOOEfmxGW7hffVN/oa/oMtmgJbbHDUgl2TDgdu
rky1O+Bj+S37s5rhiXAJ4oU9ekdpWIlN30GczfynjiPqGigKh/cINsEEhQIIAiJG
PRDQfSIJeh67x1AP0KE8sJAYSsaeFxT+kHrT/NPs1NFDSzrQSa/QWPFVjGVVuI/Y
w84Mo0EqdRV7tap7D3QyWyYea6zdP00PG8TyLl0Kba+LckFbzpNN5hP3SUxleBzL
0ZBJi7/tOqnrMV3YaGm40dLfgD4B+CFt8zDyg2JvWUxxEzfQfYif7KIT2IV8fSqS
QrVw2NrzQC7EZ4Zu98wCN7dyyOE8yhqbq805YdG3Nj+zT6DqRu01TBo4Yr/Ek8ux
+ITAtQVbaOZmTIt/qh/Oxc5jRsurAno1FP3XRH+1hfSlS7xc3LfI1CUbX3jAKzXN
edxdM4/h+d4nekvROnKBH4EheS6+ZVfgzYlYUW9c2rjcJ1RHhDElbh14+IoM6LKJ
nJ+Cp+744F6W5jaG3oWElJrdhlY31mWUjiZaj2CHl16EcH3MToytrxKMX+OWo95W
gChnKicrtpO6+9nbED3Tdhp7SkbDysun6jvEpSgdlm8+2H5Kwrw=
=KWL7
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We have a few regression fixes for qgroup rescan status tracking and
the vm_fault_t conversion that mixed up the error values"
* tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix mount failure when qgroup rescan is in progress
Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf
Pull vfs fix from Al Viro:
"Followup to procfs-seq_file series this window"
This fixes a memory leak by making sure that proc seq files release any
private data on close. The 'proc_seq_open' has to be properly paired
with 'proc_seq_release' that releases the extra private data.
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
proc: add proc_seq_release
The poll() changes were not well thought out, and completely
unexplained. They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.
Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead. That gets rid of one of the new indirections.
But that doesn't fix the new complexity that is completely unwarranted
for the regular case. The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.
[ This revert is a revert of about 30 different commits, not reverted
individually because that would just be unnecessarily messy - Linus ]
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a power failure happens while the qgroup rescan kthread is running,
the next mount operation will always fail. This is because of a recent
regression that makes qgroup_rescan_init() incorrectly return -EINVAL
when we are mounting the filesystem (through btrfs_read_qgroup_config()).
This causes the -EINVAL error to be returned regardless of any qgroup
flags being set instead of returning the error only when neither of
the flags BTRFS_QGROUP_STATUS_FLAG_RESCAN nor BTRFS_QGROUP_STATUS_FLAG_ON
are set.
A test case for fstests follows up soon.
Fixes: 9593bf4967 ("btrfs: qgroup: show more meaningful qgroup_rescan_init error message")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The vm_fault_t conversion commit introduced a ret2 variable for tracking
the integer return values from internal btrfs functions. It was
sometimes returning VM_FAULT_LOCKED for pages that were actually invalid
and had been removed from the radix. Something like this:
ret2 = btrfs_delalloc_reserve_space() // returns zero on success
lock_page(page)
if (page->mapping != inode->i_mapping)
goto out_unlock;
...
out_unlock:
if (!ret2) {
...
return VM_FAULT_LOCKED;
}
This ends up triggering this WARNING in btrfs_destroy_inode()
WARN_ON(BTRFS_I(inode)->block_rsv.size);
xfstests generic/095 was able to reliably reproduce the errors.
Since out_unlock: is only used for errors, this fix moves it below the
if (!ret2) check we use to return VM_FAULT_LOCKED for success.
Fixes: a528a24150 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t)
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit ff3d27a048 ("btrfs: qgroup: Finish rescan when hit the last leaf
of extent tree") added a new exit for rescan finish.
However after finishing quota rescan, we set
fs_info->qgroup_rescan_progress to (u64)-1 before we exit through the
original exit path.
While we missed that assignment of (u64)-1 in the new exit path.
The end result is, the quota status item doesn't have the same value.
(-1 vs the last bytenr + 1)
Although it doesn't affect quota accounting, it's still better to keep
the original behavior.
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Fixes: ff3d27a048 ("btrfs: qgroup: Finish rescan when hit the last leaf of extent tree")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
kmemleak reported some memory leak on reading proc files. After adding
some debug lines, find that proc_seq_fops is using seq_release as
release handler, which won't handle the free of 'private' field of
seq_file, while in fact the open handler proc_seq_open could create
the private data with __seq_open_private when state_size is greater
than zero. So after reading files created with proc_create_seq_private,
such as /proc/timer_list and /proc/vmallocinfo, the private mem of a
seq_file is not freed. Fix it by adding the paired proc_seq_release
as the default release handler of proc_seq_ops instead of seq_release.
Fixes: 44414d82cf ("proc: introduce proc_create_seq_private")
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- More metadata validation strengthening to prevent crashes.
- Fix extent offset overflow problem when insert_range on a 512b block fs
- Fix some off-by-one errors in the realtime fsmap code
- Fix some math errors in the default resblks calculation when free space
is low
- Fix a problem where stale page contents are exposed via mmap read
after a zero_range at eof
- Fix accounting problems with per-ag reservations causing statfs
reports to vary incorrectly
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlsv6mQACgkQ+H93GTRK
tOv3ZA//W9QVf6aGv3fefg7ZkleRpKPhpvrdgy+LnnhGdkToKGsOIJGavFjBYfBh
cFSOXrKamNiSZiJzzQ8NiFaZ8wrD2NAhOybatp7jAGKGb+a3VTtoz7v+4VwJ5Psw
gPdZ+zADqsbbUg4Xx08TDqdQDHlIXz6kTvrM+MGBq5TynICAIUgS2o9fJT1HKf12
xfbC2dh2cK/7TnN0MNCLzekrPoiLfuEPsMkrnWv/PBre0AhfcCJLbNhQi5+mCWcs
e71vttXAc3kkSly6G4D9SSb+5Mn1Q1SkBsORn82fJdI59eHjFn1ujLvkfPM1w1Iu
oDjJw8PCyrG7NFZiTh46MuJhvH/Tc6ZZvop/YJyJJApHOYYx6K64enpJrWcg+CKp
Rcl5gQocvlIfMyzBvyozpHctlIuOjkQsSmyEXXx8PAsr7zU+FehXS5ENzCOtdg5U
LivcsXmh0bvSU1PdzA3oiANcIPAfVX6+RoM4rQpDjT+IS7Ud9F2hELD0Yi/pmk4l
oC91+e4oPJ4HsoCDmxePLl29Gqm4n1lY9i3l1PsZe8yWeYxs991R/hJuxKvRQLul
sxkgKk5nvdG2PjSmAKx9XYRrHEf/3uK+TEtrQtrm06i1QEI881qqSEMHpaoRj0qo
awESWimVodllaag+ugFLi+u+UAh8LA0jTAmijBqczLtspU1Cg0c=
=Aw39
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.18-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are some patches for 4.18 to fix regressions, accounting
problems, overflow problems, and to strengthen metadata validation to
prevent corruption.
This series has been run through a full xfstests run over the weekend
and through a quick xfstests run against this morning's master, with
no major failures reported.
Changes since last update:
- more metadata validation strengthening to prevent crashes.
- fix extent offset overflow problem when insert_range on a 512b
block fs
- fix some off-by-one errors in the realtime fsmap code
- fix some math errors in the default resblks calculation when free
space is low
- fix a problem where stale page contents are exposed via mmap read
after a zero_range at eof
- fix accounting problems with per-ag reservations causing statfs
reports to vary incorrectly"
* tag 'xfs-4.18-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix fdblocks accounting w/ RMAPBT per-AG reservation
xfs: ensure post-EOF zeroing happens after zeroing part of a file
xfs: fix off-by-one error in xfs_rtalloc_query_range
xfs: fix uninitialized field in rtbitmap fsmap backend
xfs: recheck reflink state after grabbing ILOCK_SHARED for a write
xfs: don't allow insert-range to shift extents past the maximum offset
xfs: don't trip over negative free space in xfs_reserve_blocks
xfs: allow empty transactions while frozen
xfs: xfs_iflush_abort() can be called twice on cluster writeback failure
xfs: More robust inode extent count validation
xfs: simplify xfs_bmap_punch_delalloc_range
In any case, d_splice_alias() does not drop reference of original
dentry.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsvtAIACgkQxWXV+ddt
WDvPJQ//eEr+ACNxG5n42e63TaNpLOeoGAsUChwekP6DAIc2n/r78SHX+T/woDqd
7/dc2eqlYF5Fmn6MPQ1ufL2xlfw0t3OnemK8T5+4sxZDdzeAH9V+kHaqoaLUPExL
4r6lK5Ywwa2cWghC7WvQg3+bWLX18OEExG63SlEhLo3YJM5uUqVhGVPi6ARrbxNM
GJvXcQsxjXLqukm4gYvHC6Zra9Awv65uiAU+VCm2y96j1fEJ0yjK/pC1RtoFGCqS
48Jjuzfq/V3nxy0Wjr3DvpQVEQcKyGha6i/eazZISdRhGSjrYdvIwpUn7gZeoo2Q
hT8VVergLbVYgIeaOwgwubQzNaG2C/ZTsEjPQrNrA7a/AGsh5C/ommYE9MSyS6L0
PG0NLUNDXFmEj8WdI97So+1Sm2OCb04DPbPhHbbkhw5L/MPE1TaLN5aUWguj8laB
NnyPRdVP9jCJAI0OhJY7nPDmsPKe2jogVVsRheTcob+V5G+vIgzDXlGfsW/88Seg
dHubMaC0nz6u8Cj4dIviitiLXuustyz0JkVdljTLawWWJ/Hs7NlsSf3Q3nj2Kvia
e8QMID0vLphQyL1hqC0n7M0g2ohq9NUGT1nhLTPSpFl0l8bIA9PQehOx9Q5vx8yp
tJF+d0qiNfgvadA+KhyvQ8puQb2+zZQ8+Pqfjwd4ySD7SI5dcA4=
=fCdm
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two regression fixes and an incorrect error value propagation fix from
'rename exchange'"
* tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix return value on rename exchange failure
btrfs: fix invalid-free in btrfs_extent_same
Btrfs: fix physical offset reported by fiemap for inline extents
In __xfs_ag_resv_init we incorrectly calculate the amount by which to
decrease fdblocks when reserving blocks for the rmapbt. Because rmapbt
allocations do not decrease fdblocks, we must decrease fdblocks by the
entire size of the requested reservation in order to achieve our goal of
always having enough free blocks to satisfy an rmapbt expansion.
This is in contrast to the refcountbt/finobt, which /do/ subtract from
fdblocks whenever they allocate a block. For this allocation type we
preserve the existing behavior where we decrease fdblocks only by the
requested reservation minus the size of the existing tree.
This fixes the problem where the available block counts reported by
statfs change across a remount if there had been an rmapbt size change
since mount time.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
If a user asks us to zero_range part of a file, the end of the range is
EOF, and not aligned to a page boundary, invoke writeback of the EOF
page to ensure that the post-EOF part of the page is zeroed. This
ensures that we don't expose stale memory contents via mmap, if in a
clumsy manner.
Found by running generic/127 when it runs zero_range and mapread at EOF
one after the other.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
In commit 8ad560d256 ("xfs: strengthen rtalloc query range checks")
we strengthened the input parameter checks in the rtbitmap range query
function, but introduced an off-by-one error in the process. The call
to xfs_rtfind_forw deals with the high key being rextents, but we clamp
the high key to rextents - 1. This causes the returned results to stop
one block short of the end of the rtdev, which is incorrect.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Initialize the extent count field of the high key so that when we use
the high key to synthesize an 'unknown owner' record (i.e. used space
record) at the end of the queried range we have a field with which to
compute rm_blockcount. This is not strictly necessary because the
synthesizer never uses the rm_blockcount field, but we can shut up the
static code analysis anyway.
Coverity-id: 1437358
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Zorro Lang reports that generic/485 blows an assert on a filesystem with
512 byte blocks. The test tries to fallocate a post-eof extent at the
maximum file size and calls insert range to shift the extents right by
two blocks. On a 512b block filesystem this causes startoff to overflow
the 54-bit startoff field, leading to the assert.
Therefore, always check the rightmost extent to see if it would overflow
prior to invoking the insert range machinery.
Reported-by: zlang@redhat.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200137
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If we somehow end up with a filesystem that has fewer free blocks than
the blocks set aside to avoid ENOSPC deadlocks, it's possible that the
free space calculation in xfs_reserve_blocks will spit out a negative
number (because percpu_counter_sum returns s64). We fail to notice
this negative number and set fdblks_delta to it. Now we increment
fdblocks(!) and the unsigned type of m_resblks means that we end up
setting a ridiculously huge m_resblks reservation.
Avoid this comedy of errors by detecting the negative free space and
returning -ENOSPC.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In commit e89c041338 ("xfs: implement the GETFSMAP ioctl") we
created the ability to obtain empty transactions. These transactions
have no log or block reservations and therefore can't modify anything.
Since they're also NO_WRITECOUNT they can run while the fs is frozen,
so we don't need to WARN_ON about that usage.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If we failed during a rename exchange operation after starting/joining a
transaction, we would end up replacing the return value, stored in the
local 'ret' variable, with the return value from btrfs_end_transaction().
So this could end up returning 0 (success) to user space despite the
operation having failed and aborted the transaction, because if there are
multiple tasks having a reference on the transaction at the time
btrfs_end_transaction() is called by the rename exchange, that function
returns 0 (otherwise it returns -EIO and not the original error value).
So fix this by not overwriting the return value on error after getting
a transaction handle.
Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlssqQcACgkQnJ2qBz9k
QNl2XwgAhtb2LvBhWKh3OkBVczICgc9eQho03KiHJkEcR1OCkgjOJhVBDrQmJbBa
tAuMmUcnTpLJG8CI5TyR0G4hNbttE2LuTu+6RV67hXOjhiYAQ5P9wYLyUqfZf/01
myrPEewr9qqx3h2htqufiLQIyO1M4FeM37VqdH7vZhQOb+B+FUw7JB9a/HbCpNh/
7NUDf2GbLtLjK+2Xh0ttXvfWjgbLjC4wMmaPaa5+Nabn+URtvX9aHgQ/dYrOjQ16
7oD5K5x3k3sOT6Ix7xRGLHgE6Xl6MlTtxpyt5ldb96RwWO/GAuMhSCBd14nS2hmx
fEx5N2vbs3v8Ux+ZdMauzAKZI6Fz2g==
=oFo+
-----END PGP SIGNATURE-----
Merge tag 'for_v4.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull udf, quota, ext2 fixes from Jan Kara:
"UDF:
- fix an oops due to corrupted disk image
- two small cleanups
quota:
- a fixfor lru handling
- cleanup
ext2:
- a warning about a deprecated mount option"
* tag 'for_v4.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Drop unused arguments of udf_delete_aext()
udf: Provide function for calculating dir entry length
udf: Detect incorrect directory size
ext2: add warning when specifying nocheck option
quota: Cleanup list iteration in dqcache_shrink_scan()
quota: reclaim least recently used dquots
When a corrupt inode is detected during xfs_iflush_cluster, we can
get a shutdown ASSERT failure like this:
XFS (pmem1): Metadata corruption detected at xfs_symlink_shortform_verify+0x5c/0xa0, inode 0x86627 data fork
XFS (pmem1): Unmount and run xfs_repair
XFS (pmem1): xfs_do_force_shutdown(0x8) called from line 3372 of file fs/xfs/xfs_inode.c. Return address = ffffffff814f4116
XFS (pmem1): Corruption of in-memory data detected. Shutting down filesystem
XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c. Return address = ffffffff814a8a88
XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c. Return address = ffffffff814a8ef9
XFS (pmem1): Please umount the filesystem and rectify the problem(s)
XFS: Assertion failed: xfs_isiflocked(ip), file: fs/xfs/xfs_inode.h, line: 258
.....
Call Trace:
xfs_iflush_abort+0x10a/0x110
xfs_iflush+0xf3/0x390
xfs_inode_item_push+0x126/0x1e0
xfsaild+0x2c5/0x890
kthread+0x11c/0x140
ret_from_fork+0x24/0x30
Essentially, xfs_iflush_abort() has been called twice on the
original inode that that was flushed. This happens because the
inode has been flushed to teh buffer successfully via
xfs_iflush_int(), and so when another inode is detected as corrupt
in xfs_iflush_cluster, the buffer is marked stale and EIO, and
iodone callbacks are run on it.
Running the iodone callbacks walks across the original inode and
calls xfs_iflush_abort() on it. When xfs_iflush_cluster() returns
to xfs_iflush(), it runs the error path for that function, and that
calls xfs_iflush_abort() on the inode a second time, leading to the
above assert failure as the inode is not flush locked anymore.
This bug has been there a long time.
The simple fix would be to just avoid calling xfs_iflush_abort() in
xfs_iflush() if we've got a failure from xfs_iflush_cluster().
However, xfs_iflush_cluster() has magic delwri buffer handling that
means it may or may not have run IO completion on the buffer, and
hence sometimes we have to call xfs_iflush_abort() from
xfs_iflush(), and sometimes we shouldn't.
After reading through all the error paths and the delwri buffer
code, it's clear that the error handling in xfs_iflush_cluster() is
unnecessary. If the buffer is delwri, it leaves it on the delwri
list so that when the delwri list is submitted it sees a shutdown
fliesystem in xfs_buf_submit() and that marks the buffer stale, EIO
and runs IO completion. i.e. exactly what xfs+iflush_cluster() does
when it's not a delwri buffer. Further, marking a buffer stale
clears the _XBF_DELWRI_Q flag on the buffer, which means when
submission of the buffer occurs, it just skips over it and releases
it.
IOWs, the error handling in xfs_iflush_cluster doesn't need to care
if the buffer is already on a the delwri queue or not - it just
needs to mark the buffer stale, EIO and run completions. That means
we can just use the easy fix for xfs_iflush() to avoid the double
abort.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When the inode is in extent format, it can't have more extents that
fit in the inode fork. We don't currenty check this, and so this
corruption goes unnoticed by the inode verifiers. This can lead to
crashes operating on invalid in-memory structures.
Attempts to access such a inode will now error out in the verifier
rather than allowing modification operations to proceed.
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix a typedef, add some braces and breaks to shut up compiler warnings]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>