In order to avoid the below overflow issue, we should have checked the
boundaries in superblock before reaching out to allocation. As Linus suggested,
the right place should be sanity_check_raw_super().
Dr Silvio Cesare of InfoSect reported:
There are integer overflows with using the cp_payload superblock field in the
f2fs filesystem potentially leading to memory corruption.
include/linux/f2fs_fs.h
struct f2fs_super_block {
...
__le32 cp_payload;
fs/f2fs/f2fs.h
typedef u32 block_t; /*
* should not change u32, since it is the on-disk block
* address format, __le32.
*/
...
static inline block_t __cp_payload(struct f2fs_sb_info *sbi)
{
return le32_to_cpu(F2FS_RAW_SUPER(sbi)->cp_payload);
}
fs/f2fs/checkpoint.c
block_t start_blk, orphan_blocks, i, j;
...
start_blk = __start_cp_addr(sbi) + 1 + __cp_payload(sbi);
orphan_blocks = __start_sum_addr(sbi) - 1 - __cp_payload(sbi);
+++ integer overflows
...
unsigned int cp_blks = 1 + __cp_payload(sbi);
...
sbi->ckpt = kzalloc(cp_blks * blk_size, GFP_KERNEL);
+++ integer overflow leading to incorrect heap allocation.
int cp_payload_blks = __cp_payload(sbi);
...
ckpt->cp_pack_start_sum = cpu_to_le32(1 + cp_payload_blks +
orphan_blocks);
+++ sign bug and integer overflow
...
for (i = 1; i < 1 + cp_payload_blks; i++)
+++ integer overflow
...
sbi->max_orphans = (sbi->blocks_per_seg - F2FS_CP_PACKS -
NR_CURSEG_TYPE - __cp_payload(sbi)) *
F2FS_ORPHANS_PER_BLOCK;
+++ integer overflow
Reported-by: Greg KH <greg@kroah.com>
Reported-by: Silvio Cesare <silvio.cesare@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Volatile file's data will be updated oftenly, so it'd better to place
its data into hot data segment.
In addition, for atomic file, we change to check FI_ATOMIC_FILE instead
of FI_HOT_DATA to make code readability better.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce release_discard_addr() to include common codes for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Fengguang Wu: declare static function, reported by kbuild test robot]
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In build_sit_entries(), if valid_blocks in SIT block is smaller than
valid_blocks in journal, for below calculation:
sbi->discard_blks += old_valid_blocks - se->valid_blocks;
There will be two times potential overflow:
- old_valid_blocks - se->valid_blocks will overflow, and be a very
large number.
- sbi->discard_blks += result will overflow again, comes out a correct
result accidently.
Anyway, it should be fixed.
Fixes: d600af236d ("f2fs: avoid unneeded loop in build_sit_entries")
Fixes: 1f43e2ad7b ("f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
RW semphore dio_rwsem in struct f2fs_inode_info is introduced to avoid
race between dio and data gc, but now, it is more wildly used to avoid
foreground operation vs data gc. So rename it to i_gc_rwsem to improve
its readability.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch move mnt_want_write_file after range check,
it's needless to check arguments with it.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fix missing clear FI_NO_PREALLOC in some error case
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is to give a option for user to be able to recover B/foo in the below
case.
mkdir A
sync()
rename(A, B)
creat (B/foo)
fsync (B/foo)
---crash---
Sugessted-by: Velayudhan Pillai <vijay@cs.utexas.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The file is called "features", not "feature".
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch moves error handling from commit_inmem_pages() into
__commit_inmem_page() for cleanup, no logic change.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Only dir may have F2FS_INLINE_DOTS flag, so there is no need to check
the flag in recover flow.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes duplicated dquot_initialize in recover_orphan_inode(),
and fix the error handling if dquot_initialize fails.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
dquot_initialize() can fail due to any exception inside quota subsystem,
f2fs needs to be aware of it, and return correct return value to caller.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
v4->v5: move data corruption check to __submit_discard_cmd, in order to
control discard io submitted more accurately, besides, increase async
thread wait time if data corruption detected.
This patch stop async thread and umount process to issue discard
if something wrong with f2fs, which is similar to fstrim.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_ioc_commit_atomic_write, if file is volatile, return -EINVAL to
indicate that commit failure.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a file not set type as hot, has dirty pages more than
threshold 64 before starting atomic write, may be lose hot
flag.
v1->v2: move set FI_ATOMIC_FILE flag behind flush dirty pages too,
in case of dirty pages before starting atomic use atomic mode to
write back.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
`cur' will never be NULL, we should check inmem_pages list instead.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Thread GC thread
- f2fs_ioc_start_atomic_write
- get_dirty_pages
- filemap_write_and_wait_range
- f2fs_gc
- do_garbage_collect
- gc_data_segment
- move_data_page
- f2fs_is_atomic_file
- set_page_dirty
- set_inode_flag(, FI_ATOMIC_FILE)
Dirty data page can still be generated by GC in race condition as
above call stack.
This patch adds fi->dio_rwsem[WRITE] in f2fs_ioc_start_atomic_write
to avoid such race.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use new return type vm_fault_t for page_mkwrite
and fault handler.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the structure of f2fs_inode, i_extra_size's type is __le16,
so we should keep type consistent when using it.
Fixes: 704956ecf5 ("f2fs: support inode checksum")
Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should check valid_map_mir and block count to ensure
the flushed raw_sit is correct.
Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Correct return value in two cases:
- return EINVAL if end boundary is out-of-range.
- return EIO if fs needs off-line check.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to show missing encrypt/inline_data flag in
FS_IOC_GETFLAGS like ext4 does.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now F2FS_FL_USER_VISIBLE and F2FS_FL_USER_MODIFIABLE has included
F2FS_PROJINHERIT_FL, so remove unneeded F2FS_PROJINHERIT_FL when
using visible/modifiable flag macro.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Related to https://lkml.org/lkml/2018/4/8/661
Sometimes, we need to write meta data to new allocated block address,
then we will allocate a zeroed page in inner inode's address space, and
fill partial data in it, and leave other place with zero value which means
some fields are initial status.
There are two inner inodes (meta inode and node inode) setting __GFP_ZERO,
I have just checked them, for both of them, we can avoid using __GFP_ZERO,
and do initialization by ourselves to avoid unneeded/redundant zeroing
from mm.
Cc: <stable@vger.kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch modify max_requests to UINT_MAX, to issue
all big range discards in umount.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For buffered IO, we don't need to use block plug to cache bio,
for direct IO, generic f2fs_direct_IO has already added block
plug, so let's remove redundant one in .write_iter.
As Yunlei described in his patch:
-f2fs_file_write_iter
-blk_start_plug
-__generic_file_write_iter
...
-do_blockdev_direct_IO
-blk_start_plug
...
-blk_finish_plug
...
-blk_finish_plug
which may conduct performance decrease in our platform
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since the layout of regular dentry block is different from inline dentry
block, zero_user_segment starting from MAX_INLINE_DATA(dir) is not
correct for regular dentry block, besides, bitmap is already copied and
used, so there is no necessary to zero page at all, so just remove the
zero_user_segment is OK.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we use generic FS_*_FL defined by vfs to indicate inode status
for each bit of i_flags, so f2fs's flag status definition is tied to vfs'
one, it will be hard for f2fs to reuse bits f2fs never used to indicate
new status..
In order to solve this issue, we introduce private inode status mapping,
Note, for these bits have already been persisted into disk, we should
never change their definition, for other ones, we can remap them for
later new coming status.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to wait for whole bunch of discard candidates in fstrim, since
runtime discard will issue them in idle time.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In order to avoid interfering normal r/w IO, let's turn down IO
priority of discard issued from background.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, we issue discard asynchronously in separated thread instead of in
checkpoint, after that, we won't encounter long latency in checkpoint
due to huge number of synchronous discard command handling, so, we don't
need to split checkpoint to do trim in batch, merge it and obsolete
related sysfs entry.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the high utilization like over 80%, we don't expect huge # of large discard
commands, but do many small pending discards which affects FTL GCs a lot.
Let's issue them in that case.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For non-atomic files, this patch adds an option to give nobarrier which
doesn't issue flush commands to the device.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The fstrim gathers huge number of large discard commands, and tries to issue
without IO awareness, which results in long user-perceive IO latencies on
READ, WRITE, and FLUSH in UFS. We've observed some of commands take several
seconds due to long discard latency.
This patch limits the maximum size to 2MB per candidate, and check IO congestion
when issuing them to disk.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
pageout() in MM traslates EAGAIN, so calls handle_write_error()
-> mapping_set_error() -> set_bit(AS_EIO, ...).
file_write_and_wait_range() will see EIO error, which is critical
to return value of fsync() followed by atomic_write failure to user.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch clears PageError in some pages tagged by read path, but when we
write the pages with valid contents, writepage should clear the bit likewise
ext4.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes the rule to check cap_resource for data blocks, not inode
or node blocks in order to avoid selinux denial.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch reverts copied f2fs_set_page_dirty_nobuffer to use generic function
for stability.
This reverts commit fe76b796fc.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
xfstest generic/429 sometimes hangs on f2fs, caused by a thread being
unable to take a directory's i_rwsem for write in vfs_rmdir(). In the
test, one thread repeatedly creates and removes a directory, and other
threads repeatedly look up a file in the directory. The bug is that
f2fs_mkdir() calls d_instantiate() before unlock_new_inode(), resulting
in the directory inode being exposed to lookups before it has been fully
initialized. And with CONFIG_DEBUG_LOCK_ALLOC, unlock_new_inode()
reinitializes ->i_rwsem, corrupting its state when it is already held.
Fix it by calling unlock_new_inode() before d_instantiate(). This
matches what other filesystems do.
Fixes: 57397d86c6 ("f2fs: add inode operations for special inodes")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently f2fs's ->readpage() and ->readpages() assume that either the
data undergoes no postprocessing, or decryption only. But with
fs-verity, there will be an additional authenticity verification step,
and it may be needed either by itself, or combined with decryption.
To support this, store a 'struct bio_post_read_ctx' in ->bi_private
which contains a work struct, a bitmask of postprocessing steps that are
enabled, and an indicator of the current step. The bio completion
routine, if there was no I/O error, enqueues the first postprocessing
step. When that completes, it continues to the next step. Pages that
fail any postprocessing step have PageError set. Once all steps have
completed, pages without PageError set are set Uptodate, and all pages
are unlocked.
Also replace f2fs_encrypted_file() with a new function
f2fs_post_read_required() in places like direct I/O and garbage
collection that really should be testing whether the file needs special
I/O processing, not whether it is encrypted specifically.
This may also be useful for other future f2fs features such as
compression.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently, fscrypt provides fscrypt_decrypt_bio_pages() which decrypts a
bio's pages asynchronously, then unlocks them afterwards. But, this
assumes that decryption is the last "postprocessing step" for the bio,
so it's incompatible with additional postprocessing steps such as
authenticity verification after decryption.
Therefore, rename the existing fscrypt_decrypt_bio_pages() to
fscrypt_enqueue_decrypt_bio(). Then, add fscrypt_decrypt_bio() which
decrypts the pages in the bio synchronously without unlocking the pages,
nor setting them Uptodate; and add fscrypt_enqueue_decrypt_work(), which
enqueues work on the fscrypt_read_workqueue. The new functions will be
used by filesystems that support both fscrypt and fs-verity.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull hexagon fixes from Richard Kuo:
"Some small fixes for module compilation"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rkuo/linux-hexagon-kernel:
hexagon: export csum_partial_copy_nocheck
hexagon: add memset_io() helper
This is needed to link ipv6 as a loadable module, which in turn happens
in allmodconfig.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
We already have memcpy_toio(), but not memset_io(), so let's
add the obvious version to allow building an allmodconfig kernel
without errors like
drivers/gpu/drm/ttm/ttm_bo_util.c: In function 'ttm_bo_move_memcpy':
drivers/gpu/drm/ttm/ttm_bo_util.c:390:3: error: implicit declaration of function 'memset_io' [-Werror=implicit-function-declaration]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
- Enhance inode fork verifiers to prevent loading of corrupted metadata.
- Fix a crash when we try to convert extents format inodes to btree
format, we run out of space, but forget to revert the in-core state
changes.
- Fix file size checks when doing INSERT_RANGE that could cause files
to end up negative size if there previously was an extent mapped at
s_maxbytes.
- Fix a bug when doing a remove-then-add ATTR_REPLACE xattr update where
we forget to clear ATTR_REPLACE after the remove, which causes the
attr to be lost and the fs to shut down due to (what it thinks is)
inconsistent in-core state.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJa2gs1AAoJEPh/dxk0SrTru50P+wZuY2DRfBSCH20Oq3qTbO9M
dQGBnOIqRUC+9IDkpRExB4BYVZoK9/v3fnW3zttRMlEmTQfHtMsoEUXpByoETY5b
/bTZh+c7fMhj7eNZspfc+8xsHvq0k3hSxrITe4zjL2rSy72KUrsYtDoIu2UvXyZK
nJsqCiyFOdFgMi6IBRLOAVBPzs/q8sIBfl/axjyvokLL/6ki/TfvCAtLkdT4FRIt
UHzE8ly/Z99honciPQW4axZ9TobAVd6g2d11XJpbhku3ijTL/vHyflRCFgmM/2T3
VyJ3tTH/w3rCAXOEEj3H8TvKAlHiKpB+g9VwhsTZrP0B14/ljkClm/yEFnCOLmb1
/26t+A++fkqy71PoqeQvXGJGEAxC/1TGo7dxq6Gn5SESc1yr8CjXXMiWTMBBWgfF
xIsBNA6Ok0F5+OkEhfSUtfIBziPeUIjbmNGTnMV/EiL9stpHwrQS+JLAK5+POHvQ
ZoH5RTY69mK5rJtzN6USGyPe4J0f+S7YTf9fjKTcjjpWHLjJYEYJcGvQb3ynZSV+
5Wu1TqaUkl/Mp3mkZ1KFMq+MXBgOnarOZug80fbhkv3Vyzbelzpebuc5fMJnzofp
0x6BQUC+bwOxB2KJX8TC+9ajen8O+oxrE2RyehVtCZL0O830NvZ4hQg3yxjKUbwv
8Y0Mq6Q2CQEmsBJkIHH2
=lFf0
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are a few more bug fixes for xfs for 4.17-rc4. Most of them are
fixes for bad behavior.
This series has been run through a full xfstests run during LSF and
through a quick xfstests run against this morning's master, with no
major failures reported.
Summary:
- Enhance inode fork verifiers to prevent loading of corrupted
metadata.
- Fix a crash when we try to convert extents format inodes to btree
format, we run out of space, but forget to revert the in-core state
changes.
- Fix file size checks when doing INSERT_RANGE that could cause files
to end up negative size if there previously was an extent mapped at
s_maxbytes.
- Fix a bug when doing a remove-then-add ATTR_REPLACE xattr update
where we forget to clear ATTR_REPLACE after the remove, which
causes the attr to be lost and the fs to shut down due to (what it
thinks is) inconsistent in-core state"
* tag 'xfs-4.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: don't fail when converting shortform attr to long form during ATTR_REPLACE
xfs: prevent creating negative-sized file via INSERT_RANGE
xfs: set format back to extents if xfs_bmap_extents_to_btree
xfs: enhance dinode verifier