Commit Graph

67697 Commits

Author SHA1 Message Date
Josef Bacik
5aa2717b6b btrfs: fix error handling in commit_fs_roots
[ Upstream commit 4f4317c13a40194940acf4a71670179c4faca2b5 ]

While doing error injection I would sometimes get a corrupt file system.
This is because I was injecting errors at btrfs_search_slot, but would
only do it one time per stack.  This uncovered a problem in
commit_fs_roots, where if we get an error we would just break.  However
we're in a nested loop, the first loop being a loop to find all the
dirty fs roots, and then subsequent root updates would succeed clearing
the error value.

This isn't likely to happen in real scenarios, however we could
potentially get a random ENOMEM once and then not again, and we'd end up
with a corrupted file system.  Fix this by moving the error checking
around a bit to the main loop, as this is the only place where something
will fail, and return the error as soon as it occurs.

With this patch my reproducer no longer corrupts the file system.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-07 12:34:13 +01:00
Chao Yu
0d2d6857db f2fs: fix to set/clear I_LINKABLE under i_lock
[ Upstream commit 46085f37fc9e12d5c3539fb768b5ad7951e72acf ]

fsstress + fault injection test case reports a warning message as
below:

WARNING: CPU: 13 PID: 6226 at fs/inode.c:361 inc_nlink+0x32/0x40
Call Trace:
 f2fs_init_inode_metadata+0x25c/0x4a0 [f2fs]
 f2fs_add_inline_entry+0x153/0x3b0 [f2fs]
 f2fs_add_dentry+0x75/0x80 [f2fs]
 f2fs_do_add_link+0x108/0x160 [f2fs]
 f2fs_rename2+0x6ab/0x14f0 [f2fs]
 vfs_rename+0x70c/0x940
 do_renameat2+0x4d8/0x4f0
 __x64_sys_renameat2+0x4b/0x60
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Following race case can cause this:
Thread A				Kworker
- f2fs_rename
 - f2fs_create_whiteout
  - __f2fs_tmpfile
   - f2fs_i_links_write
    - f2fs_mark_inode_dirty_sync
     - mark_inode_dirty_sync
					- writeback_single_inode
					 - __writeback_single_inode
					  - spin_lock(&inode->i_lock)
   - inode->i_state |= I_LINKABLE
					  - inode->i_state &= ~dirty
					  - spin_unlock(&inode->i_lock)
 - f2fs_add_link
  - f2fs_do_add_link
   - f2fs_add_dentry
    - f2fs_add_inline_entry
     - f2fs_init_inode_metadata
      - f2fs_i_links_write
       - inc_nlink
        - WARN_ON(!(inode->i_state & I_LINKABLE))

Fix to add i_lock to avoid i_state update race condition.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-07 12:34:12 +01:00
Jaegeuk Kim
c86df2b84b f2fs: handle unallocated section and zone on pinned/atgc
[ Upstream commit 632faca72938f9f63049e48a8c438913828ac7a9 ]

If we have large section/zone, unallocated segment makes them corrupted.

E.g.,

  - Pinned file:       -1 119304647 119304647
  - ATGC   data:       -1 119304647 119304647

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-07 12:34:12 +01:00
Jens Axboe
01fd84a436 fs: make unlazy_walk() error handling consistent
[ Upstream commit e36cffed20a324e116f329a94061ae30dd26fb51 ]

Most callers check for non-zero return, and assume it's -ECHILD (which
it always will be). One caller uses the actual error return. Clean this
up and make it fully consistent, by having unlazy_walk() return a bool
instead. Rename it to try_to_unlazy() and return true on success, and
failure on error. That's easier to read.

No functional changes in this patch.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-07 12:34:11 +01:00
Yumei Huang
c57ba68e73 xfs: Fix assert failure in xfs_setattr_size()
commit 88a9e03beef22cc5fabea344f54b9a0dfe63de08 upstream.

An assert failure is triggered by syzkaller test due to
ATTR_KILL_PRIV is not cleared before xfs_setattr_size.
As ATTR_KILL_PRIV is not checked/used by xfs_setattr_size,
just remove it from the assert.

Signed-off-by: Yumei Huang <yuhuang@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-07 12:34:05 +01:00
Gao Xiang
5e0068a4fb erofs: fix shift-out-of-bounds of blkszbits
commit bde545295b710bdd13a0fcd4b9fddd2383eeeb3a upstream.

syzbot generated a crafted bitszbits which can be shifted
out-of-bounds[1]. So directly print unsupported blkszbits
instead of blksize.

[1] https://lore.kernel.org/r/000000000000c72ddd05b9444d2f@google.com

Link: https://lore.kernel.org/r/20210120013016.14071-1-hsiangkao@aol.com
Reported-by: syzbot+c68f467cd7c45860e8d4@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-07 12:34:05 +01:00
Randy Dunlap
40f6090d6e JFS: more checks for invalid superblock
commit 3bef198f1b17d1bb89260bad947ef084c0a2d1a6 upstream.

syzbot is feeding invalid superblock data to JFS for mount testing.
JFS does not check several of the fields -- just assumes that they
are good since the JFS_MAGIC and version fields are good.

In this case (syzbot reproducer), we have s_l2bsize == 0xda0c,
pad == 0xf045, and s_state == 0x50, all of which are invalid IMO.
Having s_l2bsize == 0xda0c causes this UBSAN warning:
  UBSAN: shift-out-of-bounds in fs/jfs/jfs_mount.c:373:25
  shift exponent -9716 is negative

s_l2bsize can be tested for correctness. pad can be tested for non-0
and punted. s_state can be tested for its valid values and punted.

Do those 3 tests and if any of them fails, report the superblock as
invalid/corrupt and let fsck handle it.

With this patch, chkSuper() says this when JFS_DEBUG is enabled:
  jfs_mount: Mount Failure: superblock is corrupt!
  Mount JFS Failure: -22
  jfs_mount failed w/return code = -22

The obvious problem with this method is that next week there could
be another syzbot test that uses different fields for invalid values,
this making this like a game of whack-a-mole.

syzkaller link: https://syzkaller.appspot.com/bug?extid=36315852ece4132ec193

Reported-by: syzbot+36315852ece4132ec193@syzkaller.appspotmail.com
Reported-by: kernel test robot <lkp@intel.com> # v2
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: jfs-discussion@lists.sourceforge.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-07 12:34:04 +01:00
Andreas Gruenbacher
eb8128c5bb gfs2: Recursive gfs2_quota_hold in gfs2_iomap_end
commit 7009fa9cd9a5262944b30eb7efb1f0561d074b68 upstream.

When starting an iomap write, gfs2_quota_lock_check -> gfs2_quota_lock
-> gfs2_quota_hold is called from gfs2_iomap_begin.  At the end of the
write, before unlocking the quotas, punch_hole -> gfs2_quota_hold can be
called again in gfs2_iomap_end, which is incorrect and leads to a failed
assertion.  Instead, move the call to gfs2_quota_unlock before the call
to punch_hole to fix that.

Fixes: 64bc06bb32 ("gfs2: iomap buffered write support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:44 +01:00
Andreas Gruenbacher
a646a3164b gfs2: Lock imbalance on error path in gfs2_recover_one
commit 834ec3e1ee65029029225a86c12337a6cd385af7 upstream.

In gfs2_recover_one, fix a sd_log_flush_lock imbalance when a recovery
pass fails.

Fixes: c9ebc4b737 ("gfs2: allow journal replay to hold sd_log_flush_lock")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:44 +01:00
Bob Peterson
42fd500353 gfs2: Don't skip dlm unlock if glock has an lvb
commit 78178ca844f0eb88f21f31c7fde969384be4c901 upstream.

Patch fb6791d100 was designed to allow gfs2 to unmount quicker by
skipping the step where it tells dlm to unlock glocks in EX with lvbs.
This was done because when gfs2 unmounts a file system, it destroys the
dlm lockspace shortly after it destroys the glocks so it doesn't need to
unlock them all: the unlock is implied when the lockspace is destroyed
by dlm.

However, that patch introduced a use-after-free in dlm: as part of its
normal dlm_recoverd process, it can call ls_recovery to recover dead
locks. In so doing, it can call recover_rsbs which calls recover_lvb for
any mastered rsbs. Func recover_lvb runs through the list of lkbs queued
to the given rsb (if the glock is cached but unlocked, it will still be
queued to the lkb, but in NL--Unlocked--mode) and if it has an lvb,
copies it to the rsb, thus trying to preserve the lkb. However, when
gfs2 skips the dlm unlock step, it frees the glock and its lvb, which
means dlm's function recover_lvb references the now freed lvb pointer,
copying the freed lvb memory to the rsb.

This patch changes the check in gdlm_put_lock so that it calls
dlm_unlock for all glocks that contain an lvb pointer.

Fixes: fb6791d100 ("GFS2: skip dlm_unlock calls in unmount")
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:43 +01:00
Bob Peterson
fc82ab4bb5 gfs2: fix glock confusion in function signal_our_withdraw
commit f5f02fde9f52b2d769c1c2ddfd3d9c4a1fe739a7 upstream.

If go_free is defined, function signal_our_withdraw is supposed to
synchronize on the GLF_FREEING flag of the inode glock, but it
accidentally does that on the live glock. Fix that and disambiguate
the glock variables.

Fixes: 601ef0d52e ("gfs2: Force withdraw to replay journals and wait for it to finish")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:43 +01:00
Jaegeuk Kim
f98be16898 f2fs: flush data when enabling checkpoint back
commit b0ff4fe746fd028eef920ddc8c7b0361c1ede6ec upstream.

During checkpoint=disable period, f2fs bypasses all the synchronous IOs such as
sync and fsync. So, when enabling it back, we must flush all of them in order
to keep the data persistent. Otherwise, suddern power-cut right after enabling
checkpoint will cause data loss.

Fixes: 4354994f09 ("f2fs: checkpoint disabling")
Cc: stable@vger.kernel.org
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:43 +01:00
Chao Yu
04a495780f f2fs: enforce the immutable flag on open files
commit e0fcd01510ad025c9bbce704c5c2579294056141 upstream.

This patch ports commit 02b016ca7f ("ext4: enforce the immutable
flag on open files") to f2fs.

According to the chattr man page, "a file with the 'i' attribute
cannot be modified..."  Historically, this was only enforced when the
file was opened, per the rest of the description, "... and the file
can not be opened in write mode".

There is general agreement that we should standardize all file systems
to prevent modifications even for files that were opened at the time
the immutable flag is set.  Eventually, a change to enforce this at
the VFS layer should be landing in mainline.

Cc: stable@kernel.org
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:43 +01:00
Chao Yu
e391239dcd f2fs: fix out-of-repair __setattr_copy()
commit 2562515f0ad7342bde6456602c491b64c63fe950 upstream.

__setattr_copy() was copied from setattr_copy() in fs/attr.c, there is
two missing patches doesn't cover this inner function, fix it.

Commit 7fa294c899 ("userns: Allow chown and setgid preservation")
Commit 23adbe12ef ("fs,userns: Change inode_capable to capable_wrt_inode_uidgid")

Fixes: fbfa2cc58d ("f2fs: add file operations")
Cc: stable@vger.kernel.org
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:42 +01:00
Jens Axboe
13fb0e1ecf proc: don't allow async path resolution of /proc/thread-self components
commit 0d4370cfe36b7f1719123b621a4ec4d9c7a25f89 upstream.

If this is attempted by an io-wq kthread, then return -EOPNOTSUPP as we
don't currently support that. Once we can get task_pid_ptr() doing the
right thing, then this can go away again.

Use PF_IO_WORKER for this to speciically target the io_uring workers.
Modify the /proc/self/ check to use PF_IO_WORKER as well.

Cc: stable@vger.kernel.org
Fixes: 8d4c3e76e3 ("proc: don't allow async path resolution of /proc/self components")
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:42 +01:00
Chris Wilson
1ea3602095 kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE
commit bfe3911a91047557eb0e620f95a370aee6a248c7 upstream.

Userspace has discovered the functionality offered by SYS_kcmp and has
started to depend upon it. In particular, Mesa uses SYS_kcmp for
os_same_file_description() in order to identify when two fd (e.g. device
or dmabuf) point to the same struct file. Since they depend on it for
core functionality, lift SYS_kcmp out of the non-default
CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.

Rasmus Villemoes also pointed out that systemd uses SYS_kcmp to
deduplicate the per-service file descriptor store.

Note that some distributions such as Ubuntu are already enabling
CHECKPOINT_RESTORE in their configs and so, by extension, SYS_kcmp.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/3046
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: stable@vger.kernel.org
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> # DRM depends on kcmp
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> # systemd uses kcmp
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210205220012.1983-1-chris@chris-wilson.co.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:41 +01:00
Shin'ichiro Kawasaki
834c7ec6af zonefs: Fix file size of zones in full condition
commit 059c01039c0185dbee7ed080f1f2bd22cb1e4dab upstream.

Per ZBC/ZAC/ZNS specifications, write pointers may not have valid values
when zones are in full condition. However, when zonefs mounts a zoned
block device, zonefs refers write pointers to set file size even when
the zones are in full condition. This results in wrong file size. To fix
this, refer maximum file size in place of write pointers for zones in
full condition.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Cc: <stable@vger.kernel.org> # 5.6+
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:41 +01:00
Namjae Jeon
4e6e00704f exfat: fix shift-out-of-bounds in exfat_fill_super()
commit 78c276f5495aa53a8beebb627e5bf6a54f0af34f upstream.

syzbot reported a warning which could cause shift-out-of-bounds issue.

Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x183/0x22e lib/dump_stack.c:120
 ubsan_epilogue lib/ubsan.c:148 [inline]
 __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
 exfat_read_boot_sector fs/exfat/super.c:471 [inline]
 __exfat_fill_super fs/exfat/super.c:556 [inline]
 exfat_fill_super+0x2acb/0x2d00 fs/exfat/super.c:624
 get_tree_bdev+0x406/0x630 fs/super.c:1291
 vfs_get_tree+0x86/0x270 fs/super.c:1496
 do_new_mount fs/namespace.c:2881 [inline]
 path_mount+0x1937/0x2c50 fs/namespace.c:3211
 do_mount fs/namespace.c:3224 [inline]
 __do_sys_mount fs/namespace.c:3432 [inline]
 __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3409
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

exfat specification describe sect_per_clus_bits field of boot sector
could be at most 25 - sect_size_bits and at least 0. And sect_size_bits
can also affect this calculation, It also needs validation.
This patch add validation for sect_per_clus_bits and sect_size_bits
field of boot sector.

Fixes: 719c1e1829 ("exfat: add super block operations")
Cc: stable@vger.kernel.org # v5.9+
Reported-by: syzbot+da4fe66aaadd3c2e2d1c@syzkaller.appspotmail.com
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:41 +01:00
Pan Bian
885a2d24c2 fs/affs: release old buffer head on error path
commit 70779b897395b330ba5a47bed84f94178da599f9 upstream.

The reference count of the old buffer head should be decremented on path
that fails to get the new buffer head.

Fixes: 6b4657667b ("fs/affs: add rename exchange")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:37 +01:00
Jiri Bohac
edadcf211a pstore: Fix typo in compression option name
commit 19d8e9149c27b689c6224f5c84b96a159342195a upstream.

Both pstore_compress() and decompress_record() use a mistyped config
option name ("PSTORE_COMPRESSION" instead of "PSTORE_COMPRESS"). As
a result compression and decompression of pstore records was always
disabled.

Use the correct config option name.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: fd49e03280 ("pstore: Fix linking when crypto API disabled")
Acked-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210218111547.johvp5klpv3xrpnn@dwarf.suse.cz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:33 +01:00
Filipe Manana
de3ea5be51 btrfs: fix extent buffer leak on failure to copy root
commit 72c9925f87c8b74f36f8e75a4cd93d964538d3ca upstream.

At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up
returning without unlocking and releasing our reference on the extent
buffer named "cow" we previously allocated with btrfs_alloc_tree_block().

So fix that by unlocking the extent buffer and dropping our reference on
it before returning.

Fixes: be20aa9dba ("Btrfs: Add mount option to turn off data cow")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:30 +01:00
Josef Bacik
9a739917ef btrfs: account for new extents being deleted in total_bytes_pinned
commit 81e75ac74ecba929d1e922bf93f9fc467232e39f upstream.

My recent patch set "A variety of lock contention fixes", found here

https://lore.kernel.org/linux-btrfs/cover.1608319304.git.josef@toxicpanda.com/
(Tracked in https://github.com/btrfs/linux/issues/86)

that reduce lock contention on the extent root by running delayed refs
less often resulted in a regression in generic/371.  This test
fallocate()'s the fs until it's full, deletes all the files, and then
tries to fallocate() until full again.

Before these patches we would run all of the delayed refs during
flushing, and then would commit the transaction because we had plenty of
pinned space to recover in order to allocate.  However my patches made
it so we weren't running the delayed refs as aggressively, which meant
that we appeared to have less pinned space when we were deciding to
commit the transaction.

We use the space_info->total_bytes_pinned to approximate how much space
we have pinned.  It's approximate because if we remove a reference to an
extent we may free it, but there may be more references to it than we
know of at that point, but we account it as pinned at the creation time,
and then it's properly accounted when the delayed ref runs.

The way we account for pinned space is if the
delayed_ref_head->total_ref_mod is < 0, because that is clearly a
freeing option.  However there is another case, and that is where
->total_ref_mod == 0 && ->must_insert_reserved == 1.

When we allocate a new extent, we have ->total_ref_mod == 1 and we have
->must_insert_reserved == 1.  This is used to indicate that it is a
brand new extent and will need to have its extent entry added before we
modify any references on the delayed ref head.  But if we subsequently
remove that extent reference, our ->total_ref_mod will be 0, and that
space will be pinned and freed.  Accounting for this case properly
allows for generic/371 to pass with my delayed refs patches applied.

It's important to note that this problem exists without the referenced
patches, it just was uncovered by them.

CC: stable@vger.kernel.org # 5.10
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:30 +01:00
Josef Bacik
7ec1536e80 btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself
commit 2187374f35fe9cadbddaa9fcf0c4121365d914e8 upstream.

Currently we pass things around to figure out if we maybe freeing data
based on the state of the delayed refs head.  This makes the accounting
sort of confusing and hard to follow, as it's distinctly separate from
the delayed ref heads stuff, but also depends on it entirely.

Fix this by explicitly adjusting the space_info->total_bytes_pinned in
the delayed refs code.  We now have two places where we modify this
counter, once where we create the delayed and destroy the delayed refs,
and once when we pin and unpin the extents.  This means there is a
slight overlap between delayed refs and the pin/unpin mechanisms, but
this is simply used by the ENOSPC infrastructure to determine if we need
to commit the transaction, so there's no adverse affect from this, we
might simply commit thinking it will give us enough space when it might
not.

CC: stable@vger.kernel.org # 5.10
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:30 +01:00
Josef Bacik
acaeedb193 btrfs: splice remaining dirty_bg's onto the transaction dirty bg list
commit 938fcbfb0cbcf532a1869efab58e6009446b1ced upstream.

While doing error injection testing with my relocation patches I hit the
following assert:

  assertion failed: list_empty(&block_group->dirty_list), in fs/btrfs/block-group.c:3356
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3357!
  invalid opcode: 0000 [#1] SMP NOPTI
  CPU: 0 PID: 24351 Comm: umount Tainted: G        W         5.10.0-rc3+ #193
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  RIP: 0010:assertfail.constprop.0+0x18/0x1a
  RSP: 0018:ffffa09b019c7e00 EFLAGS: 00010282
  RAX: 0000000000000056 RBX: ffff8f6492c18000 RCX: 0000000000000000
  RDX: ffff8f64fbc27c60 RSI: ffff8f64fbc19050 RDI: ffff8f64fbc19050
  RBP: ffff8f6483bbdc00 R08: 0000000000000000 R09: 0000000000000000
  R10: ffffa09b019c7c38 R11: ffffffff85d70928 R12: ffff8f6492c18100
  R13: ffff8f6492c18148 R14: ffff8f6483bbdd70 R15: dead000000000100
  FS:  00007fbfda4cdc40(0000) GS:ffff8f64fbc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fbfda666fd0 CR3: 000000013cf66002 CR4: 0000000000370ef0
  Call Trace:
   btrfs_free_block_groups.cold+0x55/0x55
   close_ctree+0x2c5/0x306
   ? fsnotify_destroy_marks+0x14/0x100
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20
   deactivate_locked_super+0x36/0xa0
   cleanup_mnt+0x12d/0x190
   task_work_run+0x5c/0xa0
   exit_to_user_mode_prepare+0x1b1/0x1d0
   syscall_exit_to_user_mode+0x54/0x280
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happened because I injected an error in btrfs_cow_block() while
running the dirty block groups.  When we run the dirty block groups, we
splice the list onto a local list to process.  However if an error
occurs, we only cleanup the transactions dirty block group list, not any
pending block groups we have on our locally spliced list.

In fact if we fail to allocate a path in this function we'll also fail
to clean up the splice list.

Fix this by splicing the list back onto the transaction dirty block
group list so that the block groups are cleaned up.  Then add a 'out'
label and have the error conditions jump to out so that the errors are
handled properly.  This also has the side-effect of fixing a problem
where we would clear 'ret' on error because we unconditionally ran
btrfs_run_delayed_refs().

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:30 +01:00
Josef Bacik
c717ca57a4 btrfs: fix reloc root leak with 0 ref reloc roots on recovery
commit c78a10aebb275c38d0cfccae129a803fe622e305 upstream.

When recovering a relocation, if we run into a reloc root that has 0
refs we simply add it to the reloc_control->reloc_roots list, and then
clean it up later.  The problem with this is __del_reloc_root() doesn't
do anything if the root isn't in the radix tree, which in this case it
won't be because we never call __add_reloc_root() on the reloc_root.

This exit condition simply isn't correct really.  During normal
operation we can remove ourselves from the rb tree and then we're meant
to clean up later at merge_reloc_roots() time, and this happens
correctly.  During recovery we're depending on free_reloc_roots() to
drop our references, but we're short-circuiting.

Fix this by continuing to check if we're on the list and dropping
ourselves from the reloc_control root list and dropping our reference
appropriately.  Change the corresponding BUG_ON() to an ASSERT() that
does the correct thing if we aren't in the rb tree.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:30 +01:00
Josef Bacik
4d3edf72d6 btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root
commit 867ed321f90d06aaba84e2c91de51cd3038825ef upstream.

While testing my error handling patches, I added a error injection site
at btrfs_inc_extent_ref, to validate the error handling I added was
doing the correct thing.  However I hit a pretty ugly corruption while
doing this check, with the following error injection stack trace:

btrfs_inc_extent_ref
  btrfs_copy_root
    create_reloc_root
      btrfs_init_reloc_root
	btrfs_record_root_in_trans
	  btrfs_start_transaction
	    btrfs_update_inode
	      btrfs_update_time
		touch_atime
		  file_accessed
		    btrfs_file_mmap

This is because we do not catch the error from btrfs_inc_extent_ref,
which in practice would be ENOMEM, which means we lose the extent
references for a root that has already been allocated and inserted,
which is the problem.  Fix this by aborting the transaction if we fail
to do the reference modification.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:29 +01:00
Josef Bacik
a1a5cc2548 btrfs: add asserts for deleting backref cache nodes
commit eddda68d97732ce05ca145f8e85e8a447f65cdad upstream.

A weird KASAN problem that Zygo reported could have been easily caught
if we checked for basic things in our backref freeing code.  We have two
methods of freeing a backref node

- btrfs_backref_free_node: this just is kfree() essentially.
- btrfs_backref_drop_node: this actually unlinks the node and cleans up
  everything and then calls btrfs_backref_free_node().

We should mostly be using btrfs_backref_drop_node(), to make sure the
node is properly unlinked from the backref cache, and only use
btrfs_backref_free_node() when we know the node isn't actually linked to
the backref cache.  We made a mistake here and thus got the KASAN splat.

Make this style of issue easier to find by adding some ASSERT()'s to
btrfs_backref_free_node() and adjusting our deletion stuff to properly
init the list so we can rely on list_empty() checks working properly.

  BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
  Read of size 8 at addr ffff888112402950 by task btrfs/28836

  CPU: 0 PID: 28836 Comm: btrfs Tainted: G        W         5.10.0-e35f27394290-for-next+ #23
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
  Call Trace:
   dump_stack+0xbc/0xf9
   ? btrfs_backref_cleanup_node+0x18a/0x420
   print_address_description.constprop.8+0x21/0x210
   ? record_print_text.cold.34+0x11/0x11
   ? btrfs_backref_cleanup_node+0x18a/0x420
   ? btrfs_backref_cleanup_node+0x18a/0x420
   kasan_report.cold.10+0x20/0x37
   ? btrfs_backref_cleanup_node+0x18a/0x420
   __asan_load8+0x69/0x90
   btrfs_backref_cleanup_node+0x18a/0x420
   btrfs_backref_release_cache+0x83/0x1b0
   relocate_block_group+0x394/0x780
   ? merge_reloc_roots+0x4a0/0x4a0
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   ? check_flags.part.50+0x6c/0x1e0
   ? btrfs_relocate_chunk+0x120/0x120
   ? kmem_cache_alloc_trace+0xa06/0xcb0
   ? _copy_from_user+0x83/0xc0
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   ? __kasan_check_read+0x11/0x20
   ? check_chain_key+0x1f4/0x2f0
   ? __asan_loadN+0xf/0x20
   ? btrfs_ioctl_get_supported_features+0x30/0x30
   ? kvm_sched_clock_read+0x18/0x30
   ? check_chain_key+0x1f4/0x2f0
   ? lock_downgrade+0x3f0/0x3f0
   ? handle_mm_fault+0xad6/0x2150
   ? do_vfs_ioctl+0xfc/0x9d0
   ? ioctl_file_clone+0xe0/0xe0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags+0x26/0x30
   ? lock_is_held_type+0xc3/0xf0
   ? syscall_enter_from_user_mode+0x1b/0x60
   ? do_syscall_64+0x13/0x80
   ? rcu_read_lock_sched_held+0xa1/0xd0
   ? __kasan_check_read+0x11/0x20
   ? __fget_light+0xae/0x110
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f4c4bdfe427
  RSP: 002b:00007fff33ee6df8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00007fff33ee6e98 RCX: 00007f4c4bdfe427
  RDX: 00007fff33ee6e98 RSI: 00000000c4009420 RDI: 0000000000000003
  RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
  R10: fffffffffffff59d R11: 0000000000000202 R12: 0000000000000001
  R13: 0000000000000000 R14: 00007fff33ee8a34 R15: 0000000000000001

  Allocated by task 28836:
   kasan_save_stack+0x21/0x50
   __kasan_kmalloc.constprop.18+0xbe/0xd0
   kasan_kmalloc+0x9/0x10
   kmem_cache_alloc_trace+0x410/0xcb0
   btrfs_backref_alloc_node+0x46/0xf0
   btrfs_backref_add_tree_node+0x60d/0x11d0
   build_backref_tree+0xc5/0x700
   relocate_tree_blocks+0x2be/0xb90
   relocate_block_group+0x2eb/0x780
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 28836:
   kasan_save_stack+0x21/0x50
   kasan_set_track+0x20/0x30
   kasan_set_free_info+0x1f/0x30
   __kasan_slab_free+0xf3/0x140
   kasan_slab_free+0xe/0x10
   kfree+0xde/0x200
   btrfs_backref_error_cleanup+0x452/0x530
   build_backref_tree+0x1a5/0x700
   relocate_tree_blocks+0x2be/0xb90
   relocate_block_group+0x2eb/0x780
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  The buggy address belongs to the object at ffff888112402900
   which belongs to the cache kmalloc-128 of size 128
  The buggy address is located 80 bytes inside of
   128-byte region [ffff888112402900, ffff888112402980)
  The buggy address belongs to the page:
  page:0000000028b1cd08 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888131c810c0 pfn:0x112402
  flags: 0x17ffe0000000200(slab)
  raw: 017ffe0000000200 ffffea000424f308 ffffea0007d572c8 ffff888100040440
  raw: ffff888131c810c0 ffff888112402000 0000000100000009 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff888112402800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff888112402880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  >ffff888112402900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                   ^
   ffff888112402980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   ffff888112402a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Link: https://lore.kernel.org/linux-btrfs/20201208194607.GI31381@hungrycats.org/
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:29 +01:00
Josef Bacik
52f93e5ee7 btrfs: do not warn if we can't find the reloc root when looking up backref
commit f78743fbdae1bb31bc9c9233c3590a5048782381 upstream.

The backref code is looking for a reloc_root that corresponds to the
given fs root.  However any number of things could have gone wrong while
initializing that reloc_root, like ENOMEM while trying to allocate the
root itself, or EIO while trying to write the root item.  This would
result in no corresponding reloc_root being in the reloc root cache, and
thus would return NULL when we do the find_reloc_root() call.

Because of this we do not want to WARN_ON().  This presumably was meant
to catch developer errors, cases where we messed up adding the reloc
root.  However we can easily hit this case with error injection, and
thus should not do a WARN_ON().

CC: stable@vger.kernel.org # 5.10+
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:29 +01:00
Josef Bacik
02785bae77 btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node
commit 7e2a870a599d4699a626ec26430c7a1ab14a2a49 upstream.

Zygo reported the following panic when testing my error handling patches
for relocation:

  kernel BUG at fs/btrfs/backref.c:2545!
  invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 8472 Comm: btrfs Tainted: G        W 14
  Hardware name: QEMU Standard PC (i440FX + PIIX,

  Call Trace:
   btrfs_backref_error_cleanup+0x4df/0x530
   build_backref_tree+0x1a5/0x700
   ? _raw_spin_unlock+0x22/0x30
   ? release_extent_buffer+0x225/0x280
   ? free_extent_buffer.part.52+0xd7/0x140
   relocate_tree_blocks+0x2a6/0xb60
   ? kasan_unpoison_shadow+0x35/0x50
   ? do_relocation+0xc10/0xc10
   ? kasan_kmalloc+0x9/0x10
   ? kmem_cache_alloc_trace+0x6a3/0xcb0
   ? free_extent_buffer.part.52+0xd7/0x140
   ? rb_insert_color+0x342/0x360
   ? add_tree_block.isra.36+0x236/0x2b0
   relocate_block_group+0x2eb/0x780
   ? merge_reloc_roots+0x470/0x470
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x18f0
   ? pvclock_clocksource_read+0xeb/0x190
   ? btrfs_relocate_chunk+0x120/0x120
   ? lock_contended+0x620/0x6e0
   ? do_raw_spin_lock+0x1e0/0x1e0
   ? do_raw_spin_unlock+0xa8/0x140
   btrfs_ioctl_balance+0x1f9/0x460
   btrfs_ioctl+0x24c8/0x4380
   ? __kasan_check_read+0x11/0x20
   ? check_chain_key+0x1f4/0x2f0
   ? __asan_loadN+0xf/0x20
   ? btrfs_ioctl_get_supported_features+0x30/0x30
   ? kvm_sched_clock_read+0x18/0x30
   ? check_chain_key+0x1f4/0x2f0
   ? lock_downgrade+0x3f0/0x3f0
   ? handle_mm_fault+0xad6/0x2150
   ? do_vfs_ioctl+0xfc/0x9d0
   ? ioctl_file_clone+0xe0/0xe0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags+0x26/0x30
   ? lock_is_held_type+0xc3/0xf0
   ? syscall_enter_from_user_mode+0x1b/0x60
   ? do_syscall_64+0x13/0x80
   ? rcu_read_lock_sched_held+0xa1/0xd0
   ? __kasan_check_read+0x11/0x20
   ? __fget_light+0xae/0x110
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This occurs because of this check

  if (RB_EMPTY_NODE(&upper->rb_node))
	  BUG_ON(!list_empty(&node->upper));

As we are dropping the backref node, if we discover that our upper node
in the edge we just cleaned up isn't linked into the cache that we are
now done with this node, thus the BUG_ON().

However this is an erroneous assumption, as we will look up all the
references for a node first, and then process the pending edges.  All of
the 'upper' nodes in our pending edges won't be in the cache's rb_tree
yet, because they haven't been processed.  We could very well have many
edges still left to cleanup on this node.

The fact is we simply do not need this check, we can just process all of
the edges only for this node, because below this check we do the
following

  if (list_empty(&upper->lower)) {
	  list_add_tail(&upper->lower, &cache->leaves);
	  upper->lowest = 1;
  }

If the upper node truly isn't used yet, then we add it to the
cache->leaves list to be cleaned up later.  If it is still used then the
last child node that has it linked into its node will add it to the
leaves list and then it will be cleaned up.

Fix this problem by dropping this logic altogether.  With this fix I no
longer see the panic when testing with error injection in the backref
code.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:29 +01:00
Gao Xiang
03c9bf033c erofs: initialized fields can only be observed after bit is set
commit ce063129181312f8781a047a50be439c5859747b upstream.

Currently, although set_bit() & test_bit() pairs are used as a fast-
path for initialized configurations. However, these atomic ops are
actually relaxed forms. Instead, load-acquire & store-release form is
needed to make sure uninitialized fields won't be observed in advance
here (yet no such corresponding bitops so use full barriers instead.)

Link: https://lore.kernel.org/r/20210209130618.15838-1-hsiangkao@aol.com
Fixes: 62dc45979f ("staging: erofs: fix race of initializing xattrs of a inode at the same time")
Fixes: 152a333a58 ("staging: erofs: add compacted compression indexes support")
Cc: <stable@vger.kernel.org> # 5.3+
Reported-by: Huang Jianan <huangjianan@oppo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:28 +01:00
Josef Bacik
b7925acd82 proc: use kvzalloc for our kernel buffer
[ Upstream commit 4508943794efdd94171549c0bd52810e2f4ad9fe ]

Since

  sysctl: pass kernel pointers to ->proc_handler

we have been pre-allocating a buffer to copy the data from the proc
handlers into, and then copying that to userspace.  The problem is this
just blindly kzalloc()'s the buffer size passed in from the read, which in
the case of our 'cat' binary was 64kib.  Order-4 allocations are not
awesome, and since we can potentially allocate up to our maximum order, so
use kvzalloc for these buffers.

[willy@infradead.org: changelog tweaks]

Link: https://lkml.kernel.org/r/6345270a2c1160b89dd5e6715461f388176899d1.1612972413.git.josef@toxicpanda.com
Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
CC: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:38:21 +01:00
Trond Myklebust
9468ab8a27 NFSv4: Fixes for nfs4_bitmask_adjust()
[ Upstream commit 45901a231723a5a513ff08477983f3a274a6a910 ]

We don't want to ask for the ACL in a WRITE reply, since we don't have
a preallocated buffer.

Instead of checking NFS_INO_INVALID_ACCESS, which is really about
managing the access cache, we should look at the value of
NFS_INO_INVALID_OTHER. Also ensure we assign the mode, owner and
owner_group flags to the correct bit mask.

Finally, fix up the check for NFS_INO_INVALID_CTIME to retrieve the
ctime, and add a check for NFS_INO_INVALID_CHANGE.

Fixes: 76bd5c016e ("NFSv4: make cache consistency bitmask dynamic")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:38:20 +01:00
Dan Carpenter
b761fd2821 ocfs2: fix a use after free on error
[ Upstream commit c57d117f2b2f2a19b570c36f2819ef8d8210af20 ]

The error handling in this function frees "reg" but it is still on the
"o2hb_all_regions" list so it will lead to a use after freew.  Joseph Qi
points out that we need to clear the bit in the "o2hb_region_bitmap" as
well

Link: https://lkml.kernel.org/r/YBk4M6HUG8jB/jc7@mwanda
Fixes: 1cf257f511 ("ocfs2: fix memory leak")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:38:18 +01:00
Geert Uytterhoeven
d167a7367d ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it
[ Upstream commit 302fdadeafe4be539f247abf25f61822e4a5a577 ]

EXT4_KUNIT_TESTS selects EXT4_FS, thus enabling an optional feature the
user may not want to enable.  Fix this by making the test depend on
EXT4_FS instead.

Fixes: 1cbeab1b24 ("ext4: add kunit test for decoding extended timestamps")
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20210122110234.2825685-1-geert@linux-m68k.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:38:15 +01:00
Theodore Ts'o
858d343c78 ext4: fix potential htree index checksum corruption
[ Upstream commit b5776e7524afbd4569978ff790864755c438bba7 ]

In the case where we need to do an interior node split, and
immediately afterwards, we are unable to allocate a new directory leaf
block due to ENOSPC, the directory index checksum's will not be filled
in correctly (and indeed, will not be correctly journalled).

This looks like a bug that was introduced when we added largedir
support.  The original code doesn't make any sense (and should have
been caught in code review), but it was hidden because most of the
time, the index node checksum will be set by do_split().  But if
do_split bails out due to ENOSPC, then ext4_handle_dirty_dx_node()
won't get called, and so the directory index checksum field will not
get set, leading to:

EXT4-fs error (device sdb): dx_probe:858: inode #6635543: block 4022: comm nfsd: Directory index failed checksum

Google-Bug-Id: 176345532
Fixes: e08ac99fa2 ("ext4: add largedir feature")
Cc: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:38:13 +01:00
Jeff Layton
95f432c0a3 ceph: fix flush_snap logic after putting caps
[ Upstream commit 64f36da5625f7f9853b86750eaa89d499d16a2e9 ]

A primary reason for skipping ceph_check_caps after putting the
references was to avoid the locking in ceph_check_caps during a
reconnect. __ceph_put_cap_refs can still call ceph_flush_snaps in that
case though, and that takes many of the same inconvenient locks.

Fix the logic in __ceph_put_cap_refs to skip flushing snaps when the
skip_checking_caps flag is set.

Fixes: e64f44a884 ("ceph: skip checking caps when session reconnecting and releasing reqs")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:38:08 +01:00
J. Bruce Fields
7c7cb07d4a nfsd: register pernet ops last, unregister first
[ Upstream commit bd5ae9288d6451bd346a1b4a59d4fe7e62ba29b7 ]

These pernet operations may depend on stuff set up or torn down in the
module init/exit functions.  And they may be called at any time in
between.  So it makes more sense for them to be the last to be
registered in the init function, and the first to be unregistered in the
exit function.

In particular, without this, the drc slab is being destroyed before all
the per-net drcs are shut down, resulting in an "Objects remaining in
nfsd_drc on __kmem_cache_shutdown()" warning in exit_nfsd.

Reported-by: Zhi Li <yieli@redhat.com>
Fixes: 3ba75830ce "nfsd4: drc containerization"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:38:07 +01:00
Pan Bian
ffc6e686f4 isofs: release buffer head before return
[ Upstream commit 0a6dc67a6aa45f19bd4ff89b4f468fc50c4b8daa ]

Release the buffer_head before returning error code in
do_isofs_readdir() and isofs_find_entry().

Fixes: 2deb1acc65 ("isofs: fix access to unallocated memory when reading corrupted filesystem")
Link: https://lore.kernel.org/r/20210118120455.118955-1-bianpan2016@163.com
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:38:00 +01:00
Jan Kara
8584d4f31a quota: Fix memory leak when handling corrupted quota file
[ Upstream commit a4db1072e1a3bd7a8d9c356e1902b13ac5deb8ef ]

When checking corrupted quota file we can bail out and leak allocated
info structure. Properly free info structure on error return.

Reported-by: syzbot+77779c9b52ab78154b08@syzkaller.appspotmail.com
Fixes: 11c514a99bb9 ("quota: Sanity-check quota file headers on load")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:53 +01:00
Wang ShaoBo
24386143cb ubifs: Fix error return code in alloc_wbufs()
[ Upstream commit 42119dbe571eb419dae99b81dd20fa42f47464e1 ]

Fix to return PTR_ERR() error code from the error handling case instead
fo 0 in function alloc_wbufs(), as done elsewhere in this function.

Fixes: 6a98bc4614 ("ubifs: Add authentication nodes to journal")
Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:51 +01:00
Arnd Bergmann
3818158df1 ubifs: replay: Fix high stack usage, again
[ Upstream commit 410b6de702ef84fea6e7abcb6620ef8bfc112fae ]

An earlier commit moved out some functions to not be inlined by gcc, but
after some other rework to remove one of those, clang started inlining
the other one and ran into the same problem as gcc did before:

fs/ubifs/replay.c:1174:5: error: stack frame size of 1152 bytes in function 'ubifs_replay_journal' [-Werror,-Wframe-larger-than=]

Mark the function as noinline_for_stack to ensure it doesn't happen
again.

Fixes: f80df38512 ("ubifs: use crypto_shash_tfm_digest()")
Fixes: eb66eff663 ("ubifs: replay: Fix high stack usage")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:51 +01:00
Dinghao Liu
bdb176a0c8 ubifs: Fix memleak in ubifs_init_authentication
[ Upstream commit 11b8ab3836454a2600e396f34731e491b661f9d5 ]

When crypto_shash_digestsize() fails, c->hmac_tfm
has not been freed before returning, which leads
to memleak.

Fixes: 49525e5eec ("ubifs: Add helper functions for authentication support")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:51 +01:00
Tom Rix
bdf9437459 jffs2: fix use after free in jffs2_sum_write_data()
[ Upstream commit 19646447ad3a680d2ab08c097585b7d96a66126b ]

clang static analysis reports this problem

fs/jffs2/summary.c:794:31: warning: Use of memory after it is freed
                c->summary->sum_list_head = temp->u.next;
                                            ^~~~~~~~~~~~

In jffs2_sum_write_data(), in a loop summary data is handles a node at
a time.  When it has written out the node it is removed the summary list,
and the node is deleted.  In the corner case when a
JFFS2_FEATURE_RWCOMPAT_COPY is seen, a call is made to
jffs2_sum_disable_collecting().  jffs2_sum_disable_collecting() deletes
the whole list which conflicts with the loop's deleting the list by parts.

To preserve the old behavior of stopping the write midway, bail out of
the loop after disabling summary collection.

Fixes: 6171586a7a ("[JFFS2] Correct handling of JFFS2_FEATURE_RWCOMPAT_COPY nodes.")
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:51 +01:00
Colin Ian King
a6b56338a9 fs/jfs: fix potential integer overflow on shift of a int
[ Upstream commit 4208c398aae4c2290864ba15c3dab7111f32bec1 ]

The left shift of int 32 bit integer constant 1 is evaluated using 32 bit
arithmetic and then assigned to a signed 64 bit integer. In the case where
l2nb is 32 or more this can lead to an overflow.  Avoid this by shifting
the value 1LL instead.

Addresses-Coverity: ("Uninitentional integer overflow")
Fixes: b40c2e665c ("fs/jfs: TRIM support for JFS Filesystem")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:51 +01:00
Hao Xu
81dfee4731 io_uring: fix possible deadlock in io_uring_poll
[ Upstream commit ed670c3f90a67d9e16ab6d8893be6f072d79cd4c ]

Abaci reported follow issue:

[   30.615891] ======================================================
[   30.616648] WARNING: possible circular locking dependency detected
[   30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted
[   30.618035] ------------------------------------------------------
[   30.618914] a.out/1128 is trying to acquire lock:
[   30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220
[   30.620505]
[   30.620505] but task is already holding lock:
[   30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[   30.622349]
[   30.622349] which lock already depends on the new lock.
[   30.622349]
[   30.623289]
[   30.623289] the existing dependency chain (in reverse order) is:
[   30.624243]
[   30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
[   30.625263]        lock_acquire+0x2c7/0x390
[   30.625868]        __mutex_lock+0xae/0x9f0
[   30.626451]        io_cqring_overflow_flush.part.95+0x6d/0x70
[   30.627278]        io_uring_poll+0xcb/0xd0
[   30.627890]        ep_item_poll.isra.14+0x4e/0x90
[   30.628531]        do_epoll_ctl+0xb7e/0x1120
[   30.629122]        __x64_sys_epoll_ctl+0x70/0xb0
[   30.629770]        do_syscall_64+0x2d/0x40
[   30.630332]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.631187]
[   30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}:
[   30.631985]        check_prevs_add+0x226/0xb00
[   30.632584]        __lock_acquire+0x1237/0x13a0
[   30.633207]        lock_acquire+0x2c7/0x390
[   30.633740]        __mutex_lock+0xae/0x9f0
[   30.634258]        __ep_eventpoll_poll+0x9f/0x220
[   30.634879]        __io_arm_poll_handler+0xbf/0x220
[   30.635462]        io_issue_sqe+0xa6b/0x13e0
[   30.635982]        __io_queue_sqe+0x10b/0x550
[   30.636648]        io_queue_sqe+0x235/0x470
[   30.637281]        io_submit_sqes+0xcce/0xf10
[   30.637839]        __x64_sys_io_uring_enter+0x3fb/0x5b0
[   30.638465]        do_syscall_64+0x2d/0x40
[   30.638999]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.639643]
[   30.639643] other info that might help us debug this:
[   30.639643]
[   30.640618]  Possible unsafe locking scenario:
[   30.640618]
[   30.641402]        CPU0                    CPU1
[   30.641938]        ----                    ----
[   30.642664]   lock(&ctx->uring_lock);
[   30.643425]                                lock(&ep->mtx);
[   30.644498]                                lock(&ctx->uring_lock);
[   30.645668]   lock(&ep->mtx);
[   30.646321]
[   30.646321]  *** DEADLOCK ***
[   30.646321]
[   30.647642] 1 lock held by a.out/1128:
[   30.648424]  #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[   30.649954]
[   30.649954] stack backtrace:
[   30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1
[   30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   30.652290] Call Trace:
[   30.652688]  dump_stack+0xac/0xe3
[   30.653164]  check_noncircular+0x11e/0x130
[   30.653747]  ? check_prevs_add+0x226/0xb00
[   30.654303]  check_prevs_add+0x226/0xb00
[   30.654845]  ? add_lock_to_list.constprop.49+0xac/0x1d0
[   30.655564]  __lock_acquire+0x1237/0x13a0
[   30.656262]  lock_acquire+0x2c7/0x390
[   30.656788]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.657379]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.658014]  __mutex_lock+0xae/0x9f0
[   30.658524]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.659112]  ? mark_held_locks+0x5a/0x80
[   30.659648]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.660229]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
[   30.660885]  ? trace_hardirqs_on+0x46/0x110
[   30.661471]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.662102]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.662696]  __ep_eventpoll_poll+0x9f/0x220
[   30.663273]  ? __ep_eventpoll_poll+0x220/0x220
[   30.663875]  __io_arm_poll_handler+0xbf/0x220
[   30.664463]  io_issue_sqe+0xa6b/0x13e0
[   30.664984]  ? __lock_acquire+0x782/0x13a0
[   30.665544]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.666170]  ? __io_queue_sqe+0x10b/0x550
[   30.666725]  __io_queue_sqe+0x10b/0x550
[   30.667252]  ? __fget_files+0x131/0x260
[   30.667791]  ? io_req_prep+0xd8/0x1090
[   30.668316]  ? io_queue_sqe+0x235/0x470
[   30.668868]  io_queue_sqe+0x235/0x470
[   30.669398]  io_submit_sqes+0xcce/0xf10
[   30.669931]  ? xa_load+0xe4/0x1c0
[   30.670425]  __x64_sys_io_uring_enter+0x3fb/0x5b0
[   30.671051]  ? lockdep_hardirqs_on_prepare+0xde/0x180
[   30.671719]  ? syscall_enter_from_user_mode+0x2b/0x80
[   30.672380]  do_syscall_64+0x2d/0x40
[   30.672901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.673503] RIP: 0033:0x7fd89c813239
[   30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
[   30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa
[   30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239
[   30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003
[   30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000
[   30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0
[   30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000

This might happen if we do epoll_wait on a uring fd while reading/writing
the former epoll fd in a sqe in the former uring instance.
So let's don't flush cqring overflow list, just do a simple check.

Reported-by: Abaci <abaci@linux.alibaba.com>
Fixes: 6c503150ae33 ("io_uring: patch up IOPOLL overflow_flush sync")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:49 +01:00
Qu Wenruo
6a402b937e btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge
[ Upstream commit 951c80f83d61bd4b21794c8aba829c3c1a45c2d0 ]

Commit dbfdb6d1b3 ("Btrfs: Search for all ordered extents that could
span across a page") make btrfs_invalidapage() to search all ordered
extents.

The offending code looks like this:

  again:
	  start = page_start;
	  ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
	  if (ordred) {
		  end = min(page_end,
			    ordered->file_offset + ordered->num_bytes - 1);

		  /* Do the cleanup */

		  start = end + 1;
		  if (start < page_end)
			  goto again;
	  }

The behavior is indeed necessary for the incoming subpage support, but
when it iterates through all the ordered extents, it also resets the
search range @start.

This means, for the following cases, we can double account the ordered
extents, causing its bytes_left underflow:

	Page offset
	0		16K		32K
	|<--- OE 1  --->|<--- OE 2 ---->|

As the first iteration will find ordered extent (OE) 1, which doesn't
cover the full page, thus after cleanup code, we need to retry again.
But again label will reset start to page_start, and we got OE 1 again,
which causes double accounting on OE 1, and cause OE 1's byte_left to
underflow.

This problem can only happen for subpage case, as for regular sectorsize
== PAGE_SIZE case, we will always find a OE ends at or after page end,
thus no way to trigger the problem.

Move the again label after start = page_start.  There will be more
comprehensive rework to convert the open coded loop to a proper while
loop for subpage support.

Fixes: dbfdb6d1b3 ("Btrfs: Search for all ordered extents that could span across a page")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:48 +01:00
Zhihao Cheng
006ef266c2 btrfs: clarify error returns values in __load_free_space_cache
[ Upstream commit 3cc64e7ebfb0d7faaba2438334c43466955a96e8 ]

Return value in __load_free_space_cache is not properly set after
(unlikely) memory allocation failures and 0 is returned instead.
This is not a problem for the caller load_free_space_cache because only
value 1 is considered as 'cache loaded' but for clarity it's better
to set the errors accordingly.

Fixes: a67509c300 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:47 +01:00
Dehe Gu
5dc2ee02fd f2fs: fix a wrong condition in __submit_bio
[ Upstream commit 39f71b7e40e21805d6b15fc7750bdd9cab6a5010 ]

We should use !F2FS_IO_ALIGNED() to check and submit_io directly.

Fixes: 8223ecc456 ("f2fs: fix to add missing F2FS_IO_ALIGNED() condition")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Dehe Gu <gudehe@huawei.com>
Signed-off-by: Ge Qiu <qiuge@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:46 +01:00
Yi Chen
a9553ae64d f2fs: fix to avoid inconsistent quota data
[ Upstream commit 25fb04dbce6a0e165d28fd1fa8a1d7018c637fe8 ]

Occasionally, quota data may be corrupted detected by fsck:

Info: checkpoint state = 45 :  crc compacted_summary unmount
[QUOTA WARNING] Usage inconsistent for ID 0:actual (1543036928, 762) != expected (1543032832, 762)
[ASSERT] (fsck_chk_quota_files:1986)  --> Quota file is missing or invalid quota file content found.
[QUOTA WARNING] Usage inconsistent for ID 0:actual (1352478720, 344) != expected (1352474624, 344)
[ASSERT] (fsck_chk_quota_files:1986)  --> Quota file is missing or invalid quota file content found.

[FSCK] Unreachable nat entries                        [Ok..] [0x0]
[FSCK] SIT valid block bitmap checking                [Ok..]
[FSCK] Hard link checking for regular file            [Ok..] [0x0]
[FSCK] valid_block_count matching with CP             [Ok..] [0xdf299]
[FSCK] valid_node_count matcing with CP (de lookup)   [Ok..] [0x2b01]
[FSCK] valid_node_count matcing with CP (nat lookup)  [Ok..] [0x2b01]
[FSCK] valid_inode_count matched with CP              [Ok..] [0x2665]
[FSCK] free segment_count matched with CP             [Ok..] [0xcb04]
[FSCK] next block offset is free                      [Ok..]
[FSCK] fixing SIT types
[FSCK] other corrupted bugs                           [Fail]

The root cause is:
If we open file w/ readonly flag, disk quota info won't be initialized
for this file, however, following mmap() will force to convert inline
inode via f2fs_convert_inline_inode(), which may increase block usage
for this inode w/o updating quota data, it causes inconsistent disk quota
info.

The issue will happen in following stack:
open(file, O_RDONLY)
mmap(file)
- f2fs_convert_inline_inode
 - f2fs_convert_inline_page
  - f2fs_reserve_block
   - f2fs_reserve_new_block
    - f2fs_reserve_new_blocks
     - f2fs_i_blocks_write
      - dquot_claim_block
inode->i_blocks increase, but the dqb_curspace keep the size for the dquots
is NULL.

To fix this issue, let's call dquot_initialize() anyway in both
f2fs_truncate() and f2fs_convert_inline_inode() functions to avoid potential
inconsistent quota data issue.

Fixes: 0abd675e97 ("f2fs: support plain user/group quota")
Signed-off-by: Daiyue Zhang <zhangdaiyue1@huawei.com>
Signed-off-by: Dehe Gu <gudehe@huawei.com>
Signed-off-by: Junchao Jiang <jiangjunchao1@huawei.com>
Signed-off-by: Ge Qiu <qiuge@huawei.com>
Signed-off-by: Yi Chen <chenyi77@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:46 +01:00
Will Deacon
61a1f0ad45 mm: proc: Invalidate TLB after clearing soft-dirty page state
[ Upstream commit 912efa17e5121693dfbadae29768f4144a3f9e62 ]

Since commit 0758cd8304 ("asm-generic/tlb: avoid potential double
flush"), TLB invalidation is elided in tlb_finish_mmu() if no entries
were batched via the tlb_remove_*() functions. Consequently, the
page-table modifications performed by clear_refs_write() in response to
a write to /proc/<pid>/clear_refs do not perform TLB invalidation.
Although this is fine when simply aging the ptes, in the case of
clearing the "soft-dirty" state we can end up with entries where
pte_write() is false, yet a writable mapping remains in the TLB.

Fix this by avoiding the mmu_gather API altogether: managing both the
'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB
invalidation for the sort-dirty path, much like mprotect() does already.

Fixes: 0758cd8304 ("asm-generic/tlb: avoid potential double flush”)
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-2-will@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:45 +01:00