linux_dsm_epyc7002/fs
Filipe Manana 54f290efac btrfs: fix race between marking inode needs to be logged and log syncing
commit bc0939fcfab0d7efb2ed12896b1af3d819954a14 upstream.

We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.

1) We are at transaction N;

2) Inode I was previously fsynced in the current transaction so it has:

    inode->logged_trans set to N;

3) The inode's root currently has:

   root->log_transid set to 1
   root->last_log_commit set to 0

   Which means only one log transaction was committed to far, log
   transaction 0. When a log tree is created we set ->log_transid and
   ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());

4) One more range of pages is dirtied in inode I;

5) Some task A starts an fsync against some other inode J (same root), and
   so it joins log transaction 1.

   Before task A calls btrfs_sync_log()...

6) Task B starts an fsync against inode I, which currently has the full
   sync flag set, so it starts delalloc and waits for the ordered extent
   to complete before calling btrfs_inode_in_log() at btrfs_sync_file();

7) During ordered extent completion we have btrfs_update_inode() called
   against inode I, which in turn calls btrfs_set_inode_last_trans(),
   which does the following:

     spin_lock(&inode->lock);
     inode->last_trans = trans->transaction->transid;
     inode->last_sub_trans = inode->root->log_transid;
     inode->last_log_commit = inode->root->last_log_commit;
     spin_unlock(&inode->lock);

   So ->last_trans is set to N and ->last_sub_trans set to 1.
   But before setting ->last_log_commit...

8) Task A is at btrfs_sync_log():

   - it increments root->log_transid to 2
   - starts writeback for all log tree extent buffers
   - waits for the writeback to complete
   - writes the super blocks
   - updates root->last_log_commit to 1

   It's a lot of slow steps between updating root->log_transid and
   root->last_log_commit;

9) The task doing the ordered extent completion, currently at
   btrfs_set_inode_last_trans(), then finally runs:

     inode->last_log_commit = inode->root->last_log_commit;
     spin_unlock(&inode->lock);

   Which results in inode->last_log_commit being set to 1.
   The ordered extent completes;

10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
    true because we have all the following conditions met:

    inode->logged_trans == N which matches fs_info->generation &&
    inode->last_subtrans (1) <= inode->last_log_commit (1) &&
    inode->last_subtrans (1) <= root->last_log_commit (1) &&
    list inode->extent_tree.modified_extents is empty

    And as a consequence we return without logging the inode, so the
    existing logged version of the inode does not point to the extent
    that was written after the previous fsync.

It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.

However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:

  vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
  {
     (...)
     BTRFS_I(inode)->last_trans = fs_info->generation;
     BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
     BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
     (...)

So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.

So fix this in two different ways:

1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
   value of ->last_sub_trans minus 1;

2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
   like we do for buffered and direct writes at btrfs_file_write_iter(),
   which is all we need to make sure multiple writes and fsyncs to an
   inode in the same transaction never result in an fsync missing that
   the inode changed and needs to be logged. Turn this into a helper
   function and use it both at btrfs_page_mkwrite() and at
   btrfs_file_write_iter() - this also fixes the problem that at
   btrfs_page_mkwrite() we were setting those fields without the
   protection of the inode's spinlock.

This is an extremely unlikely race to happen in practice.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-05 19:00:50 +02:00
..
9p fs: 9p: add generic splice_write file operation 2020-12-01 21:40:47 +01:00
adfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
affs fs/affs: release old buffer head on error path 2021-03-04 11:38:37 +01:00
afs afs: Fix tracepoint string placement with built-in AFS 2021-07-28 14:35:41 +02:00
aufs init: add dsm gpl source 2024-07-05 18:00:04 +02:00
autofs autofs: harden ioctl table 2020-10-16 11:11:22 -07:00
befs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
bfs bfs: don't use WARNING: string when it's just info. 2021-01-06 14:56:52 +01:00
btrfs btrfs: fix race between marking inode needs to be logged and log syncing 2024-07-05 19:00:50 +02:00
cachefiles fs/cachefiles: Remove wait_bit_key layout dependency 2021-03-30 14:32:07 +02:00
ceph init: add dsm gpl source 2024-07-05 18:00:04 +02:00
cifs cifs: create sd context must be a multiple of 8 2024-07-05 18:53:59 +02:00
coda
configfs init: add dsm gpl source 2024-07-05 18:00:04 +02:00
cramfs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
crypto fscrypt: fix derivation of SipHash keys on big endian CPUs 2021-07-14 16:56:53 +02:00
debugfs init: add dsm gpl source 2024-07-05 18:00:04 +02:00
devpts
dlm fs: dlm: fix memory leak when fenced 2021-07-14 16:55:59 +02:00
ecryptfs init: add dsm gpl source 2024-07-05 18:00:04 +02:00
efivarfs efivarfs: revert "fix memory leak in efivarfs_create()" 2020-11-25 16:55:02 +01:00
efs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
erofs erofs: fix error return code in erofs_read_superblock() 2021-07-14 16:56:53 +02:00
exfat init: add dsm gpl source 2024-07-05 18:00:04 +02:00
exportfs init: add dsm gpl source 2024-07-05 18:00:04 +02:00
ext2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
ext4 ext4: fix potential htree corruption when growing large_dir directories 2024-07-05 18:52:29 +02:00
f2fs f2fs: Show casefolding support only when supported 2021-07-25 14:36:17 +02:00
fat init: add dsm gpl source 2024-07-05 18:00:04 +02:00
freevxfs
fscache
fuse init: add dsm gpl source 2024-07-05 18:00:04 +02:00
gfs2 gfs2: Fix error handling in init_statfs 2021-07-14 16:55:38 +02:00
hfs hfs: add lock nesting notation to hfs_find_init 2021-07-31 08:16:12 +02:00
hfsplus init: add dsm gpl source 2024-07-05 18:00:04 +02:00
hostfs hostfs: fix memory handling in follow_link() 2021-04-14 08:42:06 +02:00
hpfs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
hugetlbfs hugetlbfs: fix mount mode command line processing 2021-07-28 14:35:46 +02:00
iomap init: add dsm gpl source 2024-07-05 18:00:04 +02:00
isofs init: add dsm gpl source 2024-07-05 18:00:04 +02:00
jbd2 ext4: fix debug format string warning 2021-05-19 10:13:19 +02:00
jffs2 jffs2: check the validity of dstlen in jffs2_zlib_compress() 2021-05-11 14:47:36 +02:00
jfs fs/jfs: Fix missing error code in lmLogInit() 2021-07-20 16:05:40 +02:00
kernfs kernfs: wire up ->splice_read and ->splice_write 2021-01-27 11:55:29 +01:00
lockd init: add dsm gpl source 2024-07-05 18:00:04 +02:00
minix [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
nfs init: add dsm gpl source 2024-07-05 18:00:04 +02:00
nfs_common nfs_common: need lock during iterate through the list 2020-12-30 11:53:45 +01:00
nfsd init: add dsm gpl source 2024-07-05 18:00:04 +02:00
nilfs2 nilfs2: fix memory leak in nilfs_sysfs_delete_device_group 2021-06-30 08:47:24 -04:00
nls
notify init: add dsm gpl source 2024-07-05 18:00:04 +02:00
ntfs ntfs: fix validity check for file name attribute 2021-07-14 16:55:38 +02:00
ocfs2 ocfs2: issue zeroout to EOF blocks 2024-07-05 18:03:15 +02:00
omfs fs: omfs: use kmemdup() rather than kmalloc+memcpy 2020-09-22 23:39:45 -04:00
openpromfs
orangefs orangefs: fix orangefs df output. 2021-07-20 16:05:48 +02:00
overlayfs ovl: fix uninitialized pointer read in ovl_lookup_real_one() 2024-07-05 18:56:55 +02:00
proc init: add dsm gpl source 2024-07-05 18:00:04 +02:00
pstore init: add dsm gpl source 2024-07-05 18:00:04 +02:00
qnx4 [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
qnx6 [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
quota init: add dsm gpl source 2024-07-05 18:00:04 +02:00
ramfs ramfs: fix nommu mmap with gaps in the page cache 2020-10-16 11:11:22 -07:00
reiserfs reiserfs: check directory items on read from disk 2024-07-05 18:52:32 +02:00
romfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
squashfs squashfs: fix divide error in calculate_skip() 2021-05-19 10:13:10 +02:00
sysfs sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output 2020-10-02 12:02:30 +02:00
sysv [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
tracefs
ubifs ubifs: Set/Clear I_LINKABLE under i_lock for whiteout inode 2021-07-20 16:05:51 +02:00
udf init: add dsm gpl source 2024-07-05 18:00:04 +02:00
ufs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
unicode unicode: Add utf8_casefold_hash 2020-09-10 14:03:31 -07:00
vboxsf vboxsf: Add support for the atomic_open directory-inode op 2024-07-05 18:54:41 +02:00
verity fs-verity: use smp_load_acquire() for ->i_verity_info 2020-07-21 16:02:41 -07:00
xfs xfs: fix return of uninitialized value in variable error 2021-05-14 09:50:34 +02:00
zonefs zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone() 2021-03-25 09:04:05 +01:00
aio.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
anon_inodes.c
attr.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot 2020-10-16 11:11:21 -07:00
binfmt_elf.c fs: Replace zero-length array with flexible-array member 2020-10-29 17:22:59 -05:00
binfmt_em86.c
binfmt_flat.c binfmt_flat: revert "binfmt_flat: don't offset the data start" 2020-08-24 08:49:13 +10:00
binfmt_misc.c binfmt_misc: fix possible deadlock in bm_register_write 2021-03-17 17:06:35 +01:00
binfmt_script.c
block_dev.c block: fix a race between del_gendisk and BLKRRPART 2021-06-03 09:00:45 +02:00
buffer.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
char_dev.c
compat_binfmt_elf.c
coredump.c coredump: fix core_pattern parse error 2020-12-06 10:19:07 -08:00
d_path.c fs: fix NULL dereference due to data race in prepend_path() 2020-10-14 14:54:45 -07:00
dax.c dax: fix ENOMEM handling in grab_mapping_entry() 2021-07-14 16:56:13 +02:00
dcache.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
dcookies.c
direct-io.c fs: direct-io: fix missing sdio->boundary 2021-04-14 08:41:58 +02:00
drop_caches.c
eventfd.c
eventpoll.c fs/epoll: restore waking from ep_done_scan() 2021-05-11 14:47:12 +02:00
exec.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
fcntl.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
fhandle.c
file_table.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
file.c kernel/io_uring: cancel io_uring before task works 2021-01-30 13:55:18 +01:00
filesystems.c
fs_context.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
fs_parser.c fs_parse: mark fs_param_bad_value() as static 2020-10-13 18:38:27 -07:00
fs_pin.c
fs_struct.c vfs: Use sequence counter with associated spinlock 2020-07-29 16:14:27 +02:00
fs_types.c
fs-writeback.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
fsopen.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
init.c init: add an init_dup helper 2020-08-04 21:02:38 -04:00
inode.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
internal.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
io_uring.c io_uring: only assign io_uring_enter() SQPOLL error in actual error case 2024-07-05 18:56:00 +02:00
io-wq.c io_uring: fix false WARN_ONCE 2021-07-19 09:44:51 +02:00
io-wq.h io_uring: always batch cancel in *cancel_files() 2021-02-13 13:54:56 +01:00
ioctl.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
Kconfig init: add dsm gpl source 2024-07-05 18:00:04 +02:00
Kconfig.binfmt
kernel_read_file.c fs/kernel_file_read: Add "offset" arg for partial reads 2020-10-05 13:37:04 +02:00
libfs.c libfs: fix error cast of negative value in simple_attr_write() 2020-11-22 10:48:22 -08:00
locks.c Revert "nfsd4: a client's own opens needn't prevent delegations" 2021-03-20 10:43:44 +01:00
Makefile init: add dsm gpl source 2024-07-05 18:00:04 +02:00
mbcache.c
mount.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
mpage.c
namei.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
namespace.c fs: warn about impending deprecation of mandatory locks 2024-07-05 18:56:00 +02:00
no-block.c
nsfs.c
open.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
pipe.c pipe: increase minimum default pipe size to 2 pages 2024-07-05 18:52:29 +02:00
pnode.c
pnode.h mount: fix mounting of detached mounts onto targets that reside on shared mounts 2021-03-17 17:06:13 +01:00
posix_acl.c
proc_namespace.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
read_write.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
readdir.c readdir: make sure to verify directory entry for legacy interfaces too 2021-04-21 13:00:54 +02:00
remap_range.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
select.c kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data() 2021-03-25 09:04:16 +01:00
seq_file.c seq_file: disallow extremely large seq buffer allocations 2021-07-20 16:05:59 +02:00
signalfd.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
splice.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
stack.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
stat.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
statfs.c Add a "nosymfollow" mount option. 2020-08-27 16:06:47 -04:00
super.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
sync.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
syno_acl_api.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
syno_acl.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
syno_acl.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
timerfd.c
userfaultfd.c userfaultfd: do not untag user pointers 2021-07-28 14:35:46 +02:00
utimes.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
xattr.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00