linux_dsm_epyc7002/fs/btrfs
Filipe Manana 54f290efac btrfs: fix race between marking inode needs to be logged and log syncing
commit bc0939fcfab0d7efb2ed12896b1af3d819954a14 upstream.

We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.

1) We are at transaction N;

2) Inode I was previously fsynced in the current transaction so it has:

    inode->logged_trans set to N;

3) The inode's root currently has:

   root->log_transid set to 1
   root->last_log_commit set to 0

   Which means only one log transaction was committed to far, log
   transaction 0. When a log tree is created we set ->log_transid and
   ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());

4) One more range of pages is dirtied in inode I;

5) Some task A starts an fsync against some other inode J (same root), and
   so it joins log transaction 1.

   Before task A calls btrfs_sync_log()...

6) Task B starts an fsync against inode I, which currently has the full
   sync flag set, so it starts delalloc and waits for the ordered extent
   to complete before calling btrfs_inode_in_log() at btrfs_sync_file();

7) During ordered extent completion we have btrfs_update_inode() called
   against inode I, which in turn calls btrfs_set_inode_last_trans(),
   which does the following:

     spin_lock(&inode->lock);
     inode->last_trans = trans->transaction->transid;
     inode->last_sub_trans = inode->root->log_transid;
     inode->last_log_commit = inode->root->last_log_commit;
     spin_unlock(&inode->lock);

   So ->last_trans is set to N and ->last_sub_trans set to 1.
   But before setting ->last_log_commit...

8) Task A is at btrfs_sync_log():

   - it increments root->log_transid to 2
   - starts writeback for all log tree extent buffers
   - waits for the writeback to complete
   - writes the super blocks
   - updates root->last_log_commit to 1

   It's a lot of slow steps between updating root->log_transid and
   root->last_log_commit;

9) The task doing the ordered extent completion, currently at
   btrfs_set_inode_last_trans(), then finally runs:

     inode->last_log_commit = inode->root->last_log_commit;
     spin_unlock(&inode->lock);

   Which results in inode->last_log_commit being set to 1.
   The ordered extent completes;

10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
    true because we have all the following conditions met:

    inode->logged_trans == N which matches fs_info->generation &&
    inode->last_subtrans (1) <= inode->last_log_commit (1) &&
    inode->last_subtrans (1) <= root->last_log_commit (1) &&
    list inode->extent_tree.modified_extents is empty

    And as a consequence we return without logging the inode, so the
    existing logged version of the inode does not point to the extent
    that was written after the previous fsync.

It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.

However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:

  vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
  {
     (...)
     BTRFS_I(inode)->last_trans = fs_info->generation;
     BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
     BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
     (...)

So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.

So fix this in two different ways:

1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
   value of ->last_sub_trans minus 1;

2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
   like we do for buffered and direct writes at btrfs_file_write_iter(),
   which is all we need to make sure multiple writes and fsyncs to an
   inode in the same transaction never result in an fsync missing that
   the inode changed and needs to be logged. Turn this into a helper
   function and use it both at btrfs_page_mkwrite() and at
   btrfs_file_write_iter() - this also fixes the problem that at
   btrfs_page_mkwrite() we were setting those fields without the
   protection of the inode's spinlock.

This is an extremely unlikely race to happen in practice.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-05 19:00:50 +02:00
..
tests init: add dsm gpl source 2024-07-05 18:00:04 +02:00
acl.c btrfs: cleanup btrfs_setxattr_trans and drop transaction parameter 2019-04-29 19:02:44 +02:00
async-thread.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
async-thread.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
backref.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
backref.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
block-group.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
block-group.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
block-rsv.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
block-rsv.h btrfs: Remove __ prefix from btrfs_block_rsv_release 2020-03-23 17:01:55 +01:00
btrfs_inode.h btrfs: fix race between marking inode needs to be logged and log syncing 2024-07-05 19:00:50 +02:00
check-integrity.c btrfs: check-integrity: remove unnecessary failure messages during memory allocation 2020-07-27 12:55:21 +02:00
check-integrity.h btrfs: remove btrfsic_submit_bh() 2020-03-23 17:01:39 +01:00
compression.c btrfs: mark compressed range uptodate only if all bio succeed 2024-07-05 18:02:46 +02:00
compression.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
ctree.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
ctree.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
delalloc-space.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
delalloc-space.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
delayed-inode.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
delayed-inode.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
delayed-ref.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
delayed-ref.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
dev-replace.c btrfs: fix deadlock when cloning inline extent and low on free metadata space 2021-01-17 14:16:54 +01:00
dev-replace.h btrfs: add __pure attribute to functions 2019-11-18 12:46:52 +01:00
dir-item.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
discard.c btrfs: merge critical sections of discard lock in workfn 2021-01-19 18:27:24 +01:00
discard.h btrfs: discard: Use the correct style for SPDX License Identifier 2020-04-20 17:43:42 +02:00
disk-io.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
disk-io.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
export.c btrfs: simplify iget helpers 2020-05-25 11:25:37 +02:00
export.h btrfs: export helpers for subvolume name/id resolution 2020-03-23 17:01:42 +01:00
extent_io.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
extent_io.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
extent_map.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
extent_map.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
extent-io-tree.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
extent-tree.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
file-item.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
file.c btrfs: fix race between marking inode needs to be logged and log syncing 2024-07-05 19:00:50 +02:00
free-space-analyze.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
free-space-cache.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
free-space-cache.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
free-space-tree.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
free-space-tree.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
inode-item.c btrfs: Make btrfs_find_name_in_ext_backref return struct btrfs_inode_extref 2019-09-09 14:59:16 +02:00
inode-map.c btrfs: make btrfs_delalloc_reserve_space take btrfs_inode 2020-07-27 12:55:36 +02:00
inode-map.h btrfs: replace GPL boilerplate by SPDX -- headers 2018-04-12 16:29:46 +02:00
inode.c btrfs: fix race between marking inode needs to be logged and log syncing 2024-07-05 19:00:50 +02:00
ioctl.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
Kconfig btrfs: disable build on platforms having page size 256K 2021-07-14 16:55:56 +02:00
locking.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
locking.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
lzo.c btrfs: compression: inline free_workspace 2019-11-18 12:46:59 +01:00
Makefile init: add dsm gpl source 2024-07-05 18:00:04 +02:00
misc.h btrfs: rename tree_entry to rb_simple_node and export it 2020-05-25 11:25:19 +02:00
ordered-data.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
ordered-data.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
orphan.c btrfs: replace GPL boilerplate by SPDX -- sources 2018-04-12 16:29:51 +02:00
print-tree.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
print-tree.h btrfs: print the actual offset in btrfs_root_name 2021-01-27 11:55:06 +01:00
props.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
props.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
qgroup.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
qgroup.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
raid56.c btrfs: fix raid6 qstripe kmap 2021-03-09 11:11:10 +01:00
raid56.h btrfs: constify map parameter for nr_parity_stripes and nr_data_stripes 2019-07-01 13:34:58 +02:00
rcu-string.h btrfs: rcu-string: Replace zero-length array with flexible-array member 2020-03-23 17:01:53 +01:00
reada.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
ref-verify.c btrfs: ref-verify: fix memory leak in btrfs_ref_tree_mod 2020-11-05 13:03:39 +01:00
ref-verify.h btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod() 2019-04-29 19:02:49 +02:00
reflink.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
reflink.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
relocation.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
root-tree.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
scrub.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
send.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
send.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
snapshot-size-query.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
space-info.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
space-info.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
struct-funcs.c btrfs: use unaligned helpers for stack and header set/get helpers 2020-10-07 12:13:23 +02:00
super.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
syno_acl.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
syno_acl.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
syno-extent-usage.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
syno-feat-tree.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
syno-feat-tree.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
syno-locker.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
syno-rbd-meta.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
syno-rbd-meta.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
sysfs.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
sysfs.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
transaction.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
transaction.h btrfs: fix race between marking inode needs to be logged and log syncing 2024-07-05 19:00:50 +02:00
tree-checker.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
tree-checker.h btrfs: get fs_info from eb in btrfs_check_chunk_valid 2019-04-29 19:02:39 +02:00
tree-defrag.c btrfs: remove unused btrfs_root::defrag_trans_start 2020-07-27 12:55:28 +02:00
tree-log.c btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction 2024-07-05 18:07:24 +02:00
tree-log.h btrfs: make fast fsyncs wait only for writeback 2020-10-07 12:06:56 +02:00
ulist.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
ulist.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
usrquota.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
uuid-tree.c btrfs: simplify root lookup by id 2020-05-25 11:25:36 +02:00
volumes.c btrfs: fix rw device counting in __btrfs_free_extra_devids 2024-07-05 18:02:41 +02:00
volumes.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
xattr.c init: add dsm gpl source 2024-07-05 18:00:04 +02:00
xattr.h init: add dsm gpl source 2024-07-05 18:00:04 +02:00
zlib.c btrfs: use larger zlib buffer for s390 hardware compression 2020-01-31 10:30:40 -08:00
zstd.c btrfs: compression: inline free_workspace 2019-11-18 12:46:59 +01:00