linux_dsm_epyc7002/fs/btrfs
Filipe Manana b84b8390d6 Btrfs: fix file read corruption after extent cloning and fsync
If we partially clone one extent of a file into a lower offset of the
file, fsync the file, power fail and then mount the fs to trigger log
replay, we can get multiple checksum items in the csum tree that overlap
each other and result in checksum lookup failures later. Those failures
can make file data read requests assume a checksum value of 0, but they
will not return an error (-EIO for example) to userspace exactly because
the expected checksum value 0 is a special value that makes the read bio
endio callback return success and set all the bytes of the corresponding
page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
From a userspace perspective this is equivalent to file corruption
because we are not returning what was written to the file.

Details about how this can happen, and why, are included inline in the
following reproducer test case for fstests and the comment added to
tree-log.c.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_cloner
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with a single 100K extent starting at file
  # offset 800K. We fsync the file here to make the fsync log tree gets
  # a single csum item that covers the whole 100K extent, which causes
  # the second fsync, done after the cloning operation below, to not
  # leave in the log tree two csum items covering two sub-ranges
  # ([0, 20K[ and [20K, 100K[)) of our extent.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K"  \
                  -c "fsync"                     \
                   $SCRATCH_MNT/foo | _filter_xfs_io

  # Now clone part of our extent into file offset 400K. This adds a file
  # extent item to our inode's metadata that points to the 100K extent
  # we created before, using a data offset of 20K and a data length of
  # 20K, so that it refers to the sub-range [20K, 40K[ of our original
  # extent.
  $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
      -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  # Now fsync our file to make sure the extent cloning is durably
  # persisted. This fsync will not add a second csum item to the log
  # tree containing the checksums for the blocks in the sub-range
  # [20K, 40K[ of our extent, because there was already a csum item in
  # the log tree covering the whole extent, added by the first fsync
  # we did before.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  # Silently drop all writes and ummount to simulate a crash/power
  # failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file
  # contents.
  # The fsync log replay first processes the file extent item
  # corresponding to the file offset 400K (the one which refers to the
  # [20K, 40K[ sub-range of our 100K extent) and then processes the file
  # extent item for file offset 800K. It used to happen that when
  # processing the later, it erroneously left in the csum tree 2 csum
  # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
  # 1 for the whole range of our extent. This introduced a problem where
  # subsequent lookups for the checksums of blocks within the range
  # [40K, 100K[ of our extent would not find anything because lookups in
  # the csum tree ended up looking only at the smaller csum item, the
  # one covering the subrange [20K, 40K[. This made read requests assume
  # an expected checksum with a value of 0 for those blocks, which caused
  # checksum verification failure when the read operations finished.
  # However those checksum failure did not result in read requests
  # returning an error to user space (like -EIO for e.g.) because the
  # expected checksum value had the special value 0, and in that case
  # btrfs set all bytes of the corresponding pages with the value 0x01
  # and produce the following warning in dmesg/syslog:
  #
  #  "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
  #   1322675045 expected csum 0"
  #
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File digest after log replay:"
  # Must match the same digest he had after cloning the extent and
  # before the power failure happened.
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  _unmount_flakey

  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:27:46 -07:00
..
tests btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism. 2015-06-10 09:26:05 -07:00
acl.c btrfs: remove useless ACL check 2014-06-09 17:20:42 -07:00
async-thread.c btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx() 2015-06-10 07:04:52 -07:00
async-thread.h btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx() 2015-06-10 07:04:52 -07:00
backref.c Btrfs: fix warning in backref walking 2015-08-09 07:33:50 -07:00
backref.h btrfs: cleanup, remove inode_item_info helper 2015-01-14 19:23:47 +01:00
btrfs_inode.h Btrfs: fix warning of bytes_may_use 2015-07-01 17:17:21 -07:00
check-integrity.c Merge branch 'cleanups-post-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1 2015-03-25 10:52:48 -07:00
check-integrity.h
compression.c Merge branch 'cleanups-post-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1 2015-03-25 10:52:48 -07:00
compression.h btrfs: constify structs with op functions or static definitions 2015-02-16 18:48:44 +01:00
ctree.c btrfs: abort transaction on btrfs_reloc_cow_block() 2015-08-09 07:07:14 -07:00
ctree.h Merge branch 'jeffm-discard-4.3' into for-linus-4.3 2015-08-09 07:35:33 -07:00
delayed-inode.c Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. 2015-04-26 06:27:03 -07:00
delayed-inode.h
delayed-ref.c btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref() 2015-06-24 12:28:03 -07:00
delayed-ref.h btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots. 2015-06-10 09:26:23 -07:00
dev-replace.c btrfs: its btrfs_err() instead of btrfs_error() 2015-07-22 18:20:53 -07:00
dev-replace.h
dir-item.c Btrfs: make xattr replace operations atomic 2014-11-20 17:20:07 -08:00
disk-io.c Merge branch 'jeffm-discard-4.3' into for-linus-4.3 2015-08-09 07:35:33 -07:00
disk-io.h Btrfs: disk-io: replace root args iff only fs_info used 2015-02-16 18:48:43 +01:00
export.c VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
export.h
extent_io.c btrfs: Prevent from early transaction abort 2015-08-19 14:25:15 -07:00
extent_io.h btrfs: constify structs with op functions or static definitions 2015-02-16 18:48:44 +01:00
extent_map.c Btrfs: do not move em to modified list when unpinning 2014-11-21 11:59:54 -08:00
extent_map.h Btrfs: fix NULL pointer crash when running balance and scrub concurrently 2014-06-19 14:20:55 -07:00
extent-tree.c Merge branch 'jeffm-discard-4.3' into for-linus-4.3 2015-08-09 07:35:33 -07:00
extent-tree.h btrfs: qgroup: Add new qgroup calculation function 2015-06-10 09:25:49 -07:00
file-item.c Merge branch 'cleanups-post-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1 2015-03-25 10:52:48 -07:00
file.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-07-04 19:36:06 -07:00
free-space-cache.c btrfs: add missing discards when unpinning extents with -o discard 2015-07-29 08:15:29 -07:00
free-space-cache.h Btrfs: allow block group cache writeout outside critical section in commit 2015-04-10 14:07:22 -07:00
hash.c btrfs: LLVMLinux: Remove VLAIS 2014-10-14 10:51:22 +02:00
hash.h
inode-item.c Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs 2015-01-21 18:02:05 -08:00
inode-map.c Btrfs: fix race between caching kthread and returning inode to inode cache 2015-06-30 14:36:46 -07:00
inode-map.h
inode.c Btrfs: add support for blkio controllers 2015-08-09 07:35:06 -07:00
ioctl.c btrfs: fix clone / extent-same deadlocks 2015-08-09 07:34:25 -07:00
Kconfig rcu: Make SRCU optional by using CONFIG_SRCU 2015-01-06 11:04:29 -08:00
locking.c btrfs: Add WARN_ON() for double lock in btrfs_tree_lock() 2015-08-09 07:07:14 -07:00
locking.h btrfs: fix lockups from btrfs_clear_path_blocking 2014-11-19 10:34:35 -08:00
lzo.c btrfs: constify structs with op functions or static definitions 2015-02-16 18:48:44 +01:00
Makefile Btrfs: add sanity tests for new qgroup accounting code 2014-06-09 17:20:49 -07:00
math.h btrfs: cleanup 64bit/32bit divs, compile time constants 2015-03-03 17:23:57 +01:00
ordered-data.c Btrfs: fix memory corruption on failure to submit bio for direct IO 2015-07-01 17:17:18 -07:00
ordered-data.h Btrfs: avoid syncing log in the fast fsync path when not necessary 2015-06-10 07:02:43 -07:00
orphan.c btrfs: kill the key type accessor helpers 2014-09-17 13:37:12 -07:00
print-tree.c btrfs: remove parameter blocksize from read_tree_block 2014-10-02 17:14:50 +02:00
print-tree.h
props.c btrfs: constify structs with op functions or static definitions 2015-02-16 18:48:44 +01:00
props.h
qgroup.c btrfs: qgroup: allow user to clear the limitation on qgroup 2015-06-30 13:20:00 -07:00
qgroup.h btrfs: qgroup: Cleanup the old ref_node-oriented mechanism. 2015-06-10 09:26:11 -07:00
raid56.c Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation 2015-08-09 07:34:26 -07:00
raid56.h Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation 2015-08-09 07:34:26 -07:00
rcu-string.h
reada.c Btrfs: count devices correctly in readahead during RAID 5/6 replace 2015-08-09 07:34:26 -07:00
relocation.c btrfs: Remove unnecessary variants in relocation.c 2015-08-09 07:07:14 -07:00
root-tree.c Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root 2014-06-09 17:20:40 -07:00
scrub.c Btrfs: fix parity scrub of RAID 5/6 with missing device 2015-08-09 07:34:26 -07:00
send.c Btrfs: use received_uuid of parent during send 2015-06-12 13:20:38 -07:00
send.h
struct-funcs.c
super.c Merge branch 'jeffm-discard-4.3' into for-linus-4.3 2015-08-09 07:35:33 -07:00
sysfs.c Btrfs: Check if kobject is initialized before put 2015-06-22 14:43:31 +02:00
sysfs.h Btrfs: sysfs: btrfs_sysfs_remove_fsid() make it non static 2015-05-27 12:27:22 +02:00
transaction.c Btrfs: check if previous transaction aborted to avoid fs corruption 2015-08-19 14:27:31 -07:00
transaction.h btrfs: add missing discards when unpinning extents with -o discard 2015-07-29 08:15:29 -07:00
tree-defrag.c btrfs: let tree defrag work in SSD mode 2015-06-02 19:34:33 -07:00
tree-log.c Btrfs: fix file read corruption after extent cloning and fsync 2015-08-19 14:27:46 -07:00
tree-log.h Btrfs: fix metadata inconsistencies after directory fsync 2015-03-26 17:56:23 -07:00
ulist.c btrfs: ulist: Add ulist_del() function. 2015-06-10 09:26:17 -07:00
ulist.h btrfs: ulist: Add ulist_del() function. 2015-06-10 09:26:17 -07:00
uuid-tree.c Btrfs: make btrfs_search_forward return with nodes unlocked 2014-09-17 13:38:02 -07:00
volumes.c btrfs: use __GFP_NOFAIL in alloc_btrfs_bio 2015-08-19 14:25:15 -07:00
volumes.h Merge branch 'jeffm-discard-4.3' into for-linus-4.3 2015-08-09 07:35:33 -07:00
xattr.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
xattr.h
zlib.c btrfs: constify structs with op functions or static definitions 2015-02-16 18:48:44 +01:00