linux_dsm_epyc7002/fs/btrfs
Filipe Manana f0e9b7d640 Btrfs: fix race setting block group readonly during device replace
When we do a device replace, for each device extent we find from the
source device, we set the corresponding block group to readonly mode to
prevent writes into it from happening while we are copying the device
extent from the source to the target device. However just before we set
the block group to readonly mode some concurrent task might have already
allocated an extent from it or decided it could perform a nocow write
into one of its extents, which can make the device replace process to
miss copying an extent since it uses the extent tree's commit root to
search for extents and only once it finishes searching for all extents
belonging to the block group it does set the left cursor to the logical
end address of the block group - this is a problem if the respective
ordered extents finish while we are searching for extents using the
extent tree's commit root and no transaction commit happens while we
are iterating the tree, since it's the delayed references created by the
ordered extents (when they complete) that insert the extent items into
the extent tree (using the non-commit root of course).
Example:

          CPU 1                                            CPU 2

 btrfs_dev_replace_start()
   btrfs_scrub_dev()
     scrub_enumerate_chunks()
       --> finds device extent belonging
           to block group X

                               <transaction N starts>

                                                      starts buffered write
                                                      against some inode

                                                      writepages is run against
                                                      that inode forcing dellaloc
                                                      to run

                                                      btrfs_writepages()
                                                        extent_writepages()
                                                          extent_write_cache_pages()
                                                            __extent_writepage()
                                                              writepage_delalloc()
                                                                run_delalloc_range()
                                                                  cow_file_range()
                                                                    btrfs_reserve_extent()
                                                                      --> allocates an extent
                                                                          from block group X
                                                                          (which is not yet
                                                                           in RO mode)
                                                                    btrfs_add_ordered_extent()
                                                                      --> creates ordered extent Y
                                                        flush_epd_write_bio()
                                                          --> bio against the extent from
                                                              block group X is submitted

       btrfs_inc_block_group_ro(bg X)
         --> sets block group X to readonly

       scrub_chunk(bg X)
         scrub_stripe(device extent from srcdev)
           --> keeps searching for extent items
               belonging to the block group using
               the extent tree's commit root
           --> it never blocks due to
               fs_info->scrub_pause_req as no
               one tries to commit transaction N
           --> copies all extents found from the
               source device into the target device
           --> finishes search loop

                                                        bio completes

                                                        ordered extent Y completes
                                                        and creates delayed data
                                                        reference which will add an
                                                        extent item to the extent
                                                        tree when run (typically
                                                        at transaction commit time)

                                                          --> so the task doing the
                                                              scrub/device replace
                                                              at CPU 1 misses this
                                                              and does not copy this
                                                              extent into the new/target
                                                              device

       btrfs_dec_block_group_ro(bg X)
         --> turns block group X back to RW mode

       dev_replace->cursor_left is set to the
       logical end offset of block group X

So fix this by waiting for all cow and nocow writes after setting a block
group to readonly mode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:21 +01:00
..
tests btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
acl.c Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2016-01-18 12:44:40 -08:00
async-thread.c btrfs: async-thread: Fix a use-after-free error for trace 2016-01-25 16:50:26 -08:00
async-thread.h
backref.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
backref.h
btrfs_inode.h Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
check-integrity.c btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
check-integrity.h
compression.c btrfs: make find_workspace warn if there are no workspaces 2016-05-10 09:46:16 +02:00
compression.h btrfs: move btrfs_compression_type to compression.h 2016-03-11 17:12:46 +01:00
ctree.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
ctree.h Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
delayed-inode.c btrfs: GFP_NOFS does not GFP_HIGHMEM 2016-05-10 09:44:21 +02:00
delayed-inode.h btrfs: properly set the termination value of ctx->pos in readdir 2016-02-11 07:01:59 -08:00
delayed-ref.c btrfs: drop null testing before destroy functions 2016-02-18 11:46:03 +01:00
delayed-ref.h btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
dev-replace.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
dev-replace.h btrfs: refactor btrfs_dev_replace_start for reuse 2016-04-28 10:59:13 +02:00
dir-item.c
disk-io.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
disk-io.h Merge branch 'misc-cleanups-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2016-01-11 06:08:37 -08:00
export.c
export.h
extent_io.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
extent_io.h Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
extent_map.c btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
extent_map.h btrfs: cleanup, stop casting for extent_map->lookup everywhere 2016-01-15 19:22:28 +01:00
extent-tree.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
file-item.c btrfs: sink gfp parameter to set_extent_bits 2016-04-29 11:01:47 +02:00
file.c Btrfs: fix handling of faults from btrfs_copy_from_user 2016-05-26 13:23:59 -07:00
free-space-cache.c btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
free-space-cache.h btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
free-space-tree.c Revert "btrfs: synchronize incompat feature bits with sysfs files" 2016-01-29 08:19:37 -08:00
free-space-tree.h
hash.c
hash.h
inode-item.c btrfs: rename btrfs_std_error to btrfs_handle_fs_error 2016-04-28 10:36:54 +02:00
inode-map.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
inode-map.h Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots 2016-01-15 19:25:02 +01:00
inode.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
ioctl.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
Kconfig
locking.c btrfs: cleanup, remove stray return statements 2016-01-07 14:30:52 +01:00
locking.h
lzo.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
Makefile
math.h
ordered-data.c Btrfs: fix race setting block group readonly during device replace 2016-05-30 12:58:21 +01:00
ordered-data.h Btrfs: fix race setting block group readonly during device replace 2016-05-30 12:58:21 +01:00
orphan.c
print-tree.c btrfs: teach print_leaf about temporary item subtypes 2016-02-11 16:15:43 +01:00
print-tree.h
props.c btrfs: move btrfs_compression_type to compression.h 2016-03-11 17:12:46 +01:00
props.h
qgroup.c btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
qgroup.h
raid56.c btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
raid56.h
rcu-string.h
reada.c Btrfs: fix race between readahead and device replace/removal 2016-05-30 12:58:18 +01:00
relocation.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
root-tree.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
scrub.c Btrfs: fix race setting block group readonly during device replace 2016-05-30 12:58:21 +01:00
send.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
send.h Btrfs: use linux/sizes.h to represent constants 2016-01-07 14:38:02 +01:00
struct-funcs.c btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
super.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
sysfs.c btrfs: sysfs: protect reading label by lock 2016-05-06 15:22:49 +02:00
sysfs.h btrfs: sysfs: introduce helper for syncing bits with sysfs files 2016-01-21 18:50:40 +01:00
transaction.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
transaction.h btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
tree-defrag.c
tree-log.c Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
tree-log.h Btrfs: fix unreplayable log after snapshot delete + parent dir fsync 2016-03-01 08:23:25 -08:00
ulist.c btrfs: fix string and comment grammatical issues and typos 2016-05-25 22:35:14 +02:00
ulist.h
uuid-tree.c
volumes.c Btrfs: fix race between device replace and block group removal 2016-05-30 12:58:19 +01:00
volumes.h Merge branch 'foreign/jeffm/uapi' into for-chris-4.7-20160516 2016-05-16 15:46:29 +02:00
xattr.c Btrfs: fix listxattrs not listing all xattrs packed in the same item 2016-03-01 08:23:41 -08:00
xattr.h
zlib.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00