We usually call btrfs_put_bbio() when btrfs_map_block() failed,
btrfs_put_bbio() works right whether bbio is a valid value, or NULL.
But there is a exception, in some case, btrfs_map_block() will return
fail without touching *bbio(keeping its original value), and if bbio
was not initialized yet, invalid memory accessing will happened.
Above case is in scrub_missing_raid56_pages(), and similar case in
scrub_raid56_parity().
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was never
freed anywhere.
Signed-off-by: Scott Talbert <scott.talbert@hgst.com>
Signed-off-by: David Sterba <dsterba@suse.com>
pagev array in scrub_block{} is of size SCRUB_MAX_PAGES_PER_BLOCK.
page_index should be checked with the same to trigger BUG_ON().
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The struct 'map_lookup' uses type int for @stripe_len, while
btrfs_chunk_stripe_len() can return a u64 value, and it may end up with
@stripe_len being undefined value and it can lead to 'divide error' in
__btrfs_map_block().
This changes 'map_lookup' to use type u64 for stripe_len, also right now
we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for
BTRFS_STRIPE_LEN.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ folded division fix to scrub_raid56_parity ]
Signed-off-by: David Sterba <dsterba@suse.com>
The key variable occupies 17 bytes, the key_start is used once, we can
simply reuse existing 'key' for that purpose. As the key is not a simple
type, compiler doest not do it on itself.
Signed-off-by: David Sterba <dsterba@suse.com>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's basically harmless if "ref_level" isn't initialized since it's only
used for an error message, but it causes a static checker warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Xfstests btrfs/011 complains about a deadlock warning,
[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1226.652955] Possible interrupt unsafe locking scenario:
[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***
Commit 084b6e7c76 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.
The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock
However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to
[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.
To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.
With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Scrub is not on the critical writeback path we don't need to use
GFP_NOFS for all allocations. The failures are handled and stats passed
back to userspace.
Let's use GFP_KERNEL on the paths where everything is ok, ie. setup the
global structures and the IO submission paths.
Functions that do the repair and fixups still use GFP_NOFS as we might
want to skip any other filesystem activity if we encounter an error.
This could turn out to be unnecessary, but requires more review compared
to the easy cases in this patch.
Signed-off-by: David Sterba <dsterba@suse.com>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Current code is trying to calculate rbio->dbitmap's size to make it
align to sizeof(long), but implement haven't achived this object,
it is align to sizeof(char) instead.
This patch fixed above calculation, and use sizeof(long) instead of
fixed "8" to increate compatibility.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Overloading extent_map->bdev to struct map_lookup * might have started out
as a means to an end, but it's a pattern that's used all over the place
now. Let's get rid of the casting and just add a union instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace the integers by enums for better readability. The value 2 does
not have any meaning since a717531942
"Btrfs: do less aggressive btree readahead" (2009-01-22).
Signed-off-by: David Sterba <dsterba@suse.com>
Currently scrub can race with the cleaner kthread when the later attempts
to delete an unused block group, and the result is preventing the cleaner
kthread from ever deleting later the block group - unless the block group
becomes used and unused again. The following diagram illustrates that
race:
CPU 1 CPU 2
cleaner kthread
btrfs_delete_unused_bgs()
gets block group X from
fs_info->unused_bgs and
removes it from that list
scrub_enumerate_chunks()
searches device tree using
its commit root
finds device extent for
block group X
gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())
sets bg X to RO
sees the block group is
already RO and therefore
doesn't delete it nor adds
it back to unused list
So fix this by making scrub add the block group again to the list of
unused block groups if the block group is still unused when it finished
scrubbing it and it hasn't been removed already.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Scrub can race with the cleaner kthread deleting block groups that are
unused (and with relocation too) leading to a failure with error -EINVAL
that gets returned to user space.
The following diagram illustrates how it happens:
CPU 1 CPU 2
cleaner kthread
btrfs_delete_unused_bgs()
gets block group X from
fs_info->unused_bgs
sets block group to RO
btrfs_remove_chunk(bg X)
deletes device extents
scrub_enumerate_chunks()
searches device tree using
its commit root
finds device extent for
block group X
gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())
sets bg X to RO (again)
btrfs_remove_block_group(bg X)
deletes block group from
fs_info->block_group_cache_tree
removes extent map from
fs_info->mapping_tree
scrub_chunk(offset X)
searches fs_info->mapping_tree
for extent map starting at
offset X
--> doesn't find any such
extent map
--> returns -EINVAL and scrub
errors out to userspace
with -EINVAL
Fix this by dealing with an extent map lookup failure as an indicator of
block group deletion.
Issue reproduced with fstest btrfs/071.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
DEV_LIST="/dev/vdd /dev/vde"
DEV_REPLACE="/dev/vdf"
do_test()
{
local mkfs_opt="$1"
local size="$2"
dmesg -c >/dev/null
umount $SCRATCH_MNT &>/dev/null
echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
mount "${DEV_LIST[0]}" $SCRATCH_MNT
echo -n "Writing big files"
dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
for ((i = 1; i <= size; i++)); do
echo -n .
/bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
done
echo
echo Start replace
btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
dmesg
return 1
}
return 0
}
# Set size to value near fs size
# for example, 1897 can trigger this bug in 2.6G device.
#
./do_test "-d raid1 -m raid1" 1897
System will report replace fail with following warning in dmesg:
[ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
[ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
[ 135.543505] ------------[ cut here ]------------
[ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
[ 135.545276] Modules linked in:
[ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
[ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
[ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
[ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
[ 135.550758] Call Trace:
[ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55
[ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
[ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
[ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
[ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
[ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
[ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
[ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
[ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
[ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80
[ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
Reason:
When big data writen to fs, the whole free space will be allocated
for data chunk.
And operation as scrub need to set_block_ro(), and when there is
only one metadata chunk in system(or other metadata chunks
are all full), the function will try to allocate a new chunk,
and failed because no space in device.
Fix:
When set_block_ro failed for metadata chunk, it is not a problem
because scrub_lock paused commit_trancaction in same time, and
metadata are always cowed, so the on-the-fly writepages will not
write data into same place with scrub/replace.
Let replace continue in this case is no problem.
Tested by above script, and xfstests/011, plus 100 times xfstests/070.
Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>
Suggested-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We don't need pass so many arguments for recheck sblock now,
this patch cleans them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can use existing scrub_checksum_data() and scrub_checksum_tree_block()
for scrub_recheck_block_checksum(), instead of write duplicated code.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We should reset sblock->xxx_error stats before calling
scrub_recheck_block_checksum().
Current code run correctly because all sblock are allocated by
k[cz]alloc(), and the error stats are not got changed.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
scrub_setup_recheck_block() isn't setup all necessary fields for
sblock_to_check because history reason.
So current code need more arguments in severial functions,
and more local variables, just to passing these lacked values to
necessary place.
This patch setup above fields to sblock_to_check in
scrub_setup_recheck_block(), for:
1: more cleanup for function arg, local variable
2: to make sblock_to_check complete, then we can use sblock_to_check
without concern about some uninitialized member.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
It is better to show error stats to user when we found tree block
spanning stripes.
On a btrfs created by old version of btrfs-convert:
Before patch:
# btrfs scrub start -B /dev/vdh
scrub done for 8b342d35-2904-41ab-b3cb-2f929709cf47
scrub started at Tue Aug 25 21:19:09 2015 and finished after 00:00:00
total bytes scrubbed: 53.54MiB with 0 errors
# dmesg
...
[ 128.711434] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832
[ 128.712744] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368
...
After patch:
# btrfs scrub start -B /dev/vdh
scrub done for ff7f844b-7a4e-4b1a-88a9-8252ab25be1b
scrub started at Tue Aug 25 21:42:29 2015 and finished after 00:00:00
total bytes scrubbed: 53.60MiB with 2 errors
error details:
corrected errors: 0, uncorrectable errors: 2, unverified errors: 0
ERROR: There are uncorrectable errors.
# dmesg
...omit...
#
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs cleanups and fixes from Chris Mason:
"These are small cleanups, and also some fixes for our async worker
thread initialization.
I was having some trouble testing these, but it ended up being a
combination of changing around my test servers and a shiny new
schedule while atomic from the new start/finish_plug in
writeback_sb_inodes().
That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't
changing a bunch of things in my test setup at once it would have been
really clear. Fix for writeback_sb_inodes() on the way as well"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
btrfs: async_thread: Fix workqueue 'max_active' value when initializing
btrfs: Add raid56 support for updating num_tolerated_disk_barrier_failures in btrfs_balance
btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
btrfs: Update out-of-date "skip parity stripe" comment
Pull btrfs updates from Chris Mason:
"This has Jeff Mahoney's long standing trim patch that fixes corners
where trims were missing. Omar has some raid5/6 fixes, especially for
using scrub and device replace when devices are missing.
Zhao Lie continues cleaning and fixing things, this series fixes some
really hard to hit corners in xfstests. I had to pull it last merge
window due to some deadlocks, but those are now resolved.
I added support for Tejun's new blkio controllers. It seems to work
well for single devices, we'll expand to multi-device as well"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits)
btrfs: fix compile when block cgroups are not enabled
Btrfs: fix file read corruption after extent cloning and fsync
Btrfs: check if previous transaction aborted to avoid fs corruption
btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
btrfs: Prevent from early transaction abort
btrfs: Remove unused arguments in tree-log.c
btrfs: Remove useless condition in start_log_trans()
Btrfs: add support for blkio controllers
Btrfs: remove unused mutex from struct 'btrfs_fs_info'
Btrfs: fix parity scrub of RAID 5/6 with missing device
Btrfs: fix device replace of a missing RAID 5/6 device
Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
Btrfs: count devices correctly in readahead during RAID 5/6 replace
Btrfs: remove misleading handling of missing device scrub
btrfs: fix clone / extent-same deadlocks
Btrfs: fix defrag to merge tail file extent
Btrfs: fix warning in backref walking
btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
btrfs: Remove root argument in extent_data_ref_count()
btrfs: Fix wrong comment of btrfs_alloc_tree_block()
...
These variables are not used from introduced version, remove them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When testing the previous patch, Zhao Lei reported a similar bug when
attempting to scrub a degraded RAID 5/6 filesystem with a missing
device, leading to NULL pointer dereferences from the RAID 5/6 parity
scrubbing code.
The first cause was the same as in the previous patch: attempting to
call bio_add_page() on a missing block device. To fix this,
scrub_extent_for_parity() can just mark the sectors on the missing
device as errors instead of attempting to read from it.
Additionally, the code uses scrub_remap_extent() to map the extent of
the corresponding data stripe, but the extent wasn't already mapped. If
scrub_remap_extent() finds a missing block device, it doesn't initialize
extent_dev, so we're left with a NULL struct btrfs_device. The solution
is to use btrfs_map_block() directly.
Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The original implementation of device replace on RAID 5/6 seems to have
missed support for replacing a missing device. When this is attempted,
we end up calling bio_add_page() on a bio with a NULL ->bi_bdev, which
crashes when we try to dereference it. This happens because
btrfs_map_block() has no choice but to return us the missing device
because RAID 5/6 don't have any alternate mirrors to read from, and a
missing device has a NULL bdev.
The idea implemented here is to handle the missing device case
separately, which better only happen when we're replacing a missing RAID
5/6 device. We use the new BTRFS_RBIO_REBUILD_MISSING operation to
reconstruct the data from parity, check it with
scrub_recheck_block_checksum(), and write it out with
scrub_write_block_to_dev_replace().
Reported-by: Philip <bugzilla@philip-seeger.de>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96141
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The current RAID 5/6 recovery code isn't quite prepared to handle
missing devices. In particular, it expects a bio that we previously
attempted to use in the read path, meaning that it has valid pages
allocated. However, missing devices have a NULL blkdev, and we can't
call bio_add_page() on a bio with a NULL blkdev. We could do manual
manipulation of bio->bi_io_vec, but that's pretty gross. So instead, add
a separate path that allows us to manually add pages to the rbio.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
scrub_submit() claims that it can handle a bio with a NULL block device,
but this is misleading, as calling bio_add_page() on a bio with a NULL
->bi_bdev would've already crashed. Delete this, as we're about to
properly handle a missing block device.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
xfstests btrfs/070 sometimes failed.
In my test machine, its fail rate is about 30%.
In another vm(vmware), its fail rate is about 50%.
Reason:
btrfs/070 do replace and defrag with fsstress simultaneously,
after above operation, checksum error is found by scrub.
Actually, it have no relationship with defrag operation, only
replace with fsstress can trigger this bug.
New data writen to target device have possibility rewrited by
old data from source device by replace code in debug, to avoid
above problem, we can set target block group to readonly in
replace period, so new data requested by other operation will
not write to same place with replace code.
Before patch(4.1-rc3):
30% failed in 100 xfstests.
After patch:
0% failed in 300 xfstests.
It also happened in btrfs/071 as it's another scrub with IO load tests.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Use new intruduced scrub_pause_on/off() can make this code block
clean and more readable.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
It can reduce current duplicated code which is similar to
scrub_blocked_if_needed() but can not call it because little
different.
It also used by my next patch which is in same case.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we access extent_root in scrub_stripe() and
scrub_raid56_parity(), we need bypass unrelated tree item firstly
before using its contents to do other condition.
It is not a bug fix, only making code sequence in logic.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We need not load csum of whole strip in scrub because strip is trimed
before use, it is to say, what we really need to calculate csum is
data between [extent_logical, extent_len).
This patch changed to use above segment for btrfs_lookup_csums_range()
in scrub_stripe()
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
For example, in scrub_raid56_parity(), following lines are used
to judge is all data processed:
place1: if (key.objectid > logic_end) ...
place2: if (logic_start >= logic_end) ...
...
(place2 is typo, is should be ">", it is copied from other
place, where logic_end's meaning is different, long story...)
We can fix above typo directly, but the root reason is ambiguous
meaning of logic_end in scrub raid56 parity.
In other place, XXX_end is pointed to data which is not included,
and we need to process segment of [XXX_start, XXX_end).
But for scrub raid56 parity, logic_end is pointed to lattest data
need to process, and introduced many "+ 1" and "- 1" in code as
below:
length = sparity->logic_end - sparity->logic_start + 1
logic_end - logic_start + 1
stripe_logical + increment - 1
This patch changed logic_end's meaning to make it in normal understanding
in raid56 parity functions and data struct alone with above bugfix.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When scrub_extent() failed, we need to free previois created
checksum list.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Old code checking cancel and pause request inside scrub stripe
operation, like:
loop() {
if (parity) {
scrub_parity_stripe();
continue;
}
check_cancel_and_pause()
scrub_normal_stripe();
}
Reason is when introduce raid56 stripe scrub, new code is inserted
simplely to front of loop.
Better to:
loop() {
check_cancel_and_pause()
if (parity)
scrub_parity_stripe();
else
scrub_normal_stripe();
}
This patch adjusted code place to realize above sequence.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Scrub panic in following operation:
mkfs.ext4 /dev/vdh
btrfs-convert /dev/vdh
mount /dev/vdh /mnt/tmp1
btrfs scrub start -B /dev/vdh
(panic)
Reason:
1: In some case, leaf created by btrfs-convert was splited into 2
strips.
2: Scrub bypassed part of above wrong leaf data, but remain data
caused panic in scrub_checksum_tree_block().
For reason 1:
we can get following information after some simple operation.
a. mkfs.ext4 /dev/vdh
btrfs-convert /dev/vdh
b. btrfs-debug-tree /dev/vdh
we can see following item in extent tree:
item 25 key (27054080 METADATA_ITEM 0) itemoff 15083 itemsize 33
Its logical address is [27054080, 27070464)
and acrossed 2 strips:
[27000832, 27066368)
[27066368, 27131904)
Will be fixed in btrfs-progs(btrfs-convert, btrfsck, ...)
For reason 2:
Scrub is trying to do a "bypass" in this case, but the result is
"panic", because current code lacks of some condition in bypass,
and let some wrong leaf data escaped.
This patch fixed above scrub code.
Before patch:
# btrfs scrub start -B /dev/vdh
(panic)
After patch:
# btrfs scrub start -B /dev/vdh
scrub done for 353cec8f-da31-4a94-aa35-be72d997b06e
...
# dmesg
...
[ 59.088697] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832
[ 59.089929] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368
#
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Although it is a rare case, we'd better free previous allocated
memory on error.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
lockdep report following warning in test:
[25176.843958] =================================
[25176.844519] [ INFO: inconsistent lock state ]
[25176.845047] 4.1.0-rc3 #22 Tainted: G W
[25176.845591] ---------------------------------
[25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
[25176.847246] (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs]
[25176.847838] {SOFTIRQ-ON-W} state was registered at:
[25176.848396] [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10
[25176.848955] [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0
[25176.849491] [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410
[25176.850029] [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs]
[25176.850575] [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs]
[25176.851110] [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
[25176.851660] [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
[25176.852189] [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
[25176.852771] [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
[25176.853315] [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570
[25176.853868] [<ffffffff8121c851>] SyS_ioctl+0x41/0x80
[25176.854406] [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f
[25176.854935] irq event stamp: 51506
[25176.855511] hardirqs last enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0
[25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0
[25176.856642] softirqs last enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640
[25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120
[25176.857746]
other info that might help us debug this:
[25176.858845] Possible unsafe locking scenario:
[25176.859981] CPU0
[25176.860537] ----
[25176.861059] lock(&wr_ctx->wr_lock);
[25176.861705] <Interrupt>
[25176.862272] lock(&wr_ctx->wr_lock);
[25176.862881]
*** DEADLOCK ***
Reason:
Above warning is caused by:
Interrupt
-> bio_endio()
-> ...
-> scrub_put_ctx()
-> scrub_free_ctx() *1
-> ...
-> mutex_lock(&wr_ctx->wr_lock);
scrub_put_ctx() is allowed to be called in end_bio interrupt, but
in code design, it will never call scrub_free_ctx(sctx) in interrupe
context(above *1), because btrfs_scrub_dev() get one additional
reference of sctx->refs, which makes scrub_free_ctx() only called
withine btrfs_scrub_dev().
Now the code runs out of our wish, because free sequence in
scrub_pending_bio_dec() have a gap.
Current code:
-----------------------------------+-----------------------------------
scrub_pending_bio_dec() | btrfs_scrub_dev
-----------------------------------+-----------------------------------
atomic_dec(&sctx->bios_in_flight); |
wake_up(&sctx->list_wait); |
| scrub_put_ctx()
| -> atomic_dec_and_test(&sctx->refs)
scrub_put_ctx(sctx); |
-> atomic_dec_and_test(&sctx->refs)|
-> scrub_free_ctx() |
-----------------------------------+-----------------------------------
We expected:
-----------------------------------+-----------------------------------
scrub_pending_bio_dec() | btrfs_scrub_dev
-----------------------------------+-----------------------------------
atomic_dec(&sctx->bios_in_flight); |
wake_up(&sctx->list_wait); |
scrub_put_ctx(sctx); |
-> atomic_dec_and_test(&sctx->refs)|
| scrub_put_ctx()
| -> atomic_dec_and_test(&sctx->refs)
| -> scrub_free_ctx()
-----------------------------------+-----------------------------------
Fix:
Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
in interrupt context.
Tested by check tracelog in debug.
Changelog v1->v2:
Use workqueue instead of adjust function call sequence in v1,
because v1 will introduce a bug pointed out by:
Filipe David Manana <fdmanana@gmail.com>
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type.
Get rid of a few more do_div instances.
Signed-off-by: David Sterba <dsterba@suse.cz>
Convert kmalloc(nr * size, ..) to kmalloc_array that does additional
overflow checks, the zeroing variant is kcalloc.
Signed-off-by: David Sterba <dsterba@suse.cz>
Pull btrfs updates from Chris Mason:
"This pull is mostly cleanups and fixes:
- The raid5/6 cleanups from Zhao Lei fixup some long standing warts
in the code and add improvements on top of the scrubbing support
from 3.19.
- Josef has round one of our ENOSPC fixes coming from large btrfs
clusters here at FB.
- Dave Sterba continues a long series of cleanups (thanks Dave), and
Filipe continues hammering on corner cases in fsync and others
This all was held up a little trying to track down a use-after-free in
btrfs raid5/6. It's not clear yet if this is just made easier to
trigger with this pull or if its a new bug from the raid5/6 cleanups.
Dave Sterba is the only one to trigger it so far, but he has a
consistent way to reproduce, so we'll get it nailed shortly"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits)
Btrfs: don't remove extents and xattrs when logging new names
Btrfs: fix fsync data loss after adding hard link to inode
Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
Btrfs: account for large extents with enospc
Btrfs: don't set and clear delalloc for O_DIRECT writes
Btrfs: only adjust outstanding_extents when we do a short write
btrfs: Fix out-of-space bug
Btrfs: scrub, fix sleep in atomic context
Btrfs: fix scheduler warning when syncing log
Btrfs: Remove unnecessary placeholder in btrfs_err_code
btrfs: cleanup init for list in free-space-cache
btrfs: delete chunk allocation attemp when setting block group ro
btrfs: clear bio reference after submit_one_bio()
Btrfs: fix scrub race leading to use-after-free
Btrfs: add missing cleanup on sysfs init failure
Btrfs: fix race between transaction commit and empty block group removal
btrfs: add more checks to btrfs_read_sys_array
btrfs: cleanup, rename a few variables in btrfs_read_sys_array
btrfs: add checks for sys_chunk_array sizes
btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
...
My previous patch "Btrfs: fix scrub race leading to use-after-free"
introduced the possibility to sleep in an atomic context, which happens
when the scrub_lock mutex is held at the time scrub_pending_bio_dec()
is called - this function can be called under an atomic context.
Chris ran into this in a debug kernel which gave the following trace:
[ 1928.950319] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:621
[ 1928.967334] in_atomic(): 1, irqs_disabled(): 0, pid: 149670, name: fsstress
[ 1928.981324] INFO: lockdep is turned off.
[ 1928.989244] CPU: 24 PID: 149670 Comm: fsstress Tainted: G W 3.19.0-rc7-mason+ #41
[ 1929.006418] Hardware name: ZTSYSTEMS Echo Ridge T4 /A9DRPF-10D, BIOS 1.07 05/10/2012
[ 1929.022207] ffffffff81a22cf8 ffff881076e03b78 ffffffff816b8dd9 ffff881076e03b78
[ 1929.037267] ffff880d8e828710 ffff881076e03ba8 ffffffff810856c4 ffff881076e03bc8
[ 1929.052315] 0000000000000000 000000000000026d ffffffff81a22cf8 ffff881076e03bd8
[ 1929.067381] Call Trace:
[ 1929.072344] <IRQ> [<ffffffff816b8dd9>] dump_stack+0x4f/0x6e
[ 1929.083968] [<ffffffff810856c4>] ___might_sleep+0x174/0x230
[ 1929.095352] [<ffffffff810857d2>] __might_sleep+0x52/0x90
[ 1929.106223] [<ffffffff816bb68f>] mutex_lock_nested+0x2f/0x3b0
[ 1929.117951] [<ffffffff810ab37d>] ? trace_hardirqs_on+0xd/0x10
[ 1929.129708] [<ffffffffa05dc838>] scrub_pending_bio_dec+0x38/0x70 [btrfs]
[ 1929.143370] [<ffffffffa05dd0e0>] scrub_parity_bio_endio+0x50/0x70 [btrfs]
[ 1929.157191] [<ffffffff812fa603>] bio_endio+0x53/0xa0
[ 1929.167382] [<ffffffffa05f96bc>] rbio_orig_end_io+0x7c/0xa0 [btrfs]
[ 1929.180161] [<ffffffffa05f97ba>] raid_write_parity_end_io+0x5a/0x80 [btrfs]
[ 1929.194318] [<ffffffff812fa603>] bio_endio+0x53/0xa0
[ 1929.204496] [<ffffffff8130401b>] blk_update_request+0x1eb/0x450
[ 1929.216569] [<ffffffff81096e58>] ? trigger_load_balance+0x78/0x500
[ 1929.229176] [<ffffffff8144c74d>] scsi_end_request+0x3d/0x1f0
[ 1929.240740] [<ffffffff8144ccac>] scsi_io_completion+0xac/0x5b0
[ 1929.252654] [<ffffffff81441c50>] scsi_finish_command+0xf0/0x150
[ 1929.264725] [<ffffffff8144d317>] scsi_softirq_done+0x147/0x170
[ 1929.276635] [<ffffffff8130ace6>] blk_done_softirq+0x86/0xa0
[ 1929.288014] [<ffffffff8105d92e>] __do_softirq+0xde/0x600
[ 1929.298885] [<ffffffff8105df6d>] irq_exit+0xbd/0xd0
(...)
Fix this by using a reference count on the scrub context structure
instead of locking the scrub_lock mutex.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The xfstests btrfs/072 reports uncorrectable read errors in dmesg,
because scrub forgets to use commit_root for parity scrub routine
and scrub attempts to scrub those extents items whose contents are
not fully on disk.
To fix it, we just add the @search_commit_root flag back.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
refs is better than ref_count to record a struct's ref count.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Suggested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
So we can check raid56 with:
(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
instead of long:
(map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6))
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Corrent code use many kinds of "clever" way to determine operation
target's raid type, as:
raid_map != NULL
or
raid_map[MAX_NR] == RAID[56]_Q_STRIPE
To make code easy to maintenance, this patch put raid type into
bbio, and we can always get raid type from bbio with a "stupid"
way.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
scrub_setup_recheck_block() have many arguments but most of them
can be get from one of them, we can remove them to make code clean.
Some other cleanup for that function also included in this patch.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The code are similar, combine them to make code clean and easy to maintenance.
Some lost condition are also completed with benefit of this combination.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
In corrent code, code of finding-right-mirror and writing-to-target
are mixed in logic, if we find a right mirror but failed in writing
to target, it will treat as "hadn't found right block", and fill the
target with sblock_bad.
Actually, "failed in writing to target" does not mean "source
block is wrong", this patch separate above two condition in logic,
and do some cleanup to make code clean.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Use break instead of useless loop should be more suitable in this
case.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
It is always 1 in this place, because !1 case was already jumped
out in previous code.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
if (sctx->is_dev_replace && !is_metadata && !have_csum) {
...
goto nodatasum_case;
}
...
nodatasum_case:
WARN_ON(sctx->is_dev_replace);
In above code, nodatasum_case marker should be moved after
WARN_ON().
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
to keep btrfs_bio's memory in raid56 recovery implement.
2: free function for bbio will make code clean and flexible, plus
forced data type checking in compile.
Changelog v1->v2:
Rename following by David Sterba's suggestion:
put_btrfs_bio() -> btrfs_put_bio()
get_btrfs_bio() -> btrfs_get_bio()
bbio->ref_count -> bbio->refs
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
It can make code more simple and clear, we need not care about
free bbio and raid_map together.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We add the number of stripes on target devices into bbio->num_stripes
if we are under device replacement, and we just sort the raid_map of
those stripes that not on the target devices, so if when we need
real raid_map, we need skip the stripes on the target devices.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The address that should be freed is not 'ppath' but 'path'.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper
for simple btrfs_search_slot.
After we've removed all such callers, passing a NULL key is not valid
anymore.
Signed-off-by: David Sterba <dsterba@suse.cz>
The only way that "ret" is set is when we call scrub_pages_for_parity()
so the skip to "if (ret) " test doesn't make sense and causes a static
checker warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.
The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The implementation is:
- Read and check all the data with checksum in the same stripe.
All the data which has checksum is COW data, and we are sure
that it is not changed though we don't lock the stripe. because
the space of that data just can be reclaimed after the current
transction is committed, and then the fs can use it to store the
other data, but when doing scrub, we hold the current transaction,
that is that data can not be recovered, it is safe that read and check
it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
The data without checksum and the parity may be changed if we don't
lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
is not right
- Unlock the stripe
If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.
And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This can be reproduced by fstests: btrfs/070
The scenario is like the following:
replace worker thread defrag thread
--------------------- -------------
copy_nocow_pages_worker btrfs_defrag_file
copy_nocow_pages_for_inode ...
btrfs_writepages
|A| lock_extent_bits extent_write_cache_pages
|B| lock_page
__extent_writepage
... writepage_delalloc
find_lock_delalloc_range
|B| lock_extent_bits
find_or_create_page
pagecache_get_page
|A| lock_page
This leads to an ABBA pattern deadlock. To fix it,
o we just change it to an AABB pattern which means to @unlock_extent_bits()
before we @lock_page(), and in this way the @extent_read_full_page_nolock()
is no longer in an locked context, so change it back to @extent_read_full_page()
to regain protection.
o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(),
the extent may not really point at the physical block we want, so we
have to check it before write.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
total_size will be changed when resizing a device, and disk_total_size
will be changed if resizing is successful. Meanwhile, the on-disk super
blocks of the previous transaction might not be updated. Considering
the consistency of the metadata in the previous transaction, We should
use the size in the previous transaction to check if the super block is
beyond the boundary of the device. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The super block generation of the seed devices is not the same as the
filesystem which sprouted from them because we don't update the super
block on the seed devices when we change that new filesystem. So we
should not use the generation of that new filesystem to check the super
block generation on the seed devices, Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
All the metadata in the seed devices has the same fsid as the fsid
of the seed filesystem which is on the seed device, so we should check
them by the current filesystem. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The nodesize and leafsize were never of different values. Unify the
usage and make nodesize the one. Cleanup the redundant checks and
helpers.
Shaves a few bytes from .text:
text data bss dec hex filename
852418 24560 23112 900090 dbbfa btrfs.ko.before
851074 24584 23112 898770 db6d2 btrfs.ko.after
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.
Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --
normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);
work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()
The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will
file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()
So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.
A bit more explanation,
A,B,C -- struct btrfs_work
arg -- struct work_struct
kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);
A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed
B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()
As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.
Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).
When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.
So the situation is that our kthread is waiting forever on work C.
Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.
With this patch, I no long hit the above hang.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
We should not write data into a readonly device especially seed device when
doing scrub, skip those devices.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>