Revert commit 11367799f3 ("md: Prevent IO hold during accessing to faulty
raid5 array") as it doesn't comply with commit c3cce6cda1 ("md/raid5:
ensure device failure recorded before write request returns."). That change
is not required anymore as the problem is resolved by commit 16f889499a
("md: report 'write_pending' state when array in sync") - read request is
stuck as array state is not reported correctly via sysfs attribute.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull MD updates from Shaohua Li:
"This update includes:
- new AVX512 instruction based raid6 gen/recovery algorithm
- a couple of md-cluster related bug fixes
- fix a potential deadlock
- set nonrotational bit for raid array with SSD
- set correct max_hw_sectors for raid5/6, which hopefuly can improve
performance a little bit
- other minor fixes"
* tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md: set rotational bit
raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays
raid5: handle register_shrinker failure
raid5: fix to detect failure of register_shrinker
md: fix a potential deadlock
md/bitmap: fix wrong cleanup
raid5: allow arbitrary max_hw_sectors
lib/raid6: Add AVX512 optimized xor_syndrome functions
lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
lib/raid6: Add AVX512 optimized recovery functions
lib/raid6: Add AVX512 optimized gen_syndrome functions
md-cluster: make resync lock also could be interruptted
md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
md-cluster: convert the completion to wait queue
md-cluster: protect md_find_rdev_nr_rcu with rcu lock
md-cluster: clean related infos of cluster
md: changes for MD_STILL_CLOSED flag
md-cluster: remove some unnecessary dlm_unlock_sync
md-cluster: use FORCEUNLOCK in lockres_free
md-cluster: call md_kick_rdev_from_array once ack failed
Pull CPU hotplug updates from Thomas Gleixner:
"Yet another batch of cpu hotplug core updates and conversions:
- Provide core infrastructure for multi instance drivers so the
drivers do not have to keep custom lists.
- Convert custom lists to the new infrastructure. The block-mq custom
list conversion comes through the block tree and makes the diffstat
tip over to more lines removed than added.
- Handle unbalanced hotplug enable/disable calls more gracefully.
- Remove the obsolete CPU_STARTING/DYING notifier support.
- Convert another batch of notifier users.
The relayfs changes which conflicted with the conversion have been
shipped to me by Andrew.
The remaining lot is targeted for 4.10 so that we finally can remove
the rest of the notifiers"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
cpufreq: Fix up conversion to hotplug state machine
blk/mq: Reserve hotplug states for block multiqueue
x86/apic/uv: Convert to hotplug state machine
s390/mm/pfault: Convert to hotplug state machine
mips/loongson/smp: Convert to hotplug state machine
mips/octeon/smp: Convert to hotplug state machine
fault-injection/cpu: Convert to hotplug state machine
padata: Convert to hotplug state machine
cpufreq: Convert to hotplug state machine
ACPI/processor: Convert to hotplug state machine
virtio scsi: Convert to hotplug state machine
oprofile/timer: Convert to hotplug state machine
block/softirq: Convert to hotplug state machine
lib/irq_poll: Convert to hotplug state machine
x86/microcode: Convert to hotplug state machine
sh/SH-X3 SMP: Convert to hotplug state machine
ia64/mca: Convert to hotplug state machine
ARM/OMAP/wakeupgen: Convert to hotplug state machine
ARM/shmobile: Convert to hotplug state machine
arm64/FP/SIMD: Convert to hotplug state machine
...
register_shrinker() now can fail. When it happens, shrinker.nr_deferred is
null. We use it to determine if unregister_shrinker is required.
Signed-off-by: Shaohua Li <shli@fb.com>
register_shrinker can fail after commit 1d3d4437ea ("vmscan: per-node
deferred work"), we should detect the failure of it, otherwise we may
fail to register shrinker after raid5 configuration was setup successfully.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
raid5 will split bio to proper size internally, there is no point to use
underlayer disk's max_hw_sectors. In my qemu system, without the change,
the raid5 only receives 128k size bio, which reduces the chance of bio
merge sending to underlayer disks.
Signed-off-by: Shaohua Li <shli@fb.com>
commit 5f9d1fde7d54a5(raid5: fix memory leak of bio integrity data)
moves bio_reset to bio_endio. But it introduces a small race condition.
It does bio_reset after raid5_release_stripe, which could make the
stripe reusable and hence reuse the bio just before bio_reset. Moving
bio_reset before raid5_release_stripe is called should fix the race.
Reported-and-tested-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Signed-off-by: Shaohua Li <shli@fb.com>
If there aren't enough stripes, reshape will hang. We have a check for
this in new reshape, but miss it for reshape resume, hence we could see
hang in reshape resume. This patch forces enough stripes existed if
reshape resumes.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Pull MD fixes from Shaohua Li:
"This includes several bug fixes:
- Alexey Obitotskiy fixed a hang for faulty raid5 array with external
management
- Song Liu fixed two raid5 journal related bugs
- Tomasz Majchrzak fixed a bad block recording issue and an
accounting issue for raid10
- ZhengYuan Liu fixed an accounting issue for raid5
- I fixed a potential race condition and memory leak with DIF/DIX
enabled
- other trival fixes"
* tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
raid5: avoid unnecessary bio data set
raid5: fix memory leak of bio integrity data
raid10: record correct address of bad block
md-cluster: fix error return code in join()
r5cache: set MD_JOURNAL_CLEAN correctly
md: don't print the same repeated messages about delayed sync operation
md: remove obsolete ret in md_start_sync
md: do not count journal as spare in GET_ARRAY_INFO
md: Prevent IO hold during accessing to faulty raid5 array
MD: hold mddev lock to change bitmap location
raid5: fix incorrectly counter of conf->empty_inactive_list_nr
raid10: increment write counter after bio is split
bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to
set them every time. bi_private will be set before the bio is
dispatched.
Signed-off-by: Shaohua Li <shli@fb.com>
Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5
doesn't alloc/free bio, instead it reuses bios. There are two issues in
current code:
1. the code calls bio_init (from
init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io).
The bio is reused, so likely there is integrity data attached. bio_init
will clear a pointer to integrity data and makes bio_reset can't release
the data
2. bio_reset is called before dispatching bio. After bio is finished,
it's possible we don't free bio's integrity data (eg, we don't call
bio_reset again)
Both issues will cause memory leak. The patch moves bio_init to stripe
creation and bio_reset to bio end io. This will fix the two issues.
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Currently, the code sets MD_JOURNAL_CLEAN when the array has
MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
will be MD_JOURNAL_CLEAN even if the journal device is missing.
With this patch, the MD_JOURNAL_CLEAN is only set when the journal
device presents.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
After array enters in faulty state (e.g. number of failed drives
becomes more then accepted for raid5 level) it sets error flags
(one of this flags is MD_CHANGE_PENDING). For internal metadata
arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for
external metadata arrays. MD_CHANGE_PENDING flag set prevents to
finish all new or non-finished IOs to array and hold them in
pending state. In some cases this can leads to deadlock situation.
For example, we have faulty array (2 of 4 drives failed) and
udev handle array state changes and blkid started (or other
userspace application that used array to read/write) but unable
to finish reads due to IO hold. At the same time we unable to get
exclusive access to array (to stop array in our case) because
another external application still use this array.
Fix makes possible to return IO with errors immediately.
So external application can finish working with array and
give exclusive access to other applications to perform
required management actions with array.
Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The counter conf->empty_inactive_list_nr is only used for determine if the
raid5 is congested which is deal with in function raid5_congested().
It was increased in get_free_stripe() when conf->inactive_list got to be
empty and decreased in release_inactive_stripe_list() when splice
temp_inactive_list to conf->inactive_list. However, this may have a
problem when raid5_get_active_stripe or stripe_add_to_batch_list was called,
because these two functions may call list_del_init(&sh->lru) to delete sh from
"conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to
be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be
done at these two point and increase empty_inactive_list_nr accordingly.
Otherwise the counter may get to be negative number which would influence
async readahead from VFS.
Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces. For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense. Any check for READA is replaced with an
explicit check for REQ_RAHEAD. Also remove the READA alias for
REQ_RAHEAD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
which can delay several milliseconds in some case.
If lots of devices fail at once - as could happen with a large RAID10 where one set
of devices are removed all at once - these delays can add up to be very inconcenient.
As failure is not reversible we can check for that first, setting a
separate flag if it is found, and then all synchronize_rcu() once for
all the flagged devices. Then ->hot_remove_disk() function can skip the
synchronize_rcu() step if the flag is set.
fix build error(Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
It is important that we never increment rdev->nr_pending on a Faulty
device as ->hot_remove_disk() assumes that once the Faulty flag is visible
no code will take a new reference.
Some places take a new reference after only check In_sync. This should
be safe as the two are changed together. However to make the code more
obviously safe, add checks for 'Faulty' as well.
Note: the actual rule is:
Never increment nr_pending if Faulty is set and Blocked is clear,
never clear Faulty, and never set Blocked without holding a reference
through nr_pending.
fix build error (Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Being in the middle of resync is no longer protection against failed
rdevs disappearing. So add rcu protection.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The rdev could be freed while handle_failed_sync is running, so
rcu protection is needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In current handle_stripe_dirtying, the code prefers rmw with
PARITY_ENABLE_RMW; while prefers rcw with PARITY_PREFER_RMW.
This patch reverses this behavior.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Some code waits for a metadata update by:
1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared
If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.
So make sure all places that set MD_CHANGE_PENDING are atomicial, and
bit_clear_unless (suggested by Neil) is introduced for the purpose.
Cc: Martin Kepplinger <martink@posteo.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: <linux-kernel@vger.kernel.org>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
In case md runs underneath the dm-raid target, the mddev does not have
a request queue or gendisk, thus avoid accesses.
This patch adds a missing conditional to the raid5 personality.
Signed-of-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
If device has R5_LOCKED set, it's legit device has R5_SkipCopy set and page !=
orig_page. After R5_LOCKED is clear, handle_stripe_clean_event will clear the
SkipCopy flag and set page to orig_page. So the warning is unnecessary.
Reported-by: Joey Liao <joeyliao@qnap.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The raid456_cpu_notify() hotplug callback lacks handling of the
CPU_UP_CANCELED case. That means if CPU_UP_PREPARE fails, the scratch
buffer is leaked.
Add handling for CPU_UP_CANCELED[_FROZEN] hotplug notifier transitions
to free the scratch buffer.
CC: Shaohua Li <shli@kernel.org>
CC: linux-raid@vger.kernel.org
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be
quite convenient if we know the stripe state. This is what this patch does.
Signed-off-by: Shaohua Li <shli@fb.com>
break_stripe_batch_list breaks up a batch and copies some flags from
the batch head to the members, preserving others.
It doesn't preserve or copy STRIPE_PREREAD_ACTIVE. This is not
normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a
stripe_head is added to a batch, and is not set on stripe_heads
already in a batch.
However there is no locking to ensure one thread doesn't set the flag
after it has just been cleared in another. This does occasionally happen.
md/raid5 maintains a count of the number of stripe_heads with
STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes. When
break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently
this could becomes incorrect and will never again return to zero.
md/raid5 delays the handling of some stripe_heads until
preread_active_stripes becomes zero. So when the above mention race
happens, those stripe_heads become blocked and never progress,
resulting is write to the array handing.
So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE
in the members of a batch.
URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741
URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153
URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz
Reported-by: Martin Svec <martin.svec@zoner.cz> (and others)
Tested-by: Tom Weber <linux@junkyard.4t2.com>
Fixes: 1b956f7a8f ("md/raid5: be more selective about distributing flags across batch.")
Cc: stable@vger.kernel.org (v4.1 and later)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Revert commit
e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe)
The problem is raid5_get_active_stripe waits on
conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes
in this order:
- release all stripes with hash 0
- raid5_get_active_stripe still sleeps since active_stripes >
max_nr_stripes * 3 / 4
- release all stripes with hash other than 0. active_stripes becomes 0
- raid5_get_active_stripe still sleeps, since nobody wakes up
wait_for_stripe[0]
The system live locks. The problem is active_stripes isn't a per-hash
count. Revert the patch makes the live lock go away.
Cc: stable@vger.kernel.org (v4.2+)
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
check_reshape() is called from raid5d thread. raid5d thread shouldn't
call mddev_suspend(), because mddev_suspend() waits for all IO finish
but IO is handled in raid5d thread, we could easily deadlock here.
This issue is introduced by
738a273 ("md/raid5: fix allocation of 'scribble' array.")
Cc: stable@vger.kernel.org (v4.1+)
Reported-and-tested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
'max_discard_sectors' is in sectors, while 'stripe' is in bytes.
This fixes the problem where DISCARD would get disabled on some larger
RAID5 configurations (6 or more drives in my testing), while it worked
as expected with smaller configurations.
Fixes: 620125f2bf ("MD: raid5 trim support")
Cc: stable@vger.kernel.org v3.7+
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Add support for journal disk hot add/remove. Mostly trival checks in md
part. The raid5 part is a little tricky. For hot-remove, we can't wait
pending write as it's called from raid5d. The wait will cause deadlock.
We simplily fail the hot-remove. A hot-remove retry can success
eventually since if journal disk is faulty all pending write will be
failed and finish. For hot-add, since an array supporting journal but
without journal disk will be marked read-only, we are safe to hot add
journal without stopping IO (should be read IO, while journal only
handles write IO).
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
The stripe_add_to_batch_list() function is called only if
stripe_can_batch() returned true, so there is no need for double check.
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Neil Brown <neilb@suse.com>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
Two major components to this update.
1/ the clustered-raid1 support from SUSE is nearly
complete. There are a few outstanding issues being
worked on. Maybe half a dozen patches will bring
this to a usable state.
2/ The first stage of journalled-raid5 support from
Facebook makes an appearance. With a journal
device configured (typically NVRAM or SSD), the
"RAID5 write hole" should be closed - a crash
during degraded operations cannot result in data
corruption.
The next stage will be to use the journal as a
write-behind cache so that latency can be reduced
and in some cases throughput increased by
performing more full-stripe writes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWNX9RAAoJEDnsnt1WYoG5bYMP/jI0pV3wcbs7mZQAa8S/V0lU
2l25x4MdwDvqVKMfjIc/C5J08QNgcrgSvhiVPCEOK0w18q395vep9f6gFKbMHhu/
lWU3PLHGw8XBHp5yEnxrpQkN0pRrNjh5NqIdlVMBNyL6u+RZPS2ZuzxJ8wiNAFg1
MypNkgoUu6s+nBp4DWWnMGYhBc+szBR+gTYAzGiZ8vqOH9uiSJ2SsGG5aRVUN/af
oMYvJAf9aA6uj+xSzNlXIaLfWJIrshQYS1jU/W4gTm0DwK9yqbTxvubJaE0SGu/o
73FGU8tmQ6ELYfsp3D/jmfUkE7weiNEQhdVb/4wy1A/SGc+W7Ju9pxfhm8ra57s0
/BCkfwWZXEvx1flegXfK1mC6EMpMIcGAD2FQEhmQbW6wTdDwtNyEhIePDVGJwD/F
rhEThFa+Dg9+xnBGnS6OUK3EpXgml2hAeAC7uA3TVSAnWd/9/Mpim6fZhqrB/v9L
Ik0tZt+H4nxYaheZjKlKhuXUQYcUWGiMb67bGMem/YAlMa4y9C9qF+9mPXxyjVlI
hBsd5SfZNz99DyB/bO8BumQeIWlTfzLeFzWW67eQ864LRKO6k0/VIbPZHCfn2oVG
XvyC2fUhNOIURP3IMxcyHYxOA7Mu6EDsVVDTpuqLVbZQ5IPjDEfQ54yB/BLUvbX/
Gh2/tKn7Xc25HuLAFEbs
=TD5o
-----END PGP SIGNATURE-----
Merge tag 'md/4.4' of git://neil.brown.name/md
Pull md updates from Neil Brown:
"Two major components to this update.
1) The clustered-raid1 support from SUSE is nearly complete. There
are a few outstanding issues being worked on. Maybe half a dozen
patches will bring this to a usable state.
2) The first stage of journalled-raid5 support from Facebook makes an
appearance. With a journal device configured (typically NVRAM or
SSD), the "RAID5 write hole" should be closed - a crash during
degraded operations cannot result in data corruption.
The next stage will be to use the journal as a write-behind cache
so that latency can be reduced and in some cases throughput
increased by performing more full-stripe writes.
* tag 'md/4.4' of git://neil.brown.name/md: (66 commits)
MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW
MD: set journal disk ->raid_disk
MD: kick out journal disk if it's not fresh
raid5-cache: start raid5 readonly if journal is missing
MD: add new bit to indicate raid array with journal
raid5-cache: IO error handling
raid5: journal disk can't be removed
raid5-cache: add trim support for log
MD: fix info output for journal disk
raid5-cache: use bio chaining
raid5-cache: small log->seq cleanup
raid5-cache: new helper: r5_reserve_log_entry
raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
raid5-cache: take rdev->data_offset into account early on
raid5-cache: refactor bio allocation
raid5-cache: clean up r5l_get_meta
raid5-cache: simplify state machine when caches flushes are not needed
raid5-cache: factor out a helper to run all stripes for an I/O unit
raid5-cache: rename flushed_ios to finished_ios
raid5-cache: free I/O units earlier
...
Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of
0, because we already have a disk with ->raid_disk 0 and this causes
sysfs entry creation conflict. A lot of places assumes disk with
->raid_disk >=0 is normal raid disk, so we add check for journal disk.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
If raid array is expected to have journal (eg, journal is set in MD
superblock feature map) and the array is started without journal disk,
start the array readonly.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
There are 3 places the raid5-cache dispatches IO. The discard IO error
doesn't matter, so we ignore it. The superblock write IO error can be
handled in MD core. The remaining are log write and flush. When the IO
error happens, we mark log disk faulty and fail all write IO. Read IO is
still allowed to run. Userspace will get a notification too and
corresponding daemon can choose setting raid array readonly for example.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
raid5-cache uses journal disk rdev->bdev, rdev->mddev in several places.
Don't allow journal disk disappear magically. On the other hand, we do
need to update superblock for other disks to bump up ->events, so next
time journal disk will be identified as stale.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Move reclaim stop to quiesce handling, where is safer for this stuff.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
With log enabled, bio is written to raid disks after the bio is settled
down in log disk. The recovery guarantees we can recovery the bio data
from log disk, so we we skip FLUSH IO.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Before we write stripe data to raid disks, we must guarantee stripe data
is settled down in log disk. To do this, we flush log disk cache and
wait the flush finish. That wait introduces sleep time in raid5d thread
and impact performance. This patch moves the log disk cache flush
process to the stripe handling state machine, which can remove the wait
in raid5d.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
If cache(log) support is enabled, don't allow resize/reshape in current
stage. In the future, we can flush all data from cache(log) to raid
before resize/reshape and then allow resize/reshape.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
With log enabled, r5l_write_stripe will add the stripe to log. With
batch, several stripes are linked together. The stripes must be in the
same state. While with log, the log/reclaim unit is stripe, we can't
guarantee the several stripes are in the same state. Disabling batch for
log now.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
After commit 566c09c534 ("raid5: relieve lock contention in get_active_stripe()")
__find_stripe() is called under conf->hash_locks + hash.
But handle_stripe_clean_event() calls remove_hash() under
conf->device_lock.
Under some cirscumstances the hash chain can be circuited,
and we get an infinite loop with disabled interrupts and locked hash
lock in __find_stripe(). This leads to hard lockup on multiple CPUs
and following system crash.
I was able to reproduce this behavior on raid6 over 6 ssd disks.
The devices_handle_discard_safely option should be set to enable trim
support. The following script was used:
for i in `seq 1 32`; do
dd if=/dev/zero of=large$i bs=10M count=100 &
done
neilb: original was against a 3.x kernel. I forward-ported
to 4.3-rc. This verison is suitable for any kernel since
Commit: 59fc630b8b ("RAID5: batch adjacent full stripe write")
(v4.1+). I'll post a version for earlier kernels to stable.
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Fixes: 566c09c534 ("raid5: relieve lock contention in get_active_stripe()")
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: <stable@vger.kernel.org> # 3.13 - 4.2
This is the reclaim support for raid5 log. A stripe write will have
following steps:
1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused
In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.
It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.
Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.
Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.
There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.
The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.
Includes fixes to problems which were...
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.
The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.
For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.
For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.
flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
When a stripe finishes construction, we write the stripe to raid in
ops_run_io normally. With log, we do a bunch of other operations before
the stripe is written to raid. Mainly write the stripe to log disk,
flush disk cache and so on. The operations are still driven by raid5d
and run in the stripe state machine. We introduce a new state for such
stripe (trapped into log). The stripe is in this state from the time it
first enters ops_run_io (finish construction) to the time it is written
to raid. Since we know the state is only for log, we bypass other
check/operation in handle_stripe.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Next several patches use some raid5 functions, rename them with raid5
prefix and export out.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Suspending the entire device for resync could take too long. Resync
in small chunks.
cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:
1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
list.
bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Remove unneeded NULL test.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@ expression x; @@
-if (x != NULL)
\(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
When need_this_block probably shouldn't be called when there
are more than 2 failed devices, we really don't want it to try
indexing beyond the end of the failed_num[] of fdev[] arrays.
So limit the loops to at most 2 iterations.
Reported-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
handle_failed_stripe() makes the stripe fail, eg, all IO will return
with a failure, but it doesn't update stripe_head_state. Later
handle_stripe() has special handling for raid6 for handle_stripe_fill().
That check before handle_stripe_fill() doesn't skip the failed stripe
and we get a kernel crash in need_this_block. This patch clear the
analysis state to make sure no functions wrongly called after
handle_failed_stripe()
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Pull core block updates from Jens Axboe:
"This first core part of the block IO changes contains:
- Cleanup of the bio IO error signaling from Christoph. We used to
rely on the uptodate bit and passing around of an error, now we
store the error in the bio itself.
- Improvement of the above from myself, by shrinking the bio size
down again to fit in two cachelines on x86-64.
- Revert of the max_hw_sectors cap removal from a revision again,
from Jeff Moyer. This caused performance regressions in various
tests. Reinstate the limit, bump it to a more reasonable size
instead.
- Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
Most devices have huge trim limits, which can cause nasty latencies
when deleting files. Enable the admin to configure the size down.
We will look into having a more sane default instead of UINT_MAX
sectors.
- Improvement of the SGP gaps logic from Keith Busch.
- Enable the block core to handle arbitrarily sized bios, which
enables a nice simplification of bio_add_page() (which is an IO hot
path). From Kent.
- Improvements to the partition io stats accounting, making it
faster. From Ming Lei.
- Also from Ming Lei, a basic fixup for overflow of the sysfs pending
file in blk-mq, as well as a fix for a blk-mq timeout race
condition.
- Ming Lin has been carrying Kents above mentioned patches forward
for a while, and testing them. Ming also did a few fixes around
that.
- Sasha Levin found and fixed a use-after-free problem introduced by
the bio->bi_error changes from Christoph.
- Small blk cgroup cleanup from Viresh Kumar"
* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
blk: Fix bio_io_vec index when checking bvec gaps
block: Replace SG_GAPS with new queue limits mask
block: bump BLK_DEF_MAX_SECTORS to 2560
Revert "block: remove artifical max_hw_sectors cap"
blk-mq: fix race between timeout and freeing request
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
Documentation: update notes in biovecs about arbitrarily sized bios
block: remove bio_get_nr_vecs()
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: kill merge_bvec_fn() completely
md/raid5: get rid of bio_fits_rdev()
md/raid5: split bio for chunk_aligned_read
block: remove split code in blkdev_issue_{discard,write_same}
btrfs: remove bio splitting and merge_bvec_fn() calls
bcache: remove driver private bio splitting code
block: simplify bio_add_page()
block: make generic_make_request handle arbitrarily sized bios
blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
block: don't access bio->bi_error after bio_put()
block: shrink struct bio down to 2 cache lines again
...
When a write to one of the devices of a RAID5/6 fails, the failure is
recorded in the metadata of the other devices so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).
Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.
Currently there is no interlock between the write request completing
and the metadata update. So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.
This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.
So:
- set MD_CHANGE_PENDING when requesting a metadata update for a
failed device, so we can know with certainty when it completes
- queue requests that completed when MD_CHANGE_PENDING is set to
only be processed after the metadata update completes
- call raid_end_bio_io() on bios in that queue when the time comes.
Signed-off-by: NeilBrown <neilb@suse.com>
It is possible (though unlikely) for a reshape to be
interrupted between the time that end_reshape is called
and the time when raid5_finish_reshape is called.
This can leave conf->reshape_progress set to MaxSector,
but mddev->reshape_position not.
This combination confused reshape_request() when ->reshape_backwards.
As conf->reshape_progress is so high, it seems the reshape hasn't
really begun. But assuming MaxSector is a valid address only
leads to sorrow.
So ensure reshape_position and reshape_progress both agree,
and add an extra check in reshape_request() just in case they don't.
Signed-off-by: NeilBrown <neilb@suse.com>
While it generally shouldn't happen, it is not impossible for
curr_resync_completed to exceed resync_max.
This can particularly happen when reshaping RAID5 - the current
status isn't copied to curr_resync_completed promptly, so when it
is, it can exceed resync_max.
This happens when the reshape is 'frozen', resync_max is set low,
and reshape is re-enabled.
Taking a difference between two unsigned numbers is always dangerous
anyway, so add a test to behave correctly if
curr_resync_completed > resync_max
Signed-off-by: NeilBrown <neilb@suse.com>
This code is calculating:
writepos, which is the furthest along address (device-space) that we
*will* be writing to
readpos, which is the earliest address that we *could* possible read
from, and
safepos, which is the earliest address in the 'old' section that we
might read from after a crash when the reshape position is
recovered from metadata.
The first is a precise calculation, so clipping at zero doesn't
make sense. As the reshape position is now guaranteed to always be
a multiple of reshape_sectors and as we already BUG_ON when
reshape_progress is zero, there is no point in this min_t() call.
The readpos and safepos are worst case - actual value depends on
precise geometry. That worst case could be negative, which is only
a problem because we are storing the value in an unsigned.
So leave the min_t() for those.
Signed-off-by: NeilBrown <neilb@suse.com>
When reshaping, we work in units of the largest chunk size.
If changing from a larger to a smaller chunk size, that means we
reshape more than one stripe at a time. So the required alignment
of reshape_position needs to take into account both the old
and new chunk size.
This means that both 'here_new' and 'here_old' are calculated with
respect to the same (maximum) chunk size, so testing if they are the
same when delta_disks is zero becomes pointless.
Signed-off-by: NeilBrown <neilb@suse.com>
The chunk_sectors and new_chunk_sectors fields of mddev can be changed
any time (via sysfs) that the reconfig mutex can be taken. So raid5
keeps internal copies in 'conf' which are stable except for a short
locked moment when reshape stops/starts.
So any access that does not hold reconfig_mutex should use the 'conf'
values, not the 'mddev' values.
Several don't.
This could result in corruption if new values were written at awkward
times.
Also use min() or max() rather than open-coding.
Signed-off-by: NeilBrown <neilb@suse.com>
These aren't really needed when no reshape is happening,
but it is safer to have them always set to a meaningful value.
The next patch will use ->prev_chunk_sectors without checking
if a reshape is happening (because that makes the code simpler),
and this patch makes that safe.
Signed-off-by: NeilBrown <neilb@suse.com>
md/raid5 only updates ->reshape_position (which is stored in
metadata and is authoritative) occasionally, but particularly
when getting closed to ->resync_max as it must be correct
when ->resync_max is reached.
When mdadm tries to stop an array which is reshaping it will:
- freeze the reshape,
- set resync_max to where the reshape has reached.
- unfreeze the reshape.
When this happens, the reshape is aborted and then restarted.
The restart doesn't check that resync_max is close, and so doesn't
update ->reshape_position like it should.
This results in the reshape stopping, but ->reshape_position being
incorrect.
So on that first call to reshape_request, make sure ->reshape_position
is updated if needed.
Signed-off-by: NeilBrown <neilb@suse.com>
As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove bio_fits_rdev() as sufficient merge_bvec_fn() handling is now
performed by blk_queue_split() in md_make_request().
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If a read request fits entirely in a chunk, it will be passed directly to the
underlying device (providing it hasn't failed of course). If it doesn't fit,
the slightly less efficient path that uses the stripe_cache is used.
Requests that get to the stripe cache are always completely split up as
necessary.
So with RAID5, ripping out the merge_bvec_fn doesn't cause it to stop work,
but could cause it to take the less efficient path more often.
All that is needed to manage this is for 'chunk_aligned_read' do some bio
splitting, much like the RAID0 code does.
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
I have a report of drop_one_stripe() called from
raid5_cache_scan() apparently finding ->max_nr_stripes == 0.
This should not be allowed.
So add a test to keep max_nr_stripes above min_nr_stripes.
Also use a 'mask' rather than a 'mod' in drop_one_stripe
to ensure 'hash' is valid even if max_nr_stripes does reach zero.
Fixes: edbe83ab4c ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please release with 2d5b569b66)
Reported-by: Tomas Papan <tomas.papan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Some places use helpers now, others don't. We only have the 'is set'
helper, add helpers for setting and clearing flags too.
It was a bit of a mess of atomic vs non-atomic access. With
BIO_UPTODATE gone, we don't have any risk of concurrent access to the
flags. So relax the restriction and don't make any of them atomic. The
flags that do have serialization issues (reffed and chained), we
already handle those separately.
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This flag is currently never cleared, which can in rare cases
trigger a warn-on if it is still set but the block isn't
InSync.
So clear it when it isn't need, which includes if the replacement
device has failed.
Signed-off-by: NeilBrown <neilb@suse.com>
Cache size can grow or shrink due to various pressures at
any time. So when we resize the cache as part of a 'grow'
operation (i.e. change the size to allow more devices) we need
to blocks that automatic growing/shrinking.
So introduce a mutex. auto grow/shrink uses mutex_trylock()
and just doesn't bother if there is a blockage.
Resizing the whole cache holds the mutex to ensure that
the correct number of new stripes is allocated.
This bug can result in some stripes not being freed when an
array is stopped. This leads to the kmem_cache not being
freed and a subsequent array can try to use the same kmem_cache
and get confused.
Fixes: edbe83ab4c ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)
Signed-off-by: NeilBrown <neilb@suse.com>
conf->released_stripes list isn't always related to where there are
free stripes pending. Active stripes can be in the list too.
And even free stripes were active very recently.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
I noticed heavy spin lock contention at get_active_stripe() with fsmark
multiple thread write workloads.
Here is how this hot contention comes from. We have limited stripes, and
it's a multiple thread write workload. Hence, those stripes will be taken
soon, which puts later processes to sleep for waiting free stripes. When
enough stripes(>= 1/4 total stripes) are released, all process are woken,
trying to get the lock. But there is one only being able to get this lock
for each hash lock, making other processes spinning out there for acquiring
the lock.
Thus, it's effectiveless to wakeup all processes and let them battle for
a lock that permits one to access only each time. Instead, we could make
it be a exclusive wake up: wake up one process only. That avoids the heavy
spin lock contention naturally.
To do the exclusive wake up, we've to split wait_for_stripe into multiple
wait queues, to make it per hash value, just like the hash lock.
Here are some test results I have got with this patch applied(all test run
3 times):
`fsmark.files_per_sec'
=====================
next-20150317 this patch
------------------------- -------------------------
metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params
------------------------- ------------------------- -------- ------------------------------
25.600 ±0.0 92.700 ±2.5 262.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
25.600 ±0.0 77.800 ±0.6 203.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
32.000 ±0.0 93.800 ±1.7 193.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
32.000 ±0.0 81.233 ±1.7 153.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
48.800 ±14.5 99.667 ±2.0 104.2% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
6.400 ±0.0 12.800 ±0.0 100.0% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
63.133 ±8.2 82.800 ±0.7 31.2% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
245.067 ±0.7 306.567 ±7.9 25.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose
17.533 ±0.3 21.000 ±0.8 19.8% ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
188.167 ±1.9 215.033 ±3.1 14.3% ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync
254.500 ±1.8 290.733 ±2.4 14.2% ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync
`time.system_time'
=====================
next-20150317 this patch
------------------------- -------------------------
metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params
------------------------- ------------------------- -------- ------------------------------
7235.603 ±1.2 185.163 ±1.9 -97.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
7666.883 ±2.9 202.750 ±1.0 -97.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
14567.893 ±0.7 421.230 ±0.4 -97.1% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
3697.667 ±14.0 148.190 ±1.7 -96.0% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
5572.867 ±3.8 310.717 ±1.4 -94.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
5565.050 ±0.5 313.277 ±1.5 -94.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
2420.707 ±17.1 171.043 ±2.7 -92.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
3743.300 ±4.6 379.827 ±3.5 -89.9% ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose
3308.687 ±6.3 363.050 ±2.0 -89.0% ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
Where,
1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark
1t, 64t: where 't' means thread
4M: means the single file size, corresponding to the '-s' option of fsmark
40G, 30G, 120G: means the total test size
4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means
the size of one ramdisk. So, it would be 48G in total. And we made a
raid on those ramdisk
As you can see, though there are no much performance gain for hard disk
workload, the system time is dropped heavily, up to 97%. And as expected,
the performance increased a lot, up to 260%, for fast device(ram disk).
v2: use bits instead of array to note down wait queue need to wake up.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.
After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.
Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.
So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.
v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
Commit log refactor suggestion from Neil.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
resync etc finished. However it is possible for raid5_start_reshape
to race and start a reshape before MD_RECOVERY_DONE is cleared. This
can lean to multiple reshapes running at the same time, which isn't
good.
To make sure it is cleared before starting a reshape, and also clear
it when reaping a thread, just to be safe.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that the code in break_stripe_batch_list() is nearly identical
to the end of handle_stripe_clean_event, replace the later
with a function call.
The only remaining difference of any interest is the masking that is
applieds to dev[i].flags copied from head_sh.
R5_WriteError certainly isn't wanted as it is set per-stripe, not
per-patch. R5_Overlap isn't wanted as it is explicitly handled.
Signed-off-by: NeilBrown <neilb@suse.de>
When a batch of stripes is broken up, we keep some of the flags
that were per-stripe, and copy other flags from the head to all
others.
This only happens while a stripe is being handled, so many of the
flags are irrelevant.
The "SYNC_FLAGS" (which I've renamed to make it clear there are
several) and STRIPE_DEGRADED are set per-stripe and so need to be
preserved. STRIPE_INSYNC is the only flag that is set on the head
that needs to be propagated to all others.
For safety, add a WARN_ON if others are set, except:
STRIPE_HANDLE - this is safe and per-stripe and we are going to set
in several cases anyway
STRIPE_INSYNC
STRIPE_IO_STARTED - this is just a hint and doesn't hurt.
STRIPE_ON_PLUG_LIST
STRIPE_ON_RELEASE_LIST - It is a point pointless for a batched
stripe to be on one of these lists, but it can happen
as can be safely ignored.
Signed-off-by: NeilBrown <neilb@suse.de>
When we break a stripe_batch_list we sometimes want to set
STRIPE_HANDLE on the individual stripes, and sometimes not.
So pass a 'handle_flags' arg. If it is zero, always set STRIPE_HANDLE
(on non-head stripes). If not zero, only set it if any of the given
flags are present.
Signed-off-by: NeilBrown <neilb@suse.de>
break_stripe_batch list didn't clear head_sh->batch_head.
This was probably a bug.
Also clear all R5_Overlap flags and if any were cleared, wake up
'wait_for_overlap'.
This isn't always necessary but the worst effect is a little
extra checking for code that is waiting on wait_for_overlap.
Also, don't use wake_up_nr() because that does the wrong thing
if 'nr' is zero, and it number of flags cleared doesn't
strongly correlate with the number of threads to wake.
Signed-off-by: NeilBrown <neilb@suse.de>
handle_stripe_clean_event() contains a chunk of code very
similar to check_break_stripe_batch_list().
If we make the latter more like the former, we can end up
with just one copy of this code.
This first step removed the condition (and the 'check_') part
of the name. This has the added advantage of making it clear
what check is being performed at the point where the function is
called.
Signed-off-by: NeilBrown <neilb@suse.de>
If a stripe is a member of a batch, but not the head, it must
not be handled separately from the rest of the batch.
'clear_batch_ready()' handles this requirement to some
extent but not completely. If a member is passed to handle_stripe()
a second time it returns '0' indicating the stripe can be handled,
which is wrong.
So add an extra test.
Signed-off-by: NeilBrown <neilb@suse.de>
When we add a write to a stripe we need to make sure the bitmap
bit is set. While doing that the stripe is not locked so it could
be added to a batch after which further changes to STRIPE_BIT_DELAY
and ->bm_seq are ineffective.
So we need to hold off adding to a stripe until bitmap_startwrite has
completed at least once, and we need to avoid further changes to
STRIPE_BIT_DELAY once the stripe has been added to a batch.
If a bitmap_startwrite() completes after the stripe was added to a
batch, it will not have set the bit, only incremented a counter, so no
extra delay of the stripe is needed.
Reported-by: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
When we add a stripe to a batch, we need to be sure that
head stripe will wait for the bitmap update required for the new
stripe.
Signed-off-by: NeilBrown <neilb@suse.de>
ops_run_reconstruct6() doesn't correctly chain asyn operations. The tx returned
by async_gen_syndrome should be added as the dependent tx of next stripe.
The issue is introduced by commit 59fc630b8b
RAID5: batch adjacent full stripe write
Reported-and-tested-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
There is no need for special handling of stripe-batches when the array
is degraded.
There may be if there is a failure in the batch, but STRIPE_DEGRADED
does not imply an error.
So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
degraded.
This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
check_break_stripe_batch_list() and so the bitmap bit gets cleared
when it shouldn't.
So in check_break_stripe_batch_list(), split the batch up completely -
again STRIPE_DEGRADED isn't meaningful.
Also don't set STRIPE_BATCH_ERR when there is a write error to a
replacement device. This simply removes the replacement device and
requires no extra handling.
Signed-off-by: NeilBrown <neilb@suse.de>
As the new 'scribble' array is sized based on chunk size,
we need to make sure the size matches the largest of 'old'
and 'new' chunk sizes when the array is undergoing reshape.
We also potentially need to resize it even when not resizing
the stripe cache, as chunk size can change without changing
number of devices.
So move the 'resize' code into a separate function, and
consider old and new sizes when allocating.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 46d5b78562 ("raid5: use flex_array for scribble data")
If any memory allocation in resize_stripes fails we will return
-ENOMEM, but in some cases we update conf->pool_size anyway.
This means that if we try again, the allocations will be assumed
to be larger than they are, and badness results.
So only update pool_size if there is no error.
This bug was introduced in 2.6.17 and the patch is suitable for
-stable.
Fixes: ad01c9e375 ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array")
Cc: stable@vger.kernel.org (v2.6.17+)
Signed-off-by: NeilBrown <neilb@suse.de>
When performing a reconstruct write, we need to read all blocks
that are not being over-written .. except the parity (P and Q) blocks.
The code currently reads these (as they are not being over-written!)
unnecessarily.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: ea664c8245 ("md/raid5: need_this_block: tidy/fix last condition.")
It is not incorrect to call handle_stripe_fill() when
a batch of full-stripe writes is active.
It is, however, a BUG if fetch_block() then decides
it needs to actually fetch anything.
So move the 'BUG_ON' to where it belongs.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 59fc630b8b ("RAID5: batch adjacent full stripe write")
The new batch_lock and batch_list fields are being initialized in
grow_one_stripe() but not in resize_stripes(). This causes a crash
on resize.
So separate the core initialization into a new function and call it
from both allocation sites.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 59fc630b8b ("RAID5: batch adjacent full stripe write")
When array is degraded, read data landed on failed drives will result in
reading rest of data in a stripe. So a single sequential read would
result in same data being read twice.
This patch is to avoid chunk aligned read for degraded array. The
downside is to involve stripe cache which means associated CPU overhead
and extra memory copy.
Test Results:
Following test are done on a enterprise storage node with Seagate 6T SAS
drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
chunk size 128 KiB.
I use FIO, using direct-io with various bs size, enough queue depth,
tested sequential and 100% random read against 3 array config:
1) optimal, as baseline;
2) degraded;
3) degraded with this patch.
Kernel version is 4.0-rc3.
Each individual test I only did once so there might be some variations,
but we just focus on big trend.
Sequential Read:
bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s)
1024 1608 656 995
512 1624 710 956
256 1635 728 980
128 1636 771 983
64 1612 1119 1000
32 1580 1420 1004
16 1368 688 986
8 768 647 953
4 411 413 850
Random Read:
bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS)
1024 163 160 156
512 274 273 272
256 426 428 424
128 576 592 591
64 726 724 726
32 849 848 837
16 900 970 971
8 927 940 929
4 948 940 955
Some notes:
* In sequential + optimal, as bs size getting smaller, the FIO thread
become CPU bound.
* In sequential + degraded, there's big increase when bs is 64K and
32K, I don't have explanation.
* In sequential + degraded-with-patch, the MD thread mostly become CPU
bound.
If you want to we can discuss specific data point in those data. But in
general it seems with this patch, we have more predictable and in most
cases significant better sequential read performance when array is
degraded, and almost no noticeable impact on random read.
Performance is a complicated thing, the patch works well for this
particular configuration, but may not be universal. For example I
imagine testing on all SSD array may have very different result. But I
personally think in most cases IO bandwidth is more scarce resource than
CPU.
Signed-off-by: Eric Mei <eric.mei@seagate.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The default setting of 256 stripe_heads is probably
much too small for many configurations. So it is best to make it
auto-configure.
Shrinking the cache under memory pressure is easy. The only
interesting part here is that we put a fairly high cost
('seeks') on shrinking the cache as the cost is greater than
just having to read more data, it reduces parallelism.
Growing the cache on demand needs to be done carefully. If we allow
fast growth, that can upset memory balance as lots of dirty memory can
quickly turn into lots of memory queued in the stripe_cache.
It is important for the raid5 block device to appear congested to
allow write-throttling to work.
So we only add stripes slowly. We set a flag when an allocation
fails because all stripes are in use, allocate at a convenient
time when that flag is set, and don't allow it to be set again
until at least one stripe_head has been released for re-use.
This means that a spurt of requests will only cause one stripe_head
to be allocated, but a steady stream of requests will slowly
increase the cache size - until memory pressure puts it back again.
It could take hours to reach a steady state.
The value written to, and displayed in, stripe_cache_size is
used as a minimum. The cache can grow above this and shrink back
down to it. The actual size is not directly visible, though it can
be deduced to some extent by watching stripe_cache_active.
Signed-off-by: NeilBrown <neilb@suse.de>
Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
succeeds, do it inside the functions.
Also choose the correct hash to handle next inside the functions.
This removes duplication and will help with future new uses of
{grow,drop}_one_stripe.
This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
message always said "0kB".
Signed-off-by: NeilBrown <neilb@suse.de>
Depending on the available coding we allow optimized rmw logic for write
operations. To support easier testing this patch allows manual control
of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.
The configuration can handle three levels of control.
rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
calculation has no implementation path yet to factor in/out chunks of
a syndrome. Enforcing this level can be benefical for slow CPUs with
hardware syndrome support and fast SSDs.
rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
save IOs. This equals the "old" unpatched behaviour and will be the
default.
rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
equal. We might have higher CPU consumption because of calculating the
parity twice but it can be benefical otherwise. E.g. RAID4 with fast
dedicated parity disk/SSD. The option is implemented just to be
forward-looking and will ONLY work with this patch!
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
expansion/resync can grab a stripe when the stripe is in batch list. Since all
stripes in batch list must be in the same state, we can't allow some stripes
run into expansion/resync. So we delay expansion/resync for stripe in batch
list.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If io error happens in any stripe of a batch list, the batch list will be
split, then normal process will run for the stripes in the list.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
unit. Idealy we should use big size for adjacent full stripe writes. Bigger
stripe cache size means less stripes runing in the state machine so can reduce
cpu overhead. And also bigger size can cause bigger IO size dispatched to under
layer disks.
With below patch, we will automatically batch adjacent full stripe write
together. Such stripes will be added to the batch list. Only the first stripe
of the list will be put to handle_list and so run handle_stripe(). Some steps
of handle_stripe() are extended to cover all stripes of the list, including
ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
running in handle_stripe() and we send IO of whole stripe list together to
increase IO size.
Stripes added to a batch list have some limitations. A batch list can only
include full stripe write and can't cross chunk boundary to make sure stripes
have the same parity disks. Stripes in a batch list must be in the same state
(no written, toread and so on). If a stripe is in a batch list, all new
read/write to add_stripe_bio will be blocked to overlap conflict till the batch
list is handled. The limitations will make sure stripes in a batch list be in
exactly the same state in the life circly.
I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
PCIe SSD. This patch improves around 30% performance and IO size to under layer
disk is exactly 32k. I also run a 4k randwrite test in the same array to make
sure the performance isn't changed with the patch.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Track overwrite disk count, so we can know if a stripe is a full stripe write.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A freshly new stripe with write request can be batched. Any time the stripe is
handled or new read is queued, the flag will be cleared.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Use flex_array for scribble data. Next patch will batch several stripes
together, so scribble data should be able to cover several stripes, so this
patch also allocates scribble data for stripes across a chunk.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This option is not well justified and testing suggests that
it hardly ever makes any difference.
The comment suggests there might be a need to wait for non-resync
activity indicated by ->nr_waiting, however raise_barrier()
already waits for all of that.
So just remove it to simplify reasoning about speed limiting.
This allows us to remove a 'FIXME' comment from raid5.c as that
never used the flag.
Signed-off-by: NeilBrown <neilb@suse.de>
When we have more than 1 drive failure, it's possible we start
rebuild one drive while leaving another faulty drive in array.
To determine whether array will be optimal after building, current
code only check whether a drive is missing, which could potentially
lead to data corruption. This patch is to add checking Faulty flag.
Signed-off-by: NeilBrown <neilb@suse.de>
Commit a7854487cd:
md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.
Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.
So change the condition to only force RCW on non-degraded arrays.
Reported-by: Manibalan P <pmanibalan@amiindia.co.in>
Bisected-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: a7854487cd
Cc: stable@vger.kernel.org (v3.7+)
Rather than using mddev_lock() to take the reconfig_mutex
when writing to any md sysfs file, we only take mddev_lock()
in the particular _store() functions that require it.
Admittedly this is most, but it isn't all.
This also allows us to remove special-case handling for new_dev_store
(in md_attr_store).
Signed-off-by: NeilBrown <neilb@suse.de>
It is important that mddev->private isn't freed while
a sysfs attribute function is accessing it.
So use mddev->lock to protect the setting of ->private to NULL, and
take that lock when checking ->private for NULL and de-referencing it
in the sysfs access functions.
This only applies to the read ('show') side of access. Write
access will be handled separately.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that the ->stop function only frees the private data,
rename is accordingly.
Also pass in the private pointer as an arg rather than using
mddev->private. This flexibility will be useful in level_store().
Finally, don't clear ->private. It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before ->to_remove was
introduced).
Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.
Signed-off-by: NeilBrown <neilb@suse.de>
Each md personality has a 'stop' operation which does two
things:
1/ it finalizes some aspects of the array to ensure nothing
is accessing the ->private data
2/ it frees the ->private data.
All the steps in '1' can apply to all arrays and so can be
performed in common code.
This is useful as in the case where we change the personality which
manages an array (in level_store()), it would be helpful to do
step 1 early, and step 2 later.
So split the 'step 1' functionality out into a new mddev_detach().
Signed-off-by: NeilBrown <neilb@suse.de>
There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.
So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.
Signed-off-by: NeilBrown <neilb@suse.de>
There is currently no locking around calls to the 'congested'
bdi function. If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.
So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.
When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock. Fortunately this is the case.
Signed-off-by: NeilBrown <neilb@suse.de>
That last condition is unclear and over cautious.
There are two related issues here.
If a partial write is destined for a missing device, then
either RMW or RCW can work. We must read all the available
block. Only then can the missing blocks be calculated, and
then the parity update performed.
If RMW is not an option, then there is a complication even
without partial writes. If we would need to read a missing
device to perform the reconstruction, then we must first read every
block so the missing device data can be computed.
This is the case for RAID6 (Which currently does not support
RMW) and for times when we don't trust the parity (after a crash)
and so are in the process of resyncing it.
So make these two cases more clear and separate, and perform
the relevant tests more thoroughly.
Signed-off-by: NeilBrown <neilb@suse.de>
Both the last two cases are only relevant if something has failed and
something needs to be written (but not over-written), and if it is OK
to pre-read blocks at this point. So factor out those tests and
explain them.
Signed-off-by: NeilBrown <neilb@suse.de>
Some of the conditions in need_this_block have very straight
forward motivation. Separate those out and document them.
Signed-off-by: NeilBrown <neilb@suse.de>
fetch_block() has a very large and hard to read 'if' condition.
Separate it into its own function so that it can be
made more readable.
Signed-off-by: NeilBrown <neilb@suse.de>
67f455486d introduced a call to
md_wakeup_thread() when adding to the delayed_list. However the md
thread is woken up unconditionally just below.
Remove the unnecessary wakeup call.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a non-page-aligned write is destined for a device which
is missing/faulty, we can deadlock.
As the target device is missing, a read-modify-write cycle
is not possible.
As the write is not for a full-page, a recontruct-write cycle
is not possible.
This should be handled by logic in fetch_block() which notices
there is a non-R5_OVERWRITE write to a missing device, and so
loads all blocks.
However since commit 67f455486d, that code requires
STRIPE_PREREAD_ACTIVE before it will active, and those circumstances
never set STRIPE_PREREAD_ACTIVE.
So: in handle_stripe_dirtying, if neither rmw or rcw was possible,
set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set
after a suitable delay.
Fixes: 67f455486d
Cc: stable@vger.kernel.org (v3.16+)
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
It is critical that fetch_block() and handle_stripe_dirtying()
are consistent in their analysis of what needs to be loaded.
Otherwise raid5 can wait forever for a block that won't be loaded.
Currently when writing to a RAID5 that is resyncing, to a location
beyond the resync offset, handle_stripe_dirtying chooses a
reconstruct-write cycle, but fetch_block() assumes a
read-modify-write, and a lockup can happen.
So treat that case just like RAID6, just as we do in
handle_stripe_dirtying. RAID6 always does reconstruct-write.
This bug was introduced when the behaviour of handle_stripe_dirtying
was changed in 3.7, so the patch is suitable for any kernel since,
though it will need careful merging for some versions.
Cc: stable@vger.kernel.org (v3.7+)
Fixes: a7854487cd
Reported-by: Henry Cai <henryplusplus@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
raid5: fix init_stripe() inconsistencies
1) remove_hash() is not necessary. We will only be called right after
get_free_stripe(). There we have already a call to remove_hash().
2) Tracing prints out the sector of the freed stripe and not the sector
that we want to initialize.
Signed-off-by: NeilBrown <neilb@suse.de>
It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint. Some devices in some cases don't do what it
says on the label.
The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero. If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks. If all were zero, this would
be the case. If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.
As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.
As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.
If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
raid456.devices_handle_discard_safely=Y
As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7
Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org (3.7+)
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 620125f2bf
Signed-off-by: NeilBrown <neilb@suse.de>
During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.
If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.
This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.
Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then. In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().
Cc: stable@vger.kernel.org (2.6.32+)
Fixes: 6c0069c0ae
Cc: Yuri Tikhonov <yur@emcraft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
If a stripe in a raid6 array received a write to each data block while
the array is degraded, and if any of these writes to a missing device
are not page-aligned, then a live-lock happens.
In this case the P and Q blocks need to be read so that the part of
the missing block which is *not* being updated by the write can be
constructed. Due to a logic error, these blocks are not loaded, so
the update cannot proceed and the stripe is 'handled' repeatedly in an
infinite loop.
This bug is unlikely as most writes are page aligned. However as it
can lead to a livelock it is suitable for -stable. It was introduced
in 3.16.
Cc: stable@vger.kernel.org (v3.16)
Fixed: 67f455486d
Signed-off-by: NeilBrown <neilb@suse.de>
Mostly performance improvements with a few corner-case bug fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIVAwUAU5fevznsnt1WYoG5AQKzNxAAgjZPg6LZgl0wEHJi09ykqLK9dSkglz2B
RY5IEWPEMcsGJQbSX7fEME5WSHkqKBIHx43xmdzijPCHyWLnF4vQRubNAi/Wo8PC
yfJyI5hMrG8+0rNKnmXWeG+NLGj7l7j+wkExe27VBlu2+6ATvIUCa34kjOWFDJJx
xAN9/t22XGly776SiuLXTpMhA0pRWH4sqijJhvM5EqPlsEUjGxvQU/URwUPYzhYr
iCbvY64uKB1Crh+vn7yP1IUWc77aDOoUPIxO50m0GXOQFU/wbQta1NAFmMeu8tIo
3QRlGzP+jRJFBV7jLii4nHTMSVxBmD2zMppxVtDoq/egWhyZlJhh2Kd5sUKo2r9V
8bGQm7e1JlsDsRMVXxSlx3TebZDfJhtnFNhYb6JPlpWM5EYjCzXXfJUl3/6LfKX6
BKJ800xvabr5TFkVubF9H80LTQxaO+pQoS6PSs8hnwneqtybdbt9o7V59yluMGao
IwaG/BY4jlY+76YnK32Kbr1eofUNYXBLg45mgSkrhUVqv9HLHe0CDnYAzYs68aur
zG8a1rhA9IJtBmWwql2Gm8lKOnDFM6onvSmaYntsabJ++eahaNDpL5ck+R4kO8mJ
pyI1N6GQDjtLdONk4NmH36P+CuCZje6KwRbVXvsJIBA1042tkZfU0zdqgt5QhFZj
Ma9ZCn9wZFo=
=MaM/
-----END PGP SIGNATURE-----
Merge tag 'md/3.16' of git://neil.brown.name/md
Pull md updates from Neil Brown:
"Assorted md fixes for 3.16
Mostly performance improvements with a few corner-case bug fixes"
* tag 'md/3.16' of git://neil.brown.name/md:
raid5: speedup sync_request processing
md/raid5: deadlock between retry_aligned_read with barrier io
raid5: add an option to avoid copy data from bio to stripe cache
md/bitmap: remove confusing code from filemap_get_page.
raid5: avoid release list until last reference of the stripe
md: md_clear_badblocks should return an error code on failure.
md/raid56: Don't perform reads to support writes until stripe is ready.
md: refuse to change shape of array if it is active but read-only
The raid5 sync_request() processing calls handle_stripe() within the context of
the resync-thread. The resync-thread issues the first set of read requests
and this adds execution latency and slows down the scheduling of the next
sync_request().
The current rebuild/resync speed of raid5 is not much faster than what
rotational HDDs can sustain.
Testing the following patch on a 6-drive array, I can increase the rebuild
speed from 100 MB/s to 175 MB/s.
The sync_request() now just sets STRIPE_HANDLE and releases the stripe. This
creates some more parallelism between the resync-thread and raid5 kernel daemon.
Signed-off-by: Eivind Sarto <esarto@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A chunk aligned read increases counter active_aligned_reads and
decreases it after sub-device handle it successfully. But when a read
error occurs, the read redispatched by raid5d, and the
active_aligned_reads will not be decreased until we can grab a stripe
head in retry_aligned_read. Now suppose, a barrier io comes, set
conf->quiesce to 2, and wait until both active_stripes and
active_aligned_reads are zero. The retried chunk aligned read gets
stuck at get_active_stripe waiting until conf->quiesce becomes 0.
Retry_aligned_read and barrier io are waiting each other now.
One possible solution is that we ignore conf->quiesce, let the retried
aligned read finish. I reproduced this deadlock and test this patch on
centos6.0
Signed-off-by: NeilBrown <neilb@suse.de>
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The (lockless) release_list reduces lock contention, but there is excessive
queueing and dequeuing of stripes on this list. A stripe will currently be
queued on the release_list with a stripe reference count > 1. This can cause
the raid5 kernel thread(s) to dequeue the stripe and decrement the refcount
without doing any other useful processing of the stripe. The are two cases
when the stripe can be put on the release_list multiple times before it is
actually handled by the kernel thread(s).
1) make_request() activates the stripe processing in 4k increments. When a
write request is large enough to span multiple chunks of a stripe_head, the
first 4k chunk adds the stripe to the plug list. The next 4k chunk that is
processed for the same stripe puts the stripe on the release_list with a
refcount=2. This can cause the kernel thread to process and decrement the
stripe before the stripe us unplugged, which again will put it back on the
release_list.
2) Whenever IO is scheduled on a stripe (pre-read and/or write), the stripe
refcount is set to the number of active IO (for each chunk). The stripe is
released as each IO complete, and can be queued and dequeued multiple times
on the release_list, until its refcount finally reached zero.
This simple patch will ensure a stripe is only queued on the release_list when
its refcount=1 and is ready to be handled by the kernel thread(s). I added some
instrumentation to raid5 and counted the number of times striped were queued on
the release_list for a variety of write IO sizes. Without this patch the number
of times stripes got queued on the release_list was 100-500% higher than with
the patch. The excess queuing will increase with the IO size. The patch also
improved throughput by 5-10%.
Signed-off-by: Eivind Sarto <esarto@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If it is found that we need to pre-read some blocks before a write
can succeed, we normally set STRIPE_DELAYED and don't actually perform
the read until STRIPE_PREREAD_ACTIVE subsequently gets set.
However for a degraded RAID6 we currently perform the reads as soon
as we see that a write is pending. This significantly hurts
throughput.
So:
- when handle_stripe_dirtying find a block that it wants on a device
that is failed, set STRIPE_DELAY, instead of doing nothing, and
- when fetch_block detects that a read might be required to satisfy a
write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
and if we would actually need to read something to complete the write.
This also helps RAID5, though less often as RAID5 supports a
read-modify-write cycle. For RAID5 the read is performed too early
only if the write is not a full 4K aligned write (i.e. no an
R5_OVERWRITE).
Also clean up a couple of horrible bits of formatting.
Reported-by: Patrik Horník <patrik@dsl.sk>
Signed-off-by: NeilBrown <neilb@suse.de>
I hit another BUG_ON with e240c1839d. In __get_priority_stripe(),
stripe count equals to 0 initially. Between atomic_inc and BUG_ON,
get_active_stripe() finds the stripe. So the stripe count isn't 1 any more.
V2: keeps the BUG_ON suggested by Neil.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
For sequential workload (or request size big workload), get_active_stripe can
find cached stripe. In this case, we always hold device_lock, which exposes a
lot of lock contention for such workload. If stripe count isn't 0, we don't
need hold the lock actually, since we just increase its count. And this is the
hot code path for such workload. Unfortunately we must delete the BUG_ON.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
lot of contention for sequential workload (or big request size
workload). For such workload, each bio includes several stripes. So we
can just do prepare_to_wait/finish_wait once for the whold bio instead
of every stripe. This reduces the lock contention completely for such
workload. Random workload might have the similar lock contention too,
but I didn't see it yet, maybe because my stroage is still not fast
enough.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:
register_cpu_notifier(&foobar_cpu_notifier);
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
put_online_cpus();
A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.
So reorganize the code in raid5 this way to fix the deadlock with callback
registration.
Cc: linux-raid@vger.kernel.org
Cc: stable@vger.kernel.org (v2.6.32+)
Fixes: 36d1c6476b
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block IO driver changes from Jens Axboe:
- bcache update from Kent Overstreet.
- two bcache fixes from Nicholas Swenson.
- cciss pci init error fix from Andrew.
- underflow fix in the parallel IDE pg_write code from Dan Carpenter.
I'm sure the 1 (or 0) users of that are now happy.
- two PCI related fixes for sx8 from Jingoo Han.
- floppy init fix for first block read from Jiri Kosina.
- pktcdvd error return miss fix from Julia Lawall.
- removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
Michael Opdenacker.
- comment typo fix for the loop driver from Olaf Hering.
- potential oops fix for null_blk from Raghavendra K T.
- two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
an OOM problem and a problem with handling security locked conditions
* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
mg_disk: Spelling s/finised/finished/
null_blk: Null pointer deference problem in alloc_page_buffers
mtip32xx: Correctly handle security locked condition
mtip32xx: Make SGL container per-command to eliminate high order dma allocation
drivers/block/loop.c: fix comment typo in loop_config_discard
drivers/block/cciss.c:cciss_init_one(): use proper errnos
drivers/block/paride/pg.c: underflow bug in pg_write()
drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
drivers/block/sx8.c: use module_pci_driver()
floppy: bail out in open() if drive is not responding to block0 read
bcache: Fix auxiliary search trees for key size > cacheline size
bcache: Don't return -EINTR when insert finished
bcache: Improve bucket_prio() calculation
bcache: Add bch_bkey_equal_header()
bcache: update bch_bkey_try_merge
bcache: Move insert_fixup() to btree_keys_ops
bcache: Convert sorting to btree_keys
bcache: Convert debug code to btree_keys
bcache: Convert btree_iter to struct btree_keys
bcache: Refactor bset_tree sysfs stats
...
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
As release_stripe and __release_stripe decrement ->count and then
manipulate ->lru both under ->device_lock, it is important that
get_active_stripe() increments ->count and clears ->lru also under
->device_lock.
However we currently list_del_init ->lru under the lock, but increment
the ->count outside the lock. This can lead to races and list
corruption.
So move the atomic_inc(&sh->count) up inside the ->device_lock
protected region.
Note that we still increment ->count without device lock in the case
where get_free_stripe() was called, and in fact don't take
->device_lock at all in that path.
This is safe because if the stripe_head can be found by
get_free_stripe, then the hash lock assures us the no-one else could
possibly be calling release_stripe() at the same time.
Fixes: 566c09c534
Cc: stable@vger.kernel.org (3.13)
Reported-and-tested-by: Ian Kumlien <ian.kumlien@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Before a write starts we set a bit in the write-intent bitmap.
When the write completes we clear that bit if the write was successful
to all devices. However if the write wasn't fully successful we
should not clear the bit. If the faulty drive is subsequently
re-added, the fact that the bit is still set ensure that we will
re-write the data that is missing.
This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
bitmap bit when this flag is not set.
Currently we correctly set the flag if a write starts when some
devices are failed or missing. But we do *not* set the flag if some
device failed during the write attempt.
This is wrong and can result in clearing the bit inappropriately.
So: set the flag when a write fails.
This bug has been present since bitmaps were introduces, so the fix is
suitable for any -stable kernel.
Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>