2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* raid5.c : Multiple Devices driver for Linux
|
|
|
|
* Copyright (C) 1996, 1997 Ingo Molnar, Miguel de Icaza, Gadi Oxman
|
|
|
|
* Copyright (C) 1999, 2000 Ingo Molnar
|
2006-06-26 14:27:38 +07:00
|
|
|
* Copyright (C) 2002, 2003 H. Peter Anvin
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
2006-06-26 14:27:38 +07:00
|
|
|
* RAID-4/5/6 management functions.
|
|
|
|
* Thanks to Penguin Computing for making the RAID-6 development possible
|
|
|
|
* by donating a test server!
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or modify
|
|
|
|
* it under the terms of the GNU General Public License as published by
|
|
|
|
* the Free Software Foundation; either version 2, or (at your option)
|
|
|
|
* any later version.
|
|
|
|
*
|
|
|
|
* You should have received a copy of the GNU General Public License
|
|
|
|
* (for example /usr/src/linux/COPYING); if not, write to the Free
|
|
|
|
* Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
|
|
|
|
*/
|
|
|
|
|
2006-07-10 18:44:17 +07:00
|
|
|
/*
|
|
|
|
* BITMAP UNPLUGGING:
|
|
|
|
*
|
|
|
|
* The sequencing for updating the bitmap reliably is a little
|
|
|
|
* subtle (and I got it wrong the first time) so it deserves some
|
|
|
|
* explanation.
|
|
|
|
*
|
|
|
|
* We group bitmap updates into batches. Each batch has a number.
|
|
|
|
* We may write out several batches at once, but that isn't very important.
|
2011-04-18 15:25:43 +07:00
|
|
|
* conf->seq_write is the number of the last batch successfully written.
|
|
|
|
* conf->seq_flush is the number of the last batch that was closed to
|
2006-07-10 18:44:17 +07:00
|
|
|
* new additions.
|
|
|
|
* When we discover that we will need to write to any block in a stripe
|
|
|
|
* (in add_stripe_bio) we update the in-memory bitmap and record in sh->bm_seq
|
2011-04-18 15:25:43 +07:00
|
|
|
* the number of the batch it will be in. This is seq_flush+1.
|
2006-07-10 18:44:17 +07:00
|
|
|
* When we are ready to do a write, if that batch hasn't been written yet,
|
|
|
|
* we plug the array and queue the stripe for later.
|
|
|
|
* When an unplug happens, we increment bm_flush, thus closing the current
|
|
|
|
* batch.
|
|
|
|
* When we notice that bm_flush > bm_write, we write out all pending updates
|
|
|
|
* to the bitmap, and advance bm_write to where bm_flush was.
|
|
|
|
* This may occasionally write a bit out twice, but is sure never to
|
|
|
|
* miss any bits.
|
|
|
|
*/
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-03-31 10:33:13 +07:00
|
|
|
#include <linux/blkdev.h>
|
2006-03-27 16:18:11 +07:00
|
|
|
#include <linux/kthread.h>
|
2009-03-31 11:09:39 +07:00
|
|
|
#include <linux/raid/pq.h>
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
#include <linux/async_tx.h>
|
2011-07-04 00:58:33 +07:00
|
|
|
#include <linux/module.h>
|
2009-08-30 09:13:13 +07:00
|
|
|
#include <linux/async.h>
|
2009-03-31 10:33:13 +07:00
|
|
|
#include <linux/seq_file.h>
|
2009-07-15 01:48:22 +07:00
|
|
|
#include <linux/cpu.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 15:04:11 +07:00
|
|
|
#include <linux/slab.h>
|
2011-07-27 08:00:36 +07:00
|
|
|
#include <linux/ratelimit.h>
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
#include <linux/nodemask.h>
|
2014-12-15 08:57:02 +07:00
|
|
|
#include <linux/flex_array.h>
|
2017-02-09 00:51:30 +07:00
|
|
|
|
2012-10-31 07:59:09 +07:00
|
|
|
#include <trace/events/block.h>
|
2017-03-04 13:06:12 +07:00
|
|
|
#include <linux/list_sort.h>
|
2012-10-31 07:59:09 +07:00
|
|
|
|
2009-03-31 10:33:13 +07:00
|
|
|
#include "md.h"
|
2009-03-31 10:33:13 +07:00
|
|
|
#include "raid5.h"
|
2010-03-08 12:02:42 +07:00
|
|
|
#include "raid0.h"
|
2017-10-11 04:02:41 +07:00
|
|
|
#include "md-bitmap.h"
|
2017-03-09 15:59:58 +07:00
|
|
|
#include "raid5-log.h"
|
2005-09-10 06:23:54 +07:00
|
|
|
|
2017-01-05 07:10:19 +07:00
|
|
|
#define UNSUPPORTED_MDDEV_FLAGS (1L << MD_FAILFAST_SUPPORTED)
|
|
|
|
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
#define cpu_to_group(cpu) cpu_to_node(cpu)
|
|
|
|
#define ANY_GROUP NUMA_NO_NODE
|
|
|
|
|
2014-10-02 10:45:00 +07:00
|
|
|
static bool devices_handle_discard_safely = false;
|
|
|
|
module_param(devices_handle_discard_safely, bool, 0644);
|
|
|
|
MODULE_PARM_DESC(devices_handle_discard_safely,
|
|
|
|
"Set to Y if all devices in each array reliably return zeroes on reads from discarded regions");
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
static struct workqueue_struct *raid5_wq;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static inline struct hlist_head *stripe_hash(struct r5conf *conf, sector_t sect)
|
2011-10-07 10:23:00 +07:00
|
|
|
{
|
|
|
|
int hash = (sect >> STRIPE_SHIFT) & HASH_MASK;
|
|
|
|
return &conf->stripe_hashtbl[hash];
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
static inline int stripe_hash_locks_hash(sector_t sect)
|
|
|
|
{
|
|
|
|
return (sect >> STRIPE_SHIFT) & STRIPE_HASH_LOCKS_MASK;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void lock_device_hash_lock(struct r5conf *conf, int hash)
|
|
|
|
{
|
|
|
|
spin_lock_irq(conf->hash_locks + hash);
|
|
|
|
spin_lock(&conf->device_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void unlock_device_hash_lock(struct r5conf *conf, int hash)
|
|
|
|
{
|
|
|
|
spin_unlock(&conf->device_lock);
|
|
|
|
spin_unlock_irq(conf->hash_locks + hash);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void lock_all_device_hash_locks_irq(struct r5conf *conf)
|
|
|
|
{
|
|
|
|
int i;
|
md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock
On mainline, there is no functional difference, just less code, and
symmetric lock/unlock paths.
On PREEMPT_RT builds, this fixes the following warning, seen by
Alexander GQ Gerasiov, due to the sleeping nature of spinlocks.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993
in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1
CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: G W 4.9.20-rt16-stand6-686 #1
Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015
Workqueue: writeback wb_workfn (flush-253:0)
Call Trace:
dump_stack+0x47/0x68
? migrate_enable+0x4a/0xf0
___might_sleep+0x101/0x180
rt_spin_lock+0x17/0x40
add_stripe_bio+0x4e3/0x6c0 [raid456]
? preempt_count_add+0x42/0xb0
raid5_make_request+0x737/0xdd0 [raid456]
Reported-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Tested-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-29 00:41:02 +07:00
|
|
|
spin_lock_irq(conf->hash_locks);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
for (i = 1; i < NR_STRIPE_HASH_LOCKS; i++)
|
|
|
|
spin_lock_nest_lock(conf->hash_locks + i, conf->hash_locks);
|
|
|
|
spin_lock(&conf->device_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void unlock_all_device_hash_locks_irq(struct r5conf *conf)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
spin_unlock(&conf->device_lock);
|
md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock
On mainline, there is no functional difference, just less code, and
symmetric lock/unlock paths.
On PREEMPT_RT builds, this fixes the following warning, seen by
Alexander GQ Gerasiov, due to the sleeping nature of spinlocks.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993
in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1
CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: G W 4.9.20-rt16-stand6-686 #1
Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015
Workqueue: writeback wb_workfn (flush-253:0)
Call Trace:
dump_stack+0x47/0x68
? migrate_enable+0x4a/0xf0
___might_sleep+0x101/0x180
rt_spin_lock+0x17/0x40
add_stripe_bio+0x4e3/0x6c0 [raid456]
? preempt_count_add+0x42/0xb0
raid5_make_request+0x737/0xdd0 [raid456]
Reported-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Tested-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-29 00:41:02 +07:00
|
|
|
for (i = NR_STRIPE_HASH_LOCKS - 1; i; i--)
|
|
|
|
spin_unlock(conf->hash_locks + i);
|
|
|
|
spin_unlock_irq(conf->hash_locks);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
}
|
|
|
|
|
2009-03-31 10:39:38 +07:00
|
|
|
/* Find first data disk in a raid6 stripe */
|
|
|
|
static inline int raid6_d0(struct stripe_head *sh)
|
|
|
|
{
|
2009-03-31 10:39:38 +07:00
|
|
|
if (sh->ddf_layout)
|
|
|
|
/* ddf always start from first device */
|
|
|
|
return 0;
|
|
|
|
/* md starts just after Q block */
|
2009-03-31 10:39:38 +07:00
|
|
|
if (sh->qd_idx == sh->disks - 1)
|
|
|
|
return 0;
|
|
|
|
else
|
|
|
|
return sh->qd_idx + 1;
|
|
|
|
}
|
2006-06-26 14:27:38 +07:00
|
|
|
static inline int raid6_next_disk(int disk, int raid_disks)
|
|
|
|
{
|
|
|
|
disk++;
|
|
|
|
return (disk < raid_disks) ? disk : 0;
|
|
|
|
}
|
2007-07-10 01:56:43 +07:00
|
|
|
|
2009-03-31 10:39:38 +07:00
|
|
|
/* When walking through the disks in a raid5, starting at raid6_d0,
|
|
|
|
* We need to map each disk to a 'slot', where the data disks are slot
|
|
|
|
* 0 .. raid_disks-3, the parity disk is raid_disks-2 and the Q disk
|
|
|
|
* is raid_disks-1. This help does that mapping.
|
|
|
|
*/
|
2009-03-31 10:39:38 +07:00
|
|
|
static int raid6_idx_to_slot(int idx, struct stripe_head *sh,
|
|
|
|
int *count, int syndrome_disks)
|
2009-03-31 10:39:38 +07:00
|
|
|
{
|
2009-10-20 08:09:32 +07:00
|
|
|
int slot = *count;
|
2009-03-31 10:39:38 +07:00
|
|
|
|
2009-10-16 12:27:34 +07:00
|
|
|
if (sh->ddf_layout)
|
2009-10-20 08:09:32 +07:00
|
|
|
(*count)++;
|
2009-03-31 10:39:38 +07:00
|
|
|
if (idx == sh->pd_idx)
|
2009-03-31 10:39:38 +07:00
|
|
|
return syndrome_disks;
|
2009-03-31 10:39:38 +07:00
|
|
|
if (idx == sh->qd_idx)
|
2009-03-31 10:39:38 +07:00
|
|
|
return syndrome_disks + 1;
|
2009-10-16 12:27:34 +07:00
|
|
|
if (!sh->ddf_layout)
|
2009-10-20 08:09:32 +07:00
|
|
|
(*count)++;
|
2009-03-31 10:39:38 +07:00
|
|
|
return slot;
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static void print_raid5_conf (struct r5conf *conf);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 05:32:05 +07:00
|
|
|
static int stripe_operations_active(struct stripe_head *sh)
|
|
|
|
{
|
|
|
|
return sh->check_state || sh->reconstruct_state ||
|
|
|
|
test_bit(STRIPE_BIOFILL_RUN, &sh->state) ||
|
|
|
|
test_bit(STRIPE_COMPUTE_RUN, &sh->state);
|
|
|
|
}
|
|
|
|
|
2017-02-16 10:37:32 +07:00
|
|
|
static bool stripe_is_lowprio(struct stripe_head *sh)
|
|
|
|
{
|
|
|
|
return (test_bit(STRIPE_R5C_FULL_STRIPE, &sh->state) ||
|
|
|
|
test_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) &&
|
|
|
|
!test_bit(STRIPE_R5C_CACHING, &sh->state);
|
|
|
|
}
|
|
|
|
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
|
|
|
|
{
|
|
|
|
struct r5conf *conf = sh->raid_conf;
|
|
|
|
struct r5worker_group *group;
|
2013-08-29 14:40:32 +07:00
|
|
|
int thread_cnt;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
int i, cpu = sh->cpu;
|
|
|
|
|
|
|
|
if (!cpu_online(cpu)) {
|
|
|
|
cpu = cpumask_any(cpu_online_mask);
|
|
|
|
sh->cpu = cpu;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (list_empty(&sh->lru)) {
|
|
|
|
struct r5worker_group *group;
|
|
|
|
group = conf->worker_groups + cpu_to_group(cpu);
|
2017-02-16 10:37:32 +07:00
|
|
|
if (stripe_is_lowprio(sh))
|
|
|
|
list_add_tail(&sh->lru, &group->loprio_list);
|
|
|
|
else
|
|
|
|
list_add_tail(&sh->lru, &group->handle_list);
|
2013-08-29 14:40:32 +07:00
|
|
|
group->stripes_cnt++;
|
|
|
|
sh->group = group;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (conf->worker_cnt_per_group == 0) {
|
|
|
|
md_wakeup_thread(conf->mddev->thread);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
group = conf->worker_groups + cpu_to_group(sh->cpu);
|
|
|
|
|
2013-08-29 14:40:32 +07:00
|
|
|
group->workers[0].working = true;
|
|
|
|
/* at least one worker should run to avoid race */
|
|
|
|
queue_work_on(sh->cpu, raid5_wq, &group->workers[0].work);
|
|
|
|
|
|
|
|
thread_cnt = group->stripes_cnt / MAX_STRIPE_BATCH - 1;
|
|
|
|
/* wakeup more workers */
|
|
|
|
for (i = 1; i < conf->worker_cnt_per_group && thread_cnt > 0; i++) {
|
|
|
|
if (group->workers[i].working == false) {
|
|
|
|
group->workers[i].working = true;
|
|
|
|
queue_work_on(sh->cpu, raid5_wq,
|
|
|
|
&group->workers[i].work);
|
|
|
|
thread_cnt--;
|
|
|
|
}
|
|
|
|
}
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
}
|
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
|
|
|
|
struct list_head *temp_inactive_list)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
int i;
|
|
|
|
int injournal = 0; /* number of date pages with R5_InJournal */
|
|
|
|
|
2012-07-19 13:01:31 +07:00
|
|
|
BUG_ON(!list_empty(&sh->lru));
|
|
|
|
BUG_ON(atomic_read(&conf->active_stripes)==0);
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
|
|
|
|
if (r5c_is_writeback(conf->log))
|
|
|
|
for (i = sh->disks; i--; )
|
|
|
|
if (test_bit(R5_InJournal, &sh->dev[i].flags))
|
|
|
|
injournal++;
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:40 +07:00
|
|
|
/*
|
2017-05-12 07:03:44 +07:00
|
|
|
* In the following cases, the stripe cannot be released to cached
|
|
|
|
* lists. Therefore, we make the stripe write out and set
|
|
|
|
* STRIPE_HANDLE:
|
|
|
|
* 1. when quiesce in r5c write back;
|
|
|
|
* 2. when resync is requested fot the stripe.
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:40 +07:00
|
|
|
*/
|
2017-05-12 07:03:44 +07:00
|
|
|
if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state) ||
|
|
|
|
(conf->quiesce && r5c_is_writeback(conf->log) &&
|
|
|
|
!test_bit(STRIPE_HANDLE, &sh->state) && injournal != 0)) {
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:40 +07:00
|
|
|
if (test_bit(STRIPE_R5C_CACHING, &sh->state))
|
|
|
|
r5c_make_stripe_write_out(sh);
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
}
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
|
2012-07-19 13:01:31 +07:00
|
|
|
if (test_bit(STRIPE_HANDLE, &sh->state)) {
|
|
|
|
if (test_bit(STRIPE_DELAYED, &sh->state) &&
|
2015-01-30 00:38:29 +07:00
|
|
|
!test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
|
2012-07-19 13:01:31 +07:00
|
|
|
list_add_tail(&sh->lru, &conf->delayed_list);
|
2015-01-30 00:38:29 +07:00
|
|
|
else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
|
2012-07-19 13:01:31 +07:00
|
|
|
sh->bm_seq - conf->seq_write > 0)
|
|
|
|
list_add_tail(&sh->lru, &conf->bitmap_list);
|
|
|
|
else {
|
|
|
|
clear_bit(STRIPE_DELAYED, &sh->state);
|
|
|
|
clear_bit(STRIPE_BIT_DELAY, &sh->state);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
if (conf->worker_cnt_per_group == 0) {
|
2017-02-16 10:37:32 +07:00
|
|
|
if (stripe_is_lowprio(sh))
|
|
|
|
list_add_tail(&sh->lru,
|
|
|
|
&conf->loprio_list);
|
|
|
|
else
|
|
|
|
list_add_tail(&sh->lru,
|
|
|
|
&conf->handle_list);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
} else {
|
|
|
|
raid5_wakeup_stripe_thread(sh);
|
|
|
|
return;
|
|
|
|
}
|
2012-07-19 13:01:31 +07:00
|
|
|
}
|
|
|
|
md_wakeup_thread(conf->mddev->thread);
|
|
|
|
} else {
|
|
|
|
BUG_ON(stripe_operations_active(sh));
|
|
|
|
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
|
|
|
|
if (atomic_dec_return(&conf->preread_active_stripes)
|
|
|
|
< IO_THRESHOLD)
|
|
|
|
md_wakeup_thread(conf->mddev->thread);
|
|
|
|
atomic_dec(&conf->active_stripes);
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
|
|
|
|
if (!r5c_is_writeback(conf->log))
|
|
|
|
list_add_tail(&sh->lru, temp_inactive_list);
|
|
|
|
else {
|
|
|
|
WARN_ON(test_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags));
|
|
|
|
if (injournal == 0)
|
|
|
|
list_add_tail(&sh->lru, temp_inactive_list);
|
|
|
|
else if (injournal == conf->raid_disks - conf->max_degraded) {
|
|
|
|
/* full stripe */
|
|
|
|
if (!test_and_set_bit(STRIPE_R5C_FULL_STRIPE, &sh->state))
|
|
|
|
atomic_inc(&conf->r5c_cached_full_stripes);
|
|
|
|
if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state))
|
|
|
|
atomic_dec(&conf->r5c_cached_partial_stripes);
|
|
|
|
list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:40 +07:00
|
|
|
r5c_check_cached_full_stripe(conf);
|
md/r5cache: enable chunk_aligned_read with write back cache
Chunk aligned read significantly reduces CPU usage of raid456.
However, it is not safe to fully bypass the write back cache.
This patch enables chunk aligned read with write back cache.
For chunk aligned read, we track stripes in write back cache at
a bigger granularity, "big_stripe". Each chunk may contain more
than one stripe (for example, a 256kB chunk contains 64 4kB-page,
so this chunk contain 64 stripes). For chunk_aligned_read, these
stripes are grouped into one big_stripe, so we only need one lookup
for the whole chunk.
For each big_stripe, struct big_stripe_info tracks how many stripes
of this big_stripe are in the write back cache. We count how many
stripes of this big_stripe are in the write back cache. These
counters are tracked in a radix tree (big_stripe_tree).
r5c_tree_index() is used to calculate keys for the radix tree.
chunk_aligned_read() calls r5c_big_stripe_cached() to look up
big_stripe of each chunk in the tree. If this big_stripe is in the
tree, chunk_aligned_read() aborts. This look up is protected by
rcu_read_lock().
It is necessary to remember whether a stripe is counted in
big_stripe_tree. Instead of adding new flag, we reuses existing flags:
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
two flags are set, the stripe is counted in big_stripe_tree. This
requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
r5c_try_caching_write(); and moving clear_bit of
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
r5c_finish_stripe_write_out().
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-12 04:39:14 +07:00
|
|
|
} else
|
|
|
|
/*
|
|
|
|
* STRIPE_R5C_PARTIAL_STRIPE is set in
|
|
|
|
* r5c_try_caching_write(). No need to
|
|
|
|
* set it again.
|
|
|
|
*/
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
list_add_tail(&sh->lru, &conf->r5c_partial_stripe_list);
|
|
|
|
}
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
}
|
2009-03-31 10:39:38 +07:00
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
static void __release_stripe(struct r5conf *conf, struct stripe_head *sh,
|
|
|
|
struct list_head *temp_inactive_list)
|
2012-07-19 13:01:31 +07:00
|
|
|
{
|
|
|
|
if (atomic_dec_and_test(&sh->count))
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
do_release_stripe(conf, sh, temp_inactive_list);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* @hash could be NR_STRIPE_HASH_LOCKS, then we have a list of inactive_list
|
|
|
|
*
|
|
|
|
* Be careful: Only one task can add/delete stripes from temp_inactive_list at
|
|
|
|
* given time. Adding stripes only takes device lock, while deleting stripes
|
|
|
|
* only takes hash lock.
|
|
|
|
*/
|
|
|
|
static void release_inactive_stripe_list(struct r5conf *conf,
|
|
|
|
struct list_head *temp_inactive_list,
|
|
|
|
int hash)
|
|
|
|
{
|
|
|
|
int size;
|
2016-02-26 07:24:42 +07:00
|
|
|
bool do_wakeup = false;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
if (hash == NR_STRIPE_HASH_LOCKS) {
|
|
|
|
size = NR_STRIPE_HASH_LOCKS;
|
|
|
|
hash = NR_STRIPE_HASH_LOCKS - 1;
|
|
|
|
} else
|
|
|
|
size = 1;
|
|
|
|
while (size) {
|
|
|
|
struct list_head *list = &temp_inactive_list[size - 1];
|
|
|
|
|
|
|
|
/*
|
2015-08-14 04:31:57 +07:00
|
|
|
* We don't hold any lock here yet, raid5_get_active_stripe() might
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
* remove stripes from the list
|
|
|
|
*/
|
|
|
|
if (!list_empty_careful(list)) {
|
|
|
|
spin_lock_irqsave(conf->hash_locks + hash, flags);
|
2013-11-14 11:16:17 +07:00
|
|
|
if (list_empty(conf->inactive_list + hash) &&
|
|
|
|
!list_empty(list))
|
|
|
|
atomic_dec(&conf->empty_inactive_list_nr);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
list_splice_tail_init(list, conf->inactive_list + hash);
|
2016-02-26 07:24:42 +07:00
|
|
|
do_wakeup = true;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
spin_unlock_irqrestore(conf->hash_locks + hash, flags);
|
|
|
|
}
|
|
|
|
size--;
|
|
|
|
hash--;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (do_wakeup) {
|
2016-02-26 07:24:42 +07:00
|
|
|
wake_up(&conf->wait_for_stripe);
|
md/raid5: split wait_for_stripe and introduce wait_for_quiescent
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.
After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.
Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.
So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.
v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
Commit log refactor suggestion from Neil.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 15:19:06 +07:00
|
|
|
if (atomic_read(&conf->active_stripes) == 0)
|
|
|
|
wake_up(&conf->wait_for_quiescent);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
if (conf->retry_read_aligned)
|
|
|
|
md_wakeup_thread(conf->mddev->thread);
|
|
|
|
}
|
2012-07-19 13:01:31 +07:00
|
|
|
}
|
|
|
|
|
2013-08-27 16:50:39 +07:00
|
|
|
/* should hold conf->device_lock already */
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
static int release_stripe_list(struct r5conf *conf,
|
|
|
|
struct list_head *temp_inactive_list)
|
2013-08-27 16:50:39 +07:00
|
|
|
{
|
2017-02-14 14:26:24 +07:00
|
|
|
struct stripe_head *sh, *t;
|
2013-08-27 16:50:39 +07:00
|
|
|
int count = 0;
|
|
|
|
struct llist_node *head;
|
|
|
|
|
|
|
|
head = llist_del_all(&conf->released_stripes);
|
2013-08-28 13:29:05 +07:00
|
|
|
head = llist_reverse_order(head);
|
2017-02-14 14:26:24 +07:00
|
|
|
llist_for_each_entry_safe(sh, t, head, release_list) {
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
int hash;
|
|
|
|
|
2013-08-27 16:50:39 +07:00
|
|
|
/* sh could be readded after STRIPE_ON_RELEASE_LIST is cleard */
|
|
|
|
smp_mb();
|
|
|
|
clear_bit(STRIPE_ON_RELEASE_LIST, &sh->state);
|
|
|
|
/*
|
|
|
|
* Don't worry the bit is set here, because if the bit is set
|
|
|
|
* again, the count is always > 1. This is true for
|
|
|
|
* STRIPE_ON_UNPLUG_LIST bit too.
|
|
|
|
*/
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
hash = sh->hash_lock_index;
|
|
|
|
__release_stripe(conf, sh, &temp_inactive_list[hash]);
|
2013-08-27 16:50:39 +07:00
|
|
|
count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2015-08-14 04:31:57 +07:00
|
|
|
void raid5_release_stripe(struct stripe_head *sh)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned long flags;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
struct list_head list;
|
|
|
|
int hash;
|
2013-08-27 16:50:39 +07:00
|
|
|
bool wakeup;
|
2006-06-26 14:27:38 +07:00
|
|
|
|
2014-05-28 10:39:23 +07:00
|
|
|
/* Avoid release_list until the last reference.
|
|
|
|
*/
|
|
|
|
if (atomic_add_unless(&sh->count, -1, 1))
|
|
|
|
return;
|
|
|
|
|
2013-11-14 11:16:15 +07:00
|
|
|
if (unlikely(!conf->mddev->thread) ||
|
|
|
|
test_and_set_bit(STRIPE_ON_RELEASE_LIST, &sh->state))
|
2013-08-27 16:50:39 +07:00
|
|
|
goto slow_path;
|
|
|
|
wakeup = llist_add(&sh->release_list, &conf->released_stripes);
|
|
|
|
if (wakeup)
|
|
|
|
md_wakeup_thread(conf->mddev->thread);
|
|
|
|
return;
|
|
|
|
slow_path:
|
|
|
|
/* we are ok here if STRIPE_ON_RELEASE_LIST is set or not */
|
2018-07-04 03:01:36 +07:00
|
|
|
if (atomic_dec_and_lock_irqsave(&sh->count, &conf->device_lock, flags)) {
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
INIT_LIST_HEAD(&list);
|
|
|
|
hash = sh->hash_lock_index;
|
|
|
|
do_release_stripe(conf, sh, &list);
|
2018-07-04 03:01:37 +07:00
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
release_inactive_stripe_list(conf, &list, hash);
|
2012-07-19 13:01:31 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2006-01-06 15:20:33 +07:00
|
|
|
static inline void remove_hash(struct stripe_head *sh)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("remove_hash(), stripe %llu\n",
|
|
|
|
(unsigned long long)sh->sector);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-01-06 15:20:33 +07:00
|
|
|
hlist_del_init(&sh->hash);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static inline void insert_hash(struct r5conf *conf, struct stripe_head *sh)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-01-06 15:20:33 +07:00
|
|
|
struct hlist_head *hp = stripe_hash(conf, sh->sector);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("insert_hash(), stripe %llu\n",
|
|
|
|
(unsigned long long)sh->sector);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-01-06 15:20:33 +07:00
|
|
|
hlist_add_head(&sh->hash, hp);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* find an idle stripe, make sure it is unhashed, and return it. */
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
static struct stripe_head *get_free_stripe(struct r5conf *conf, int hash)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
struct stripe_head *sh = NULL;
|
|
|
|
struct list_head *first;
|
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
if (list_empty(conf->inactive_list + hash))
|
2005-04-17 05:20:36 +07:00
|
|
|
goto out;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
first = (conf->inactive_list + hash)->next;
|
2005-04-17 05:20:36 +07:00
|
|
|
sh = list_entry(first, struct stripe_head, lru);
|
|
|
|
list_del_init(first);
|
|
|
|
remove_hash(sh);
|
|
|
|
atomic_inc(&conf->active_stripes);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
BUG_ON(hash != sh->hash_lock_index);
|
2013-11-14 11:16:17 +07:00
|
|
|
if (list_empty(conf->inactive_list + hash))
|
|
|
|
atomic_inc(&conf->empty_inactive_list_nr);
|
2005-04-17 05:20:36 +07:00
|
|
|
out:
|
|
|
|
return sh;
|
|
|
|
}
|
|
|
|
|
2010-06-16 13:45:16 +07:00
|
|
|
static void shrink_buffers(struct stripe_head *sh)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
struct page *p;
|
|
|
|
int i;
|
2010-06-16 13:45:16 +07:00
|
|
|
int num = sh->raid_conf->pool_size;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-06-16 13:45:16 +07:00
|
|
|
for (i = 0; i < num ; i++) {
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
WARN_ON(sh->dev[i].page != sh->dev[i].orig_page);
|
2005-04-17 05:20:36 +07:00
|
|
|
p = sh->dev[i].page;
|
|
|
|
if (!p)
|
|
|
|
continue;
|
|
|
|
sh->dev[i].page = NULL;
|
2006-01-06 15:20:31 +07:00
|
|
|
put_page(p);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-02-25 08:02:51 +07:00
|
|
|
static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
int i;
|
2010-06-16 13:45:16 +07:00
|
|
|
int num = sh->raid_conf->pool_size;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-06-16 13:45:16 +07:00
|
|
|
for (i = 0; i < num; i++) {
|
2005-04-17 05:20:36 +07:00
|
|
|
struct page *page;
|
|
|
|
|
2015-02-25 08:02:51 +07:00
|
|
|
if (!(page = alloc_page(gfp))) {
|
2005-04-17 05:20:36 +07:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
sh->dev[i].page = page;
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
sh->dev[i].orig_page = page;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2017-03-09 15:59:59 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
|
2009-03-31 10:39:38 +07:00
|
|
|
struct stripe_head *sh);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-03-31 10:39:38 +07:00
|
|
|
static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
int i, seq;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-04-02 18:31:42 +07:00
|
|
|
BUG_ON(atomic_read(&sh->count) != 0);
|
|
|
|
BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
|
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 05:32:05 +07:00
|
|
|
BUG_ON(stripe_operations_active(sh));
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2007-01-03 03:52:30 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("init_stripe called, stripe %llu\n",
|
2014-08-23 17:19:27 +07:00
|
|
|
(unsigned long long)sector);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
retry:
|
|
|
|
seq = read_seqcount_begin(&conf->gen_lock);
|
2009-03-31 11:19:03 +07:00
|
|
|
sh->generation = conf->generation - previous;
|
2009-03-31 10:39:38 +07:00
|
|
|
sh->disks = previous ? conf->previous_raid_disks : conf->raid_disks;
|
2005-04-17 05:20:36 +07:00
|
|
|
sh->sector = sector;
|
2009-03-31 10:39:38 +07:00
|
|
|
stripe_set_idx(sector, conf, previous, sh);
|
2005-04-17 05:20:36 +07:00
|
|
|
sh->state = 0;
|
|
|
|
|
2006-03-27 16:18:08 +07:00
|
|
|
for (i = sh->disks; i--; ) {
|
2005-04-17 05:20:36 +07:00
|
|
|
struct r5dev *dev = &sh->dev[i];
|
|
|
|
|
2007-01-03 03:52:30 +07:00
|
|
|
if (dev->toread || dev->read || dev->towrite || dev->written ||
|
2005-04-17 05:20:36 +07:00
|
|
|
test_bit(R5_LOCKED, &dev->flags)) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_err("sector=%llx i=%d %p %p %p %p %d\n",
|
2005-04-17 05:20:36 +07:00
|
|
|
(unsigned long long)sh->sector, i, dev->toread,
|
2007-01-03 03:52:30 +07:00
|
|
|
dev->read, dev->towrite, dev->written,
|
2005-04-17 05:20:36 +07:00
|
|
|
test_bit(R5_LOCKED, &dev->flags));
|
2011-07-27 08:00:36 +07:00
|
|
|
WARN_ON(1);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
dev->flags = 0;
|
2017-08-10 15:12:17 +07:00
|
|
|
dev->sector = raid5_compute_blocknr(sh, i, previous);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
if (read_seqcount_retry(&conf->gen_lock, seq))
|
|
|
|
goto retry;
|
2014-12-15 08:57:03 +07:00
|
|
|
sh->overwrite_disks = 0;
|
2005-04-17 05:20:36 +07:00
|
|
|
insert_hash(conf, sh);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
sh->cpu = smp_processor_id();
|
2014-12-15 08:57:03 +07:00
|
|
|
set_bit(STRIPE_BATCH_READY, &sh->state);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
|
2009-03-31 11:19:03 +07:00
|
|
|
short generation)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
struct stripe_head *sh;
|
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector);
|
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 08:06:00 +07:00
|
|
|
hlist_for_each_entry(sh, stripe_hash(conf, sector), hash)
|
2009-03-31 11:19:03 +07:00
|
|
|
if (sh->sector == sector && sh->generation == generation)
|
2005-04-17 05:20:36 +07:00
|
|
|
return sh;
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector);
|
2005-04-17 05:20:36 +07:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2010-06-16 14:17:53 +07:00
|
|
|
/*
|
|
|
|
* Need to check if array has failed when deciding whether to:
|
|
|
|
* - start an array
|
|
|
|
* - remove non-faulty devices
|
|
|
|
* - add a spare
|
|
|
|
* - allow a reshape
|
|
|
|
* This determination is simple when no reshape is happening.
|
|
|
|
* However if there is a reshape, we need to carefully check
|
|
|
|
* both the before and after sections.
|
|
|
|
* This is because some failed devices may only affect one
|
|
|
|
* of the two sections, and some non-in_sync devices may
|
|
|
|
* be insync in the section most affected by failed devices.
|
|
|
|
*/
|
2017-01-25 01:45:30 +07:00
|
|
|
int raid5_calc_degraded(struct r5conf *conf)
|
2010-06-16 14:17:53 +07:00
|
|
|
{
|
2011-12-23 06:17:50 +07:00
|
|
|
int degraded, degraded2;
|
2010-06-16 14:17:53 +07:00
|
|
|
int i;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
degraded = 0;
|
|
|
|
for (i = 0; i < conf->previous_raid_disks; i++) {
|
2011-10-11 12:45:26 +07:00
|
|
|
struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev);
|
2012-09-19 09:52:30 +07:00
|
|
|
if (rdev && test_bit(Faulty, &rdev->flags))
|
|
|
|
rdev = rcu_dereference(conf->disks[i].replacement);
|
2010-06-16 14:17:53 +07:00
|
|
|
if (!rdev || test_bit(Faulty, &rdev->flags))
|
|
|
|
degraded++;
|
|
|
|
else if (test_bit(In_sync, &rdev->flags))
|
|
|
|
;
|
|
|
|
else
|
|
|
|
/* not in-sync or faulty.
|
|
|
|
* If the reshape increases the number of devices,
|
|
|
|
* this is being recovered by the reshape, so
|
|
|
|
* this 'previous' section is not in_sync.
|
|
|
|
* If the number of devices is being reduced however,
|
|
|
|
* the device can only be part of the array if
|
|
|
|
* we are reverting a reshape, so this section will
|
|
|
|
* be in-sync.
|
|
|
|
*/
|
|
|
|
if (conf->raid_disks >= conf->previous_raid_disks)
|
|
|
|
degraded++;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2011-12-23 06:17:50 +07:00
|
|
|
if (conf->raid_disks == conf->previous_raid_disks)
|
|
|
|
return degraded;
|
2010-06-16 14:17:53 +07:00
|
|
|
rcu_read_lock();
|
2011-12-23 06:17:50 +07:00
|
|
|
degraded2 = 0;
|
2010-06-16 14:17:53 +07:00
|
|
|
for (i = 0; i < conf->raid_disks; i++) {
|
2011-10-11 12:45:26 +07:00
|
|
|
struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev);
|
2012-09-19 09:52:30 +07:00
|
|
|
if (rdev && test_bit(Faulty, &rdev->flags))
|
|
|
|
rdev = rcu_dereference(conf->disks[i].replacement);
|
2010-06-16 14:17:53 +07:00
|
|
|
if (!rdev || test_bit(Faulty, &rdev->flags))
|
2011-12-23 06:17:50 +07:00
|
|
|
degraded2++;
|
2010-06-16 14:17:53 +07:00
|
|
|
else if (test_bit(In_sync, &rdev->flags))
|
|
|
|
;
|
|
|
|
else
|
|
|
|
/* not in-sync or faulty.
|
|
|
|
* If reshape increases the number of devices, this
|
|
|
|
* section has already been recovered, else it
|
|
|
|
* almost certainly hasn't.
|
|
|
|
*/
|
|
|
|
if (conf->raid_disks <= conf->previous_raid_disks)
|
2011-12-23 06:17:50 +07:00
|
|
|
degraded2++;
|
2010-06-16 14:17:53 +07:00
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2011-12-23 06:17:50 +07:00
|
|
|
if (degraded2 > degraded)
|
|
|
|
return degraded2;
|
|
|
|
return degraded;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int has_failed(struct r5conf *conf)
|
|
|
|
{
|
|
|
|
int degraded;
|
|
|
|
|
|
|
|
if (conf->mddev->reshape_position == MaxSector)
|
|
|
|
return conf->mddev->degraded > conf->max_degraded;
|
|
|
|
|
2017-01-25 01:45:30 +07:00
|
|
|
degraded = raid5_calc_degraded(conf);
|
2010-06-16 14:17:53 +07:00
|
|
|
if (degraded > conf->max_degraded)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-08-14 04:31:57 +07:00
|
|
|
struct stripe_head *
|
|
|
|
raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
|
|
|
|
int previous, int noblock, int noquiesce)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
struct stripe_head *sh;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
int hash = stripe_hash_locks_hash(sector);
|
2016-07-28 13:22:14 +07:00
|
|
|
int inc_empty_inactive_list_flag;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
spin_lock_irq(conf->hash_locks + hash);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
do {
|
md/raid5: split wait_for_stripe and introduce wait_for_quiescent
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.
After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.
Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.
So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.
v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
Commit log refactor suggestion from Neil.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 15:19:06 +07:00
|
|
|
wait_event_lock_irq(conf->wait_for_quiescent,
|
2009-06-09 11:39:59 +07:00
|
|
|
conf->quiesce == 0 || noquiesce,
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
*(conf->hash_locks + hash));
|
2009-03-31 11:19:03 +07:00
|
|
|
sh = __find_stripe(conf, sector, conf->generation - previous);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (!sh) {
|
2015-02-26 08:47:56 +07:00
|
|
|
if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) {
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
sh = get_free_stripe(conf, hash);
|
2015-05-29 07:33:47 +07:00
|
|
|
if (!sh && !test_bit(R5_DID_ALLOC,
|
|
|
|
&conf->cache_state))
|
2015-02-26 08:47:56 +07:00
|
|
|
set_bit(R5_ALLOC_MORE,
|
|
|
|
&conf->cache_state);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
if (noblock && sh == NULL)
|
|
|
|
break;
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:40 +07:00
|
|
|
|
|
|
|
r5c_check_stripe_cache_usage(conf);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (!sh) {
|
2015-02-26 08:21:04 +07:00
|
|
|
set_bit(R5_INACTIVE_BLOCKED,
|
|
|
|
&conf->cache_state);
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:40 +07:00
|
|
|
r5l_wake_reclaim(conf->log, 0);
|
2016-02-26 07:24:42 +07:00
|
|
|
wait_event_lock_irq(
|
|
|
|
conf->wait_for_stripe,
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
!list_empty(conf->inactive_list + hash) &&
|
|
|
|
(atomic_read(&conf->active_stripes)
|
|
|
|
< (conf->max_nr_stripes * 3 / 4)
|
2015-02-26 08:21:04 +07:00
|
|
|
|| !test_bit(R5_INACTIVE_BLOCKED,
|
|
|
|
&conf->cache_state)),
|
2016-02-26 07:24:42 +07:00
|
|
|
*(conf->hash_locks + hash));
|
2015-02-26 08:21:04 +07:00
|
|
|
clear_bit(R5_INACTIVE_BLOCKED,
|
|
|
|
&conf->cache_state);
|
2014-01-22 07:45:03 +07:00
|
|
|
} else {
|
2009-03-31 10:39:38 +07:00
|
|
|
init_stripe(sh, sector, previous);
|
2014-01-22 07:45:03 +07:00
|
|
|
atomic_inc(&sh->count);
|
|
|
|
}
|
2014-04-09 10:27:42 +07:00
|
|
|
} else if (!atomic_inc_not_zero(&sh->count)) {
|
2013-11-28 06:55:27 +07:00
|
|
|
spin_lock(&conf->device_lock);
|
2014-04-09 10:27:42 +07:00
|
|
|
if (!atomic_read(&sh->count)) {
|
2005-04-17 05:20:36 +07:00
|
|
|
if (!test_bit(STRIPE_HANDLE, &sh->state))
|
|
|
|
atomic_inc(&conf->active_stripes);
|
2014-01-14 11:16:10 +07:00
|
|
|
BUG_ON(list_empty(&sh->lru) &&
|
|
|
|
!test_bit(STRIPE_EXPANDING, &sh->state));
|
2016-07-28 13:22:14 +07:00
|
|
|
inc_empty_inactive_list_flag = 0;
|
|
|
|
if (!list_empty(conf->inactive_list + hash))
|
|
|
|
inc_empty_inactive_list_flag = 1;
|
2006-06-26 14:27:38 +07:00
|
|
|
list_del_init(&sh->lru);
|
2016-07-28 13:22:14 +07:00
|
|
|
if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
|
|
|
|
atomic_inc(&conf->empty_inactive_list_nr);
|
2013-08-29 14:40:32 +07:00
|
|
|
if (sh->group) {
|
|
|
|
sh->group->stripes_cnt--;
|
|
|
|
sh->group = NULL;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2014-01-22 07:45:03 +07:00
|
|
|
atomic_inc(&sh->count);
|
2013-11-28 06:55:27 +07:00
|
|
|
spin_unlock(&conf->device_lock);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
} while (sh == NULL);
|
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
spin_unlock_irq(conf->hash_locks + hash);
|
2005-04-17 05:20:36 +07:00
|
|
|
return sh;
|
|
|
|
}
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
static bool is_full_stripe_write(struct stripe_head *sh)
|
|
|
|
{
|
|
|
|
BUG_ON(sh->overwrite_disks > (sh->disks - sh->raid_conf->max_degraded));
|
|
|
|
return sh->overwrite_disks == (sh->disks - sh->raid_conf->max_degraded);
|
|
|
|
}
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
static void lock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2)
|
|
|
|
{
|
|
|
|
if (sh1 > sh2) {
|
md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock
On mainline, there is no functional difference, just less code, and
symmetric lock/unlock paths.
On PREEMPT_RT builds, this fixes the following warning, seen by
Alexander GQ Gerasiov, due to the sleeping nature of spinlocks.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993
in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1
CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: G W 4.9.20-rt16-stand6-686 #1
Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015
Workqueue: writeback wb_workfn (flush-253:0)
Call Trace:
dump_stack+0x47/0x68
? migrate_enable+0x4a/0xf0
___might_sleep+0x101/0x180
rt_spin_lock+0x17/0x40
add_stripe_bio+0x4e3/0x6c0 [raid456]
? preempt_count_add+0x42/0xb0
raid5_make_request+0x737/0xdd0 [raid456]
Reported-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Tested-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-29 00:41:02 +07:00
|
|
|
spin_lock_irq(&sh2->stripe_lock);
|
2014-12-15 08:57:03 +07:00
|
|
|
spin_lock_nested(&sh1->stripe_lock, 1);
|
|
|
|
} else {
|
md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock
On mainline, there is no functional difference, just less code, and
symmetric lock/unlock paths.
On PREEMPT_RT builds, this fixes the following warning, seen by
Alexander GQ Gerasiov, due to the sleeping nature of spinlocks.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993
in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1
CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: G W 4.9.20-rt16-stand6-686 #1
Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015
Workqueue: writeback wb_workfn (flush-253:0)
Call Trace:
dump_stack+0x47/0x68
? migrate_enable+0x4a/0xf0
___might_sleep+0x101/0x180
rt_spin_lock+0x17/0x40
add_stripe_bio+0x4e3/0x6c0 [raid456]
? preempt_count_add+0x42/0xb0
raid5_make_request+0x737/0xdd0 [raid456]
Reported-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Tested-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-29 00:41:02 +07:00
|
|
|
spin_lock_irq(&sh1->stripe_lock);
|
2014-12-15 08:57:03 +07:00
|
|
|
spin_lock_nested(&sh2->stripe_lock, 1);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2)
|
|
|
|
{
|
|
|
|
spin_unlock(&sh1->stripe_lock);
|
md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock
On mainline, there is no functional difference, just less code, and
symmetric lock/unlock paths.
On PREEMPT_RT builds, this fixes the following warning, seen by
Alexander GQ Gerasiov, due to the sleeping nature of spinlocks.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993
in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1
CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: G W 4.9.20-rt16-stand6-686 #1
Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015
Workqueue: writeback wb_workfn (flush-253:0)
Call Trace:
dump_stack+0x47/0x68
? migrate_enable+0x4a/0xf0
___might_sleep+0x101/0x180
rt_spin_lock+0x17/0x40
add_stripe_bio+0x4e3/0x6c0 [raid456]
? preempt_count_add+0x42/0xb0
raid5_make_request+0x737/0xdd0 [raid456]
Reported-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Tested-by: Alexander GQ Gerasiov <gq@redlab-i.ru>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-29 00:41:02 +07:00
|
|
|
spin_unlock_irq(&sh2->stripe_lock);
|
2014-12-15 08:57:03 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Only freshly new full stripe normal write stripe can be added to a batch list */
|
|
|
|
static bool stripe_can_batch(struct stripe_head *sh)
|
|
|
|
{
|
2015-08-14 04:32:02 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
|
|
|
|
2018-08-30 01:05:42 +07:00
|
|
|
if (raid5_has_log(conf) || raid5_has_ppl(conf))
|
2015-08-14 04:32:02 +07:00
|
|
|
return false;
|
2014-12-15 08:57:03 +07:00
|
|
|
return test_bit(STRIPE_BATCH_READY, &sh->state) &&
|
2015-05-27 05:43:45 +07:00
|
|
|
!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
|
2014-12-15 08:57:03 +07:00
|
|
|
is_full_stripe_write(sh);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* we only do back search */
|
|
|
|
static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh)
|
|
|
|
{
|
|
|
|
struct stripe_head *head;
|
|
|
|
sector_t head_sector, tmp_sec;
|
|
|
|
int hash;
|
|
|
|
int dd_idx;
|
2016-07-28 13:22:14 +07:00
|
|
|
int inc_empty_inactive_list_flag;
|
2014-12-15 08:57:03 +07:00
|
|
|
|
|
|
|
/* Don't cross chunks, so stripe pd_idx/qd_idx is the same */
|
|
|
|
tmp_sec = sh->sector;
|
|
|
|
if (!sector_div(tmp_sec, conf->chunk_sectors))
|
|
|
|
return;
|
|
|
|
head_sector = sh->sector - STRIPE_SECTORS;
|
|
|
|
|
|
|
|
hash = stripe_hash_locks_hash(head_sector);
|
|
|
|
spin_lock_irq(conf->hash_locks + hash);
|
|
|
|
head = __find_stripe(conf, head_sector, conf->generation);
|
|
|
|
if (head && !atomic_inc_not_zero(&head->count)) {
|
|
|
|
spin_lock(&conf->device_lock);
|
|
|
|
if (!atomic_read(&head->count)) {
|
|
|
|
if (!test_bit(STRIPE_HANDLE, &head->state))
|
|
|
|
atomic_inc(&conf->active_stripes);
|
|
|
|
BUG_ON(list_empty(&head->lru) &&
|
|
|
|
!test_bit(STRIPE_EXPANDING, &head->state));
|
2016-07-28 13:22:14 +07:00
|
|
|
inc_empty_inactive_list_flag = 0;
|
|
|
|
if (!list_empty(conf->inactive_list + hash))
|
|
|
|
inc_empty_inactive_list_flag = 1;
|
2014-12-15 08:57:03 +07:00
|
|
|
list_del_init(&head->lru);
|
2016-07-28 13:22:14 +07:00
|
|
|
if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
|
|
|
|
atomic_inc(&conf->empty_inactive_list_nr);
|
2014-12-15 08:57:03 +07:00
|
|
|
if (head->group) {
|
|
|
|
head->group->stripes_cnt--;
|
|
|
|
head->group = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
atomic_inc(&head->count);
|
|
|
|
spin_unlock(&conf->device_lock);
|
|
|
|
}
|
|
|
|
spin_unlock_irq(conf->hash_locks + hash);
|
|
|
|
|
|
|
|
if (!head)
|
|
|
|
return;
|
|
|
|
if (!stripe_can_batch(head))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
lock_two_stripes(head, sh);
|
|
|
|
/* clear_batch_ready clear the flag */
|
|
|
|
if (!stripe_can_batch(head) || !stripe_can_batch(sh))
|
|
|
|
goto unlock_out;
|
|
|
|
|
|
|
|
if (sh->batch_head)
|
|
|
|
goto unlock_out;
|
|
|
|
|
|
|
|
dd_idx = 0;
|
|
|
|
while (dd_idx == sh->pd_idx || dd_idx == sh->qd_idx)
|
|
|
|
dd_idx++;
|
2016-08-06 04:35:16 +07:00
|
|
|
if (head->dev[dd_idx].towrite->bi_opf != sh->dev[dd_idx].towrite->bi_opf ||
|
2016-06-06 02:32:07 +07:00
|
|
|
bio_op(head->dev[dd_idx].towrite) != bio_op(sh->dev[dd_idx].towrite))
|
2014-12-15 08:57:03 +07:00
|
|
|
goto unlock_out;
|
|
|
|
|
|
|
|
if (head->batch_head) {
|
|
|
|
spin_lock(&head->batch_head->batch_lock);
|
|
|
|
/* This batch list is already running */
|
|
|
|
if (!stripe_can_batch(head)) {
|
|
|
|
spin_unlock(&head->batch_head->batch_lock);
|
|
|
|
goto unlock_out;
|
|
|
|
}
|
2017-08-26 00:40:02 +07:00
|
|
|
/*
|
|
|
|
* We must assign batch_head of this stripe within the
|
|
|
|
* batch_lock, otherwise clear_batch_ready of batch head
|
|
|
|
* stripe could clear BATCH_READY bit of this stripe and
|
|
|
|
* this stripe->batch_head doesn't get assigned, which
|
|
|
|
* could confuse clear_batch_ready for this stripe
|
|
|
|
*/
|
|
|
|
sh->batch_head = head->batch_head;
|
2014-12-15 08:57:03 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* at this point, head's BATCH_READY could be cleared, but we
|
|
|
|
* can still add the stripe to batch list
|
|
|
|
*/
|
|
|
|
list_add(&sh->batch_list, &head->batch_list);
|
|
|
|
spin_unlock(&head->batch_head->batch_lock);
|
|
|
|
} else {
|
|
|
|
head->batch_head = head;
|
|
|
|
sh->batch_head = head->batch_head;
|
|
|
|
spin_lock(&head->batch_lock);
|
|
|
|
list_add_tail(&sh->batch_list, &head->batch_list);
|
|
|
|
spin_unlock(&head->batch_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
|
|
|
|
if (atomic_dec_return(&conf->preread_active_stripes)
|
|
|
|
< IO_THRESHOLD)
|
|
|
|
md_wakeup_thread(conf->mddev->thread);
|
|
|
|
|
2015-05-21 12:10:01 +07:00
|
|
|
if (test_and_clear_bit(STRIPE_BIT_DELAY, &sh->state)) {
|
|
|
|
int seq = sh->bm_seq;
|
|
|
|
if (test_bit(STRIPE_BIT_DELAY, &sh->batch_head->state) &&
|
|
|
|
sh->batch_head->bm_seq > seq)
|
|
|
|
seq = sh->batch_head->bm_seq;
|
|
|
|
set_bit(STRIPE_BIT_DELAY, &sh->batch_head->state);
|
|
|
|
sh->batch_head->bm_seq = seq;
|
|
|
|
}
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
atomic_inc(&sh->count);
|
|
|
|
unlock_out:
|
|
|
|
unlock_two_stripes(head, sh);
|
|
|
|
out:
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(head);
|
2014-12-15 08:57:03 +07:00
|
|
|
}
|
|
|
|
|
2012-05-21 06:27:00 +07:00
|
|
|
/* Determine if 'data_offset' or 'new_data_offset' should be used
|
|
|
|
* in this stripe_head.
|
|
|
|
*/
|
|
|
|
static int use_new_offset(struct r5conf *conf, struct stripe_head *sh)
|
|
|
|
{
|
|
|
|
sector_t progress = conf->reshape_progress;
|
|
|
|
/* Need a memory barrier to make sure we see the value
|
|
|
|
* of conf->generation, or ->data_offset that was set before
|
|
|
|
* reshape_progress was updated.
|
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
if (progress == MaxSector)
|
|
|
|
return 0;
|
|
|
|
if (sh->generation == conf->generation - 1)
|
|
|
|
return 0;
|
|
|
|
/* We are in a reshape, and this is a new-generation stripe,
|
|
|
|
* so use new_data_offset.
|
|
|
|
*/
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2017-03-04 13:06:12 +07:00
|
|
|
static void dispatch_bio_list(struct bio_list *tmp)
|
raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 00:33:23 +07:00
|
|
|
{
|
|
|
|
struct bio *bio;
|
|
|
|
|
2017-03-04 13:06:12 +07:00
|
|
|
while ((bio = bio_list_pop(tmp)))
|
|
|
|
generic_make_request(bio);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int cmp_stripe(void *priv, struct list_head *a, struct list_head *b)
|
|
|
|
{
|
|
|
|
const struct r5pending_data *da = list_entry(a,
|
|
|
|
struct r5pending_data, sibling);
|
|
|
|
const struct r5pending_data *db = list_entry(b,
|
|
|
|
struct r5pending_data, sibling);
|
|
|
|
if (da->sector > db->sector)
|
|
|
|
return 1;
|
|
|
|
if (da->sector < db->sector)
|
|
|
|
return -1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void dispatch_defer_bios(struct r5conf *conf, int target,
|
|
|
|
struct bio_list *list)
|
|
|
|
{
|
|
|
|
struct r5pending_data *data;
|
|
|
|
struct list_head *first, *next = NULL;
|
|
|
|
int cnt = 0;
|
|
|
|
|
|
|
|
if (conf->pending_data_cnt == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
list_sort(NULL, &conf->pending_list, cmp_stripe);
|
|
|
|
|
|
|
|
first = conf->pending_list.next;
|
|
|
|
|
|
|
|
/* temporarily move the head */
|
|
|
|
if (conf->next_pending_data)
|
|
|
|
list_move_tail(&conf->pending_list,
|
|
|
|
&conf->next_pending_data->sibling);
|
|
|
|
|
|
|
|
while (!list_empty(&conf->pending_list)) {
|
|
|
|
data = list_first_entry(&conf->pending_list,
|
|
|
|
struct r5pending_data, sibling);
|
|
|
|
if (&data->sibling == first)
|
|
|
|
first = data->sibling.next;
|
|
|
|
next = data->sibling.next;
|
|
|
|
|
|
|
|
bio_list_merge(list, &data->bios);
|
|
|
|
list_move(&data->sibling, &conf->free_list);
|
|
|
|
cnt++;
|
|
|
|
if (cnt >= target)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
conf->pending_data_cnt -= cnt;
|
|
|
|
BUG_ON(conf->pending_data_cnt < 0 || cnt < target);
|
|
|
|
|
|
|
|
if (next != &conf->pending_list)
|
|
|
|
conf->next_pending_data = list_entry(next,
|
|
|
|
struct r5pending_data, sibling);
|
|
|
|
else
|
|
|
|
conf->next_pending_data = NULL;
|
|
|
|
/* list isn't empty */
|
|
|
|
if (first != &conf->pending_list)
|
|
|
|
list_move_tail(&conf->pending_list, first);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void flush_deferred_bios(struct r5conf *conf)
|
|
|
|
{
|
|
|
|
struct bio_list tmp = BIO_EMPTY_LIST;
|
|
|
|
|
|
|
|
if (conf->pending_data_cnt == 0)
|
raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 00:33:23 +07:00
|
|
|
return;
|
|
|
|
|
|
|
|
spin_lock(&conf->pending_bios_lock);
|
2017-03-04 13:06:12 +07:00
|
|
|
dispatch_defer_bios(conf, conf->pending_data_cnt, &tmp);
|
|
|
|
BUG_ON(conf->pending_data_cnt != 0);
|
raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 00:33:23 +07:00
|
|
|
spin_unlock(&conf->pending_bios_lock);
|
|
|
|
|
2017-03-04 13:06:12 +07:00
|
|
|
dispatch_bio_list(&tmp);
|
raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 00:33:23 +07:00
|
|
|
}
|
|
|
|
|
2017-03-04 13:06:12 +07:00
|
|
|
static void defer_issue_bios(struct r5conf *conf, sector_t sector,
|
|
|
|
struct bio_list *bios)
|
raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 00:33:23 +07:00
|
|
|
{
|
2017-03-04 13:06:12 +07:00
|
|
|
struct bio_list tmp = BIO_EMPTY_LIST;
|
|
|
|
struct r5pending_data *ent;
|
|
|
|
|
raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 00:33:23 +07:00
|
|
|
spin_lock(&conf->pending_bios_lock);
|
2017-03-04 13:06:12 +07:00
|
|
|
ent = list_first_entry(&conf->free_list, struct r5pending_data,
|
|
|
|
sibling);
|
|
|
|
list_move_tail(&ent->sibling, &conf->pending_list);
|
|
|
|
ent->sector = sector;
|
|
|
|
bio_list_init(&ent->bios);
|
|
|
|
bio_list_merge(&ent->bios, bios);
|
|
|
|
conf->pending_data_cnt++;
|
|
|
|
if (conf->pending_data_cnt >= PENDING_IO_MAX)
|
|
|
|
dispatch_defer_bios(conf, PENDING_IO_ONE_FLUSH, &tmp);
|
|
|
|
|
raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 00:33:23 +07:00
|
|
|
spin_unlock(&conf->pending_bios_lock);
|
2017-03-04 13:06:12 +07:00
|
|
|
|
|
|
|
dispatch_bio_list(&tmp);
|
raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 00:33:23 +07:00
|
|
|
}
|
|
|
|
|
2007-09-27 17:47:43 +07:00
|
|
|
static void
|
2015-07-20 20:29:37 +07:00
|
|
|
raid5_end_read_request(struct bio *bi);
|
2007-09-27 17:47:43 +07:00
|
|
|
static void
|
2015-07-20 20:29:37 +07:00
|
|
|
raid5_end_write_request(struct bio *bi);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2008-06-28 05:31:53 +07:00
|
|
|
static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
int i, disks = sh->disks;
|
2014-12-15 08:57:03 +07:00
|
|
|
struct stripe_head *head_sh = sh;
|
2017-03-04 13:06:12 +07:00
|
|
|
struct bio_list pending_bios = BIO_EMPTY_LIST;
|
|
|
|
bool should_defer;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
might_sleep();
|
|
|
|
|
2017-03-09 15:59:58 +07:00
|
|
|
if (log_stripe(sh, s) == 0)
|
|
|
|
return;
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
|
2017-03-04 13:06:12 +07:00
|
|
|
should_defer = conf->batch_bio_dispatch && conf->group_cnt;
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
for (i = disks; i--; ) {
|
2016-06-06 02:32:07 +07:00
|
|
|
int op, op_flags = 0;
|
2011-12-23 06:17:53 +07:00
|
|
|
int replace_only = 0;
|
2011-12-23 06:17:53 +07:00
|
|
|
struct bio *bi, *rbi;
|
|
|
|
struct md_rdev *rdev, *rrdev = NULL;
|
2014-12-15 08:57:03 +07:00
|
|
|
|
|
|
|
sh = head_sh;
|
2010-09-03 16:56:18 +07:00
|
|
|
if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags)) {
|
2016-06-06 02:32:07 +07:00
|
|
|
op = REQ_OP_WRITE;
|
2010-09-03 16:56:18 +07:00
|
|
|
if (test_and_clear_bit(R5_WantFUA, &sh->dev[i].flags))
|
2016-11-01 20:40:10 +07:00
|
|
|
op_flags = REQ_FUA;
|
2012-10-11 09:49:49 +07:00
|
|
|
if (test_bit(R5_Discard, &sh->dev[i].flags))
|
2016-06-06 02:32:07 +07:00
|
|
|
op = REQ_OP_DISCARD;
|
2010-09-03 16:56:18 +07:00
|
|
|
} else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
|
2016-06-06 02:32:07 +07:00
|
|
|
op = REQ_OP_READ;
|
2011-12-23 06:17:53 +07:00
|
|
|
else if (test_and_clear_bit(R5_WantReplace,
|
|
|
|
&sh->dev[i].flags)) {
|
2016-06-06 02:32:07 +07:00
|
|
|
op = REQ_OP_WRITE;
|
2011-12-23 06:17:53 +07:00
|
|
|
replace_only = 1;
|
|
|
|
} else
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
continue;
|
2012-05-22 10:55:05 +07:00
|
|
|
if (test_and_clear_bit(R5_SyncIO, &sh->dev[i].flags))
|
2016-06-06 02:32:07 +07:00
|
|
|
op_flags |= REQ_SYNC;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
again:
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
bi = &sh->dev[i].req;
|
2011-12-23 06:17:53 +07:00
|
|
|
rbi = &sh->dev[i].rreq; /* For writing to replacement */
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2011-12-23 06:17:53 +07:00
|
|
|
rrdev = rcu_dereference(conf->disks[i].replacement);
|
2011-12-23 06:17:53 +07:00
|
|
|
smp_mb(); /* Ensure that if rrdev is NULL, rdev won't be */
|
|
|
|
rdev = rcu_dereference(conf->disks[i].rdev);
|
|
|
|
if (!rdev) {
|
|
|
|
rdev = rrdev;
|
|
|
|
rrdev = NULL;
|
|
|
|
}
|
2016-06-06 02:32:07 +07:00
|
|
|
if (op_is_write(op)) {
|
2011-12-23 06:17:53 +07:00
|
|
|
if (replace_only)
|
|
|
|
rdev = NULL;
|
2011-12-23 06:17:53 +07:00
|
|
|
if (rdev == rrdev)
|
|
|
|
/* We raced and saw duplicates */
|
|
|
|
rrdev = NULL;
|
2011-12-23 06:17:53 +07:00
|
|
|
} else {
|
2014-12-15 08:57:03 +07:00
|
|
|
if (test_bit(R5_ReadRepl, &head_sh->dev[i].flags) && rrdev)
|
2011-12-23 06:17:53 +07:00
|
|
|
rdev = rrdev;
|
|
|
|
rrdev = NULL;
|
|
|
|
}
|
2011-12-23 06:17:53 +07:00
|
|
|
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
if (rdev && test_bit(Faulty, &rdev->flags))
|
|
|
|
rdev = NULL;
|
|
|
|
if (rdev)
|
|
|
|
atomic_inc(&rdev->nr_pending);
|
2011-12-23 06:17:53 +07:00
|
|
|
if (rrdev && test_bit(Faulty, &rrdev->flags))
|
|
|
|
rrdev = NULL;
|
|
|
|
if (rrdev)
|
|
|
|
atomic_inc(&rrdev->nr_pending);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
2011-07-28 08:39:22 +07:00
|
|
|
/* We have already checked bad blocks for reads. Now
|
2011-12-23 06:17:53 +07:00
|
|
|
* need to check for writes. We never accept write errors
|
|
|
|
* on the replacement, so we don't to check rrdev.
|
2011-07-28 08:39:22 +07:00
|
|
|
*/
|
2016-06-06 02:32:07 +07:00
|
|
|
while (op_is_write(op) && rdev &&
|
2011-07-28 08:39:22 +07:00
|
|
|
test_bit(WriteErrorSeen, &rdev->flags)) {
|
|
|
|
sector_t first_bad;
|
|
|
|
int bad_sectors;
|
|
|
|
int bad = is_badblock(rdev, sh->sector, STRIPE_SECTORS,
|
|
|
|
&first_bad, &bad_sectors);
|
|
|
|
if (!bad)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (bad < 0) {
|
|
|
|
set_bit(BlockedBadBlocks, &rdev->flags);
|
|
|
|
if (!conf->mddev->external &&
|
2016-12-09 06:48:19 +07:00
|
|
|
conf->mddev->sb_flags) {
|
2011-07-28 08:39:22 +07:00
|
|
|
/* It is very unlikely, but we might
|
|
|
|
* still need to write out the
|
|
|
|
* bad block log - better give it
|
|
|
|
* a chance*/
|
|
|
|
md_check_recovery(conf->mddev);
|
|
|
|
}
|
2012-07-03 09:11:54 +07:00
|
|
|
/*
|
|
|
|
* Because md_wait_for_blocked_rdev
|
|
|
|
* will dec nr_pending, we must
|
|
|
|
* increment it first.
|
|
|
|
*/
|
|
|
|
atomic_inc(&rdev->nr_pending);
|
2011-07-28 08:39:22 +07:00
|
|
|
md_wait_for_blocked_rdev(rdev, conf->mddev);
|
|
|
|
} else {
|
|
|
|
/* Acknowledged bad block - skip the write */
|
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
|
|
|
rdev = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
if (rdev) {
|
2011-12-23 06:17:53 +07:00
|
|
|
if (s->syncing || s->expanding || s->expanded
|
|
|
|
|| s->replacing)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
md_sync_acct(rdev->bdev, STRIPE_SECTORS);
|
|
|
|
|
2008-06-28 05:31:52 +07:00
|
|
|
set_bit(STRIPE_IO_STARTED, &sh->state);
|
|
|
|
|
2017-08-24 00:10:32 +07:00
|
|
|
bio_set_dev(bi, rdev->bdev);
|
2016-06-06 02:32:07 +07:00
|
|
|
bio_set_op_attrs(bi, op, op_flags);
|
|
|
|
bi->bi_end_io = op_is_write(op)
|
2012-09-12 02:26:38 +07:00
|
|
|
? raid5_end_write_request
|
|
|
|
: raid5_end_read_request;
|
|
|
|
bi->bi_private = sh;
|
|
|
|
|
2016-06-06 02:32:21 +07:00
|
|
|
pr_debug("%s: for %llu schedule op %d on disc %d\n",
|
2008-04-28 16:15:50 +07:00
|
|
|
__func__, (unsigned long long)sh->sector,
|
2016-08-06 04:35:16 +07:00
|
|
|
bi->bi_opf, i);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
atomic_inc(&sh->count);
|
2014-12-15 08:57:03 +07:00
|
|
|
if (sh != head_sh)
|
|
|
|
atomic_inc(&head_sh->count);
|
2012-05-21 06:27:00 +07:00
|
|
|
if (use_new_offset(conf, sh))
|
2013-10-12 05:44:27 +07:00
|
|
|
bi->bi_iter.bi_sector = (sh->sector
|
2012-05-21 06:27:00 +07:00
|
|
|
+ rdev->new_data_offset);
|
|
|
|
else
|
2013-10-12 05:44:27 +07:00
|
|
|
bi->bi_iter.bi_sector = (sh->sector
|
2012-05-21 06:27:00 +07:00
|
|
|
+ rdev->data_offset);
|
2014-12-15 08:57:03 +07:00
|
|
|
if (test_bit(R5_ReadNoMerge, &head_sh->dev[i].flags))
|
2016-08-06 04:35:16 +07:00
|
|
|
bi->bi_opf |= REQ_NOMERGE;
|
2012-07-31 07:04:21 +07:00
|
|
|
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
if (test_bit(R5_SkipCopy, &sh->dev[i].flags))
|
|
|
|
WARN_ON(test_bit(R5_UPTODATE, &sh->dev[i].flags));
|
2017-01-13 08:22:41 +07:00
|
|
|
|
|
|
|
if (!op_is_write(op) &&
|
|
|
|
test_bit(R5_InJournal, &sh->dev[i].flags))
|
|
|
|
/*
|
|
|
|
* issuing read for a page in journal, this
|
|
|
|
* must be preparing for prexor in rmw; read
|
|
|
|
* the data into orig_page
|
|
|
|
*/
|
|
|
|
sh->dev[i].vec.bv_page = sh->dev[i].orig_page;
|
|
|
|
else
|
|
|
|
sh->dev[i].vec.bv_page = sh->dev[i].page;
|
2013-05-30 13:44:39 +07:00
|
|
|
bi->bi_vcnt = 1;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
|
|
|
|
bi->bi_io_vec[0].bv_offset = 0;
|
2013-10-12 05:44:27 +07:00
|
|
|
bi->bi_iter.bi_size = STRIPE_SIZE;
|
2018-04-20 00:28:10 +07:00
|
|
|
bi->bi_write_hint = sh->dev[i].write_hint;
|
|
|
|
if (!rrdev)
|
|
|
|
sh->dev[i].write_hint = RWF_WRITE_LIFE_NOT_SET;
|
2013-10-19 13:50:28 +07:00
|
|
|
/*
|
|
|
|
* If this is discard request, set bi_vcnt 0. We don't
|
|
|
|
* want to confuse SCSI because SCSI will replace payload
|
|
|
|
*/
|
2016-06-06 02:32:07 +07:00
|
|
|
if (op == REQ_OP_DISCARD)
|
2013-10-19 13:50:28 +07:00
|
|
|
bi->bi_vcnt = 0;
|
2011-12-23 06:17:53 +07:00
|
|
|
if (rrdev)
|
|
|
|
set_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags);
|
2013-03-08 05:22:01 +07:00
|
|
|
|
|
|
|
if (conf->mddev->gendisk)
|
2017-08-24 00:10:32 +07:00
|
|
|
trace_block_bio_remap(bi->bi_disk->queue,
|
2013-03-08 05:22:01 +07:00
|
|
|
bi, disk_devt(conf->mddev->gendisk),
|
|
|
|
sh->dev[i].sector);
|
2017-03-04 13:06:12 +07:00
|
|
|
if (should_defer && op_is_write(op))
|
|
|
|
bio_list_add(&pending_bios, bi);
|
|
|
|
else
|
|
|
|
generic_make_request(bi);
|
2011-12-23 06:17:53 +07:00
|
|
|
}
|
|
|
|
if (rrdev) {
|
2011-12-23 06:17:53 +07:00
|
|
|
if (s->syncing || s->expanding || s->expanded
|
|
|
|
|| s->replacing)
|
2011-12-23 06:17:53 +07:00
|
|
|
md_sync_acct(rrdev->bdev, STRIPE_SECTORS);
|
|
|
|
|
|
|
|
set_bit(STRIPE_IO_STARTED, &sh->state);
|
|
|
|
|
2017-08-24 00:10:32 +07:00
|
|
|
bio_set_dev(rbi, rrdev->bdev);
|
2016-06-06 02:32:07 +07:00
|
|
|
bio_set_op_attrs(rbi, op, op_flags);
|
|
|
|
BUG_ON(!op_is_write(op));
|
2012-09-12 02:26:38 +07:00
|
|
|
rbi->bi_end_io = raid5_end_write_request;
|
|
|
|
rbi->bi_private = sh;
|
|
|
|
|
2016-06-06 02:32:21 +07:00
|
|
|
pr_debug("%s: for %llu schedule op %d on "
|
2011-12-23 06:17:53 +07:00
|
|
|
"replacement disc %d\n",
|
|
|
|
__func__, (unsigned long long)sh->sector,
|
2016-08-06 04:35:16 +07:00
|
|
|
rbi->bi_opf, i);
|
2011-12-23 06:17:53 +07:00
|
|
|
atomic_inc(&sh->count);
|
2014-12-15 08:57:03 +07:00
|
|
|
if (sh != head_sh)
|
|
|
|
atomic_inc(&head_sh->count);
|
2012-05-21 06:27:00 +07:00
|
|
|
if (use_new_offset(conf, sh))
|
2013-10-12 05:44:27 +07:00
|
|
|
rbi->bi_iter.bi_sector = (sh->sector
|
2012-05-21 06:27:00 +07:00
|
|
|
+ rrdev->new_data_offset);
|
|
|
|
else
|
2013-10-12 05:44:27 +07:00
|
|
|
rbi->bi_iter.bi_sector = (sh->sector
|
2012-05-21 06:27:00 +07:00
|
|
|
+ rrdev->data_offset);
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
if (test_bit(R5_SkipCopy, &sh->dev[i].flags))
|
|
|
|
WARN_ON(test_bit(R5_UPTODATE, &sh->dev[i].flags));
|
|
|
|
sh->dev[i].rvec.bv_page = sh->dev[i].page;
|
2013-05-30 13:44:39 +07:00
|
|
|
rbi->bi_vcnt = 1;
|
2011-12-23 06:17:53 +07:00
|
|
|
rbi->bi_io_vec[0].bv_len = STRIPE_SIZE;
|
|
|
|
rbi->bi_io_vec[0].bv_offset = 0;
|
2013-10-12 05:44:27 +07:00
|
|
|
rbi->bi_iter.bi_size = STRIPE_SIZE;
|
2018-04-20 00:28:10 +07:00
|
|
|
rbi->bi_write_hint = sh->dev[i].write_hint;
|
|
|
|
sh->dev[i].write_hint = RWF_WRITE_LIFE_NOT_SET;
|
2013-10-19 13:50:28 +07:00
|
|
|
/*
|
|
|
|
* If this is discard request, set bi_vcnt 0. We don't
|
|
|
|
* want to confuse SCSI because SCSI will replace payload
|
|
|
|
*/
|
2016-06-06 02:32:07 +07:00
|
|
|
if (op == REQ_OP_DISCARD)
|
2013-10-19 13:50:28 +07:00
|
|
|
rbi->bi_vcnt = 0;
|
2013-03-08 05:22:01 +07:00
|
|
|
if (conf->mddev->gendisk)
|
2017-08-24 00:10:32 +07:00
|
|
|
trace_block_bio_remap(rbi->bi_disk->queue,
|
2013-03-08 05:22:01 +07:00
|
|
|
rbi, disk_devt(conf->mddev->gendisk),
|
|
|
|
sh->dev[i].sector);
|
2017-03-04 13:06:12 +07:00
|
|
|
if (should_defer && op_is_write(op))
|
|
|
|
bio_list_add(&pending_bios, rbi);
|
|
|
|
else
|
|
|
|
generic_make_request(rbi);
|
2011-12-23 06:17:53 +07:00
|
|
|
}
|
|
|
|
if (!rdev && !rrdev) {
|
2016-06-06 02:32:07 +07:00
|
|
|
if (op_is_write(op))
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
set_bit(STRIPE_DEGRADED, &sh->state);
|
2016-06-06 02:32:21 +07:00
|
|
|
pr_debug("skip op %d on disc %d for sector %llu\n",
|
2016-08-06 04:35:16 +07:00
|
|
|
bi->bi_opf, i, (unsigned long long)sh->sector);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
clear_bit(R5_LOCKED, &sh->dev[i].flags);
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
}
|
2014-12-15 08:57:03 +07:00
|
|
|
|
|
|
|
if (!head_sh->batch_head)
|
|
|
|
continue;
|
|
|
|
sh = list_first_entry(&sh->batch_list, struct stripe_head,
|
|
|
|
batch_list);
|
|
|
|
if (sh != head_sh)
|
|
|
|
goto again;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
2017-03-04 13:06:12 +07:00
|
|
|
|
|
|
|
if (should_defer && !bio_list_empty(&pending_bios))
|
|
|
|
defer_issue_bios(conf, head_sh->sector, &pending_bios);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct dma_async_tx_descriptor *
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
async_copy_data(int frombio, struct bio *bio, struct page **page,
|
|
|
|
sector_t sector, struct dma_async_tx_descriptor *tx,
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
struct stripe_head *sh, int no_skipcopy)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
{
|
2013-11-24 08:19:00 +07:00
|
|
|
struct bio_vec bvl;
|
|
|
|
struct bvec_iter iter;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
struct page *bio_page;
|
|
|
|
int page_offset;
|
2009-06-04 01:43:59 +07:00
|
|
|
struct async_submit_ctl submit;
|
2009-09-09 07:42:50 +07:00
|
|
|
enum async_tx_flags flags = 0;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2013-10-12 05:44:27 +07:00
|
|
|
if (bio->bi_iter.bi_sector >= sector)
|
|
|
|
page_offset = (signed)(bio->bi_iter.bi_sector - sector) * 512;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
else
|
2013-10-12 05:44:27 +07:00
|
|
|
page_offset = (signed)(sector - bio->bi_iter.bi_sector) * -512;
|
2009-06-04 01:43:59 +07:00
|
|
|
|
2009-09-09 07:42:50 +07:00
|
|
|
if (frombio)
|
|
|
|
flags |= ASYNC_TX_FENCE;
|
|
|
|
init_async_submit(&submit, flags, tx, NULL, NULL, NULL);
|
|
|
|
|
2013-11-24 08:19:00 +07:00
|
|
|
bio_for_each_segment(bvl, bio, iter) {
|
|
|
|
int len = bvl.bv_len;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
int clen;
|
|
|
|
int b_offset = 0;
|
|
|
|
|
|
|
|
if (page_offset < 0) {
|
|
|
|
b_offset = -page_offset;
|
|
|
|
page_offset += b_offset;
|
|
|
|
len -= b_offset;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (len > 0 && page_offset + len > STRIPE_SIZE)
|
|
|
|
clen = STRIPE_SIZE - page_offset;
|
|
|
|
else
|
|
|
|
clen = len;
|
|
|
|
|
|
|
|
if (clen > 0) {
|
2013-11-24 08:19:00 +07:00
|
|
|
b_offset += bvl.bv_offset;
|
|
|
|
bio_page = bvl.bv_page;
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
if (frombio) {
|
|
|
|
if (sh->raid_conf->skip_copy &&
|
|
|
|
b_offset == 0 && page_offset == 0 &&
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
clen == STRIPE_SIZE &&
|
|
|
|
!no_skipcopy)
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
*page = bio_page;
|
|
|
|
else
|
|
|
|
tx = async_memcpy(*page, bio_page, page_offset,
|
2009-06-04 01:43:59 +07:00
|
|
|
b_offset, clen, &submit);
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
} else
|
|
|
|
tx = async_memcpy(bio_page, *page, b_offset,
|
2009-06-04 01:43:59 +07:00
|
|
|
page_offset, clen, &submit);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
2009-06-04 01:43:59 +07:00
|
|
|
/* chain the operations */
|
|
|
|
submit.depend_tx = tx;
|
|
|
|
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
if (clen < len) /* hit end of page */
|
|
|
|
break;
|
|
|
|
page_offset += len;
|
|
|
|
}
|
|
|
|
|
|
|
|
return tx;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void ops_complete_biofill(void *stripe_head_ref)
|
|
|
|
{
|
|
|
|
struct stripe_head *sh = stripe_head_ref;
|
2007-09-25 00:06:13 +07:00
|
|
|
int i;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2008-04-28 16:15:50 +07:00
|
|
|
pr_debug("%s: stripe %llu\n", __func__,
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
(unsigned long long)sh->sector);
|
|
|
|
|
|
|
|
/* clear completed biofills */
|
|
|
|
for (i = sh->disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
|
|
|
|
|
|
|
/* acknowledge completion of a biofill operation */
|
2007-09-25 00:06:13 +07:00
|
|
|
/* and check if we need to reply to a read request,
|
|
|
|
* new R5_Wantfill requests are held off until
|
2008-06-28 05:31:58 +07:00
|
|
|
* !STRIPE_BIOFILL_RUN
|
2007-09-25 00:06:13 +07:00
|
|
|
*/
|
|
|
|
if (test_and_clear_bit(R5_Wantfill, &dev->flags)) {
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
struct bio *rbi, *rbi2;
|
|
|
|
|
|
|
|
BUG_ON(!dev->read);
|
|
|
|
rbi = dev->read;
|
|
|
|
dev->read = NULL;
|
2013-10-12 05:44:27 +07:00
|
|
|
while (rbi && rbi->bi_iter.bi_sector <
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
dev->sector + STRIPE_SECTORS) {
|
|
|
|
rbi2 = r5_next_bio(rbi, dev->sector);
|
2017-03-15 10:05:13 +07:00
|
|
|
bio_endio(rbi);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
rbi = rbi2;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2008-06-28 05:31:58 +07:00
|
|
|
clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2007-09-25 00:06:13 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static void ops_run_biofill(struct stripe_head *sh)
|
|
|
|
{
|
|
|
|
struct dma_async_tx_descriptor *tx = NULL;
|
2009-06-04 01:43:59 +07:00
|
|
|
struct async_submit_ctl submit;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
int i;
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2008-04-28 16:15:50 +07:00
|
|
|
pr_debug("%s: stripe %llu\n", __func__,
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
(unsigned long long)sh->sector);
|
|
|
|
|
|
|
|
for (i = sh->disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
|
|
|
if (test_bit(R5_Wantfill, &dev->flags)) {
|
|
|
|
struct bio *rbi;
|
raid5: add a per-stripe lock
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 13:01:31 +07:00
|
|
|
spin_lock_irq(&sh->stripe_lock);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
dev->read = rbi = dev->toread;
|
|
|
|
dev->toread = NULL;
|
raid5: add a per-stripe lock
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 13:01:31 +07:00
|
|
|
spin_unlock_irq(&sh->stripe_lock);
|
2013-10-12 05:44:27 +07:00
|
|
|
while (rbi && rbi->bi_iter.bi_sector <
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
dev->sector + STRIPE_SECTORS) {
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
tx = async_copy_data(0, rbi, &dev->page,
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
dev->sector, tx, sh, 0);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
rbi = r5_next_bio(rbi, dev->sector);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
atomic_inc(&sh->count);
|
2009-06-04 01:43:59 +07:00
|
|
|
init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_biofill, sh, NULL);
|
|
|
|
async_trigger_callback(&submit);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
2009-08-30 09:13:11 +07:00
|
|
|
static void mark_target_uptodate(struct stripe_head *sh, int target)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
{
|
2009-08-30 09:13:11 +07:00
|
|
|
struct r5dev *tgt;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2009-08-30 09:13:11 +07:00
|
|
|
if (target < 0)
|
|
|
|
return;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2009-08-30 09:13:11 +07:00
|
|
|
tgt = &sh->dev[target];
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
set_bit(R5_UPTODATE, &tgt->flags);
|
|
|
|
BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
|
|
|
|
clear_bit(R5_Wantcompute, &tgt->flags);
|
2009-08-30 09:13:11 +07:00
|
|
|
}
|
|
|
|
|
2009-07-15 03:40:19 +07:00
|
|
|
static void ops_complete_compute(void *stripe_head_ref)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
{
|
|
|
|
struct stripe_head *sh = stripe_head_ref;
|
|
|
|
|
2008-04-28 16:15:50 +07:00
|
|
|
pr_debug("%s: stripe %llu\n", __func__,
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
(unsigned long long)sh->sector);
|
|
|
|
|
2009-07-15 03:40:19 +07:00
|
|
|
/* mark the computed target(s) as uptodate */
|
2009-08-30 09:13:11 +07:00
|
|
|
mark_target_uptodate(sh, sh->ops.target);
|
2009-07-15 03:40:19 +07:00
|
|
|
mark_target_uptodate(sh, sh->ops.target2);
|
2009-08-30 09:13:11 +07:00
|
|
|
|
2008-06-28 05:31:57 +07:00
|
|
|
clear_bit(STRIPE_COMPUTE_RUN, &sh->state);
|
|
|
|
if (sh->check_state == check_state_compute_run)
|
|
|
|
sh->check_state = check_state_compute_result;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
2009-07-15 01:50:52 +07:00
|
|
|
/* return a pointer to the address conversion region of the scribble buffer */
|
|
|
|
static addr_conv_t *to_addr_conv(struct stripe_head *sh,
|
2014-12-15 08:57:02 +07:00
|
|
|
struct raid5_percpu *percpu, int i)
|
2009-07-15 01:50:52 +07:00
|
|
|
{
|
2014-12-15 08:57:02 +07:00
|
|
|
void *addr;
|
|
|
|
|
|
|
|
addr = flex_array_get(percpu->scribble, i);
|
|
|
|
return addr + sizeof(struct page *) * (sh->disks + 2);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* return a pointer to the address conversion region of the scribble buffer */
|
|
|
|
static struct page **to_addr_page(struct raid5_percpu *percpu, int i)
|
|
|
|
{
|
|
|
|
void *addr;
|
|
|
|
|
|
|
|
addr = flex_array_get(percpu->scribble, i);
|
|
|
|
return addr;
|
2009-07-15 01:50:52 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct dma_async_tx_descriptor *
|
|
|
|
ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
{
|
|
|
|
int disks = sh->disks;
|
2014-12-15 08:57:02 +07:00
|
|
|
struct page **xor_srcs = to_addr_page(percpu, 0);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
int target = sh->ops.target;
|
|
|
|
struct r5dev *tgt = &sh->dev[target];
|
|
|
|
struct page *xor_dest = tgt->page;
|
|
|
|
int count = 0;
|
|
|
|
struct dma_async_tx_descriptor *tx;
|
2009-06-04 01:43:59 +07:00
|
|
|
struct async_submit_ctl submit;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
int i;
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
|
|
|
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
pr_debug("%s: stripe %llu block: %d\n",
|
2008-04-28 16:15:50 +07:00
|
|
|
__func__, (unsigned long long)sh->sector, target);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
|
|
|
|
|
|
|
|
for (i = disks; i--; )
|
|
|
|
if (i != target)
|
|
|
|
xor_srcs[count++] = sh->dev[i].page;
|
|
|
|
|
|
|
|
atomic_inc(&sh->count);
|
|
|
|
|
2009-09-09 07:42:50 +07:00
|
|
|
init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, NULL,
|
2014-12-15 08:57:02 +07:00
|
|
|
ops_complete_compute, sh, to_addr_conv(sh, percpu, 0));
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
if (unlikely(count == 1))
|
2009-06-04 01:43:59 +07:00
|
|
|
tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
else
|
2009-06-04 01:43:59 +07:00
|
|
|
tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
return tx;
|
|
|
|
}
|
|
|
|
|
2009-07-15 03:40:19 +07:00
|
|
|
/* set_syndrome_sources - populate source buffers for gen_syndrome
|
|
|
|
* @srcs - (struct page *) array of size sh->disks
|
|
|
|
* @sh - stripe_head to parse
|
|
|
|
*
|
|
|
|
* Populates srcs in proper layout order for the stripe and returns the
|
|
|
|
* 'count' of sources to be used in a call to async_gen_syndrome. The P
|
|
|
|
* destination buffer is recorded in srcs[count] and the Q destination
|
|
|
|
* is recorded in srcs[count+1]].
|
|
|
|
*/
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
static int set_syndrome_sources(struct page **srcs,
|
|
|
|
struct stripe_head *sh,
|
|
|
|
int srctype)
|
2009-07-15 03:40:19 +07:00
|
|
|
{
|
|
|
|
int disks = sh->disks;
|
|
|
|
int syndrome_disks = sh->ddf_layout ? disks : (disks - 2);
|
|
|
|
int d0_idx = raid6_d0(sh);
|
|
|
|
int count;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < disks; i++)
|
2009-10-16 12:40:25 +07:00
|
|
|
srcs[i] = NULL;
|
2009-07-15 03:40:19 +07:00
|
|
|
|
|
|
|
count = 0;
|
|
|
|
i = d0_idx;
|
|
|
|
do {
|
|
|
|
int slot = raid6_idx_to_slot(i, sh, &count, syndrome_disks);
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
struct r5dev *dev = &sh->dev[i];
|
2009-07-15 03:40:19 +07:00
|
|
|
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
if (i == sh->qd_idx || i == sh->pd_idx ||
|
|
|
|
(srctype == SYNDROME_SRC_ALL) ||
|
|
|
|
(srctype == SYNDROME_SRC_WANT_DRAIN &&
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
(test_bit(R5_Wantdrain, &dev->flags) ||
|
|
|
|
test_bit(R5_InJournal, &dev->flags))) ||
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
(srctype == SYNDROME_SRC_WRITTEN &&
|
2017-03-14 03:44:35 +07:00
|
|
|
(dev->written ||
|
|
|
|
test_bit(R5_InJournal, &dev->flags)))) {
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
if (test_bit(R5_InJournal, &dev->flags))
|
|
|
|
srcs[slot] = sh->dev[i].orig_page;
|
|
|
|
else
|
|
|
|
srcs[slot] = sh->dev[i].page;
|
|
|
|
}
|
2009-07-15 03:40:19 +07:00
|
|
|
i = raid6_next_disk(i, disks);
|
|
|
|
} while (i != d0_idx);
|
|
|
|
|
2009-10-16 12:27:34 +07:00
|
|
|
return syndrome_disks;
|
2009-07-15 03:40:19 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct dma_async_tx_descriptor *
|
|
|
|
ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu)
|
|
|
|
{
|
|
|
|
int disks = sh->disks;
|
2014-12-15 08:57:02 +07:00
|
|
|
struct page **blocks = to_addr_page(percpu, 0);
|
2009-07-15 03:40:19 +07:00
|
|
|
int target;
|
|
|
|
int qd_idx = sh->qd_idx;
|
|
|
|
struct dma_async_tx_descriptor *tx;
|
|
|
|
struct async_submit_ctl submit;
|
|
|
|
struct r5dev *tgt;
|
|
|
|
struct page *dest;
|
|
|
|
int i;
|
|
|
|
int count;
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2009-07-15 03:40:19 +07:00
|
|
|
if (sh->ops.target < 0)
|
|
|
|
target = sh->ops.target2;
|
|
|
|
else if (sh->ops.target2 < 0)
|
|
|
|
target = sh->ops.target;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
else
|
2009-07-15 03:40:19 +07:00
|
|
|
/* we should only have one valid target */
|
|
|
|
BUG();
|
|
|
|
BUG_ON(target < 0);
|
|
|
|
pr_debug("%s: stripe %llu block: %d\n",
|
|
|
|
__func__, (unsigned long long)sh->sector, target);
|
|
|
|
|
|
|
|
tgt = &sh->dev[target];
|
|
|
|
BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
|
|
|
|
dest = tgt->page;
|
|
|
|
|
|
|
|
atomic_inc(&sh->count);
|
|
|
|
|
|
|
|
if (target == qd_idx) {
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL);
|
2009-07-15 03:40:19 +07:00
|
|
|
blocks[count] = NULL; /* regenerating p is not necessary */
|
|
|
|
BUG_ON(blocks[count+1] != dest); /* q should already be set */
|
2009-09-09 07:42:50 +07:00
|
|
|
init_async_submit(&submit, ASYNC_TX_FENCE, NULL,
|
|
|
|
ops_complete_compute, sh,
|
2014-12-15 08:57:02 +07:00
|
|
|
to_addr_conv(sh, percpu, 0));
|
2009-07-15 03:40:19 +07:00
|
|
|
tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit);
|
|
|
|
} else {
|
|
|
|
/* Compute any data- or p-drive using XOR */
|
|
|
|
count = 0;
|
|
|
|
for (i = disks; i-- ; ) {
|
|
|
|
if (i == target || i == qd_idx)
|
|
|
|
continue;
|
|
|
|
blocks[count++] = sh->dev[i].page;
|
|
|
|
}
|
|
|
|
|
2009-09-09 07:42:50 +07:00
|
|
|
init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST,
|
|
|
|
NULL, ops_complete_compute, sh,
|
2014-12-15 08:57:02 +07:00
|
|
|
to_addr_conv(sh, percpu, 0));
|
2009-07-15 03:40:19 +07:00
|
|
|
tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE, &submit);
|
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
return tx;
|
|
|
|
}
|
|
|
|
|
2009-07-15 03:40:19 +07:00
|
|
|
static struct dma_async_tx_descriptor *
|
|
|
|
ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
|
|
|
|
{
|
|
|
|
int i, count, disks = sh->disks;
|
|
|
|
int syndrome_disks = sh->ddf_layout ? disks : disks-2;
|
|
|
|
int d0_idx = raid6_d0(sh);
|
|
|
|
int faila = -1, failb = -1;
|
|
|
|
int target = sh->ops.target;
|
|
|
|
int target2 = sh->ops.target2;
|
|
|
|
struct r5dev *tgt = &sh->dev[target];
|
|
|
|
struct r5dev *tgt2 = &sh->dev[target2];
|
|
|
|
struct dma_async_tx_descriptor *tx;
|
2014-12-15 08:57:02 +07:00
|
|
|
struct page **blocks = to_addr_page(percpu, 0);
|
2009-07-15 03:40:19 +07:00
|
|
|
struct async_submit_ctl submit;
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2009-07-15 03:40:19 +07:00
|
|
|
pr_debug("%s: stripe %llu block1: %d block2: %d\n",
|
|
|
|
__func__, (unsigned long long)sh->sector, target, target2);
|
|
|
|
BUG_ON(target < 0 || target2 < 0);
|
|
|
|
BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
|
|
|
|
BUG_ON(!test_bit(R5_Wantcompute, &tgt2->flags));
|
|
|
|
|
2009-09-17 02:24:54 +07:00
|
|
|
/* we need to open-code set_syndrome_sources to handle the
|
2009-07-15 03:40:19 +07:00
|
|
|
* slot number conversion for 'faila' and 'failb'
|
|
|
|
*/
|
|
|
|
for (i = 0; i < disks ; i++)
|
2009-10-16 12:40:25 +07:00
|
|
|
blocks[i] = NULL;
|
2009-07-15 03:40:19 +07:00
|
|
|
count = 0;
|
|
|
|
i = d0_idx;
|
|
|
|
do {
|
|
|
|
int slot = raid6_idx_to_slot(i, sh, &count, syndrome_disks);
|
|
|
|
|
|
|
|
blocks[slot] = sh->dev[i].page;
|
|
|
|
|
|
|
|
if (i == target)
|
|
|
|
faila = slot;
|
|
|
|
if (i == target2)
|
|
|
|
failb = slot;
|
|
|
|
i = raid6_next_disk(i, disks);
|
|
|
|
} while (i != d0_idx);
|
|
|
|
|
|
|
|
BUG_ON(faila == failb);
|
|
|
|
if (failb < faila)
|
|
|
|
swap(faila, failb);
|
|
|
|
pr_debug("%s: stripe: %llu faila: %d failb: %d\n",
|
|
|
|
__func__, (unsigned long long)sh->sector, faila, failb);
|
|
|
|
|
|
|
|
atomic_inc(&sh->count);
|
|
|
|
|
|
|
|
if (failb == syndrome_disks+1) {
|
|
|
|
/* Q disk is one of the missing disks */
|
|
|
|
if (faila == syndrome_disks) {
|
|
|
|
/* Missing P+Q, just recompute */
|
2009-09-09 07:42:50 +07:00
|
|
|
init_async_submit(&submit, ASYNC_TX_FENCE, NULL,
|
|
|
|
ops_complete_compute, sh,
|
2014-12-15 08:57:02 +07:00
|
|
|
to_addr_conv(sh, percpu, 0));
|
2009-10-16 12:27:34 +07:00
|
|
|
return async_gen_syndrome(blocks, 0, syndrome_disks+2,
|
2009-07-15 03:40:19 +07:00
|
|
|
STRIPE_SIZE, &submit);
|
|
|
|
} else {
|
|
|
|
struct page *dest;
|
|
|
|
int data_target;
|
|
|
|
int qd_idx = sh->qd_idx;
|
|
|
|
|
|
|
|
/* Missing D+Q: recompute D from P, then recompute Q */
|
|
|
|
if (target == qd_idx)
|
|
|
|
data_target = target2;
|
|
|
|
else
|
|
|
|
data_target = target;
|
|
|
|
|
|
|
|
count = 0;
|
|
|
|
for (i = disks; i-- ; ) {
|
|
|
|
if (i == data_target || i == qd_idx)
|
|
|
|
continue;
|
|
|
|
blocks[count++] = sh->dev[i].page;
|
|
|
|
}
|
|
|
|
dest = sh->dev[data_target].page;
|
2009-09-09 07:42:50 +07:00
|
|
|
init_async_submit(&submit,
|
|
|
|
ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST,
|
|
|
|
NULL, NULL, NULL,
|
2014-12-15 08:57:02 +07:00
|
|
|
to_addr_conv(sh, percpu, 0));
|
2009-07-15 03:40:19 +07:00
|
|
|
tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE,
|
|
|
|
&submit);
|
|
|
|
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL);
|
2009-09-09 07:42:50 +07:00
|
|
|
init_async_submit(&submit, ASYNC_TX_FENCE, tx,
|
|
|
|
ops_complete_compute, sh,
|
2014-12-15 08:57:02 +07:00
|
|
|
to_addr_conv(sh, percpu, 0));
|
2009-07-15 03:40:19 +07:00
|
|
|
return async_gen_syndrome(blocks, 0, count+2,
|
|
|
|
STRIPE_SIZE, &submit);
|
|
|
|
}
|
|
|
|
} else {
|
2009-09-17 02:24:54 +07:00
|
|
|
init_async_submit(&submit, ASYNC_TX_FENCE, NULL,
|
|
|
|
ops_complete_compute, sh,
|
2014-12-15 08:57:02 +07:00
|
|
|
to_addr_conv(sh, percpu, 0));
|
2009-09-17 02:24:54 +07:00
|
|
|
if (failb == syndrome_disks) {
|
|
|
|
/* We're missing D+P. */
|
|
|
|
return async_raid6_datap_recov(syndrome_disks+2,
|
|
|
|
STRIPE_SIZE, faila,
|
|
|
|
blocks, &submit);
|
|
|
|
} else {
|
|
|
|
/* We're missing D+D. */
|
|
|
|
return async_raid6_2data_recov(syndrome_disks+2,
|
|
|
|
STRIPE_SIZE, faila, failb,
|
|
|
|
blocks, &submit);
|
|
|
|
}
|
2009-07-15 03:40:19 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
static void ops_complete_prexor(void *stripe_head_ref)
|
|
|
|
{
|
|
|
|
struct stripe_head *sh = stripe_head_ref;
|
|
|
|
|
2008-04-28 16:15:50 +07:00
|
|
|
pr_debug("%s: stripe %llu\n", __func__,
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
(unsigned long long)sh->sector);
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
|
|
|
|
if (r5c_is_writeback(sh->raid_conf->log))
|
|
|
|
/*
|
|
|
|
* raid5-cache write back uses orig_page during prexor.
|
|
|
|
* After prexor, it is time to free orig_page
|
|
|
|
*/
|
|
|
|
r5c_release_extra_page(sh);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct dma_async_tx_descriptor *
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
|
|
|
|
struct dma_async_tx_descriptor *tx)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
{
|
|
|
|
int disks = sh->disks;
|
2014-12-15 08:57:02 +07:00
|
|
|
struct page **xor_srcs = to_addr_page(percpu, 0);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
int count = 0, pd_idx = sh->pd_idx, i;
|
2009-06-04 01:43:59 +07:00
|
|
|
struct async_submit_ctl submit;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
/* existing parity data subtracted */
|
|
|
|
struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2008-04-28 16:15:50 +07:00
|
|
|
pr_debug("%s: stripe %llu\n", __func__,
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
(unsigned long long)sh->sector);
|
|
|
|
|
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
|
|
|
/* Only process blocks that are known to be uptodate */
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
if (test_bit(R5_InJournal, &dev->flags))
|
|
|
|
xor_srcs[count++] = dev->orig_page;
|
|
|
|
else if (test_bit(R5_Wantdrain, &dev->flags))
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
xor_srcs[count++] = dev->page;
|
|
|
|
}
|
|
|
|
|
2009-09-09 07:42:50 +07:00
|
|
|
init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
|
2014-12-15 08:57:02 +07:00
|
|
|
ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0));
|
2009-06-04 01:43:59 +07:00
|
|
|
tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
return tx;
|
|
|
|
}
|
|
|
|
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
static struct dma_async_tx_descriptor *
|
|
|
|
ops_run_prexor6(struct stripe_head *sh, struct raid5_percpu *percpu,
|
|
|
|
struct dma_async_tx_descriptor *tx)
|
|
|
|
{
|
|
|
|
struct page **blocks = to_addr_page(percpu, 0);
|
|
|
|
int count;
|
|
|
|
struct async_submit_ctl submit;
|
|
|
|
|
|
|
|
pr_debug("%s: stripe %llu\n", __func__,
|
|
|
|
(unsigned long long)sh->sector);
|
|
|
|
|
|
|
|
count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_WANT_DRAIN);
|
|
|
|
|
|
|
|
init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_PQ_XOR_DST, tx,
|
|
|
|
ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0));
|
|
|
|
tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit);
|
|
|
|
|
|
|
|
return tx;
|
|
|
|
}
|
|
|
|
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
static struct dma_async_tx_descriptor *
|
2008-06-28 05:32:06 +07:00
|
|
|
ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
{
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
int disks = sh->disks;
|
2008-06-28 05:32:06 +07:00
|
|
|
int i;
|
2014-12-15 08:57:03 +07:00
|
|
|
struct stripe_head *head_sh = sh;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2008-04-28 16:15:50 +07:00
|
|
|
pr_debug("%s: stripe %llu\n", __func__,
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
(unsigned long long)sh->sector);
|
|
|
|
|
|
|
|
for (i = disks; i--; ) {
|
2014-12-15 08:57:03 +07:00
|
|
|
struct r5dev *dev;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
struct bio *chosen;
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
sh = head_sh;
|
|
|
|
if (test_and_clear_bit(R5_Wantdrain, &head_sh->dev[i].flags)) {
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
struct bio *wbi;
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
again:
|
|
|
|
dev = &sh->dev[i];
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
/*
|
|
|
|
* clear R5_InJournal, so when rewriting a page in
|
|
|
|
* journal, it is not skipped by r5l_log_stripe()
|
|
|
|
*/
|
|
|
|
clear_bit(R5_InJournal, &dev->flags);
|
raid5: add a per-stripe lock
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 13:01:31 +07:00
|
|
|
spin_lock_irq(&sh->stripe_lock);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
chosen = dev->towrite;
|
|
|
|
dev->towrite = NULL;
|
2014-12-15 08:57:03 +07:00
|
|
|
sh->overwrite_disks = 0;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
BUG_ON(dev->written);
|
|
|
|
wbi = dev->written = chosen;
|
raid5: add a per-stripe lock
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 13:01:31 +07:00
|
|
|
spin_unlock_irq(&sh->stripe_lock);
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
WARN_ON(dev->page != dev->orig_page);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2013-10-12 05:44:27 +07:00
|
|
|
while (wbi && wbi->bi_iter.bi_sector <
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
dev->sector + STRIPE_SECTORS) {
|
2016-08-06 04:35:16 +07:00
|
|
|
if (wbi->bi_opf & REQ_FUA)
|
2010-09-03 16:56:18 +07:00
|
|
|
set_bit(R5_WantFUA, &dev->flags);
|
2016-08-06 04:35:16 +07:00
|
|
|
if (wbi->bi_opf & REQ_SYNC)
|
2012-05-22 10:55:05 +07:00
|
|
|
set_bit(R5_SyncIO, &dev->flags);
|
2016-06-06 02:32:07 +07:00
|
|
|
if (bio_op(wbi) == REQ_OP_DISCARD)
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
set_bit(R5_Discard, &dev->flags);
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
else {
|
|
|
|
tx = async_copy_data(1, wbi, &dev->page,
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
dev->sector, tx, sh,
|
|
|
|
r5c_is_writeback(conf->log));
|
|
|
|
if (dev->page != dev->orig_page &&
|
|
|
|
!r5c_is_writeback(conf->log)) {
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
set_bit(R5_SkipCopy, &dev->flags);
|
|
|
|
clear_bit(R5_UPTODATE, &dev->flags);
|
|
|
|
clear_bit(R5_OVERWRITE, &dev->flags);
|
|
|
|
}
|
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
wbi = r5_next_bio(wbi, dev->sector);
|
|
|
|
}
|
2014-12-15 08:57:03 +07:00
|
|
|
|
|
|
|
if (head_sh->batch_head) {
|
|
|
|
sh = list_first_entry(&sh->batch_list,
|
|
|
|
struct stripe_head,
|
|
|
|
batch_list);
|
|
|
|
if (sh == head_sh)
|
|
|
|
continue;
|
|
|
|
goto again;
|
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return tx;
|
|
|
|
}
|
|
|
|
|
2009-07-15 03:40:19 +07:00
|
|
|
static void ops_complete_reconstruct(void *stripe_head_ref)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
{
|
|
|
|
struct stripe_head *sh = stripe_head_ref;
|
2009-07-15 03:40:19 +07:00
|
|
|
int disks = sh->disks;
|
|
|
|
int pd_idx = sh->pd_idx;
|
|
|
|
int qd_idx = sh->qd_idx;
|
|
|
|
int i;
|
2012-10-11 09:49:49 +07:00
|
|
|
bool fua = false, sync = false, discard = false;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2008-04-28 16:15:50 +07:00
|
|
|
pr_debug("%s: stripe %llu\n", __func__,
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
(unsigned long long)sh->sector);
|
|
|
|
|
2012-05-22 10:55:05 +07:00
|
|
|
for (i = disks; i--; ) {
|
2010-09-03 16:56:18 +07:00
|
|
|
fua |= test_bit(R5_WantFUA, &sh->dev[i].flags);
|
2012-05-22 10:55:05 +07:00
|
|
|
sync |= test_bit(R5_SyncIO, &sh->dev[i].flags);
|
2012-10-11 09:49:49 +07:00
|
|
|
discard |= test_bit(R5_Discard, &sh->dev[i].flags);
|
2012-05-22 10:55:05 +07:00
|
|
|
}
|
2010-09-03 16:56:18 +07:00
|
|
|
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
2009-07-15 03:40:19 +07:00
|
|
|
|
2010-09-03 16:56:18 +07:00
|
|
|
if (dev->written || i == pd_idx || i == qd_idx) {
|
raid5: Set R5_Expanded on parity devices as well as data.
When reshaping a fully degraded raid5/raid6 to a larger
nubmer of devices, the new device(s) are not in-sync
and so that can make the newly grown stripe appear to be
"failed".
To avoid this, we set the R5_Expanded flag to say "Even though
this device is not fully in-sync, this block is safe so
don't treat the device as failed for this stripe".
This flag is set for data devices, not not for parity devices.
Consequently, if you have a RAID6 with two devices that are partly
recovered and a spare, and start a reshape to include the spare,
then when the reshape gets past the point where the recovery was
up to, it will think the stripes are failed and will get into
an infinite loop, failing to make progress.
So when contructing parity on an EXPAND_READY stripe,
set R5_Expanded.
Reported-by: Curt <lightspd@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-10-17 12:18:36 +07:00
|
|
|
if (!discard && !test_bit(R5_SkipCopy, &dev->flags)) {
|
2012-10-11 09:49:49 +07:00
|
|
|
set_bit(R5_UPTODATE, &dev->flags);
|
raid5: Set R5_Expanded on parity devices as well as data.
When reshaping a fully degraded raid5/raid6 to a larger
nubmer of devices, the new device(s) are not in-sync
and so that can make the newly grown stripe appear to be
"failed".
To avoid this, we set the R5_Expanded flag to say "Even though
this device is not fully in-sync, this block is safe so
don't treat the device as failed for this stripe".
This flag is set for data devices, not not for parity devices.
Consequently, if you have a RAID6 with two devices that are partly
recovered and a spare, and start a reshape to include the spare,
then when the reshape gets past the point where the recovery was
up to, it will think the stripes are failed and will get into
an infinite loop, failing to make progress.
So when contructing parity on an EXPAND_READY stripe,
set R5_Expanded.
Reported-by: Curt <lightspd@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-10-17 12:18:36 +07:00
|
|
|
if (test_bit(STRIPE_EXPAND_READY, &sh->state))
|
|
|
|
set_bit(R5_Expanded, &dev->flags);
|
|
|
|
}
|
2010-09-03 16:56:18 +07:00
|
|
|
if (fua)
|
|
|
|
set_bit(R5_WantFUA, &dev->flags);
|
2012-05-22 10:55:05 +07:00
|
|
|
if (sync)
|
|
|
|
set_bit(R5_SyncIO, &dev->flags);
|
2010-09-03 16:56:18 +07:00
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
2008-06-28 05:32:06 +07:00
|
|
|
if (sh->reconstruct_state == reconstruct_state_drain_run)
|
|
|
|
sh->reconstruct_state = reconstruct_state_drain_result;
|
|
|
|
else if (sh->reconstruct_state == reconstruct_state_prexor_drain_run)
|
|
|
|
sh->reconstruct_state = reconstruct_state_prexor_drain_result;
|
|
|
|
else {
|
|
|
|
BUG_ON(sh->reconstruct_state != reconstruct_state_run);
|
|
|
|
sh->reconstruct_state = reconstruct_state_result;
|
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2009-07-15 03:40:19 +07:00
|
|
|
ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
|
|
|
|
struct dma_async_tx_descriptor *tx)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
{
|
|
|
|
int disks = sh->disks;
|
2014-12-15 08:57:03 +07:00
|
|
|
struct page **xor_srcs;
|
2009-06-04 01:43:59 +07:00
|
|
|
struct async_submit_ctl submit;
|
2014-12-15 08:57:03 +07:00
|
|
|
int count, pd_idx = sh->pd_idx, i;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
struct page *xor_dest;
|
2008-06-28 05:32:06 +07:00
|
|
|
int prexor = 0;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
unsigned long flags;
|
2014-12-15 08:57:03 +07:00
|
|
|
int j = 0;
|
|
|
|
struct stripe_head *head_sh = sh;
|
|
|
|
int last_stripe;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2008-04-28 16:15:50 +07:00
|
|
|
pr_debug("%s: stripe %llu\n", __func__,
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
(unsigned long long)sh->sector);
|
|
|
|
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
for (i = 0; i < sh->disks; i++) {
|
|
|
|
if (pd_idx == i)
|
|
|
|
continue;
|
|
|
|
if (!test_bit(R5_Discard, &sh->dev[i].flags))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (i >= sh->disks) {
|
|
|
|
atomic_inc(&sh->count);
|
|
|
|
set_bit(R5_Discard, &sh->dev[pd_idx].flags);
|
|
|
|
ops_complete_reconstruct(sh);
|
|
|
|
return;
|
|
|
|
}
|
2014-12-15 08:57:03 +07:00
|
|
|
again:
|
|
|
|
count = 0;
|
|
|
|
xor_srcs = to_addr_page(percpu, j);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
/* check if prexor is active which means only process blocks
|
|
|
|
* that are part of a read-modify-write (written)
|
|
|
|
*/
|
2014-12-15 08:57:03 +07:00
|
|
|
if (head_sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
|
2008-06-28 05:32:06 +07:00
|
|
|
prexor = 1;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
|
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
if (head_sh->dev[i].written ||
|
|
|
|
test_bit(R5_InJournal, &head_sh->dev[i].flags))
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
xor_srcs[count++] = dev->page;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
xor_dest = sh->dev[pd_idx].page;
|
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
|
|
|
if (i != pd_idx)
|
|
|
|
xor_srcs[count++] = dev->page;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* 1/ if we prexor'd then the dest is reused as a source
|
|
|
|
* 2/ if we did not prexor then we are redoing the parity
|
|
|
|
* set ASYNC_TX_XOR_DROP_DST and ASYNC_TX_XOR_ZERO_DST
|
|
|
|
* for the synchronous xor case
|
|
|
|
*/
|
2014-12-15 08:57:03 +07:00
|
|
|
last_stripe = !head_sh->batch_head ||
|
|
|
|
list_first_entry(&sh->batch_list,
|
|
|
|
struct stripe_head, batch_list) == head_sh;
|
|
|
|
if (last_stripe) {
|
|
|
|
flags = ASYNC_TX_ACK |
|
|
|
|
(prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST);
|
|
|
|
|
|
|
|
atomic_inc(&head_sh->count);
|
|
|
|
init_async_submit(&submit, flags, tx, ops_complete_reconstruct, head_sh,
|
|
|
|
to_addr_conv(sh, percpu, j));
|
|
|
|
} else {
|
|
|
|
flags = prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST;
|
|
|
|
init_async_submit(&submit, flags, tx, NULL, NULL,
|
|
|
|
to_addr_conv(sh, percpu, j));
|
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2009-06-04 01:43:59 +07:00
|
|
|
if (unlikely(count == 1))
|
|
|
|
tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit);
|
|
|
|
else
|
|
|
|
tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit);
|
2014-12-15 08:57:03 +07:00
|
|
|
if (!last_stripe) {
|
|
|
|
j++;
|
|
|
|
sh = list_first_entry(&sh->batch_list, struct stripe_head,
|
|
|
|
batch_list);
|
|
|
|
goto again;
|
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
2009-07-15 03:40:19 +07:00
|
|
|
static void
|
|
|
|
ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
|
|
|
|
struct dma_async_tx_descriptor *tx)
|
|
|
|
{
|
|
|
|
struct async_submit_ctl submit;
|
2014-12-15 08:57:03 +07:00
|
|
|
struct page **blocks;
|
|
|
|
int count, i, j = 0;
|
|
|
|
struct stripe_head *head_sh = sh;
|
|
|
|
int last_stripe;
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
int synflags;
|
|
|
|
unsigned long txflags;
|
2009-07-15 03:40:19 +07:00
|
|
|
|
|
|
|
pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
|
|
|
|
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
for (i = 0; i < sh->disks; i++) {
|
|
|
|
if (sh->pd_idx == i || sh->qd_idx == i)
|
|
|
|
continue;
|
|
|
|
if (!test_bit(R5_Discard, &sh->dev[i].flags))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (i >= sh->disks) {
|
|
|
|
atomic_inc(&sh->count);
|
|
|
|
set_bit(R5_Discard, &sh->dev[sh->pd_idx].flags);
|
|
|
|
set_bit(R5_Discard, &sh->dev[sh->qd_idx].flags);
|
|
|
|
ops_complete_reconstruct(sh);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
again:
|
|
|
|
blocks = to_addr_page(percpu, j);
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
|
|
|
|
if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
|
|
|
|
synflags = SYNDROME_SRC_WRITTEN;
|
|
|
|
txflags = ASYNC_TX_ACK | ASYNC_TX_PQ_XOR_DST;
|
|
|
|
} else {
|
|
|
|
synflags = SYNDROME_SRC_ALL;
|
|
|
|
txflags = ASYNC_TX_ACK;
|
|
|
|
}
|
|
|
|
|
|
|
|
count = set_syndrome_sources(blocks, sh, synflags);
|
2014-12-15 08:57:03 +07:00
|
|
|
last_stripe = !head_sh->batch_head ||
|
|
|
|
list_first_entry(&sh->batch_list,
|
|
|
|
struct stripe_head, batch_list) == head_sh;
|
|
|
|
|
|
|
|
if (last_stripe) {
|
|
|
|
atomic_inc(&head_sh->count);
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
init_async_submit(&submit, txflags, tx, ops_complete_reconstruct,
|
2014-12-15 08:57:03 +07:00
|
|
|
head_sh, to_addr_conv(sh, percpu, j));
|
|
|
|
} else
|
|
|
|
init_async_submit(&submit, 0, tx, NULL, NULL,
|
|
|
|
to_addr_conv(sh, percpu, j));
|
2015-05-13 23:30:08 +07:00
|
|
|
tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit);
|
2014-12-15 08:57:03 +07:00
|
|
|
if (!last_stripe) {
|
|
|
|
j++;
|
|
|
|
sh = list_first_entry(&sh->batch_list, struct stripe_head,
|
|
|
|
batch_list);
|
|
|
|
goto again;
|
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static void ops_complete_check(void *stripe_head_ref)
|
|
|
|
{
|
|
|
|
struct stripe_head *sh = stripe_head_ref;
|
|
|
|
|
2008-04-28 16:15:50 +07:00
|
|
|
pr_debug("%s: stripe %llu\n", __func__,
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
(unsigned long long)sh->sector);
|
|
|
|
|
2008-06-28 05:31:57 +07:00
|
|
|
sh->check_state = check_state_check_result;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
2009-07-15 03:40:19 +07:00
|
|
|
static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
{
|
|
|
|
int disks = sh->disks;
|
2009-07-15 03:40:19 +07:00
|
|
|
int pd_idx = sh->pd_idx;
|
|
|
|
int qd_idx = sh->qd_idx;
|
|
|
|
struct page *xor_dest;
|
2014-12-15 08:57:02 +07:00
|
|
|
struct page **xor_srcs = to_addr_page(percpu, 0);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
struct dma_async_tx_descriptor *tx;
|
2009-06-04 01:43:59 +07:00
|
|
|
struct async_submit_ctl submit;
|
2009-07-15 03:40:19 +07:00
|
|
|
int count;
|
|
|
|
int i;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2008-04-28 16:15:50 +07:00
|
|
|
pr_debug("%s: stripe %llu\n", __func__,
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
(unsigned long long)sh->sector);
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2009-07-15 03:40:19 +07:00
|
|
|
count = 0;
|
|
|
|
xor_dest = sh->dev[pd_idx].page;
|
|
|
|
xor_srcs[count++] = xor_dest;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
for (i = disks; i--; ) {
|
2009-07-15 03:40:19 +07:00
|
|
|
if (i == pd_idx || i == qd_idx)
|
|
|
|
continue;
|
|
|
|
xor_srcs[count++] = sh->dev[i].page;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
2009-07-15 01:50:52 +07:00
|
|
|
init_async_submit(&submit, 0, NULL, NULL, NULL,
|
2014-12-15 08:57:02 +07:00
|
|
|
to_addr_conv(sh, percpu, 0));
|
2009-04-09 04:28:37 +07:00
|
|
|
tx = async_xor_val(xor_dest, xor_srcs, 0, count, STRIPE_SIZE,
|
2009-06-04 01:43:59 +07:00
|
|
|
&sh->ops.zero_sum_result, &submit);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
atomic_inc(&sh->count);
|
2009-06-04 01:43:59 +07:00
|
|
|
init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_check, sh, NULL);
|
|
|
|
tx = async_trigger_callback(&submit);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
2009-07-15 03:40:19 +07:00
|
|
|
static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu, int checkp)
|
|
|
|
{
|
2014-12-15 08:57:02 +07:00
|
|
|
struct page **srcs = to_addr_page(percpu, 0);
|
2009-07-15 03:40:19 +07:00
|
|
|
struct async_submit_ctl submit;
|
|
|
|
int count;
|
|
|
|
|
|
|
|
pr_debug("%s: stripe %llu checkp: %d\n", __func__,
|
|
|
|
(unsigned long long)sh->sector, checkp);
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
count = set_syndrome_sources(srcs, sh, SYNDROME_SRC_ALL);
|
2009-07-15 03:40:19 +07:00
|
|
|
if (!checkp)
|
|
|
|
srcs[count] = NULL;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
atomic_inc(&sh->count);
|
2009-07-15 03:40:19 +07:00
|
|
|
init_async_submit(&submit, ASYNC_TX_ACK, NULL, ops_complete_check,
|
2014-12-15 08:57:02 +07:00
|
|
|
sh, to_addr_conv(sh, percpu, 0));
|
2009-07-15 03:40:19 +07:00
|
|
|
async_syndrome_val(srcs, 0, count+2, STRIPE_SIZE,
|
|
|
|
&sh->ops.zero_sum_result, percpu->spare_page, &submit);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
2013-02-28 05:08:34 +07:00
|
|
|
static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
{
|
|
|
|
int overlap_clear = 0, i, disks = sh->disks;
|
|
|
|
struct dma_async_tx_descriptor *tx = NULL;
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
2009-07-15 03:40:19 +07:00
|
|
|
int level = conf->level;
|
2009-07-15 01:50:52 +07:00
|
|
|
struct raid5_percpu *percpu;
|
|
|
|
unsigned long cpu;
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2009-07-15 01:50:52 +07:00
|
|
|
cpu = get_cpu();
|
|
|
|
percpu = per_cpu_ptr(conf->percpu, cpu);
|
2008-06-28 05:31:58 +07:00
|
|
|
if (test_bit(STRIPE_OP_BIOFILL, &ops_request)) {
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
ops_run_biofill(sh);
|
|
|
|
overlap_clear++;
|
|
|
|
}
|
|
|
|
|
2008-06-28 05:32:09 +07:00
|
|
|
if (test_bit(STRIPE_OP_COMPUTE_BLK, &ops_request)) {
|
2009-07-15 03:40:19 +07:00
|
|
|
if (level < 6)
|
|
|
|
tx = ops_run_compute5(sh, percpu);
|
|
|
|
else {
|
|
|
|
if (sh->ops.target2 < 0 || sh->ops.target < 0)
|
|
|
|
tx = ops_run_compute6_1(sh, percpu);
|
|
|
|
else
|
|
|
|
tx = ops_run_compute6_2(sh, percpu);
|
|
|
|
}
|
|
|
|
/* terminate the chain if reconstruct is not set to be run */
|
|
|
|
if (tx && !test_bit(STRIPE_OP_RECONSTRUCT, &ops_request))
|
2008-06-28 05:32:09 +07:00
|
|
|
async_tx_ack(tx);
|
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
if (test_bit(STRIPE_OP_PREXOR, &ops_request)) {
|
|
|
|
if (level < 6)
|
|
|
|
tx = ops_run_prexor5(sh, percpu, tx);
|
|
|
|
else
|
|
|
|
tx = ops_run_prexor6(sh, percpu, tx);
|
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2017-04-04 18:13:58 +07:00
|
|
|
if (test_bit(STRIPE_OP_PARTIAL_PARITY, &ops_request))
|
|
|
|
tx = ops_run_partial_parity(sh, percpu, tx);
|
|
|
|
|
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 05:32:05 +07:00
|
|
|
if (test_bit(STRIPE_OP_BIODRAIN, &ops_request)) {
|
2008-06-28 05:32:06 +07:00
|
|
|
tx = ops_run_biodrain(sh, tx);
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
overlap_clear++;
|
|
|
|
}
|
|
|
|
|
2009-07-15 03:40:19 +07:00
|
|
|
if (test_bit(STRIPE_OP_RECONSTRUCT, &ops_request)) {
|
|
|
|
if (level < 6)
|
|
|
|
ops_run_reconstruct5(sh, percpu, tx);
|
|
|
|
else
|
|
|
|
ops_run_reconstruct6(sh, percpu, tx);
|
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2009-07-15 03:40:19 +07:00
|
|
|
if (test_bit(STRIPE_OP_CHECK, &ops_request)) {
|
|
|
|
if (sh->check_state == check_state_run)
|
|
|
|
ops_run_check_p(sh, percpu);
|
|
|
|
else if (sh->check_state == check_state_run_q)
|
|
|
|
ops_run_check_pq(sh, percpu, 0);
|
|
|
|
else if (sh->check_state == check_state_run_pq)
|
|
|
|
ops_run_check_pq(sh, percpu, 1);
|
|
|
|
else
|
|
|
|
BUG();
|
|
|
|
}
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
if (overlap_clear && !sh->batch_head)
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
|
|
|
if (test_and_clear_bit(R5_Overlap, &dev->flags))
|
|
|
|
wake_up(&sh->raid_conf->wait_for_overlap);
|
|
|
|
}
|
2009-07-15 01:50:52 +07:00
|
|
|
put_cpu();
|
md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.
To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests. In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing. The following flags outline the
requests that handle_stripe can make of raid5_run_ops:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
- subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
STRIPE_OP_IO
- submit i/o to the member disks (note this was already performed outside
the stripe lock, but it made sense to add it as an operation type
The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
new operations that were previously blocked
Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.
Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
2017-04-04 18:13:57 +07:00
|
|
|
static void free_stripe(struct kmem_cache *sc, struct stripe_head *sh)
|
|
|
|
{
|
|
|
|
if (sh->ppl_page)
|
|
|
|
__free_page(sh->ppl_page);
|
|
|
|
kmem_cache_free(sc, sh);
|
|
|
|
}
|
|
|
|
|
2016-08-23 11:14:01 +07:00
|
|
|
static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
|
2017-04-04 18:13:57 +07:00
|
|
|
int disks, struct r5conf *conf)
|
2015-05-08 15:19:04 +07:00
|
|
|
{
|
|
|
|
struct stripe_head *sh;
|
2016-08-23 11:14:01 +07:00
|
|
|
int i;
|
2015-05-08 15:19:04 +07:00
|
|
|
|
|
|
|
sh = kmem_cache_zalloc(sc, gfp);
|
|
|
|
if (sh) {
|
|
|
|
spin_lock_init(&sh->stripe_lock);
|
|
|
|
spin_lock_init(&sh->batch_lock);
|
|
|
|
INIT_LIST_HEAD(&sh->batch_list);
|
|
|
|
INIT_LIST_HEAD(&sh->lru);
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:40 +07:00
|
|
|
INIT_LIST_HEAD(&sh->r5c);
|
2016-11-24 13:50:39 +07:00
|
|
|
INIT_LIST_HEAD(&sh->log_list);
|
2015-05-08 15:19:04 +07:00
|
|
|
atomic_set(&sh->count, 1);
|
2017-04-04 18:13:57 +07:00
|
|
|
sh->raid_conf = conf;
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:40 +07:00
|
|
|
sh->log_start = MaxSector;
|
2016-08-23 11:14:01 +07:00
|
|
|
for (i = 0; i < disks; i++) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
|
|
|
|
2016-11-22 22:57:21 +07:00
|
|
|
bio_init(&dev->req, &dev->vec, 1);
|
|
|
|
bio_init(&dev->rreq, &dev->rvec, 1);
|
2016-08-23 11:14:01 +07:00
|
|
|
}
|
2017-04-04 18:13:57 +07:00
|
|
|
|
|
|
|
if (raid5_has_ppl(conf)) {
|
|
|
|
sh->ppl_page = alloc_page(gfp);
|
|
|
|
if (!sh->ppl_page) {
|
|
|
|
free_stripe(sc, sh);
|
|
|
|
sh = NULL;
|
|
|
|
}
|
|
|
|
}
|
2015-05-08 15:19:04 +07:00
|
|
|
}
|
|
|
|
return sh;
|
|
|
|
}
|
2015-02-25 08:10:35 +07:00
|
|
|
static int grow_one_stripe(struct r5conf *conf, gfp_t gfp)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
struct stripe_head *sh;
|
2015-05-08 15:19:04 +07:00
|
|
|
|
2017-04-04 18:13:57 +07:00
|
|
|
sh = alloc_stripe(conf->slab_cache, gfp, conf->pool_size, conf);
|
2005-11-09 12:39:25 +07:00
|
|
|
if (!sh)
|
|
|
|
return 0;
|
2011-07-18 14:38:50 +07:00
|
|
|
|
2015-02-25 08:02:51 +07:00
|
|
|
if (grow_buffers(sh, gfp)) {
|
2010-06-16 13:45:16 +07:00
|
|
|
shrink_buffers(sh);
|
2017-04-04 18:13:57 +07:00
|
|
|
free_stripe(conf->slab_cache, sh);
|
2005-11-09 12:39:25 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2015-02-25 08:10:35 +07:00
|
|
|
sh->hash_lock_index =
|
|
|
|
conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS;
|
2005-11-09 12:39:25 +07:00
|
|
|
/* we just created an active stripe so... */
|
|
|
|
atomic_inc(&conf->active_stripes);
|
2014-12-15 08:57:03 +07:00
|
|
|
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2015-02-25 08:10:35 +07:00
|
|
|
conf->max_nr_stripes++;
|
2005-11-09 12:39:25 +07:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static int grow_stripes(struct r5conf *conf, int num)
|
2005-11-09 12:39:25 +07:00
|
|
|
{
|
2006-12-07 11:33:20 +07:00
|
|
|
struct kmem_cache *sc;
|
md: raid5: avoid string overflow warning
gcc warns about a possible overflow of the kmem_cache string, when adding
four characters to a string of the same length:
drivers/md/raid5.c: In function 'setup_conf':
drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=]
sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
^~~~
drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32
sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If I'm counting correctly, we need 11 characters for the fixed part
of the string and 18 characters for a 64-bit pointer (when no gendisk
is used), so that leaves three characters for conf->level, which should
always be sufficient.
This makes the code use snprintf() with the correct length, to
make the code more robust against changes, and to get the compiler
to shut up.
In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for
kmem_cache when mddev has no gendisk") from 2010, Neil said that
the pointer could be removed "shortly" once devices without gendisk
are disallowed. I have no idea if that happened, but if it did, that
should probably be changed as well.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
2018-02-20 20:09:11 +07:00
|
|
|
size_t namelen = sizeof(conf->cache_name[0]);
|
2009-10-16 12:35:30 +07:00
|
|
|
int devs = max(conf->raid_disks, conf->previous_raid_disks);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-06-01 16:37:25 +07:00
|
|
|
if (conf->mddev->gendisk)
|
md: raid5: avoid string overflow warning
gcc warns about a possible overflow of the kmem_cache string, when adding
four characters to a string of the same length:
drivers/md/raid5.c: In function 'setup_conf':
drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=]
sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
^~~~
drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32
sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If I'm counting correctly, we need 11 characters for the fixed part
of the string and 18 characters for a 64-bit pointer (when no gendisk
is used), so that leaves three characters for conf->level, which should
always be sufficient.
This makes the code use snprintf() with the correct length, to
make the code more robust against changes, and to get the compiler
to shut up.
In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for
kmem_cache when mddev has no gendisk") from 2010, Neil said that
the pointer could be removed "shortly" once devices without gendisk
are disallowed. I have no idea if that happened, but if it did, that
should probably be changed as well.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
2018-02-20 20:09:11 +07:00
|
|
|
snprintf(conf->cache_name[0], namelen,
|
2010-06-01 16:37:25 +07:00
|
|
|
"raid%d-%s", conf->level, mdname(conf->mddev));
|
|
|
|
else
|
md: raid5: avoid string overflow warning
gcc warns about a possible overflow of the kmem_cache string, when adding
four characters to a string of the same length:
drivers/md/raid5.c: In function 'setup_conf':
drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=]
sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
^~~~
drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32
sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If I'm counting correctly, we need 11 characters for the fixed part
of the string and 18 characters for a 64-bit pointer (when no gendisk
is used), so that leaves three characters for conf->level, which should
always be sufficient.
This makes the code use snprintf() with the correct length, to
make the code more robust against changes, and to get the compiler
to shut up.
In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for
kmem_cache when mddev has no gendisk") from 2010, Neil said that
the pointer could be removed "shortly" once devices without gendisk
are disallowed. I have no idea if that happened, but if it did, that
should probably be changed as well.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
2018-02-20 20:09:11 +07:00
|
|
|
snprintf(conf->cache_name[0], namelen,
|
2010-06-01 16:37:25 +07:00
|
|
|
"raid%d-%p", conf->level, conf->mddev);
|
md: raid5: avoid string overflow warning
gcc warns about a possible overflow of the kmem_cache string, when adding
four characters to a string of the same length:
drivers/md/raid5.c: In function 'setup_conf':
drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=]
sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
^~~~
drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32
sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If I'm counting correctly, we need 11 characters for the fixed part
of the string and 18 characters for a 64-bit pointer (when no gendisk
is used), so that leaves three characters for conf->level, which should
always be sufficient.
This makes the code use snprintf() with the correct length, to
make the code more robust against changes, and to get the compiler
to shut up.
In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for
kmem_cache when mddev has no gendisk") from 2010, Neil said that
the pointer could be removed "shortly" once devices without gendisk
are disallowed. I have no idea if that happened, but if it did, that
should probably be changed as well.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
2018-02-20 20:09:11 +07:00
|
|
|
snprintf(conf->cache_name[1], namelen, "%.27s-alt", conf->cache_name[0]);
|
2010-06-01 16:37:25 +07:00
|
|
|
|
2006-03-27 16:18:07 +07:00
|
|
|
conf->active_name = 0;
|
|
|
|
sc = kmem_cache_create(conf->cache_name[conf->active_name],
|
2005-04-17 05:20:36 +07:00
|
|
|
sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
|
2007-07-20 08:11:58 +07:00
|
|
|
0, 0, NULL);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (!sc)
|
|
|
|
return 1;
|
|
|
|
conf->slab_cache = sc;
|
2006-03-27 16:18:07 +07:00
|
|
|
conf->pool_size = devs;
|
2015-02-25 08:10:35 +07:00
|
|
|
while (num--)
|
|
|
|
if (!grow_one_stripe(conf, GFP_KERNEL))
|
2005-04-17 05:20:36 +07:00
|
|
|
return 1;
|
2015-02-25 08:10:35 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2006-03-27 16:18:10 +07:00
|
|
|
|
2009-07-15 01:50:52 +07:00
|
|
|
/**
|
|
|
|
* scribble_len - return the required size of the scribble region
|
|
|
|
* @num - total number of disks in the array
|
|
|
|
*
|
|
|
|
* The size must be enough to contain:
|
|
|
|
* 1/ a struct page pointer for each device in the array +2
|
|
|
|
* 2/ room to convert each entry in (1) to its corresponding dma
|
|
|
|
* (dma_map_page()) or page (page_address()) address.
|
|
|
|
*
|
|
|
|
* Note: the +2 is for the destination buffers of the ddf/raid6 case where we
|
|
|
|
* calculate over all devices (not just the data blocks), using zeros in place
|
|
|
|
* of the P and Q blocks.
|
|
|
|
*/
|
2014-12-15 08:57:02 +07:00
|
|
|
static struct flex_array *scribble_alloc(int num, int cnt, gfp_t flags)
|
2009-07-15 01:50:52 +07:00
|
|
|
{
|
2014-12-15 08:57:02 +07:00
|
|
|
struct flex_array *ret;
|
2009-07-15 01:50:52 +07:00
|
|
|
size_t len;
|
|
|
|
|
|
|
|
len = sizeof(struct page *) * (num+2) + sizeof(addr_conv_t) * (num+2);
|
2014-12-15 08:57:02 +07:00
|
|
|
ret = flex_array_alloc(len, cnt, flags);
|
|
|
|
if (!ret)
|
|
|
|
return NULL;
|
|
|
|
/* always prealloc all elements, so no locking is required */
|
|
|
|
if (flex_array_prealloc(ret, 0, cnt, flags)) {
|
|
|
|
flex_array_free(ret);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
return ret;
|
2009-07-15 01:50:52 +07:00
|
|
|
}
|
|
|
|
|
2015-05-08 15:19:39 +07:00
|
|
|
static int resize_chunks(struct r5conf *conf, int new_disks, int new_sectors)
|
|
|
|
{
|
|
|
|
unsigned long cpu;
|
|
|
|
int err = 0;
|
|
|
|
|
2016-02-25 08:38:28 +07:00
|
|
|
/*
|
|
|
|
* Never shrink. And mddev_suspend() could deadlock if this is called
|
|
|
|
* from raid5d. In that case, scribble_disks and scribble_sectors
|
|
|
|
* should equal to new_disks and new_sectors
|
|
|
|
*/
|
|
|
|
if (conf->scribble_disks >= new_disks &&
|
|
|
|
conf->scribble_sectors >= new_sectors)
|
|
|
|
return 0;
|
2015-05-08 15:19:39 +07:00
|
|
|
mddev_suspend(conf->mddev);
|
|
|
|
get_online_cpus();
|
|
|
|
for_each_present_cpu(cpu) {
|
|
|
|
struct raid5_percpu *percpu;
|
|
|
|
struct flex_array *scribble;
|
|
|
|
|
|
|
|
percpu = per_cpu_ptr(conf->percpu, cpu);
|
|
|
|
scribble = scribble_alloc(new_disks,
|
|
|
|
new_sectors / STRIPE_SECTORS,
|
|
|
|
GFP_NOIO);
|
|
|
|
|
|
|
|
if (scribble) {
|
|
|
|
flex_array_free(percpu->scribble);
|
|
|
|
percpu->scribble = scribble;
|
|
|
|
} else {
|
|
|
|
err = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
put_online_cpus();
|
|
|
|
mddev_resume(conf->mddev);
|
2016-02-25 08:38:28 +07:00
|
|
|
if (!err) {
|
|
|
|
conf->scribble_disks = new_disks;
|
|
|
|
conf->scribble_sectors = new_sectors;
|
|
|
|
}
|
2015-05-08 15:19:39 +07:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static int resize_stripes(struct r5conf *conf, int newsize)
|
2006-03-27 16:18:07 +07:00
|
|
|
{
|
|
|
|
/* Make all the stripes able to hold 'newsize' devices.
|
|
|
|
* New slots in each stripe get 'page' set to a new page.
|
|
|
|
*
|
|
|
|
* This happens in stages:
|
|
|
|
* 1/ create a new kmem_cache and allocate the required number of
|
|
|
|
* stripe_heads.
|
2012-10-29 22:18:08 +07:00
|
|
|
* 2/ gather all the old stripe_heads and transfer the pages across
|
2006-03-27 16:18:07 +07:00
|
|
|
* to the new stripe_heads. This will have the side effect of
|
|
|
|
* freezing the array as once all stripe_heads have been collected,
|
|
|
|
* no IO will be possible. Old stripe heads are freed once their
|
|
|
|
* pages have been transferred over, and the old kmem_cache is
|
|
|
|
* freed when all stripes are done.
|
|
|
|
* 3/ reallocate conf->disks to be suitable bigger. If this fails,
|
2017-03-15 15:14:53 +07:00
|
|
|
* we simple return a failure status - no need to clean anything up.
|
2006-03-27 16:18:07 +07:00
|
|
|
* 4/ allocate new pages for the new slots in the new stripe_heads.
|
|
|
|
* If this fails, we don't bother trying the shrink the
|
|
|
|
* stripe_heads down again, we just leave them as they are.
|
|
|
|
* As each stripe_head is processed the new one is released into
|
|
|
|
* active service.
|
|
|
|
*
|
|
|
|
* Once step2 is started, we cannot afford to wait for a write,
|
|
|
|
* so we use GFP_NOIO allocations.
|
|
|
|
*/
|
|
|
|
struct stripe_head *osh, *nsh;
|
|
|
|
LIST_HEAD(newstripes);
|
|
|
|
struct disk_info *ndisks;
|
2017-05-08 16:56:55 +07:00
|
|
|
int err = 0;
|
2006-12-07 11:33:20 +07:00
|
|
|
struct kmem_cache *sc;
|
2006-03-27 16:18:07 +07:00
|
|
|
int i;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
int hash, cnt;
|
2006-03-27 16:18:07 +07:00
|
|
|
|
2017-05-08 16:56:55 +07:00
|
|
|
md_allow_write(conf->mddev);
|
2007-01-26 15:57:11 +07:00
|
|
|
|
2006-03-27 16:18:07 +07:00
|
|
|
/* Step 1 */
|
|
|
|
sc = kmem_cache_create(conf->cache_name[1-conf->active_name],
|
|
|
|
sizeof(struct stripe_head)+(newsize-1)*sizeof(struct r5dev),
|
2007-07-20 08:11:58 +07:00
|
|
|
0, 0, NULL);
|
2006-03-27 16:18:07 +07:00
|
|
|
if (!sc)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2015-07-06 09:49:23 +07:00
|
|
|
/* Need to ensure auto-resizing doesn't interfere */
|
|
|
|
mutex_lock(&conf->cache_size_mutex);
|
|
|
|
|
2006-03-27 16:18:07 +07:00
|
|
|
for (i = conf->max_nr_stripes; i; i--) {
|
2017-04-04 18:13:57 +07:00
|
|
|
nsh = alloc_stripe(sc, GFP_KERNEL, newsize, conf);
|
2006-03-27 16:18:07 +07:00
|
|
|
if (!nsh)
|
|
|
|
break;
|
|
|
|
|
|
|
|
list_add(&nsh->lru, &newstripes);
|
|
|
|
}
|
|
|
|
if (i) {
|
|
|
|
/* didn't get enough, give up */
|
|
|
|
while (!list_empty(&newstripes)) {
|
|
|
|
nsh = list_entry(newstripes.next, struct stripe_head, lru);
|
|
|
|
list_del(&nsh->lru);
|
2017-04-04 18:13:57 +07:00
|
|
|
free_stripe(sc, nsh);
|
2006-03-27 16:18:07 +07:00
|
|
|
}
|
|
|
|
kmem_cache_destroy(sc);
|
2015-07-06 09:49:23 +07:00
|
|
|
mutex_unlock(&conf->cache_size_mutex);
|
2006-03-27 16:18:07 +07:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
/* Step 2 - Must use GFP_NOIO now.
|
|
|
|
* OK, we have enough stripes, start collecting inactive
|
|
|
|
* stripes and copying them over
|
|
|
|
*/
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
hash = 0;
|
|
|
|
cnt = 0;
|
2006-03-27 16:18:07 +07:00
|
|
|
list_for_each_entry(nsh, &newstripes, lru) {
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
lock_device_hash_lock(conf, hash);
|
2016-02-26 07:24:42 +07:00
|
|
|
wait_event_cmd(conf->wait_for_stripe,
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
!list_empty(conf->inactive_list + hash),
|
|
|
|
unlock_device_hash_lock(conf, hash),
|
|
|
|
lock_device_hash_lock(conf, hash));
|
|
|
|
osh = get_free_stripe(conf, hash);
|
|
|
|
unlock_device_hash_lock(conf, hash);
|
2015-05-08 15:19:04 +07:00
|
|
|
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
for(i=0; i<conf->pool_size; i++) {
|
2006-03-27 16:18:07 +07:00
|
|
|
nsh->dev[i].page = osh->dev[i].page;
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
nsh->dev[i].orig_page = osh->dev[i].page;
|
|
|
|
}
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
nsh->hash_lock_index = hash;
|
2017-04-04 18:13:57 +07:00
|
|
|
free_stripe(conf->slab_cache, osh);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
cnt++;
|
|
|
|
if (cnt >= conf->max_nr_stripes / NR_STRIPE_HASH_LOCKS +
|
|
|
|
!!((conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS) > hash)) {
|
|
|
|
hash++;
|
|
|
|
cnt = 0;
|
|
|
|
}
|
2006-03-27 16:18:07 +07:00
|
|
|
}
|
|
|
|
kmem_cache_destroy(conf->slab_cache);
|
|
|
|
|
|
|
|
/* Step 3.
|
|
|
|
* At this point, we are holding all the stripes so the array
|
|
|
|
* is completely stalled, so now is a good time to resize
|
2009-07-15 01:50:52 +07:00
|
|
|
* conf->disks and the scribble region
|
2006-03-27 16:18:07 +07:00
|
|
|
*/
|
treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:
kzalloc(a * b, gfp)
with:
kcalloc(a * b, gfp)
as well as handling cases of:
kzalloc(a * b * c, gfp)
with:
kzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kzalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc
+ kcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc(sizeof(THING) * C2, ...)
|
kzalloc(sizeof(TYPE) * C2, ...)
|
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 04:03:40 +07:00
|
|
|
ndisks = kcalloc(newsize, sizeof(struct disk_info), GFP_NOIO);
|
2006-03-27 16:18:07 +07:00
|
|
|
if (ndisks) {
|
2016-11-24 13:50:39 +07:00
|
|
|
for (i = 0; i < conf->pool_size; i++)
|
2006-03-27 16:18:07 +07:00
|
|
|
ndisks[i] = conf->disks[i];
|
2016-11-24 13:50:39 +07:00
|
|
|
|
|
|
|
for (i = conf->pool_size; i < newsize; i++) {
|
|
|
|
ndisks[i].extra_page = alloc_page(GFP_NOIO);
|
|
|
|
if (!ndisks[i].extra_page)
|
|
|
|
err = -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (err) {
|
|
|
|
for (i = conf->pool_size; i < newsize; i++)
|
|
|
|
if (ndisks[i].extra_page)
|
|
|
|
put_page(ndisks[i].extra_page);
|
|
|
|
kfree(ndisks);
|
|
|
|
} else {
|
|
|
|
kfree(conf->disks);
|
|
|
|
conf->disks = ndisks;
|
|
|
|
}
|
2006-03-27 16:18:07 +07:00
|
|
|
} else
|
|
|
|
err = -ENOMEM;
|
|
|
|
|
2015-07-06 09:49:23 +07:00
|
|
|
mutex_unlock(&conf->cache_size_mutex);
|
2017-03-29 14:46:13 +07:00
|
|
|
|
|
|
|
conf->slab_cache = sc;
|
|
|
|
conf->active_name = 1-conf->active_name;
|
|
|
|
|
2006-03-27 16:18:07 +07:00
|
|
|
/* Step 4, return new stripes to service */
|
|
|
|
while(!list_empty(&newstripes)) {
|
|
|
|
nsh = list_entry(newstripes.next, struct stripe_head, lru);
|
|
|
|
list_del_init(&nsh->lru);
|
2009-07-15 01:50:52 +07:00
|
|
|
|
2006-03-27 16:18:07 +07:00
|
|
|
for (i=conf->raid_disks; i < newsize; i++)
|
|
|
|
if (nsh->dev[i].page == NULL) {
|
|
|
|
struct page *p = alloc_page(GFP_NOIO);
|
|
|
|
nsh->dev[i].page = p;
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
nsh->dev[i].orig_page = p;
|
2006-03-27 16:18:07 +07:00
|
|
|
if (!p)
|
|
|
|
err = -ENOMEM;
|
|
|
|
}
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(nsh);
|
2006-03-27 16:18:07 +07:00
|
|
|
}
|
|
|
|
/* critical section pass, GFP_NOIO no longer needed */
|
|
|
|
|
2015-05-08 15:19:34 +07:00
|
|
|
if (!err)
|
|
|
|
conf->pool_size = newsize;
|
2006-03-27 16:18:07 +07:00
|
|
|
return err;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-02-25 08:10:35 +07:00
|
|
|
static int drop_one_stripe(struct r5conf *conf)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
struct stripe_head *sh;
|
2015-08-03 14:09:57 +07:00
|
|
|
int hash = (conf->max_nr_stripes - 1) & STRIPE_HASH_LOCKS_MASK;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
spin_lock_irq(conf->hash_locks + hash);
|
|
|
|
sh = get_free_stripe(conf, hash);
|
|
|
|
spin_unlock_irq(conf->hash_locks + hash);
|
2005-11-09 12:39:25 +07:00
|
|
|
if (!sh)
|
|
|
|
return 0;
|
2006-04-02 18:31:42 +07:00
|
|
|
BUG_ON(atomic_read(&sh->count));
|
2010-06-16 13:45:16 +07:00
|
|
|
shrink_buffers(sh);
|
2017-04-04 18:13:57 +07:00
|
|
|
free_stripe(conf->slab_cache, sh);
|
2005-11-09 12:39:25 +07:00
|
|
|
atomic_dec(&conf->active_stripes);
|
2015-02-25 08:10:35 +07:00
|
|
|
conf->max_nr_stripes--;
|
2005-11-09 12:39:25 +07:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static void shrink_stripes(struct r5conf *conf)
|
2005-11-09 12:39:25 +07:00
|
|
|
{
|
2015-02-25 08:10:35 +07:00
|
|
|
while (conf->max_nr_stripes &&
|
|
|
|
drop_one_stripe(conf))
|
|
|
|
;
|
2005-11-09 12:39:25 +07:00
|
|
|
|
2015-09-13 19:15:10 +07:00
|
|
|
kmem_cache_destroy(conf->slab_cache);
|
2005-04-17 05:20:36 +07:00
|
|
|
conf->slab_cache = NULL;
|
|
|
|
}
|
|
|
|
|
2015-07-20 20:29:37 +07:00
|
|
|
static void raid5_end_read_request(struct bio * bi)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2009-03-31 10:39:38 +07:00
|
|
|
struct stripe_head *sh = bi->bi_private;
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
2006-03-27 16:18:08 +07:00
|
|
|
int disks = sh->disks, i;
|
2006-07-10 18:44:20 +07:00
|
|
|
char b[BDEVNAME_SIZE];
|
2011-12-23 06:17:53 +07:00
|
|
|
struct md_rdev *rdev = NULL;
|
2012-05-21 06:27:00 +07:00
|
|
|
sector_t s;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
for (i=0 ; i<disks; i++)
|
|
|
|
if (bi == &sh->dev[i].req)
|
|
|
|
break;
|
|
|
|
|
2015-07-20 20:29:37 +07:00
|
|
|
pr_debug("end_read_request %llu/%d, count: %d, error %d.\n",
|
2007-07-10 01:56:43 +07:00
|
|
|
(unsigned long long)sh->sector, i, atomic_read(&sh->count),
|
2017-06-03 14:38:06 +07:00
|
|
|
bi->bi_status);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (i == disks) {
|
2016-08-23 11:14:01 +07:00
|
|
|
bio_reset(bi);
|
2005-04-17 05:20:36 +07:00
|
|
|
BUG();
|
2007-09-27 17:47:43 +07:00
|
|
|
return;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2011-12-23 06:17:52 +07:00
|
|
|
if (test_bit(R5_ReadRepl, &sh->dev[i].flags))
|
2011-12-23 06:17:53 +07:00
|
|
|
/* If replacement finished while this request was outstanding,
|
|
|
|
* 'replacement' might be NULL already.
|
|
|
|
* In that case it moved down to 'rdev'.
|
|
|
|
* rdev is not removed until all requests are finished.
|
|
|
|
*/
|
2011-12-23 06:17:52 +07:00
|
|
|
rdev = conf->disks[i].replacement;
|
2011-12-23 06:17:53 +07:00
|
|
|
if (!rdev)
|
2011-12-23 06:17:52 +07:00
|
|
|
rdev = conf->disks[i].rdev;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-05-21 06:27:00 +07:00
|
|
|
if (use_new_offset(conf, sh))
|
|
|
|
s = sh->sector + rdev->new_data_offset;
|
|
|
|
else
|
|
|
|
s = sh->sector + rdev->data_offset;
|
2017-06-03 14:38:06 +07:00
|
|
|
if (!bi->bi_status) {
|
2005-04-17 05:20:36 +07:00
|
|
|
set_bit(R5_UPTODATE, &sh->dev[i].flags);
|
2005-11-09 12:39:22 +07:00
|
|
|
if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
|
2011-12-23 06:17:52 +07:00
|
|
|
/* Note that this cannot happen on a
|
|
|
|
* replacement device. We just fail those on
|
|
|
|
* any error
|
|
|
|
*/
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_info_ratelimited(
|
|
|
|
"md/raid:%s: read error corrected (%lu sectors at %llu on %s)\n",
|
2011-07-27 08:00:36 +07:00
|
|
|
mdname(conf->mddev), STRIPE_SECTORS,
|
2012-05-21 06:27:00 +07:00
|
|
|
(unsigned long long)s,
|
2011-07-27 08:00:36 +07:00
|
|
|
bdevname(rdev->bdev, b));
|
2011-07-27 08:00:36 +07:00
|
|
|
atomic_add(STRIPE_SECTORS, &rdev->corrected_errors);
|
2005-11-09 12:39:22 +07:00
|
|
|
clear_bit(R5_ReadError, &sh->dev[i].flags);
|
|
|
|
clear_bit(R5_ReWrite, &sh->dev[i].flags);
|
2012-07-31 07:04:21 +07:00
|
|
|
} else if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags))
|
|
|
|
clear_bit(R5_ReadNoMerge, &sh->dev[i].flags);
|
|
|
|
|
2017-01-13 08:22:41 +07:00
|
|
|
if (test_bit(R5_InJournal, &sh->dev[i].flags))
|
|
|
|
/*
|
|
|
|
* end read for a page in journal, this
|
|
|
|
* must be preparing for prexor in rmw
|
|
|
|
*/
|
|
|
|
set_bit(R5_OrigPageUPTDODATE, &sh->dev[i].flags);
|
|
|
|
|
2011-12-23 06:17:52 +07:00
|
|
|
if (atomic_read(&rdev->read_errors))
|
|
|
|
atomic_set(&rdev->read_errors, 0);
|
2005-04-17 05:20:36 +07:00
|
|
|
} else {
|
2011-12-23 06:17:52 +07:00
|
|
|
const char *bdn = bdevname(rdev->bdev, b);
|
2005-11-09 12:39:31 +07:00
|
|
|
int retry = 0;
|
2012-07-03 12:57:02 +07:00
|
|
|
int set_bad = 0;
|
2006-07-10 18:44:20 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
clear_bit(R5_UPTODATE, &sh->dev[i].flags);
|
2006-07-10 18:44:20 +07:00
|
|
|
atomic_inc(&rdev->read_errors);
|
2011-12-23 06:17:52 +07:00
|
|
|
if (test_bit(R5_ReadRepl, &sh->dev[i].flags))
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn_ratelimited(
|
|
|
|
"md/raid:%s: read error on replacement device (sector %llu on %s).\n",
|
2011-12-23 06:17:52 +07:00
|
|
|
mdname(conf->mddev),
|
2012-05-21 06:27:00 +07:00
|
|
|
(unsigned long long)s,
|
2011-12-23 06:17:52 +07:00
|
|
|
bdn);
|
2012-07-03 12:57:02 +07:00
|
|
|
else if (conf->mddev->degraded >= conf->max_degraded) {
|
|
|
|
set_bad = 1;
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn_ratelimited(
|
|
|
|
"md/raid:%s: read error not correctable (sector %llu on %s).\n",
|
2011-07-27 08:00:36 +07:00
|
|
|
mdname(conf->mddev),
|
2012-05-21 06:27:00 +07:00
|
|
|
(unsigned long long)s,
|
2011-07-27 08:00:36 +07:00
|
|
|
bdn);
|
2012-07-03 12:57:02 +07:00
|
|
|
} else if (test_bit(R5_ReWrite, &sh->dev[i].flags)) {
|
2005-11-09 12:39:22 +07:00
|
|
|
/* Oh, no!!! */
|
2012-07-03 12:57:02 +07:00
|
|
|
set_bad = 1;
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn_ratelimited(
|
|
|
|
"md/raid:%s: read error NOT corrected!! (sector %llu on %s).\n",
|
2011-07-27 08:00:36 +07:00
|
|
|
mdname(conf->mddev),
|
2012-05-21 06:27:00 +07:00
|
|
|
(unsigned long long)s,
|
2011-07-27 08:00:36 +07:00
|
|
|
bdn);
|
2012-07-03 12:57:02 +07:00
|
|
|
} else if (atomic_read(&rdev->read_errors)
|
2005-11-09 12:39:31 +07:00
|
|
|
> conf->max_nr_stripes)
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: Too many read errors, failing device %s.\n",
|
2006-07-10 18:44:20 +07:00
|
|
|
mdname(conf->mddev), bdn);
|
2005-11-09 12:39:31 +07:00
|
|
|
else
|
|
|
|
retry = 1;
|
2013-11-14 11:16:17 +07:00
|
|
|
if (set_bad && test_bit(In_sync, &rdev->flags)
|
|
|
|
&& !test_bit(R5_ReadNoMerge, &sh->dev[i].flags))
|
|
|
|
retry = 1;
|
2005-11-09 12:39:31 +07:00
|
|
|
if (retry)
|
2012-07-31 07:04:21 +07:00
|
|
|
if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags)) {
|
|
|
|
set_bit(R5_ReadError, &sh->dev[i].flags);
|
|
|
|
clear_bit(R5_ReadNoMerge, &sh->dev[i].flags);
|
|
|
|
} else
|
|
|
|
set_bit(R5_ReadNoMerge, &sh->dev[i].flags);
|
2005-11-09 12:39:31 +07:00
|
|
|
else {
|
2005-11-09 12:39:22 +07:00
|
|
|
clear_bit(R5_ReadError, &sh->dev[i].flags);
|
|
|
|
clear_bit(R5_ReWrite, &sh->dev[i].flags);
|
2012-07-03 12:57:02 +07:00
|
|
|
if (!(set_bad
|
|
|
|
&& test_bit(In_sync, &rdev->flags)
|
|
|
|
&& rdev_set_badblocks(
|
|
|
|
rdev, sh->sector, STRIPE_SECTORS, 0)))
|
|
|
|
md_error(conf->mddev, rdev);
|
2005-11-09 12:39:31 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2011-12-23 06:17:52 +07:00
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
2016-09-09 00:43:58 +07:00
|
|
|
bio_reset(bi);
|
2005-04-17 05:20:36 +07:00
|
|
|
clear_bit(R5_LOCKED, &sh->dev[i].flags);
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2015-07-20 20:29:37 +07:00
|
|
|
static void raid5_end_write_request(struct bio *bi)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2009-03-31 10:39:38 +07:00
|
|
|
struct stripe_head *sh = bi->bi_private;
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
2006-03-27 16:18:08 +07:00
|
|
|
int disks = sh->disks, i;
|
2011-12-23 06:17:53 +07:00
|
|
|
struct md_rdev *uninitialized_var(rdev);
|
2011-07-28 08:39:23 +07:00
|
|
|
sector_t first_bad;
|
|
|
|
int bad_sectors;
|
2011-12-23 06:17:53 +07:00
|
|
|
int replacement = 0;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-12-23 06:17:53 +07:00
|
|
|
for (i = 0 ; i < disks; i++) {
|
|
|
|
if (bi == &sh->dev[i].req) {
|
|
|
|
rdev = conf->disks[i].rdev;
|
2005-04-17 05:20:36 +07:00
|
|
|
break;
|
2011-12-23 06:17:53 +07:00
|
|
|
}
|
|
|
|
if (bi == &sh->dev[i].rreq) {
|
|
|
|
rdev = conf->disks[i].replacement;
|
2011-12-23 06:17:53 +07:00
|
|
|
if (rdev)
|
|
|
|
replacement = 1;
|
|
|
|
else
|
|
|
|
/* rdev was removed and 'replacement'
|
|
|
|
* replaced it. rdev is not removed
|
|
|
|
* until all requests are finished.
|
|
|
|
*/
|
|
|
|
rdev = conf->disks[i].rdev;
|
2011-12-23 06:17:53 +07:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2015-07-20 20:29:37 +07:00
|
|
|
pr_debug("end_write_request %llu/%d, count %d, error: %d.\n",
|
2005-04-17 05:20:36 +07:00
|
|
|
(unsigned long long)sh->sector, i, atomic_read(&sh->count),
|
2017-06-03 14:38:06 +07:00
|
|
|
bi->bi_status);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (i == disks) {
|
2016-08-23 11:14:01 +07:00
|
|
|
bio_reset(bi);
|
2005-04-17 05:20:36 +07:00
|
|
|
BUG();
|
2007-09-27 17:47:43 +07:00
|
|
|
return;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2011-12-23 06:17:53 +07:00
|
|
|
if (replacement) {
|
2017-06-03 14:38:06 +07:00
|
|
|
if (bi->bi_status)
|
2011-12-23 06:17:53 +07:00
|
|
|
md_error(conf->mddev, rdev);
|
|
|
|
else if (is_badblock(rdev, sh->sector,
|
|
|
|
STRIPE_SECTORS,
|
|
|
|
&first_bad, &bad_sectors))
|
|
|
|
set_bit(R5_MadeGoodRepl, &sh->dev[i].flags);
|
|
|
|
} else {
|
2017-06-03 14:38:06 +07:00
|
|
|
if (bi->bi_status) {
|
2014-01-16 05:35:38 +07:00
|
|
|
set_bit(STRIPE_DEGRADED, &sh->state);
|
2011-12-23 06:17:53 +07:00
|
|
|
set_bit(WriteErrorSeen, &rdev->flags);
|
|
|
|
set_bit(R5_WriteError, &sh->dev[i].flags);
|
2011-12-23 06:17:54 +07:00
|
|
|
if (!test_and_set_bit(WantReplacement, &rdev->flags))
|
|
|
|
set_bit(MD_RECOVERY_NEEDED,
|
|
|
|
&rdev->mddev->recovery);
|
2011-12-23 06:17:53 +07:00
|
|
|
} else if (is_badblock(rdev, sh->sector,
|
|
|
|
STRIPE_SECTORS,
|
2013-04-24 08:42:42 +07:00
|
|
|
&first_bad, &bad_sectors)) {
|
2011-12-23 06:17:53 +07:00
|
|
|
set_bit(R5_MadeGood, &sh->dev[i].flags);
|
2013-04-24 08:42:42 +07:00
|
|
|
if (test_bit(R5_ReadError, &sh->dev[i].flags))
|
|
|
|
/* That was a successful write so make
|
|
|
|
* sure it looks like we already did
|
|
|
|
* a re-write.
|
|
|
|
*/
|
|
|
|
set_bit(R5_ReWrite, &sh->dev[i].flags);
|
|
|
|
}
|
2011-12-23 06:17:53 +07:00
|
|
|
}
|
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-06-03 14:38:06 +07:00
|
|
|
if (sh->batch_head && bi->bi_status && !replacement)
|
2014-12-15 08:57:03 +07:00
|
|
|
set_bit(STRIPE_BATCH_ERR, &sh->batch_head->state);
|
|
|
|
|
2016-09-09 00:43:58 +07:00
|
|
|
bio_reset(bi);
|
2011-12-23 06:17:53 +07:00
|
|
|
if (!test_and_clear_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags))
|
|
|
|
clear_bit(R5_LOCKED, &sh->dev[i].flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2014-12-15 08:57:03 +07:00
|
|
|
|
|
|
|
if (sh->batch_head && sh != sh->batch_head)
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh->batch_head);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2016-01-21 04:52:20 +07:00
|
|
|
static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
char b[BDEVNAME_SIZE];
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2011-12-23 06:17:50 +07:00
|
|
|
unsigned long flags;
|
2010-05-03 11:09:02 +07:00
|
|
|
pr_debug("raid456: error called\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-12-23 06:17:50 +07:00
|
|
|
spin_lock_irqsave(&conf->device_lock, flags);
|
2018-09-04 20:08:30 +07:00
|
|
|
|
|
|
|
if (test_bit(In_sync, &rdev->flags) &&
|
|
|
|
mddev->degraded == conf->max_degraded) {
|
|
|
|
/*
|
|
|
|
* Don't allow to achieve failed state
|
|
|
|
* Don't try to recover this device
|
|
|
|
*/
|
|
|
|
conf->recovery_disabled = mddev->recovery_disabled;
|
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-11-17 09:57:44 +07:00
|
|
|
set_bit(Faulty, &rdev->flags);
|
2011-12-23 06:17:50 +07:00
|
|
|
clear_bit(In_sync, &rdev->flags);
|
2017-01-25 01:45:30 +07:00
|
|
|
mddev->degraded = raid5_calc_degraded(conf);
|
2011-12-23 06:17:50 +07:00
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
|
|
|
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
|
|
|
|
|
2011-07-28 08:31:48 +07:00
|
|
|
set_bit(Blocked, &rdev->flags);
|
2016-12-09 06:48:19 +07:00
|
|
|
set_mask_bits(&mddev->sb_flags, 0,
|
|
|
|
BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n"
|
|
|
|
"md/raid:%s: Operation continuing on %d devices.\n",
|
|
|
|
mdname(mddev),
|
|
|
|
bdevname(rdev->bdev, b),
|
|
|
|
mdname(mddev),
|
|
|
|
conf->raid_disks - mddev->degraded);
|
md/r5cache: gracefully handle journal device errors for writeback mode
For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, it is necessary to handle
journal failure gracefully.
During journal failures, the following logic handles the graceful shutdown
of journal:
1. raid5_error() marks the device as Faulty and schedules async work
log->disable_writeback_work;
2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
suspended, set to write through, and then resumed. mddev_suspend()
flushes all cached stripes;
3. All cached stripes need to be flushed carefully to the RAID array.
This patch fixes issues within the process above:
1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
journal failures;
2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
since raid5_error() updates superblock.
3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
to make progress during log_failed;
4. In delay_towrite(), if log failed only process data in the cache (skip
new writes in dev->towrite);
5. In __get_priority_stripe(), process loprio_list during journal device
failures.
6. In raid5_remove_disk(), wait for all cached stripes are flushed before
calling log_exit().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-12 05:28:28 +07:00
|
|
|
r5c_update_on_rdev_error(mddev, rdev);
|
2006-06-26 14:27:38 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Input: a 'big' sector number,
|
|
|
|
* Output: index of the data and parity disk, and the sector # in them.
|
|
|
|
*/
|
2015-08-14 04:31:57 +07:00
|
|
|
sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
|
|
|
|
int previous, int *dd_idx,
|
|
|
|
struct stripe_head *sh)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2010-04-23 04:08:28 +07:00
|
|
|
sector_t stripe, stripe2;
|
2010-04-20 11:13:34 +07:00
|
|
|
sector_t chunk_number;
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned int chunk_offset;
|
2009-03-31 10:39:38 +07:00
|
|
|
int pd_idx, qd_idx;
|
2009-03-31 10:39:38 +07:00
|
|
|
int ddf_layout = 0;
|
2005-04-17 05:20:36 +07:00
|
|
|
sector_t new_sector;
|
2009-03-31 11:20:22 +07:00
|
|
|
int algorithm = previous ? conf->prev_algo
|
|
|
|
: conf->algorithm;
|
2009-06-18 05:45:55 +07:00
|
|
|
int sectors_per_chunk = previous ? conf->prev_chunk_sectors
|
|
|
|
: conf->chunk_sectors;
|
2009-03-31 10:39:38 +07:00
|
|
|
int raid_disks = previous ? conf->previous_raid_disks
|
|
|
|
: conf->raid_disks;
|
|
|
|
int data_disks = raid_disks - conf->max_degraded;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/* First compute the information on this sector */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute the chunk number and the sector offset inside the chunk
|
|
|
|
*/
|
|
|
|
chunk_offset = sector_div(r_sector, sectors_per_chunk);
|
|
|
|
chunk_number = r_sector;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute the stripe number
|
|
|
|
*/
|
2010-04-20 11:13:34 +07:00
|
|
|
stripe = chunk_number;
|
|
|
|
*dd_idx = sector_div(stripe, data_disks);
|
2010-04-23 04:08:28 +07:00
|
|
|
stripe2 = stripe;
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Select the parity disk based on the user selected algorithm.
|
|
|
|
*/
|
2011-07-27 08:00:36 +07:00
|
|
|
pd_idx = qd_idx = -1;
|
2006-06-26 14:27:38 +07:00
|
|
|
switch(conf->level) {
|
|
|
|
case 4:
|
2009-03-31 10:39:38 +07:00
|
|
|
pd_idx = data_disks;
|
2006-06-26 14:27:38 +07:00
|
|
|
break;
|
|
|
|
case 5:
|
2009-03-31 11:20:22 +07:00
|
|
|
switch (algorithm) {
|
2005-04-17 05:20:36 +07:00
|
|
|
case ALGORITHM_LEFT_ASYMMETRIC:
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = data_disks - sector_div(stripe2, raid_disks);
|
2009-03-31 10:39:38 +07:00
|
|
|
if (*dd_idx >= pd_idx)
|
2005-04-17 05:20:36 +07:00
|
|
|
(*dd_idx)++;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_RIGHT_ASYMMETRIC:
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = sector_div(stripe2, raid_disks);
|
2009-03-31 10:39:38 +07:00
|
|
|
if (*dd_idx >= pd_idx)
|
2005-04-17 05:20:36 +07:00
|
|
|
(*dd_idx)++;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_LEFT_SYMMETRIC:
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = data_disks - sector_div(stripe2, raid_disks);
|
2009-03-31 10:39:38 +07:00
|
|
|
*dd_idx = (pd_idx + 1 + *dd_idx) % raid_disks;
|
2005-04-17 05:20:36 +07:00
|
|
|
break;
|
|
|
|
case ALGORITHM_RIGHT_SYMMETRIC:
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = sector_div(stripe2, raid_disks);
|
2009-03-31 10:39:38 +07:00
|
|
|
*dd_idx = (pd_idx + 1 + *dd_idx) % raid_disks;
|
2005-04-17 05:20:36 +07:00
|
|
|
break;
|
2009-03-31 10:39:38 +07:00
|
|
|
case ALGORITHM_PARITY_0:
|
|
|
|
pd_idx = 0;
|
|
|
|
(*dd_idx)++;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_PARITY_N:
|
|
|
|
pd_idx = data_disks;
|
|
|
|
break;
|
2005-04-17 05:20:36 +07:00
|
|
|
default:
|
2009-03-31 10:39:38 +07:00
|
|
|
BUG();
|
2006-06-26 14:27:38 +07:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
case 6:
|
|
|
|
|
2009-03-31 11:20:22 +07:00
|
|
|
switch (algorithm) {
|
2006-06-26 14:27:38 +07:00
|
|
|
case ALGORITHM_LEFT_ASYMMETRIC:
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = raid_disks - 1 - sector_div(stripe2, raid_disks);
|
2009-03-31 10:39:38 +07:00
|
|
|
qd_idx = pd_idx + 1;
|
|
|
|
if (pd_idx == raid_disks-1) {
|
2009-03-31 10:39:38 +07:00
|
|
|
(*dd_idx)++; /* Q D D D P */
|
2009-03-31 10:39:38 +07:00
|
|
|
qd_idx = 0;
|
|
|
|
} else if (*dd_idx >= pd_idx)
|
2006-06-26 14:27:38 +07:00
|
|
|
(*dd_idx) += 2; /* D D P Q D */
|
|
|
|
break;
|
|
|
|
case ALGORITHM_RIGHT_ASYMMETRIC:
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = sector_div(stripe2, raid_disks);
|
2009-03-31 10:39:38 +07:00
|
|
|
qd_idx = pd_idx + 1;
|
|
|
|
if (pd_idx == raid_disks-1) {
|
2009-03-31 10:39:38 +07:00
|
|
|
(*dd_idx)++; /* Q D D D P */
|
2009-03-31 10:39:38 +07:00
|
|
|
qd_idx = 0;
|
|
|
|
} else if (*dd_idx >= pd_idx)
|
2006-06-26 14:27:38 +07:00
|
|
|
(*dd_idx) += 2; /* D D P Q D */
|
|
|
|
break;
|
|
|
|
case ALGORITHM_LEFT_SYMMETRIC:
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = raid_disks - 1 - sector_div(stripe2, raid_disks);
|
2009-03-31 10:39:38 +07:00
|
|
|
qd_idx = (pd_idx + 1) % raid_disks;
|
|
|
|
*dd_idx = (pd_idx + 2 + *dd_idx) % raid_disks;
|
2006-06-26 14:27:38 +07:00
|
|
|
break;
|
|
|
|
case ALGORITHM_RIGHT_SYMMETRIC:
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = sector_div(stripe2, raid_disks);
|
2009-03-31 10:39:38 +07:00
|
|
|
qd_idx = (pd_idx + 1) % raid_disks;
|
|
|
|
*dd_idx = (pd_idx + 2 + *dd_idx) % raid_disks;
|
2006-06-26 14:27:38 +07:00
|
|
|
break;
|
2009-03-31 10:39:38 +07:00
|
|
|
|
|
|
|
case ALGORITHM_PARITY_0:
|
|
|
|
pd_idx = 0;
|
|
|
|
qd_idx = 1;
|
|
|
|
(*dd_idx) += 2;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_PARITY_N:
|
|
|
|
pd_idx = data_disks;
|
|
|
|
qd_idx = data_disks + 1;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ALGORITHM_ROTATING_ZERO_RESTART:
|
|
|
|
/* Exactly the same as RIGHT_ASYMMETRIC, but or
|
|
|
|
* of blocks for computing Q is different.
|
|
|
|
*/
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = sector_div(stripe2, raid_disks);
|
2009-03-31 10:39:38 +07:00
|
|
|
qd_idx = pd_idx + 1;
|
|
|
|
if (pd_idx == raid_disks-1) {
|
|
|
|
(*dd_idx)++; /* Q D D D P */
|
|
|
|
qd_idx = 0;
|
|
|
|
} else if (*dd_idx >= pd_idx)
|
|
|
|
(*dd_idx) += 2; /* D D P Q D */
|
2009-03-31 10:39:38 +07:00
|
|
|
ddf_layout = 1;
|
2009-03-31 10:39:38 +07:00
|
|
|
break;
|
|
|
|
|
|
|
|
case ALGORITHM_ROTATING_N_RESTART:
|
|
|
|
/* Same a left_asymmetric, by first stripe is
|
|
|
|
* D D D P Q rather than
|
|
|
|
* Q D D D P
|
|
|
|
*/
|
2010-04-23 04:08:28 +07:00
|
|
|
stripe2 += 1;
|
|
|
|
pd_idx = raid_disks - 1 - sector_div(stripe2, raid_disks);
|
2009-03-31 10:39:38 +07:00
|
|
|
qd_idx = pd_idx + 1;
|
|
|
|
if (pd_idx == raid_disks-1) {
|
|
|
|
(*dd_idx)++; /* Q D D D P */
|
|
|
|
qd_idx = 0;
|
|
|
|
} else if (*dd_idx >= pd_idx)
|
|
|
|
(*dd_idx) += 2; /* D D P Q D */
|
2009-03-31 10:39:38 +07:00
|
|
|
ddf_layout = 1;
|
2009-03-31 10:39:38 +07:00
|
|
|
break;
|
|
|
|
|
|
|
|
case ALGORITHM_ROTATING_N_CONTINUE:
|
|
|
|
/* Same as left_symmetric but Q is before P */
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = raid_disks - 1 - sector_div(stripe2, raid_disks);
|
2009-03-31 10:39:38 +07:00
|
|
|
qd_idx = (pd_idx + raid_disks - 1) % raid_disks;
|
|
|
|
*dd_idx = (pd_idx + 1 + *dd_idx) % raid_disks;
|
2009-03-31 10:39:38 +07:00
|
|
|
ddf_layout = 1;
|
2009-03-31 10:39:38 +07:00
|
|
|
break;
|
|
|
|
|
|
|
|
case ALGORITHM_LEFT_ASYMMETRIC_6:
|
|
|
|
/* RAID5 left_asymmetric, with Q on last device */
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = data_disks - sector_div(stripe2, raid_disks-1);
|
2009-03-31 10:39:38 +07:00
|
|
|
if (*dd_idx >= pd_idx)
|
|
|
|
(*dd_idx)++;
|
|
|
|
qd_idx = raid_disks - 1;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ALGORITHM_RIGHT_ASYMMETRIC_6:
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = sector_div(stripe2, raid_disks-1);
|
2009-03-31 10:39:38 +07:00
|
|
|
if (*dd_idx >= pd_idx)
|
|
|
|
(*dd_idx)++;
|
|
|
|
qd_idx = raid_disks - 1;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ALGORITHM_LEFT_SYMMETRIC_6:
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = data_disks - sector_div(stripe2, raid_disks-1);
|
2009-03-31 10:39:38 +07:00
|
|
|
*dd_idx = (pd_idx + 1 + *dd_idx) % (raid_disks-1);
|
|
|
|
qd_idx = raid_disks - 1;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ALGORITHM_RIGHT_SYMMETRIC_6:
|
2010-04-23 04:08:28 +07:00
|
|
|
pd_idx = sector_div(stripe2, raid_disks-1);
|
2009-03-31 10:39:38 +07:00
|
|
|
*dd_idx = (pd_idx + 1 + *dd_idx) % (raid_disks-1);
|
|
|
|
qd_idx = raid_disks - 1;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case ALGORITHM_PARITY_0_6:
|
|
|
|
pd_idx = 0;
|
|
|
|
(*dd_idx)++;
|
|
|
|
qd_idx = raid_disks - 1;
|
|
|
|
break;
|
|
|
|
|
2006-06-26 14:27:38 +07:00
|
|
|
default:
|
2009-03-31 10:39:38 +07:00
|
|
|
BUG();
|
2006-06-26 14:27:38 +07:00
|
|
|
}
|
|
|
|
break;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2009-03-31 10:39:38 +07:00
|
|
|
if (sh) {
|
|
|
|
sh->pd_idx = pd_idx;
|
|
|
|
sh->qd_idx = qd_idx;
|
2009-03-31 10:39:38 +07:00
|
|
|
sh->ddf_layout = ddf_layout;
|
2009-03-31 10:39:38 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Finally, compute the new sector number
|
|
|
|
*/
|
|
|
|
new_sector = (sector_t)stripe * sectors_per_chunk + chunk_offset;
|
|
|
|
return new_sector;
|
|
|
|
}
|
|
|
|
|
2015-08-14 04:31:57 +07:00
|
|
|
sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
2006-12-10 17:20:49 +07:00
|
|
|
int raid_disks = sh->disks;
|
|
|
|
int data_disks = raid_disks - conf->max_degraded;
|
2005-04-17 05:20:36 +07:00
|
|
|
sector_t new_sector = sh->sector, check;
|
2009-06-18 05:45:55 +07:00
|
|
|
int sectors_per_chunk = previous ? conf->prev_chunk_sectors
|
|
|
|
: conf->chunk_sectors;
|
2009-03-31 11:20:22 +07:00
|
|
|
int algorithm = previous ? conf->prev_algo
|
|
|
|
: conf->algorithm;
|
2005-04-17 05:20:36 +07:00
|
|
|
sector_t stripe;
|
|
|
|
int chunk_offset;
|
2010-04-20 11:13:34 +07:00
|
|
|
sector_t chunk_number;
|
|
|
|
int dummy1, dd_idx = i;
|
2005-04-17 05:20:36 +07:00
|
|
|
sector_t r_sector;
|
2009-03-31 10:39:38 +07:00
|
|
|
struct stripe_head sh2;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
chunk_offset = sector_div(new_sector, sectors_per_chunk);
|
|
|
|
stripe = new_sector;
|
|
|
|
|
2006-06-26 14:27:38 +07:00
|
|
|
if (i == sh->pd_idx)
|
|
|
|
return 0;
|
|
|
|
switch(conf->level) {
|
|
|
|
case 4: break;
|
|
|
|
case 5:
|
2009-03-31 11:20:22 +07:00
|
|
|
switch (algorithm) {
|
2005-04-17 05:20:36 +07:00
|
|
|
case ALGORITHM_LEFT_ASYMMETRIC:
|
|
|
|
case ALGORITHM_RIGHT_ASYMMETRIC:
|
|
|
|
if (i > sh->pd_idx)
|
|
|
|
i--;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_LEFT_SYMMETRIC:
|
|
|
|
case ALGORITHM_RIGHT_SYMMETRIC:
|
|
|
|
if (i < sh->pd_idx)
|
|
|
|
i += raid_disks;
|
|
|
|
i -= (sh->pd_idx + 1);
|
|
|
|
break;
|
2009-03-31 10:39:38 +07:00
|
|
|
case ALGORITHM_PARITY_0:
|
|
|
|
i -= 1;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_PARITY_N:
|
|
|
|
break;
|
2005-04-17 05:20:36 +07:00
|
|
|
default:
|
2009-03-31 10:39:38 +07:00
|
|
|
BUG();
|
2006-06-26 14:27:38 +07:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
case 6:
|
2009-03-31 10:39:38 +07:00
|
|
|
if (i == sh->qd_idx)
|
2006-06-26 14:27:38 +07:00
|
|
|
return 0; /* It is the Q disk */
|
2009-03-31 11:20:22 +07:00
|
|
|
switch (algorithm) {
|
2006-06-26 14:27:38 +07:00
|
|
|
case ALGORITHM_LEFT_ASYMMETRIC:
|
|
|
|
case ALGORITHM_RIGHT_ASYMMETRIC:
|
2009-03-31 10:39:38 +07:00
|
|
|
case ALGORITHM_ROTATING_ZERO_RESTART:
|
|
|
|
case ALGORITHM_ROTATING_N_RESTART:
|
|
|
|
if (sh->pd_idx == raid_disks-1)
|
|
|
|
i--; /* Q D D D P */
|
2006-06-26 14:27:38 +07:00
|
|
|
else if (i > sh->pd_idx)
|
|
|
|
i -= 2; /* D D P Q D */
|
|
|
|
break;
|
|
|
|
case ALGORITHM_LEFT_SYMMETRIC:
|
|
|
|
case ALGORITHM_RIGHT_SYMMETRIC:
|
|
|
|
if (sh->pd_idx == raid_disks-1)
|
|
|
|
i--; /* Q D D D P */
|
|
|
|
else {
|
|
|
|
/* D D P Q D */
|
|
|
|
if (i < sh->pd_idx)
|
|
|
|
i += raid_disks;
|
|
|
|
i -= (sh->pd_idx + 2);
|
|
|
|
}
|
|
|
|
break;
|
2009-03-31 10:39:38 +07:00
|
|
|
case ALGORITHM_PARITY_0:
|
|
|
|
i -= 2;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_PARITY_N:
|
|
|
|
break;
|
|
|
|
case ALGORITHM_ROTATING_N_CONTINUE:
|
2009-10-16 12:27:34 +07:00
|
|
|
/* Like left_symmetric, but P is before Q */
|
2009-03-31 10:39:38 +07:00
|
|
|
if (sh->pd_idx == 0)
|
|
|
|
i--; /* P D D D Q */
|
2009-10-16 12:27:34 +07:00
|
|
|
else {
|
|
|
|
/* D D Q P D */
|
|
|
|
if (i < sh->pd_idx)
|
|
|
|
i += raid_disks;
|
|
|
|
i -= (sh->pd_idx + 1);
|
|
|
|
}
|
2009-03-31 10:39:38 +07:00
|
|
|
break;
|
|
|
|
case ALGORITHM_LEFT_ASYMMETRIC_6:
|
|
|
|
case ALGORITHM_RIGHT_ASYMMETRIC_6:
|
|
|
|
if (i > sh->pd_idx)
|
|
|
|
i--;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_LEFT_SYMMETRIC_6:
|
|
|
|
case ALGORITHM_RIGHT_SYMMETRIC_6:
|
|
|
|
if (i < sh->pd_idx)
|
|
|
|
i += data_disks + 1;
|
|
|
|
i -= (sh->pd_idx + 1);
|
|
|
|
break;
|
|
|
|
case ALGORITHM_PARITY_0_6:
|
|
|
|
i -= 1;
|
|
|
|
break;
|
2006-06-26 14:27:38 +07:00
|
|
|
default:
|
2009-03-31 10:39:38 +07:00
|
|
|
BUG();
|
2006-06-26 14:27:38 +07:00
|
|
|
}
|
|
|
|
break;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
chunk_number = stripe * data_disks + i;
|
2010-04-20 11:13:34 +07:00
|
|
|
r_sector = chunk_number * sectors_per_chunk + chunk_offset;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-03-31 10:39:38 +07:00
|
|
|
check = raid5_compute_sector(conf, r_sector,
|
2009-03-31 11:19:07 +07:00
|
|
|
previous, &dummy1, &sh2);
|
2009-03-31 10:39:38 +07:00
|
|
|
if (check != sh->sector || dummy1 != dd_idx || sh2.pd_idx != sh->pd_idx
|
|
|
|
|| sh2.qd_idx != sh->qd_idx) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: compute_blocknr: map not correct\n",
|
|
|
|
mdname(conf->mddev));
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return r_sector;
|
|
|
|
}
|
|
|
|
|
md/r5cache: shift complex rmw from read path to write path
Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.
However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.
To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.
Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.
There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 08:12:58 +07:00
|
|
|
/*
|
|
|
|
* There are cases where we want handle_stripe_dirtying() and
|
|
|
|
* schedule_reconstruction() to delay towrite to some dev of a stripe.
|
|
|
|
*
|
|
|
|
* This function checks whether we want to delay the towrite. Specifically,
|
|
|
|
* we delay the towrite when:
|
|
|
|
*
|
|
|
|
* 1. degraded stripe has a non-overwrite to the missing dev, AND this
|
|
|
|
* stripe has data in journal (for other devices).
|
|
|
|
*
|
|
|
|
* In this case, when reading data for the non-overwrite dev, it is
|
|
|
|
* necessary to handle complex rmw of write back cache (prexor with
|
|
|
|
* orig_page, and xor with page). To keep read path simple, we would
|
|
|
|
* like to flush data in journal to RAID disks first, so complex rmw
|
|
|
|
* is handled in the write patch (handle_stripe_dirtying).
|
|
|
|
*
|
2017-01-25 05:08:23 +07:00
|
|
|
* 2. when journal space is critical (R5C_LOG_CRITICAL=1)
|
|
|
|
*
|
|
|
|
* It is important to be able to flush all stripes in raid5-cache.
|
|
|
|
* Therefore, we need reserve some space on the journal device for
|
|
|
|
* these flushes. If flush operation includes pending writes to the
|
|
|
|
* stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
|
|
|
|
* for the flush out. If we exclude these pending writes from flush
|
|
|
|
* operation, we only need (conf->max_degraded + 1) pages per stripe.
|
|
|
|
* Therefore, excluding pending writes in these cases enables more
|
|
|
|
* efficient use of the journal device.
|
|
|
|
*
|
|
|
|
* Note: To make sure the stripe makes progress, we only delay
|
|
|
|
* towrite for stripes with data already in journal (injournal > 0).
|
|
|
|
* When LOG_CRITICAL, stripes with injournal == 0 will be sent to
|
|
|
|
* no_space_stripes list.
|
|
|
|
*
|
md/r5cache: gracefully handle journal device errors for writeback mode
For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, it is necessary to handle
journal failure gracefully.
During journal failures, the following logic handles the graceful shutdown
of journal:
1. raid5_error() marks the device as Faulty and schedules async work
log->disable_writeback_work;
2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
suspended, set to write through, and then resumed. mddev_suspend()
flushes all cached stripes;
3. All cached stripes need to be flushed carefully to the RAID array.
This patch fixes issues within the process above:
1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
journal failures;
2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
since raid5_error() updates superblock.
3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
to make progress during log_failed;
4. In delay_towrite(), if log failed only process data in the cache (skip
new writes in dev->towrite);
5. In __get_priority_stripe(), process loprio_list during journal device
failures.
6. In raid5_remove_disk(), wait for all cached stripes are flushed before
calling log_exit().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-12 05:28:28 +07:00
|
|
|
* 3. during journal failure
|
|
|
|
* In journal failure, we try to flush all cached data to raid disks
|
|
|
|
* based on data in stripe cache. The array is read-only to upper
|
|
|
|
* layers, so we would skip all pending writes.
|
|
|
|
*
|
md/r5cache: shift complex rmw from read path to write path
Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.
However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.
To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.
Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.
There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 08:12:58 +07:00
|
|
|
*/
|
2017-01-25 05:08:23 +07:00
|
|
|
static inline bool delay_towrite(struct r5conf *conf,
|
|
|
|
struct r5dev *dev,
|
|
|
|
struct stripe_head_state *s)
|
md/r5cache: shift complex rmw from read path to write path
Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.
However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.
To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.
Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.
There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 08:12:58 +07:00
|
|
|
{
|
2017-01-25 05:08:23 +07:00
|
|
|
/* case 1 above */
|
|
|
|
if (!test_bit(R5_OVERWRITE, &dev->flags) &&
|
|
|
|
!test_bit(R5_Insync, &dev->flags) && s->injournal)
|
|
|
|
return true;
|
|
|
|
/* case 2 above */
|
|
|
|
if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
|
|
|
|
s->injournal > 0)
|
|
|
|
return true;
|
md/r5cache: gracefully handle journal device errors for writeback mode
For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, it is necessary to handle
journal failure gracefully.
During journal failures, the following logic handles the graceful shutdown
of journal:
1. raid5_error() marks the device as Faulty and schedules async work
log->disable_writeback_work;
2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
suspended, set to write through, and then resumed. mddev_suspend()
flushes all cached stripes;
3. All cached stripes need to be flushed carefully to the RAID array.
This patch fixes issues within the process above:
1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
journal failures;
2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
since raid5_error() updates superblock.
3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
to make progress during log_failed;
4. In delay_towrite(), if log failed only process data in the cache (skip
new writes in dev->towrite);
5. In __get_priority_stripe(), process loprio_list during journal device
failures.
6. In raid5_remove_disk(), wait for all cached stripes are flushed before
calling log_exit().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-12 05:28:28 +07:00
|
|
|
/* case 3 above */
|
|
|
|
if (s->log_failed && s->injournal)
|
|
|
|
return true;
|
2017-01-25 05:08:23 +07:00
|
|
|
return false;
|
md/r5cache: shift complex rmw from read path to write path
Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.
However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.
To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.
Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.
There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 08:12:58 +07:00
|
|
|
}
|
|
|
|
|
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 05:32:05 +07:00
|
|
|
static void
|
2009-08-30 09:13:12 +07:00
|
|
|
schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
|
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 05:32:05 +07:00
|
|
|
int rcw, int expand)
|
2007-01-03 03:52:30 +07:00
|
|
|
{
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx, disks = sh->disks;
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
2009-08-30 09:13:12 +07:00
|
|
|
int level = conf->level;
|
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
if (rcw) {
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
/*
|
|
|
|
* In some cases, handle_stripe_dirtying initially decided to
|
|
|
|
* run rmw and allocates extra page for prexor. However, rcw is
|
|
|
|
* cheaper later on. We need to free the extra page now,
|
|
|
|
* because we won't be able to do that in ops_complete_prexor().
|
|
|
|
*/
|
|
|
|
r5c_release_extra_page(sh);
|
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
|
|
|
|
2017-01-25 05:08:23 +07:00
|
|
|
if (dev->towrite && !delay_towrite(conf, dev, s)) {
|
2007-01-03 03:52:30 +07:00
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
2008-06-28 05:32:06 +07:00
|
|
|
set_bit(R5_Wantdrain, &dev->flags);
|
2007-01-03 03:52:30 +07:00
|
|
|
if (!expand)
|
|
|
|
clear_bit(R5_UPTODATE, &dev->flags);
|
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 05:32:05 +07:00
|
|
|
s->locked++;
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
} else if (test_bit(R5_InJournal, &dev->flags)) {
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
s->locked++;
|
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
}
|
2013-03-04 08:37:14 +07:00
|
|
|
/* if we are not expanding this is a proper write request, and
|
|
|
|
* there will be bios with new data to be drained into the
|
|
|
|
* stripe cache
|
|
|
|
*/
|
|
|
|
if (!expand) {
|
|
|
|
if (!s->locked)
|
|
|
|
/* False alarm, nothing to do */
|
|
|
|
return;
|
|
|
|
sh->reconstruct_state = reconstruct_state_drain_run;
|
|
|
|
set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
|
|
|
|
} else
|
|
|
|
sh->reconstruct_state = reconstruct_state_run;
|
|
|
|
|
|
|
|
set_bit(STRIPE_OP_RECONSTRUCT, &s->ops_request);
|
|
|
|
|
2009-08-30 09:13:12 +07:00
|
|
|
if (s->locked + conf->max_degraded == disks)
|
2008-04-28 16:15:53 +07:00
|
|
|
if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state))
|
2009-08-30 09:13:12 +07:00
|
|
|
atomic_inc(&conf->pending_full_writes);
|
2007-01-03 03:52:30 +07:00
|
|
|
} else {
|
|
|
|
BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) ||
|
|
|
|
test_bit(R5_Wantcompute, &sh->dev[pd_idx].flags)));
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
BUG_ON(level == 6 &&
|
|
|
|
(!(test_bit(R5_UPTODATE, &sh->dev[qd_idx].flags) ||
|
|
|
|
test_bit(R5_Wantcompute, &sh->dev[qd_idx].flags))));
|
2007-01-03 03:52:30 +07:00
|
|
|
|
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
if (i == pd_idx || i == qd_idx)
|
2007-01-03 03:52:30 +07:00
|
|
|
continue;
|
|
|
|
|
|
|
|
if (dev->towrite &&
|
|
|
|
(test_bit(R5_UPTODATE, &dev->flags) ||
|
2008-06-28 05:32:06 +07:00
|
|
|
test_bit(R5_Wantcompute, &dev->flags))) {
|
|
|
|
set_bit(R5_Wantdrain, &dev->flags);
|
2007-01-03 03:52:30 +07:00
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
clear_bit(R5_UPTODATE, &dev->flags);
|
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 05:32:05 +07:00
|
|
|
s->locked++;
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
} else if (test_bit(R5_InJournal, &dev->flags)) {
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
s->locked++;
|
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
}
|
2013-03-04 08:37:14 +07:00
|
|
|
if (!s->locked)
|
|
|
|
/* False alarm - nothing to do */
|
|
|
|
return;
|
|
|
|
sh->reconstruct_state = reconstruct_state_prexor_drain_run;
|
|
|
|
set_bit(STRIPE_OP_PREXOR, &s->ops_request);
|
|
|
|
set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
|
|
|
|
set_bit(STRIPE_OP_RECONSTRUCT, &s->ops_request);
|
2007-01-03 03:52:30 +07:00
|
|
|
}
|
|
|
|
|
2009-08-30 09:13:12 +07:00
|
|
|
/* keep the parity disk(s) locked while asynchronous operations
|
2007-01-03 03:52:30 +07:00
|
|
|
* are in flight
|
|
|
|
*/
|
|
|
|
set_bit(R5_LOCKED, &sh->dev[pd_idx].flags);
|
|
|
|
clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
|
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 05:32:05 +07:00
|
|
|
s->locked++;
|
2007-01-03 03:52:30 +07:00
|
|
|
|
2009-08-30 09:13:12 +07:00
|
|
|
if (level == 6) {
|
|
|
|
int qd_idx = sh->qd_idx;
|
|
|
|
struct r5dev *dev = &sh->dev[qd_idx];
|
|
|
|
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
clear_bit(R5_UPTODATE, &dev->flags);
|
|
|
|
s->locked++;
|
|
|
|
}
|
|
|
|
|
2017-04-04 18:13:57 +07:00
|
|
|
if (raid5_has_ppl(sh->raid_conf) && sh->ppl_page &&
|
2017-03-09 15:59:59 +07:00
|
|
|
test_bit(STRIPE_OP_BIODRAIN, &s->ops_request) &&
|
|
|
|
!test_bit(STRIPE_FULL_WRITE, &sh->state) &&
|
|
|
|
test_bit(R5_Insync, &sh->dev[pd_idx].flags))
|
|
|
|
set_bit(STRIPE_OP_PARTIAL_PARITY, &s->ops_request);
|
|
|
|
|
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 05:32:05 +07:00
|
|
|
pr_debug("%s: stripe %llu locked: %d ops_request: %lx\n",
|
2008-04-28 16:15:50 +07:00
|
|
|
__func__, (unsigned long long)sh->sector,
|
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 05:32:05 +07:00
|
|
|
s->locked, s->ops_request);
|
2007-01-03 03:52:30 +07:00
|
|
|
}
|
2006-06-26 14:27:38 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Each stripe/dev can have one or more bion attached.
|
2006-06-26 14:27:38 +07:00
|
|
|
* toread/towrite point to the first in a chain.
|
2005-04-17 05:20:36 +07:00
|
|
|
* The bi_next chain must be in order.
|
|
|
|
*/
|
2014-12-15 08:57:03 +07:00
|
|
|
static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
|
|
|
|
int forwrite, int previous)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
struct bio **bip;
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
2005-09-10 06:23:54 +07:00
|
|
|
int firstwrite=0;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-07-26 08:20:35 +07:00
|
|
|
pr_debug("adding bi b#%llu to stripe s#%llu\n",
|
2013-10-12 05:44:27 +07:00
|
|
|
(unsigned long long)bi->bi_iter.bi_sector,
|
2005-04-17 05:20:36 +07:00
|
|
|
(unsigned long long)sh->sector);
|
|
|
|
|
raid5: add a per-stripe lock
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 13:01:31 +07:00
|
|
|
spin_lock_irq(&sh->stripe_lock);
|
2018-04-20 00:28:10 +07:00
|
|
|
sh->dev[dd_idx].write_hint = bi->bi_write_hint;
|
2014-12-15 08:57:03 +07:00
|
|
|
/* Don't allow new IO added to stripes in batch list */
|
|
|
|
if (sh->batch_head)
|
|
|
|
goto overlap;
|
2005-09-10 06:23:54 +07:00
|
|
|
if (forwrite) {
|
2005-04-17 05:20:36 +07:00
|
|
|
bip = &sh->dev[dd_idx].towrite;
|
2012-07-19 13:01:31 +07:00
|
|
|
if (*bip == NULL)
|
2005-09-10 06:23:54 +07:00
|
|
|
firstwrite = 1;
|
|
|
|
} else
|
2005-04-17 05:20:36 +07:00
|
|
|
bip = &sh->dev[dd_idx].toread;
|
2013-10-12 05:44:27 +07:00
|
|
|
while (*bip && (*bip)->bi_iter.bi_sector < bi->bi_iter.bi_sector) {
|
|
|
|
if (bio_end_sector(*bip) > bi->bi_iter.bi_sector)
|
2005-04-17 05:20:36 +07:00
|
|
|
goto overlap;
|
|
|
|
bip = & (*bip)->bi_next;
|
|
|
|
}
|
2013-10-12 05:44:27 +07:00
|
|
|
if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
|
2005-04-17 05:20:36 +07:00
|
|
|
goto overlap;
|
|
|
|
|
2017-03-09 15:59:59 +07:00
|
|
|
if (forwrite && raid5_has_ppl(conf)) {
|
|
|
|
/*
|
|
|
|
* With PPL only writes to consecutive data chunks within a
|
|
|
|
* stripe are allowed because for a single stripe_head we can
|
|
|
|
* only have one PPL entry at a time, which describes one data
|
|
|
|
* range. Not really an overlap, but wait_for_overlap can be
|
|
|
|
* used to handle this.
|
|
|
|
*/
|
|
|
|
sector_t sector;
|
|
|
|
sector_t first = 0;
|
|
|
|
sector_t last = 0;
|
|
|
|
int count = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < sh->disks; i++) {
|
|
|
|
if (i != sh->pd_idx &&
|
|
|
|
(i == dd_idx || sh->dev[i].towrite)) {
|
|
|
|
sector = sh->dev[i].sector;
|
|
|
|
if (count == 0 || sector < first)
|
|
|
|
first = sector;
|
|
|
|
if (sector > last)
|
|
|
|
last = sector;
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (first + conf->chunk_sectors * (count - 1) != last)
|
|
|
|
goto overlap;
|
|
|
|
}
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
if (!forwrite || previous)
|
|
|
|
clear_bit(STRIPE_BATCH_READY, &sh->state);
|
|
|
|
|
2006-04-02 18:31:42 +07:00
|
|
|
BUG_ON(*bip && bi->bi_next && (*bip) != bi->bi_next);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (*bip)
|
|
|
|
bi->bi_next = *bip;
|
|
|
|
*bip = bi;
|
2017-03-15 10:05:13 +07:00
|
|
|
bio_inc_remaining(bi);
|
md/raid5: use md_write_start to count stripes, not bios
We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count. We currently count bios
submitted to md/raid5. Change it count stripe_heads that a WRITE bio
has been attached to.
So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().
The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
Whenever we set bi_phys_segments to 1, we now call md_write_start.
Whenever we increment it on non-read requests with
raid5_inc_bi_active_stripes(), we now call md_write_start().
Whenever we decrement bi_phys_segments on non-read requsts with
raid5_dec_bi_active_stripes(), we now call md_write_end().
This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.
md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 10:05:12 +07:00
|
|
|
md_write_inc(conf->mddev, bi);
|
2005-09-10 06:23:54 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
if (forwrite) {
|
|
|
|
/* check if page is covered */
|
|
|
|
sector_t sector = sh->dev[dd_idx].sector;
|
|
|
|
for (bi=sh->dev[dd_idx].towrite;
|
|
|
|
sector < sh->dev[dd_idx].sector + STRIPE_SECTORS &&
|
2013-10-12 05:44:27 +07:00
|
|
|
bi && bi->bi_iter.bi_sector <= sector;
|
2005-04-17 05:20:36 +07:00
|
|
|
bi = r5_next_bio(bi, sh->dev[dd_idx].sector)) {
|
2012-09-26 05:05:12 +07:00
|
|
|
if (bio_end_sector(bi) >= sector)
|
|
|
|
sector = bio_end_sector(bi);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
|
2014-12-15 08:57:03 +07:00
|
|
|
if (!test_and_set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags))
|
|
|
|
sh->overwrite_disks++;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2011-07-26 08:20:35 +07:00
|
|
|
|
|
|
|
pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
|
2013-10-12 05:44:27 +07:00
|
|
|
(unsigned long long)(*bip)->bi_iter.bi_sector,
|
2011-07-26 08:20:35 +07:00
|
|
|
(unsigned long long)sh->sector, dd_idx);
|
|
|
|
|
|
|
|
if (conf->mddev->bitmap && firstwrite) {
|
2015-05-27 05:43:45 +07:00
|
|
|
/* Cannot hold spinlock over bitmap_startwrite,
|
|
|
|
* but must ensure this isn't added to a batch until
|
|
|
|
* we have added to the bitmap and set bm_seq.
|
|
|
|
* So set STRIPE_BITMAP_PENDING to prevent
|
|
|
|
* batching.
|
|
|
|
* If multiple add_stripe_bio() calls race here they
|
|
|
|
* much all set STRIPE_BITMAP_PENDING. So only the first one
|
|
|
|
* to complete "bitmap_startwrite" gets to set
|
|
|
|
* STRIPE_BIT_DELAY. This is important as once a stripe
|
|
|
|
* is added to a batch, STRIPE_BIT_DELAY cannot be changed
|
|
|
|
* any more.
|
|
|
|
*/
|
|
|
|
set_bit(STRIPE_BITMAP_PENDING, &sh->state);
|
|
|
|
spin_unlock_irq(&sh->stripe_lock);
|
2018-08-02 05:20:50 +07:00
|
|
|
md_bitmap_startwrite(conf->mddev->bitmap, sh->sector,
|
|
|
|
STRIPE_SECTORS, 0);
|
2015-05-27 05:43:45 +07:00
|
|
|
spin_lock_irq(&sh->stripe_lock);
|
|
|
|
clear_bit(STRIPE_BITMAP_PENDING, &sh->state);
|
|
|
|
if (!sh->batch_head) {
|
|
|
|
sh->bm_seq = conf->seq_flush+1;
|
|
|
|
set_bit(STRIPE_BIT_DELAY, &sh->state);
|
|
|
|
}
|
2011-07-26 08:20:35 +07:00
|
|
|
}
|
2015-05-27 05:43:45 +07:00
|
|
|
spin_unlock_irq(&sh->stripe_lock);
|
2014-12-15 08:57:03 +07:00
|
|
|
|
|
|
|
if (stripe_can_batch(sh))
|
|
|
|
stripe_add_to_batch_list(conf, sh);
|
2005-04-17 05:20:36 +07:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
overlap:
|
|
|
|
set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
|
raid5: add a per-stripe lock
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 13:01:31 +07:00
|
|
|
spin_unlock_irq(&sh->stripe_lock);
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static void end_reshape(struct r5conf *conf);
|
2006-03-27 16:18:10 +07:00
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
|
2009-03-31 10:39:38 +07:00
|
|
|
struct stripe_head *sh)
|
2006-03-27 16:18:09 +07:00
|
|
|
{
|
2009-03-31 11:19:07 +07:00
|
|
|
int sectors_per_chunk =
|
2009-06-18 05:45:55 +07:00
|
|
|
previous ? conf->prev_chunk_sectors : conf->chunk_sectors;
|
2009-03-31 10:39:38 +07:00
|
|
|
int dd_idx;
|
2006-10-03 15:15:50 +07:00
|
|
|
int chunk_offset = sector_div(stripe, sectors_per_chunk);
|
2009-03-31 10:39:38 +07:00
|
|
|
int disks = previous ? conf->previous_raid_disks : conf->raid_disks;
|
2006-10-03 15:15:50 +07:00
|
|
|
|
2009-03-31 10:39:38 +07:00
|
|
|
raid5_compute_sector(conf,
|
|
|
|
stripe * (disks - conf->max_degraded)
|
2006-12-10 17:20:49 +07:00
|
|
|
*sectors_per_chunk + chunk_offset,
|
2009-03-31 10:39:38 +07:00
|
|
|
previous,
|
2009-03-31 10:39:38 +07:00
|
|
|
&dd_idx, sh);
|
2006-03-27 16:18:09 +07:00
|
|
|
}
|
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
static void
|
2011-10-11 12:49:52 +07:00
|
|
|
handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
|
2017-03-15 10:05:12 +07:00
|
|
|
struct stripe_head_state *s, int disks)
|
2007-07-10 01:56:43 +07:00
|
|
|
{
|
|
|
|
int i;
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2007-07-10 01:56:43 +07:00
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct bio *bi;
|
|
|
|
int bitmap_end = 0;
|
|
|
|
|
|
|
|
if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
|
2011-10-11 12:45:26 +07:00
|
|
|
struct md_rdev *rdev;
|
2007-07-10 01:56:43 +07:00
|
|
|
rcu_read_lock();
|
|
|
|
rdev = rcu_dereference(conf->disks[i].rdev);
|
2016-06-02 13:19:53 +07:00
|
|
|
if (rdev && test_bit(In_sync, &rdev->flags) &&
|
|
|
|
!test_bit(Faulty, &rdev->flags))
|
2011-07-28 08:39:22 +07:00
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
else
|
|
|
|
rdev = NULL;
|
2007-07-10 01:56:43 +07:00
|
|
|
rcu_read_unlock();
|
2011-07-28 08:39:22 +07:00
|
|
|
if (rdev) {
|
|
|
|
if (!rdev_set_badblocks(
|
|
|
|
rdev,
|
|
|
|
sh->sector,
|
|
|
|
STRIPE_SECTORS, 0))
|
|
|
|
md_error(conf->mddev, rdev);
|
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
|
|
|
}
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
raid5: add a per-stripe lock
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 13:01:31 +07:00
|
|
|
spin_lock_irq(&sh->stripe_lock);
|
2007-07-10 01:56:43 +07:00
|
|
|
/* fail all writes first */
|
|
|
|
bi = sh->dev[i].towrite;
|
|
|
|
sh->dev[i].towrite = NULL;
|
2014-12-15 08:57:03 +07:00
|
|
|
sh->overwrite_disks = 0;
|
raid5: add a per-stripe lock
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.
stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.
If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe
Let's have an example:
| stripe1 | stripe2 | stripe3 |
...bio1......|bio2|bio3|....bio4.....
stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.
before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 13:01:31 +07:00
|
|
|
spin_unlock_irq(&sh->stripe_lock);
|
2012-10-11 09:50:13 +07:00
|
|
|
if (bi)
|
2007-07-10 01:56:43 +07:00
|
|
|
bitmap_end = 1;
|
|
|
|
|
2017-03-09 15:59:58 +07:00
|
|
|
log_stripe_write_finished(sh);
|
raid5: log reclaim support
This is the reclaim support for raid5 log. A stripe write will have
following steps:
1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused
In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.
It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.
Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.
Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.
There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.
The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.
Includes fixes to problems which were...
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-14 04:32:00 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
|
|
|
|
wake_up(&conf->wait_for_overlap);
|
|
|
|
|
2013-10-12 05:44:27 +07:00
|
|
|
while (bi && bi->bi_iter.bi_sector <
|
2007-07-10 01:56:43 +07:00
|
|
|
sh->dev[i].sector + STRIPE_SECTORS) {
|
|
|
|
struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
|
2015-07-20 20:29:37 +07:00
|
|
|
|
md/raid5: use md_write_start to count stripes, not bios
We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count. We currently count bios
submitted to md/raid5. Change it count stripe_heads that a WRITE bio
has been attached to.
So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().
The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
Whenever we set bi_phys_segments to 1, we now call md_write_start.
Whenever we increment it on non-read requests with
raid5_inc_bi_active_stripes(), we now call md_write_start().
Whenever we decrement bi_phys_segments on non-read requsts with
raid5_dec_bi_active_stripes(), we now call md_write_end().
This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.
md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 10:05:12 +07:00
|
|
|
md_write_end(conf->mddev);
|
2017-07-21 15:33:44 +07:00
|
|
|
bio_io_error(bi);
|
2007-07-10 01:56:43 +07:00
|
|
|
bi = nextbi;
|
|
|
|
}
|
2012-07-19 13:01:31 +07:00
|
|
|
if (bitmap_end)
|
2018-08-02 05:20:50 +07:00
|
|
|
md_bitmap_endwrite(conf->mddev->bitmap, sh->sector,
|
|
|
|
STRIPE_SECTORS, 0, 0);
|
2012-07-19 13:01:31 +07:00
|
|
|
bitmap_end = 0;
|
2007-07-10 01:56:43 +07:00
|
|
|
/* and fail all 'written' */
|
|
|
|
bi = sh->dev[i].written;
|
|
|
|
sh->dev[i].written = NULL;
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
if (test_and_clear_bit(R5_SkipCopy, &sh->dev[i].flags)) {
|
|
|
|
WARN_ON(test_bit(R5_UPTODATE, &sh->dev[i].flags));
|
|
|
|
sh->dev[i].page = sh->dev[i].orig_page;
|
|
|
|
}
|
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
if (bi) bitmap_end = 1;
|
2013-10-12 05:44:27 +07:00
|
|
|
while (bi && bi->bi_iter.bi_sector <
|
2007-07-10 01:56:43 +07:00
|
|
|
sh->dev[i].sector + STRIPE_SECTORS) {
|
|
|
|
struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
|
2015-07-20 20:29:37 +07:00
|
|
|
|
md/raid5: use md_write_start to count stripes, not bios
We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count. We currently count bios
submitted to md/raid5. Change it count stripe_heads that a WRITE bio
has been attached to.
So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().
The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
Whenever we set bi_phys_segments to 1, we now call md_write_start.
Whenever we increment it on non-read requests with
raid5_inc_bi_active_stripes(), we now call md_write_start().
Whenever we decrement bi_phys_segments on non-read requsts with
raid5_dec_bi_active_stripes(), we now call md_write_end().
This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.
md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 10:05:12 +07:00
|
|
|
md_write_end(conf->mddev);
|
2017-07-21 15:33:44 +07:00
|
|
|
bio_io_error(bi);
|
2007-07-10 01:56:43 +07:00
|
|
|
bi = bi2;
|
|
|
|
}
|
|
|
|
|
2007-01-03 03:52:31 +07:00
|
|
|
/* fail any reads if this device is non-operational and
|
|
|
|
* the data has not reached the cache yet.
|
|
|
|
*/
|
|
|
|
if (!test_bit(R5_Wantfill, &sh->dev[i].flags) &&
|
2015-10-09 11:54:08 +07:00
|
|
|
s->failed > conf->max_degraded &&
|
2007-01-03 03:52:31 +07:00
|
|
|
(!test_bit(R5_Insync, &sh->dev[i].flags) ||
|
|
|
|
test_bit(R5_ReadError, &sh->dev[i].flags))) {
|
2012-10-11 09:50:12 +07:00
|
|
|
spin_lock_irq(&sh->stripe_lock);
|
2007-07-10 01:56:43 +07:00
|
|
|
bi = sh->dev[i].toread;
|
|
|
|
sh->dev[i].toread = NULL;
|
2012-10-11 09:50:12 +07:00
|
|
|
spin_unlock_irq(&sh->stripe_lock);
|
2007-07-10 01:56:43 +07:00
|
|
|
if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
|
|
|
|
wake_up(&conf->wait_for_overlap);
|
2015-09-19 00:20:13 +07:00
|
|
|
if (bi)
|
|
|
|
s->to_read--;
|
2013-10-12 05:44:27 +07:00
|
|
|
while (bi && bi->bi_iter.bi_sector <
|
2007-07-10 01:56:43 +07:00
|
|
|
sh->dev[i].sector + STRIPE_SECTORS) {
|
|
|
|
struct bio *nextbi =
|
|
|
|
r5_next_bio(bi, sh->dev[i].sector);
|
2015-07-20 20:29:37 +07:00
|
|
|
|
2017-07-21 15:33:44 +07:00
|
|
|
bio_io_error(bi);
|
2007-07-10 01:56:43 +07:00
|
|
|
bi = nextbi;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (bitmap_end)
|
2018-08-02 05:20:50 +07:00
|
|
|
md_bitmap_endwrite(conf->mddev->bitmap, sh->sector,
|
|
|
|
STRIPE_SECTORS, 0, 0);
|
2011-07-27 08:00:36 +07:00
|
|
|
/* If we were in the middle of a write the parity block might
|
|
|
|
* still be locked - so just clear all R5_LOCKED flags
|
|
|
|
*/
|
|
|
|
clear_bit(R5_LOCKED, &sh->dev[i].flags);
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
2015-09-19 00:20:13 +07:00
|
|
|
s->to_write = 0;
|
|
|
|
s->written = 0;
|
2007-07-10 01:56:43 +07:00
|
|
|
|
2008-04-28 16:15:53 +07:00
|
|
|
if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))
|
|
|
|
if (atomic_dec_and_test(&conf->pending_full_writes))
|
|
|
|
md_wakeup_thread(conf->mddev->thread);
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
|
|
|
|
2011-07-28 08:39:22 +07:00
|
|
|
static void
|
2011-10-11 12:49:52 +07:00
|
|
|
handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
|
2011-07-28 08:39:22 +07:00
|
|
|
struct stripe_head_state *s)
|
|
|
|
{
|
|
|
|
int abort = 0;
|
|
|
|
int i;
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2011-07-28 08:39:22 +07:00
|
|
|
clear_bit(STRIPE_SYNCING, &sh->state);
|
2013-03-12 08:18:06 +07:00
|
|
|
if (test_and_clear_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags))
|
|
|
|
wake_up(&conf->wait_for_overlap);
|
2011-07-28 08:39:22 +07:00
|
|
|
s->syncing = 0;
|
2011-12-23 06:17:53 +07:00
|
|
|
s->replacing = 0;
|
2011-07-28 08:39:22 +07:00
|
|
|
/* There is nothing more to do for sync/check/repair.
|
2012-04-01 20:48:38 +07:00
|
|
|
* Don't even need to abort as that is handled elsewhere
|
|
|
|
* if needed, and not always wanted e.g. if there is a known
|
|
|
|
* bad block here.
|
2011-12-23 06:17:53 +07:00
|
|
|
* For recover/replace we need to record a bad block on all
|
2011-07-28 08:39:22 +07:00
|
|
|
* non-sync devices, or abort the recovery
|
|
|
|
*/
|
2012-04-01 20:48:38 +07:00
|
|
|
if (test_bit(MD_RECOVERY_RECOVER, &conf->mddev->recovery)) {
|
|
|
|
/* During recovery devices cannot be removed, so
|
|
|
|
* locking and refcounting of rdevs is not needed
|
|
|
|
*/
|
2016-06-02 13:19:52 +07:00
|
|
|
rcu_read_lock();
|
2012-04-01 20:48:38 +07:00
|
|
|
for (i = 0; i < conf->raid_disks; i++) {
|
2016-06-02 13:19:52 +07:00
|
|
|
struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev);
|
2012-04-01 20:48:38 +07:00
|
|
|
if (rdev
|
|
|
|
&& !test_bit(Faulty, &rdev->flags)
|
|
|
|
&& !test_bit(In_sync, &rdev->flags)
|
|
|
|
&& !rdev_set_badblocks(rdev, sh->sector,
|
|
|
|
STRIPE_SECTORS, 0))
|
|
|
|
abort = 1;
|
2016-06-02 13:19:52 +07:00
|
|
|
rdev = rcu_dereference(conf->disks[i].replacement);
|
2012-04-01 20:48:38 +07:00
|
|
|
if (rdev
|
|
|
|
&& !test_bit(Faulty, &rdev->flags)
|
|
|
|
&& !test_bit(In_sync, &rdev->flags)
|
|
|
|
&& !rdev_set_badblocks(rdev, sh->sector,
|
|
|
|
STRIPE_SECTORS, 0))
|
|
|
|
abort = 1;
|
|
|
|
}
|
2016-06-02 13:19:52 +07:00
|
|
|
rcu_read_unlock();
|
2012-04-01 20:48:38 +07:00
|
|
|
if (abort)
|
|
|
|
conf->recovery_disabled =
|
|
|
|
conf->mddev->recovery_disabled;
|
2011-07-28 08:39:22 +07:00
|
|
|
}
|
2012-04-01 20:48:38 +07:00
|
|
|
md_done_sync(conf->mddev, STRIPE_SECTORS, !abort);
|
2011-07-28 08:39:22 +07:00
|
|
|
}
|
|
|
|
|
2011-12-23 06:17:53 +07:00
|
|
|
static int want_replace(struct stripe_head *sh, int disk_idx)
|
|
|
|
{
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
int rv = 0;
|
2016-06-02 13:19:52 +07:00
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
rdev = rcu_dereference(sh->raid_conf->disks[disk_idx].replacement);
|
2011-12-23 06:17:53 +07:00
|
|
|
if (rdev
|
|
|
|
&& !test_bit(Faulty, &rdev->flags)
|
|
|
|
&& !test_bit(In_sync, &rdev->flags)
|
|
|
|
&& (rdev->recovery_offset <= sh->sector
|
|
|
|
|| rdev->mddev->recovery_cp <= sh->sector))
|
|
|
|
rv = 1;
|
2016-06-02 13:19:52 +07:00
|
|
|
rcu_read_unlock();
|
2011-12-23 06:17:53 +07:00
|
|
|
return rv;
|
|
|
|
}
|
|
|
|
|
2015-02-02 07:32:23 +07:00
|
|
|
static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
|
|
|
|
int disk_idx, int disks)
|
2007-07-10 01:56:43 +07:00
|
|
|
{
|
2009-08-30 09:13:12 +07:00
|
|
|
struct r5dev *dev = &sh->dev[disk_idx];
|
2011-07-26 08:35:19 +07:00
|
|
|
struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],
|
|
|
|
&sh->dev[s->failed_num[1]] };
|
2015-02-02 10:03:28 +07:00
|
|
|
int i;
|
2009-08-30 09:13:12 +07:00
|
|
|
|
2015-02-02 07:37:59 +07:00
|
|
|
|
|
|
|
if (test_bit(R5_LOCKED, &dev->flags) ||
|
|
|
|
test_bit(R5_UPTODATE, &dev->flags))
|
|
|
|
/* No point reading this as we already have it or have
|
|
|
|
* decided to get it.
|
|
|
|
*/
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (dev->toread ||
|
|
|
|
(dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)))
|
|
|
|
/* We need this block to directly satisfy a request */
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (s->syncing || s->expanding ||
|
|
|
|
(s->replacing && want_replace(sh, disk_idx)))
|
|
|
|
/* When syncing, or expanding we read everything.
|
|
|
|
* When replacing, we need the replaced block.
|
|
|
|
*/
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if ((s->failed >= 1 && fdev[0]->toread) ||
|
|
|
|
(s->failed >= 2 && fdev[1]->toread))
|
|
|
|
/* If we want to read from a failed device, then
|
|
|
|
* we need to actually read every other device.
|
|
|
|
*/
|
|
|
|
return 1;
|
|
|
|
|
2015-02-02 07:49:10 +07:00
|
|
|
/* Sometimes neither read-modify-write nor reconstruct-write
|
|
|
|
* cycles can work. In those cases we read every block we
|
|
|
|
* can. Then the parity-update is certain to have enough to
|
|
|
|
* work with.
|
|
|
|
* This can only be a problem when we need to write something,
|
|
|
|
* and some device has failed. If either of those tests
|
|
|
|
* fail we need look no further.
|
|
|
|
*/
|
|
|
|
if (!s->failed || !s->to_write)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (test_bit(R5_Insync, &dev->flags) &&
|
|
|
|
!test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
|
|
|
|
/* Pre-reads at not permitted until after short delay
|
|
|
|
* to gather multiple requests. However if this
|
2017-03-15 15:14:53 +07:00
|
|
|
* device is no Insync, the block could only be computed
|
2015-02-02 07:49:10 +07:00
|
|
|
* and there is no need to delay that.
|
|
|
|
*/
|
|
|
|
return 0;
|
2015-02-02 10:03:28 +07:00
|
|
|
|
2015-09-24 12:25:36 +07:00
|
|
|
for (i = 0; i < s->failed && i < 2; i++) {
|
2015-02-02 10:03:28 +07:00
|
|
|
if (fdev[i]->towrite &&
|
|
|
|
!test_bit(R5_UPTODATE, &fdev[i]->flags) &&
|
|
|
|
!test_bit(R5_OVERWRITE, &fdev[i]->flags))
|
|
|
|
/* If we have a partial write to a failed
|
|
|
|
* device, then we will need to reconstruct
|
|
|
|
* the content of that device, so all other
|
|
|
|
* devices must be read.
|
|
|
|
*/
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If we are forced to do a reconstruct-write, either because
|
|
|
|
* the current RAID6 implementation only supports that, or
|
2017-03-15 15:14:53 +07:00
|
|
|
* because parity cannot be trusted and we are currently
|
2015-02-02 10:03:28 +07:00
|
|
|
* recovering it, there is extra need to be careful.
|
|
|
|
* If one of the devices that we would need to read, because
|
|
|
|
* it is not being overwritten (and maybe not written at all)
|
|
|
|
* is missing/faulty, then we need to read everything we can.
|
|
|
|
*/
|
|
|
|
if (sh->raid_conf->level != 6 &&
|
|
|
|
sh->sector < sh->raid_conf->mddev->recovery_cp)
|
|
|
|
/* reconstruct-write isn't being forced */
|
|
|
|
return 0;
|
2015-09-24 12:25:36 +07:00
|
|
|
for (i = 0; i < s->failed && i < 2; i++) {
|
2015-05-08 15:19:33 +07:00
|
|
|
if (s->failed_num[i] != sh->pd_idx &&
|
|
|
|
s->failed_num[i] != sh->qd_idx &&
|
|
|
|
!test_bit(R5_UPTODATE, &fdev[i]->flags) &&
|
2015-02-02 10:03:28 +07:00
|
|
|
!test_bit(R5_OVERWRITE, &fdev[i]->flags))
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2015-02-02 07:32:23 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-01-13 08:22:42 +07:00
|
|
|
/* fetch_block - checks the given member device to see if its data needs
|
|
|
|
* to be read or computed to satisfy a request.
|
|
|
|
*
|
|
|
|
* Returns 1 when no more member devices need to be checked, otherwise returns
|
|
|
|
* 0 to tell the loop in handle_stripe_fill to continue
|
|
|
|
*/
|
2015-02-02 07:32:23 +07:00
|
|
|
static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
|
|
|
|
int disk_idx, int disks)
|
|
|
|
{
|
|
|
|
struct r5dev *dev = &sh->dev[disk_idx];
|
|
|
|
|
|
|
|
/* is the data in this block needed, and can we get it? */
|
|
|
|
if (need_this_block(sh, s, disk_idx, disks)) {
|
2009-08-30 09:13:12 +07:00
|
|
|
/* we would like to get this block, possibly by computing it,
|
|
|
|
* otherwise read it if the backing disk is insync
|
|
|
|
*/
|
|
|
|
BUG_ON(test_bit(R5_Wantcompute, &dev->flags));
|
|
|
|
BUG_ON(test_bit(R5_Wantread, &dev->flags));
|
2015-05-08 15:19:32 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2017-04-03 09:11:32 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* In the raid6 case if the only non-uptodate disk is P
|
|
|
|
* then we already trusted P to compute the other failed
|
|
|
|
* drives. It is safe to compute rather than re-read P.
|
|
|
|
* In other cases we only compute blocks from failed
|
|
|
|
* devices, otherwise check/repair might fail to detect
|
|
|
|
* a real inconsistency.
|
|
|
|
*/
|
|
|
|
|
2009-08-30 09:13:12 +07:00
|
|
|
if ((s->uptodate == disks - 1) &&
|
2017-04-03 09:11:32 +07:00
|
|
|
((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) ||
|
2011-07-26 08:35:19 +07:00
|
|
|
(s->failed && (disk_idx == s->failed_num[0] ||
|
2017-04-03 09:11:32 +07:00
|
|
|
disk_idx == s->failed_num[1])))) {
|
2009-08-30 09:13:12 +07:00
|
|
|
/* have disk failed, and we're requested to fetch it;
|
|
|
|
* do compute it
|
2007-07-10 01:56:43 +07:00
|
|
|
*/
|
2009-08-30 09:13:12 +07:00
|
|
|
pr_debug("Computing stripe %llu block %d\n",
|
|
|
|
(unsigned long long)sh->sector, disk_idx);
|
|
|
|
set_bit(STRIPE_COMPUTE_RUN, &sh->state);
|
|
|
|
set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
|
|
|
|
set_bit(R5_Wantcompute, &dev->flags);
|
|
|
|
sh->ops.target = disk_idx;
|
|
|
|
sh->ops.target2 = -1; /* no 2nd target */
|
|
|
|
s->req_compute = 1;
|
2011-07-27 08:00:36 +07:00
|
|
|
/* Careful: from this point on 'uptodate' is in the eye
|
|
|
|
* of raid_run_ops which services 'compute' operations
|
|
|
|
* before writes. R5_Wantcompute flags a block that will
|
|
|
|
* be R5_UPTODATE by the time it is needed for a
|
|
|
|
* subsequent operation.
|
|
|
|
*/
|
2009-08-30 09:13:12 +07:00
|
|
|
s->uptodate++;
|
|
|
|
return 1;
|
|
|
|
} else if (s->uptodate == disks-2 && s->failed >= 2) {
|
|
|
|
/* Computing 2-failure is *very* expensive; only
|
|
|
|
* do it if failed >= 2
|
|
|
|
*/
|
|
|
|
int other;
|
|
|
|
for (other = disks; other--; ) {
|
|
|
|
if (other == disk_idx)
|
|
|
|
continue;
|
|
|
|
if (!test_bit(R5_UPTODATE,
|
|
|
|
&sh->dev[other].flags))
|
|
|
|
break;
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
2009-08-30 09:13:12 +07:00
|
|
|
BUG_ON(other < 0);
|
|
|
|
pr_debug("Computing stripe %llu blocks %d,%d\n",
|
|
|
|
(unsigned long long)sh->sector,
|
|
|
|
disk_idx, other);
|
|
|
|
set_bit(STRIPE_COMPUTE_RUN, &sh->state);
|
|
|
|
set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
|
|
|
|
set_bit(R5_Wantcompute, &sh->dev[disk_idx].flags);
|
|
|
|
set_bit(R5_Wantcompute, &sh->dev[other].flags);
|
|
|
|
sh->ops.target = disk_idx;
|
|
|
|
sh->ops.target2 = other;
|
|
|
|
s->uptodate += 2;
|
|
|
|
s->req_compute = 1;
|
|
|
|
return 1;
|
|
|
|
} else if (test_bit(R5_Insync, &dev->flags)) {
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
set_bit(R5_Wantread, &dev->flags);
|
|
|
|
s->locked++;
|
|
|
|
pr_debug("Reading block %d (sync=%d)\n",
|
|
|
|
disk_idx, s->syncing);
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
|
|
|
}
|
2009-08-30 09:13:12 +07:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2011-07-27 08:00:36 +07:00
|
|
|
* handle_stripe_fill - read or compute data to satisfy pending requests.
|
2009-08-30 09:13:12 +07:00
|
|
|
*/
|
2011-07-27 08:00:36 +07:00
|
|
|
static void handle_stripe_fill(struct stripe_head *sh,
|
|
|
|
struct stripe_head_state *s,
|
|
|
|
int disks)
|
2009-08-30 09:13:12 +07:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* look for blocks to read/compute, skip this if a compute
|
|
|
|
* is already in flight, or if the stripe contents are in the
|
|
|
|
* midst of changing due to a write
|
|
|
|
*/
|
|
|
|
if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
|
md/r5cache: shift complex rmw from read path to write path
Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.
However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.
To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.
Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.
There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 08:12:58 +07:00
|
|
|
!sh->reconstruct_state) {
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For degraded stripe with data in journal, do not handle
|
|
|
|
* read requests yet, instead, flush the stripe to raid
|
|
|
|
* disks first, this avoids handling complex rmw of write
|
|
|
|
* back cache (prexor with orig_page, and then xor with
|
|
|
|
* page) in the read path
|
|
|
|
*/
|
|
|
|
if (s->injournal && s->failed) {
|
|
|
|
if (test_bit(STRIPE_R5C_CACHING, &sh->state))
|
|
|
|
r5c_make_stripe_write_out(sh);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2009-08-30 09:13:12 +07:00
|
|
|
for (i = disks; i--; )
|
2011-07-27 08:00:36 +07:00
|
|
|
if (fetch_block(sh, s, i, disks))
|
2009-08-30 09:13:12 +07:00
|
|
|
break;
|
md/r5cache: shift complex rmw from read path to write path
Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.
However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.
To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.
Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.
There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 08:12:58 +07:00
|
|
|
}
|
|
|
|
out:
|
2007-07-10 01:56:43 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
}
|
|
|
|
|
2015-05-21 09:56:41 +07:00
|
|
|
static void break_stripe_batch_list(struct stripe_head *head_sh,
|
|
|
|
unsigned long handle_flags);
|
2008-06-28 06:16:30 +07:00
|
|
|
/* handle_stripe_clean_event
|
2007-07-10 01:56:43 +07:00
|
|
|
* any written block on an uptodate or failed drive can be returned.
|
|
|
|
* Note that if we 'wrote' to a failed drive, it will be UPTODATE, but
|
|
|
|
* never LOCKED, so we don't need to test 'failed' directly.
|
|
|
|
*/
|
2011-10-11 12:49:52 +07:00
|
|
|
static void handle_stripe_clean_event(struct r5conf *conf,
|
2017-03-15 10:05:12 +07:00
|
|
|
struct stripe_head *sh, int disks)
|
2007-07-10 01:56:43 +07:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct r5dev *dev;
|
2013-03-12 08:18:06 +07:00
|
|
|
int discard_pending = 0;
|
2014-12-15 08:57:03 +07:00
|
|
|
struct stripe_head *head_sh = sh;
|
|
|
|
bool do_endio = false;
|
2007-07-10 01:56:43 +07:00
|
|
|
|
|
|
|
for (i = disks; i--; )
|
|
|
|
if (sh->dev[i].written) {
|
|
|
|
dev = &sh->dev[i];
|
|
|
|
if (!test_bit(R5_LOCKED, &dev->flags) &&
|
2012-10-11 09:49:49 +07:00
|
|
|
(test_bit(R5_UPTODATE, &dev->flags) ||
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
test_bit(R5_Discard, &dev->flags) ||
|
|
|
|
test_bit(R5_SkipCopy, &dev->flags))) {
|
2007-07-10 01:56:43 +07:00
|
|
|
/* We can return any write requests */
|
|
|
|
struct bio *wbi, *wbi2;
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("Return write for disc %d\n", i);
|
2012-11-21 12:33:40 +07:00
|
|
|
if (test_and_clear_bit(R5_Discard, &dev->flags))
|
|
|
|
clear_bit(R5_UPTODATE, &dev->flags);
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
if (test_and_clear_bit(R5_SkipCopy, &dev->flags)) {
|
|
|
|
WARN_ON(test_bit(R5_UPTODATE, &dev->flags));
|
|
|
|
}
|
2014-12-15 08:57:03 +07:00
|
|
|
do_endio = true;
|
|
|
|
|
|
|
|
returnbi:
|
|
|
|
dev->page = dev->orig_page;
|
2007-07-10 01:56:43 +07:00
|
|
|
wbi = dev->written;
|
|
|
|
dev->written = NULL;
|
2013-10-12 05:44:27 +07:00
|
|
|
while (wbi && wbi->bi_iter.bi_sector <
|
2007-07-10 01:56:43 +07:00
|
|
|
dev->sector + STRIPE_SECTORS) {
|
|
|
|
wbi2 = r5_next_bio(wbi, dev->sector);
|
md/raid5: use md_write_start to count stripes, not bios
We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count. We currently count bios
submitted to md/raid5. Change it count stripe_heads that a WRITE bio
has been attached to.
So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().
The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
Whenever we set bi_phys_segments to 1, we now call md_write_start.
Whenever we increment it on non-read requests with
raid5_inc_bi_active_stripes(), we now call md_write_start().
Whenever we decrement bi_phys_segments on non-read requsts with
raid5_dec_bi_active_stripes(), we now call md_write_end().
This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.
md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 10:05:12 +07:00
|
|
|
md_write_end(conf->mddev);
|
2017-03-15 10:05:13 +07:00
|
|
|
bio_endio(wbi);
|
2007-07-10 01:56:43 +07:00
|
|
|
wbi = wbi2;
|
|
|
|
}
|
2018-08-02 05:20:50 +07:00
|
|
|
md_bitmap_endwrite(conf->mddev->bitmap, sh->sector,
|
|
|
|
STRIPE_SECTORS,
|
|
|
|
!test_bit(STRIPE_DEGRADED, &sh->state),
|
|
|
|
0);
|
2014-12-15 08:57:03 +07:00
|
|
|
if (head_sh->batch_head) {
|
|
|
|
sh = list_first_entry(&sh->batch_list,
|
|
|
|
struct stripe_head,
|
|
|
|
batch_list);
|
|
|
|
if (sh != head_sh) {
|
|
|
|
dev = &sh->dev[i];
|
|
|
|
goto returnbi;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
sh = head_sh;
|
|
|
|
dev = &sh->dev[i];
|
2013-03-12 08:18:06 +07:00
|
|
|
} else if (test_bit(R5_Discard, &dev->flags))
|
|
|
|
discard_pending = 1;
|
|
|
|
}
|
raid5: add basic stripe log
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.
The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.
For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.
For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.
flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-14 04:31:59 +07:00
|
|
|
|
2017-03-09 15:59:58 +07:00
|
|
|
log_stripe_write_finished(sh);
|
raid5: log reclaim support
This is the reclaim support for raid5 log. A stripe write will have
following steps:
1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused
In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.
It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.
Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.
Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.
There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.
The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.
Includes fixes to problems which were...
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-14 04:32:00 +07:00
|
|
|
|
2013-03-12 08:18:06 +07:00
|
|
|
if (!discard_pending &&
|
|
|
|
test_bit(R5_Discard, &sh->dev[sh->pd_idx].flags)) {
|
2015-10-31 06:53:50 +07:00
|
|
|
int hash;
|
2013-03-12 08:18:06 +07:00
|
|
|
clear_bit(R5_Discard, &sh->dev[sh->pd_idx].flags);
|
|
|
|
clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
|
|
|
|
if (sh->qd_idx >= 0) {
|
|
|
|
clear_bit(R5_Discard, &sh->dev[sh->qd_idx].flags);
|
|
|
|
clear_bit(R5_UPTODATE, &sh->dev[sh->qd_idx].flags);
|
|
|
|
}
|
|
|
|
/* now that discard is done we can proceed with any sync */
|
|
|
|
clear_bit(STRIPE_DISCARD, &sh->state);
|
2013-10-19 13:51:42 +07:00
|
|
|
/*
|
|
|
|
* SCSI discard will change some bio fields and the stripe has
|
|
|
|
* no updated data, so remove it from hash list and the stripe
|
|
|
|
* will be reinitialized
|
|
|
|
*/
|
2014-12-15 08:57:03 +07:00
|
|
|
unhash:
|
2015-10-31 06:53:50 +07:00
|
|
|
hash = sh->hash_lock_index;
|
|
|
|
spin_lock_irq(conf->hash_locks + hash);
|
2013-10-19 13:51:42 +07:00
|
|
|
remove_hash(sh);
|
2015-10-31 06:53:50 +07:00
|
|
|
spin_unlock_irq(conf->hash_locks + hash);
|
2014-12-15 08:57:03 +07:00
|
|
|
if (head_sh->batch_head) {
|
|
|
|
sh = list_first_entry(&sh->batch_list,
|
|
|
|
struct stripe_head, batch_list);
|
|
|
|
if (sh != head_sh)
|
|
|
|
goto unhash;
|
|
|
|
}
|
|
|
|
sh = head_sh;
|
|
|
|
|
2013-03-12 08:18:06 +07:00
|
|
|
if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state))
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
|
|
|
|
}
|
2008-04-28 16:15:53 +07:00
|
|
|
|
|
|
|
if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))
|
|
|
|
if (atomic_dec_and_test(&conf->pending_full_writes))
|
|
|
|
md_wakeup_thread(conf->mddev->thread);
|
2014-12-15 08:57:03 +07:00
|
|
|
|
2015-05-21 09:56:41 +07:00
|
|
|
if (head_sh->batch_head && do_endio)
|
|
|
|
break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
|
|
|
|
2017-01-13 08:22:41 +07:00
|
|
|
/*
|
|
|
|
* For RMW in write back cache, we need extra page in prexor to store the
|
|
|
|
* old data. This page is stored in dev->orig_page.
|
|
|
|
*
|
|
|
|
* This function checks whether we have data for prexor. The exact logic
|
|
|
|
* is:
|
|
|
|
* R5_UPTODATE && (!R5_InJournal || R5_OrigPageUPTDODATE)
|
|
|
|
*/
|
|
|
|
static inline bool uptodate_for_rmw(struct r5dev *dev)
|
|
|
|
{
|
|
|
|
return (test_bit(R5_UPTODATE, &dev->flags)) &&
|
|
|
|
(!test_bit(R5_InJournal, &dev->flags) ||
|
|
|
|
test_bit(R5_OrigPageUPTDODATE, &dev->flags));
|
|
|
|
}
|
|
|
|
|
2016-11-24 13:50:39 +07:00
|
|
|
static int handle_stripe_dirtying(struct r5conf *conf,
|
|
|
|
struct stripe_head *sh,
|
|
|
|
struct stripe_head_state *s,
|
|
|
|
int disks)
|
2007-07-10 01:56:43 +07:00
|
|
|
{
|
|
|
|
int rmw = 0, rcw = 0, i;
|
2012-10-11 09:50:12 +07:00
|
|
|
sector_t recovery_cp = conf->mddev->recovery_cp;
|
|
|
|
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
/* Check whether resync is now happening or should start.
|
2012-10-11 09:50:12 +07:00
|
|
|
* If yes, then the array is dirty (after unclean shutdown or
|
|
|
|
* initial creation), so parity in some stripes might be inconsistent.
|
|
|
|
* In this case, we need to always do reconstruct-write, to ensure
|
|
|
|
* that in case of drive failure or read-error correction, we
|
|
|
|
* generate correct data from the parity.
|
|
|
|
*/
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
if (conf->rmw_level == PARITY_DISABLE_RMW ||
|
2015-02-18 07:35:14 +07:00
|
|
|
(recovery_cp < MaxSector && sh->sector >= recovery_cp &&
|
|
|
|
s->failed == 0)) {
|
2012-10-11 09:50:12 +07:00
|
|
|
/* Calculate the real rcw later - for now make it
|
2011-07-27 08:00:36 +07:00
|
|
|
* look like rcw is cheaper
|
|
|
|
*/
|
|
|
|
rcw = 1; rmw = 2;
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
pr_debug("force RCW rmw_level=%u, recovery_cp=%llu sh->sector=%llu\n",
|
|
|
|
conf->rmw_level, (unsigned long long)recovery_cp,
|
2012-10-11 09:50:12 +07:00
|
|
|
(unsigned long long)sh->sector);
|
2011-07-27 08:00:36 +07:00
|
|
|
} else for (i = disks; i--; ) {
|
2007-07-10 01:56:43 +07:00
|
|
|
/* would I have to read this buffer for read_modify_write */
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
2017-01-25 05:08:23 +07:00
|
|
|
if (((dev->towrite && !delay_towrite(conf, dev, s)) ||
|
md/r5cache: shift complex rmw from read path to write path
Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.
However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.
To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.
Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.
There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 08:12:58 +07:00
|
|
|
i == sh->pd_idx || i == sh->qd_idx ||
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
test_bit(R5_InJournal, &dev->flags)) &&
|
2007-07-10 01:56:43 +07:00
|
|
|
!test_bit(R5_LOCKED, &dev->flags) &&
|
2017-01-13 08:22:41 +07:00
|
|
|
!(uptodate_for_rmw(dev) ||
|
2007-01-03 03:52:30 +07:00
|
|
|
test_bit(R5_Wantcompute, &dev->flags))) {
|
2007-07-10 01:56:43 +07:00
|
|
|
if (test_bit(R5_Insync, &dev->flags))
|
|
|
|
rmw++;
|
|
|
|
else
|
|
|
|
rmw += 2*disks; /* cannot read it */
|
|
|
|
}
|
|
|
|
/* Would I have to read this buffer for reconstruct_write */
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
if (!test_bit(R5_OVERWRITE, &dev->flags) &&
|
|
|
|
i != sh->pd_idx && i != sh->qd_idx &&
|
2007-07-10 01:56:43 +07:00
|
|
|
!test_bit(R5_LOCKED, &dev->flags) &&
|
2007-01-03 03:52:30 +07:00
|
|
|
!(test_bit(R5_UPTODATE, &dev->flags) ||
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
test_bit(R5_Wantcompute, &dev->flags))) {
|
2014-05-28 10:39:22 +07:00
|
|
|
if (test_bit(R5_Insync, &dev->flags))
|
|
|
|
rcw++;
|
2007-07-10 01:56:43 +07:00
|
|
|
else
|
|
|
|
rcw += 2*disks;
|
|
|
|
}
|
|
|
|
}
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
|
2017-01-25 05:08:23 +07:00
|
|
|
pr_debug("for sector %llu state 0x%lx, rmw=%d rcw=%d\n",
|
|
|
|
(unsigned long long)sh->sector, sh->state, rmw, rcw);
|
2007-07-10 01:56:43 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
2016-05-24 07:25:06 +07:00
|
|
|
if ((rmw < rcw || (rmw == rcw && conf->rmw_level == PARITY_PREFER_RMW)) && rmw > 0) {
|
2007-07-10 01:56:43 +07:00
|
|
|
/* prefer read-modify-write, but need to get some data */
|
2013-03-08 05:22:01 +07:00
|
|
|
if (conf->mddev->queue)
|
|
|
|
blk_add_trace_msg(conf->mddev->queue,
|
|
|
|
"raid5 rmw %llu %d",
|
|
|
|
(unsigned long long)sh->sector, rmw);
|
2007-07-10 01:56:43 +07:00
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
if (test_bit(R5_InJournal, &dev->flags) &&
|
|
|
|
dev->page == dev->orig_page &&
|
|
|
|
!test_bit(R5_LOCKED, &sh->dev[sh->pd_idx].flags)) {
|
|
|
|
/* alloc page for prexor */
|
2016-11-24 13:50:39 +07:00
|
|
|
struct page *p = alloc_page(GFP_NOIO);
|
|
|
|
|
|
|
|
if (p) {
|
|
|
|
dev->orig_page = p;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* alloc_page() failed, try use
|
|
|
|
* disk_info->extra_page
|
|
|
|
*/
|
|
|
|
if (!test_and_set_bit(R5C_EXTRA_PAGE_IN_USE,
|
|
|
|
&conf->cache_state)) {
|
|
|
|
r5c_use_extra_page(sh);
|
|
|
|
break;
|
|
|
|
}
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
|
2016-11-24 13:50:39 +07:00
|
|
|
/* extra_page in use, add to delayed_list */
|
|
|
|
set_bit(STRIPE_DELAYED, &sh->state);
|
|
|
|
s->waiting_extra_page = 1;
|
|
|
|
return -EAGAIN;
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
}
|
2016-11-24 13:50:39 +07:00
|
|
|
}
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
|
2016-11-24 13:50:39 +07:00
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
2017-01-25 05:08:23 +07:00
|
|
|
if (((dev->towrite && !delay_towrite(conf, dev, s)) ||
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
i == sh->pd_idx || i == sh->qd_idx ||
|
|
|
|
test_bit(R5_InJournal, &dev->flags)) &&
|
2007-07-10 01:56:43 +07:00
|
|
|
!test_bit(R5_LOCKED, &dev->flags) &&
|
2017-01-13 08:22:41 +07:00
|
|
|
!(uptodate_for_rmw(dev) ||
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
test_bit(R5_Wantcompute, &dev->flags)) &&
|
2007-07-10 01:56:43 +07:00
|
|
|
test_bit(R5_Insync, &dev->flags)) {
|
2014-05-28 10:39:22 +07:00
|
|
|
if (test_bit(STRIPE_PREREAD_ACTIVE,
|
|
|
|
&sh->state)) {
|
|
|
|
pr_debug("Read_old block %d for r-m-w\n",
|
|
|
|
i);
|
2007-07-10 01:56:43 +07:00
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
set_bit(R5_Wantread, &dev->flags);
|
|
|
|
s->locked++;
|
|
|
|
} else {
|
|
|
|
set_bit(STRIPE_DELAYED, &sh->state);
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2012-10-31 07:59:09 +07:00
|
|
|
}
|
2016-05-24 07:25:06 +07:00
|
|
|
if ((rcw < rmw || (rcw == rmw && conf->rmw_level != PARITY_PREFER_RMW)) && rcw > 0) {
|
2007-07-10 01:56:43 +07:00
|
|
|
/* want reconstruct write, but need to get some data */
|
2012-10-31 07:59:09 +07:00
|
|
|
int qread =0;
|
2011-07-27 08:00:36 +07:00
|
|
|
rcw = 0;
|
2007-07-10 01:56:43 +07:00
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
|
|
|
if (!test_bit(R5_OVERWRITE, &dev->flags) &&
|
2011-07-27 08:00:36 +07:00
|
|
|
i != sh->pd_idx && i != sh->qd_idx &&
|
2007-07-10 01:56:43 +07:00
|
|
|
!test_bit(R5_LOCKED, &dev->flags) &&
|
2007-01-03 03:52:30 +07:00
|
|
|
!(test_bit(R5_UPTODATE, &dev->flags) ||
|
2011-07-27 08:00:36 +07:00
|
|
|
test_bit(R5_Wantcompute, &dev->flags))) {
|
|
|
|
rcw++;
|
2014-05-28 10:39:22 +07:00
|
|
|
if (test_bit(R5_Insync, &dev->flags) &&
|
|
|
|
test_bit(STRIPE_PREREAD_ACTIVE,
|
|
|
|
&sh->state)) {
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("Read_old block "
|
2007-07-10 01:56:43 +07:00
|
|
|
"%d for Reconstruct\n", i);
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
set_bit(R5_Wantread, &dev->flags);
|
|
|
|
s->locked++;
|
2012-10-31 07:59:09 +07:00
|
|
|
qread++;
|
2007-07-10 01:56:43 +07:00
|
|
|
} else {
|
|
|
|
set_bit(STRIPE_DELAYED, &sh->state);
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2013-03-08 05:22:01 +07:00
|
|
|
if (rcw && conf->mddev->queue)
|
2012-10-31 07:59:09 +07:00
|
|
|
blk_add_trace_msg(conf->mddev->queue, "raid5 rcw %llu %d %d %d",
|
|
|
|
(unsigned long long)sh->sector,
|
|
|
|
rcw, qread, test_bit(STRIPE_DELAYED, &sh->state));
|
2011-07-27 08:00:36 +07:00
|
|
|
}
|
2015-02-02 06:44:29 +07:00
|
|
|
|
|
|
|
if (rcw > disks && rmw > disks &&
|
|
|
|
!test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
|
|
|
|
set_bit(STRIPE_DELAYED, &sh->state);
|
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
/* now if nothing is locked, and if we have enough data,
|
|
|
|
* we can start a write request
|
|
|
|
*/
|
2007-01-03 03:52:30 +07:00
|
|
|
/* since handle_stripe can be called at any time we need to handle the
|
|
|
|
* case where a compute block operation has been submitted and then a
|
2009-07-15 03:40:19 +07:00
|
|
|
* subsequent call wants to start a write request. raid_run_ops only
|
|
|
|
* handles the case where compute block and reconstruct are requested
|
2007-01-03 03:52:30 +07:00
|
|
|
* simultaneously. If this is not the case then new writes need to be
|
|
|
|
* held off until the compute completes.
|
|
|
|
*/
|
2008-06-28 05:32:03 +07:00
|
|
|
if ((s->req_compute || !test_bit(STRIPE_COMPUTE_RUN, &sh->state)) &&
|
|
|
|
(s->locked == 0 && (rcw == 0 || rmw == 0) &&
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
!test_bit(STRIPE_BIT_DELAY, &sh->state)))
|
2009-08-30 09:13:12 +07:00
|
|
|
schedule_reconstruction(sh, s, rcw == 0, 0);
|
2016-11-24 13:50:39 +07:00
|
|
|
return 0;
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
|
2007-07-10 01:56:43 +07:00
|
|
|
struct stripe_head_state *s, int disks)
|
|
|
|
{
|
2008-06-28 05:31:57 +07:00
|
|
|
struct r5dev *dev = NULL;
|
2008-04-11 11:29:27 +07:00
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2007-07-10 01:56:43 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
2007-01-03 03:52:31 +07:00
|
|
|
|
2008-06-28 05:31:57 +07:00
|
|
|
switch (sh->check_state) {
|
|
|
|
case check_state_idle:
|
|
|
|
/* start a new check operation if there are no failures */
|
2008-04-11 11:29:27 +07:00
|
|
|
if (s->failed == 0) {
|
|
|
|
BUG_ON(s->uptodate != disks);
|
2008-06-28 05:31:57 +07:00
|
|
|
sh->check_state = check_state_run;
|
|
|
|
set_bit(STRIPE_OP_CHECK, &s->ops_request);
|
2008-04-11 11:29:27 +07:00
|
|
|
clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
|
|
|
|
s->uptodate--;
|
2008-06-28 05:31:57 +07:00
|
|
|
break;
|
2008-04-11 11:29:27 +07:00
|
|
|
}
|
2011-07-26 08:35:19 +07:00
|
|
|
dev = &sh->dev[s->failed_num[0]];
|
2008-06-28 05:31:57 +07:00
|
|
|
/* fall through */
|
|
|
|
case check_state_compute_result:
|
|
|
|
sh->check_state = check_state_idle;
|
|
|
|
if (!dev)
|
|
|
|
dev = &sh->dev[sh->pd_idx];
|
|
|
|
|
|
|
|
/* check that a write has not made the stripe insync */
|
|
|
|
if (test_bit(STRIPE_INSYNC, &sh->state))
|
|
|
|
break;
|
2008-05-13 04:02:12 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
/* either failed parity check, or recovery is happening */
|
|
|
|
BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
|
|
|
|
BUG_ON(s->uptodate != disks);
|
|
|
|
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
2008-06-28 05:31:57 +07:00
|
|
|
s->locked++;
|
2007-07-10 01:56:43 +07:00
|
|
|
set_bit(R5_Wantwrite, &dev->flags);
|
2007-01-03 03:52:31 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
clear_bit(STRIPE_DEGRADED, &sh->state);
|
|
|
|
set_bit(STRIPE_INSYNC, &sh->state);
|
2008-06-28 05:31:57 +07:00
|
|
|
break;
|
|
|
|
case check_state_run:
|
|
|
|
break; /* we will be called again upon completion */
|
|
|
|
case check_state_check_result:
|
|
|
|
sh->check_state = check_state_idle;
|
|
|
|
|
|
|
|
/* if a failure occurred during the check operation, leave
|
|
|
|
* STRIPE_INSYNC not set and let the stripe be handled again
|
|
|
|
*/
|
|
|
|
if (s->failed)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/* handle a successful check operation, if parity is correct
|
|
|
|
* we are done. Otherwise update the mismatch count and repair
|
|
|
|
* parity if !MD_RECOVERY_CHECK
|
|
|
|
*/
|
2009-08-30 09:09:26 +07:00
|
|
|
if ((sh->ops.zero_sum_result & SUM_CHECK_P_RESULT) == 0)
|
2008-06-28 05:31:57 +07:00
|
|
|
/* parity is correct (on disc,
|
|
|
|
* not in buffer any more)
|
|
|
|
*/
|
|
|
|
set_bit(STRIPE_INSYNC, &sh->state);
|
|
|
|
else {
|
2012-10-11 10:17:59 +07:00
|
|
|
atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches);
|
2017-05-16 16:13:31 +07:00
|
|
|
if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
|
2008-06-28 05:31:57 +07:00
|
|
|
/* don't try to repair!! */
|
|
|
|
set_bit(STRIPE_INSYNC, &sh->state);
|
2017-05-16 16:13:31 +07:00
|
|
|
pr_warn_ratelimited("%s: mismatch sector in range "
|
|
|
|
"%llu-%llu\n", mdname(conf->mddev),
|
|
|
|
(unsigned long long) sh->sector,
|
|
|
|
(unsigned long long) sh->sector +
|
|
|
|
STRIPE_SECTORS);
|
|
|
|
} else {
|
2008-06-28 05:31:57 +07:00
|
|
|
sh->check_state = check_state_compute_run;
|
2008-06-28 05:32:03 +07:00
|
|
|
set_bit(STRIPE_COMPUTE_RUN, &sh->state);
|
2008-06-28 05:31:57 +07:00
|
|
|
set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
|
|
|
|
set_bit(R5_Wantcompute,
|
|
|
|
&sh->dev[sh->pd_idx].flags);
|
|
|
|
sh->ops.target = sh->pd_idx;
|
2009-07-15 03:40:19 +07:00
|
|
|
sh->ops.target2 = -1;
|
2008-06-28 05:31:57 +07:00
|
|
|
s->uptodate++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case check_state_compute_run:
|
|
|
|
break;
|
|
|
|
default:
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_err("%s: unknown check_state: %d sector: %llu\n",
|
2008-06-28 05:31:57 +07:00
|
|
|
__func__, sh->check_state,
|
|
|
|
(unsigned long long) sh->sector);
|
|
|
|
BUG();
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
|
2009-07-15 01:48:22 +07:00
|
|
|
struct stripe_head_state *s,
|
2011-07-26 08:35:19 +07:00
|
|
|
int disks)
|
2007-07-10 01:56:43 +07:00
|
|
|
{
|
|
|
|
int pd_idx = sh->pd_idx;
|
2009-03-31 11:10:16 +07:00
|
|
|
int qd_idx = sh->qd_idx;
|
2009-07-15 03:40:57 +07:00
|
|
|
struct r5dev *dev;
|
2007-07-10 01:56:43 +07:00
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2007-07-10 01:56:43 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
|
|
|
|
BUG_ON(s->failed > 2);
|
2009-07-15 03:40:57 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
/* Want to check and possibly repair P and Q.
|
|
|
|
* However there could be one 'failed' device, in which
|
|
|
|
* case we can only check one of them, possibly using the
|
|
|
|
* other to generate missing data
|
|
|
|
*/
|
|
|
|
|
2009-07-15 03:40:57 +07:00
|
|
|
switch (sh->check_state) {
|
|
|
|
case check_state_idle:
|
|
|
|
/* start a new check operation if there are < 2 failures */
|
2011-07-26 08:35:19 +07:00
|
|
|
if (s->failed == s->q_failed) {
|
2009-07-15 03:40:57 +07:00
|
|
|
/* The only possible failed device holds Q, so it
|
2007-07-10 01:56:43 +07:00
|
|
|
* makes sense to check P (If anything else were failed,
|
|
|
|
* we would have used P to recreate it).
|
|
|
|
*/
|
2009-07-15 03:40:57 +07:00
|
|
|
sh->check_state = check_state_run;
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
2011-07-26 08:35:19 +07:00
|
|
|
if (!s->q_failed && s->failed < 2) {
|
2009-07-15 03:40:57 +07:00
|
|
|
/* Q is not failed, and we didn't use it to generate
|
2007-07-10 01:56:43 +07:00
|
|
|
* anything, so it makes sense to check it
|
|
|
|
*/
|
2009-07-15 03:40:57 +07:00
|
|
|
if (sh->check_state == check_state_run)
|
|
|
|
sh->check_state = check_state_run_pq;
|
|
|
|
else
|
|
|
|
sh->check_state = check_state_run_q;
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
|
|
|
|
2009-07-15 03:40:57 +07:00
|
|
|
/* discard potentially stale zero_sum_result */
|
|
|
|
sh->ops.zero_sum_result = 0;
|
2007-07-10 01:56:43 +07:00
|
|
|
|
2009-07-15 03:40:57 +07:00
|
|
|
if (sh->check_state == check_state_run) {
|
|
|
|
/* async_xor_zero_sum destroys the contents of P */
|
|
|
|
clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
|
|
|
|
s->uptodate--;
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
2009-07-15 03:40:57 +07:00
|
|
|
if (sh->check_state >= check_state_run &&
|
|
|
|
sh->check_state <= check_state_run_pq) {
|
|
|
|
/* async_syndrome_zero_sum preserves P and Q, so
|
|
|
|
* no need to mark them !uptodate here
|
|
|
|
*/
|
|
|
|
set_bit(STRIPE_OP_CHECK, &s->ops_request);
|
|
|
|
break;
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
|
|
|
|
2009-07-15 03:40:57 +07:00
|
|
|
/* we have 2-disk failure */
|
|
|
|
BUG_ON(s->failed != 2);
|
|
|
|
/* fall through */
|
|
|
|
case check_state_compute_result:
|
|
|
|
sh->check_state = check_state_idle;
|
2007-07-10 01:56:43 +07:00
|
|
|
|
2009-07-15 03:40:57 +07:00
|
|
|
/* check that a write has not made the stripe insync */
|
|
|
|
if (test_bit(STRIPE_INSYNC, &sh->state))
|
|
|
|
break;
|
2007-07-10 01:56:43 +07:00
|
|
|
|
|
|
|
/* now write out any block on a failed drive,
|
2009-07-15 03:40:57 +07:00
|
|
|
* or P or Q if they were recomputed
|
2007-07-10 01:56:43 +07:00
|
|
|
*/
|
2009-07-15 03:40:57 +07:00
|
|
|
BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
|
2007-07-10 01:56:43 +07:00
|
|
|
if (s->failed == 2) {
|
2011-07-26 08:35:19 +07:00
|
|
|
dev = &sh->dev[s->failed_num[1]];
|
2007-07-10 01:56:43 +07:00
|
|
|
s->locked++;
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
set_bit(R5_Wantwrite, &dev->flags);
|
|
|
|
}
|
|
|
|
if (s->failed >= 1) {
|
2011-07-26 08:35:19 +07:00
|
|
|
dev = &sh->dev[s->failed_num[0]];
|
2007-07-10 01:56:43 +07:00
|
|
|
s->locked++;
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
set_bit(R5_Wantwrite, &dev->flags);
|
|
|
|
}
|
2009-07-15 03:40:57 +07:00
|
|
|
if (sh->ops.zero_sum_result & SUM_CHECK_P_RESULT) {
|
2007-07-10 01:56:43 +07:00
|
|
|
dev = &sh->dev[pd_idx];
|
|
|
|
s->locked++;
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
set_bit(R5_Wantwrite, &dev->flags);
|
|
|
|
}
|
2009-07-15 03:40:57 +07:00
|
|
|
if (sh->ops.zero_sum_result & SUM_CHECK_Q_RESULT) {
|
2007-07-10 01:56:43 +07:00
|
|
|
dev = &sh->dev[qd_idx];
|
|
|
|
s->locked++;
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
set_bit(R5_Wantwrite, &dev->flags);
|
|
|
|
}
|
|
|
|
clear_bit(STRIPE_DEGRADED, &sh->state);
|
|
|
|
|
|
|
|
set_bit(STRIPE_INSYNC, &sh->state);
|
2009-07-15 03:40:57 +07:00
|
|
|
break;
|
|
|
|
case check_state_run:
|
|
|
|
case check_state_run_q:
|
|
|
|
case check_state_run_pq:
|
|
|
|
break; /* we will be called again upon completion */
|
|
|
|
case check_state_check_result:
|
|
|
|
sh->check_state = check_state_idle;
|
|
|
|
|
|
|
|
/* handle a successful check operation, if parity is correct
|
|
|
|
* we are done. Otherwise update the mismatch count and repair
|
|
|
|
* parity if !MD_RECOVERY_CHECK
|
|
|
|
*/
|
|
|
|
if (sh->ops.zero_sum_result == 0) {
|
|
|
|
/* both parities are correct */
|
|
|
|
if (!s->failed)
|
|
|
|
set_bit(STRIPE_INSYNC, &sh->state);
|
|
|
|
else {
|
|
|
|
/* in contrast to the raid5 case we can validate
|
|
|
|
* parity, but still have a failure to write
|
|
|
|
* back
|
|
|
|
*/
|
|
|
|
sh->check_state = check_state_compute_result;
|
|
|
|
/* Returning at this point means that we may go
|
|
|
|
* off and bring p and/or q uptodate again so
|
|
|
|
* we make sure to check zero_sum_result again
|
|
|
|
* to verify if p or q need writeback
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
} else {
|
2012-10-11 10:17:59 +07:00
|
|
|
atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches);
|
2017-05-16 16:13:31 +07:00
|
|
|
if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
|
2009-07-15 03:40:57 +07:00
|
|
|
/* don't try to repair!! */
|
|
|
|
set_bit(STRIPE_INSYNC, &sh->state);
|
2017-05-16 16:13:31 +07:00
|
|
|
pr_warn_ratelimited("%s: mismatch sector in range "
|
|
|
|
"%llu-%llu\n", mdname(conf->mddev),
|
|
|
|
(unsigned long long) sh->sector,
|
|
|
|
(unsigned long long) sh->sector +
|
|
|
|
STRIPE_SECTORS);
|
|
|
|
} else {
|
2009-07-15 03:40:57 +07:00
|
|
|
int *target = &sh->ops.target;
|
|
|
|
|
|
|
|
sh->ops.target = -1;
|
|
|
|
sh->ops.target2 = -1;
|
|
|
|
sh->check_state = check_state_compute_run;
|
|
|
|
set_bit(STRIPE_COMPUTE_RUN, &sh->state);
|
|
|
|
set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
|
|
|
|
if (sh->ops.zero_sum_result & SUM_CHECK_P_RESULT) {
|
|
|
|
set_bit(R5_Wantcompute,
|
|
|
|
&sh->dev[pd_idx].flags);
|
|
|
|
*target = pd_idx;
|
|
|
|
target = &sh->ops.target2;
|
|
|
|
s->uptodate++;
|
|
|
|
}
|
|
|
|
if (sh->ops.zero_sum_result & SUM_CHECK_Q_RESULT) {
|
|
|
|
set_bit(R5_Wantcompute,
|
|
|
|
&sh->dev[qd_idx].flags);
|
|
|
|
*target = qd_idx;
|
|
|
|
s->uptodate++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case check_state_compute_run:
|
|
|
|
break;
|
|
|
|
default:
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("%s: unknown check_state: %d sector: %llu\n",
|
|
|
|
__func__, sh->check_state,
|
|
|
|
(unsigned long long) sh->sector);
|
2009-07-15 03:40:57 +07:00
|
|
|
BUG();
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh)
|
2007-07-10 01:56:43 +07:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* We have read all the blocks in this stripe and now we need to
|
|
|
|
* copy some of them into a target stripe for expand.
|
|
|
|
*/
|
2007-01-03 03:52:31 +07:00
|
|
|
struct dma_async_tx_descriptor *tx = NULL;
|
2014-12-15 08:57:03 +07:00
|
|
|
BUG_ON(sh->batch_head);
|
2007-07-10 01:56:43 +07:00
|
|
|
clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
|
|
|
|
for (i = 0; i < sh->disks; i++)
|
2009-03-31 11:10:16 +07:00
|
|
|
if (i != sh->pd_idx && i != sh->qd_idx) {
|
2009-03-31 10:39:38 +07:00
|
|
|
int dd_idx, j;
|
2007-07-10 01:56:43 +07:00
|
|
|
struct stripe_head *sh2;
|
2009-06-04 01:43:59 +07:00
|
|
|
struct async_submit_ctl submit;
|
2007-07-10 01:56:43 +07:00
|
|
|
|
2015-08-14 04:31:57 +07:00
|
|
|
sector_t bn = raid5_compute_blocknr(sh, i, 1);
|
2009-03-31 10:39:38 +07:00
|
|
|
sector_t s = raid5_compute_sector(conf, bn, 0,
|
|
|
|
&dd_idx, NULL);
|
2015-08-14 04:31:57 +07:00
|
|
|
sh2 = raid5_get_active_stripe(conf, s, 0, 1, 1);
|
2007-07-10 01:56:43 +07:00
|
|
|
if (sh2 == NULL)
|
|
|
|
/* so far only the early blocks of this stripe
|
|
|
|
* have been requested. When later blocks
|
|
|
|
* get requested, we will try again
|
|
|
|
*/
|
|
|
|
continue;
|
|
|
|
if (!test_bit(STRIPE_EXPANDING, &sh2->state) ||
|
|
|
|
test_bit(R5_Expanded, &sh2->dev[dd_idx].flags)) {
|
|
|
|
/* must have already done this block */
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh2);
|
2007-07-10 01:56:43 +07:00
|
|
|
continue;
|
|
|
|
}
|
2007-01-03 03:52:31 +07:00
|
|
|
|
|
|
|
/* place all the copies on one channel */
|
2009-06-04 01:43:59 +07:00
|
|
|
init_async_submit(&submit, 0, tx, NULL, NULL, NULL);
|
2007-01-03 03:52:31 +07:00
|
|
|
tx = async_memcpy(sh2->dev[dd_idx].page,
|
2009-04-10 06:16:18 +07:00
|
|
|
sh->dev[i].page, 0, 0, STRIPE_SIZE,
|
2009-06-04 01:43:59 +07:00
|
|
|
&submit);
|
2007-01-03 03:52:31 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
set_bit(R5_Expanded, &sh2->dev[dd_idx].flags);
|
|
|
|
set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
|
|
|
|
for (j = 0; j < conf->raid_disks; j++)
|
|
|
|
if (j != sh2->pd_idx &&
|
2011-07-27 08:00:36 +07:00
|
|
|
j != sh2->qd_idx &&
|
2007-07-10 01:56:43 +07:00
|
|
|
!test_bit(R5_Expanded, &sh2->dev[j].flags))
|
|
|
|
break;
|
|
|
|
if (j == conf->raid_disks) {
|
|
|
|
set_bit(STRIPE_EXPAND_READY, &sh2->state);
|
|
|
|
set_bit(STRIPE_HANDLE, &sh2->state);
|
|
|
|
}
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh2);
|
2007-01-03 03:52:31 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
2007-09-12 05:23:36 +07:00
|
|
|
/* done submitting copies, wait for them to complete */
|
2012-11-20 10:11:15 +07:00
|
|
|
async_tx_quiesce(&tx);
|
2007-07-10 01:56:43 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* handle_stripe - do things to a stripe.
|
|
|
|
*
|
2011-12-23 06:17:53 +07:00
|
|
|
* We lock the stripe by setting STRIPE_ACTIVE and then examine the
|
|
|
|
* state of various bits to see what needs to be done.
|
2005-04-17 05:20:36 +07:00
|
|
|
* Possible results:
|
2011-12-23 06:17:53 +07:00
|
|
|
* return some read requests which now have data
|
|
|
|
* return some write requests which are safely on storage
|
2005-04-17 05:20:36 +07:00
|
|
|
* schedule a read on some buffers
|
|
|
|
* schedule a write of some buffers
|
|
|
|
* return confirmation of parity correctness
|
|
|
|
*
|
|
|
|
*/
|
2007-07-10 01:56:43 +07:00
|
|
|
|
2011-07-27 08:00:36 +07:00
|
|
|
static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
2007-03-01 11:11:53 +07:00
|
|
|
int disks = sh->disks;
|
2011-07-27 08:00:36 +07:00
|
|
|
struct r5dev *dev;
|
|
|
|
int i;
|
2011-12-23 06:17:53 +07:00
|
|
|
int do_recovery = 0;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-07-27 08:00:36 +07:00
|
|
|
memset(s, 0, sizeof(*s));
|
|
|
|
|
2014-12-15 08:57:04 +07:00
|
|
|
s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state) && !sh->batch_head;
|
|
|
|
s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state) && !sh->batch_head;
|
2011-07-27 08:00:36 +07:00
|
|
|
s->failed_num[0] = -1;
|
|
|
|
s->failed_num[1] = -1;
|
2015-10-09 11:54:08 +07:00
|
|
|
s->log_failed = r5l_log_disk_error(conf);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-07-27 08:00:36 +07:00
|
|
|
/* Now to look around and see what can be done */
|
2005-04-17 05:20:36 +07:00
|
|
|
rcu_read_lock();
|
2006-06-26 14:27:38 +07:00
|
|
|
for (i=disks; i--; ) {
|
2011-10-11 12:45:26 +07:00
|
|
|
struct md_rdev *rdev;
|
2011-07-28 08:39:22 +07:00
|
|
|
sector_t first_bad;
|
|
|
|
int bad_sectors;
|
|
|
|
int is_bad = 0;
|
2011-07-27 08:00:36 +07:00
|
|
|
|
2006-06-26 14:27:38 +07:00
|
|
|
dev = &sh->dev[i];
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("check %d: state 0x%lx read %p write %p written %p\n",
|
2011-12-23 06:17:53 +07:00
|
|
|
i, dev->flags,
|
|
|
|
dev->toread, dev->towrite, dev->written);
|
2009-08-30 09:13:13 +07:00
|
|
|
/* maybe we can reply to a read
|
|
|
|
*
|
|
|
|
* new wantfill requests are only permitted while
|
|
|
|
* ops_complete_biofill is guaranteed to be inactive
|
|
|
|
*/
|
|
|
|
if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread &&
|
|
|
|
!test_bit(STRIPE_BIOFILL_RUN, &sh->state))
|
|
|
|
set_bit(R5_Wantfill, &dev->flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-06-26 14:27:38 +07:00
|
|
|
/* now count some things */
|
2011-07-26 08:35:35 +07:00
|
|
|
if (test_bit(R5_LOCKED, &dev->flags))
|
|
|
|
s->locked++;
|
|
|
|
if (test_bit(R5_UPTODATE, &dev->flags))
|
|
|
|
s->uptodate++;
|
2009-09-17 02:11:54 +07:00
|
|
|
if (test_bit(R5_Wantcompute, &dev->flags)) {
|
2011-07-26 08:35:35 +07:00
|
|
|
s->compute++;
|
|
|
|
BUG_ON(s->compute > 2);
|
2009-09-17 02:11:54 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-07-27 08:00:36 +07:00
|
|
|
if (test_bit(R5_Wantfill, &dev->flags))
|
2011-07-26 08:35:35 +07:00
|
|
|
s->to_fill++;
|
2011-07-27 08:00:36 +07:00
|
|
|
else if (dev->toread)
|
2011-07-26 08:35:35 +07:00
|
|
|
s->to_read++;
|
2006-06-26 14:27:38 +07:00
|
|
|
if (dev->towrite) {
|
2011-07-26 08:35:35 +07:00
|
|
|
s->to_write++;
|
2006-06-26 14:27:38 +07:00
|
|
|
if (!test_bit(R5_OVERWRITE, &dev->flags))
|
2011-07-26 08:35:35 +07:00
|
|
|
s->non_overwrite++;
|
2006-06-26 14:27:38 +07:00
|
|
|
}
|
2007-07-10 01:56:43 +07:00
|
|
|
if (dev->written)
|
2011-07-26 08:35:35 +07:00
|
|
|
s->written++;
|
2011-12-23 06:17:52 +07:00
|
|
|
/* Prefer to use the replacement for reads, but only
|
|
|
|
* if it is recovered enough and has no bad blocks.
|
|
|
|
*/
|
|
|
|
rdev = rcu_dereference(conf->disks[i].replacement);
|
|
|
|
if (rdev && !test_bit(Faulty, &rdev->flags) &&
|
|
|
|
rdev->recovery_offset >= sh->sector + STRIPE_SECTORS &&
|
|
|
|
!is_badblock(rdev, sh->sector, STRIPE_SECTORS,
|
|
|
|
&first_bad, &bad_sectors))
|
|
|
|
set_bit(R5_ReadRepl, &dev->flags);
|
|
|
|
else {
|
2015-07-17 10:26:23 +07:00
|
|
|
if (rdev && !test_bit(Faulty, &rdev->flags))
|
2011-12-23 06:17:53 +07:00
|
|
|
set_bit(R5_NeedReplace, &dev->flags);
|
2015-07-17 10:26:23 +07:00
|
|
|
else
|
|
|
|
clear_bit(R5_NeedReplace, &dev->flags);
|
2011-12-23 06:17:52 +07:00
|
|
|
rdev = rcu_dereference(conf->disks[i].rdev);
|
|
|
|
clear_bit(R5_ReadRepl, &dev->flags);
|
|
|
|
}
|
2011-12-08 12:27:57 +07:00
|
|
|
if (rdev && test_bit(Faulty, &rdev->flags))
|
|
|
|
rdev = NULL;
|
2011-07-28 08:39:22 +07:00
|
|
|
if (rdev) {
|
|
|
|
is_bad = is_badblock(rdev, sh->sector, STRIPE_SECTORS,
|
|
|
|
&first_bad, &bad_sectors);
|
|
|
|
if (s->blocked_rdev == NULL
|
|
|
|
&& (test_bit(Blocked, &rdev->flags)
|
|
|
|
|| is_bad < 0)) {
|
|
|
|
if (is_bad < 0)
|
|
|
|
set_bit(BlockedBadBlocks,
|
|
|
|
&rdev->flags);
|
|
|
|
s->blocked_rdev = rdev;
|
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
}
|
2008-04-30 14:52:32 +07:00
|
|
|
}
|
2010-06-17 14:25:21 +07:00
|
|
|
clear_bit(R5_Insync, &dev->flags);
|
|
|
|
if (!rdev)
|
|
|
|
/* Not in-sync */;
|
2011-07-28 08:39:22 +07:00
|
|
|
else if (is_bad) {
|
|
|
|
/* also not in-sync */
|
2012-04-01 20:48:38 +07:00
|
|
|
if (!test_bit(WriteErrorSeen, &rdev->flags) &&
|
|
|
|
test_bit(R5_UPTODATE, &dev->flags)) {
|
2011-07-28 08:39:22 +07:00
|
|
|
/* treat as in-sync, but with a read error
|
|
|
|
* which we can now try to correct
|
|
|
|
*/
|
|
|
|
set_bit(R5_Insync, &dev->flags);
|
|
|
|
set_bit(R5_ReadError, &dev->flags);
|
|
|
|
}
|
|
|
|
} else if (test_bit(In_sync, &rdev->flags))
|
2010-06-17 14:25:21 +07:00
|
|
|
set_bit(R5_Insync, &dev->flags);
|
2011-12-23 05:57:00 +07:00
|
|
|
else if (sh->sector + STRIPE_SECTORS <= rdev->recovery_offset)
|
2010-06-17 14:25:21 +07:00
|
|
|
/* in sync if before recovery_offset */
|
2011-12-23 05:57:00 +07:00
|
|
|
set_bit(R5_Insync, &dev->flags);
|
|
|
|
else if (test_bit(R5_UPTODATE, &dev->flags) &&
|
|
|
|
test_bit(R5_Expanded, &dev->flags))
|
|
|
|
/* If we've reshaped into here, we assume it is Insync.
|
|
|
|
* We will shortly update recovery_offset to make
|
|
|
|
* it official.
|
|
|
|
*/
|
|
|
|
set_bit(R5_Insync, &dev->flags);
|
|
|
|
|
2014-01-06 09:19:42 +07:00
|
|
|
if (test_bit(R5_WriteError, &dev->flags)) {
|
2011-12-23 06:17:52 +07:00
|
|
|
/* This flag does not apply to '.replacement'
|
|
|
|
* only to .rdev, so make sure to check that*/
|
|
|
|
struct md_rdev *rdev2 = rcu_dereference(
|
|
|
|
conf->disks[i].rdev);
|
|
|
|
if (rdev2 == rdev)
|
|
|
|
clear_bit(R5_Insync, &dev->flags);
|
|
|
|
if (rdev2 && !test_bit(Faulty, &rdev2->flags)) {
|
2011-07-28 08:39:22 +07:00
|
|
|
s->handle_bad_blocks = 1;
|
2011-12-23 06:17:52 +07:00
|
|
|
atomic_inc(&rdev2->nr_pending);
|
2011-07-28 08:39:22 +07:00
|
|
|
} else
|
|
|
|
clear_bit(R5_WriteError, &dev->flags);
|
|
|
|
}
|
2014-01-06 09:19:42 +07:00
|
|
|
if (test_bit(R5_MadeGood, &dev->flags)) {
|
2011-12-23 06:17:52 +07:00
|
|
|
/* This flag does not apply to '.replacement'
|
|
|
|
* only to .rdev, so make sure to check that*/
|
|
|
|
struct md_rdev *rdev2 = rcu_dereference(
|
|
|
|
conf->disks[i].rdev);
|
|
|
|
if (rdev2 && !test_bit(Faulty, &rdev2->flags)) {
|
2011-07-28 08:39:23 +07:00
|
|
|
s->handle_bad_blocks = 1;
|
2011-12-23 06:17:52 +07:00
|
|
|
atomic_inc(&rdev2->nr_pending);
|
2011-07-28 08:39:23 +07:00
|
|
|
} else
|
|
|
|
clear_bit(R5_MadeGood, &dev->flags);
|
|
|
|
}
|
2011-12-23 06:17:53 +07:00
|
|
|
if (test_bit(R5_MadeGoodRepl, &dev->flags)) {
|
|
|
|
struct md_rdev *rdev2 = rcu_dereference(
|
|
|
|
conf->disks[i].replacement);
|
|
|
|
if (rdev2 && !test_bit(Faulty, &rdev2->flags)) {
|
|
|
|
s->handle_bad_blocks = 1;
|
|
|
|
atomic_inc(&rdev2->nr_pending);
|
|
|
|
} else
|
|
|
|
clear_bit(R5_MadeGoodRepl, &dev->flags);
|
|
|
|
}
|
2010-06-17 14:25:21 +07:00
|
|
|
if (!test_bit(R5_Insync, &dev->flags)) {
|
2006-06-26 14:27:38 +07:00
|
|
|
/* The ReadError flag will just be confusing now */
|
|
|
|
clear_bit(R5_ReadError, &dev->flags);
|
|
|
|
clear_bit(R5_ReWrite, &dev->flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2010-06-17 14:25:21 +07:00
|
|
|
if (test_bit(R5_ReadError, &dev->flags))
|
|
|
|
clear_bit(R5_Insync, &dev->flags);
|
|
|
|
if (!test_bit(R5_Insync, &dev->flags)) {
|
2011-07-26 08:35:35 +07:00
|
|
|
if (s->failed < 2)
|
|
|
|
s->failed_num[s->failed] = i;
|
|
|
|
s->failed++;
|
2011-12-23 06:17:53 +07:00
|
|
|
if (rdev && !test_bit(Faulty, &rdev->flags))
|
|
|
|
do_recovery = 1;
|
md/raid5: fix data corruption of replacements after originals dropped
During raid5 replacement, the stripes can be marked with R5_NeedReplace
flag. Data can be read from being-replaced devices and written to
replacing spares without reading all other devices. (It's 'replace'
mode. s.replacing = 1) If a being-replaced device is dropped, the
replacement progress will be interrupted and resumed with pure recovery
mode. However, existing stripes before being interrupted cannot read
from the dropped device anymore. It prints lots of WARN_ON messages.
And it results in data corruption because existing stripes write
problematic data into its replacement device and update the progress.
\# Erase disks (1MB + 2GB)
dd if=/dev/zero of=/dev/sda bs=1MB count=2049
dd if=/dev/zero of=/dev/sdb bs=1MB count=2049
dd if=/dev/zero of=/dev/sdc bs=1MB count=2049
dd if=/dev/zero of=/dev/sdd bs=1MB count=2049
mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152
\# Ensure array stores non-zero data
dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB
\# Start replacement
mdadm /dev/md0 -a /dev/sdd
mdadm /dev/md0 --replace /dev/sda
Then, Hot-plug out /dev/sda during recovery, and wait for recovery done.
echo check > /sys/block/md0/md/sync_action
cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0.
Soon after you hot-plug out /dev/sda, you will see many WARN_ON
messages. The replacement recovery will be interrupted shortly. After
the recovery finishes, it will result in data corruption.
Actually, it's just an unhandled case of replacement. In commit
<f94c0b6658c7> (md/raid5: fix interaction of 'replace' and 'recovery'.),
if a NeedReplace device is not UPTODATE then that is an error, the
commit just simply print WARN_ON but also mark these corrupted stripes
with R5_WantReplace. (it means it's ready for writes.)
To fix this case, we can leverage 'sync and replace' mode mentioned in
commit <9a3e1101b827> (md/raid5: detect and handle replacements during
recovery.). We can add logics to detect and use 'sync and replace' mode
for these stripes.
Reported-by: Alex Chen <alexchen@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-01 16:08:36 +07:00
|
|
|
else if (!rdev) {
|
|
|
|
rdev = rcu_dereference(
|
|
|
|
conf->disks[i].replacement);
|
|
|
|
if (rdev && !test_bit(Faulty, &rdev->flags))
|
|
|
|
do_recovery = 1;
|
|
|
|
}
|
2010-06-17 14:25:21 +07:00
|
|
|
}
|
md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
- write-back (R5C_MODE_WRITE_BACK)
- write-through (R5C_MODE_WRITE_THROUGH)
Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.
With write-back cache, every stripe could operate in two different
phases:
- caching
- writing-out
In caching phase, the stripe handles writes as:
- write to journal
- return IO
In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.
STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.
Please note: this is a "no-op" patch for raid5-cache write-through
mode.
The following detailed explanation is copied from the raid5-cache.c:
/*
* raid5 cache state machine
*
* With rhe RAID cache, each stripe works in two phases:
* - caching phase
* - writing-out phase
*
* These two phases are controlled by bit STRIPE_R5C_CACHING:
* if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
* if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
*
* When there is no journal, or the journal is in write-through mode,
* the stripe is always in writing-out phase.
*
* For write-back journal, the stripe is sent to caching phase on write
* (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
* the write-out phase by clearing STRIPE_R5C_CACHING.
*
* Stripes in caching phase do not write the raid disks. Instead, all
* writes are committed from the log device. Therefore, a stripe in
* caching phase handles writes as:
* - write to log device
* - return IO
*
* Stripes in writing-out phase handle writes as:
* - calculate parity
* - write pending data and parity to journal
* - write data and parity to raid disks
* - return IO for pending writes
*/
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:38 +07:00
|
|
|
|
|
|
|
if (test_bit(R5_InJournal, &dev->flags))
|
|
|
|
s->injournal++;
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
if (test_bit(R5_InJournal, &dev->flags) && dev->written)
|
|
|
|
s->just_cached++;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2011-12-23 06:17:53 +07:00
|
|
|
if (test_bit(STRIPE_SYNCING, &sh->state)) {
|
|
|
|
/* If there is a failed device being replaced,
|
|
|
|
* we must be recovering.
|
|
|
|
* else if we are after recovery_cp, we must be syncing
|
2012-04-01 22:16:59 +07:00
|
|
|
* else if MD_RECOVERY_REQUESTED is set, we also are syncing.
|
2011-12-23 06:17:53 +07:00
|
|
|
* else we can only be replacing
|
|
|
|
* sync and recovery both need to read all devices, and so
|
|
|
|
* use the same flag.
|
|
|
|
*/
|
|
|
|
if (do_recovery ||
|
2012-04-01 22:16:59 +07:00
|
|
|
sh->sector >= conf->mddev->recovery_cp ||
|
|
|
|
test_bit(MD_RECOVERY_REQUESTED, &(conf->mddev->recovery)))
|
2011-12-23 06:17:53 +07:00
|
|
|
s->syncing = 1;
|
|
|
|
else
|
|
|
|
s->replacing = 1;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
rcu_read_unlock();
|
2011-07-26 08:35:35 +07:00
|
|
|
}
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
static int clear_batch_ready(struct stripe_head *sh)
|
|
|
|
{
|
2015-05-22 12:20:04 +07:00
|
|
|
/* Return '1' if this is a member of batch, or
|
|
|
|
* '0' if it is a lone stripe or a head which can now be
|
|
|
|
* handled.
|
|
|
|
*/
|
2014-12-15 08:57:03 +07:00
|
|
|
struct stripe_head *tmp;
|
|
|
|
if (!test_and_clear_bit(STRIPE_BATCH_READY, &sh->state))
|
2015-05-22 12:20:04 +07:00
|
|
|
return (sh->batch_head && sh->batch_head != sh);
|
2014-12-15 08:57:03 +07:00
|
|
|
spin_lock(&sh->stripe_lock);
|
|
|
|
if (!sh->batch_head) {
|
|
|
|
spin_unlock(&sh->stripe_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* this stripe could be added to a batch list before we check
|
|
|
|
* BATCH_READY, skips it
|
|
|
|
*/
|
|
|
|
if (sh->batch_head != sh) {
|
|
|
|
spin_unlock(&sh->stripe_lock);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
spin_lock(&sh->batch_lock);
|
|
|
|
list_for_each_entry(tmp, &sh->batch_list, batch_list)
|
|
|
|
clear_bit(STRIPE_BATCH_READY, &tmp->state);
|
|
|
|
spin_unlock(&sh->batch_lock);
|
|
|
|
spin_unlock(&sh->stripe_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* BATCH_READY is cleared, no new stripes can be added.
|
|
|
|
* batch_list can be accessed without lock
|
|
|
|
*/
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-05-21 09:20:36 +07:00
|
|
|
static void break_stripe_batch_list(struct stripe_head *head_sh,
|
|
|
|
unsigned long handle_flags)
|
2014-12-15 08:57:03 +07:00
|
|
|
{
|
2015-05-21 08:50:16 +07:00
|
|
|
struct stripe_head *sh, *next;
|
2014-12-15 08:57:03 +07:00
|
|
|
int i;
|
2015-05-21 09:00:47 +07:00
|
|
|
int do_wakeup = 0;
|
2014-12-15 08:57:03 +07:00
|
|
|
|
2015-05-08 15:19:40 +07:00
|
|
|
list_for_each_entry_safe(sh, next, &head_sh->batch_list, batch_list) {
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
list_del_init(&sh->batch_list);
|
|
|
|
|
2016-03-10 01:08:38 +07:00
|
|
|
WARN_ONCE(sh->state & ((1 << STRIPE_ACTIVE) |
|
2015-05-21 09:40:26 +07:00
|
|
|
(1 << STRIPE_SYNCING) |
|
|
|
|
(1 << STRIPE_REPLACED) |
|
|
|
|
(1 << STRIPE_DELAYED) |
|
|
|
|
(1 << STRIPE_BIT_DELAY) |
|
|
|
|
(1 << STRIPE_FULL_WRITE) |
|
|
|
|
(1 << STRIPE_BIOFILL_RUN) |
|
|
|
|
(1 << STRIPE_COMPUTE_RUN) |
|
|
|
|
(1 << STRIPE_OPS_REQ_PENDING) |
|
|
|
|
(1 << STRIPE_DISCARD) |
|
|
|
|
(1 << STRIPE_BATCH_READY) |
|
|
|
|
(1 << STRIPE_BATCH_ERR) |
|
2016-03-10 01:08:38 +07:00
|
|
|
(1 << STRIPE_BITMAP_PENDING)),
|
|
|
|
"stripe state: %lx\n", sh->state);
|
|
|
|
WARN_ONCE(head_sh->state & ((1 << STRIPE_DISCARD) |
|
|
|
|
(1 << STRIPE_REPLACED)),
|
|
|
|
"head stripe state: %lx\n", head_sh->state);
|
2015-05-21 09:40:26 +07:00
|
|
|
|
|
|
|
set_mask_bits(&sh->state, ~(STRIPE_EXPAND_SYNC_FLAGS |
|
2016-03-09 08:58:25 +07:00
|
|
|
(1 << STRIPE_PREREAD_ACTIVE) |
|
2017-09-06 10:02:35 +07:00
|
|
|
(1 << STRIPE_DEGRADED) |
|
|
|
|
(1 << STRIPE_ON_UNPLUG_LIST)),
|
2015-05-21 09:40:26 +07:00
|
|
|
head_sh->state & (1 << STRIPE_INSYNC));
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
sh->check_state = head_sh->check_state;
|
|
|
|
sh->reconstruct_state = head_sh->reconstruct_state;
|
2018-05-16 17:59:35 +07:00
|
|
|
spin_lock_irq(&sh->stripe_lock);
|
|
|
|
sh->batch_head = NULL;
|
|
|
|
spin_unlock_irq(&sh->stripe_lock);
|
2015-05-21 09:00:47 +07:00
|
|
|
for (i = 0; i < sh->disks; i++) {
|
|
|
|
if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
|
|
|
|
do_wakeup = 1;
|
2014-12-15 08:57:03 +07:00
|
|
|
sh->dev[i].flags = head_sh->dev[i].flags &
|
|
|
|
(~((1 << R5_WriteError) | (1 << R5_Overlap)));
|
2015-05-21 09:00:47 +07:00
|
|
|
}
|
2015-05-21 09:20:36 +07:00
|
|
|
if (handle_flags == 0 ||
|
|
|
|
sh->state & handle_flags)
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2014-12-15 08:57:03 +07:00
|
|
|
}
|
2015-05-21 09:00:47 +07:00
|
|
|
spin_lock_irq(&head_sh->stripe_lock);
|
|
|
|
head_sh->batch_head = NULL;
|
|
|
|
spin_unlock_irq(&head_sh->stripe_lock);
|
|
|
|
for (i = 0; i < head_sh->disks; i++)
|
|
|
|
if (test_and_clear_bit(R5_Overlap, &head_sh->dev[i].flags))
|
|
|
|
do_wakeup = 1;
|
2015-05-21 09:20:36 +07:00
|
|
|
if (head_sh->state & handle_flags)
|
|
|
|
set_bit(STRIPE_HANDLE, &head_sh->state);
|
2015-05-21 09:00:47 +07:00
|
|
|
|
|
|
|
if (do_wakeup)
|
|
|
|
wake_up(&head_sh->raid_conf->wait_for_overlap);
|
2014-12-15 08:57:03 +07:00
|
|
|
}
|
|
|
|
|
2011-07-26 08:35:35 +07:00
|
|
|
static void handle_stripe(struct stripe_head *sh)
|
|
|
|
{
|
|
|
|
struct stripe_head_state s;
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = sh->raid_conf;
|
2011-07-27 08:00:36 +07:00
|
|
|
int i;
|
2011-07-27 08:00:36 +07:00
|
|
|
int prexor;
|
|
|
|
int disks = sh->disks;
|
2011-07-27 08:00:36 +07:00
|
|
|
struct r5dev *pdev, *qdev;
|
2011-07-26 08:35:35 +07:00
|
|
|
|
|
|
|
clear_bit(STRIPE_HANDLE, &sh->state);
|
2011-11-08 12:22:06 +07:00
|
|
|
if (test_and_set_bit_lock(STRIPE_ACTIVE, &sh->state)) {
|
2011-07-26 08:35:35 +07:00
|
|
|
/* already being handled, ensure it gets handled
|
|
|
|
* again when current action finishes */
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
if (clear_batch_ready(sh) ) {
|
|
|
|
clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2015-05-21 08:50:16 +07:00
|
|
|
if (test_and_clear_bit(STRIPE_BATCH_ERR, &sh->state))
|
2015-05-21 09:20:36 +07:00
|
|
|
break_stripe_batch_list(sh, 0);
|
2014-12-15 08:57:03 +07:00
|
|
|
|
2014-12-15 08:57:04 +07:00
|
|
|
if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state) && !sh->batch_head) {
|
2013-03-12 08:18:06 +07:00
|
|
|
spin_lock(&sh->stripe_lock);
|
2017-05-12 07:03:44 +07:00
|
|
|
/*
|
|
|
|
* Cannot process 'sync' concurrently with 'discard'.
|
|
|
|
* Flush data in r5cache before 'sync'.
|
|
|
|
*/
|
|
|
|
if (!test_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state) &&
|
|
|
|
!test_bit(STRIPE_R5C_FULL_STRIPE, &sh->state) &&
|
|
|
|
!test_bit(STRIPE_DISCARD, &sh->state) &&
|
2013-03-12 08:18:06 +07:00
|
|
|
test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
|
|
|
|
set_bit(STRIPE_SYNCING, &sh->state);
|
|
|
|
clear_bit(STRIPE_INSYNC, &sh->state);
|
2013-07-22 09:57:21 +07:00
|
|
|
clear_bit(STRIPE_REPLACED, &sh->state);
|
2013-03-12 08:18:06 +07:00
|
|
|
}
|
|
|
|
spin_unlock(&sh->stripe_lock);
|
2011-07-26 08:35:35 +07:00
|
|
|
}
|
|
|
|
clear_bit(STRIPE_DELAYED, &sh->state);
|
|
|
|
|
|
|
|
pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
|
|
|
|
"pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
|
|
|
|
(unsigned long long)sh->sector, sh->state,
|
|
|
|
atomic_read(&sh->count), sh->pd_idx, sh->qd_idx,
|
|
|
|
sh->check_state, sh->reconstruct_state);
|
2011-07-27 08:00:36 +07:00
|
|
|
|
2011-07-27 08:00:36 +07:00
|
|
|
analyse_stripe(sh, &s);
|
2011-07-27 08:00:36 +07:00
|
|
|
|
2015-08-14 04:31:58 +07:00
|
|
|
if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
|
|
|
|
goto finish;
|
|
|
|
|
2017-03-15 10:05:12 +07:00
|
|
|
if (s.handle_bad_blocks ||
|
|
|
|
test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
|
2011-07-28 08:39:22 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
goto finish;
|
|
|
|
}
|
|
|
|
|
2011-07-27 08:00:36 +07:00
|
|
|
if (unlikely(s.blocked_rdev)) {
|
|
|
|
if (s.syncing || s.expanding || s.expanded ||
|
2011-12-23 06:17:53 +07:00
|
|
|
s.replacing || s.to_write || s.written) {
|
2011-07-27 08:00:36 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
goto finish;
|
|
|
|
}
|
|
|
|
/* There is nothing for the blocked_rdev to block */
|
|
|
|
rdev_dec_pending(s.blocked_rdev, conf->mddev);
|
|
|
|
s.blocked_rdev = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
|
|
|
|
set_bit(STRIPE_OP_BIOFILL, &s.ops_request);
|
|
|
|
set_bit(STRIPE_BIOFILL_RUN, &sh->state);
|
|
|
|
}
|
|
|
|
|
|
|
|
pr_debug("locked=%d uptodate=%d to_read=%d"
|
|
|
|
" to_write=%d failed=%d failed_num=%d,%d\n",
|
|
|
|
s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
|
|
|
|
s.failed_num[0], s.failed_num[1]);
|
md/r5cache: gracefully handle journal device errors for writeback mode
For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, it is necessary to handle
journal failure gracefully.
During journal failures, the following logic handles the graceful shutdown
of journal:
1. raid5_error() marks the device as Faulty and schedules async work
log->disable_writeback_work;
2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
suspended, set to write through, and then resumed. mddev_suspend()
flushes all cached stripes;
3. All cached stripes need to be flushed carefully to the RAID array.
This patch fixes issues within the process above:
1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
journal failures;
2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
since raid5_error() updates superblock.
3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
to make progress during log_failed;
4. In delay_towrite(), if log failed only process data in the cache (skip
new writes in dev->towrite);
5. In __get_priority_stripe(), process loprio_list during journal device
failures.
6. In raid5_remove_disk(), wait for all cached stripes are flushed before
calling log_exit().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-12 05:28:28 +07:00
|
|
|
/*
|
|
|
|
* check if the array has lost more than max_degraded devices and,
|
2011-07-27 08:00:36 +07:00
|
|
|
* if so, some requests might need to be failed.
|
md/r5cache: gracefully handle journal device errors for writeback mode
For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, it is necessary to handle
journal failure gracefully.
During journal failures, the following logic handles the graceful shutdown
of journal:
1. raid5_error() marks the device as Faulty and schedules async work
log->disable_writeback_work;
2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
suspended, set to write through, and then resumed. mddev_suspend()
flushes all cached stripes;
3. All cached stripes need to be flushed carefully to the RAID array.
This patch fixes issues within the process above:
1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
journal failures;
2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
since raid5_error() updates superblock.
3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
to make progress during log_failed;
4. In delay_towrite(), if log failed only process data in the cache (skip
new writes in dev->towrite);
5. In __get_priority_stripe(), process loprio_list during journal device
failures.
6. In raid5_remove_disk(), wait for all cached stripes are flushed before
calling log_exit().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-12 05:28:28 +07:00
|
|
|
*
|
|
|
|
* When journal device failed (log_failed), we will only process
|
|
|
|
* the stripe if there is data need write to raid disks
|
2011-07-27 08:00:36 +07:00
|
|
|
*/
|
md/r5cache: gracefully handle journal device errors for writeback mode
For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, it is necessary to handle
journal failure gracefully.
During journal failures, the following logic handles the graceful shutdown
of journal:
1. raid5_error() marks the device as Faulty and schedules async work
log->disable_writeback_work;
2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
suspended, set to write through, and then resumed. mddev_suspend()
flushes all cached stripes;
3. All cached stripes need to be flushed carefully to the RAID array.
This patch fixes issues within the process above:
1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
journal failures;
2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
since raid5_error() updates superblock.
3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
to make progress during log_failed;
4. In delay_towrite(), if log failed only process data in the cache (skip
new writes in dev->towrite);
5. In __get_priority_stripe(), process loprio_list during journal device
failures.
6. In raid5_remove_disk(), wait for all cached stripes are flushed before
calling log_exit().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-12 05:28:28 +07:00
|
|
|
if (s.failed > conf->max_degraded ||
|
|
|
|
(s.log_failed && s.injournal == 0)) {
|
2011-11-08 12:22:01 +07:00
|
|
|
sh->check_state = 0;
|
|
|
|
sh->reconstruct_state = 0;
|
2015-05-22 11:03:10 +07:00
|
|
|
break_stripe_batch_list(sh, 0);
|
2011-11-08 12:22:01 +07:00
|
|
|
if (s.to_read+s.to_write+s.written)
|
2017-03-15 10:05:12 +07:00
|
|
|
handle_failed_stripe(conf, sh, &s, disks);
|
2011-12-23 06:17:53 +07:00
|
|
|
if (s.syncing + s.replacing)
|
2011-11-08 12:22:01 +07:00
|
|
|
handle_failed_sync(conf, sh, &s);
|
|
|
|
}
|
2011-07-27 08:00:36 +07:00
|
|
|
|
2011-07-27 08:00:36 +07:00
|
|
|
/* Now we check to see if any write operations have recently
|
|
|
|
* completed
|
|
|
|
*/
|
|
|
|
prexor = 0;
|
|
|
|
if (sh->reconstruct_state == reconstruct_state_prexor_drain_result)
|
|
|
|
prexor = 1;
|
|
|
|
if (sh->reconstruct_state == reconstruct_state_drain_result ||
|
|
|
|
sh->reconstruct_state == reconstruct_state_prexor_drain_result) {
|
|
|
|
sh->reconstruct_state = reconstruct_state_idle;
|
|
|
|
|
|
|
|
/* All the 'written' buffers and the parity block are ready to
|
|
|
|
* be written back to disk
|
|
|
|
*/
|
2012-10-11 09:49:49 +07:00
|
|
|
BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags) &&
|
|
|
|
!test_bit(R5_Discard, &sh->dev[sh->pd_idx].flags));
|
2011-07-27 08:00:36 +07:00
|
|
|
BUG_ON(sh->qd_idx >= 0 &&
|
2012-10-11 09:49:49 +07:00
|
|
|
!test_bit(R5_UPTODATE, &sh->dev[sh->qd_idx].flags) &&
|
|
|
|
!test_bit(R5_Discard, &sh->dev[sh->qd_idx].flags));
|
2011-07-27 08:00:36 +07:00
|
|
|
for (i = disks; i--; ) {
|
|
|
|
struct r5dev *dev = &sh->dev[i];
|
|
|
|
if (test_bit(R5_LOCKED, &dev->flags) &&
|
|
|
|
(i == sh->pd_idx || i == sh->qd_idx ||
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
dev->written || test_bit(R5_InJournal,
|
|
|
|
&dev->flags))) {
|
2011-07-27 08:00:36 +07:00
|
|
|
pr_debug("Writing block %d\n", i);
|
|
|
|
set_bit(R5_Wantwrite, &dev->flags);
|
|
|
|
if (prexor)
|
|
|
|
continue;
|
2014-08-13 06:57:07 +07:00
|
|
|
if (s.failed > 1)
|
|
|
|
continue;
|
2011-07-27 08:00:36 +07:00
|
|
|
if (!test_bit(R5_Insync, &dev->flags) ||
|
|
|
|
((i == sh->pd_idx || i == sh->qd_idx) &&
|
|
|
|
s.failed == 0))
|
|
|
|
set_bit(STRIPE_INSYNC, &sh->state);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
|
|
|
|
s.dec_preread_active = 1;
|
|
|
|
}
|
|
|
|
|
2012-11-22 05:13:36 +07:00
|
|
|
/*
|
|
|
|
* might be able to return some write requests if the parity blocks
|
|
|
|
* are safe, or on a failed drive
|
|
|
|
*/
|
|
|
|
pdev = &sh->dev[sh->pd_idx];
|
|
|
|
s.p_failed = (s.failed >= 1 && s.failed_num[0] == sh->pd_idx)
|
|
|
|
|| (s.failed >= 2 && s.failed_num[1] == sh->pd_idx);
|
|
|
|
qdev = &sh->dev[sh->qd_idx];
|
|
|
|
s.q_failed = (s.failed >= 1 && s.failed_num[0] == sh->qd_idx)
|
|
|
|
|| (s.failed >= 2 && s.failed_num[1] == sh->qd_idx)
|
|
|
|
|| conf->level < 6;
|
|
|
|
|
|
|
|
if (s.written &&
|
|
|
|
(s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
|
|
|
|
&& !test_bit(R5_LOCKED, &pdev->flags)
|
|
|
|
&& (test_bit(R5_UPTODATE, &pdev->flags) ||
|
|
|
|
test_bit(R5_Discard, &pdev->flags))))) &&
|
|
|
|
(s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
|
|
|
|
&& !test_bit(R5_LOCKED, &qdev->flags)
|
|
|
|
&& (test_bit(R5_UPTODATE, &qdev->flags) ||
|
|
|
|
test_bit(R5_Discard, &qdev->flags))))))
|
2017-03-15 10:05:12 +07:00
|
|
|
handle_stripe_clean_event(conf, sh, disks);
|
2012-11-22 05:13:36 +07:00
|
|
|
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
if (s.just_cached)
|
2017-03-15 10:05:12 +07:00
|
|
|
r5c_handle_cached_data_endio(conf, sh, disks);
|
2017-03-09 15:59:58 +07:00
|
|
|
log_stripe_write_finished(sh);
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
|
2012-11-22 05:13:36 +07:00
|
|
|
/* Now we might consider reading some blocks, either to check/generate
|
|
|
|
* parity, or to satisfy requests
|
|
|
|
* or to load a block that is being partially written.
|
|
|
|
*/
|
|
|
|
if (s.to_read || s.non_overwrite
|
|
|
|
|| (conf->level == 6 && s.to_write && s.failed)
|
|
|
|
|| (s.syncing && (s.uptodate + s.compute < disks))
|
|
|
|
|| s.replacing
|
|
|
|
|| s.expanding)
|
|
|
|
handle_stripe_fill(sh, &s, disks);
|
|
|
|
|
md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
- write-back (R5C_MODE_WRITE_BACK)
- write-through (R5C_MODE_WRITE_THROUGH)
Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.
With write-back cache, every stripe could operate in two different
phases:
- caching
- writing-out
In caching phase, the stripe handles writes as:
- write to journal
- return IO
In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.
STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.
Please note: this is a "no-op" patch for raid5-cache write-through
mode.
The following detailed explanation is copied from the raid5-cache.c:
/*
* raid5 cache state machine
*
* With rhe RAID cache, each stripe works in two phases:
* - caching phase
* - writing-out phase
*
* These two phases are controlled by bit STRIPE_R5C_CACHING:
* if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
* if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
*
* When there is no journal, or the journal is in write-through mode,
* the stripe is always in writing-out phase.
*
* For write-back journal, the stripe is sent to caching phase on write
* (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
* the write-out phase by clearing STRIPE_R5C_CACHING.
*
* Stripes in caching phase do not write the raid disks. Instead, all
* writes are committed from the log device. Therefore, a stripe in
* caching phase handles writes as:
* - write to log device
* - return IO
*
* Stripes in writing-out phase handle writes as:
* - calculate parity
* - write pending data and parity to journal
* - write data and parity to raid disks
* - return IO for pending writes
*/
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:38 +07:00
|
|
|
/*
|
|
|
|
* When the stripe finishes full journal write cycle (write to journal
|
|
|
|
* and raid disk), this is the clean up procedure so it is ready for
|
|
|
|
* next operation.
|
|
|
|
*/
|
|
|
|
r5c_finish_stripe_write_out(conf, sh, &s);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now to consider new write requests, cache write back and what else,
|
|
|
|
* if anything should be read. We do not handle new writes when:
|
2011-07-27 08:00:36 +07:00
|
|
|
* 1/ A 'write' operation (copy+xor) is already in flight.
|
|
|
|
* 2/ A 'check' operation is in flight, as it may clobber the parity
|
|
|
|
* block.
|
md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
- write-back (R5C_MODE_WRITE_BACK)
- write-through (R5C_MODE_WRITE_THROUGH)
Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.
With write-back cache, every stripe could operate in two different
phases:
- caching
- writing-out
In caching phase, the stripe handles writes as:
- write to journal
- return IO
In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.
STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.
Please note: this is a "no-op" patch for raid5-cache write-through
mode.
The following detailed explanation is copied from the raid5-cache.c:
/*
* raid5 cache state machine
*
* With rhe RAID cache, each stripe works in two phases:
* - caching phase
* - writing-out phase
*
* These two phases are controlled by bit STRIPE_R5C_CACHING:
* if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
* if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
*
* When there is no journal, or the journal is in write-through mode,
* the stripe is always in writing-out phase.
*
* For write-back journal, the stripe is sent to caching phase on write
* (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
* the write-out phase by clearing STRIPE_R5C_CACHING.
*
* Stripes in caching phase do not write the raid disks. Instead, all
* writes are committed from the log device. Therefore, a stripe in
* caching phase handles writes as:
* - write to log device
* - return IO
*
* Stripes in writing-out phase handle writes as:
* - calculate parity
* - write pending data and parity to journal
* - write data and parity to raid disks
* - return IO for pending writes
*/
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:38 +07:00
|
|
|
* 3/ A r5c cache log write is in flight.
|
2011-07-27 08:00:36 +07:00
|
|
|
*/
|
md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
- write-back (R5C_MODE_WRITE_BACK)
- write-through (R5C_MODE_WRITE_THROUGH)
Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.
With write-back cache, every stripe could operate in two different
phases:
- caching
- writing-out
In caching phase, the stripe handles writes as:
- write to journal
- return IO
In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.
STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.
Please note: this is a "no-op" patch for raid5-cache write-through
mode.
The following detailed explanation is copied from the raid5-cache.c:
/*
* raid5 cache state machine
*
* With rhe RAID cache, each stripe works in two phases:
* - caching phase
* - writing-out phase
*
* These two phases are controlled by bit STRIPE_R5C_CACHING:
* if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
* if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
*
* When there is no journal, or the journal is in write-through mode,
* the stripe is always in writing-out phase.
*
* For write-back journal, the stripe is sent to caching phase on write
* (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
* the write-out phase by clearing STRIPE_R5C_CACHING.
*
* Stripes in caching phase do not write the raid disks. Instead, all
* writes are committed from the log device. Therefore, a stripe in
* caching phase handles writes as:
* - write to log device
* - return IO
*
* Stripes in writing-out phase handle writes as:
* - calculate parity
* - write pending data and parity to journal
* - write data and parity to raid disks
* - return IO for pending writes
*/
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:38 +07:00
|
|
|
|
|
|
|
if (!sh->reconstruct_state && !sh->check_state && !sh->log_io) {
|
|
|
|
if (!r5c_is_writeback(conf->log)) {
|
|
|
|
if (s.to_write)
|
|
|
|
handle_stripe_dirtying(conf, sh, &s, disks);
|
|
|
|
} else { /* write back cache */
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
/* First, try handle writes in caching phase */
|
|
|
|
if (s.to_write)
|
|
|
|
ret = r5c_try_caching_write(conf, sh, &s,
|
|
|
|
disks);
|
|
|
|
/*
|
|
|
|
* If caching phase failed: ret == -EAGAIN
|
|
|
|
* OR
|
|
|
|
* stripe under reclaim: !caching && injournal
|
|
|
|
*
|
|
|
|
* fall back to handle_stripe_dirtying()
|
|
|
|
*/
|
|
|
|
if (ret == -EAGAIN ||
|
|
|
|
/* stripe under reclaim: !caching && injournal */
|
|
|
|
(!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
|
2016-11-24 13:50:39 +07:00
|
|
|
s.injournal > 0)) {
|
|
|
|
ret = handle_stripe_dirtying(conf, sh, &s,
|
|
|
|
disks);
|
|
|
|
if (ret == -EAGAIN)
|
|
|
|
goto finish;
|
|
|
|
}
|
md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
- write-back (R5C_MODE_WRITE_BACK)
- write-through (R5C_MODE_WRITE_THROUGH)
Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.
With write-back cache, every stripe could operate in two different
phases:
- caching
- writing-out
In caching phase, the stripe handles writes as:
- write to journal
- return IO
In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.
STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.
Please note: this is a "no-op" patch for raid5-cache write-through
mode.
The following detailed explanation is copied from the raid5-cache.c:
/*
* raid5 cache state machine
*
* With rhe RAID cache, each stripe works in two phases:
* - caching phase
* - writing-out phase
*
* These two phases are controlled by bit STRIPE_R5C_CACHING:
* if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
* if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
*
* When there is no journal, or the journal is in write-through mode,
* the stripe is always in writing-out phase.
*
* For write-back journal, the stripe is sent to caching phase on write
* (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
* the write-out phase by clearing STRIPE_R5C_CACHING.
*
* Stripes in caching phase do not write the raid disks. Instead, all
* writes are committed from the log device. Therefore, a stripe in
* caching phase handles writes as:
* - write to log device
* - return IO
*
* Stripes in writing-out phase handle writes as:
* - calculate parity
* - write pending data and parity to journal
* - write data and parity to raid disks
* - return IO for pending writes
*/
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:38 +07:00
|
|
|
}
|
|
|
|
}
|
2011-07-27 08:00:36 +07:00
|
|
|
|
|
|
|
/* maybe we need to check and possibly fix the parity for this stripe
|
|
|
|
* Any reads will already have been scheduled, so we just see if enough
|
|
|
|
* data is available. The parity check is held off while parity
|
|
|
|
* dependent operations are in flight.
|
|
|
|
*/
|
|
|
|
if (sh->check_state ||
|
|
|
|
(s.syncing && s.locked == 0 &&
|
|
|
|
!test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
|
|
|
|
!test_bit(STRIPE_INSYNC, &sh->state))) {
|
|
|
|
if (conf->level == 6)
|
|
|
|
handle_parity_checks6(conf, sh, &s, disks);
|
|
|
|
else
|
|
|
|
handle_parity_checks5(conf, sh, &s, disks);
|
|
|
|
}
|
2011-07-27 08:00:36 +07:00
|
|
|
|
2013-07-22 09:57:21 +07:00
|
|
|
if ((s.replacing || s.syncing) && s.locked == 0
|
|
|
|
&& !test_bit(STRIPE_COMPUTE_RUN, &sh->state)
|
|
|
|
&& !test_bit(STRIPE_REPLACED, &sh->state)) {
|
2011-12-23 06:17:53 +07:00
|
|
|
/* Write out to replacement devices where possible */
|
|
|
|
for (i = 0; i < conf->raid_disks; i++)
|
2013-07-22 09:57:21 +07:00
|
|
|
if (test_bit(R5_NeedReplace, &sh->dev[i].flags)) {
|
|
|
|
WARN_ON(!test_bit(R5_UPTODATE, &sh->dev[i].flags));
|
2011-12-23 06:17:53 +07:00
|
|
|
set_bit(R5_WantReplace, &sh->dev[i].flags);
|
|
|
|
set_bit(R5_LOCKED, &sh->dev[i].flags);
|
|
|
|
s.locked++;
|
|
|
|
}
|
2013-07-22 09:57:21 +07:00
|
|
|
if (s.replacing)
|
|
|
|
set_bit(STRIPE_INSYNC, &sh->state);
|
|
|
|
set_bit(STRIPE_REPLACED, &sh->state);
|
2011-12-23 06:17:53 +07:00
|
|
|
}
|
|
|
|
if ((s.syncing || s.replacing) && s.locked == 0 &&
|
2013-07-22 09:57:21 +07:00
|
|
|
!test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
|
2011-12-23 06:17:53 +07:00
|
|
|
test_bit(STRIPE_INSYNC, &sh->state)) {
|
2011-07-27 08:00:36 +07:00
|
|
|
md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
|
|
|
|
clear_bit(STRIPE_SYNCING, &sh->state);
|
2013-03-12 08:18:06 +07:00
|
|
|
if (test_and_clear_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags))
|
|
|
|
wake_up(&conf->wait_for_overlap);
|
2011-07-27 08:00:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* If the failed drives are just a ReadError, then we might need
|
|
|
|
* to progress the repair/check process
|
|
|
|
*/
|
|
|
|
if (s.failed <= conf->max_degraded && !conf->mddev->ro)
|
|
|
|
for (i = 0; i < s.failed; i++) {
|
|
|
|
struct r5dev *dev = &sh->dev[s.failed_num[i]];
|
|
|
|
if (test_bit(R5_ReadError, &dev->flags)
|
|
|
|
&& !test_bit(R5_LOCKED, &dev->flags)
|
|
|
|
&& test_bit(R5_UPTODATE, &dev->flags)
|
|
|
|
) {
|
|
|
|
if (!test_bit(R5_ReWrite, &dev->flags)) {
|
|
|
|
set_bit(R5_Wantwrite, &dev->flags);
|
|
|
|
set_bit(R5_ReWrite, &dev->flags);
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
s.locked++;
|
|
|
|
} else {
|
|
|
|
/* let's read it back */
|
|
|
|
set_bit(R5_Wantread, &dev->flags);
|
|
|
|
set_bit(R5_LOCKED, &dev->flags);
|
|
|
|
s.locked++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-07-27 08:00:36 +07:00
|
|
|
/* Finish reconstruct operations initiated by the expansion process */
|
|
|
|
if (sh->reconstruct_state == reconstruct_state_result) {
|
|
|
|
struct stripe_head *sh_src
|
2015-08-14 04:31:57 +07:00
|
|
|
= raid5_get_active_stripe(conf, sh->sector, 1, 1, 1);
|
2011-07-27 08:00:36 +07:00
|
|
|
if (sh_src && test_bit(STRIPE_EXPAND_SOURCE, &sh_src->state)) {
|
|
|
|
/* sh cannot be written until sh_src has been read.
|
|
|
|
* so arrange for sh to be delayed a little
|
|
|
|
*/
|
|
|
|
set_bit(STRIPE_DELAYED, &sh->state);
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
|
|
|
|
&sh_src->state))
|
|
|
|
atomic_inc(&conf->preread_active_stripes);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh_src);
|
2011-07-27 08:00:36 +07:00
|
|
|
goto finish;
|
|
|
|
}
|
|
|
|
if (sh_src)
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh_src);
|
2011-07-27 08:00:36 +07:00
|
|
|
|
|
|
|
sh->reconstruct_state = reconstruct_state_idle;
|
|
|
|
clear_bit(STRIPE_EXPANDING, &sh->state);
|
|
|
|
for (i = conf->raid_disks; i--; ) {
|
|
|
|
set_bit(R5_Wantwrite, &sh->dev[i].flags);
|
|
|
|
set_bit(R5_LOCKED, &sh->dev[i].flags);
|
|
|
|
s.locked++;
|
|
|
|
}
|
|
|
|
}
|
2007-03-01 11:11:53 +07:00
|
|
|
|
2011-07-27 08:00:36 +07:00
|
|
|
if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
|
|
|
|
!sh->reconstruct_state) {
|
|
|
|
/* Need to write out all blocks after computing parity */
|
|
|
|
sh->disks = conf->raid_disks;
|
|
|
|
stripe_set_idx(sh->sector, conf, 0, sh);
|
|
|
|
schedule_reconstruction(sh, &s, 1, 1);
|
|
|
|
} else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
|
|
|
|
clear_bit(STRIPE_EXPAND_READY, &sh->state);
|
|
|
|
atomic_dec(&conf->reshape_stripes);
|
|
|
|
wake_up(&conf->wait_for_overlap);
|
|
|
|
md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.expanding && s.locked == 0 &&
|
|
|
|
!test_bit(STRIPE_COMPUTE_RUN, &sh->state))
|
|
|
|
handle_stripe_expansion(conf, sh);
|
2006-06-26 14:27:38 +07:00
|
|
|
|
2011-07-27 08:00:36 +07:00
|
|
|
finish:
|
2008-04-30 14:52:32 +07:00
|
|
|
/* wait for this device to become unblocked */
|
2012-07-03 09:13:29 +07:00
|
|
|
if (unlikely(s.blocked_rdev)) {
|
|
|
|
if (conf->mddev->external)
|
|
|
|
md_wait_for_blocked_rdev(s.blocked_rdev,
|
|
|
|
conf->mddev);
|
|
|
|
else
|
|
|
|
/* Internal metadata will immediately
|
|
|
|
* be written by raid5d, so we don't
|
|
|
|
* need to wait here.
|
|
|
|
*/
|
|
|
|
rdev_dec_pending(s.blocked_rdev,
|
|
|
|
conf->mddev);
|
|
|
|
}
|
2008-04-30 14:52:32 +07:00
|
|
|
|
2011-07-28 08:39:22 +07:00
|
|
|
if (s.handle_bad_blocks)
|
|
|
|
for (i = disks; i--; ) {
|
2011-10-11 12:45:26 +07:00
|
|
|
struct md_rdev *rdev;
|
2011-07-28 08:39:22 +07:00
|
|
|
struct r5dev *dev = &sh->dev[i];
|
|
|
|
if (test_and_clear_bit(R5_WriteError, &dev->flags)) {
|
|
|
|
/* We own a safe reference to the rdev */
|
|
|
|
rdev = conf->disks[i].rdev;
|
|
|
|
if (!rdev_set_badblocks(rdev, sh->sector,
|
|
|
|
STRIPE_SECTORS, 0))
|
|
|
|
md_error(conf->mddev, rdev);
|
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
|
|
|
}
|
2011-07-28 08:39:23 +07:00
|
|
|
if (test_and_clear_bit(R5_MadeGood, &dev->flags)) {
|
|
|
|
rdev = conf->disks[i].rdev;
|
|
|
|
rdev_clear_badblocks(rdev, sh->sector,
|
2012-05-21 06:27:00 +07:00
|
|
|
STRIPE_SECTORS, 0);
|
2011-07-28 08:39:23 +07:00
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
|
|
|
}
|
2011-12-23 06:17:53 +07:00
|
|
|
if (test_and_clear_bit(R5_MadeGoodRepl, &dev->flags)) {
|
|
|
|
rdev = conf->disks[i].replacement;
|
2011-12-23 06:17:53 +07:00
|
|
|
if (!rdev)
|
|
|
|
/* rdev have been moved down */
|
|
|
|
rdev = conf->disks[i].rdev;
|
2011-12-23 06:17:53 +07:00
|
|
|
rdev_clear_badblocks(rdev, sh->sector,
|
2012-05-21 06:27:00 +07:00
|
|
|
STRIPE_SECTORS, 0);
|
2011-12-23 06:17:53 +07:00
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
|
|
|
}
|
2011-07-28 08:39:22 +07:00
|
|
|
}
|
|
|
|
|
2009-08-30 09:13:13 +07:00
|
|
|
if (s.ops_request)
|
|
|
|
raid_run_ops(sh, s.ops_request);
|
|
|
|
|
2008-06-28 05:31:55 +07:00
|
|
|
ops_run_io(sh, &s);
|
2006-06-26 14:27:38 +07:00
|
|
|
|
2011-07-26 08:35:20 +07:00
|
|
|
if (s.dec_preread_active) {
|
2009-12-14 08:49:50 +07:00
|
|
|
/* We delay this until after ops_run_io so that if make_request
|
2010-09-03 16:56:18 +07:00
|
|
|
* is waiting on a flush, it won't continue until the writes
|
2009-12-14 08:49:50 +07:00
|
|
|
* have actually been submitted.
|
|
|
|
*/
|
|
|
|
atomic_dec(&conf->preread_active_stripes);
|
|
|
|
if (atomic_read(&conf->preread_active_stripes) <
|
|
|
|
IO_THRESHOLD)
|
|
|
|
md_wakeup_thread(conf->mddev->thread);
|
|
|
|
}
|
|
|
|
|
2011-11-08 12:22:06 +07:00
|
|
|
clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
|
2006-06-26 14:27:38 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static void raid5_activate_delayed(struct r5conf *conf)
|
2006-06-26 14:27:38 +07:00
|
|
|
{
|
|
|
|
if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
|
|
|
|
while (!list_empty(&conf->delayed_list)) {
|
|
|
|
struct list_head *l = conf->delayed_list.next;
|
|
|
|
struct stripe_head *sh;
|
|
|
|
sh = list_entry(l, struct stripe_head, lru);
|
|
|
|
list_del_init(l);
|
|
|
|
clear_bit(STRIPE_DELAYED, &sh->state);
|
|
|
|
if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
|
|
|
|
atomic_inc(&conf->preread_active_stripes);
|
2008-04-28 16:15:53 +07:00
|
|
|
list_add_tail(&sh->lru, &conf->hold_list);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
raid5_wakeup_stripe_thread(sh);
|
2006-06-26 14:27:38 +07:00
|
|
|
}
|
2011-04-18 15:25:42 +07:00
|
|
|
}
|
2006-06-26 14:27:38 +07:00
|
|
|
}
|
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
static void activate_bit_delay(struct r5conf *conf,
|
|
|
|
struct list_head *temp_inactive_list)
|
2006-06-26 14:27:38 +07:00
|
|
|
{
|
|
|
|
/* device_lock is held */
|
|
|
|
struct list_head head;
|
|
|
|
list_add(&head, &conf->bitmap_list);
|
|
|
|
list_del_init(&conf->bitmap_list);
|
|
|
|
while (!list_empty(&head)) {
|
|
|
|
struct stripe_head *sh = list_entry(head.next, struct stripe_head, lru);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
int hash;
|
2006-06-26 14:27:38 +07:00
|
|
|
list_del_init(&sh->lru);
|
|
|
|
atomic_inc(&sh->count);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
hash = sh->hash_lock_index;
|
|
|
|
__release_stripe(conf, sh, &temp_inactive_list[hash]);
|
2006-06-26 14:27:38 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-12-15 08:56:56 +07:00
|
|
|
static int raid5_congested(struct mddev *mddev, int bits)
|
2006-10-03 15:15:56 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2006-10-03 15:15:56 +07:00
|
|
|
|
|
|
|
/* No difference between reads and writes. Just check
|
|
|
|
* how busy the stripe_cache is
|
|
|
|
*/
|
2009-09-23 15:10:29 +07:00
|
|
|
|
2015-02-26 08:21:04 +07:00
|
|
|
if (test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state))
|
2006-10-03 15:15:56 +07:00
|
|
|
return 1;
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:40 +07:00
|
|
|
|
|
|
|
/* Also checks whether there is pressure on r5cache log space */
|
|
|
|
if (test_bit(R5C_LOG_TIGHT, &conf->cache_state))
|
|
|
|
return 1;
|
2006-10-03 15:15:56 +07:00
|
|
|
if (conf->quiesce)
|
|
|
|
return 1;
|
2013-11-14 11:16:17 +07:00
|
|
|
if (atomic_read(&conf->empty_inactive_list_nr))
|
2006-10-03 15:15:56 +07:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static int in_chunk_boundary(struct mddev *mddev, struct bio *bio)
|
2006-12-10 17:20:46 +07:00
|
|
|
{
|
2015-07-15 14:24:17 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2017-08-24 00:10:28 +07:00
|
|
|
sector_t sector = bio->bi_iter.bi_sector;
|
2015-07-15 14:24:17 +07:00
|
|
|
unsigned int chunk_sectors;
|
2013-02-06 06:19:29 +07:00
|
|
|
unsigned int bio_sectors = bio_sectors(bio);
|
2006-12-10 17:20:46 +07:00
|
|
|
|
2017-08-24 00:10:28 +07:00
|
|
|
WARN_ON_ONCE(bio->bi_partno);
|
|
|
|
|
2015-07-15 14:24:17 +07:00
|
|
|
chunk_sectors = min(conf->chunk_sectors, conf->prev_chunk_sectors);
|
2006-12-10 17:20:46 +07:00
|
|
|
return chunk_sectors >=
|
|
|
|
((sector & (chunk_sectors - 1)) + bio_sectors);
|
|
|
|
}
|
|
|
|
|
2006-12-10 17:20:47 +07:00
|
|
|
/*
|
|
|
|
* add bio to the retry LIFO ( in O(1) ... we are in interrupt )
|
|
|
|
* later sampled by raid5d.
|
|
|
|
*/
|
2011-10-11 12:49:52 +07:00
|
|
|
static void add_bio_to_retry(struct bio *bi,struct r5conf *conf)
|
2006-12-10 17:20:47 +07:00
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&conf->device_lock, flags);
|
|
|
|
|
|
|
|
bi->bi_next = conf->retry_read_aligned_list;
|
|
|
|
conf->retry_read_aligned_list = bi;
|
|
|
|
|
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
|
|
|
md_wakeup_thread(conf->mddev->thread);
|
|
|
|
}
|
|
|
|
|
2017-03-15 10:05:13 +07:00
|
|
|
static struct bio *remove_bio_from_retry(struct r5conf *conf,
|
|
|
|
unsigned int *offset)
|
2006-12-10 17:20:47 +07:00
|
|
|
{
|
|
|
|
struct bio *bi;
|
|
|
|
|
|
|
|
bi = conf->retry_read_aligned;
|
|
|
|
if (bi) {
|
2017-03-15 10:05:13 +07:00
|
|
|
*offset = conf->retry_read_offset;
|
2006-12-10 17:20:47 +07:00
|
|
|
conf->retry_read_aligned = NULL;
|
|
|
|
return bi;
|
|
|
|
}
|
|
|
|
bi = conf->retry_read_aligned_list;
|
|
|
|
if(bi) {
|
2007-02-09 05:20:29 +07:00
|
|
|
conf->retry_read_aligned_list = bi->bi_next;
|
2006-12-10 17:20:47 +07:00
|
|
|
bi->bi_next = NULL;
|
2017-03-15 10:05:13 +07:00
|
|
|
*offset = 0;
|
2006-12-10 17:20:47 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
return bi;
|
|
|
|
}
|
|
|
|
|
2006-12-10 17:20:46 +07:00
|
|
|
/*
|
|
|
|
* The "raid5_align_endio" should check if the read succeeded and if it
|
|
|
|
* did, call bio_endio on the original bio (having bio_put the new bio
|
|
|
|
* first).
|
|
|
|
* If the read failed..
|
|
|
|
*/
|
2015-07-20 20:29:37 +07:00
|
|
|
static void raid5_align_endio(struct bio *bi)
|
2006-12-10 17:20:46 +07:00
|
|
|
{
|
|
|
|
struct bio* raid_bi = bi->bi_private;
|
2011-10-11 12:47:53 +07:00
|
|
|
struct mddev *mddev;
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf;
|
2011-10-11 12:45:26 +07:00
|
|
|
struct md_rdev *rdev;
|
2017-06-03 14:38:06 +07:00
|
|
|
blk_status_t error = bi->bi_status;
|
2006-12-10 17:20:47 +07:00
|
|
|
|
2006-12-10 17:20:46 +07:00
|
|
|
bio_put(bi);
|
2006-12-10 17:20:47 +07:00
|
|
|
|
|
|
|
rdev = (void*)raid_bi->bi_next;
|
|
|
|
raid_bi->bi_next = NULL;
|
2010-03-25 12:06:03 +07:00
|
|
|
mddev = rdev->mddev;
|
|
|
|
conf = mddev->private;
|
2006-12-10 17:20:47 +07:00
|
|
|
|
|
|
|
rdev_dec_pending(rdev, conf->mddev);
|
|
|
|
|
2015-08-11 06:05:18 +07:00
|
|
|
if (!error) {
|
2015-07-20 20:29:37 +07:00
|
|
|
bio_endio(raid_bi);
|
2006-12-10 17:20:47 +07:00
|
|
|
if (atomic_dec_and_test(&conf->active_aligned_reads))
|
md/raid5: split wait_for_stripe and introduce wait_for_quiescent
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.
After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.
Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.
So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.
v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
Commit log refactor suggestion from Neil.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 15:19:06 +07:00
|
|
|
wake_up(&conf->wait_for_quiescent);
|
2007-09-27 17:47:43 +07:00
|
|
|
return;
|
2006-12-10 17:20:47 +07:00
|
|
|
}
|
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("raid5_align_endio : io error...handing IO for a retry\n");
|
2006-12-10 17:20:47 +07:00
|
|
|
|
|
|
|
add_bio_to_retry(raid_bi, conf);
|
2006-12-10 17:20:46 +07:00
|
|
|
}
|
|
|
|
|
2015-05-07 12:51:24 +07:00
|
|
|
static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
|
2006-12-10 17:20:46 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2009-12-14 08:49:47 +07:00
|
|
|
int dd_idx;
|
2006-12-10 17:20:46 +07:00
|
|
|
struct bio* align_bi;
|
2011-10-11 12:45:26 +07:00
|
|
|
struct md_rdev *rdev;
|
2011-12-23 06:17:52 +07:00
|
|
|
sector_t end_sector;
|
2006-12-10 17:20:46 +07:00
|
|
|
|
|
|
|
if (!in_chunk_boundary(mddev, raid_bio)) {
|
2015-05-07 12:51:24 +07:00
|
|
|
pr_debug("%s: non aligned\n", __func__);
|
2006-12-10 17:20:46 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
/*
|
2017-02-14 22:29:03 +07:00
|
|
|
* use bio_clone_fast to make a copy of the bio
|
2006-12-10 17:20:46 +07:00
|
|
|
*/
|
2018-05-21 05:25:52 +07:00
|
|
|
align_bi = bio_clone_fast(raid_bio, GFP_NOIO, &mddev->bio_set);
|
2006-12-10 17:20:46 +07:00
|
|
|
if (!align_bi)
|
|
|
|
return 0;
|
|
|
|
/*
|
|
|
|
* set bi_end_io to a new function, and set bi_private to the
|
|
|
|
* original bio.
|
|
|
|
*/
|
|
|
|
align_bi->bi_end_io = raid5_align_endio;
|
|
|
|
align_bi->bi_private = raid_bio;
|
|
|
|
/*
|
|
|
|
* compute position
|
|
|
|
*/
|
2013-10-12 05:44:27 +07:00
|
|
|
align_bi->bi_iter.bi_sector =
|
|
|
|
raid5_compute_sector(conf, raid_bio->bi_iter.bi_sector,
|
|
|
|
0, &dd_idx, NULL);
|
2006-12-10 17:20:46 +07:00
|
|
|
|
2012-09-26 05:05:12 +07:00
|
|
|
end_sector = bio_end_sector(align_bi);
|
2006-12-10 17:20:46 +07:00
|
|
|
rcu_read_lock();
|
2011-12-23 06:17:52 +07:00
|
|
|
rdev = rcu_dereference(conf->disks[dd_idx].replacement);
|
|
|
|
if (!rdev || test_bit(Faulty, &rdev->flags) ||
|
|
|
|
rdev->recovery_offset < end_sector) {
|
|
|
|
rdev = rcu_dereference(conf->disks[dd_idx].rdev);
|
|
|
|
if (rdev &&
|
|
|
|
(test_bit(Faulty, &rdev->flags) ||
|
|
|
|
!(test_bit(In_sync, &rdev->flags) ||
|
|
|
|
rdev->recovery_offset >= end_sector)))
|
|
|
|
rdev = NULL;
|
|
|
|
}
|
md/r5cache: enable chunk_aligned_read with write back cache
Chunk aligned read significantly reduces CPU usage of raid456.
However, it is not safe to fully bypass the write back cache.
This patch enables chunk aligned read with write back cache.
For chunk aligned read, we track stripes in write back cache at
a bigger granularity, "big_stripe". Each chunk may contain more
than one stripe (for example, a 256kB chunk contains 64 4kB-page,
so this chunk contain 64 stripes). For chunk_aligned_read, these
stripes are grouped into one big_stripe, so we only need one lookup
for the whole chunk.
For each big_stripe, struct big_stripe_info tracks how many stripes
of this big_stripe are in the write back cache. We count how many
stripes of this big_stripe are in the write back cache. These
counters are tracked in a radix tree (big_stripe_tree).
r5c_tree_index() is used to calculate keys for the radix tree.
chunk_aligned_read() calls r5c_big_stripe_cached() to look up
big_stripe of each chunk in the tree. If this big_stripe is in the
tree, chunk_aligned_read() aborts. This look up is protected by
rcu_read_lock().
It is necessary to remember whether a stripe is counted in
big_stripe_tree. Instead of adding new flag, we reuses existing flags:
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
two flags are set, the stripe is counted in big_stripe_tree. This
requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
r5c_try_caching_write(); and moving clear_bit of
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
r5c_finish_stripe_write_out().
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-12 04:39:14 +07:00
|
|
|
|
|
|
|
if (r5c_big_stripe_cached(conf, align_bi->bi_iter.bi_sector)) {
|
|
|
|
rcu_read_unlock();
|
|
|
|
bio_put(align_bi);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-12-23 06:17:52 +07:00
|
|
|
if (rdev) {
|
2011-07-28 08:39:22 +07:00
|
|
|
sector_t first_bad;
|
|
|
|
int bad_sectors;
|
|
|
|
|
2006-12-10 17:20:46 +07:00
|
|
|
atomic_inc(&rdev->nr_pending);
|
|
|
|
rcu_read_unlock();
|
2006-12-10 17:20:47 +07:00
|
|
|
raid_bio->bi_next = (void*)rdev;
|
2017-08-24 00:10:32 +07:00
|
|
|
bio_set_dev(align_bi, rdev->bdev);
|
2015-07-25 01:37:59 +07:00
|
|
|
bio_clear_flag(align_bi, BIO_SEG_VALID);
|
2006-12-10 17:20:47 +07:00
|
|
|
|
2013-09-26 03:37:01 +07:00
|
|
|
if (is_badblock(rdev, align_bi->bi_iter.bi_sector,
|
2013-10-12 05:44:27 +07:00
|
|
|
bio_sectors(align_bi),
|
2011-07-28 08:39:22 +07:00
|
|
|
&first_bad, &bad_sectors)) {
|
2007-02-09 05:20:29 +07:00
|
|
|
bio_put(align_bi);
|
|
|
|
rdev_dec_pending(rdev, mddev);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-06-12 07:31:10 +07:00
|
|
|
/* No reshape active, so we can trust rdev->data_offset */
|
2013-10-12 05:44:27 +07:00
|
|
|
align_bi->bi_iter.bi_sector += rdev->data_offset;
|
2012-06-12 07:31:10 +07:00
|
|
|
|
2006-12-10 17:20:47 +07:00
|
|
|
spin_lock_irq(&conf->device_lock);
|
md/raid5: split wait_for_stripe and introduce wait_for_quiescent
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.
After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.
Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.
So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.
v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
Commit log refactor suggestion from Neil.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 15:19:06 +07:00
|
|
|
wait_event_lock_irq(conf->wait_for_quiescent,
|
2006-12-10 17:20:47 +07:00
|
|
|
conf->quiesce == 0,
|
2012-11-30 17:42:40 +07:00
|
|
|
conf->device_lock);
|
2006-12-10 17:20:47 +07:00
|
|
|
atomic_inc(&conf->active_aligned_reads);
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
|
2013-03-08 05:22:01 +07:00
|
|
|
if (mddev->gendisk)
|
2017-08-24 00:10:32 +07:00
|
|
|
trace_block_bio_remap(align_bi->bi_disk->queue,
|
2013-03-08 05:22:01 +07:00
|
|
|
align_bi, disk_devt(mddev->gendisk),
|
2013-10-12 05:44:27 +07:00
|
|
|
raid_bio->bi_iter.bi_sector);
|
2006-12-10 17:20:46 +07:00
|
|
|
generic_make_request(align_bi);
|
|
|
|
return 1;
|
|
|
|
} else {
|
|
|
|
rcu_read_unlock();
|
2006-12-10 17:20:47 +07:00
|
|
|
bio_put(align_bi);
|
2006-12-10 17:20:46 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-05-07 12:51:24 +07:00
|
|
|
static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio)
|
|
|
|
{
|
|
|
|
struct bio *split;
|
2017-04-05 11:05:51 +07:00
|
|
|
sector_t sector = raid_bio->bi_iter.bi_sector;
|
|
|
|
unsigned chunk_sects = mddev->chunk_sectors;
|
|
|
|
unsigned sectors = chunk_sects - (sector & (chunk_sects-1));
|
2015-05-07 12:51:24 +07:00
|
|
|
|
2017-04-05 11:05:51 +07:00
|
|
|
if (sectors < bio_sectors(raid_bio)) {
|
|
|
|
struct r5conf *conf = mddev->private;
|
2018-05-21 05:25:52 +07:00
|
|
|
split = bio_split(raid_bio, sectors, GFP_NOIO, &conf->bio_split);
|
2017-04-05 11:05:51 +07:00
|
|
|
bio_chain(split, raid_bio);
|
|
|
|
generic_make_request(raid_bio);
|
|
|
|
raid_bio = split;
|
|
|
|
}
|
2015-05-07 12:51:24 +07:00
|
|
|
|
2017-04-05 11:05:51 +07:00
|
|
|
if (!raid5_read_one_chunk(mddev, raid_bio))
|
|
|
|
return raid_bio;
|
2015-05-07 12:51:24 +07:00
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2008-04-28 16:15:53 +07:00
|
|
|
/* __get_priority_stripe - get the next stripe to process
|
|
|
|
*
|
|
|
|
* Full stripe writes are allowed to pass preread active stripes up until
|
|
|
|
* the bypass_threshold is exceeded. In general the bypass_count
|
|
|
|
* increments when the handle_list is handled before the hold_list; however, it
|
|
|
|
* will not be incremented when STRIPE_IO_STARTED is sampled set signifying a
|
|
|
|
* stripe with in flight i/o. The bypass_count will be reset when the
|
|
|
|
* head of the hold_list has changed, i.e. the head was promoted to the
|
|
|
|
* handle_list.
|
|
|
|
*/
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
|
2008-04-28 16:15:53 +07:00
|
|
|
{
|
2017-02-16 10:37:32 +07:00
|
|
|
struct stripe_head *sh, *tmp;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
struct list_head *handle_list = NULL;
|
2017-02-16 10:37:32 +07:00
|
|
|
struct r5worker_group *wg;
|
md/r5cache: gracefully handle journal device errors for writeback mode
For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, it is necessary to handle
journal failure gracefully.
During journal failures, the following logic handles the graceful shutdown
of journal:
1. raid5_error() marks the device as Faulty and schedules async work
log->disable_writeback_work;
2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
suspended, set to write through, and then resumed. mddev_suspend()
flushes all cached stripes;
3. All cached stripes need to be flushed carefully to the RAID array.
This patch fixes issues within the process above:
1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
journal failures;
2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
since raid5_error() updates superblock.
3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
to make progress during log_failed;
4. In delay_towrite(), if log failed only process data in the cache (skip
new writes in dev->towrite);
5. In __get_priority_stripe(), process loprio_list during journal device
failures.
6. In raid5_remove_disk(), wait for all cached stripes are flushed before
calling log_exit().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-12 05:28:28 +07:00
|
|
|
bool second_try = !r5c_is_writeback(conf->log) &&
|
|
|
|
!r5l_log_disk_error(conf);
|
|
|
|
bool try_loprio = test_bit(R5C_LOG_TIGHT, &conf->cache_state) ||
|
|
|
|
r5l_log_disk_error(conf);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
|
2017-02-16 10:37:32 +07:00
|
|
|
again:
|
|
|
|
wg = NULL;
|
|
|
|
sh = NULL;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
if (conf->worker_cnt_per_group == 0) {
|
2017-02-16 10:37:32 +07:00
|
|
|
handle_list = try_loprio ? &conf->loprio_list :
|
|
|
|
&conf->handle_list;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
} else if (group != ANY_GROUP) {
|
2017-02-16 10:37:32 +07:00
|
|
|
handle_list = try_loprio ? &conf->worker_groups[group].loprio_list :
|
|
|
|
&conf->worker_groups[group].handle_list;
|
2013-08-29 14:40:32 +07:00
|
|
|
wg = &conf->worker_groups[group];
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
} else {
|
|
|
|
int i;
|
|
|
|
for (i = 0; i < conf->group_cnt; i++) {
|
2017-02-16 10:37:32 +07:00
|
|
|
handle_list = try_loprio ? &conf->worker_groups[i].loprio_list :
|
|
|
|
&conf->worker_groups[i].handle_list;
|
2013-08-29 14:40:32 +07:00
|
|
|
wg = &conf->worker_groups[i];
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
if (!list_empty(handle_list))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2008-04-28 16:15:53 +07:00
|
|
|
|
|
|
|
pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
|
|
|
|
__func__,
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
list_empty(handle_list) ? "empty" : "busy",
|
2008-04-28 16:15:53 +07:00
|
|
|
list_empty(&conf->hold_list) ? "empty" : "busy",
|
|
|
|
atomic_read(&conf->pending_full_writes), conf->bypass_count);
|
|
|
|
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
if (!list_empty(handle_list)) {
|
|
|
|
sh = list_entry(handle_list->next, typeof(*sh), lru);
|
2008-04-28 16:15:53 +07:00
|
|
|
|
|
|
|
if (list_empty(&conf->hold_list))
|
|
|
|
conf->bypass_count = 0;
|
|
|
|
else if (!test_bit(STRIPE_IO_STARTED, &sh->state)) {
|
|
|
|
if (conf->hold_list.next == conf->last_hold)
|
|
|
|
conf->bypass_count++;
|
|
|
|
else {
|
|
|
|
conf->last_hold = conf->hold_list.next;
|
|
|
|
conf->bypass_count -= conf->bypass_threshold;
|
|
|
|
if (conf->bypass_count < 0)
|
|
|
|
conf->bypass_count = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else if (!list_empty(&conf->hold_list) &&
|
|
|
|
((conf->bypass_threshold &&
|
|
|
|
conf->bypass_count > conf->bypass_threshold) ||
|
|
|
|
atomic_read(&conf->pending_full_writes) == 0)) {
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
|
|
|
|
list_for_each_entry(tmp, &conf->hold_list, lru) {
|
|
|
|
if (conf->worker_cnt_per_group == 0 ||
|
|
|
|
group == ANY_GROUP ||
|
|
|
|
!cpu_online(tmp->cpu) ||
|
|
|
|
cpu_to_group(tmp->cpu) == group) {
|
|
|
|
sh = tmp;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (sh) {
|
|
|
|
conf->bypass_count -= conf->bypass_threshold;
|
|
|
|
if (conf->bypass_count < 0)
|
|
|
|
conf->bypass_count = 0;
|
|
|
|
}
|
2013-08-29 14:40:32 +07:00
|
|
|
wg = NULL;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
}
|
|
|
|
|
2017-02-16 10:37:32 +07:00
|
|
|
if (!sh) {
|
|
|
|
if (second_try)
|
|
|
|
return NULL;
|
|
|
|
second_try = true;
|
|
|
|
try_loprio = !try_loprio;
|
|
|
|
goto again;
|
|
|
|
}
|
2008-04-28 16:15:53 +07:00
|
|
|
|
2013-08-29 14:40:32 +07:00
|
|
|
if (wg) {
|
|
|
|
wg->stripes_cnt--;
|
|
|
|
sh->group = NULL;
|
|
|
|
}
|
2008-04-28 16:15:53 +07:00
|
|
|
list_del_init(&sh->lru);
|
2014-04-15 08:12:54 +07:00
|
|
|
BUG_ON(atomic_inc_return(&sh->count) != 1);
|
2008-04-28 16:15:53 +07:00
|
|
|
return sh;
|
|
|
|
}
|
2006-12-10 17:20:46 +07:00
|
|
|
|
2012-08-02 05:33:00 +07:00
|
|
|
struct raid5_plug_cb {
|
|
|
|
struct blk_plug_cb cb;
|
|
|
|
struct list_head list;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
struct list_head temp_inactive_list[NR_STRIPE_HASH_LOCKS];
|
2012-08-02 05:33:00 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
static void raid5_unplug(struct blk_plug_cb *blk_cb, bool from_schedule)
|
|
|
|
{
|
|
|
|
struct raid5_plug_cb *cb = container_of(
|
|
|
|
blk_cb, struct raid5_plug_cb, cb);
|
|
|
|
struct stripe_head *sh;
|
|
|
|
struct mddev *mddev = cb->cb.data;
|
|
|
|
struct r5conf *conf = mddev->private;
|
2012-10-31 07:59:09 +07:00
|
|
|
int cnt = 0;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
int hash;
|
2012-08-02 05:33:00 +07:00
|
|
|
|
|
|
|
if (cb->list.next && !list_empty(&cb->list)) {
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
while (!list_empty(&cb->list)) {
|
|
|
|
sh = list_first_entry(&cb->list, struct stripe_head, lru);
|
|
|
|
list_del_init(&sh->lru);
|
|
|
|
/*
|
|
|
|
* avoid race release_stripe_plug() sees
|
|
|
|
* STRIPE_ON_UNPLUG_LIST clear but the stripe
|
|
|
|
* is still in our list
|
|
|
|
*/
|
2014-03-18 00:06:10 +07:00
|
|
|
smp_mb__before_atomic();
|
2012-08-02 05:33:00 +07:00
|
|
|
clear_bit(STRIPE_ON_UNPLUG_LIST, &sh->state);
|
2013-08-27 16:50:39 +07:00
|
|
|
/*
|
|
|
|
* STRIPE_ON_RELEASE_LIST could be set here. In that
|
|
|
|
* case, the count is always > 1 here
|
|
|
|
*/
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
hash = sh->hash_lock_index;
|
|
|
|
__release_stripe(conf, sh, &cb->temp_inactive_list[hash]);
|
2012-10-31 07:59:09 +07:00
|
|
|
cnt++;
|
2012-08-02 05:33:00 +07:00
|
|
|
}
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
}
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
release_inactive_stripe_list(conf, cb->temp_inactive_list,
|
|
|
|
NR_STRIPE_HASH_LOCKS);
|
2013-03-08 05:22:01 +07:00
|
|
|
if (mddev->queue)
|
|
|
|
trace_block_unplug(mddev->queue, cnt, !from_schedule);
|
2012-08-02 05:33:00 +07:00
|
|
|
kfree(cb);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void release_stripe_plug(struct mddev *mddev,
|
|
|
|
struct stripe_head *sh)
|
|
|
|
{
|
|
|
|
struct blk_plug_cb *blk_cb = blk_check_plugged(
|
|
|
|
raid5_unplug, mddev,
|
|
|
|
sizeof(struct raid5_plug_cb));
|
|
|
|
struct raid5_plug_cb *cb;
|
|
|
|
|
|
|
|
if (!blk_cb) {
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2012-08-02 05:33:00 +07:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
cb = container_of(blk_cb, struct raid5_plug_cb, cb);
|
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
if (cb->list.next == NULL) {
|
|
|
|
int i;
|
2012-08-02 05:33:00 +07:00
|
|
|
INIT_LIST_HEAD(&cb->list);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
|
|
|
|
INIT_LIST_HEAD(cb->temp_inactive_list + i);
|
|
|
|
}
|
2012-08-02 05:33:00 +07:00
|
|
|
|
|
|
|
if (!test_and_set_bit(STRIPE_ON_UNPLUG_LIST, &sh->state))
|
|
|
|
list_add_tail(&sh->lru, &cb->list);
|
|
|
|
else
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2012-08-02 05:33:00 +07:00
|
|
|
}
|
|
|
|
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
static void make_discard_request(struct mddev *mddev, struct bio *bi)
|
|
|
|
{
|
|
|
|
struct r5conf *conf = mddev->private;
|
|
|
|
sector_t logical_sector, last_sector;
|
|
|
|
struct stripe_head *sh;
|
|
|
|
int stripe_sectors;
|
|
|
|
|
|
|
|
if (mddev->reshape_position != MaxSector)
|
|
|
|
/* Skip discard while reshape is happening */
|
|
|
|
return;
|
|
|
|
|
2013-10-12 05:44:27 +07:00
|
|
|
logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1);
|
|
|
|
last_sector = bi->bi_iter.bi_sector + (bi->bi_iter.bi_size>>9);
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
|
|
|
|
bi->bi_next = NULL;
|
|
|
|
|
|
|
|
stripe_sectors = conf->chunk_sectors *
|
|
|
|
(conf->raid_disks - conf->max_degraded);
|
|
|
|
logical_sector = DIV_ROUND_UP_SECTOR_T(logical_sector,
|
|
|
|
stripe_sectors);
|
|
|
|
sector_div(last_sector, stripe_sectors);
|
|
|
|
|
|
|
|
logical_sector *= conf->chunk_sectors;
|
|
|
|
last_sector *= conf->chunk_sectors;
|
|
|
|
|
|
|
|
for (; logical_sector < last_sector;
|
|
|
|
logical_sector += STRIPE_SECTORS) {
|
|
|
|
DEFINE_WAIT(w);
|
|
|
|
int d;
|
|
|
|
again:
|
2015-08-14 04:31:57 +07:00
|
|
|
sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0);
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
prepare_to_wait(&conf->wait_for_overlap, &w,
|
|
|
|
TASK_UNINTERRUPTIBLE);
|
2013-03-12 08:18:06 +07:00
|
|
|
set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags);
|
|
|
|
if (test_bit(STRIPE_SYNCING, &sh->state)) {
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2013-03-12 08:18:06 +07:00
|
|
|
schedule();
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
clear_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags);
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
spin_lock_irq(&sh->stripe_lock);
|
|
|
|
for (d = 0; d < conf->raid_disks; d++) {
|
|
|
|
if (d == sh->pd_idx || d == sh->qd_idx)
|
|
|
|
continue;
|
|
|
|
if (sh->dev[d].towrite || sh->dev[d].toread) {
|
|
|
|
set_bit(R5_Overlap, &sh->dev[d].flags);
|
|
|
|
spin_unlock_irq(&sh->stripe_lock);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
schedule();
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
}
|
2013-03-12 08:18:06 +07:00
|
|
|
set_bit(STRIPE_DISCARD, &sh->state);
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
finish_wait(&conf->wait_for_overlap, &w);
|
2014-12-15 08:57:03 +07:00
|
|
|
sh->overwrite_disks = 0;
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
for (d = 0; d < conf->raid_disks; d++) {
|
|
|
|
if (d == sh->pd_idx || d == sh->qd_idx)
|
|
|
|
continue;
|
|
|
|
sh->dev[d].towrite = bi;
|
|
|
|
set_bit(R5_OVERWRITE, &sh->dev[d].flags);
|
2017-03-15 10:05:13 +07:00
|
|
|
bio_inc_remaining(bi);
|
md/raid5: use md_write_start to count stripes, not bios
We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count. We currently count bios
submitted to md/raid5. Change it count stripe_heads that a WRITE bio
has been attached to.
So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().
The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
Whenever we set bi_phys_segments to 1, we now call md_write_start.
Whenever we increment it on non-read requests with
raid5_inc_bi_active_stripes(), we now call md_write_start().
Whenever we decrement bi_phys_segments on non-read requsts with
raid5_dec_bi_active_stripes(), we now call md_write_end().
This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.
md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 10:05:12 +07:00
|
|
|
md_write_inc(mddev, bi);
|
2014-12-15 08:57:03 +07:00
|
|
|
sh->overwrite_disks++;
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
}
|
|
|
|
spin_unlock_irq(&sh->stripe_lock);
|
|
|
|
if (conf->mddev->bitmap) {
|
|
|
|
for (d = 0;
|
|
|
|
d < conf->raid_disks - conf->max_degraded;
|
|
|
|
d++)
|
2018-08-02 05:20:50 +07:00
|
|
|
md_bitmap_startwrite(mddev->bitmap,
|
|
|
|
sh->sector,
|
|
|
|
STRIPE_SECTORS,
|
|
|
|
0);
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
sh->bm_seq = conf->seq_flush + 1;
|
|
|
|
set_bit(STRIPE_BIT_DELAY, &sh->state);
|
|
|
|
}
|
|
|
|
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
clear_bit(STRIPE_DELAYED, &sh->state);
|
|
|
|
if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
|
|
|
|
atomic_inc(&conf->preread_active_stripes);
|
|
|
|
release_stripe_plug(mddev, sh);
|
|
|
|
}
|
|
|
|
|
2017-03-15 10:05:13 +07:00
|
|
|
bio_endio(bi);
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
}
|
|
|
|
|
2017-06-05 13:49:39 +07:00
|
|
|
static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2009-03-31 10:39:38 +07:00
|
|
|
int dd_idx;
|
2005-04-17 05:20:36 +07:00
|
|
|
sector_t new_sector;
|
|
|
|
sector_t logical_sector, last_sector;
|
|
|
|
struct stripe_head *sh;
|
2005-11-01 15:26:16 +07:00
|
|
|
const int rw = bio_data_dir(bi);
|
2014-04-09 10:25:47 +07:00
|
|
|
DEFINE_WAIT(w);
|
|
|
|
bool do_prepare;
|
2016-11-19 07:46:50 +07:00
|
|
|
bool do_flush = false;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2016-08-06 04:35:16 +07:00
|
|
|
if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
|
2017-12-27 16:31:40 +07:00
|
|
|
int ret = log_handle_flush_request(conf, bi);
|
2015-09-03 03:49:49 +07:00
|
|
|
|
|
|
|
if (ret == 0)
|
2017-06-05 13:49:39 +07:00
|
|
|
return true;
|
2015-09-03 03:49:49 +07:00
|
|
|
if (ret == -ENODEV) {
|
|
|
|
md_flush_request(mddev, bi);
|
2017-06-05 13:49:39 +07:00
|
|
|
return true;
|
2015-09-03 03:49:49 +07:00
|
|
|
}
|
|
|
|
/* ret == -EAGAIN, fallback */
|
2016-11-19 07:46:50 +07:00
|
|
|
/*
|
|
|
|
* if r5l_handle_flush_request() didn't clear REQ_PREFLUSH,
|
|
|
|
* we need to flush journal device
|
|
|
|
*/
|
|
|
|
do_flush = bi->bi_opf & REQ_PREFLUSH;
|
2005-09-10 06:23:41 +07:00
|
|
|
}
|
|
|
|
|
2017-06-05 13:49:39 +07:00
|
|
|
if (!md_write_start(mddev, bi))
|
|
|
|
return false;
|
md/raid5: don't do chunk aligned read on degraded array.
When array is degraded, read data landed on failed drives will result in
reading rest of data in a stripe. So a single sequential read would
result in same data being read twice.
This patch is to avoid chunk aligned read for degraded array. The
downside is to involve stripe cache which means associated CPU overhead
and extra memory copy.
Test Results:
Following test are done on a enterprise storage node with Seagate 6T SAS
drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
chunk size 128 KiB.
I use FIO, using direct-io with various bs size, enough queue depth,
tested sequential and 100% random read against 3 array config:
1) optimal, as baseline;
2) degraded;
3) degraded with this patch.
Kernel version is 4.0-rc3.
Each individual test I only did once so there might be some variations,
but we just focus on big trend.
Sequential Read:
bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s)
1024 1608 656 995
512 1624 710 956
256 1635 728 980
128 1636 771 983
64 1612 1119 1000
32 1580 1420 1004
16 1368 688 986
8 768 647 953
4 411 413 850
Random Read:
bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS)
1024 163 160 156
512 274 273 272
256 426 428 424
128 576 592 591
64 726 724 726
32 849 848 837
16 900 970 971
8 927 940 929
4 948 940 955
Some notes:
* In sequential + optimal, as bs size getting smaller, the FIO thread
become CPU bound.
* In sequential + degraded, there's big increase when bs is 64K and
32K, I don't have explanation.
* In sequential + degraded-with-patch, the MD thread mostly become CPU
bound.
If you want to we can discuss specific data point in those data. But in
general it seems with this patch, we have more predictable and in most
cases significant better sequential read performance when array is
degraded, and almost no noticeable impact on random read.
Performance is a complicated thing, the patch works well for this
particular configuration, but may not be universal. For example I
imagine testing on all SSD array may have very different result. But I
personally think in most cases IO bandwidth is more scarce resource than
CPU.
Signed-off-by: Eric Mei <eric.mei@seagate.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-19 12:39:11 +07:00
|
|
|
/*
|
|
|
|
* If array is degraded, better not do chunk aligned read because
|
|
|
|
* later we might have to read it again in order to reconstruct
|
|
|
|
* data on failed drives.
|
|
|
|
*/
|
|
|
|
if (rw == READ && mddev->degraded == 0 &&
|
2015-05-07 12:51:24 +07:00
|
|
|
mddev->reshape_position == MaxSector) {
|
|
|
|
bi = chunk_aligned_read(mddev, bi);
|
|
|
|
if (!bi)
|
2017-06-05 13:49:39 +07:00
|
|
|
return true;
|
2015-05-07 12:51:24 +07:00
|
|
|
}
|
2006-12-10 17:20:48 +07:00
|
|
|
|
2016-06-06 02:32:07 +07:00
|
|
|
if (unlikely(bio_op(bi) == REQ_OP_DISCARD)) {
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
make_discard_request(mddev, bi);
|
2017-06-05 13:49:39 +07:00
|
|
|
md_write_end(mddev);
|
|
|
|
return true;
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
}
|
|
|
|
|
2013-10-12 05:44:27 +07:00
|
|
|
logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1);
|
2012-09-26 05:05:12 +07:00
|
|
|
last_sector = bio_end_sector(bi);
|
2005-04-17 05:20:36 +07:00
|
|
|
bi->bi_next = NULL;
|
2005-06-22 07:17:12 +07:00
|
|
|
|
2014-04-09 10:25:47 +07:00
|
|
|
prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
|
2005-04-17 05:20:36 +07:00
|
|
|
for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
|
2009-03-31 10:39:38 +07:00
|
|
|
int previous;
|
2013-08-27 12:52:13 +07:00
|
|
|
int seq;
|
2006-03-27 16:18:12 +07:00
|
|
|
|
2014-04-09 10:25:47 +07:00
|
|
|
do_prepare = false;
|
2006-03-27 16:18:08 +07:00
|
|
|
retry:
|
2013-08-27 12:52:13 +07:00
|
|
|
seq = read_seqcount_begin(&conf->gen_lock);
|
2009-03-31 10:39:38 +07:00
|
|
|
previous = 0;
|
2014-04-09 10:25:47 +07:00
|
|
|
if (do_prepare)
|
|
|
|
prepare_to_wait(&conf->wait_for_overlap, &w,
|
|
|
|
TASK_UNINTERRUPTIBLE);
|
2009-03-31 11:27:18 +07:00
|
|
|
if (unlikely(conf->reshape_progress != MaxSector)) {
|
2009-03-31 11:16:46 +07:00
|
|
|
/* spinlock is needed as reshape_progress may be
|
2006-03-27 16:18:15 +07:00
|
|
|
* 64bit on a 32bit platform, and so it might be
|
|
|
|
* possible to see a half-updated value
|
2011-04-10 23:06:17 +07:00
|
|
|
* Of course reshape_progress could change after
|
2006-03-27 16:18:15 +07:00
|
|
|
* the lock is dropped, so once we get a reference
|
|
|
|
* to the stripe that we think it is, we will have
|
|
|
|
* to check again.
|
|
|
|
*/
|
2006-03-27 16:18:08 +07:00
|
|
|
spin_lock_irq(&conf->device_lock);
|
2012-05-21 06:27:00 +07:00
|
|
|
if (mddev->reshape_backwards
|
2009-03-31 11:16:46 +07:00
|
|
|
? logical_sector < conf->reshape_progress
|
|
|
|
: logical_sector >= conf->reshape_progress) {
|
2009-03-31 10:39:38 +07:00
|
|
|
previous = 1;
|
|
|
|
} else {
|
2012-05-21 06:27:00 +07:00
|
|
|
if (mddev->reshape_backwards
|
2009-03-31 11:16:46 +07:00
|
|
|
? logical_sector < conf->reshape_safe
|
|
|
|
: logical_sector >= conf->reshape_safe) {
|
2006-03-27 16:18:12 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
schedule();
|
2014-04-09 10:25:47 +07:00
|
|
|
do_prepare = true;
|
2006-03-27 16:18:12 +07:00
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
}
|
2006-03-27 16:18:08 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
}
|
2006-06-26 14:27:38 +07:00
|
|
|
|
2009-03-31 10:39:38 +07:00
|
|
|
new_sector = raid5_compute_sector(conf, logical_sector,
|
|
|
|
previous,
|
2009-03-31 10:39:38 +07:00
|
|
|
&dd_idx, NULL);
|
2016-01-21 04:52:20 +07:00
|
|
|
pr_debug("raid456: raid5_make_request, sector %llu logical %llu\n",
|
2013-08-27 12:52:13 +07:00
|
|
|
(unsigned long long)new_sector,
|
2005-04-17 05:20:36 +07:00
|
|
|
(unsigned long long)logical_sector);
|
|
|
|
|
2015-08-14 04:31:57 +07:00
|
|
|
sh = raid5_get_active_stripe(conf, new_sector, previous,
|
2016-08-06 04:35:16 +07:00
|
|
|
(bi->bi_opf & REQ_RAHEAD), 0);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (sh) {
|
2009-03-31 11:27:18 +07:00
|
|
|
if (unlikely(previous)) {
|
2006-03-27 16:18:08 +07:00
|
|
|
/* expansion might have moved on while waiting for a
|
2006-03-27 16:18:15 +07:00
|
|
|
* stripe, so we must do the range check again.
|
|
|
|
* Expansion could still move past after this
|
|
|
|
* test, but as we are holding a reference to
|
|
|
|
* 'sh', we know that if that happens,
|
|
|
|
* STRIPE_EXPANDING will get set and the expansion
|
|
|
|
* won't proceed until we finish with the stripe.
|
2006-03-27 16:18:08 +07:00
|
|
|
*/
|
|
|
|
int must_retry = 0;
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
2012-05-21 06:27:00 +07:00
|
|
|
if (mddev->reshape_backwards
|
2009-03-31 11:27:18 +07:00
|
|
|
? logical_sector >= conf->reshape_progress
|
|
|
|
: logical_sector < conf->reshape_progress)
|
2006-03-27 16:18:08 +07:00
|
|
|
/* mismatch, need to try again */
|
|
|
|
must_retry = 1;
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
if (must_retry) {
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2009-06-17 06:00:33 +07:00
|
|
|
schedule();
|
2014-04-09 10:25:47 +07:00
|
|
|
do_prepare = true;
|
2006-03-27 16:18:08 +07:00
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
}
|
2013-08-27 12:52:13 +07:00
|
|
|
if (read_seqcount_retry(&conf->gen_lock, seq)) {
|
|
|
|
/* Might have got the wrong stripe_head
|
|
|
|
* by accident
|
|
|
|
*/
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2013-08-27 12:52:13 +07:00
|
|
|
goto retry;
|
|
|
|
}
|
2009-07-01 10:15:35 +07:00
|
|
|
|
2006-03-27 16:18:08 +07:00
|
|
|
if (test_bit(STRIPE_EXPANDING, &sh->state) ||
|
2014-12-15 08:57:03 +07:00
|
|
|
!add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
|
2006-03-27 16:18:08 +07:00
|
|
|
/* Stripe is busy expanding or
|
|
|
|
* add failed due to overlap. Flush everything
|
2005-04-17 05:20:36 +07:00
|
|
|
* and wait a while
|
|
|
|
*/
|
2011-04-18 15:25:42 +07:00
|
|
|
md_wakeup_thread(mddev->thread);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2005-04-17 05:20:36 +07:00
|
|
|
schedule();
|
2014-04-09 10:25:47 +07:00
|
|
|
do_prepare = true;
|
2005-04-17 05:20:36 +07:00
|
|
|
goto retry;
|
|
|
|
}
|
2016-11-19 07:46:50 +07:00
|
|
|
if (do_flush) {
|
|
|
|
set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
|
|
|
|
/* we only need flush for one stripe */
|
|
|
|
do_flush = false;
|
|
|
|
}
|
|
|
|
|
2008-02-06 16:40:00 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
clear_bit(STRIPE_DELAYED, &sh->state);
|
2014-12-15 08:57:03 +07:00
|
|
|
if ((!sh->batch_head || sh == sh->batch_head) &&
|
2016-08-06 04:35:16 +07:00
|
|
|
(bi->bi_opf & REQ_SYNC) &&
|
2009-12-14 08:49:50 +07:00
|
|
|
!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
|
|
|
|
atomic_inc(&conf->preread_active_stripes);
|
2012-08-02 05:33:00 +07:00
|
|
|
release_stripe_plug(mddev, sh);
|
2005-04-17 05:20:36 +07:00
|
|
|
} else {
|
|
|
|
/* cannot get stripe for read-ahead, just give-up */
|
2017-06-03 14:38:06 +07:00
|
|
|
bi->bi_status = BLK_STS_IOERR;
|
2005-04-17 05:20:36 +07:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2014-04-09 10:25:47 +07:00
|
|
|
finish_wait(&conf->wait_for_overlap, &w);
|
2011-04-18 15:25:43 +07:00
|
|
|
|
md/raid5: use md_write_start to count stripes, not bios
We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count. We currently count bios
submitted to md/raid5. Change it count stripe_heads that a WRITE bio
has been attached to.
So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().
The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
Whenever we set bi_phys_segments to 1, we now call md_write_start.
Whenever we increment it on non-read requests with
raid5_inc_bi_active_stripes(), we now call md_write_start().
Whenever we decrement bi_phys_segments on non-read requsts with
raid5_dec_bi_active_stripes(), we now call md_write_end().
This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.
md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-15 10:05:12 +07:00
|
|
|
if (rw == WRITE)
|
|
|
|
md_write_end(mddev);
|
2017-03-15 10:05:13 +07:00
|
|
|
bio_endio(bi);
|
2017-06-05 13:49:39 +07:00
|
|
|
return true;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static sector_t raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks);
|
2009-03-31 11:00:31 +07:00
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *skipped)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-06-26 14:27:43 +07:00
|
|
|
/* reshaping is quite different to recovery/resync so it is
|
|
|
|
* handled quite separately ... here.
|
|
|
|
*
|
|
|
|
* On each call to sync_request, we gather one chunk worth of
|
|
|
|
* destination stripes and flag them as expanding.
|
|
|
|
* Then we find all the source stripes and request reads.
|
|
|
|
* As the reads complete, handle_stripe will copy the data
|
|
|
|
* into the destination stripe and release that stripe.
|
|
|
|
*/
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2005-04-17 05:20:36 +07:00
|
|
|
struct stripe_head *sh;
|
2017-10-17 12:18:36 +07:00
|
|
|
struct md_rdev *rdev;
|
2006-03-27 16:18:09 +07:00
|
|
|
sector_t first_sector, last_sector;
|
2007-03-01 11:11:53 +07:00
|
|
|
int raid_disks = conf->previous_raid_disks;
|
|
|
|
int data_disks = raid_disks - conf->max_degraded;
|
|
|
|
int new_data_disks = conf->raid_disks - conf->max_degraded;
|
2006-06-26 14:27:43 +07:00
|
|
|
int i;
|
|
|
|
int dd_idx;
|
2009-03-31 11:28:40 +07:00
|
|
|
sector_t writepos, readpos, safepos;
|
2009-03-31 11:17:38 +07:00
|
|
|
sector_t stripe_addr;
|
2009-03-31 11:21:40 +07:00
|
|
|
int reshape_sectors;
|
2009-03-31 11:26:47 +07:00
|
|
|
struct list_head stripes;
|
2015-07-06 09:28:45 +07:00
|
|
|
sector_t retn;
|
2006-06-26 14:27:43 +07:00
|
|
|
|
2009-03-31 11:16:46 +07:00
|
|
|
if (sector_nr == 0) {
|
|
|
|
/* If restarting in the middle, skip the initial sectors */
|
2012-05-21 06:27:00 +07:00
|
|
|
if (mddev->reshape_backwards &&
|
2009-03-31 11:16:46 +07:00
|
|
|
conf->reshape_progress < raid5_size(mddev, 0, 0)) {
|
|
|
|
sector_nr = raid5_size(mddev, 0, 0)
|
|
|
|
- conf->reshape_progress;
|
2015-07-24 10:30:32 +07:00
|
|
|
} else if (mddev->reshape_backwards &&
|
|
|
|
conf->reshape_progress == MaxSector) {
|
|
|
|
/* shouldn't happen, but just in case, finish up.*/
|
|
|
|
sector_nr = MaxSector;
|
2012-05-21 06:27:00 +07:00
|
|
|
} else if (!mddev->reshape_backwards &&
|
2009-03-31 11:16:46 +07:00
|
|
|
conf->reshape_progress > 0)
|
|
|
|
sector_nr = conf->reshape_progress;
|
2007-03-01 11:11:53 +07:00
|
|
|
sector_div(sector_nr, new_data_disks);
|
2009-03-31 11:16:46 +07:00
|
|
|
if (sector_nr) {
|
2009-11-06 10:59:29 +07:00
|
|
|
mddev->curr_resync_completed = sector_nr;
|
|
|
|
sysfs_notify(&mddev->kobj, NULL, "sync_completed");
|
2009-03-31 11:16:46 +07:00
|
|
|
*skipped = 1;
|
2015-07-06 09:28:45 +07:00
|
|
|
retn = sector_nr;
|
|
|
|
goto finish;
|
2009-03-31 11:16:46 +07:00
|
|
|
}
|
2006-06-26 14:27:43 +07:00
|
|
|
}
|
|
|
|
|
2009-03-31 11:21:40 +07:00
|
|
|
/* We need to process a full chunk at a time.
|
|
|
|
* If old and new chunk sizes differ, we need to process the
|
|
|
|
* largest of these
|
|
|
|
*/
|
2015-07-15 14:24:17 +07:00
|
|
|
|
|
|
|
reshape_sectors = max(conf->chunk_sectors, conf->prev_chunk_sectors);
|
2009-03-31 11:21:40 +07:00
|
|
|
|
2012-05-21 06:27:01 +07:00
|
|
|
/* We update the metadata at least every 10 seconds, or when
|
|
|
|
* the data about to be copied would over-write the source of
|
|
|
|
* the data at the front of the range. i.e. one new_stripe
|
|
|
|
* along from reshape_progress new_maps to after where
|
|
|
|
* reshape_safe old_maps to
|
2006-06-26 14:27:43 +07:00
|
|
|
*/
|
2009-03-31 11:16:46 +07:00
|
|
|
writepos = conf->reshape_progress;
|
2007-03-01 11:11:53 +07:00
|
|
|
sector_div(writepos, new_data_disks);
|
2009-03-31 11:28:40 +07:00
|
|
|
readpos = conf->reshape_progress;
|
|
|
|
sector_div(readpos, data_disks);
|
2009-03-31 11:16:46 +07:00
|
|
|
safepos = conf->reshape_safe;
|
2007-03-01 11:11:53 +07:00
|
|
|
sector_div(safepos, data_disks);
|
2012-05-21 06:27:00 +07:00
|
|
|
if (mddev->reshape_backwards) {
|
2015-07-15 14:54:15 +07:00
|
|
|
BUG_ON(writepos < reshape_sectors);
|
|
|
|
writepos -= reshape_sectors;
|
2009-03-31 11:28:40 +07:00
|
|
|
readpos += reshape_sectors;
|
2009-03-31 11:21:40 +07:00
|
|
|
safepos += reshape_sectors;
|
2009-03-31 11:16:46 +07:00
|
|
|
} else {
|
2009-03-31 11:21:40 +07:00
|
|
|
writepos += reshape_sectors;
|
2015-07-15 14:54:15 +07:00
|
|
|
/* readpos and safepos are worst-case calculations.
|
|
|
|
* A negative number is overly pessimistic, and causes
|
|
|
|
* obvious problems for unsigned storage. So clip to 0.
|
|
|
|
*/
|
2009-05-27 18:39:05 +07:00
|
|
|
readpos -= min_t(sector_t, reshape_sectors, readpos);
|
|
|
|
safepos -= min_t(sector_t, reshape_sectors, safepos);
|
2009-03-31 11:16:46 +07:00
|
|
|
}
|
2006-06-26 14:27:43 +07:00
|
|
|
|
2012-05-21 06:27:01 +07:00
|
|
|
/* Having calculated the 'writepos' possibly use it
|
|
|
|
* to set 'stripe_addr' which is where we will write to.
|
|
|
|
*/
|
|
|
|
if (mddev->reshape_backwards) {
|
|
|
|
BUG_ON(conf->reshape_progress == 0);
|
|
|
|
stripe_addr = writepos;
|
|
|
|
BUG_ON((mddev->dev_sectors &
|
|
|
|
~((sector_t)reshape_sectors - 1))
|
|
|
|
- reshape_sectors - stripe_addr
|
|
|
|
!= sector_nr);
|
|
|
|
} else {
|
|
|
|
BUG_ON(writepos != sector_nr + reshape_sectors);
|
|
|
|
stripe_addr = sector_nr;
|
|
|
|
}
|
|
|
|
|
2009-03-31 11:28:40 +07:00
|
|
|
/* 'writepos' is the most advanced device address we might write.
|
|
|
|
* 'readpos' is the least advanced device address we might read.
|
|
|
|
* 'safepos' is the least address recorded in the metadata as having
|
|
|
|
* been reshaped.
|
2012-05-21 06:27:01 +07:00
|
|
|
* If there is a min_offset_diff, these are adjusted either by
|
|
|
|
* increasing the safepos/readpos if diff is negative, or
|
|
|
|
* increasing writepos if diff is positive.
|
|
|
|
* If 'readpos' is then behind 'writepos', there is no way that we can
|
2009-03-31 11:28:40 +07:00
|
|
|
* ensure safety in the face of a crash - that must be done by userspace
|
|
|
|
* making a backup of the data. So in that case there is no particular
|
|
|
|
* rush to update metadata.
|
|
|
|
* Otherwise if 'safepos' is behind 'writepos', then we really need to
|
|
|
|
* update the metadata to advance 'safepos' to match 'readpos' so that
|
|
|
|
* we can be safe in the event of a crash.
|
|
|
|
* So we insist on updating metadata if safepos is behind writepos and
|
|
|
|
* readpos is beyond writepos.
|
|
|
|
* In any case, update the metadata every 10 seconds.
|
|
|
|
* Maybe that number should be configurable, but I'm not sure it is
|
|
|
|
* worth it.... maybe it could be a multiple of safemode_delay???
|
|
|
|
*/
|
2012-05-21 06:27:01 +07:00
|
|
|
if (conf->min_offset_diff < 0) {
|
|
|
|
safepos += -conf->min_offset_diff;
|
|
|
|
readpos += -conf->min_offset_diff;
|
|
|
|
} else
|
|
|
|
writepos += conf->min_offset_diff;
|
|
|
|
|
2012-05-21 06:27:00 +07:00
|
|
|
if ((mddev->reshape_backwards
|
2009-03-31 11:28:40 +07:00
|
|
|
? (safepos > writepos && readpos < writepos)
|
|
|
|
: (safepos < writepos && readpos > writepos)) ||
|
|
|
|
time_after(jiffies, conf->reshape_checkpoint + 10*HZ)) {
|
2006-06-26 14:27:43 +07:00
|
|
|
/* Cannot proceed until we've updated the superblock... */
|
|
|
|
wait_event(conf->wait_for_overlap,
|
2013-11-19 08:02:01 +07:00
|
|
|
atomic_read(&conf->reshape_stripes)==0
|
|
|
|
|| test_bit(MD_RECOVERY_INTR, &mddev->recovery));
|
|
|
|
if (atomic_read(&conf->reshape_stripes) != 0)
|
|
|
|
return 0;
|
2009-03-31 11:16:46 +07:00
|
|
|
mddev->reshape_position = conf->reshape_progress;
|
2011-01-14 05:14:34 +07:00
|
|
|
mddev->curr_resync_completed = sector_nr;
|
2017-10-17 12:18:36 +07:00
|
|
|
if (!mddev->reshape_backwards)
|
|
|
|
/* Can update recovery_offset */
|
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
if (rdev->raid_disk >= 0 &&
|
|
|
|
!test_bit(Journal, &rdev->flags) &&
|
|
|
|
!test_bit(In_sync, &rdev->flags) &&
|
|
|
|
rdev->recovery_offset < sector_nr)
|
|
|
|
rdev->recovery_offset = sector_nr;
|
|
|
|
|
2009-03-31 11:28:40 +07:00
|
|
|
conf->reshape_checkpoint = jiffies;
|
2016-12-09 06:48:19 +07:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2006-06-26 14:27:43 +07:00
|
|
|
md_wakeup_thread(mddev->thread);
|
2016-12-09 06:48:19 +07:00
|
|
|
wait_event(mddev->sb_wait, mddev->sb_flags == 0 ||
|
2013-11-19 08:02:01 +07:00
|
|
|
test_bit(MD_RECOVERY_INTR, &mddev->recovery));
|
|
|
|
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
|
|
|
|
return 0;
|
2006-06-26 14:27:43 +07:00
|
|
|
spin_lock_irq(&conf->device_lock);
|
2009-03-31 11:16:46 +07:00
|
|
|
conf->reshape_safe = mddev->reshape_position;
|
2006-06-26 14:27:43 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
wake_up(&conf->wait_for_overlap);
|
2009-04-14 13:28:34 +07:00
|
|
|
sysfs_notify(&mddev->kobj, NULL, "sync_completed");
|
2006-06-26 14:27:43 +07:00
|
|
|
}
|
|
|
|
|
2009-03-31 11:26:47 +07:00
|
|
|
INIT_LIST_HEAD(&stripes);
|
2009-03-31 11:21:40 +07:00
|
|
|
for (i = 0; i < reshape_sectors; i += STRIPE_SECTORS) {
|
2006-06-26 14:27:43 +07:00
|
|
|
int j;
|
2009-09-23 15:06:41 +07:00
|
|
|
int skipped_disk = 0;
|
2015-08-14 04:31:57 +07:00
|
|
|
sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 1);
|
2006-06-26 14:27:43 +07:00
|
|
|
set_bit(STRIPE_EXPANDING, &sh->state);
|
|
|
|
atomic_inc(&conf->reshape_stripes);
|
|
|
|
/* If any of this stripe is beyond the end of the old
|
|
|
|
* array, then we need to zero those blocks
|
|
|
|
*/
|
|
|
|
for (j=sh->disks; j--;) {
|
|
|
|
sector_t s;
|
|
|
|
if (j == sh->pd_idx)
|
|
|
|
continue;
|
2007-03-01 11:11:53 +07:00
|
|
|
if (conf->level == 6 &&
|
2009-03-31 10:39:38 +07:00
|
|
|
j == sh->qd_idx)
|
2007-03-01 11:11:53 +07:00
|
|
|
continue;
|
2015-08-14 04:31:57 +07:00
|
|
|
s = raid5_compute_blocknr(sh, j, 0);
|
2009-03-31 11:00:31 +07:00
|
|
|
if (s < raid5_size(mddev, 0, 0)) {
|
2009-09-23 15:06:41 +07:00
|
|
|
skipped_disk = 1;
|
2006-06-26 14:27:43 +07:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
memset(page_address(sh->dev[j].page), 0, STRIPE_SIZE);
|
|
|
|
set_bit(R5_Expanded, &sh->dev[j].flags);
|
|
|
|
set_bit(R5_UPTODATE, &sh->dev[j].flags);
|
|
|
|
}
|
2009-09-23 15:06:41 +07:00
|
|
|
if (!skipped_disk) {
|
2006-06-26 14:27:43 +07:00
|
|
|
set_bit(STRIPE_EXPAND_READY, &sh->state);
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
|
|
|
}
|
2009-03-31 11:26:47 +07:00
|
|
|
list_add(&sh->lru, &stripes);
|
2006-06-26 14:27:43 +07:00
|
|
|
}
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
2012-05-21 06:27:00 +07:00
|
|
|
if (mddev->reshape_backwards)
|
2009-03-31 11:21:40 +07:00
|
|
|
conf->reshape_progress -= reshape_sectors * new_data_disks;
|
2009-03-31 11:16:46 +07:00
|
|
|
else
|
2009-03-31 11:21:40 +07:00
|
|
|
conf->reshape_progress += reshape_sectors * new_data_disks;
|
2006-06-26 14:27:43 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
/* Ok, those stripe are ready. We can start scheduling
|
|
|
|
* reads on the source stripes.
|
|
|
|
* The source stripes are determined by mapping the first and last
|
|
|
|
* block on the destination stripes.
|
|
|
|
*/
|
|
|
|
first_sector =
|
2009-03-31 11:17:38 +07:00
|
|
|
raid5_compute_sector(conf, stripe_addr*(new_data_disks),
|
2009-03-31 10:39:38 +07:00
|
|
|
1, &dd_idx, NULL);
|
2006-06-26 14:27:43 +07:00
|
|
|
last_sector =
|
2009-06-09 13:32:22 +07:00
|
|
|
raid5_compute_sector(conf, ((stripe_addr+reshape_sectors)
|
2009-06-18 05:45:55 +07:00
|
|
|
* new_data_disks - 1),
|
2009-03-31 10:39:38 +07:00
|
|
|
1, &dd_idx, NULL);
|
2009-03-31 10:33:13 +07:00
|
|
|
if (last_sector >= mddev->dev_sectors)
|
|
|
|
last_sector = mddev->dev_sectors - 1;
|
2006-06-26 14:27:43 +07:00
|
|
|
while (first_sector <= last_sector) {
|
2015-08-14 04:31:57 +07:00
|
|
|
sh = raid5_get_active_stripe(conf, first_sector, 1, 0, 1);
|
2006-06-26 14:27:43 +07:00
|
|
|
set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
|
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2006-06-26 14:27:43 +07:00
|
|
|
first_sector += STRIPE_SECTORS;
|
|
|
|
}
|
2009-03-31 11:26:47 +07:00
|
|
|
/* Now that the sources are clearly marked, we can release
|
|
|
|
* the destination stripes
|
|
|
|
*/
|
|
|
|
while (!list_empty(&stripes)) {
|
|
|
|
sh = list_entry(stripes.next, struct stripe_head, lru);
|
|
|
|
list_del_init(&sh->lru);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2009-03-31 11:26:47 +07:00
|
|
|
}
|
2008-02-06 16:39:52 +07:00
|
|
|
/* If this takes us to the resync_max point where we have to pause,
|
|
|
|
* then we need to write out the superblock.
|
|
|
|
*/
|
2009-03-31 11:21:40 +07:00
|
|
|
sector_nr += reshape_sectors;
|
2015-07-06 09:28:45 +07:00
|
|
|
retn = reshape_sectors;
|
|
|
|
finish:
|
2015-07-17 09:06:02 +07:00
|
|
|
if (mddev->curr_resync_completed > mddev->resync_max ||
|
|
|
|
(sector_nr - mddev->curr_resync_completed) * 2
|
2009-04-17 08:06:30 +07:00
|
|
|
>= mddev->resync_max - mddev->curr_resync_completed) {
|
2008-02-06 16:39:52 +07:00
|
|
|
/* Cannot proceed until we've updated the superblock... */
|
|
|
|
wait_event(conf->wait_for_overlap,
|
2013-11-19 08:02:01 +07:00
|
|
|
atomic_read(&conf->reshape_stripes) == 0
|
|
|
|
|| test_bit(MD_RECOVERY_INTR, &mddev->recovery));
|
|
|
|
if (atomic_read(&conf->reshape_stripes) != 0)
|
|
|
|
goto ret;
|
2009-03-31 11:16:46 +07:00
|
|
|
mddev->reshape_position = conf->reshape_progress;
|
2011-01-14 05:14:34 +07:00
|
|
|
mddev->curr_resync_completed = sector_nr;
|
2017-10-17 12:18:36 +07:00
|
|
|
if (!mddev->reshape_backwards)
|
|
|
|
/* Can update recovery_offset */
|
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
if (rdev->raid_disk >= 0 &&
|
|
|
|
!test_bit(Journal, &rdev->flags) &&
|
|
|
|
!test_bit(In_sync, &rdev->flags) &&
|
|
|
|
rdev->recovery_offset < sector_nr)
|
|
|
|
rdev->recovery_offset = sector_nr;
|
2009-03-31 11:28:40 +07:00
|
|
|
conf->reshape_checkpoint = jiffies;
|
2016-12-09 06:48:19 +07:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2008-02-06 16:39:52 +07:00
|
|
|
md_wakeup_thread(mddev->thread);
|
|
|
|
wait_event(mddev->sb_wait,
|
2016-12-09 06:48:19 +07:00
|
|
|
!test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags)
|
2013-11-19 08:02:01 +07:00
|
|
|
|| test_bit(MD_RECOVERY_INTR, &mddev->recovery));
|
|
|
|
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
|
|
|
|
goto ret;
|
2008-02-06 16:39:52 +07:00
|
|
|
spin_lock_irq(&conf->device_lock);
|
2009-03-31 11:16:46 +07:00
|
|
|
conf->reshape_safe = mddev->reshape_position;
|
2008-02-06 16:39:52 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
wake_up(&conf->wait_for_overlap);
|
2009-04-14 13:28:34 +07:00
|
|
|
sysfs_notify(&mddev->kobj, NULL, "sync_completed");
|
2008-02-06 16:39:52 +07:00
|
|
|
}
|
2013-11-19 08:02:01 +07:00
|
|
|
ret:
|
2015-07-06 09:28:45 +07:00
|
|
|
return retn;
|
2006-06-26 14:27:43 +07:00
|
|
|
}
|
|
|
|
|
2016-01-21 04:52:20 +07:00
|
|
|
static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_nr,
|
|
|
|
int *skipped)
|
2006-06-26 14:27:43 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2006-06-26 14:27:43 +07:00
|
|
|
struct stripe_head *sh;
|
2009-03-31 10:33:13 +07:00
|
|
|
sector_t max_sector = mddev->dev_sectors;
|
2010-10-19 06:03:39 +07:00
|
|
|
sector_t sync_blocks;
|
2006-06-26 14:27:38 +07:00
|
|
|
int still_degraded = 0;
|
|
|
|
int i;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-09-10 06:23:54 +07:00
|
|
|
if (sector_nr >= max_sector) {
|
2005-04-17 05:20:36 +07:00
|
|
|
/* just being told to finish up .. nothing much to do */
|
2009-03-31 11:15:05 +07:00
|
|
|
|
2006-03-27 16:18:10 +07:00
|
|
|
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
|
|
|
|
end_reshape(conf);
|
|
|
|
return 0;
|
|
|
|
}
|
2005-09-10 06:23:54 +07:00
|
|
|
|
|
|
|
if (mddev->curr_resync < max_sector) /* aborted */
|
2018-08-02 05:20:50 +07:00
|
|
|
md_bitmap_end_sync(mddev->bitmap, mddev->curr_resync,
|
|
|
|
&sync_blocks, 1);
|
2006-06-26 14:27:38 +07:00
|
|
|
else /* completed sync */
|
2005-09-10 06:23:54 +07:00
|
|
|
conf->fullsync = 0;
|
2018-08-02 05:20:50 +07:00
|
|
|
md_bitmap_close_sync(mddev->bitmap);
|
2005-09-10 06:23:54 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2006-03-27 16:18:09 +07:00
|
|
|
|
2009-08-03 07:59:58 +07:00
|
|
|
/* Allow raid5_quiesce to complete */
|
|
|
|
wait_event(conf->wait_for_overlap, conf->quiesce != 2);
|
|
|
|
|
2006-06-26 14:27:43 +07:00
|
|
|
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
|
|
|
|
return reshape_request(mddev, sector_nr, skipped);
|
2006-03-27 16:18:11 +07:00
|
|
|
|
2008-02-06 16:39:52 +07:00
|
|
|
/* No need to check resync_max as we never do more than one
|
|
|
|
* stripe, and as resync_max will always be on a chunk boundary,
|
|
|
|
* if the check in md_do_sync didn't fire, there is no chance
|
|
|
|
* of overstepping resync_max here
|
|
|
|
*/
|
|
|
|
|
2006-06-26 14:27:38 +07:00
|
|
|
/* if there is too many failed drives and we are trying
|
2005-04-17 05:20:36 +07:00
|
|
|
* to resync, then assert that we are finished, because there is
|
|
|
|
* nothing we can do.
|
|
|
|
*/
|
2006-06-26 14:27:55 +07:00
|
|
|
if (mddev->degraded >= conf->max_degraded &&
|
2006-06-26 14:27:38 +07:00
|
|
|
test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
|
2009-03-31 10:33:13 +07:00
|
|
|
sector_t rv = mddev->dev_sectors - sector_nr;
|
2005-06-22 07:17:13 +07:00
|
|
|
*skipped = 1;
|
2005-04-17 05:20:36 +07:00
|
|
|
return rv;
|
|
|
|
}
|
2013-04-24 08:42:41 +07:00
|
|
|
if (!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
|
|
|
|
!conf->fullsync &&
|
2018-08-02 05:20:50 +07:00
|
|
|
!md_bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, 1) &&
|
2013-04-24 08:42:41 +07:00
|
|
|
sync_blocks >= STRIPE_SECTORS) {
|
2005-09-10 06:23:54 +07:00
|
|
|
/* we can skip this block, and probably more */
|
|
|
|
sync_blocks /= STRIPE_SECTORS;
|
|
|
|
*skipped = 1;
|
|
|
|
return sync_blocks * STRIPE_SECTORS; /* keep things rounded to whole stripes */
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2018-08-02 05:20:50 +07:00
|
|
|
md_bitmap_cond_end_sync(mddev->bitmap, sector_nr, false);
|
2008-02-06 16:39:50 +07:00
|
|
|
|
2015-08-14 04:31:57 +07:00
|
|
|
sh = raid5_get_active_stripe(conf, sector_nr, 0, 1, 0);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (sh == NULL) {
|
2015-08-14 04:31:57 +07:00
|
|
|
sh = raid5_get_active_stripe(conf, sector_nr, 0, 0, 0);
|
2005-04-17 05:20:36 +07:00
|
|
|
/* make sure we don't swamp the stripe cache if someone else
|
2006-06-26 14:27:38 +07:00
|
|
|
* is trying to get access
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2005-11-07 16:01:17 +07:00
|
|
|
schedule_timeout_uninterruptible(1);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2006-06-26 14:27:38 +07:00
|
|
|
/* Need to check if array will still be degraded after recovery/resync
|
2015-01-07 00:35:02 +07:00
|
|
|
* Note in case of > 1 drive failures it's possible we're rebuilding
|
|
|
|
* one drive while leaving another faulty drive in array.
|
2006-06-26 14:27:38 +07:00
|
|
|
*/
|
2015-01-07 00:35:02 +07:00
|
|
|
rcu_read_lock();
|
|
|
|
for (i = 0; i < conf->raid_disks; i++) {
|
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 04:07:29 +07:00
|
|
|
struct md_rdev *rdev = READ_ONCE(conf->disks[i].rdev);
|
2015-01-07 00:35:02 +07:00
|
|
|
|
|
|
|
if (rdev == NULL || test_bit(Faulty, &rdev->flags))
|
2006-06-26 14:27:38 +07:00
|
|
|
still_degraded = 1;
|
2015-01-07 00:35:02 +07:00
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2006-06-26 14:27:38 +07:00
|
|
|
|
2018-08-02 05:20:50 +07:00
|
|
|
md_bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_degraded);
|
2006-06-26 14:27:38 +07:00
|
|
|
|
2011-07-26 08:19:49 +07:00
|
|
|
set_bit(STRIPE_SYNC_REQUESTED, &sh->state);
|
2014-06-10 07:06:19 +07:00
|
|
|
set_bit(STRIPE_HANDLE, &sh->state);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
return STRIPE_SECTORS;
|
|
|
|
}
|
|
|
|
|
2017-03-15 10:05:13 +07:00
|
|
|
static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
|
|
|
|
unsigned int offset)
|
2006-12-10 17:20:47 +07:00
|
|
|
{
|
|
|
|
/* We may not be able to submit a whole bio at once as there
|
|
|
|
* may not be enough stripe_heads available.
|
|
|
|
* We cannot pre-allocate enough stripe_heads as we may need
|
|
|
|
* more than exist in the cache (if we allow ever large chunks).
|
|
|
|
* So we do one stripe head at a time and record in
|
|
|
|
* ->bi_hw_segments how many have been done.
|
|
|
|
*
|
|
|
|
* We *know* that this entire raid_bio is in one chunk, so
|
|
|
|
* it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
|
|
|
|
*/
|
|
|
|
struct stripe_head *sh;
|
2009-03-31 10:39:38 +07:00
|
|
|
int dd_idx;
|
2006-12-10 17:20:47 +07:00
|
|
|
sector_t sector, logical_sector, last_sector;
|
|
|
|
int scnt = 0;
|
|
|
|
int handled = 0;
|
|
|
|
|
2013-10-12 05:44:27 +07:00
|
|
|
logical_sector = raid_bio->bi_iter.bi_sector &
|
|
|
|
~((sector_t)STRIPE_SECTORS-1);
|
2009-03-31 10:39:38 +07:00
|
|
|
sector = raid5_compute_sector(conf, logical_sector,
|
2009-03-31 10:39:38 +07:00
|
|
|
0, &dd_idx, NULL);
|
2012-09-26 05:05:12 +07:00
|
|
|
last_sector = bio_end_sector(raid_bio);
|
2006-12-10 17:20:47 +07:00
|
|
|
|
|
|
|
for (; logical_sector < last_sector;
|
2007-02-09 05:20:29 +07:00
|
|
|
logical_sector += STRIPE_SECTORS,
|
|
|
|
sector += STRIPE_SECTORS,
|
|
|
|
scnt++) {
|
2006-12-10 17:20:47 +07:00
|
|
|
|
2017-03-15 10:05:13 +07:00
|
|
|
if (scnt < offset)
|
2006-12-10 17:20:47 +07:00
|
|
|
/* already done this stripe */
|
|
|
|
continue;
|
|
|
|
|
2015-08-14 04:31:57 +07:00
|
|
|
sh = raid5_get_active_stripe(conf, sector, 0, 1, 1);
|
2006-12-10 17:20:47 +07:00
|
|
|
|
|
|
|
if (!sh) {
|
|
|
|
/* failed to get a stripe - must wait */
|
|
|
|
conf->retry_read_aligned = raid_bio;
|
2017-03-15 10:05:13 +07:00
|
|
|
conf->retry_read_offset = scnt;
|
2006-12-10 17:20:47 +07:00
|
|
|
return handled;
|
|
|
|
}
|
|
|
|
|
2014-12-15 08:57:03 +07:00
|
|
|
if (!add_stripe_bio(sh, raid_bio, dd_idx, 0, 0)) {
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2007-02-09 05:20:29 +07:00
|
|
|
conf->retry_read_aligned = raid_bio;
|
2017-03-15 10:05:13 +07:00
|
|
|
conf->retry_read_offset = scnt;
|
2007-02-09 05:20:29 +07:00
|
|
|
return handled;
|
|
|
|
}
|
|
|
|
|
2012-07-31 07:04:21 +07:00
|
|
|
set_bit(R5_ReadNoMerge, &sh->dev[dd_idx].flags);
|
2009-07-15 01:48:22 +07:00
|
|
|
handle_stripe(sh);
|
2015-08-14 04:31:57 +07:00
|
|
|
raid5_release_stripe(sh);
|
2006-12-10 17:20:47 +07:00
|
|
|
handled++;
|
|
|
|
}
|
2017-03-15 10:05:13 +07:00
|
|
|
|
|
|
|
bio_endio(raid_bio);
|
|
|
|
|
2006-12-10 17:20:47 +07:00
|
|
|
if (atomic_dec_and_test(&conf->active_aligned_reads))
|
md/raid5: split wait_for_stripe and introduce wait_for_quiescent
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.
After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.
Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.
So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.
v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
Commit log refactor suggestion from Neil.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 15:19:06 +07:00
|
|
|
wake_up(&conf->wait_for_quiescent);
|
2006-12-10 17:20:47 +07:00
|
|
|
return handled;
|
|
|
|
}
|
|
|
|
|
2013-08-29 14:40:32 +07:00
|
|
|
static int handle_active_stripes(struct r5conf *conf, int group,
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
struct r5worker *worker,
|
|
|
|
struct list_head *temp_inactive_list)
|
2012-08-02 05:33:15 +07:00
|
|
|
{
|
|
|
|
struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
int i, batch_size = 0, hash;
|
|
|
|
bool release_inactive = false;
|
2012-08-02 05:33:15 +07:00
|
|
|
|
|
|
|
while (batch_size < MAX_STRIPE_BATCH &&
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
(sh = __get_priority_stripe(conf, group)) != NULL)
|
2012-08-02 05:33:15 +07:00
|
|
|
batch[batch_size++] = sh;
|
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
if (batch_size == 0) {
|
|
|
|
for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
|
|
|
|
if (!list_empty(temp_inactive_list + i))
|
|
|
|
break;
|
2015-09-03 03:49:46 +07:00
|
|
|
if (i == NR_STRIPE_HASH_LOCKS) {
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2017-12-27 16:31:40 +07:00
|
|
|
log_flush_stripe_to_raid(conf);
|
2015-09-03 03:49:46 +07:00
|
|
|
spin_lock_irq(&conf->device_lock);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
return batch_size;
|
2015-09-03 03:49:46 +07:00
|
|
|
}
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
release_inactive = true;
|
|
|
|
}
|
2012-08-02 05:33:15 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
release_inactive_stripe_list(conf, temp_inactive_list,
|
|
|
|
NR_STRIPE_HASH_LOCKS);
|
|
|
|
|
2015-09-03 03:49:46 +07:00
|
|
|
r5l_flush_stripe_to_raid(conf->log);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
if (release_inactive) {
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-08-02 05:33:15 +07:00
|
|
|
for (i = 0; i < batch_size; i++)
|
|
|
|
handle_stripe(batch[i]);
|
2017-03-09 15:59:58 +07:00
|
|
|
log_write_stripe_run(conf);
|
2012-08-02 05:33:15 +07:00
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
for (i = 0; i < batch_size; i++) {
|
|
|
|
hash = batch[i]->hash_lock_index;
|
|
|
|
__release_stripe(conf, batch[i], &temp_inactive_list[hash]);
|
|
|
|
}
|
2012-08-02 05:33:15 +07:00
|
|
|
return batch_size;
|
|
|
|
}
|
2006-12-10 17:20:47 +07:00
|
|
|
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
static void raid5_do_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct r5worker *worker = container_of(work, struct r5worker, work);
|
|
|
|
struct r5worker_group *group = worker->group;
|
|
|
|
struct r5conf *conf = group->conf;
|
2017-03-15 10:05:12 +07:00
|
|
|
struct mddev *mddev = conf->mddev;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
int group_id = group - conf->worker_groups;
|
|
|
|
int handled;
|
|
|
|
struct blk_plug plug;
|
|
|
|
|
|
|
|
pr_debug("+++ raid5worker active\n");
|
|
|
|
|
|
|
|
blk_start_plug(&plug);
|
|
|
|
handled = 0;
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
while (1) {
|
|
|
|
int batch_size, released;
|
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
released = release_stripe_list(conf, worker->temp_inactive_list);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
batch_size = handle_active_stripes(conf, group_id, worker,
|
|
|
|
worker->temp_inactive_list);
|
2013-08-29 14:40:32 +07:00
|
|
|
worker->working = false;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
if (!batch_size && !released)
|
|
|
|
break;
|
|
|
|
handled += batch_size;
|
2017-03-15 10:05:12 +07:00
|
|
|
wait_event_lock_irq(mddev->sb_wait,
|
|
|
|
!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
|
|
|
|
conf->device_lock);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
}
|
|
|
|
pr_debug("%d stripes handled\n", handled);
|
|
|
|
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2017-07-24 13:17:40 +07:00
|
|
|
|
2017-08-24 23:53:59 +07:00
|
|
|
flush_deferred_bios(conf);
|
|
|
|
|
|
|
|
r5l_flush_stripe_to_raid(conf->log);
|
|
|
|
|
2017-07-24 13:17:40 +07:00
|
|
|
async_tx_issue_pending_all();
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
blk_finish_plug(&plug);
|
|
|
|
|
|
|
|
pr_debug("--- raid5worker inactive\n");
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* This is our raid5 kernel thread.
|
|
|
|
*
|
|
|
|
* We scan the hash table for stripes which can be handled now.
|
|
|
|
* During the scan, completed stripes are saved for us by the interrupt
|
|
|
|
* handler, so that they will not have to wait for our next wakeup.
|
|
|
|
*/
|
2012-10-11 09:34:00 +07:00
|
|
|
static void raid5d(struct md_thread *thread)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-10-11 09:34:00 +07:00
|
|
|
struct mddev *mddev = thread->mddev;
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2005-04-17 05:20:36 +07:00
|
|
|
int handled;
|
2011-04-18 15:25:41 +07:00
|
|
|
struct blk_plug plug;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("+++ raid5d active\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
md_check_recovery(mddev);
|
|
|
|
|
2011-04-18 15:25:41 +07:00
|
|
|
blk_start_plug(&plug);
|
2005-04-17 05:20:36 +07:00
|
|
|
handled = 0;
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
while (1) {
|
2006-12-10 17:20:47 +07:00
|
|
|
struct bio *bio;
|
2013-08-27 16:50:39 +07:00
|
|
|
int batch_size, released;
|
2017-03-15 10:05:13 +07:00
|
|
|
unsigned int offset;
|
2013-08-27 16:50:39 +07:00
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
released = release_stripe_list(conf, conf->temp_inactive_list);
|
2015-02-26 08:47:56 +07:00
|
|
|
if (released)
|
|
|
|
clear_bit(R5_DID_ALLOC, &conf->cache_state);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-07-31 14:08:14 +07:00
|
|
|
if (
|
2011-04-18 15:25:43 +07:00
|
|
|
!list_empty(&conf->bitmap_list)) {
|
|
|
|
/* Now is a good time to flush some bitmap updates */
|
|
|
|
conf->seq_flush++;
|
2005-11-29 04:44:10 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2018-08-02 05:20:50 +07:00
|
|
|
md_bitmap_unplug(mddev->bitmap);
|
2005-11-29 04:44:10 +07:00
|
|
|
spin_lock_irq(&conf->device_lock);
|
2011-04-18 15:25:43 +07:00
|
|
|
conf->seq_write = conf->seq_flush;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
activate_bit_delay(conf, conf->temp_inactive_list);
|
2005-09-10 06:23:54 +07:00
|
|
|
}
|
2012-07-31 14:08:14 +07:00
|
|
|
raid5_activate_delayed(conf);
|
2005-09-10 06:23:54 +07:00
|
|
|
|
2017-03-15 10:05:13 +07:00
|
|
|
while ((bio = remove_bio_from_retry(conf, &offset))) {
|
2006-12-10 17:20:47 +07:00
|
|
|
int ok;
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2017-03-15 10:05:13 +07:00
|
|
|
ok = retry_aligned_read(conf, bio, offset);
|
2006-12-10 17:20:47 +07:00
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
if (!ok)
|
|
|
|
break;
|
|
|
|
handled++;
|
|
|
|
}
|
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
batch_size = handle_active_stripes(conf, ANY_GROUP, NULL,
|
|
|
|
conf->temp_inactive_list);
|
2013-08-27 16:50:39 +07:00
|
|
|
if (!batch_size && !released)
|
2005-04-17 05:20:36 +07:00
|
|
|
break;
|
2012-08-02 05:33:15 +07:00
|
|
|
handled += batch_size;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2016-12-09 06:48:19 +07:00
|
|
|
if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
|
2012-08-02 05:33:15 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2011-07-28 08:31:48 +07:00
|
|
|
md_check_recovery(mddev);
|
2012-08-02 05:33:15 +07:00
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("%d stripes handled\n", handled);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2015-07-06 09:49:23 +07:00
|
|
|
if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
|
|
|
|
mutex_trylock(&conf->cache_size_mutex)) {
|
2015-02-26 08:47:56 +07:00
|
|
|
grow_one_stripe(conf, __GFP_NOWARN);
|
|
|
|
/* Set flag even if allocation failed. This helps
|
|
|
|
* slow down allocation requests when mem is short
|
|
|
|
*/
|
|
|
|
set_bit(R5_DID_ALLOC, &conf->cache_state);
|
2015-07-06 09:49:23 +07:00
|
|
|
mutex_unlock(&conf->cache_size_mutex);
|
2015-02-26 08:47:56 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 00:33:23 +07:00
|
|
|
flush_deferred_bios(conf);
|
|
|
|
|
raid5: log reclaim support
This is the reclaim support for raid5 log. A stripe write will have
following steps:
1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused
In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.
It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.
Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.
Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.
There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.
The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.
Includes fixes to problems which were...
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-14 04:32:00 +07:00
|
|
|
r5l_flush_stripe_to_raid(conf->log);
|
|
|
|
|
2008-07-24 02:05:51 +07:00
|
|
|
async_tx_issue_pending_all();
|
2011-04-18 15:25:41 +07:00
|
|
|
blk_finish_plug(&plug);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-10 01:56:43 +07:00
|
|
|
pr_debug("--- raid5d inactive\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2005-11-09 12:39:25 +07:00
|
|
|
static ssize_t
|
2011-10-11 12:47:53 +07:00
|
|
|
raid5_show_stripe_cache_size(struct mddev *mddev, char *page)
|
2005-11-09 12:39:25 +07:00
|
|
|
{
|
2014-12-15 08:56:59 +07:00
|
|
|
struct r5conf *conf;
|
|
|
|
int ret = 0;
|
|
|
|
spin_lock(&mddev->lock);
|
|
|
|
conf = mddev->private;
|
2005-11-09 12:39:39 +07:00
|
|
|
if (conf)
|
2015-02-26 08:47:56 +07:00
|
|
|
ret = sprintf(page, "%d\n", conf->min_nr_stripes);
|
2014-12-15 08:56:59 +07:00
|
|
|
spin_unlock(&mddev->lock);
|
|
|
|
return ret;
|
2005-11-09 12:39:25 +07:00
|
|
|
}
|
|
|
|
|
2010-06-01 16:37:24 +07:00
|
|
|
int
|
2011-10-11 12:47:53 +07:00
|
|
|
raid5_set_cache_size(struct mddev *mddev, int size)
|
2005-11-09 12:39:25 +07:00
|
|
|
{
|
2018-03-28 06:54:16 +07:00
|
|
|
int result = 0;
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2008-06-28 11:44:04 +07:00
|
|
|
|
2010-06-01 16:37:24 +07:00
|
|
|
if (size <= 16 || size > 32768)
|
2005-11-09 12:39:25 +07:00
|
|
|
return -EINVAL;
|
2015-02-25 08:10:35 +07:00
|
|
|
|
2015-02-26 08:47:56 +07:00
|
|
|
conf->min_nr_stripes = size;
|
2015-07-06 09:49:23 +07:00
|
|
|
mutex_lock(&conf->cache_size_mutex);
|
2015-02-25 08:10:35 +07:00
|
|
|
while (size < conf->max_nr_stripes &&
|
|
|
|
drop_one_stripe(conf))
|
|
|
|
;
|
2015-07-06 09:49:23 +07:00
|
|
|
mutex_unlock(&conf->cache_size_mutex);
|
2015-02-25 08:10:35 +07:00
|
|
|
|
2017-05-08 16:56:55 +07:00
|
|
|
md_allow_write(mddev);
|
2015-02-25 08:10:35 +07:00
|
|
|
|
2015-07-06 09:49:23 +07:00
|
|
|
mutex_lock(&conf->cache_size_mutex);
|
2015-02-25 08:10:35 +07:00
|
|
|
while (size > conf->max_nr_stripes)
|
2018-03-28 06:54:16 +07:00
|
|
|
if (!grow_one_stripe(conf, GFP_KERNEL)) {
|
|
|
|
conf->min_nr_stripes = conf->max_nr_stripes;
|
|
|
|
result = -ENOMEM;
|
2015-02-25 08:10:35 +07:00
|
|
|
break;
|
2018-03-28 06:54:16 +07:00
|
|
|
}
|
2015-07-06 09:49:23 +07:00
|
|
|
mutex_unlock(&conf->cache_size_mutex);
|
2015-02-25 08:10:35 +07:00
|
|
|
|
2018-03-28 06:54:16 +07:00
|
|
|
return result;
|
2010-06-01 16:37:24 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(raid5_set_cache_size);
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 12:47:53 +07:00
|
|
|
raid5_store_stripe_cache_size(struct mddev *mddev, const char *page, size_t len)
|
2010-06-01 16:37:24 +07:00
|
|
|
{
|
2014-12-15 08:57:01 +07:00
|
|
|
struct r5conf *conf;
|
2010-06-01 16:37:24 +07:00
|
|
|
unsigned long new;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (len >= PAGE_SIZE)
|
|
|
|
return -EINVAL;
|
2013-06-01 14:15:16 +07:00
|
|
|
if (kstrtoul(page, 10, &new))
|
2010-06-01 16:37:24 +07:00
|
|
|
return -EINVAL;
|
2014-12-15 08:57:01 +07:00
|
|
|
err = mddev_lock(mddev);
|
2010-06-01 16:37:24 +07:00
|
|
|
if (err)
|
|
|
|
return err;
|
2014-12-15 08:57:01 +07:00
|
|
|
conf = mddev->private;
|
|
|
|
if (!conf)
|
|
|
|
err = -ENODEV;
|
|
|
|
else
|
|
|
|
err = raid5_set_cache_size(mddev, new);
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
|
|
|
|
return err ?: len;
|
2005-11-09 12:39:25 +07:00
|
|
|
}
|
2005-11-09 12:39:30 +07:00
|
|
|
|
2005-11-09 12:39:39 +07:00
|
|
|
static struct md_sysfs_entry
|
|
|
|
raid5_stripecache_size = __ATTR(stripe_cache_size, S_IRUGO | S_IWUSR,
|
|
|
|
raid5_show_stripe_cache_size,
|
|
|
|
raid5_store_stripe_cache_size);
|
2005-11-09 12:39:25 +07:00
|
|
|
|
2014-12-15 08:57:05 +07:00
|
|
|
static ssize_t
|
|
|
|
raid5_show_rmw_level(struct mddev *mddev, char *page)
|
|
|
|
{
|
|
|
|
struct r5conf *conf = mddev->private;
|
|
|
|
if (conf)
|
|
|
|
return sprintf(page, "%d\n", conf->rmw_level);
|
|
|
|
else
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
raid5_store_rmw_level(struct mddev *mddev, const char *page, size_t len)
|
|
|
|
{
|
|
|
|
struct r5conf *conf = mddev->private;
|
|
|
|
unsigned long new;
|
|
|
|
|
|
|
|
if (!conf)
|
|
|
|
return -ENODEV;
|
|
|
|
|
|
|
|
if (len >= PAGE_SIZE)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (kstrtoul(page, 10, &new))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (new != PARITY_DISABLE_RMW && !raid6_call.xor_syndrome)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (new != PARITY_DISABLE_RMW &&
|
|
|
|
new != PARITY_ENABLE_RMW &&
|
|
|
|
new != PARITY_PREFER_RMW)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
conf->rmw_level = new;
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry
|
|
|
|
raid5_rmw_level = __ATTR(rmw_level, S_IRUGO | S_IWUSR,
|
|
|
|
raid5_show_rmw_level,
|
|
|
|
raid5_store_rmw_level);
|
|
|
|
|
|
|
|
|
2008-04-28 16:15:53 +07:00
|
|
|
static ssize_t
|
2011-10-11 12:47:53 +07:00
|
|
|
raid5_show_preread_threshold(struct mddev *mddev, char *page)
|
2008-04-28 16:15:53 +07:00
|
|
|
{
|
2014-12-15 08:56:59 +07:00
|
|
|
struct r5conf *conf;
|
|
|
|
int ret = 0;
|
|
|
|
spin_lock(&mddev->lock);
|
|
|
|
conf = mddev->private;
|
2008-04-28 16:15:53 +07:00
|
|
|
if (conf)
|
2014-12-15 08:56:59 +07:00
|
|
|
ret = sprintf(page, "%d\n", conf->bypass_threshold);
|
|
|
|
spin_unlock(&mddev->lock);
|
|
|
|
return ret;
|
2008-04-28 16:15:53 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
2011-10-11 12:47:53 +07:00
|
|
|
raid5_store_preread_threshold(struct mddev *mddev, const char *page, size_t len)
|
2008-04-28 16:15:53 +07:00
|
|
|
{
|
2014-12-15 08:57:01 +07:00
|
|
|
struct r5conf *conf;
|
2008-04-28 16:15:54 +07:00
|
|
|
unsigned long new;
|
2014-12-15 08:57:01 +07:00
|
|
|
int err;
|
|
|
|
|
2008-04-28 16:15:53 +07:00
|
|
|
if (len >= PAGE_SIZE)
|
|
|
|
return -EINVAL;
|
2013-06-01 14:15:16 +07:00
|
|
|
if (kstrtoul(page, 10, &new))
|
2008-04-28 16:15:53 +07:00
|
|
|
return -EINVAL;
|
2014-12-15 08:57:01 +07:00
|
|
|
|
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
conf = mddev->private;
|
|
|
|
if (!conf)
|
|
|
|
err = -ENODEV;
|
2015-02-26 08:47:56 +07:00
|
|
|
else if (new > conf->min_nr_stripes)
|
2014-12-15 08:57:01 +07:00
|
|
|
err = -EINVAL;
|
|
|
|
else
|
|
|
|
conf->bypass_threshold = new;
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
2008-04-28 16:15:53 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry
|
|
|
|
raid5_preread_bypass_threshold = __ATTR(preread_bypass_threshold,
|
|
|
|
S_IRUGO | S_IWUSR,
|
|
|
|
raid5_show_preread_threshold,
|
|
|
|
raid5_store_preread_threshold);
|
|
|
|
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
static ssize_t
|
|
|
|
raid5_show_skip_copy(struct mddev *mddev, char *page)
|
|
|
|
{
|
2014-12-15 08:56:59 +07:00
|
|
|
struct r5conf *conf;
|
|
|
|
int ret = 0;
|
|
|
|
spin_lock(&mddev->lock);
|
|
|
|
conf = mddev->private;
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
if (conf)
|
2014-12-15 08:56:59 +07:00
|
|
|
ret = sprintf(page, "%d\n", conf->skip_copy);
|
|
|
|
spin_unlock(&mddev->lock);
|
|
|
|
return ret;
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t
|
|
|
|
raid5_store_skip_copy(struct mddev *mddev, const char *page, size_t len)
|
|
|
|
{
|
2014-12-15 08:57:01 +07:00
|
|
|
struct r5conf *conf;
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
unsigned long new;
|
2014-12-15 08:57:01 +07:00
|
|
|
int err;
|
|
|
|
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
if (len >= PAGE_SIZE)
|
|
|
|
return -EINVAL;
|
|
|
|
if (kstrtoul(page, 10, &new))
|
|
|
|
return -EINVAL;
|
|
|
|
new = !!new;
|
2014-12-15 08:57:01 +07:00
|
|
|
|
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
conf = mddev->private;
|
|
|
|
if (!conf)
|
|
|
|
err = -ENODEV;
|
|
|
|
else if (new != conf->skip_copy) {
|
|
|
|
mddev_suspend(mddev);
|
|
|
|
conf->skip_copy = new;
|
|
|
|
if (new)
|
2017-02-02 21:56:50 +07:00
|
|
|
mddev->queue->backing_dev_info->capabilities |=
|
2014-12-15 08:57:01 +07:00
|
|
|
BDI_CAP_STABLE_WRITES;
|
|
|
|
else
|
2017-02-02 21:56:50 +07:00
|
|
|
mddev->queue->backing_dev_info->capabilities &=
|
2014-12-15 08:57:01 +07:00
|
|
|
~BDI_CAP_STABLE_WRITES;
|
|
|
|
mddev_resume(mddev);
|
|
|
|
}
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return err ?: len;
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry
|
|
|
|
raid5_skip_copy = __ATTR(skip_copy, S_IRUGO | S_IWUSR,
|
|
|
|
raid5_show_skip_copy,
|
|
|
|
raid5_store_skip_copy);
|
|
|
|
|
2005-11-09 12:39:25 +07:00
|
|
|
static ssize_t
|
2011-10-11 12:47:53 +07:00
|
|
|
stripe_cache_active_show(struct mddev *mddev, char *page)
|
2005-11-09 12:39:25 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2005-11-09 12:39:39 +07:00
|
|
|
if (conf)
|
|
|
|
return sprintf(page, "%d\n", atomic_read(&conf->active_stripes));
|
|
|
|
else
|
|
|
|
return 0;
|
2005-11-09 12:39:25 +07:00
|
|
|
}
|
|
|
|
|
2005-11-09 12:39:39 +07:00
|
|
|
static struct md_sysfs_entry
|
|
|
|
raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
|
2005-11-09 12:39:25 +07:00
|
|
|
|
2013-08-27 16:50:42 +07:00
|
|
|
static ssize_t
|
|
|
|
raid5_show_group_thread_cnt(struct mddev *mddev, char *page)
|
|
|
|
{
|
2014-12-15 08:56:59 +07:00
|
|
|
struct r5conf *conf;
|
|
|
|
int ret = 0;
|
|
|
|
spin_lock(&mddev->lock);
|
|
|
|
conf = mddev->private;
|
2013-08-27 16:50:42 +07:00
|
|
|
if (conf)
|
2014-12-15 08:56:59 +07:00
|
|
|
ret = sprintf(page, "%d\n", conf->worker_cnt_per_group);
|
|
|
|
spin_unlock(&mddev->lock);
|
|
|
|
return ret;
|
2013-08-27 16:50:42 +07:00
|
|
|
}
|
|
|
|
|
2013-11-14 11:16:20 +07:00
|
|
|
static int alloc_thread_groups(struct r5conf *conf, int cnt,
|
|
|
|
int *group_cnt,
|
|
|
|
int *worker_cnt_per_group,
|
|
|
|
struct r5worker_group **worker_groups);
|
2013-08-27 16:50:42 +07:00
|
|
|
static ssize_t
|
|
|
|
raid5_store_group_thread_cnt(struct mddev *mddev, const char *page, size_t len)
|
|
|
|
{
|
2014-12-15 08:57:01 +07:00
|
|
|
struct r5conf *conf;
|
2017-09-22 01:03:52 +07:00
|
|
|
unsigned int new;
|
2013-08-27 16:50:42 +07:00
|
|
|
int err;
|
2013-11-14 11:16:20 +07:00
|
|
|
struct r5worker_group *new_groups, *old_groups;
|
|
|
|
int group_cnt, worker_cnt_per_group;
|
2013-08-27 16:50:42 +07:00
|
|
|
|
|
|
|
if (len >= PAGE_SIZE)
|
|
|
|
return -EINVAL;
|
2017-09-22 01:03:52 +07:00
|
|
|
if (kstrtouint(page, 10, &new))
|
|
|
|
return -EINVAL;
|
|
|
|
/* 8192 should be big enough */
|
|
|
|
if (new > 8192)
|
2013-08-27 16:50:42 +07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2014-12-15 08:57:01 +07:00
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
conf = mddev->private;
|
|
|
|
if (!conf)
|
|
|
|
err = -ENODEV;
|
|
|
|
else if (new != conf->worker_cnt_per_group) {
|
|
|
|
mddev_suspend(mddev);
|
2013-08-27 16:50:42 +07:00
|
|
|
|
2014-12-15 08:57:01 +07:00
|
|
|
old_groups = conf->worker_groups;
|
|
|
|
if (old_groups)
|
|
|
|
flush_workqueue(raid5_wq);
|
2013-11-14 11:16:19 +07:00
|
|
|
|
2014-12-15 08:57:01 +07:00
|
|
|
err = alloc_thread_groups(conf, new,
|
|
|
|
&group_cnt, &worker_cnt_per_group,
|
|
|
|
&new_groups);
|
|
|
|
if (!err) {
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
|
|
|
conf->group_cnt = group_cnt;
|
|
|
|
conf->worker_cnt_per_group = worker_cnt_per_group;
|
|
|
|
conf->worker_groups = new_groups;
|
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2013-08-27 16:50:42 +07:00
|
|
|
|
2014-12-15 08:57:01 +07:00
|
|
|
if (old_groups)
|
|
|
|
kfree(old_groups[0].workers);
|
|
|
|
kfree(old_groups);
|
|
|
|
}
|
|
|
|
mddev_resume(mddev);
|
2013-08-27 16:50:42 +07:00
|
|
|
}
|
2014-12-15 08:57:01 +07:00
|
|
|
mddev_unlock(mddev);
|
2013-08-27 16:50:42 +07:00
|
|
|
|
2014-12-15 08:57:01 +07:00
|
|
|
return err ?: len;
|
2013-08-27 16:50:42 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct md_sysfs_entry
|
|
|
|
raid5_group_thread_cnt = __ATTR(group_thread_cnt, S_IRUGO | S_IWUSR,
|
|
|
|
raid5_show_group_thread_cnt,
|
|
|
|
raid5_store_group_thread_cnt);
|
|
|
|
|
2005-11-09 12:39:30 +07:00
|
|
|
static struct attribute *raid5_attrs[] = {
|
2005-11-09 12:39:25 +07:00
|
|
|
&raid5_stripecache_size.attr,
|
|
|
|
&raid5_stripecache_active.attr,
|
2008-04-28 16:15:53 +07:00
|
|
|
&raid5_preread_bypass_threshold.attr,
|
2013-08-27 16:50:42 +07:00
|
|
|
&raid5_group_thread_cnt.attr,
|
raid5: add an option to avoid copy data from bio to stripe cache
The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.
In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.
Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.
Neilb:
changed BUG_ON to WARN_ON
Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-05-21 16:57:44 +07:00
|
|
|
&raid5_skip_copy.attr,
|
2014-12-15 08:57:05 +07:00
|
|
|
&raid5_rmw_level.attr,
|
2016-11-18 06:24:41 +07:00
|
|
|
&r5c_journal_mode.attr,
|
2005-11-09 12:39:25 +07:00
|
|
|
NULL,
|
|
|
|
};
|
2005-11-09 12:39:30 +07:00
|
|
|
static struct attribute_group raid5_attrs_group = {
|
|
|
|
.name = NULL,
|
|
|
|
.attrs = raid5_attrs,
|
2005-11-09 12:39:25 +07:00
|
|
|
};
|
|
|
|
|
2013-11-14 11:16:20 +07:00
|
|
|
static int alloc_thread_groups(struct r5conf *conf, int cnt,
|
|
|
|
int *group_cnt,
|
|
|
|
int *worker_cnt_per_group,
|
|
|
|
struct r5worker_group **worker_groups)
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
{
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
int i, j, k;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
ssize_t size;
|
|
|
|
struct r5worker *workers;
|
|
|
|
|
2013-11-14 11:16:20 +07:00
|
|
|
*worker_cnt_per_group = cnt;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
if (cnt == 0) {
|
2013-11-14 11:16:20 +07:00
|
|
|
*group_cnt = 0;
|
|
|
|
*worker_groups = NULL;
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2013-11-14 11:16:20 +07:00
|
|
|
*group_cnt = num_possible_nodes();
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
size = sizeof(struct r5worker) * cnt;
|
treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:
kzalloc(a * b, gfp)
with:
kcalloc(a * b, gfp)
as well as handling cases of:
kzalloc(a * b * c, gfp)
with:
kzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kzalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc
+ kcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc(sizeof(THING) * C2, ...)
|
kzalloc(sizeof(TYPE) * C2, ...)
|
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 04:03:40 +07:00
|
|
|
workers = kcalloc(size, *group_cnt, GFP_NOIO);
|
|
|
|
*worker_groups = kcalloc(*group_cnt, sizeof(struct r5worker_group),
|
|
|
|
GFP_NOIO);
|
2013-11-14 11:16:20 +07:00
|
|
|
if (!*worker_groups || !workers) {
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
kfree(workers);
|
2013-11-14 11:16:20 +07:00
|
|
|
kfree(*worker_groups);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2013-11-14 11:16:20 +07:00
|
|
|
for (i = 0; i < *group_cnt; i++) {
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
struct r5worker_group *group;
|
|
|
|
|
2013-11-25 07:12:43 +07:00
|
|
|
group = &(*worker_groups)[i];
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
INIT_LIST_HEAD(&group->handle_list);
|
2017-02-16 10:37:32 +07:00
|
|
|
INIT_LIST_HEAD(&group->loprio_list);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
group->conf = conf;
|
|
|
|
group->workers = workers + i * cnt;
|
|
|
|
|
|
|
|
for (j = 0; j < cnt; j++) {
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
struct r5worker *worker = group->workers + j;
|
|
|
|
worker->group = group;
|
|
|
|
INIT_WORK(&worker->work, raid5_do_work);
|
|
|
|
|
|
|
|
for (k = 0; k < NR_STRIPE_HASH_LOCKS; k++)
|
|
|
|
INIT_LIST_HEAD(worker->temp_inactive_list + k);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void free_thread_groups(struct r5conf *conf)
|
|
|
|
{
|
|
|
|
if (conf->worker_groups)
|
|
|
|
kfree(conf->worker_groups[0].workers);
|
|
|
|
kfree(conf->worker_groups);
|
|
|
|
conf->worker_groups = NULL;
|
|
|
|
}
|
|
|
|
|
2009-03-18 08:10:40 +07:00
|
|
|
static sector_t
|
2011-10-11 12:47:53 +07:00
|
|
|
raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks)
|
2009-03-18 08:10:40 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2009-03-18 08:10:40 +07:00
|
|
|
|
|
|
|
if (!sectors)
|
|
|
|
sectors = mddev->dev_sectors;
|
2009-10-16 12:35:30 +07:00
|
|
|
if (!raid_disks)
|
2009-03-31 11:10:36 +07:00
|
|
|
/* size is defined by the smallest of previous and new size */
|
2009-10-16 12:35:30 +07:00
|
|
|
raid_disks = min(conf->raid_disks, conf->previous_raid_disks);
|
2009-03-18 08:10:40 +07:00
|
|
|
|
2015-07-15 14:24:17 +07:00
|
|
|
sectors &= ~((sector_t)conf->chunk_sectors - 1);
|
|
|
|
sectors &= ~((sector_t)conf->prev_chunk_sectors - 1);
|
2009-03-18 08:10:40 +07:00
|
|
|
return sectors * (raid_disks - conf->max_degraded);
|
|
|
|
}
|
|
|
|
|
2014-02-06 05:12:45 +07:00
|
|
|
static void free_scratch_buffer(struct r5conf *conf, struct raid5_percpu *percpu)
|
|
|
|
{
|
|
|
|
safe_put_page(percpu->spare_page);
|
2014-12-15 08:57:02 +07:00
|
|
|
if (percpu->scribble)
|
|
|
|
flex_array_free(percpu->scribble);
|
2014-02-06 05:12:45 +07:00
|
|
|
percpu->spare_page = NULL;
|
|
|
|
percpu->scribble = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int alloc_scratch_buffer(struct r5conf *conf, struct raid5_percpu *percpu)
|
|
|
|
{
|
|
|
|
if (conf->level == 6 && !percpu->spare_page)
|
|
|
|
percpu->spare_page = alloc_page(GFP_KERNEL);
|
|
|
|
if (!percpu->scribble)
|
2014-12-15 08:57:02 +07:00
|
|
|
percpu->scribble = scribble_alloc(max(conf->raid_disks,
|
2015-05-08 15:19:39 +07:00
|
|
|
conf->previous_raid_disks),
|
|
|
|
max(conf->chunk_sectors,
|
|
|
|
conf->prev_chunk_sectors)
|
|
|
|
/ STRIPE_SECTORS,
|
|
|
|
GFP_KERNEL);
|
2014-02-06 05:12:45 +07:00
|
|
|
|
|
|
|
if (!percpu->scribble || (conf->level == 6 && !percpu->spare_page)) {
|
|
|
|
free_scratch_buffer(conf, percpu);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-08-18 19:57:24 +07:00
|
|
|
static int raid456_cpu_dead(unsigned int cpu, struct hlist_node *node)
|
2009-07-15 01:48:22 +07:00
|
|
|
{
|
2016-08-18 19:57:24 +07:00
|
|
|
struct r5conf *conf = hlist_entry_safe(node, struct r5conf, node);
|
|
|
|
|
|
|
|
free_scratch_buffer(conf, per_cpu_ptr(conf->percpu, cpu));
|
|
|
|
return 0;
|
|
|
|
}
|
2009-07-15 01:48:22 +07:00
|
|
|
|
2016-08-18 19:57:24 +07:00
|
|
|
static void raid5_free_percpu(struct r5conf *conf)
|
|
|
|
{
|
2009-07-15 01:48:22 +07:00
|
|
|
if (!conf->percpu)
|
|
|
|
return;
|
|
|
|
|
2016-08-18 19:57:24 +07:00
|
|
|
cpuhp_state_remove_instance(CPUHP_MD_RAID5_PREPARE, &conf->node);
|
2009-07-15 01:48:22 +07:00
|
|
|
free_percpu(conf->percpu);
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static void free_conf(struct r5conf *conf)
|
2009-07-31 09:39:15 +07:00
|
|
|
{
|
2016-11-24 13:50:39 +07:00
|
|
|
int i;
|
|
|
|
|
2017-03-09 15:59:58 +07:00
|
|
|
log_exit(conf);
|
|
|
|
|
2017-12-24 01:20:31 +07:00
|
|
|
unregister_shrinker(&conf->shrinker);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
free_thread_groups(conf);
|
2009-07-31 09:39:15 +07:00
|
|
|
shrink_stripes(conf);
|
2009-07-15 01:48:22 +07:00
|
|
|
raid5_free_percpu(conf);
|
2016-11-24 13:50:39 +07:00
|
|
|
for (i = 0; i < conf->pool_size; i++)
|
|
|
|
if (conf->disks[i].extra_page)
|
|
|
|
put_page(conf->disks[i].extra_page);
|
2009-07-31 09:39:15 +07:00
|
|
|
kfree(conf->disks);
|
2018-05-21 05:25:52 +07:00
|
|
|
bioset_exit(&conf->bio_split);
|
2009-07-31 09:39:15 +07:00
|
|
|
kfree(conf->stripe_hashtbl);
|
2017-03-04 13:06:12 +07:00
|
|
|
kfree(conf->pending_data);
|
2009-07-31 09:39:15 +07:00
|
|
|
kfree(conf);
|
|
|
|
}
|
|
|
|
|
2016-08-18 19:57:24 +07:00
|
|
|
static int raid456_cpu_up_prepare(unsigned int cpu, struct hlist_node *node)
|
2009-07-15 01:48:22 +07:00
|
|
|
{
|
2016-08-18 19:57:24 +07:00
|
|
|
struct r5conf *conf = hlist_entry_safe(node, struct r5conf, node);
|
2009-07-15 01:48:22 +07:00
|
|
|
struct raid5_percpu *percpu = per_cpu_ptr(conf->percpu, cpu);
|
|
|
|
|
2016-08-18 19:57:24 +07:00
|
|
|
if (alloc_scratch_buffer(conf, percpu)) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("%s: failed memory allocation for cpu%u\n",
|
|
|
|
__func__, cpu);
|
2016-08-18 19:57:24 +07:00
|
|
|
return -ENOMEM;
|
2009-07-15 01:48:22 +07:00
|
|
|
}
|
2016-08-18 19:57:24 +07:00
|
|
|
return 0;
|
2009-07-15 01:48:22 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static int raid5_alloc_percpu(struct r5conf *conf)
|
2009-07-15 01:48:22 +07:00
|
|
|
{
|
2014-02-06 05:12:45 +07:00
|
|
|
int err = 0;
|
2009-07-15 01:48:22 +07:00
|
|
|
|
2014-02-06 05:12:45 +07:00
|
|
|
conf->percpu = alloc_percpu(struct raid5_percpu);
|
|
|
|
if (!conf->percpu)
|
2009-07-15 01:48:22 +07:00
|
|
|
return -ENOMEM;
|
2014-02-06 05:12:45 +07:00
|
|
|
|
2016-08-18 19:57:24 +07:00
|
|
|
err = cpuhp_state_add_instance(CPUHP_MD_RAID5_PREPARE, &conf->node);
|
2016-02-25 08:38:28 +07:00
|
|
|
if (!err) {
|
|
|
|
conf->scribble_disks = max(conf->raid_disks,
|
|
|
|
conf->previous_raid_disks);
|
|
|
|
conf->scribble_sectors = max(conf->chunk_sectors,
|
|
|
|
conf->prev_chunk_sectors);
|
|
|
|
}
|
2009-07-15 01:48:22 +07:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2015-02-26 08:47:56 +07:00
|
|
|
static unsigned long raid5_cache_scan(struct shrinker *shrink,
|
|
|
|
struct shrink_control *sc)
|
|
|
|
{
|
|
|
|
struct r5conf *conf = container_of(shrink, struct r5conf, shrinker);
|
2015-07-06 09:49:23 +07:00
|
|
|
unsigned long ret = SHRINK_STOP;
|
|
|
|
|
|
|
|
if (mutex_trylock(&conf->cache_size_mutex)) {
|
|
|
|
ret= 0;
|
2015-08-03 14:09:57 +07:00
|
|
|
while (ret < sc->nr_to_scan &&
|
|
|
|
conf->max_nr_stripes > conf->min_nr_stripes) {
|
2015-07-06 09:49:23 +07:00
|
|
|
if (drop_one_stripe(conf) == 0) {
|
|
|
|
ret = SHRINK_STOP;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ret++;
|
|
|
|
}
|
|
|
|
mutex_unlock(&conf->cache_size_mutex);
|
2015-02-26 08:47:56 +07:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long raid5_cache_count(struct shrinker *shrink,
|
|
|
|
struct shrink_control *sc)
|
|
|
|
{
|
|
|
|
struct r5conf *conf = container_of(shrink, struct r5conf, shrinker);
|
|
|
|
|
|
|
|
if (conf->max_nr_stripes < conf->min_nr_stripes)
|
|
|
|
/* unlikely, but not impossible */
|
|
|
|
return 0;
|
|
|
|
return conf->max_nr_stripes - conf->min_nr_stripes;
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static struct r5conf *setup_conf(struct mddev *mddev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf;
|
2009-10-16 12:35:30 +07:00
|
|
|
int raid_disk, memory, max_disks;
|
2011-10-11 12:45:26 +07:00
|
|
|
struct md_rdev *rdev;
|
2005-04-17 05:20:36 +07:00
|
|
|
struct disk_info *disk;
|
2012-07-03 12:56:52 +07:00
|
|
|
char pers_name[6];
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
int i;
|
2013-11-14 11:16:20 +07:00
|
|
|
int group_cnt, worker_cnt_per_group;
|
|
|
|
struct r5worker_group *new_group;
|
2018-05-21 05:25:52 +07:00
|
|
|
int ret;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-03-31 10:39:39 +07:00
|
|
|
if (mddev->new_level != 5
|
|
|
|
&& mddev->new_level != 4
|
|
|
|
&& mddev->new_level != 6) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: raid level not set to 4/5/6 (%d)\n",
|
|
|
|
mdname(mddev), mddev->new_level);
|
2009-03-31 10:39:39 +07:00
|
|
|
return ERR_PTR(-EIO);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2009-03-31 10:39:39 +07:00
|
|
|
if ((mddev->new_level == 5
|
|
|
|
&& !algorithm_valid_raid5(mddev->new_layout)) ||
|
|
|
|
(mddev->new_level == 6
|
|
|
|
&& !algorithm_valid_raid6(mddev->new_layout))) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: layout %d not supported\n",
|
|
|
|
mdname(mddev), mddev->new_layout);
|
2009-03-31 10:39:39 +07:00
|
|
|
return ERR_PTR(-EIO);
|
2009-03-31 10:39:38 +07:00
|
|
|
}
|
2009-03-31 10:39:39 +07:00
|
|
|
if (mddev->new_level == 6 && mddev->raid_disks < 4) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: not enough configured devices (%d, minimum 4)\n",
|
|
|
|
mdname(mddev), mddev->raid_disks);
|
2009-03-31 10:39:39 +07:00
|
|
|
return ERR_PTR(-EINVAL);
|
2008-10-13 07:55:12 +07:00
|
|
|
}
|
|
|
|
|
2009-06-18 05:45:27 +07:00
|
|
|
if (!mddev->new_chunk_sectors ||
|
|
|
|
(mddev->new_chunk_sectors << 9) % PAGE_SIZE ||
|
|
|
|
!is_power_of_2(mddev->new_chunk_sectors)) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: invalid chunk size %d\n",
|
|
|
|
mdname(mddev), mddev->new_chunk_sectors << 9);
|
2009-03-31 10:39:39 +07:00
|
|
|
return ERR_PTR(-EINVAL);
|
2006-03-27 16:18:11 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
conf = kzalloc(sizeof(struct r5conf), GFP_KERNEL);
|
2009-03-31 10:39:39 +07:00
|
|
|
if (conf == NULL)
|
2005-04-17 05:20:36 +07:00
|
|
|
goto abort;
|
2017-03-04 13:06:12 +07:00
|
|
|
INIT_LIST_HEAD(&conf->free_list);
|
|
|
|
INIT_LIST_HEAD(&conf->pending_list);
|
treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:
kzalloc(a * b, gfp)
with:
kcalloc(a * b, gfp)
as well as handling cases of:
kzalloc(a * b * c, gfp)
with:
kzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kzalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc
+ kcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc(sizeof(THING) * C2, ...)
|
kzalloc(sizeof(TYPE) * C2, ...)
|
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 04:03:40 +07:00
|
|
|
conf->pending_data = kcalloc(PENDING_IO_MAX,
|
|
|
|
sizeof(struct r5pending_data),
|
|
|
|
GFP_KERNEL);
|
2017-03-04 13:06:12 +07:00
|
|
|
if (!conf->pending_data)
|
|
|
|
goto abort;
|
|
|
|
for (i = 0; i < PENDING_IO_MAX; i++)
|
|
|
|
list_add(&conf->pending_data[i].sibling, &conf->free_list);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
/* Don't enable multi-threading by default*/
|
2013-11-14 11:16:20 +07:00
|
|
|
if (!alloc_thread_groups(conf, 0, &group_cnt, &worker_cnt_per_group,
|
|
|
|
&new_group)) {
|
|
|
|
conf->group_cnt = group_cnt;
|
|
|
|
conf->worker_cnt_per_group = worker_cnt_per_group;
|
|
|
|
conf->worker_groups = new_group;
|
|
|
|
} else
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
goto abort;
|
2009-10-16 11:55:38 +07:00
|
|
|
spin_lock_init(&conf->device_lock);
|
2013-08-27 12:52:13 +07:00
|
|
|
seqcount_init(&conf->gen_lock);
|
2015-07-06 09:49:23 +07:00
|
|
|
mutex_init(&conf->cache_size_mutex);
|
md/raid5: split wait_for_stripe and introduce wait_for_quiescent
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.
After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.
Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.
So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.
v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
Commit log refactor suggestion from Neil.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 15:19:06 +07:00
|
|
|
init_waitqueue_head(&conf->wait_for_quiescent);
|
2016-02-26 07:24:42 +07:00
|
|
|
init_waitqueue_head(&conf->wait_for_stripe);
|
2009-10-16 11:55:38 +07:00
|
|
|
init_waitqueue_head(&conf->wait_for_overlap);
|
|
|
|
INIT_LIST_HEAD(&conf->handle_list);
|
2017-02-16 10:37:32 +07:00
|
|
|
INIT_LIST_HEAD(&conf->loprio_list);
|
2009-10-16 11:55:38 +07:00
|
|
|
INIT_LIST_HEAD(&conf->hold_list);
|
|
|
|
INIT_LIST_HEAD(&conf->delayed_list);
|
|
|
|
INIT_LIST_HEAD(&conf->bitmap_list);
|
2013-08-27 16:50:39 +07:00
|
|
|
init_llist_head(&conf->released_stripes);
|
2009-10-16 11:55:38 +07:00
|
|
|
atomic_set(&conf->active_stripes, 0);
|
|
|
|
atomic_set(&conf->preread_active_stripes, 0);
|
|
|
|
atomic_set(&conf->active_aligned_reads, 0);
|
raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.
To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.
Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.
My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:
without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.
Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 00:33:23 +07:00
|
|
|
spin_lock_init(&conf->pending_bios_lock);
|
|
|
|
conf->batch_bio_dispatch = true;
|
|
|
|
rdev_for_each(rdev, mddev) {
|
|
|
|
if (test_bit(Journal, &rdev->flags))
|
|
|
|
continue;
|
|
|
|
if (blk_queue_nonrot(bdev_get_queue(rdev->bdev))) {
|
|
|
|
conf->batch_bio_dispatch = false;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-10-16 11:55:38 +07:00
|
|
|
conf->bypass_threshold = BYPASS_THRESHOLD;
|
2011-10-26 07:54:39 +07:00
|
|
|
conf->recovery_disabled = mddev->recovery_disabled - 1;
|
2009-03-31 10:39:39 +07:00
|
|
|
|
|
|
|
conf->raid_disks = mddev->raid_disks;
|
|
|
|
if (mddev->reshape_position == MaxSector)
|
|
|
|
conf->previous_raid_disks = mddev->raid_disks;
|
|
|
|
else
|
2006-03-27 16:18:11 +07:00
|
|
|
conf->previous_raid_disks = mddev->raid_disks - mddev->delta_disks;
|
2009-10-16 12:35:30 +07:00
|
|
|
max_disks = max(conf->raid_disks, conf->previous_raid_disks);
|
2006-03-27 16:18:11 +07:00
|
|
|
|
treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:
kzalloc(a * b, gfp)
with:
kcalloc(a * b, gfp)
as well as handling cases of:
kzalloc(a * b * c, gfp)
with:
kzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kzalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc
+ kcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc(sizeof(THING) * C2, ...)
|
kzalloc(sizeof(TYPE) * C2, ...)
|
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 04:03:40 +07:00
|
|
|
conf->disks = kcalloc(max_disks, sizeof(struct disk_info),
|
2006-03-27 16:18:06 +07:00
|
|
|
GFP_KERNEL);
|
2016-11-24 13:50:39 +07:00
|
|
|
|
2006-03-27 16:18:06 +07:00
|
|
|
if (!conf->disks)
|
|
|
|
goto abort;
|
2006-01-06 15:20:32 +07:00
|
|
|
|
2016-11-24 13:50:39 +07:00
|
|
|
for (i = 0; i < max_disks; i++) {
|
|
|
|
conf->disks[i].extra_page = alloc_page(GFP_KERNEL);
|
|
|
|
if (!conf->disks[i].extra_page)
|
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
|
2018-05-21 05:25:52 +07:00
|
|
|
ret = bioset_init(&conf->bio_split, BIO_POOL_SIZE, 0, 0);
|
|
|
|
if (ret)
|
2017-04-05 11:05:51 +07:00
|
|
|
goto abort;
|
2005-04-17 05:20:36 +07:00
|
|
|
conf->mddev = mddev;
|
|
|
|
|
2006-01-06 15:20:33 +07:00
|
|
|
if ((conf->stripe_hashtbl = kzalloc(PAGE_SIZE, GFP_KERNEL)) == NULL)
|
2005-04-17 05:20:36 +07:00
|
|
|
goto abort;
|
|
|
|
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
/* We init hash_locks[0] separately to that it can be used
|
|
|
|
* as the reference lock in the spin_lock_nest_lock() call
|
|
|
|
* in lock_all_device_hash_locks_irq in order to convince
|
|
|
|
* lockdep that we know what we are doing.
|
|
|
|
*/
|
|
|
|
spin_lock_init(conf->hash_locks);
|
|
|
|
for (i = 1; i < NR_STRIPE_HASH_LOCKS; i++)
|
|
|
|
spin_lock_init(conf->hash_locks + i);
|
|
|
|
|
|
|
|
for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
|
|
|
|
INIT_LIST_HEAD(conf->inactive_list + i);
|
|
|
|
|
|
|
|
for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
|
|
|
|
INIT_LIST_HEAD(conf->temp_inactive_list + i);
|
|
|
|
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
atomic_set(&conf->r5c_cached_full_stripes, 0);
|
|
|
|
INIT_LIST_HEAD(&conf->r5c_full_stripe_list);
|
|
|
|
atomic_set(&conf->r5c_cached_partial_stripes, 0);
|
|
|
|
INIT_LIST_HEAD(&conf->r5c_partial_stripe_list);
|
2017-02-11 07:18:09 +07:00
|
|
|
atomic_set(&conf->r5c_flushing_full_stripes, 0);
|
|
|
|
atomic_set(&conf->r5c_flushing_partial_stripes, 0);
|
md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
Writing-out phase of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->orig_page and the old data
is read into it. Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.
r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:39 +07:00
|
|
|
|
2009-07-15 01:48:22 +07:00
|
|
|
conf->level = mddev->new_level;
|
2014-12-15 08:57:02 +07:00
|
|
|
conf->chunk_sectors = mddev->new_chunk_sectors;
|
2009-07-15 01:48:22 +07:00
|
|
|
if (raid5_alloc_percpu(conf) != 0)
|
|
|
|
goto abort;
|
|
|
|
|
2010-05-03 11:09:02 +07:00
|
|
|
pr_debug("raid456: run(%s) called.\n", mdname(mddev));
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-03-19 08:46:39 +07:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2005-04-17 05:20:36 +07:00
|
|
|
raid_disk = rdev->raid_disk;
|
2009-10-16 12:35:30 +07:00
|
|
|
if (raid_disk >= max_disks
|
2015-10-09 11:54:12 +07:00
|
|
|
|| raid_disk < 0 || test_bit(Journal, &rdev->flags))
|
2005-04-17 05:20:36 +07:00
|
|
|
continue;
|
|
|
|
disk = conf->disks + raid_disk;
|
|
|
|
|
2011-12-23 06:17:53 +07:00
|
|
|
if (test_bit(Replacement, &rdev->flags)) {
|
|
|
|
if (disk->replacement)
|
|
|
|
goto abort;
|
|
|
|
disk->replacement = rdev;
|
|
|
|
} else {
|
|
|
|
if (disk->rdev)
|
|
|
|
goto abort;
|
|
|
|
disk->rdev = rdev;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-11-09 12:39:31 +07:00
|
|
|
if (test_bit(In_sync, &rdev->flags)) {
|
2005-04-17 05:20:36 +07:00
|
|
|
char b[BDEVNAME_SIZE];
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_info("md/raid:%s: device %s operational as raid disk %d\n",
|
|
|
|
mdname(mddev), bdevname(rdev->bdev, b), raid_disk);
|
2011-06-09 06:00:28 +07:00
|
|
|
} else if (rdev->saved_raid_disk != raid_disk)
|
Ensure interrupted recovery completed properly (v1 metadata plus bitmap)
If, while assembling an array, we find a device which is not fully
in-sync with the array, it is important to set the "fullsync" flags.
This is an exact analog to the setting of this flag in hot_add_disk
methods.
Currently, only v1.x metadata supports having devices in an array
which are not fully in-sync (it keep track of how in sync they are).
The 'fullsync' flag only makes a difference when a write-intent bitmap
is being used. In this case it tells recovery to ignore the bitmap
and recovery all blocks.
This fix is already in place for raid1, but not raid5/6 or raid10.
So without this fix, a raid1 ir raid4/5/6 array with version 1.x
metadata and a write intent bitmaps, that is stopped in the middle
of a recovery, will appear to complete the recovery instantly
after it is reassembled, but the recovery will not be correct.
If you might have an array like that, issueing
echo repair > /sys/block/mdXX/md/sync_action
will make sure recovery completes properly.
Cc: <stable@kernel.org>
Signed-off-by: Neil Brown <neilb@suse.de>
2008-06-28 05:30:52 +07:00
|
|
|
/* Cannot rely on bitmap to complete recovery */
|
|
|
|
conf->fullsync = 1;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2009-03-31 10:39:39 +07:00
|
|
|
conf->level = mddev->new_level;
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
if (conf->level == 6) {
|
2006-06-26 14:27:38 +07:00
|
|
|
conf->max_degraded = 2;
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
if (raid6_call.xor_syndrome)
|
|
|
|
conf->rmw_level = PARITY_ENABLE_RMW;
|
|
|
|
else
|
|
|
|
conf->rmw_level = PARITY_DISABLE_RMW;
|
|
|
|
} else {
|
2006-06-26 14:27:38 +07:00
|
|
|
conf->max_degraded = 1;
|
md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.
1) Enable xor_syndrome() in the async layer.
2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.
3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.
4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.
5) Adapt the several places where we ignored Q handling up to now.
Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s
Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-15 08:57:05 +07:00
|
|
|
conf->rmw_level = PARITY_ENABLE_RMW;
|
|
|
|
}
|
2009-03-31 10:39:39 +07:00
|
|
|
conf->algorithm = mddev->new_layout;
|
2009-03-31 11:16:46 +07:00
|
|
|
conf->reshape_progress = mddev->reshape_position;
|
2009-03-31 11:20:22 +07:00
|
|
|
if (conf->reshape_progress != MaxSector) {
|
2009-06-18 05:45:55 +07:00
|
|
|
conf->prev_chunk_sectors = mddev->chunk_sectors;
|
2009-03-31 11:20:22 +07:00
|
|
|
conf->prev_algo = mddev->layout;
|
2015-07-17 09:17:50 +07:00
|
|
|
} else {
|
|
|
|
conf->prev_chunk_sectors = conf->chunk_sectors;
|
|
|
|
conf->prev_algo = conf->algorithm;
|
2009-03-31 11:20:22 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-02-26 08:47:56 +07:00
|
|
|
conf->min_nr_stripes = NR_STRIPES;
|
2016-08-31 00:29:33 +07:00
|
|
|
if (mddev->reshape_position != MaxSector) {
|
|
|
|
int stripes = max_t(int,
|
|
|
|
((mddev->chunk_sectors << 9) / STRIPE_SIZE) * 4,
|
|
|
|
((mddev->new_chunk_sectors << 9) / STRIPE_SIZE) * 4);
|
|
|
|
conf->min_nr_stripes = max(NR_STRIPES, stripes);
|
|
|
|
if (conf->min_nr_stripes != NR_STRIPES)
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_info("md/raid:%s: force stripe size %d for reshape\n",
|
2016-08-31 00:29:33 +07:00
|
|
|
mdname(mddev), conf->min_nr_stripes);
|
|
|
|
}
|
2015-02-26 08:47:56 +07:00
|
|
|
memory = conf->min_nr_stripes * (sizeof(struct stripe_head) +
|
2009-10-16 12:35:30 +07:00
|
|
|
max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
|
2013-11-14 11:16:17 +07:00
|
|
|
atomic_set(&conf->empty_inactive_list_nr, NR_STRIPE_HASH_LOCKS);
|
2015-02-26 08:47:56 +07:00
|
|
|
if (grow_stripes(conf, conf->min_nr_stripes)) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: couldn't allocate %dkB for buffers\n",
|
|
|
|
mdname(mddev), memory);
|
2009-03-31 10:39:39 +07:00
|
|
|
goto abort;
|
|
|
|
} else
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_debug("md/raid:%s: allocated %dkB\n", mdname(mddev), memory);
|
2015-02-26 08:47:56 +07:00
|
|
|
/*
|
|
|
|
* Losing a stripe head costs more than the time to refill it,
|
|
|
|
* it reduces the queue depth and so can hurt throughput.
|
|
|
|
* So set it rather large, scaled by number of devices.
|
|
|
|
*/
|
|
|
|
conf->shrinker.seeks = DEFAULT_SEEKS * conf->raid_disks * 4;
|
|
|
|
conf->shrinker.scan_objects = raid5_cache_scan;
|
|
|
|
conf->shrinker.count_objects = raid5_cache_count;
|
|
|
|
conf->shrinker.batch = 128;
|
|
|
|
conf->shrinker.flags = 0;
|
2016-09-20 09:33:57 +07:00
|
|
|
if (register_shrinker(&conf->shrinker)) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: couldn't register shrinker.\n",
|
|
|
|
mdname(mddev));
|
2016-09-20 09:33:57 +07:00
|
|
|
goto abort;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-07-03 12:56:52 +07:00
|
|
|
sprintf(pers_name, "raid%d", mddev->new_level);
|
|
|
|
conf->thread = md_register_thread(raid5d, mddev, pers_name);
|
2009-03-31 10:39:39 +07:00
|
|
|
if (!conf->thread) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: couldn't allocate thread.\n",
|
|
|
|
mdname(mddev));
|
2006-06-26 14:27:38 +07:00
|
|
|
goto abort;
|
|
|
|
}
|
2009-03-31 10:39:39 +07:00
|
|
|
|
|
|
|
return conf;
|
|
|
|
|
|
|
|
abort:
|
|
|
|
if (conf) {
|
2009-07-31 09:39:15 +07:00
|
|
|
free_conf(conf);
|
2009-03-31 10:39:39 +07:00
|
|
|
return ERR_PTR(-EIO);
|
|
|
|
} else
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
2009-11-13 13:47:00 +07:00
|
|
|
static int only_parity(int raid_disk, int algo, int raid_disks, int max_degraded)
|
|
|
|
{
|
|
|
|
switch (algo) {
|
|
|
|
case ALGORITHM_PARITY_0:
|
|
|
|
if (raid_disk < max_degraded)
|
|
|
|
return 1;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_PARITY_N:
|
|
|
|
if (raid_disk >= raid_disks - max_degraded)
|
|
|
|
return 1;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_PARITY_0_6:
|
2014-09-30 11:23:59 +07:00
|
|
|
if (raid_disk == 0 ||
|
2009-11-13 13:47:00 +07:00
|
|
|
raid_disk == raid_disks - 1)
|
|
|
|
return 1;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_LEFT_ASYMMETRIC_6:
|
|
|
|
case ALGORITHM_RIGHT_ASYMMETRIC_6:
|
|
|
|
case ALGORITHM_LEFT_SYMMETRIC_6:
|
|
|
|
case ALGORITHM_RIGHT_SYMMETRIC_6:
|
|
|
|
if (raid_disk == raid_disks - 1)
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-01-21 04:52:20 +07:00
|
|
|
static int raid5_run(struct mddev *mddev)
|
2009-03-31 10:39:39 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf;
|
2010-07-26 09:04:13 +07:00
|
|
|
int working_disks = 0;
|
2009-11-13 13:47:00 +07:00
|
|
|
int dirty_parity_disks = 0;
|
2011-10-11 12:45:26 +07:00
|
|
|
struct md_rdev *rdev;
|
2015-08-14 04:32:03 +07:00
|
|
|
struct md_rdev *journal_dev = NULL;
|
2009-11-13 13:47:00 +07:00
|
|
|
sector_t reshape_offset = 0;
|
2011-12-23 06:17:53 +07:00
|
|
|
int i;
|
2012-05-21 06:27:01 +07:00
|
|
|
long long min_offset_diff = 0;
|
|
|
|
int first = 1;
|
2009-03-31 10:39:39 +07:00
|
|
|
|
2017-06-05 13:05:13 +07:00
|
|
|
if (mddev_init_writes_pending(mddev) < 0)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2009-06-18 05:48:06 +07:00
|
|
|
if (mddev->recovery_cp != MaxSector)
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_notice("md/raid:%s: not clean -- starting background reconstruction\n",
|
|
|
|
mdname(mddev));
|
2012-05-21 06:27:01 +07:00
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev) {
|
|
|
|
long long diff;
|
2015-08-14 04:32:03 +07:00
|
|
|
|
2015-10-09 11:54:12 +07:00
|
|
|
if (test_bit(Journal, &rdev->flags)) {
|
2015-08-14 04:32:03 +07:00
|
|
|
journal_dev = rdev;
|
2015-10-09 11:54:12 +07:00
|
|
|
continue;
|
|
|
|
}
|
2012-05-21 06:27:01 +07:00
|
|
|
if (rdev->raid_disk < 0)
|
|
|
|
continue;
|
|
|
|
diff = (rdev->new_data_offset - rdev->data_offset);
|
|
|
|
if (first) {
|
|
|
|
min_offset_diff = diff;
|
|
|
|
first = 0;
|
|
|
|
} else if (mddev->reshape_backwards &&
|
|
|
|
diff < min_offset_diff)
|
|
|
|
min_offset_diff = diff;
|
|
|
|
else if (!mddev->reshape_backwards &&
|
|
|
|
diff > min_offset_diff)
|
|
|
|
min_offset_diff = diff;
|
|
|
|
}
|
|
|
|
|
2017-10-17 10:24:09 +07:00
|
|
|
if ((test_bit(MD_HAS_JOURNAL, &mddev->flags) || journal_dev) &&
|
|
|
|
(mddev->bitmap_info.offset || mddev->bitmap_info.file)) {
|
|
|
|
pr_notice("md/raid:%s: array cannot have both journal and bitmap\n",
|
|
|
|
mdname(mddev));
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2009-03-31 10:39:39 +07:00
|
|
|
if (mddev->reshape_position != MaxSector) {
|
|
|
|
/* Check that we can continue the reshape.
|
2012-05-21 06:27:01 +07:00
|
|
|
* Difficulties arise if the stripe we would write to
|
|
|
|
* next is at or after the stripe we would read from next.
|
|
|
|
* For a reshape that changes the number of devices, this
|
|
|
|
* is only possible for a very short time, and mdadm makes
|
|
|
|
* sure that time appears to have past before assembling
|
|
|
|
* the array. So we fail if that time hasn't passed.
|
|
|
|
* For a reshape that keeps the number of devices the same
|
|
|
|
* mdadm must be monitoring the reshape can keeping the
|
|
|
|
* critical areas read-only and backed up. It will start
|
|
|
|
* the array in read-only mode, so we check for that.
|
2009-03-31 10:39:39 +07:00
|
|
|
*/
|
|
|
|
sector_t here_new, here_old;
|
|
|
|
int old_disks;
|
2009-03-31 11:00:56 +07:00
|
|
|
int max_degraded = (mddev->level == 6 ? 2 : 1);
|
2015-07-15 14:36:21 +07:00
|
|
|
int chunk_sectors;
|
|
|
|
int new_data_disks;
|
2009-03-31 10:39:39 +07:00
|
|
|
|
2015-08-14 04:32:03 +07:00
|
|
|
if (journal_dev) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: don't support reshape with journal - aborting.\n",
|
|
|
|
mdname(mddev));
|
2015-08-14 04:32:03 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2009-03-31 11:24:23 +07:00
|
|
|
if (mddev->new_level != mddev->level) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: unsupported reshape required - aborting.\n",
|
|
|
|
mdname(mddev));
|
2009-03-31 10:39:39 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
old_disks = mddev->raid_disks - mddev->delta_disks;
|
|
|
|
/* reshape_position must be on a new-stripe boundary, and one
|
|
|
|
* further up in new geometry must map after here in old
|
|
|
|
* geometry.
|
2015-07-15 14:36:21 +07:00
|
|
|
* If the chunk sizes are different, then as we perform reshape
|
|
|
|
* in units of the largest of the two, reshape_position needs
|
|
|
|
* be a multiple of the largest chunk size times new data disks.
|
2009-03-31 10:39:39 +07:00
|
|
|
*/
|
|
|
|
here_new = mddev->reshape_position;
|
2015-07-15 14:36:21 +07:00
|
|
|
chunk_sectors = max(mddev->chunk_sectors, mddev->new_chunk_sectors);
|
|
|
|
new_data_disks = mddev->raid_disks - max_degraded;
|
|
|
|
if (sector_div(here_new, chunk_sectors * new_data_disks)) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: reshape_position not on a stripe boundary\n",
|
|
|
|
mdname(mddev));
|
2009-03-31 10:39:39 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2015-07-15 14:36:21 +07:00
|
|
|
reshape_offset = here_new * chunk_sectors;
|
2009-03-31 10:39:39 +07:00
|
|
|
/* here_new is the stripe we will write to */
|
|
|
|
here_old = mddev->reshape_position;
|
2015-07-15 14:36:21 +07:00
|
|
|
sector_div(here_old, chunk_sectors * (old_disks-max_degraded));
|
2009-03-31 10:39:39 +07:00
|
|
|
/* here_old is the first stripe that we might need to read
|
|
|
|
* from */
|
2009-08-13 07:06:24 +07:00
|
|
|
if (mddev->delta_disks == 0) {
|
|
|
|
/* We cannot be sure it is safe to start an in-place
|
2012-05-21 06:27:01 +07:00
|
|
|
* reshape. It is only safe if user-space is monitoring
|
2009-08-13 07:06:24 +07:00
|
|
|
* and taking constant backups.
|
|
|
|
* mdadm always starts a situation like this in
|
|
|
|
* readonly mode so it can take control before
|
|
|
|
* allowing any writes. So just check for that.
|
|
|
|
*/
|
2012-05-21 06:27:01 +07:00
|
|
|
if (abs(min_offset_diff) >= mddev->chunk_sectors &&
|
|
|
|
abs(min_offset_diff) >= mddev->new_chunk_sectors)
|
|
|
|
/* not really in-place - so OK */;
|
|
|
|
else if (mddev->ro == 0) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: in-place reshape must be started in read-only mode - aborting\n",
|
|
|
|
mdname(mddev));
|
2009-08-13 07:06:24 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2012-05-21 06:27:00 +07:00
|
|
|
} else if (mddev->reshape_backwards
|
2015-07-15 14:36:21 +07:00
|
|
|
? (here_new * chunk_sectors + min_offset_diff <=
|
|
|
|
here_old * chunk_sectors)
|
|
|
|
: (here_new * chunk_sectors >=
|
|
|
|
here_old * chunk_sectors + (-min_offset_diff))) {
|
2009-03-31 10:39:39 +07:00
|
|
|
/* Reading from the same stripe as writing to - bad */
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: reshape_position too early for auto-recovery - aborting.\n",
|
|
|
|
mdname(mddev));
|
2009-03-31 10:39:39 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_debug("md/raid:%s: reshape will continue\n", mdname(mddev));
|
2009-03-31 10:39:39 +07:00
|
|
|
/* OK, we should be able to continue; */
|
|
|
|
} else {
|
|
|
|
BUG_ON(mddev->level != mddev->new_level);
|
|
|
|
BUG_ON(mddev->layout != mddev->new_layout);
|
2009-06-18 05:45:27 +07:00
|
|
|
BUG_ON(mddev->chunk_sectors != mddev->new_chunk_sectors);
|
2009-03-31 10:39:39 +07:00
|
|
|
BUG_ON(mddev->delta_disks != 0);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2009-03-31 10:39:39 +07:00
|
|
|
|
2017-03-09 15:59:59 +07:00
|
|
|
if (test_bit(MD_HAS_JOURNAL, &mddev->flags) &&
|
|
|
|
test_bit(MD_HAS_PPL, &mddev->flags)) {
|
|
|
|
pr_warn("md/raid:%s: using journal device and PPL not allowed - disabling PPL\n",
|
|
|
|
mdname(mddev));
|
|
|
|
clear_bit(MD_HAS_PPL, &mddev->flags);
|
2017-08-16 22:13:45 +07:00
|
|
|
clear_bit(MD_HAS_MULTIPLE_PPLS, &mddev->flags);
|
2017-03-09 15:59:59 +07:00
|
|
|
}
|
|
|
|
|
2009-03-31 10:39:39 +07:00
|
|
|
if (mddev->private == NULL)
|
|
|
|
conf = setup_conf(mddev);
|
|
|
|
else
|
|
|
|
conf = mddev->private;
|
|
|
|
|
2009-03-31 10:39:39 +07:00
|
|
|
if (IS_ERR(conf))
|
|
|
|
return PTR_ERR(conf);
|
|
|
|
|
2016-08-20 05:34:01 +07:00
|
|
|
if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
|
|
|
|
if (!journal_dev) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: journal disk is missing, force array readonly\n",
|
|
|
|
mdname(mddev));
|
2016-08-20 05:34:01 +07:00
|
|
|
mddev->ro = 1;
|
|
|
|
set_disk_ro(mddev->gendisk, 1);
|
|
|
|
} else if (mddev->recovery_cp == MaxSector)
|
|
|
|
set_bit(MD_JOURNAL_CLEAN, &mddev->flags);
|
2015-10-09 11:54:10 +07:00
|
|
|
}
|
|
|
|
|
2012-05-21 06:27:01 +07:00
|
|
|
conf->min_offset_diff = min_offset_diff;
|
2009-03-31 10:39:39 +07:00
|
|
|
mddev->thread = conf->thread;
|
|
|
|
conf->thread = NULL;
|
|
|
|
mddev->private = conf;
|
|
|
|
|
2011-12-23 06:17:53 +07:00
|
|
|
for (i = 0; i < conf->raid_disks && conf->previous_raid_disks;
|
|
|
|
i++) {
|
|
|
|
rdev = conf->disks[i].rdev;
|
|
|
|
if (!rdev && conf->disks[i].replacement) {
|
|
|
|
/* The replacement is all we have yet */
|
|
|
|
rdev = conf->disks[i].replacement;
|
|
|
|
conf->disks[i].replacement = NULL;
|
|
|
|
clear_bit(Replacement, &rdev->flags);
|
|
|
|
conf->disks[i].rdev = rdev;
|
|
|
|
}
|
|
|
|
if (!rdev)
|
2009-11-13 13:47:00 +07:00
|
|
|
continue;
|
2011-12-23 06:17:53 +07:00
|
|
|
if (conf->disks[i].replacement &&
|
|
|
|
conf->reshape_progress != MaxSector) {
|
|
|
|
/* replacements and reshape simply do not mix. */
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md: cannot handle concurrent replacement and reshape.\n");
|
2011-12-23 06:17:53 +07:00
|
|
|
goto abort;
|
|
|
|
}
|
2010-06-17 14:41:03 +07:00
|
|
|
if (test_bit(In_sync, &rdev->flags)) {
|
2009-03-31 10:39:39 +07:00
|
|
|
working_disks++;
|
2010-06-17 14:41:03 +07:00
|
|
|
continue;
|
|
|
|
}
|
2009-11-13 13:47:00 +07:00
|
|
|
/* This disc is not fully in-sync. However if it
|
|
|
|
* just stored parity (beyond the recovery_offset),
|
|
|
|
* when we don't need to be concerned about the
|
|
|
|
* array being dirty.
|
|
|
|
* When reshape goes 'backwards', we never have
|
|
|
|
* partially completed devices, so we only need
|
|
|
|
* to worry about reshape going forwards.
|
|
|
|
*/
|
|
|
|
/* Hack because v0.91 doesn't store recovery_offset properly. */
|
|
|
|
if (mddev->major_version == 0 &&
|
|
|
|
mddev->minor_version > 90)
|
|
|
|
rdev->recovery_offset = reshape_offset;
|
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-12 21:37:43 +07:00
|
|
|
|
2009-11-13 13:47:00 +07:00
|
|
|
if (rdev->recovery_offset < reshape_offset) {
|
|
|
|
/* We need to check old and new layout */
|
|
|
|
if (!only_parity(rdev->raid_disk,
|
|
|
|
conf->algorithm,
|
|
|
|
conf->raid_disks,
|
|
|
|
conf->max_degraded))
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (!only_parity(rdev->raid_disk,
|
|
|
|
conf->prev_algo,
|
|
|
|
conf->previous_raid_disks,
|
|
|
|
conf->max_degraded))
|
|
|
|
continue;
|
|
|
|
dirty_parity_disks++;
|
|
|
|
}
|
2009-03-31 10:39:39 +07:00
|
|
|
|
2011-12-23 06:17:53 +07:00
|
|
|
/*
|
|
|
|
* 0 for a fully functional array, 1 or 2 for a degraded array.
|
|
|
|
*/
|
2017-01-25 01:45:30 +07:00
|
|
|
mddev->degraded = raid5_calc_degraded(conf);
|
2009-03-31 10:39:39 +07:00
|
|
|
|
2010-06-16 14:17:53 +07:00
|
|
|
if (has_failed(conf)) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_crit("md/raid:%s: not enough operational devices (%d/%d failed)\n",
|
2006-10-03 15:15:47 +07:00
|
|
|
mdname(mddev), mddev->degraded, conf->raid_disks);
|
2005-04-17 05:20:36 +07:00
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
|
2009-03-31 10:39:39 +07:00
|
|
|
/* device size must be a multiple of chunk size */
|
2009-06-18 05:45:01 +07:00
|
|
|
mddev->dev_sectors &= ~(mddev->chunk_sectors - 1);
|
2009-03-31 10:39:39 +07:00
|
|
|
mddev->resync_max_sectors = mddev->dev_sectors;
|
|
|
|
|
2009-11-13 13:47:00 +07:00
|
|
|
if (mddev->degraded > dirty_parity_disks &&
|
2005-04-17 05:20:36 +07:00
|
|
|
mddev->recovery_cp != MaxSector) {
|
2017-03-09 16:00:01 +07:00
|
|
|
if (test_bit(MD_HAS_PPL, &mddev->flags))
|
|
|
|
pr_crit("md/raid:%s: starting dirty degraded array with PPL.\n",
|
|
|
|
mdname(mddev));
|
|
|
|
else if (mddev->ok_start_degraded)
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_crit("md/raid:%s: starting dirty degraded array - data corruption possible.\n",
|
|
|
|
mdname(mddev));
|
2006-01-06 15:20:15 +07:00
|
|
|
else {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_crit("md/raid:%s: cannot start dirty degraded array.\n",
|
|
|
|
mdname(mddev));
|
2006-01-06 15:20:15 +07:00
|
|
|
goto abort;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_info("md/raid:%s: raid level %d active with %d out of %d devices, algorithm %d\n",
|
|
|
|
mdname(mddev), conf->level,
|
|
|
|
mddev->raid_disks-mddev->degraded, mddev->raid_disks,
|
|
|
|
mddev->new_layout);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
print_raid5_conf(conf);
|
|
|
|
|
2009-03-31 11:16:46 +07:00
|
|
|
if (conf->reshape_progress != MaxSector) {
|
|
|
|
conf->reshape_safe = conf->reshape_progress;
|
2006-03-27 16:18:11 +07:00
|
|
|
atomic_set(&conf->reshape_stripes, 0);
|
|
|
|
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
|
|
|
|
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
|
|
|
|
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
|
|
|
|
mddev->sync_thread = md_register_thread(md_do_sync, mddev,
|
2009-09-23 15:09:45 +07:00
|
|
|
"reshape");
|
2006-03-27 16:18:11 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/* Ok, everything is just fine now */
|
2010-04-14 14:15:37 +07:00
|
|
|
if (mddev->to_remove == &raid5_attrs_group)
|
|
|
|
mddev->to_remove = NULL;
|
2010-06-01 16:37:23 +07:00
|
|
|
else if (mddev->kobj.sd &&
|
|
|
|
sysfs_create_group(&mddev->kobj, &raid5_attrs_group))
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("raid5: failed to create sysfs attributes for %s\n",
|
|
|
|
mdname(mddev));
|
2010-06-01 16:37:28 +07:00
|
|
|
md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
|
2005-05-17 11:53:16 +07:00
|
|
|
|
2010-06-01 16:37:28 +07:00
|
|
|
if (mddev->queue) {
|
2010-07-26 09:04:13 +07:00
|
|
|
int chunk_size;
|
2010-06-01 16:37:28 +07:00
|
|
|
/* read-ahead size must cover two whole stripes, which
|
|
|
|
* is 2 * (datadisks) * chunksize where 'n' is the
|
|
|
|
* number of raid devices
|
|
|
|
*/
|
|
|
|
int data_disks = conf->previous_raid_disks - conf->max_degraded;
|
|
|
|
int stripe = data_disks *
|
|
|
|
((mddev->chunk_sectors << 9) / PAGE_SIZE);
|
2017-02-02 21:56:50 +07:00
|
|
|
if (mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
|
|
|
|
mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
|
2009-03-31 10:39:39 +07:00
|
|
|
|
2010-07-26 09:04:13 +07:00
|
|
|
chunk_size = mddev->chunk_sectors << 9;
|
|
|
|
blk_queue_io_min(mddev->queue, chunk_size);
|
|
|
|
blk_queue_io_opt(mddev->queue, chunk_size *
|
|
|
|
(conf->raid_disks - conf->max_degraded));
|
2013-07-12 12:39:53 +07:00
|
|
|
mddev->queue->limits.raid_partial_stripes_expensive = 1;
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
/*
|
|
|
|
* We can only discard a whole stripe. It doesn't make sense to
|
|
|
|
* discard data disk but write parity disk
|
|
|
|
*/
|
|
|
|
stripe = stripe * PAGE_SIZE;
|
2012-11-19 09:11:26 +07:00
|
|
|
/* Round up to power of 2, as discard handling
|
|
|
|
* currently assumes that */
|
|
|
|
while ((stripe-1) & stripe)
|
|
|
|
stripe = (stripe | (stripe-1)) + 1;
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
mddev->queue->limits.discard_alignment = stripe;
|
|
|
|
mddev->queue->limits.discard_granularity = stripe;
|
2016-11-27 23:32:32 +07:00
|
|
|
|
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-12 21:37:43 +07:00
|
|
|
blk_queue_max_write_same_sectors(mddev->queue, 0);
|
2017-04-06 00:21:03 +07:00
|
|
|
blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
|
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-12 21:37:43 +07:00
|
|
|
|
2012-05-21 06:27:00 +07:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2010-07-26 09:04:13 +07:00
|
|
|
disk_stack_limits(mddev->gendisk, rdev->bdev,
|
|
|
|
rdev->data_offset << 9);
|
2012-05-21 06:27:00 +07:00
|
|
|
disk_stack_limits(mddev->gendisk, rdev->bdev,
|
|
|
|
rdev->new_data_offset << 9);
|
|
|
|
}
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
|
2017-04-06 00:21:23 +07:00
|
|
|
/*
|
|
|
|
* zeroing is required, otherwise data
|
|
|
|
* could be lost. Consider a scenario: discard a stripe
|
|
|
|
* (the stripe could be inconsistent if
|
|
|
|
* discard_zeroes_data is 0); write one disk of the
|
|
|
|
* stripe (the stripe could be inconsistent again
|
|
|
|
* depending on which disks are used to calculate
|
|
|
|
* parity); the disk is broken; The stripe data of this
|
|
|
|
* disk is lost.
|
|
|
|
*
|
|
|
|
* We only allow DISCARD if the sysadmin has confirmed that
|
|
|
|
* only safe devices are in use by setting a module parameter.
|
|
|
|
* A better idea might be to turn DISCARD into WRITE_ZEROES
|
|
|
|
* requests, as that is required to be safe.
|
|
|
|
*/
|
|
|
|
if (devices_handle_discard_safely &&
|
2016-02-17 04:44:24 +07:00
|
|
|
mddev->queue->limits.max_discard_sectors >= (stripe >> 9) &&
|
|
|
|
mddev->queue->limits.discard_granularity >= stripe)
|
2018-03-08 08:10:10 +07:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_DISCARD,
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
mddev->queue);
|
|
|
|
else
|
2018-03-08 08:10:10 +07:00
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_DISCARD,
|
MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk. To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.
Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.
The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.
For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard. Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.
If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.
If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 09:49:05 +07:00
|
|
|
mddev->queue);
|
2016-09-09 00:49:06 +07:00
|
|
|
|
|
|
|
blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);
|
2010-07-26 09:04:13 +07:00
|
|
|
}
|
2006-12-10 17:20:45 +07:00
|
|
|
|
2017-04-04 18:13:57 +07:00
|
|
|
if (log_init(conf, journal_dev, raid5_has_ppl(conf)))
|
2017-03-09 15:59:58 +07:00
|
|
|
goto abort;
|
2015-08-14 04:32:04 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
abort:
|
2011-09-21 12:30:20 +07:00
|
|
|
md_unregister_thread(&mddev->thread);
|
2011-10-07 10:22:49 +07:00
|
|
|
print_raid5_conf(conf);
|
|
|
|
free_conf(conf);
|
2005-04-17 05:20:36 +07:00
|
|
|
mddev->private = NULL;
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: failed to run raid set.\n", mdname(mddev));
|
2005-04-17 05:20:36 +07:00
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
2014-12-15 08:56:58 +07:00
|
|
|
static void raid5_free(struct mddev *mddev, void *priv)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2014-12-15 08:56:58 +07:00
|
|
|
struct r5conf *conf = priv;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-07-31 09:39:15 +07:00
|
|
|
free_conf(conf);
|
2010-04-14 14:15:37 +07:00
|
|
|
mddev->to_remove = &raid5_attrs_group;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2016-01-21 04:52:20 +07:00
|
|
|
static void raid5_status(struct seq_file *seq, struct mddev *mddev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2005-04-17 05:20:36 +07:00
|
|
|
int i;
|
|
|
|
|
2009-06-18 05:45:01 +07:00
|
|
|
seq_printf(seq, " level %d, %dk chunk, algorithm %d", mddev->level,
|
2015-07-15 14:24:17 +07:00
|
|
|
conf->chunk_sectors / 2, mddev->layout);
|
2006-10-03 15:15:47 +07:00
|
|
|
seq_printf (seq, " [%d/%d] [", conf->raid_disks, conf->raid_disks - mddev->degraded);
|
2016-06-02 13:19:52 +07:00
|
|
|
rcu_read_lock();
|
|
|
|
for (i = 0; i < conf->raid_disks; i++) {
|
|
|
|
struct md_rdev *rdev = rcu_dereference(conf->disks[i].rdev);
|
|
|
|
seq_printf (seq, "%s", rdev && test_bit(In_sync, &rdev->flags) ? "U" : "_");
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2005-04-17 05:20:36 +07:00
|
|
|
seq_printf (seq, "]");
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:52 +07:00
|
|
|
static void print_raid5_conf (struct r5conf *conf)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct disk_info *tmp;
|
|
|
|
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_debug("RAID conf printout:\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
if (!conf) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_debug("(conf==NULL)\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
return;
|
|
|
|
}
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_debug(" --- level:%d rd:%d wd:%d\n", conf->level,
|
2010-05-03 11:09:02 +07:00
|
|
|
conf->raid_disks,
|
|
|
|
conf->raid_disks - conf->mddev->degraded);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
for (i = 0; i < conf->raid_disks; i++) {
|
|
|
|
char b[BDEVNAME_SIZE];
|
|
|
|
tmp = conf->disks + i;
|
|
|
|
if (tmp->rdev)
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_debug(" disk %d, o:%d, dev:%s\n",
|
2010-05-03 11:09:02 +07:00
|
|
|
i, !test_bit(Faulty, &tmp->rdev->flags),
|
|
|
|
bdevname(tmp->rdev->bdev, b));
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static int raid5_spare_active(struct mddev *mddev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
int i;
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2005-04-17 05:20:36 +07:00
|
|
|
struct disk_info *tmp;
|
2010-08-18 08:56:59 +07:00
|
|
|
int count = 0;
|
|
|
|
unsigned long flags;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
for (i = 0; i < conf->raid_disks; i++) {
|
|
|
|
tmp = conf->disks + i;
|
2011-12-23 06:17:53 +07:00
|
|
|
if (tmp->replacement
|
|
|
|
&& tmp->replacement->recovery_offset == MaxSector
|
|
|
|
&& !test_bit(Faulty, &tmp->replacement->flags)
|
|
|
|
&& !test_and_set_bit(In_sync, &tmp->replacement->flags)) {
|
|
|
|
/* Replacement has just become active. */
|
|
|
|
if (!tmp->rdev
|
|
|
|
|| !test_and_clear_bit(In_sync, &tmp->rdev->flags))
|
|
|
|
count++;
|
|
|
|
if (tmp->rdev) {
|
|
|
|
/* Replaced device not technically faulty,
|
|
|
|
* but we need to be sure it gets removed
|
|
|
|
* and never re-added.
|
|
|
|
*/
|
|
|
|
set_bit(Faulty, &tmp->rdev->flags);
|
|
|
|
sysfs_notify_dirent_safe(
|
|
|
|
tmp->rdev->sysfs_state);
|
|
|
|
}
|
|
|
|
sysfs_notify_dirent_safe(tmp->replacement->sysfs_state);
|
|
|
|
} else if (tmp->rdev
|
2010-06-16 14:01:25 +07:00
|
|
|
&& tmp->rdev->recovery_offset == MaxSector
|
2005-11-09 12:39:31 +07:00
|
|
|
&& !test_bit(Faulty, &tmp->rdev->flags)
|
2006-10-03 15:15:53 +07:00
|
|
|
&& !test_and_set_bit(In_sync, &tmp->rdev->flags)) {
|
2010-08-18 08:56:59 +07:00
|
|
|
count++;
|
2011-01-14 05:14:33 +07:00
|
|
|
sysfs_notify_dirent_safe(tmp->rdev->sysfs_state);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
}
|
2010-08-18 08:56:59 +07:00
|
|
|
spin_lock_irqsave(&conf->device_lock, flags);
|
2017-01-25 01:45:30 +07:00
|
|
|
mddev->degraded = raid5_calc_degraded(conf);
|
2010-08-18 08:56:59 +07:00
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
print_raid5_conf(conf);
|
2010-08-18 08:56:59 +07:00
|
|
|
return count;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2011-12-23 06:17:51 +07:00
|
|
|
static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2005-04-17 05:20:36 +07:00
|
|
|
int err = 0;
|
2011-12-23 06:17:51 +07:00
|
|
|
int number = rdev->raid_disk;
|
2011-12-23 06:17:52 +07:00
|
|
|
struct md_rdev **rdevp;
|
2005-04-17 05:20:36 +07:00
|
|
|
struct disk_info *p = conf->disks + number;
|
|
|
|
|
|
|
|
print_raid5_conf(conf);
|
2015-12-21 06:51:02 +07:00
|
|
|
if (test_bit(Journal, &rdev->flags) && conf->log) {
|
2015-10-09 11:54:07 +07:00
|
|
|
/*
|
2015-12-21 06:51:02 +07:00
|
|
|
* we can't wait pending write here, as this is called in
|
|
|
|
* raid5d, wait will deadlock.
|
2017-03-15 10:05:14 +07:00
|
|
|
* neilb: there is no locking about new writes here,
|
|
|
|
* so this cannot be safe.
|
2015-10-09 11:54:07 +07:00
|
|
|
*/
|
md/r5cache: gracefully handle journal device errors for writeback mode
For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, it is necessary to handle
journal failure gracefully.
During journal failures, the following logic handles the graceful shutdown
of journal:
1. raid5_error() marks the device as Faulty and schedules async work
log->disable_writeback_work;
2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
suspended, set to write through, and then resumed. mddev_suspend()
flushes all cached stripes;
3. All cached stripes need to be flushed carefully to the RAID array.
This patch fixes issues within the process above:
1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
journal failures;
2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
since raid5_error() updates superblock.
3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
to make progress during log_failed;
4. In delay_towrite(), if log failed only process data in the cache (skip
new writes in dev->towrite);
5. In __get_priority_stripe(), process loprio_list during journal device
failures.
6. In raid5_remove_disk(), wait for all cached stripes are flushed before
calling log_exit().
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-12 05:28:28 +07:00
|
|
|
if (atomic_read(&conf->active_stripes) ||
|
|
|
|
atomic_read(&conf->r5c_cached_full_stripes) ||
|
|
|
|
atomic_read(&conf->r5c_cached_partial_stripes)) {
|
2015-12-21 06:51:02 +07:00
|
|
|
return -EBUSY;
|
2017-03-15 10:05:14 +07:00
|
|
|
}
|
2017-03-09 15:59:58 +07:00
|
|
|
log_exit(conf);
|
2015-12-21 06:51:02 +07:00
|
|
|
return 0;
|
2015-10-09 11:54:07 +07:00
|
|
|
}
|
2011-12-23 06:17:52 +07:00
|
|
|
if (rdev == p->rdev)
|
|
|
|
rdevp = &p->rdev;
|
|
|
|
else if (rdev == p->replacement)
|
|
|
|
rdevp = &p->replacement;
|
|
|
|
else
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (number >= conf->raid_disks &&
|
|
|
|
conf->reshape_progress == MaxSector)
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
|
|
|
|
|
|
|
if (test_bit(In_sync, &rdev->flags) ||
|
|
|
|
atomic_read(&rdev->nr_pending)) {
|
|
|
|
err = -EBUSY;
|
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
/* Only remove non-faulty devices if recovery
|
|
|
|
* isn't possible.
|
|
|
|
*/
|
|
|
|
if (!test_bit(Faulty, &rdev->flags) &&
|
|
|
|
mddev->recovery_disabled != conf->recovery_disabled &&
|
|
|
|
!has_failed(conf) &&
|
2011-12-23 06:17:53 +07:00
|
|
|
(!p->replacement || p->replacement == rdev) &&
|
2011-12-23 06:17:52 +07:00
|
|
|
number < conf->raid_disks) {
|
|
|
|
err = -EBUSY;
|
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
*rdevp = NULL;
|
2016-06-02 13:19:53 +07:00
|
|
|
if (!test_bit(RemoveSynchronized, &rdev->flags)) {
|
|
|
|
synchronize_rcu();
|
|
|
|
if (atomic_read(&rdev->nr_pending)) {
|
|
|
|
/* lost the race, try later */
|
|
|
|
err = -EBUSY;
|
|
|
|
*rdevp = rdev;
|
|
|
|
}
|
|
|
|
}
|
2017-03-09 16:00:02 +07:00
|
|
|
if (!err) {
|
|
|
|
err = log_modify(conf, rdev, false);
|
|
|
|
if (err)
|
|
|
|
goto abort;
|
|
|
|
}
|
2016-06-02 13:19:53 +07:00
|
|
|
if (p->replacement) {
|
2011-12-23 06:17:53 +07:00
|
|
|
/* We must have just cleared 'rdev' */
|
|
|
|
p->rdev = p->replacement;
|
|
|
|
clear_bit(Replacement, &p->replacement->flags);
|
|
|
|
smp_mb(); /* Make sure other CPUs may see both as identical
|
|
|
|
* but will never see neither - if they are careful
|
|
|
|
*/
|
|
|
|
p->replacement = NULL;
|
2017-03-09 16:00:02 +07:00
|
|
|
|
|
|
|
if (!err)
|
|
|
|
err = log_modify(conf, p->rdev, true);
|
2017-04-24 14:58:04 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
clear_bit(WantReplacement, &rdev->flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
abort:
|
|
|
|
|
|
|
|
print_raid5_conf(conf);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2008-06-28 05:31:33 +07:00
|
|
|
int err = -EEXIST;
|
2005-04-17 05:20:36 +07:00
|
|
|
int disk;
|
|
|
|
struct disk_info *p;
|
2008-06-28 05:31:31 +07:00
|
|
|
int first = 0;
|
|
|
|
int last = conf->raid_disks - 1;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-12-21 06:51:02 +07:00
|
|
|
if (test_bit(Journal, &rdev->flags)) {
|
|
|
|
if (conf->log)
|
|
|
|
return -EBUSY;
|
|
|
|
|
|
|
|
rdev->raid_disk = 0;
|
|
|
|
/*
|
|
|
|
* The array is in readonly mode if journal is missing, so no
|
|
|
|
* write requests running. We should be safe
|
|
|
|
*/
|
2017-04-04 18:13:57 +07:00
|
|
|
log_init(conf, rdev, false);
|
2015-12-21 06:51:02 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2011-07-28 08:39:22 +07:00
|
|
|
if (mddev->recovery_disabled == conf->recovery_disabled)
|
|
|
|
return -EBUSY;
|
|
|
|
|
2012-03-19 08:46:37 +07:00
|
|
|
if (rdev->saved_raid_disk < 0 && has_failed(conf))
|
2005-04-17 05:20:36 +07:00
|
|
|
/* no point adding a device */
|
2008-06-28 05:31:33 +07:00
|
|
|
return -EINVAL;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-06-28 05:31:31 +07:00
|
|
|
if (rdev->raid_disk >= 0)
|
|
|
|
first = last = rdev->raid_disk;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
2006-06-26 14:27:38 +07:00
|
|
|
* find the disk ... but prefer rdev->saved_raid_disk
|
|
|
|
* if possible.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2006-06-26 14:27:38 +07:00
|
|
|
if (rdev->saved_raid_disk >= 0 &&
|
2008-06-28 05:31:31 +07:00
|
|
|
rdev->saved_raid_disk >= first &&
|
2006-06-26 14:27:38 +07:00
|
|
|
conf->disks[rdev->saved_raid_disk].rdev == NULL)
|
2012-07-03 08:46:53 +07:00
|
|
|
first = rdev->saved_raid_disk;
|
|
|
|
|
|
|
|
for (disk = first; disk <= last; disk++) {
|
2011-12-23 06:17:53 +07:00
|
|
|
p = conf->disks + disk;
|
|
|
|
if (p->rdev == NULL) {
|
2005-11-09 12:39:31 +07:00
|
|
|
clear_bit(In_sync, &rdev->flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
rdev->raid_disk = disk;
|
2005-09-10 06:23:54 +07:00
|
|
|
if (rdev->saved_raid_disk != disk)
|
|
|
|
conf->fullsync = 1;
|
2005-11-09 12:39:27 +07:00
|
|
|
rcu_assign_pointer(p->rdev, rdev);
|
2017-03-09 16:00:02 +07:00
|
|
|
|
|
|
|
err = log_modify(conf, rdev, true);
|
|
|
|
|
2012-07-03 08:46:53 +07:00
|
|
|
goto out;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2012-07-03 08:46:53 +07:00
|
|
|
}
|
|
|
|
for (disk = first; disk <= last; disk++) {
|
|
|
|
p = conf->disks + disk;
|
2011-12-23 06:17:53 +07:00
|
|
|
if (test_bit(WantReplacement, &p->rdev->flags) &&
|
|
|
|
p->replacement == NULL) {
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
|
|
|
set_bit(Replacement, &rdev->flags);
|
|
|
|
rdev->raid_disk = disk;
|
|
|
|
err = 0;
|
|
|
|
conf->fullsync = 1;
|
|
|
|
rcu_assign_pointer(p->replacement, rdev);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2012-07-03 08:46:53 +07:00
|
|
|
out:
|
2005-04-17 05:20:36 +07:00
|
|
|
print_raid5_conf(conf);
|
2008-06-28 05:31:33 +07:00
|
|
|
return err;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static int raid5_resize(struct mddev *mddev, sector_t sectors)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
/* no resync is happening, and there is enough space
|
|
|
|
* on all devices, so we can resize.
|
|
|
|
* We need to make sure resync covers any new space.
|
|
|
|
* If the array is shrinking we should possibly wait until
|
|
|
|
* any io in the removed space completes, but it hardly seems
|
|
|
|
* worth it.
|
|
|
|
*/
|
2012-05-22 10:55:27 +07:00
|
|
|
sector_t newsize;
|
2015-07-15 14:24:17 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
|
|
|
|
2018-08-30 01:05:42 +07:00
|
|
|
if (raid5_has_log(conf) || raid5_has_ppl(conf))
|
2015-08-14 04:32:03 +07:00
|
|
|
return -EINVAL;
|
2015-07-15 14:24:17 +07:00
|
|
|
sectors &= ~((sector_t)conf->chunk_sectors - 1);
|
2012-05-22 10:55:27 +07:00
|
|
|
newsize = raid5_size(mddev, sectors, mddev->raid_disks);
|
|
|
|
if (mddev->external_size &&
|
|
|
|
mddev->array_sectors > newsize)
|
2009-03-31 11:00:31 +07:00
|
|
|
return -EINVAL;
|
2012-05-22 10:55:27 +07:00
|
|
|
if (mddev->bitmap) {
|
2018-08-02 05:20:50 +07:00
|
|
|
int ret = md_bitmap_resize(mddev->bitmap, sectors, 0, 0);
|
2012-05-22 10:55:27 +07:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
md_set_array_sectors(mddev, newsize);
|
2011-05-11 12:52:21 +07:00
|
|
|
if (sectors > mddev->dev_sectors &&
|
|
|
|
mddev->recovery_cp > mddev->dev_sectors) {
|
2009-03-31 10:33:13 +07:00
|
|
|
mddev->recovery_cp = mddev->dev_sectors;
|
2005-04-17 05:20:36 +07:00
|
|
|
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
|
|
|
|
}
|
2009-03-31 10:33:13 +07:00
|
|
|
mddev->dev_sectors = sectors;
|
2005-07-28 01:43:28 +07:00
|
|
|
mddev->resync_max_sectors = sectors;
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static int check_stripe_cache(struct mddev *mddev)
|
2009-06-18 05:47:20 +07:00
|
|
|
{
|
|
|
|
/* Can only proceed if there are plenty of stripe_heads.
|
|
|
|
* We need a minimum of one full stripe,, and for sensible progress
|
|
|
|
* it is best to have about 4 times that.
|
|
|
|
* If we require 4 times, then the default 256 4K stripe_heads will
|
|
|
|
* allow for chunk sizes up to 256K, which is probably OK.
|
|
|
|
* If the chunk size is greater, user-space should request more
|
|
|
|
* stripe_heads first.
|
|
|
|
*/
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2009-06-18 05:47:20 +07:00
|
|
|
if (((mddev->chunk_sectors << 9) / STRIPE_SIZE) * 4
|
2015-02-26 08:47:56 +07:00
|
|
|
> conf->min_nr_stripes ||
|
2009-06-18 05:47:20 +07:00
|
|
|
((mddev->new_chunk_sectors << 9) / STRIPE_SIZE) * 4
|
2015-02-26 08:47:56 +07:00
|
|
|
> conf->min_nr_stripes) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: reshape: not enough stripes. Needed %lu\n",
|
|
|
|
mdname(mddev),
|
|
|
|
((max(mddev->chunk_sectors, mddev->new_chunk_sectors) << 9)
|
|
|
|
/ STRIPE_SIZE)*4);
|
2009-06-18 05:47:20 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static int check_reshape(struct mddev *mddev)
|
2006-03-27 16:18:10 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2006-03-27 16:18:10 +07:00
|
|
|
|
2018-08-30 01:05:42 +07:00
|
|
|
if (raid5_has_log(conf) || raid5_has_ppl(conf))
|
2015-08-14 04:32:03 +07:00
|
|
|
return -EINVAL;
|
2009-03-31 11:24:23 +07:00
|
|
|
if (mddev->delta_disks == 0 &&
|
|
|
|
mddev->new_layout == mddev->layout &&
|
2009-06-18 05:45:27 +07:00
|
|
|
mddev->new_chunk_sectors == mddev->chunk_sectors)
|
2009-06-18 05:47:55 +07:00
|
|
|
return 0; /* nothing to do */
|
2010-06-16 14:17:53 +07:00
|
|
|
if (has_failed(conf))
|
2009-03-31 11:17:38 +07:00
|
|
|
return -EINVAL;
|
2013-07-04 13:38:16 +07:00
|
|
|
if (mddev->delta_disks < 0 && mddev->reshape_position == MaxSector) {
|
2009-03-31 11:17:38 +07:00
|
|
|
/* We might be able to shrink, but the devices must
|
|
|
|
* be made bigger first.
|
|
|
|
* For raid6, 4 is the minimum size.
|
|
|
|
* Otherwise 2 is the minimum
|
|
|
|
*/
|
|
|
|
int min = 2;
|
|
|
|
if (mddev->level == 6)
|
|
|
|
min = 4;
|
|
|
|
if (mddev->raid_disks + mddev->delta_disks < min)
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2006-03-27 16:18:10 +07:00
|
|
|
|
2009-06-18 05:47:20 +07:00
|
|
|
if (!check_stripe_cache(mddev))
|
2006-03-27 16:18:10 +07:00
|
|
|
return -ENOSPC;
|
|
|
|
|
2015-05-08 15:19:39 +07:00
|
|
|
if (mddev->new_chunk_sectors > mddev->chunk_sectors ||
|
|
|
|
mddev->delta_disks > 0)
|
|
|
|
if (resize_chunks(conf,
|
|
|
|
conf->previous_raid_disks
|
|
|
|
+ max(0, mddev->delta_disks),
|
|
|
|
max(mddev->new_chunk_sectors,
|
|
|
|
mddev->chunk_sectors)
|
|
|
|
) < 0)
|
|
|
|
return -ENOMEM;
|
2017-04-04 18:13:57 +07:00
|
|
|
|
|
|
|
if (conf->previous_raid_disks + mddev->delta_disks <= conf->pool_size)
|
|
|
|
return 0; /* never bother to shrink */
|
2012-10-11 10:24:13 +07:00
|
|
|
return resize_stripes(conf, (conf->previous_raid_disks
|
|
|
|
+ mddev->delta_disks));
|
2006-03-27 16:18:13 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static int raid5_start_reshape(struct mddev *mddev)
|
2006-03-27 16:18:13 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2011-10-11 12:45:26 +07:00
|
|
|
struct md_rdev *rdev;
|
2006-03-27 16:18:13 +07:00
|
|
|
int spares = 0;
|
2006-10-03 15:15:53 +07:00
|
|
|
unsigned long flags;
|
2006-03-27 16:18:13 +07:00
|
|
|
|
2007-03-01 11:11:53 +07:00
|
|
|
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
2006-03-27 16:18:13 +07:00
|
|
|
return -EBUSY;
|
|
|
|
|
2009-06-18 05:47:20 +07:00
|
|
|
if (!check_stripe_cache(mddev))
|
|
|
|
return -ENOSPC;
|
|
|
|
|
2012-05-22 10:55:28 +07:00
|
|
|
if (has_failed(conf))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2012-05-21 06:27:00 +07:00
|
|
|
rdev_for_each(rdev, mddev) {
|
2011-01-31 07:57:43 +07:00
|
|
|
if (!test_bit(In_sync, &rdev->flags)
|
|
|
|
&& !test_bit(Faulty, &rdev->flags))
|
2006-03-27 16:18:10 +07:00
|
|
|
spares++;
|
2012-05-21 06:27:00 +07:00
|
|
|
}
|
2006-03-27 16:18:13 +07:00
|
|
|
|
2007-03-01 11:11:53 +07:00
|
|
|
if (spares - mddev->degraded < mddev->delta_disks - conf->max_degraded)
|
2006-03-27 16:18:10 +07:00
|
|
|
/* Not enough devices even to make a degraded array
|
|
|
|
* of that size
|
|
|
|
*/
|
|
|
|
return -EINVAL;
|
|
|
|
|
2009-03-31 11:17:38 +07:00
|
|
|
/* Refuse to reduce size of the array. Any reductions in
|
|
|
|
* array size must be through explicit setting of array_size
|
|
|
|
* attribute.
|
|
|
|
*/
|
|
|
|
if (raid5_size(mddev, 0, conf->raid_disks + mddev->delta_disks)
|
|
|
|
< mddev->array_sectors) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: array size must be reduced before number of disks\n",
|
|
|
|
mdname(mddev));
|
2009-03-31 11:17:38 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2006-03-27 16:18:11 +07:00
|
|
|
atomic_set(&conf->reshape_stripes, 0);
|
2006-03-27 16:18:10 +07:00
|
|
|
spin_lock_irq(&conf->device_lock);
|
2013-08-27 12:52:13 +07:00
|
|
|
write_seqcount_begin(&conf->gen_lock);
|
2006-03-27 16:18:10 +07:00
|
|
|
conf->previous_raid_disks = conf->raid_disks;
|
2006-03-27 16:18:13 +07:00
|
|
|
conf->raid_disks += mddev->delta_disks;
|
2009-06-18 05:45:55 +07:00
|
|
|
conf->prev_chunk_sectors = conf->chunk_sectors;
|
|
|
|
conf->chunk_sectors = mddev->new_chunk_sectors;
|
2009-03-31 11:24:23 +07:00
|
|
|
conf->prev_algo = conf->algorithm;
|
|
|
|
conf->algorithm = mddev->new_layout;
|
2012-05-21 06:27:00 +07:00
|
|
|
conf->generation++;
|
|
|
|
/* Code that selects data_offset needs to see the generation update
|
|
|
|
* if reshape_progress has been set - so a memory barrier needed.
|
|
|
|
*/
|
|
|
|
smp_mb();
|
2012-05-21 06:27:00 +07:00
|
|
|
if (mddev->reshape_backwards)
|
2009-03-31 11:16:46 +07:00
|
|
|
conf->reshape_progress = raid5_size(mddev, 0, 0);
|
|
|
|
else
|
|
|
|
conf->reshape_progress = 0;
|
|
|
|
conf->reshape_safe = conf->reshape_progress;
|
2013-08-27 12:52:13 +07:00
|
|
|
write_seqcount_end(&conf->gen_lock);
|
2006-03-27 16:18:10 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
|
2013-08-27 12:57:47 +07:00
|
|
|
/* Now make sure any requests that proceeded on the assumption
|
|
|
|
* the reshape wasn't running - like Discard or Read - have
|
|
|
|
* completed.
|
|
|
|
*/
|
|
|
|
mddev_suspend(mddev);
|
|
|
|
mddev_resume(mddev);
|
|
|
|
|
2006-03-27 16:18:10 +07:00
|
|
|
/* Add some new drives, as many as will fit.
|
|
|
|
* We know there are enough to make the newly sized array work.
|
2010-06-17 14:48:26 +07:00
|
|
|
* Don't add devices if we are reducing the number of
|
|
|
|
* devices in the array. This is because it is not possible
|
|
|
|
* to correctly record the "partially reconstructed" state of
|
|
|
|
* such devices during the reshape and confusion could result.
|
2006-03-27 16:18:10 +07:00
|
|
|
*/
|
2011-01-31 07:57:43 +07:00
|
|
|
if (mddev->delta_disks >= 0) {
|
2012-03-19 08:46:39 +07:00
|
|
|
rdev_for_each(rdev, mddev)
|
2011-01-31 07:57:43 +07:00
|
|
|
if (rdev->raid_disk < 0 &&
|
|
|
|
!test_bit(Faulty, &rdev->flags)) {
|
|
|
|
if (raid5_add_disk(mddev, rdev) == 0) {
|
|
|
|
if (rdev->raid_disk
|
2012-03-13 07:21:21 +07:00
|
|
|
>= conf->previous_raid_disks)
|
2011-01-31 07:57:43 +07:00
|
|
|
set_bit(In_sync, &rdev->flags);
|
2012-03-13 07:21:21 +07:00
|
|
|
else
|
2011-01-31 07:57:43 +07:00
|
|
|
rdev->recovery_offset = 0;
|
2011-07-27 08:00:36 +07:00
|
|
|
|
|
|
|
if (sysfs_link_rdev(mddev, rdev))
|
2011-01-31 07:57:43 +07:00
|
|
|
/* Failure here is OK */;
|
2011-01-31 07:57:43 +07:00
|
|
|
}
|
2011-01-31 07:57:43 +07:00
|
|
|
} else if (rdev->raid_disk >= conf->previous_raid_disks
|
|
|
|
&& !test_bit(Faulty, &rdev->flags)) {
|
|
|
|
/* This is a spare that was manually added */
|
|
|
|
set_bit(In_sync, &rdev->flags);
|
|
|
|
}
|
2006-03-27 16:18:10 +07:00
|
|
|
|
2011-01-31 07:57:43 +07:00
|
|
|
/* When a reshape changes the number of devices,
|
|
|
|
* ->degraded is measured against the larger of the
|
|
|
|
* pre and post number of devices.
|
|
|
|
*/
|
2009-03-31 11:17:38 +07:00
|
|
|
spin_lock_irqsave(&conf->device_lock, flags);
|
2017-01-25 01:45:30 +07:00
|
|
|
mddev->degraded = raid5_calc_degraded(conf);
|
2009-03-31 11:17:38 +07:00
|
|
|
spin_unlock_irqrestore(&conf->device_lock, flags);
|
|
|
|
}
|
2006-03-27 16:18:13 +07:00
|
|
|
mddev->raid_disks = conf->raid_disks;
|
2009-08-03 07:59:57 +07:00
|
|
|
mddev->reshape_position = conf->reshape_progress;
|
2016-12-09 06:48:19 +07:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2006-03-27 16:18:11 +07:00
|
|
|
|
2006-03-27 16:18:10 +07:00
|
|
|
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
|
|
|
|
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
|
2015-06-12 17:05:04 +07:00
|
|
|
clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
|
2006-03-27 16:18:10 +07:00
|
|
|
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
|
|
|
|
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
|
|
|
|
mddev->sync_thread = md_register_thread(md_do_sync, mddev,
|
2009-09-23 15:09:45 +07:00
|
|
|
"reshape");
|
2006-03-27 16:18:10 +07:00
|
|
|
if (!mddev->sync_thread) {
|
|
|
|
mddev->recovery = 0;
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
2013-11-14 11:16:15 +07:00
|
|
|
write_seqcount_begin(&conf->gen_lock);
|
2006-03-27 16:18:10 +07:00
|
|
|
mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks;
|
2013-11-14 11:16:15 +07:00
|
|
|
mddev->new_chunk_sectors =
|
|
|
|
conf->chunk_sectors = conf->prev_chunk_sectors;
|
|
|
|
mddev->new_layout = conf->algorithm = conf->prev_algo;
|
2012-05-21 06:27:00 +07:00
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
rdev->new_data_offset = rdev->data_offset;
|
|
|
|
smp_wmb();
|
2013-11-14 11:16:15 +07:00
|
|
|
conf->generation --;
|
2009-03-31 11:16:46 +07:00
|
|
|
conf->reshape_progress = MaxSector;
|
2012-03-13 07:21:18 +07:00
|
|
|
mddev->reshape_position = MaxSector;
|
2013-11-14 11:16:15 +07:00
|
|
|
write_seqcount_end(&conf->gen_lock);
|
2006-03-27 16:18:10 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
2009-03-31 11:28:40 +07:00
|
|
|
conf->reshape_checkpoint = jiffies;
|
2006-03-27 16:18:10 +07:00
|
|
|
md_wakeup_thread(mddev->sync_thread);
|
|
|
|
md_new_event(mddev);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-03-31 11:17:38 +07:00
|
|
|
/* This is called from the reshape thread and should make any
|
|
|
|
* changes needed in 'conf'
|
|
|
|
*/
|
2011-10-11 12:49:52 +07:00
|
|
|
static void end_reshape(struct r5conf *conf)
|
2006-03-27 16:18:10 +07:00
|
|
|
{
|
|
|
|
|
2006-03-27 16:18:11 +07:00
|
|
|
if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
|
2017-10-17 12:18:36 +07:00
|
|
|
struct md_rdev *rdev;
|
2006-03-27 16:18:11 +07:00
|
|
|
|
|
|
|
spin_lock_irq(&conf->device_lock);
|
2009-03-31 11:15:05 +07:00
|
|
|
conf->previous_raid_disks = conf->raid_disks;
|
2017-07-05 16:34:04 +07:00
|
|
|
md_finish_reshape(conf->mddev);
|
2012-05-21 06:27:00 +07:00
|
|
|
smp_wmb();
|
2009-03-31 11:16:46 +07:00
|
|
|
conf->reshape_progress = MaxSector;
|
2015-07-24 10:30:32 +07:00
|
|
|
conf->mddev->reshape_position = MaxSector;
|
2017-10-17 12:18:36 +07:00
|
|
|
rdev_for_each(rdev, conf->mddev)
|
|
|
|
if (rdev->raid_disk >= 0 &&
|
|
|
|
!test_bit(Journal, &rdev->flags) &&
|
|
|
|
!test_bit(In_sync, &rdev->flags))
|
|
|
|
rdev->recovery_offset = MaxSector;
|
2006-03-27 16:18:11 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2009-03-31 11:27:18 +07:00
|
|
|
wake_up(&conf->wait_for_overlap);
|
2006-06-26 14:27:38 +07:00
|
|
|
|
|
|
|
/* read-ahead size must cover two whole stripes, which is
|
|
|
|
* 2 * (datadisks) * chunksize where 'n' is the number of raid devices
|
|
|
|
*/
|
2010-06-01 16:37:28 +07:00
|
|
|
if (conf->mddev->queue) {
|
2009-03-31 11:15:05 +07:00
|
|
|
int data_disks = conf->raid_disks - conf->max_degraded;
|
2009-06-18 05:45:55 +07:00
|
|
|
int stripe = data_disks * ((conf->chunk_sectors << 9)
|
2009-03-31 11:15:05 +07:00
|
|
|
/ PAGE_SIZE);
|
2017-02-02 21:56:50 +07:00
|
|
|
if (conf->mddev->queue->backing_dev_info->ra_pages < 2 * stripe)
|
|
|
|
conf->mddev->queue->backing_dev_info->ra_pages = 2 * stripe;
|
2006-06-26 14:27:38 +07:00
|
|
|
}
|
2006-03-27 16:18:10 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-03-31 11:17:38 +07:00
|
|
|
/* This is called from the raid5d thread with mddev_lock held.
|
|
|
|
* It makes config changes to the device.
|
|
|
|
*/
|
2011-10-11 12:47:53 +07:00
|
|
|
static void raid5_finish_reshape(struct mddev *mddev)
|
2009-03-31 11:15:05 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2009-03-31 11:15:05 +07:00
|
|
|
|
|
|
|
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
|
|
|
|
|
md: fix a potential deadlock of raid5/raid10 reshape
There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.
How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb->s_umount lock and tries to
write through critical data into raid5 emulated block device. So it
waits for raid5 kernel thread handling stripes in order to finish it
I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
called first in order to reap the raid5 resync thread. That is,
raid5_finish_reshape() will be called. In this function, it will try
to update conf and call VFS revalidate_disk() to grow the raid5
emulated block device. It will try to acquire VFS sb->s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb->s_umount lock. The deadlock happens.
The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.
For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.
2017/12/29:
Thanks to Guoqing Jiang <gqjiang@suse.com>,
we confirmed that there is the same deadlock issue in raid10. It's
reproducible and can be fixed by this patch. For raid10.c, we can remove
the similar code to prevent deadlock as well since they has been called
before.
Reported-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com>
2018-02-22 12:34:46 +07:00
|
|
|
if (mddev->delta_disks <= 0) {
|
2009-03-31 11:17:38 +07:00
|
|
|
int d;
|
2011-12-23 06:17:50 +07:00
|
|
|
spin_lock_irq(&conf->device_lock);
|
2017-01-25 01:45:30 +07:00
|
|
|
mddev->degraded = raid5_calc_degraded(conf);
|
2011-12-23 06:17:50 +07:00
|
|
|
spin_unlock_irq(&conf->device_lock);
|
2009-03-31 11:17:38 +07:00
|
|
|
for (d = conf->raid_disks ;
|
|
|
|
d < conf->raid_disks - mddev->delta_disks;
|
2009-08-13 07:41:49 +07:00
|
|
|
d++) {
|
2011-10-11 12:45:26 +07:00
|
|
|
struct md_rdev *rdev = conf->disks[d].rdev;
|
2012-05-22 10:55:33 +07:00
|
|
|
if (rdev)
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
|
|
|
rdev = conf->disks[d].replacement;
|
|
|
|
if (rdev)
|
|
|
|
clear_bit(In_sync, &rdev->flags);
|
2009-08-13 07:41:49 +07:00
|
|
|
}
|
2009-03-31 11:15:05 +07:00
|
|
|
}
|
2009-03-31 11:24:23 +07:00
|
|
|
mddev->layout = conf->algorithm;
|
2009-06-18 05:45:55 +07:00
|
|
|
mddev->chunk_sectors = conf->chunk_sectors;
|
2009-03-31 11:17:38 +07:00
|
|
|
mddev->reshape_position = MaxSector;
|
|
|
|
mddev->delta_disks = 0;
|
2012-05-21 06:27:00 +07:00
|
|
|
mddev->reshape_backwards = 0;
|
2009-03-31 11:15:05 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-10-19 08:49:15 +07:00
|
|
|
static void raid5_quiesce(struct mddev *mddev, int quiesce)
|
2005-09-10 06:23:54 +07:00
|
|
|
{
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2005-09-10 06:23:54 +07:00
|
|
|
|
2017-10-19 08:49:15 +07:00
|
|
|
if (quiesce) {
|
|
|
|
/* stop all writes */
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
lock_all_device_hash_locks_irq(conf);
|
2009-08-03 07:59:58 +07:00
|
|
|
/* '2' tells resync/reshape to pause so that all
|
|
|
|
* active stripes can drain
|
|
|
|
*/
|
md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.
In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
or cached stripes is enough for a full stripe (chunk size / 4k)
(r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
r5c_do_reclaim() contains new logic of reclaim.
For stripe cache:
When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
For log space:
To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.
Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.
r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.
When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).
This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 06:24:40 +07:00
|
|
|
r5c_flush_cache(conf, INT_MAX);
|
2009-08-03 07:59:58 +07:00
|
|
|
conf->quiesce = 2;
|
md/raid5: split wait_for_stripe and introduce wait_for_quiescent
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.
After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.
Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.
So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.
v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
Commit log refactor suggestion from Neil.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 15:19:06 +07:00
|
|
|
wait_event_cmd(conf->wait_for_quiescent,
|
2006-12-10 17:20:47 +07:00
|
|
|
atomic_read(&conf->active_stripes) == 0 &&
|
|
|
|
atomic_read(&conf->active_aligned_reads) == 0,
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
unlock_all_device_hash_locks_irq(conf),
|
|
|
|
lock_all_device_hash_locks_irq(conf));
|
2009-08-03 07:59:58 +07:00
|
|
|
conf->quiesce = 1;
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
unlock_all_device_hash_locks_irq(conf);
|
2009-08-03 07:59:58 +07:00
|
|
|
/* allow reshape to continue */
|
|
|
|
wake_up(&conf->wait_for_overlap);
|
2017-10-19 08:49:15 +07:00
|
|
|
} else {
|
|
|
|
/* re-enable writes */
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
lock_all_device_hash_locks_irq(conf);
|
2005-09-10 06:23:54 +07:00
|
|
|
conf->quiesce = 0;
|
md/raid5: split wait_for_stripe and introduce wait_for_quiescent
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.
After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.
Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.
So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.
v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
Commit log refactor suggestion from Neil.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 15:19:06 +07:00
|
|
|
wake_up(&conf->wait_for_quiescent);
|
2006-03-27 16:18:14 +07:00
|
|
|
wake_up(&conf->wait_for_overlap);
|
raid5: relieve lock contention in get_active_stripe()
get_active_stripe() is the last place we have lock contention. It has two
paths. One is stripe isn't found and new stripe is allocated, the other is
stripe is found.
The first path basically calls __find_stripe and init_stripe. It accesses
conf->generation, conf->previous_raid_disks, conf->raid_disks,
conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
stripe_hashtbl and inactive_list, other fields are changed very rarely.
With this patch, we split inactive_list and add new hash locks. Each free
stripe belongs to a specific inactive list. Which inactive list is determined
by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
is determined by it's lock_hash too. The lock_hash is derivied from current
stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
list too. The goal of the new hash locks introduced is we can only use the new
locks in the first path of get_active_stripe(). Since we have several hash
locks, lock contention is relieved significantly.
The first path of get_active_stripe() accesses other fields, since they are
changed rarely, changing them now need take conf->device_lock and all hash
locks. For a slow path, this isn't a problem.
If we need lock device_lock and hash lock, we always lock hash lock first. The
tricky part is release_stripe and friends. We need take device_lock first.
Neil's suggestion is we put inactive stripes to a temporary list and readd it
to inactive_list after device_lock is released. In this way, we add stripes to
temporary list with device_lock hold and remove stripes from the list with hash
lock hold. So we don't allow concurrent access to the temporary list, which
means we need allocate temporary list for all participants of release_stripe.
One downside is free stripes are maintained in their inactive list, they can't
across between the lists. By default, we have total 256 stripes and 8 lists, so
each list will have 32 stripes. It's possible one list has free stripe but
other list hasn't. The chance should be rare because stripes allocation are
even distributed. And we can always allocate more stripes for cache, several
mega bytes memory isn't a big deal.
This completely removes the lock contention of the first path of
get_active_stripe(). It slows down the second code path a little bit though
because we now need takes two locks, but since the hash lock isn't contended,
the overhead should be quite small (several atomic instructions). The second
path of get_active_stripe() (basically sequential write or big request size
randwrite) still has lock contentions.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14 11:16:17 +07:00
|
|
|
unlock_all_device_hash_locks_irq(conf);
|
2005-09-10 06:23:54 +07:00
|
|
|
}
|
2017-12-27 16:31:40 +07:00
|
|
|
log_quiesce(conf, quiesce);
|
2005-09-10 06:23:54 +07:00
|
|
|
}
|
2006-01-06 15:20:16 +07:00
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static void *raid45_takeover_raid0(struct mddev *mddev, int level)
|
2010-03-08 12:02:42 +07:00
|
|
|
{
|
2011-10-11 12:48:59 +07:00
|
|
|
struct r0conf *raid0_conf = mddev->private;
|
2011-04-21 23:07:26 +07:00
|
|
|
sector_t sectors;
|
2010-03-08 12:02:42 +07:00
|
|
|
|
2010-05-02 08:09:05 +07:00
|
|
|
/* for raid0 takeover only one zone is supported */
|
2011-10-11 12:48:59 +07:00
|
|
|
if (raid0_conf->nr_strip_zones > 1) {
|
2016-11-02 10:16:50 +07:00
|
|
|
pr_warn("md/raid:%s: cannot takeover raid0 with more than one zone.\n",
|
|
|
|
mdname(mddev));
|
2010-05-02 08:09:05 +07:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:48:59 +07:00
|
|
|
sectors = raid0_conf->strip_zone[0].zone_end;
|
|
|
|
sector_div(sectors, raid0_conf->strip_zone[0].nb_dev);
|
2011-04-20 12:38:18 +07:00
|
|
|
mddev->dev_sectors = sectors;
|
2010-05-02 08:09:05 +07:00
|
|
|
mddev->new_level = level;
|
2010-03-08 12:02:42 +07:00
|
|
|
mddev->new_layout = ALGORITHM_PARITY_N;
|
|
|
|
mddev->new_chunk_sectors = mddev->chunk_sectors;
|
|
|
|
mddev->raid_disks += 1;
|
|
|
|
mddev->delta_disks = 1;
|
|
|
|
/* make sure it will be not marked as dirty */
|
|
|
|
mddev->recovery_cp = MaxSector;
|
|
|
|
|
|
|
|
return setup_conf(mddev);
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static void *raid5_takeover_raid1(struct mddev *mddev)
|
2009-03-31 10:39:39 +07:00
|
|
|
{
|
|
|
|
int chunksect;
|
2016-12-09 06:48:17 +07:00
|
|
|
void *ret;
|
2009-03-31 10:39:39 +07:00
|
|
|
|
|
|
|
if (mddev->raid_disks != 2 ||
|
|
|
|
mddev->degraded > 1)
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
|
|
|
/* Should check if there are write-behind devices? */
|
|
|
|
|
|
|
|
chunksect = 64*2; /* 64K by default */
|
|
|
|
|
|
|
|
/* The array must be an exact multiple of chunksize */
|
|
|
|
while (chunksect && (mddev->array_sectors & (chunksect-1)))
|
|
|
|
chunksect >>= 1;
|
|
|
|
|
|
|
|
if ((chunksect<<9) < STRIPE_SIZE)
|
|
|
|
/* array size does not allow a suitable chunk size */
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
|
|
|
mddev->new_level = 5;
|
|
|
|
mddev->new_layout = ALGORITHM_LEFT_SYMMETRIC;
|
2009-06-18 05:45:27 +07:00
|
|
|
mddev->new_chunk_sectors = chunksect;
|
2009-03-31 10:39:39 +07:00
|
|
|
|
2016-12-09 06:48:17 +07:00
|
|
|
ret = setup_conf(mddev);
|
2017-01-07 07:31:35 +07:00
|
|
|
if (!IS_ERR(ret))
|
2017-01-05 07:10:19 +07:00
|
|
|
mddev_clear_unsupported_flags(mddev,
|
|
|
|
UNSUPPORTED_MDDEV_FLAGS);
|
2016-12-09 06:48:17 +07:00
|
|
|
return ret;
|
2009-03-31 10:39:39 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static void *raid5_takeover_raid6(struct mddev *mddev)
|
2009-03-31 10:57:20 +07:00
|
|
|
{
|
|
|
|
int new_layout;
|
|
|
|
|
|
|
|
switch (mddev->layout) {
|
|
|
|
case ALGORITHM_LEFT_ASYMMETRIC_6:
|
|
|
|
new_layout = ALGORITHM_LEFT_ASYMMETRIC;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_RIGHT_ASYMMETRIC_6:
|
|
|
|
new_layout = ALGORITHM_RIGHT_ASYMMETRIC;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_LEFT_SYMMETRIC_6:
|
|
|
|
new_layout = ALGORITHM_LEFT_SYMMETRIC;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_RIGHT_SYMMETRIC_6:
|
|
|
|
new_layout = ALGORITHM_RIGHT_SYMMETRIC;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_PARITY_0_6:
|
|
|
|
new_layout = ALGORITHM_PARITY_0;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_PARITY_N:
|
|
|
|
new_layout = ALGORITHM_PARITY_N;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
mddev->new_level = 5;
|
|
|
|
mddev->new_layout = new_layout;
|
|
|
|
mddev->delta_disks = -1;
|
|
|
|
mddev->raid_disks -= 1;
|
|
|
|
return setup_conf(mddev);
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static int raid5_check_reshape(struct mddev *mddev)
|
2009-03-31 10:56:41 +07:00
|
|
|
{
|
2009-03-31 11:24:23 +07:00
|
|
|
/* For a 2-drive array, the layout and chunk size can be changed
|
|
|
|
* immediately as not restriping is needed.
|
|
|
|
* For larger arrays we record the new value - after validation
|
|
|
|
* to be used by a reshape pass.
|
2009-03-31 10:56:41 +07:00
|
|
|
*/
|
2011-10-11 12:49:52 +07:00
|
|
|
struct r5conf *conf = mddev->private;
|
2009-06-18 05:47:42 +07:00
|
|
|
int new_chunk = mddev->new_chunk_sectors;
|
2009-03-31 10:56:41 +07:00
|
|
|
|
2009-06-18 05:47:42 +07:00
|
|
|
if (mddev->new_layout >= 0 && !algorithm_valid_raid5(mddev->new_layout))
|
2009-03-31 10:56:41 +07:00
|
|
|
return -EINVAL;
|
|
|
|
if (new_chunk > 0) {
|
2009-06-18 05:46:10 +07:00
|
|
|
if (!is_power_of_2(new_chunk))
|
2009-03-31 10:56:41 +07:00
|
|
|
return -EINVAL;
|
2009-06-18 05:47:42 +07:00
|
|
|
if (new_chunk < (PAGE_SIZE>>9))
|
2009-03-31 10:56:41 +07:00
|
|
|
return -EINVAL;
|
2009-06-18 05:47:42 +07:00
|
|
|
if (mddev->array_sectors & (new_chunk-1))
|
2009-03-31 10:56:41 +07:00
|
|
|
/* not factor of array size */
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* They look valid */
|
|
|
|
|
2009-03-31 11:24:23 +07:00
|
|
|
if (mddev->raid_disks == 2) {
|
2009-06-18 05:47:42 +07:00
|
|
|
/* can make the change immediately */
|
|
|
|
if (mddev->new_layout >= 0) {
|
|
|
|
conf->algorithm = mddev->new_layout;
|
|
|
|
mddev->layout = mddev->new_layout;
|
2009-03-31 11:24:23 +07:00
|
|
|
}
|
|
|
|
if (new_chunk > 0) {
|
2009-06-18 05:47:42 +07:00
|
|
|
conf->chunk_sectors = new_chunk ;
|
|
|
|
mddev->chunk_sectors = new_chunk;
|
2009-03-31 11:24:23 +07:00
|
|
|
}
|
2016-12-09 06:48:19 +07:00
|
|
|
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
|
2009-03-31 11:24:23 +07:00
|
|
|
md_wakeup_thread(mddev->thread);
|
2009-03-31 10:56:41 +07:00
|
|
|
}
|
2009-06-18 05:47:55 +07:00
|
|
|
return check_reshape(mddev);
|
2009-03-31 11:24:23 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static int raid6_check_reshape(struct mddev *mddev)
|
2009-03-31 11:24:23 +07:00
|
|
|
{
|
2009-06-18 05:47:42 +07:00
|
|
|
int new_chunk = mddev->new_chunk_sectors;
|
2009-06-18 05:47:55 +07:00
|
|
|
|
2009-06-18 05:47:42 +07:00
|
|
|
if (mddev->new_layout >= 0 && !algorithm_valid_raid6(mddev->new_layout))
|
2009-03-31 11:24:23 +07:00
|
|
|
return -EINVAL;
|
2009-03-31 10:56:41 +07:00
|
|
|
if (new_chunk > 0) {
|
2009-06-18 05:46:10 +07:00
|
|
|
if (!is_power_of_2(new_chunk))
|
2009-03-31 11:24:23 +07:00
|
|
|
return -EINVAL;
|
2009-06-18 05:47:42 +07:00
|
|
|
if (new_chunk < (PAGE_SIZE >> 9))
|
2009-03-31 11:24:23 +07:00
|
|
|
return -EINVAL;
|
2009-06-18 05:47:42 +07:00
|
|
|
if (mddev->array_sectors & (new_chunk-1))
|
2009-03-31 11:24:23 +07:00
|
|
|
/* not factor of array size */
|
|
|
|
return -EINVAL;
|
2009-03-31 10:56:41 +07:00
|
|
|
}
|
2009-03-31 11:24:23 +07:00
|
|
|
|
|
|
|
/* They look valid */
|
2009-06-18 05:47:55 +07:00
|
|
|
return check_reshape(mddev);
|
2009-03-31 10:56:41 +07:00
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static void *raid5_takeover(struct mddev *mddev)
|
2009-03-31 10:39:39 +07:00
|
|
|
{
|
|
|
|
/* raid5 can take over:
|
2010-05-02 08:09:05 +07:00
|
|
|
* raid0 - if there is only one strip zone - make it a raid4 layout
|
2009-03-31 10:39:39 +07:00
|
|
|
* raid1 - if there are two drives. We need to know the chunk size
|
|
|
|
* raid4 - trivial - just use a raid4 layout.
|
|
|
|
* raid6 - Providing it is a *_6 layout
|
|
|
|
*/
|
2010-05-02 08:09:05 +07:00
|
|
|
if (mddev->level == 0)
|
|
|
|
return raid45_takeover_raid0(mddev, 5);
|
2009-03-31 10:39:39 +07:00
|
|
|
if (mddev->level == 1)
|
|
|
|
return raid5_takeover_raid1(mddev);
|
2009-03-31 10:57:09 +07:00
|
|
|
if (mddev->level == 4) {
|
|
|
|
mddev->new_layout = ALGORITHM_PARITY_N;
|
|
|
|
mddev->new_level = 5;
|
|
|
|
return setup_conf(mddev);
|
|
|
|
}
|
2009-03-31 10:57:20 +07:00
|
|
|
if (mddev->level == 6)
|
|
|
|
return raid5_takeover_raid6(mddev);
|
2009-03-31 10:39:39 +07:00
|
|
|
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static void *raid4_takeover(struct mddev *mddev)
|
2010-03-22 12:53:49 +07:00
|
|
|
{
|
2010-05-02 08:09:05 +07:00
|
|
|
/* raid4 can take over:
|
|
|
|
* raid0 - if there is only one strip zone
|
|
|
|
* raid5 - if layout is right
|
2010-03-22 12:53:49 +07:00
|
|
|
*/
|
2010-05-02 08:09:05 +07:00
|
|
|
if (mddev->level == 0)
|
|
|
|
return raid45_takeover_raid0(mddev, 4);
|
2010-03-22 12:53:49 +07:00
|
|
|
if (mddev->level == 5 &&
|
|
|
|
mddev->layout == ALGORITHM_PARITY_N) {
|
|
|
|
mddev->new_layout = 0;
|
|
|
|
mddev->new_level = 4;
|
|
|
|
return setup_conf(mddev);
|
|
|
|
}
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
2009-03-31 10:39:39 +07:00
|
|
|
|
2011-10-11 12:49:58 +07:00
|
|
|
static struct md_personality raid5_personality;
|
2009-03-31 10:39:39 +07:00
|
|
|
|
2011-10-11 12:47:53 +07:00
|
|
|
static void *raid6_takeover(struct mddev *mddev)
|
2009-03-31 10:39:39 +07:00
|
|
|
{
|
|
|
|
/* Currently can only take over a raid5. We map the
|
|
|
|
* personality to an equivalent raid6 personality
|
|
|
|
* with the Q block at the end.
|
|
|
|
*/
|
|
|
|
int new_layout;
|
|
|
|
|
|
|
|
if (mddev->pers != &raid5_personality)
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
if (mddev->degraded > 1)
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
if (mddev->raid_disks > 253)
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
if (mddev->raid_disks < 3)
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
|
|
|
switch (mddev->layout) {
|
|
|
|
case ALGORITHM_LEFT_ASYMMETRIC:
|
|
|
|
new_layout = ALGORITHM_LEFT_ASYMMETRIC_6;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_RIGHT_ASYMMETRIC:
|
|
|
|
new_layout = ALGORITHM_RIGHT_ASYMMETRIC_6;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_LEFT_SYMMETRIC:
|
|
|
|
new_layout = ALGORITHM_LEFT_SYMMETRIC_6;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_RIGHT_SYMMETRIC:
|
|
|
|
new_layout = ALGORITHM_RIGHT_SYMMETRIC_6;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_PARITY_0:
|
|
|
|
new_layout = ALGORITHM_PARITY_0_6;
|
|
|
|
break;
|
|
|
|
case ALGORITHM_PARITY_N:
|
|
|
|
new_layout = ALGORITHM_PARITY_N;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
mddev->new_level = 6;
|
|
|
|
mddev->new_layout = new_layout;
|
|
|
|
mddev->delta_disks = 1;
|
|
|
|
mddev->raid_disks += 1;
|
|
|
|
return setup_conf(mddev);
|
|
|
|
}
|
|
|
|
|
2017-03-09 16:00:03 +07:00
|
|
|
static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
|
|
|
|
{
|
|
|
|
struct r5conf *conf;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = mddev_lock(mddev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
conf = mddev->private;
|
|
|
|
if (!conf) {
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
return -ENODEV;
|
|
|
|
}
|
|
|
|
|
2017-04-04 18:13:57 +07:00
|
|
|
if (strncmp(buf, "ppl", 3) == 0) {
|
2017-03-28 00:51:33 +07:00
|
|
|
/* ppl only works with RAID 5 */
|
2017-04-04 18:13:57 +07:00
|
|
|
if (!raid5_has_ppl(conf) && conf->level == 5) {
|
|
|
|
err = log_init(conf, NULL, true);
|
|
|
|
if (!err) {
|
|
|
|
err = resize_stripes(conf, conf->pool_size);
|
|
|
|
if (err)
|
|
|
|
log_exit(conf);
|
|
|
|
}
|
2017-03-28 00:51:33 +07:00
|
|
|
} else
|
|
|
|
err = -EINVAL;
|
|
|
|
} else if (strncmp(buf, "resync", 6) == 0) {
|
|
|
|
if (raid5_has_ppl(conf)) {
|
|
|
|
mddev_suspend(mddev);
|
|
|
|
log_exit(conf);
|
|
|
|
mddev_resume(mddev);
|
2017-04-04 18:13:57 +07:00
|
|
|
err = resize_stripes(conf, conf->pool_size);
|
2017-03-28 00:51:33 +07:00
|
|
|
} else if (test_bit(MD_HAS_JOURNAL, &conf->mddev->flags) &&
|
|
|
|
r5l_log_disk_error(conf)) {
|
|
|
|
bool journal_dev_exists = false;
|
|
|
|
struct md_rdev *rdev;
|
|
|
|
|
|
|
|
rdev_for_each(rdev, mddev)
|
|
|
|
if (test_bit(Journal, &rdev->flags)) {
|
|
|
|
journal_dev_exists = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!journal_dev_exists) {
|
|
|
|
mddev_suspend(mddev);
|
|
|
|
clear_bit(MD_HAS_JOURNAL, &mddev->flags);
|
|
|
|
mddev_resume(mddev);
|
|
|
|
} else /* need remove journal device first */
|
|
|
|
err = -EBUSY;
|
|
|
|
} else
|
|
|
|
err = -EINVAL;
|
2017-03-09 16:00:03 +07:00
|
|
|
} else {
|
|
|
|
err = -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!err)
|
|
|
|
md_update_sb(mddev, 1);
|
|
|
|
|
|
|
|
mddev_unlock(mddev);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-11-20 13:17:01 +07:00
|
|
|
static int raid5_start(struct mddev *mddev)
|
|
|
|
{
|
|
|
|
struct r5conf *conf = mddev->private;
|
|
|
|
|
|
|
|
return r5l_start(conf->log);
|
|
|
|
}
|
|
|
|
|
2011-10-11 12:49:58 +07:00
|
|
|
static struct md_personality raid6_personality =
|
2006-06-26 14:27:38 +07:00
|
|
|
{
|
|
|
|
.name = "raid6",
|
|
|
|
.level = 6,
|
|
|
|
.owner = THIS_MODULE,
|
2016-01-21 04:52:20 +07:00
|
|
|
.make_request = raid5_make_request,
|
|
|
|
.run = raid5_run,
|
2017-11-20 13:17:01 +07:00
|
|
|
.start = raid5_start,
|
2014-12-15 08:56:58 +07:00
|
|
|
.free = raid5_free,
|
2016-01-21 04:52:20 +07:00
|
|
|
.status = raid5_status,
|
|
|
|
.error_handler = raid5_error,
|
2006-06-26 14:27:38 +07:00
|
|
|
.hot_add_disk = raid5_add_disk,
|
|
|
|
.hot_remove_disk= raid5_remove_disk,
|
|
|
|
.spare_active = raid5_spare_active,
|
2016-01-21 04:52:20 +07:00
|
|
|
.sync_request = raid5_sync_request,
|
2006-06-26 14:27:38 +07:00
|
|
|
.resize = raid5_resize,
|
2009-03-18 08:10:40 +07:00
|
|
|
.size = raid5_size,
|
2009-06-18 05:47:55 +07:00
|
|
|
.check_reshape = raid6_check_reshape,
|
2007-03-01 11:11:53 +07:00
|
|
|
.start_reshape = raid5_start_reshape,
|
2009-03-31 11:15:05 +07:00
|
|
|
.finish_reshape = raid5_finish_reshape,
|
2006-06-26 14:27:38 +07:00
|
|
|
.quiesce = raid5_quiesce,
|
2009-03-31 10:39:39 +07:00
|
|
|
.takeover = raid6_takeover,
|
2014-12-15 08:56:56 +07:00
|
|
|
.congested = raid5_congested,
|
2017-03-28 00:51:33 +07:00
|
|
|
.change_consistency_policy = raid5_change_consistency_policy,
|
2006-06-26 14:27:38 +07:00
|
|
|
};
|
2011-10-11 12:49:58 +07:00
|
|
|
static struct md_personality raid5_personality =
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
.name = "raid5",
|
2006-01-06 15:20:36 +07:00
|
|
|
.level = 5,
|
2005-04-17 05:20:36 +07:00
|
|
|
.owner = THIS_MODULE,
|
2016-01-21 04:52:20 +07:00
|
|
|
.make_request = raid5_make_request,
|
|
|
|
.run = raid5_run,
|
2017-11-20 13:17:01 +07:00
|
|
|
.start = raid5_start,
|
2014-12-15 08:56:58 +07:00
|
|
|
.free = raid5_free,
|
2016-01-21 04:52:20 +07:00
|
|
|
.status = raid5_status,
|
|
|
|
.error_handler = raid5_error,
|
2005-04-17 05:20:36 +07:00
|
|
|
.hot_add_disk = raid5_add_disk,
|
|
|
|
.hot_remove_disk= raid5_remove_disk,
|
|
|
|
.spare_active = raid5_spare_active,
|
2016-01-21 04:52:20 +07:00
|
|
|
.sync_request = raid5_sync_request,
|
2005-04-17 05:20:36 +07:00
|
|
|
.resize = raid5_resize,
|
2009-03-18 08:10:40 +07:00
|
|
|
.size = raid5_size,
|
2006-03-27 16:18:13 +07:00
|
|
|
.check_reshape = raid5_check_reshape,
|
|
|
|
.start_reshape = raid5_start_reshape,
|
2009-03-31 11:15:05 +07:00
|
|
|
.finish_reshape = raid5_finish_reshape,
|
2005-09-10 06:23:54 +07:00
|
|
|
.quiesce = raid5_quiesce,
|
2009-03-31 10:39:39 +07:00
|
|
|
.takeover = raid5_takeover,
|
2014-12-15 08:56:56 +07:00
|
|
|
.congested = raid5_congested,
|
2017-03-09 16:00:03 +07:00
|
|
|
.change_consistency_policy = raid5_change_consistency_policy,
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2011-10-11 12:49:58 +07:00
|
|
|
static struct md_personality raid4_personality =
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-01-06 15:20:36 +07:00
|
|
|
.name = "raid4",
|
|
|
|
.level = 4,
|
|
|
|
.owner = THIS_MODULE,
|
2016-01-21 04:52:20 +07:00
|
|
|
.make_request = raid5_make_request,
|
|
|
|
.run = raid5_run,
|
2017-11-20 13:17:01 +07:00
|
|
|
.start = raid5_start,
|
2014-12-15 08:56:58 +07:00
|
|
|
.free = raid5_free,
|
2016-01-21 04:52:20 +07:00
|
|
|
.status = raid5_status,
|
|
|
|
.error_handler = raid5_error,
|
2006-01-06 15:20:36 +07:00
|
|
|
.hot_add_disk = raid5_add_disk,
|
|
|
|
.hot_remove_disk= raid5_remove_disk,
|
|
|
|
.spare_active = raid5_spare_active,
|
2016-01-21 04:52:20 +07:00
|
|
|
.sync_request = raid5_sync_request,
|
2006-01-06 15:20:36 +07:00
|
|
|
.resize = raid5_resize,
|
2009-03-18 08:10:40 +07:00
|
|
|
.size = raid5_size,
|
2007-03-27 12:32:13 +07:00
|
|
|
.check_reshape = raid5_check_reshape,
|
|
|
|
.start_reshape = raid5_start_reshape,
|
2009-03-31 11:15:05 +07:00
|
|
|
.finish_reshape = raid5_finish_reshape,
|
2006-01-06 15:20:36 +07:00
|
|
|
.quiesce = raid5_quiesce,
|
2010-03-22 12:53:49 +07:00
|
|
|
.takeover = raid4_takeover,
|
2014-12-15 08:56:56 +07:00
|
|
|
.congested = raid5_congested,
|
2017-03-28 00:51:33 +07:00
|
|
|
.change_consistency_policy = raid5_change_consistency_policy,
|
2006-01-06 15:20:36 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
static int __init raid5_init(void)
|
|
|
|
{
|
2016-08-18 19:57:24 +07:00
|
|
|
int ret;
|
|
|
|
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
raid5_wq = alloc_workqueue("raid5wq",
|
|
|
|
WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE|WQ_SYSFS, 0);
|
|
|
|
if (!raid5_wq)
|
|
|
|
return -ENOMEM;
|
2016-08-18 19:57:24 +07:00
|
|
|
|
|
|
|
ret = cpuhp_setup_state_multi(CPUHP_MD_RAID5_PREPARE,
|
|
|
|
"md/raid5:prepare",
|
|
|
|
raid456_cpu_up_prepare,
|
|
|
|
raid456_cpu_dead);
|
|
|
|
if (ret) {
|
|
|
|
destroy_workqueue(raid5_wq);
|
|
|
|
return ret;
|
|
|
|
}
|
2006-06-26 14:27:38 +07:00
|
|
|
register_md_personality(&raid6_personality);
|
2006-01-06 15:20:36 +07:00
|
|
|
register_md_personality(&raid5_personality);
|
|
|
|
register_md_personality(&raid4_personality);
|
|
|
|
return 0;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2006-01-06 15:20:36 +07:00
|
|
|
static void raid5_exit(void)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-06-26 14:27:38 +07:00
|
|
|
unregister_md_personality(&raid6_personality);
|
2006-01-06 15:20:36 +07:00
|
|
|
unregister_md_personality(&raid5_personality);
|
|
|
|
unregister_md_personality(&raid4_personality);
|
2016-08-18 19:57:24 +07:00
|
|
|
cpuhp_remove_multi_state(CPUHP_MD_RAID5_PREPARE);
|
raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.
raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.
To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.
My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.
Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.
In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.
The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.
This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.
Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.
Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 13:30:16 +07:00
|
|
|
destroy_workqueue(raid5_wq);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
module_init(raid5_init);
|
|
|
|
module_exit(raid5_exit);
|
|
|
|
MODULE_LICENSE("GPL");
|
2009-12-14 08:49:58 +07:00
|
|
|
MODULE_DESCRIPTION("RAID4/5/6 (striping with parity) personality for MD");
|
2005-04-17 05:20:36 +07:00
|
|
|
MODULE_ALIAS("md-personality-4"); /* RAID5 */
|
2006-01-06 15:20:51 +07:00
|
|
|
MODULE_ALIAS("md-raid5");
|
|
|
|
MODULE_ALIAS("md-raid4");
|
2006-01-06 15:20:36 +07:00
|
|
|
MODULE_ALIAS("md-level-5");
|
|
|
|
MODULE_ALIAS("md-level-4");
|
2006-06-26 14:27:38 +07:00
|
|
|
MODULE_ALIAS("md-personality-8"); /* RAID6 */
|
|
|
|
MODULE_ALIAS("md-raid6");
|
|
|
|
MODULE_ALIAS("md-level-6");
|
|
|
|
|
|
|
|
/* This used to be two separate modules, they were: */
|
|
|
|
MODULE_ALIAS("raid5");
|
|
|
|
MODULE_ALIAS("raid6");
|