bitmap_get_counter returns the number of sectors covered
by the counter in a pass-by-reference variable.
In some cases this can be very large, so make it a sector_t
for safety.
Signed-off-by: NeilBrown <neilb@suse.de>
lock_kernel calls were recently pushed down into open/release
functions.
md doesn't need that protection.
Then the BKL calls were change to md_mutex. We don't need those
either.
So remove it all.
Signed-off-by: NeilBrown <neilb@suse.de>
A RAID1 which has no persistent metadata, whether internal or
external, will hang on the first write.
This is caused by commit 070dc6dd71
In that case, MD_CHANGE_PENDING never gets cleared.
So during md_update_sb, is neither persistent or external,
clear MD_CHANGE_PENDING.
This is suitable for 2.6.36-stable.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
Silly though it is, completions and wait_queue_heads use foo_ONSTACK
(COMPLETION_INITIALIZER_ONSTACK, DECLARE_COMPLETION_ONSTACK,
__WAIT_QUEUE_HEAD_INIT_ONSTACK and DECLARE_WAIT_QUEUE_HEAD_ONSTACK) so I
guess workqueues should do the same thing.
s/INIT_WORK_ON_STACK/INIT_WORK_ONSTACK/
s/INIT_DELAYED_WORK_ON_STACK/INIT_DELAYED_WORK_ONSTACK/
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
xen-blkfront: disable barrier/flush write support
Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
block: remove BLKDEV_IFL_WAIT
aic7xxx_old: removed unused 'req' variable
block: remove the BH_Eopnotsupp flag
block: remove the BLKDEV_IFL_BARRIER flag
block: remove the WRITE_BARRIER flag
swap: do not send discards as barriers
fat: do not send discards as barriers
ext4: do not send discards as barriers
jbd2: replace barriers with explicit flush / FUA usage
jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
jbd: replace barriers with explicit flush / FUA usage
nilfs2: replace barriers with explicit flush / FUA usage
reiserfs: replace barriers with explicit flush / FUA usage
gfs2: replace barriers with explicit flush / FUA usage
btrfs: replace barriers with explicit flush / FUA usage
xfs: replace barriers with explicit flush / FUA usage
block: pass gfp_mask and flags to sb_issue_discard
dm: convey that all flushes are processed as empty
...
* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
cfq-iosched: Fix a gcc 4.5 warning and put some comments
block: Turn bvec_k{un,}map_irq() into static inline functions
block: fix accounting bug on cross partition merges
block: Make the integrity mapped property a bio flag
block: Fix double free in blk_integrity_unregister
block: Ensure physical block size is unsigned int
blkio-throttle: Fix possible multiplication overflow in iops calculations
blkio-throttle: limit max iops value to UINT_MAX
blkio-throttle: There is no need to convert jiffies to milli seconds
blkio-throttle: Fix link failure failure on i386
blkio: Recalculate the throttled bio dispatch time upon throttle limit change
blkio: Add root group to td->tg_list
blkio: deletion of a cgroup was causes oops
blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: revert bad fix for memory hotplug causing bounces
Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: Prevent hang_check firing during long I/O
cfq: improve fsync performance for small files
...
Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek
* 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
block: autoconvert trivial BKL users to private mutex
drivers: autoconvert trivial BKL users to private mutex
ipmi: autoconvert trivial BKL users to private mutex
mac: autoconvert trivial BKL users to private mutex
mtd: autoconvert trivial BKL users to private mutex
scsi: autoconvert trivial BKL users to private mutex
Fix up trivial conflicts (due to addition of private mutex right next to
deletion of a version string) in drivers/char/pcmcia/cm40[04]0_cs.c
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Function read_sb_page may return ERR_PTR(...). Check for it.
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NeilBrown <neilb@suse.de>
When performing a resync we pre-allocate some bios and repeatedly use
them. This requires us to re-initialise them each time.
One field (bi_comp_cpu) and some flags weren't being initiaised
reliably.
Signed-off-by: NeilBrown <neilb@suse.de>
bitmap_start_sync returns - via a pass-by-reference variable - the
number of sectors before we need to check with the bitmap again.
Since commit ef42567335 this number can be substantially larger,
2^27 is a common value.
Unfortunately it is an 'int' and so when raid1.c:sync_request shifts
it 9 places to the left it becomes 0. This results in a zero-length
read which the scsi layer justifiably complains about.
This patch just removes the shift so the common case becomes safe with
a trivially-correct patch.
In the next merge window we will convert this 'int' to a 'sector_t'
Reported-by: "George Spelvin" <linux@horizon.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The block device drivers have all gained new lock_kernel
calls from a recent pushdown, and some of the drivers
were already using the BKL before.
This turns the BKL into a set of per-driver mutexes.
Still need to check whether this is safe to do.
file=$1
name=$2
if grep -q lock_kernel ${file} ; then
if grep -q 'include.*linux.mutex.h' ${file} ; then
sed -i '/include.*<linux\/smp_lock.h>/d' ${file}
else
sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file}
fi
sed -i ${file} \
-e "/^#include.*linux.mutex.h/,$ {
1,/^\(static\|int\|long\)/ {
/^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);
} }" \
-e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
-e '/[ ]*cycle_kernel_lock();/d'
else
sed -i -e '/include.*\<smp_lock.h\>/d' ${file} \
-e '/cycle_kernel_lock()/d'
fi
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
If an array with 1.x metadata is assembled with the last disk missing,
md doesn't properly record the fact that the disk was missing.
This is unlikely to cause a real problem as the event count will be
different to the count on the missing disk so it won't be included in
the array. However it could still cause confusion.
So make sure we clear all the relevant slots, not just the early ones.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that we depend on md_update_sb to clear variable bits in
mddev->flags (rather than trying not to set them) it is important to
always call md_update_sb when appropriate.
md_check_recovery has this job but explicitly avoids it for ->external
metadata arrays. This is not longer appropraite, or needed.
However we do want to avoid taking the mddev lock if only
MD_CHANGE_PENDING is set as that is not cleared by md_update_sb for
external-metadata arrays.
Reported-by: "Kwolek, Adam" <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We have several users of min_not_zero, each of them using their own
definition. Move the define to kernel.h.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
Rename __clone_and_map_flush to __clone_and_map_empty_flush for added
clarity.
Simplify logic associated with REQ_FLUSH conditionals.
Introduce a BUG_ON() and add a few more helpful comments to the code
so that it is clear that all flushes are empty.
Cleanup __split_and_process_bio() so that an empty flush isn't processed
by a 'sector_count' focused while loop.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Now queue_io() is called from dec_pending(), which may be called with
interrupts disabled, so queue_io() must not enable interrupts
unconditionally and must save/restore the current interrupts status.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
against other bio's. This patch relaxes ordering around flushes.
* A flush bio is no longer deferred to workqueue directly. It's
processed like other bio's but __split_and_process_bio() uses
md->flush_bio as the clone source. md->flush_bio is initialized to
empty flush during md initialization and shared for all flushes.
* As a flush bio now travels through the same execution path as other
bio's, there's no need for dedicated error handling path either. It
can use the same error handling path in dec_pending(). Dedicated
error handling removed along with md->flush_error.
* When dec_pending() detects that a flush has completed, it checks
whether the original bio has data. If so, the bio is queued to the
deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.
* As flush sequencing is handled in the usual issue/completion path,
dm_wq_work() no longer needs to handle flushes differently. Now its
only responsibility is re-issuing deferred bio's the same way as
_dm_request() would. REQ_FLUSH handling logic including
process_flush() is dropped.
* There's no reason for queue_io() and dm_wq_work() write lock
dm->io_lock. queue_io() now only uses md->deferred_lock and
dm_wq_work() read locks dm->io_lock.
* bio's no longer need to be queued on the deferred list while a flush
is in progress making DMF_QUEUE_IO_TO_THREAD unncessary. Drop it.
This avoids stalling the device during flushes and simplifies the
implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch converts request-based dm to support the new REQ_FLUSH/FUA.
The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.
In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them. However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.
This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests. The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.
* added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
suggested by Mike Snitzer
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
now deprecated REQ_HARDBARRIER.
* -EOPNOTSUPP handling logic dropped.
* Preflush is handled as before but postflush is dropped and replaced
with passing down REQ_FUA to member request_queues. This replaces
one array wide cache flush w/ member specific FUA writes.
* __split_and_process_bio() now calls __clone_and_map_flush() directly
for flushes and guarantees all FLUSH bio's going to targets are zero
` length.
* It's now guaranteed that all FLUSH bio's which are passed onto dm
targets are zero length. bio_empty_barrier() tests are replaced
with REQ_FLUSH tests.
* Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.
* Dropped unlikely() around REQ_FLUSH tests. Flushes are not unlikely
enough to be marked with unlikely().
* Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
doesn't support cache flushing. Advertise REQ_FLUSH | REQ_FUA
capability.
* Request based dm isn't converted yet. dm_init_request_based_queue()
resets flush support to 0 for now. To avoid disturbing request
based dm code, dm->flush_error is added for bio based dm while
requested based dm continues to use dm->barrier_error.
Lightly tested linear, stripe, raid1, snap and crypt targets. Please
proceed with caution as I'm not familiar with the code base.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: dm-devel@redhat.com
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER. In the core part (md.c), the following
changes are notable.
* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
processing of other requests and thus there is no reason to mark the
queue congested while FLUSH/FUA is in progress.
* REQ_FLUSH/FUA failures are final and its users don't need retry
logic. Retry logic is removed.
* Preflush needs to be issued to all member devices but FUA writes can
be handled the same way as other writes - their processing can be
deferred to request_queue of member devices. md_barrier_request()
is renamed to md_flush_request() and simplified accordingly.
For linear, raid0 and multipath, the core changes are enough. raid1,
5 and 10 need the following conversions.
* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
request_queues of member devices. Barrier related logic removed.
* raid5: Queue draining logic dropped. FUA bit is propagated through
biodrain and stripe resconstruction such that all the updated parts
of the stripe are written out with FUA writes if any of the dirtying
writes was FUA. preread_active_stripes handling in make_request()
is updated as suggested by Neil Brown.
* raid10: FUA bit needs to be propagated to write clones.
linear, raid0, 1, 5 and 10 tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with
-EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
blk_queue_flush().
blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a
device has write cache and can flush it, it should set REQ_FLUSH. If
the device can handle FUA writes, it should also set REQ_FUA.
All blk_queue_ordered() users are converted.
* ORDERED_DRAIN is mapped to 0 which is the default value.
* ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
* ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Pierre Ossman <drzeus@drzeus.cx>
Cc: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
MD_CHANGE_CLEAN is used for two different purposes and this leads to
confusion.
One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
not used for anything else, so have MD_CHANGE_PENDING take over that
purpose fully.
The two purposes are:
1/ tell md_update_sb that an update is needed and that it is just a
clean/dirty transition.
2/ tell user-space that an transition from clean to dirty is pending
(something wants to write), and tell te kernel (by clearin the
flag) that the transition is OK.
The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
fully to MD_CHANGE_PENDING.
This means that various places which conditionally set or cleared
MD_CHANGE_CLEAN no longer need to be conditional.
Signed-off-by: NeilBrown <neilb@suse.de>
If this bit is cleared in md_update_sb() the kernel will allow writes to the
array if userspace triggers md_allow_write(), e.g. through stripe_cache_size,
when mdmon is not active. When mdmon is active the array transitions to
active-idle bypassing write-pending, setting up a race for mdmon to set the
array clean before a write arrives.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
commit 7b6d91daee changed the behaviour
of a few variables in raid1 and raid10 from flags to bit-sets, but
left them as type 'bool' so they did not work.
Change them (back) to unsigned long.
(historical note: see 1ef04fefe2)
Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: Jiri Slaby <jslaby@suse.cz> and many others
md_check_recovery expects ->spare_active to return 'true' if any
spares were activated, but none of them do, so the consequent change
in 'degraded' is not notified through sysfs.
So count the number of spares activated, subtract it from 'degraded'
just once, and return it.
Reported-by: Adrian Drzewiecki <adriand@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When RAID1 is done syncing disks, it'll update the state
of synced rdevs to In_sync. But it neglected to notify
sysfs that the attribute changed. So any programs that
are waiting for an rdev's state to change will not be
woken.
(raid5/raid10 added by neilb)
Signed-off-by: Adrian Drzewiecki <adriand@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The update of ->recovery_offset in sync_sbs is appropriate even then external
metadata is in use. However sync_sbs is only called when native
metadata is used.
So move that update in to the top of md_update_sb (which is the only
caller of sync_sbs) before the test on ->external.
This moves the update out of ->write_lock protection, but those fields
only need ->reconfig_mutex protection which they still have.
Also move the test on ->persistent up to where ->external is set as
for metadata update purposes they are the same.
Clear MD_CHANGE_DEVS and MD_CHANGE_CLEAN as they can only be confusing
if ->external is set or ->persistent isn't.
Finally move the update of ->utime down as it is only relevent (like
the ->events update) for native metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: "Kwolek, Adam" <adam.kwolek@intel.com>
Enable discard support in the DM multipath target.
This discard support depends on a few discard-specific fixes to the
block layer's request stacking driver methods.
Discard requests are optional so don't allow a failed discard to trigger
path failures. If there is a real problem with a given path the
barriers associated with the discard (either before or after the
discard) will cause path failure. That said, unconditionally passing
discard failures up the stack is not ideal. This must be fixed once DM
has more information about the nature of the underlying storage failure.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
The DM core will submit a discard bio to the stripe target for each
stripe in a striped DM device. The stripe target will determine
stripe-specific portions of the supplied bio to be remapped into
individual (at most 'num_discard_requests' extents). If a given
stripe-specific discard bio doesn't touch a particular stripe the bio
will be dropped.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Update __clone_and_map_discard to loop across all targets in a DM
device's table when it processes a discard bio. If a discard crosses a
target boundary it must be split accordingly.
Update __issue_target_requests and __issue_target_request to allow a
cloned discard bio to have a custom start sector and size.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Optimize sector division: If the number of stripes is a power of two,
we can do shift and mask instead of division.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move sector to stripe translation into a function.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Have the error target respond to a discard request with a hard -EIO
rather than fail the request with -EOPNOTSUPP.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Have the zero target silently drop a discard rather than fail the
request with -EOPNOTSUPP.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Split max_io_len_target_boundary out of max_io_len so that the discard
support can make use of it without duplicating max_io_len code.
Avoiding max_io_len's split_io logic enables DM's discard support to
submit the entire discard request to a target. But discards must still
be split on target boundaries.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rename __flush_target to __issue_target_request now that it is used to
issue both flush and discard requests.
Introduce __issue_target_requests as a convenient wrapper to
__issue_target_request 'num_flush_requests' or 'num_discard_requests'
times per target.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Allow discards to be passed through to linear mappings if at least one
underlying device supports it. Discards will be forwarded only to
devices that support them.
A target that supports discards should set num_discard_requests to
indicate how many times each discard request must be submitted to it.
Verify table's underlying devices support discards prior to setting the
associated DM device as capable of discards (via QUEUE_FLAG_DISCARD).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Allocate cipher strings indpendently of struct crypt_config and move
cipher parsing and allocation into a separate function to prepare for
supporting the cryptoapi format e.g. "xts(aes)".
No functional change in this patch.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use just one label and reuse common destructor for crypt target.
Parse remaining argv arguments in logic order.
Also do not ignore error values from IV init and set key functions.
No functional change in this patch except changed return codes
based on above.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add devname:mapper/control and MAPPER_CTRL_MINOR module alias
to support dm-mod module autoloading.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
'target_request_nr' is a more generic name that reflects the fact that
it will be used for both flush and discard support.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This change unifies the various checks and finalization that occurs on a
table prior to use. By doing so, it allows table construction without
traversing the dm-ioctl interface.
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Implement merge method for the snapshot origin to improve read
performance.
Without merge method, dm asks the upper layers to submit smallest possible
bios --- one page. Submitting such small bios impacts performance negatively
when reading or writing the origin device.
Without this patch, CPU consumption when reading the origin on lvm on md-raid0
was 6 to 12%, with this patch, it drops to 1 to 4%.
Note: in my testing, it actually degraded performance in some settings, I
traced it to Maxtor disks having problems with > 512-sector requests.
Reducing the number of sectors to /sys/block/sd*/queue/max_sectors_kb to
256 fixed the read performance. I think we don't have to care about weird
disks that actually degrade performance because of large requests being
sent to them.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Change bio-based mapped devices no longer to have a fully initialized
request_queue (request_fn, elevator, etc). This means bio-based DM
devices no longer register elevator sysfs attributes ('iosched/' tree
or 'scheduler' other than "none").
In contrast, a request-based DM device will continue to have a full
request_queue and will register elevator sysfs attributes. Therefore
a user can determine a DM device's type by checking if elevator sysfs
attributes exist.
First allocate a minimalist request_queue structure for a DM device
(needed for both bio and request-based DM).
Initialization of a full request_queue is deferred until it is known
that the DM device is request-based, at the end of the table load
sequence.
Factor DM device's request_queue initialization:
- common to both request-based and bio-based into dm_init_md_queue().
- specific to request-based into dm_init_request_based_queue().
The md->type_lock mutex is used to protect md->queue, in addition to
md->type, during table_load().
A DM device's first table_load will establish the immutable md->type.
But md->queue initialization, based on md->type, may fail at that time
(because blk_init_allocated_queue cannot allocate memory). Therefore
any subsequent table_load must (re)try dm_setup_md_queue independently of
establishing md->type.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Determine whether a mapped device is bio-based or request-based when
loading its first (inactive) table and don't allow that to be changed
later.
This patch performs different device initialisation in each of the two
cases. (We don't think it's necessary to add code to support changing
between the two types.)
Allowed md->type transitions:
DM_TYPE_NONE to DM_TYPE_BIO_BASED
DM_TYPE_NONE to DM_TYPE_REQUEST_BASED
We now prevent table_load from replacing the inactive table with a
conflicting type of table even after an explicit table_clear.
Introduce 'type_lock' into the struct mapped_device to protect md->type
and to prepare for the next patch that will change the queue
initialization and allocate memory while md->type_lock is held.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
drivers/md/dm-ioctl.c | 15 +++++++++++++++
drivers/md/dm.c | 37 ++++++++++++++++++++++++++++++-------
drivers/md/dm.h | 5 +++++
include/linux/dm-ioctl.h | 4 ++--
4 files changed, 52 insertions(+), 9 deletions(-)
When processing barriers, skip the second flush if processing the bio
failed with -EOPNOTSUPP. This can happen with discard+barrier requests.
If the device doesn't support discard, there would be two useless
SYNCHRONIZE CACHE commands. The first dm_flush cannot be so easily
optimized out, so we leave it there.
Previously, -EOPNOTSUPP could be received in dec_pending only with empty
barriers and we ignored that error, assuming the device not supporting
cache flushes has cache always consistent. With the addition of discard
barriers, this -EOPNOTSUPP can also be generated by discards and we
must record it in md->barrier_error for process_barrier.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch fixes hard-coded value for the size of a chunk that includes
disk header for persistent snapshot. It should be changed to existing
macro NUM_SNAPSHOT_HDR_CHUNKS instead of using hard-coded value 1.
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use kstrdup when the goal of an allocation is copy a string into the
allocated region.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to;
expression flag,E1,E2;
statement S;
@@
- to = kmalloc(strlen(from) + 1,flag);
+ to = kstrdup(from, flag);
... when != \(from = E1 \| to = E1 \)
if (to==NULL || ...) S
... when != \(from = E2 \| to = E2 \)
- strcpy(to, from);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The dm control device does not implement read/write, so it has no use for
seeking. Using no_llseek prevents falling back to default_llseek, which
requires the BKL.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.
By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
http://marc.info/?l=dm-devel&m=126699981019735&w=2
Without this patch, I confirmed there is a case to crash the system:
dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())
Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
CPU0 CPU1
=================================================================
<<INTERRUPT>>
blk_end_request_all(clone_rq)
blk_update_request(clone_rq)
bio_endio(clone_bio) == end_clone_bio
blk_update_request(orig_rq)
bio_endio(orig_bio)
<<I/O completed>>
dm_blk_close()
dev_remove()
dm_put(md)
<<Free md>>
blk_finish_request(clone_rq)
....
dm_end_request(clone_rq)
free_rq_clone(clone_rq)
blk_end_request_all(orig_rq)
rq_completed(md)
So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.
To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
dm_create(): create a mapped_device
dm_destroy(): destroy a mapped_device
dm_get(): increment the reference count of a mapped_device
dm_put(): decrement the reference count of a mapped_device
dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.
dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.
For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch changes dm_hash_remove_all() to release _hash_lock when
removing a device. After removing the device, dm_hash_remove_all()
takes _hash_lock and searches the hash from scratch again.
This patch is a preparation for the next patch, which changes device
deletion code to wait for md reference to be 0. Without this patch,
the wait in the next patch may cause AB-BA deadlock:
CPU0 CPU1
-----------------------------------------------------------------------
dm_hash_remove_all()
down_write(_hash_lock)
table_status()
md = find_device()
dm_get(md)
<increment md->holders>
dm_get_live_or_inactive_table()
dm_get_inactive_table()
down_write(_hash_lock)
<in the md deletion code>
<wait for md->holders to be 0>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch prevents access to mapped_device which is being deleted.
Currently, even after a mapped_device has been removed from the hash,
it could be accessed through idr_find() using minor number.
That could cause a race and NULL pointer reference below:
CPU0 CPU1
------------------------------------------------------------------
dev_remove(param)
down_write(_hash_lock)
dm_lock_for_deletion(md)
spin_lock(_minor_lock)
set_bit(DMF_DELETING)
spin_unlock(_minor_lock)
__hash_remove(hc)
up_write(_hash_lock)
dev_status(param)
md = find_device(param)
down_read(_hash_lock)
__find_device_hash_cell(param)
dm_get_md(param->dev)
md = dm_find_md(dev)
spin_lock(_minor_lock)
md = idr_find(MINOR(dev))
spin_unlock(_minor_lock)
dm_put(md)
free_dev(md)
dm_get(md)
up_read(_hash_lock)
__dev_status(md, param)
dm_put(md)
This patch fixes such problems.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
All the dm ioctls that generate uevents set the DM_UEVENT_GENERATED flag so
that userspace knows whether or not to wait for a uevent to be processed
before continuing,
The dm rename ioctl sets this flag but was not structured to return it
to userspace. This patch restructures the rename ioctl processing to
behave like the other ioctls that return data and so fix this.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove useless __dev_status call while processing an ioctl that sets up
device geometry and target message. The data is not returned to
userspace so there is no point collecting it and in the case of
target_message it is collected before processing the message so if it
did return it might be stale.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Validate chunk size against both origin and snapshot sector size
Don't allow chunk size smaller than either origin or snapshot logical
sector size. Reading or writing data not aligned to sector size is not
allowed and causes immediate errors.
This requires us to open the origin before initialising the
exception store and to export dm_snap_origin.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Iterate both origin and snapshot devices
iterate_devices method should call the callback for all the devices where
the bio may be remapped. Thus, snapshot_iterate_devices should call the callback
for both snapshot and origin underlying devices because it remaps some bios
to the snapshot and some to the origin.
snapshot_iterate_devices called the callback only for the origin device.
This led to badly calculated device limits if snapshot and origin were placed
on different types of disks.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
multipath_ctr() forgets to return an error after detecting
missing path parameters. Fix this.
Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* 'for-linus' of git://neil.brown.name/md: (24 commits)
md: clean up do_md_stop
md: fix another deadlock with removing sysfs attributes.
md: move revalidate_disk() back outside open_mutex
md/raid10: fix deadlock with unaligned read during resync
md/bitmap: separate out loading a bitmap from initialising the structures.
md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
md/bitmap: optimise scanning of empty bitmaps.
md/bitmap: clean up plugging calls.
md/bitmap: reduce dependence on sysfs.
md/bitmap: white space clean up and similar.
md/raid5: export raid5 unplugging interface.
md/plug: optionally use plugger to unplug an array during resync/recovery.
md/raid5: add simple plugging infrastructure.
md/raid5: export is_congested test
raid5: Don't set read-ahead when there is no queue
md: add support for raising dm events.
md: export various start/stop interfaces
md: split out md_rdev_init
md: be more careful setting MD_CHANGE_CLEAN
md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk
...
There is only one error exit from do_md_stop, so make that more
explicit and discard the 'err' variable.
Also drop the 'revalidate' variable by moving the unlock calls around.
Signed-off-by: NeilBrown <neilb@suse.de>
Move the deletion of sysfs attributes from reconfig_mutex to
open_mutex didn't really help as a process can try to take
open_mutex while holding reconfig_mutex, so the same deadlock can
happen, just requiring one more process to be involved in the chain.
I looks like I cannot easily use locking to wait for the sysfs
deletion to complete, so don't.
The only things that we cannot do while the deletions are still
pending is other things which can change the sysfs namespace: run,
takeover, stop. Each of these can fail with -EBUSY.
So set a flag while doing a sysfs deletion, and fail run, takeover,
stop if that flag is set.
This is suitable for 2.6.35.x
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Commit b821eaa5 "md: remove ->changed and related code" moved
revalidate_disk() under open_mutex, and lockdep noticed.
[ INFO: possible circular locking dependency detected ]
2.6.32-mdadm-locking #1
-------------------------------------------------------
mdadm/3640 is trying to acquire lock:
(&bdev->bd_mutex){+.+.+.}, at: [<ffffffff811acecb>] revalidate_disk+0x5b/0x90
but task is already holding lock:
(&mddev->open_mutex){+.+...}, at: [<ffffffffa055e07a>] do_md_stop+0x4a/0x4d0 [md_mod]
which lock already depends on the new lock.
It is suitable for 2.6.35.x
Cc: <stable@kernel.org>
Reported-by: Przemyslaw Czarnowski <przemyslaw.hawrylewicz.czarnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The open and release block_device_operations are currently
called with the BKL held. In order to change that, we must
first make sure that all drivers that currently rely
on this have no regressions.
This blindly pushes the BKL into all .open and .release
operations for all block drivers to prepare for the
next step. The drivers can subsequently replace the BKL
with their own locks or remove it completely when it can
be shown that it is not needed.
The functions blkdev_get and blkdev_put are the only
remaining users of the big kernel lock in the block
layer, besides a few uses in the ioctl code, none
of which need to serialize with blkdev_{get,put}.
Most of these two functions is also under the protection
of bdev->bd_mutex, including the actual calls to
->open and ->release, and the common code does not
access any global data structures that need the BKL.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.
Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
struct requests. This allows much easier grepping for different request
types instead of unwinding through macros.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If the 'bio_split' path in raid10-read is used while
resync/recovery is happening it is possible to deadlock.
Fix this be elevating ->nr_waiting for the duration of both
parts of the split request.
This fixes a bug that has been present since 2.6.22
but has only started manifesting recently for unknown reasons.
It is suitable for and -stable since then.
Reported-by: Justin Bronder <jsbronder@gentoo.org>
Tested-by: Justin Bronder <jsbronder@gentoo.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
dm makes this distinction between ->ctr and ->resume, so we need to
too.
Also get the new bitmap_load to clear out the bitmap first, as this is
most consistent with the dm suspend/resume approach
Signed-off-by: NeilBrown <neilb@suse.de>
This allows md/raid5 to fully work as a dm target.
Normally md uses a 'filemap' which contains a list of pages of bits
each of which may be written separately.
dm-log uses and all-or-nothing approach to writing the log, so
when using a dm-log, ->filemap is NULL and the flags normally stored
in filemap_attr are stored in ->logattrs instead.
Signed-off-by: NeilBrown <neilb@suse.de>
A bitmap is stored as one page per 2048 bits.
If none of the bits are set, the page is not allocated.
When bitmap_get_counter finds that a page isn't allocate,
it just reports that one bit work of space isn't flagged,
rather than reporting that 2048 bits worth of space are
unflagged.
This can cause searches for flagged bits (e.g. bitmap_close_sync)
to do more work than is really necessary.
So change bitmap_get_counter (when creating) to report a number of
blocks that more accurately reports the range of the device for which
no counter currently exists.
Signed-off-by: NeilBrown <neilb@suse.de>
1/ use md_unplug in bitmap.c as we will soon be using bitmaps under
arrays with no queue attached.
2/ Don't bother plugging the queue when we set a bit in the bitmap.
The reason for this was to encourage as many bits as possible to
get set before we unplug and write stuff out.
However every personality already plugs the queue after
bitmap_startwrite either directly (raid1/raid10) or be setting
STRIPE_BIT_DELAY which causes the queue to be plugged later
(raid5).
Signed-off-by: NeilBrown <neilb@suse.de>
For dm-raid45 we will want to use bitmaps in dm-targets which don't
have entries in sysfs, so cope with the mddev not living in sysfs.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes some whitespace problems
Fixed some checkpatch.pl complaints.
Replaced kmalloc ... memset(0), with kzalloc
Fixed an unlikely memory leak on an error path.
Reformatted a number of 'if/else' sets, sometimes
replacing goto with an else clause.
Removed some old comments and commented-out code.
Signed-off-by: NeilBrown <neilb@suse.de>
If an array doesn't have a 'queue' then md_do_sync cannot
unplug it.
In that case it will have a 'plugger', so make that available
to the mddev, and use it to unplug the array if needed.
Signed-off-by: NeilBrown <neilb@suse.de>
md/raid5 uses the plugging infrastructure provided by the block layer
and 'struct request_queue'. However when we plug raid5 under dm there
is no request queue so we cannot use that.
So create a similar infrastructure that is much lighter weight and use
it for raid5.
Signed-off-by: NeilBrown <neilb@suse.de>
the dm module will need this for dm-raid45.
Also only access ->queue->backing_dev_info->congested_fn
if ->queue actually exists. It won't in a dm target.
Signed-off-by: NeilBrown <neilb@suse.de>
dm-raid456 does not provide a 'queue' for raid5 to use,
so we must make raid5 stop depending on the queue.
First: read_ahead
dm handles read-ahead adjustment fully in userspace, so
simply don't do any readahead adjustments if there is
no queue.
Also re-arrange code slightly so all the accesses to ->queue are
together.
Finally, move the blk_queue_merge_bvec function into the 'if' as
the ->split_io setting in dm-raid456 has the same effect.
Signed-off-by: NeilBrown <neilb@suse.de>
dm uses scheduled work to raise events to user-space.
So allow md device to have work_structs and schedule them on an error.
Signed-off-by: NeilBrown <neilb@suse.de>
export entry points for starting and stopping md arrays.
This will be used by a module to make md/raid5 work under
dm.
Also stop calling md_stop_writes from md_stop, as that won't
work well with dm - it will want to call the two separately.
Signed-off-by: NeilBrown <neilb@suse.de>
This functionality will be needed separately in a subsequent patch, so
split it into it's own exported function.
Signed-off-by: NeilBrown <neilb@suse.de>
When MD_CHANGE_CLEAN is set we might block in md_write_start.
So we should only set it when fairly sure that something will clear
it.
There are two places where it is set so as to encourage a metadata
update to record the progress of resync/recovery. This should only
be done if the internal metadata update mechanisms are in use, which
can be tested by by inspecting '->persistent'.
Signed-off-by: NeilBrown <neilb@suse.de>
We will shortly allow md devices with no gendisk (they are attached to
a dm-target instead). That will cause mdname() to return 'mdX'.
There is one place where mdname really needs to be unique: when
creating the name for a slab cache.
So in that case, if there is no gendisk, you the address of the mddev
formatted in HEX to provide a unique name.
Signed-off-by: NeilBrown <neilb@suse.de>
We will want md devices to live as dm targets where sysfs is not
visible. So allow md to not connect to sysfs.
Signed-off-by: NeilBrown <neilb@suse.de>
There are few situations where it would make any sense to add a spare
when reducing the number of devices in an array, but it is
conceivable: A 6 drive RAID6 with two missing devices could be
reshaped to a 5 drive RAID6, and a spare could become available
just in time for the reshape, but not early enough to have been
recovered first. 'freezing' recovery can make this easy to
do without any races.
However doing such a thing is a bad idea. md will not record the
partially-recovered state of the 'spare' and when the reshape
finished it will think that the spare is still spare.
Easiest way to avoid this confusion is to simply disallow it.
Signed-off-by: NeilBrown <neilb@suse.de>
As the comment says, the tail of this loop only applies to devices
that are not fully in sync, so if In_sync was set, we should avoid
the rest of the loop.
This bug will hardly ever cause an actual problem. The worst it
can do is allow an array to be assembled that is dirty and degraded,
which is not generally a good idea (without warning the sysadmin
first).
This will only happen if the array is RAID4 or a RAID5/6 in an
intermediate state during a reshape and so has one drive that is
all 'parity' - no data - while some other device has failed.
This is certainly possible, but not at all common.
Signed-off-by: NeilBrown <neilb@suse.de>
During a recovery of reshape the early part of some devices might be
in-sync while the later parts are not.
We we know we are looking at an early part it is good to treat that
part as in-sync for stripe calculations.
This is particularly important for a reshape which suffers device
failure. Treating the data as in-sync can mean the difference between
data-safety and data-loss.
Signed-off-by: NeilBrown <neilb@suse.de>
When we are reshaping an array, the device failure combinations
that cause us to decide that the array as failed are more subtle.
In particular, any 'spare' will be fully in-sync in the section
of the array that has already been reshaped, thus failures that
affect only that section are less critical.
So encode this subtlety in a new function and call it as appropriate.
The case that showed this problem was a 4 drive RAID5 to 8 drive RAID6
conversion where the last two devices failed.
This resulted in:
good good good good incomplete good good failed failed
while converting a 5-drive RAID6 to 8 drive RAID5
The incomplete device causes the whole array to look bad,
bad as it was actually good for the section that had been
converted to 8-drives, all the data was actually safe.
Reported-by: Terry Morris <tbmorris@tbmorris.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When an array is reshaped to have fewer devices, the reshape proceeds
from the end of the devices to the beginning.
If a device happens to be non-In_sync (which is possible but rare)
we would normally update the ->recovery_offset as the reshape
progresses. However that would be wrong as the recover_offset records
that the early part of the device is in_sync, while in fact it would
only be the later part that is in_sync, and in any case the offset
number would be measured from the wrong end of the device.
Relatedly, if after a reshape a spare is discovered to not be
recoverred all the way to the end, not allow spare_active
to incorporate it in the array.
This becomes relevant in the following sample scenario:
A 4 drive RAID5 is converted to a 6 drive RAID6 in a combined
operation.
The RAID5->RAID6 conversion will cause a 5 drive to be included as a
spare, then the 5drive -> 6drive reshape will effectively rebuild that
spare as it progresses. The 6th drive is treated as in_sync the whole
time as there is never any case that we might consider reading from
it, but must not because there is no valid data.
If we interrupt this reshape part-way through and reverse it to return
to a 5-drive RAID6 (or event a 4-drive RAID5), we don't want to update
the recovery_offset - as that would be wrong - and we don't want to
include that spare as active in the 5-drive RAID6 when the reversed
reshape completed and it will be mostly out-of-sync still.
Signed-off-by: NeilBrown <neilb@suse.de>
The entries in the stripe_cache maintained by raid5 are enlarged
when we increased the number of devices in the array, but not
shrunk when we reduce the number of devices.
So if entries are added after reducing the number of devices, we
much ensure to initialise the whole entry, not just the part that
is currently relevant. Otherwise if we enlarge the array again,
we will reference uninitialised values.
As grow_buffers/shrink_buffer now want to use a count that is stored
explicity in the raid_conf, they should get it from there rather than
being passed it as a parameter.
Signed-off-by: NeilBrown <neilb@suse.de>