Many 'printk' messages from the raid456 module mention 'raid5' even
though it may be a 'raid6' or even 'raid4' array. This can cause
confusion.
Also the actual array name is not always reported and when it is
it is not reported consistently.
So change all the messages to start:
md/raid:%s:
where '%s' becomes e.g. md3 to identify the particular array.
Signed-off-by: NeilBrown <neilb@suse.de>
RAID10 has been available for quite a while now and is quite well
tested, so we can remove the EXPERIMENTAL designation.
Reported-by: Eric MSP Veith <eveith@wwweb-library.net>
Signed-off-by: NeilBrown <neilb@suse.de>
Level modifications change the output of mdstat. The mdmon manager
thread is interested in these events for external metadata management.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When a raid1 array is configured to support write-behind
on some devices, it normally only reads from other devices.
If all devices are write-behind (because the rest have failed)
it is possible for a read request to be serviced before a
behind-write request, which would appear as data corruption.
So when forced to read from a WriteMostly device, wait for any
write-behind to complete, and don't start any more behind-writes.
Signed-off-by: NeilBrown <neilb@suse.de>
This message seems to suggest the named device is the one on which a
read failed, however it is actually the device that the read will be
redirected to.
So make the message a little clearer.
Reported-by: Tim Burgess <ozburgess@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This is
- unnecessary because mddev_suspend is always followed by a call to
->stop, and each ->stop unregisters the thread, and
- a problem as it makes it awkwards to suspend and then resume a
device as we will want later.
Signed-off-by: NeilBrown <neilb@suse.de>
We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.
Signed-off-by: NeilBrown <neilb@suse.de>
This moves the call to the other side of set_readonly, but that should
not be an issue.
This encapsulates in 'md_stop' all of the functionality for internally
stopping the array, leaving all the interactions with externalities
(sysfs, request_queue, gendisk) in do_md_stop.
Signed-off-by: NeilBrown <neilb@suse.de>
Using do_md_stop to set an array to read-only is a little confusing.
Now most of the common code has been factored out, split
md_set_readonly off in to a separate function.
Signed-off-by: NeilBrown <neilb@suse.de>
Further refactoring of do_md_stop.
This one requires some explanation as it takes code from different
places in do_md_stop, so some re-ordering happens.
We only get into this part of do_md_stop if there are no active opens
of the device, so no writes can be happening and the device must have
been flushed. In md_stop_writes we want to stop any internal sources
of writes - i.e. resync - and flush out the metadata.
The only code that was previously before some of this code is
code to clean up the queue, the mddev, the gendisk, or sysfs, all
of which is probably better after code that makes active changes (i.e.
triggers writes).
Signed-off-by: NeilBrown <neilb@suse.de>
do_md_stop is large and clunky, so hard to understand.
This is a first step of refactoring, pulling two simple
sub-functions out.
Signed-off-by: NeilBrown <neilb@suse.de>
As part of relaxing the binding between an mddev and gendisk,
we separate do_md_run into two functions.
md_run does all the work internal to md
do_md_run calls md_run and makes and changes to gendisk
that are required.
Signed-off-by: NeilBrown <neilb@suse.de>
We set ->changed to 1 and call check_disk_change at the end
of md_open so that bd_invalidated would be set and thus
partition rescan would happen appropriately.
Now that we call revalidate_disk directly, which sets bd_invalidates,
that indirection is no longer needed and can be removed.
Signed-off-by: NeilBrown <neilb@suse.de>
Using ->array_sectors rather than get_capacity() is more
direct and is a step towards relaxing the tight connection
between mddev and gendisk.
Signed-off-by: NeilBrown <neilb@suse.de>
While I generally prefer letting personalities do as much as possible,
given that we have a central md_make_request anyway we may as well use
it to simplify code.
Also this centralises knowledge of ->gendisk which will help later.
Signed-off-by: NeilBrown <neilb@suse.de>
Diving through ->queue to find mddev is unnecessarily complex - there
is an easier path to finding mddev, so use that.
Signed-off-by: NeilBrown <neilb@suse.de>
Level changes can be very significant, so make sure
to notify them via sysfs.
Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When metadata is being managed by user-space, md doesn't know
what the maximum number of devices allowed in an array is
so ->max_disks is 0. In this case we should allow any (+ve)
number of disks.
Signed-off-by: NeilBrown <neilb@suse.de>
Writing "none" to "../md/dev-xx/slot" removes that device
from being an active part of the array, but it didn't
set ->raid_disk to -1 to record this fact.
Signed-off-by: Maciej Trela <Maciej.Trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In a subsequent patch we will make it possible to change
mddev->raid_disks while a RAID0 or RAID10 array is active. This is
part of the process of reshaping such an array.
This means that we cannot use this value while processes requests
(it is OK to use it during initialisation as we are locked against
changes then).
Both RAID0 and RAID10 have the same value stored in the private data
structure, so use that value instead.
Signed-off-by: NeilBrown <neilb@suse.de>
This was needed when sysfs files could only be 'notified'
from process context. Now that we have sys_notify_direct,
we can call it directly from an interrupt.
Signed-off-by: NeilBrown <neilb@suse.de>
void pointers do not need to be cast to other pointer types.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Keep track of the maximum number of concurrent write-behind requests
for an md array and exposed this number in sysfs at
md/bitmap/max_backlog_used
Writing any value to this file will clear it.
This allows userspace to be involved in tuning bitmap/backlog.
Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: NeilBrown <neilb@suse.de>
These fields have never been used.
commit 4b6d287f62
added them, but also added identical files to bitmap_super_s,
and only used the latter.
So remove these unused fields.
Signed-off-by: NeilBrown <neilb@suse.de>
There is a very small race window when writing to a
RAID1 such that if a device is marked faulty at exactly the wrong
time, the write-in-progress will not be sent to the device,
but the bitmap (if present) will be updated to say that
the write was sent.
Then if the device turned out to still be usable as was re-added
to the array, the bitmap-based-resync would skip resyncing that
block, possibly leading to corruption. This would only be a problem
if no further writes were issued to that area of the device (i.e.
that bitmap chunk).
Suitable for any pending -stable kernel.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Some levels expect the 'redundancy group' to be present,
others don't.
So when we change level of an array we might need to
add or remove this group.
This requires fixing up the current practice of overloading ->private
to indicate (when ->pers == NULL) that something needs to be removed.
So create a new ->to_remove to fill that role.
When changing levels, we may need to add or remove attributes. When
changing RAID5 -> RAID6, we both add and remove the same thing. It is
important to catch this and optimise it out as the removal is delayed
until a lock is released, so trying to add immediately would cause
problems.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When an array is stopped we need to remove some
sysfs files which are dependent on the type of array.
We need to delay that deletion as deleting them while holding
reconfig_mutex can lead to deadlocks.
We currently delay them until the array is completely destroyed.
However it is possible to deactivate and then reactivate the array.
It is also possible to need to remove sysfs files when changing level,
which can potentially happen several times before an array is
destroyed.
So we need to delete these files more promptly: as soon as
reconfig_mutex is dropped.
We need to ensure this happens before do_md_run can restart the array,
so we use open_mutex for some extra locking. This is not deadlock
prone.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Since commit ef286f6fa6
it has been important that each personality clears
->private in the ->stop() function, or sets it to a
attribute group to be removed.
linear.c doesn't. This can sometimes lead to an oops,
though it doesn't always.
Suitable for 2.6.33-stable and 2.6.34.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
When the user sets the block device to readwrite then the mddev should
follow suit. Otherwise, the BUG_ON in md_write_start() will be set to
trigger.
The reverse direction, setting mddev->ro to match a set readonly
request, can be ignored because the blkdev level readonly flag precludes
the need to have mddev->ro set correctly. Nevermind the fact that
setting mddev->ro to 1 may fail if the array is in use.
Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a component device has a merge_bvec_fn then as we never call it
we must ensure we never need to. Currently this is done by setting
max_sector to 1 PAGE, however this does not stop a bio being created
with several sub-page iovecs that would violate the merge_bvec_fn.
So instead set max_segments to 1 and set the segment boundary to the
same as a page boundary to ensure there is only ever one single-page
segment of IO requested at a time.
This can particularly be an issue when 'xen' is used as it is
known to submit multiple small buffers in a single bio.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
To prevent deadlock, bios in the hold list should be flushed before
dm_rh_stop_recovery() is called in mirror_suspend().
The recovery can't start because there are pending bios and therefore
dm_rh_stop_recovery deadlocks.
When there are pending bios in the hold list, the recovery waits for
the completion of the bios after recovery_count is acquired.
The recovery_count is released when the recovery finished, however,
the bios in the hold list are processed after dm_rh_stop_recovery() in
mirror_presuspend(). dm_rh_stop_recovery() also acquires recovery_count,
then deadlock occurs.
Signed-off-by: Takahiro Yasui <tyasui@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Eliminate a 4-byte hole in 'struct dm_io_memory' by moving 'offset' above the
'ptr' to which it applies (size reduced from 24 to 16 bytes). And by
association, 1-4 byte hole is eliminated in 'struct dm_io_request' (size
reduced from 56 to 48 bytes).
Eliminate all 6 4-byte holes and 1 cache-line in 'struct dm_snapshot' (size
reduced from 392 to 368 bytes).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Set a new DM_UEVENT_GENERATED_FLAG when returning from ioctls to
indicate that a uevent was actually generated. This tells the userspace
caller that it may need to wait for the event to be processed.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Free the dm_io structure before calling bio_endio() instead of after it,
to ensure that the io_pool containing it is not referenced after it is
freed.
This partially fixes a problem described here
https://www.redhat.com/archives/dm-devel/2010-February/msg00109.html
thread 1:
bio_endio(bio, io_error);
/* scheduling happens */
thread 2:
close the device
remove the device
thread 1:
free_io(md, io);
Thread 2, when removing the device, sees non-empty md->io_pool (because the
io hasn't been freed by thread 1 yet) and may crash with BUG in mempool_free.
Thread 1 may also crash, when freeing into a nonexisting mempool.
To fix this we must make sure that bio_endio() is the last call and
the md structure is not accessed afterwards.
There is another bio_endio in process_barrier, but it is called from the thread
and the thread is destroyed prior to freeing the mempools, so this call is
not affected by the bug.
A similar bug exists with module unloads - the module may be unloaded
immediately after bio_endio - but that is more difficult to fix.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove unused parameters(start and len) of dm_get_device()
and fix the callers.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Only issue a uevent on a resume if the state of the device changed,
i.e. if it was suspended and/or its table was replaced.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If all mirror legs fail, always return an error instead of holding the
bio, even if the handle_errors option was set. At present it is the
responsibility of the driver underneath us to deal with retries,
multipath etc.
The patch adds the bio to the failures list instead of holding it
directly. do_failures tests first if all legs failed and, if so,
returns the bio with -EIO. If any leg is still alive and handle_errors
is set, do_failures calls hold_bio.
Reviewed-by: Takahiro Yasui <tyasui@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch pulls the pg_init path activation code out of
process_queued_ios() into a new function.
No functional change.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When suspending the device we must wait for all I/O to complete, but
pg-init may be still in progress even after flushing the workqueue
for kmpath_handlerd in multipath_postsuspend.
This patch waits for pg-init completion correctly in
multipath_postsuspend().
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>