This export is required for clustering module in order to
co-ordinate remove/readd a rdev from all nodes.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Simon reported the md io stats accounting issue:
"
I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
the other two being write-mostly:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 345.00 0.00 0.00 0.00 0.00 100.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 58779.00 0.00 0.00 0.00 0.00 100.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 100.00
"
The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the
generic_start_io_acct to account the disk stats rather than the open code,
but it also introduced the increase to .in_flight[rw] which is needless to
md. So we re-use the open code here to fix it.
Reported-by: Simon Kirby <sim@hostway.ca>
Cc: <stable@vger.kernel.org> 3.19
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A --cluster-confirm without an --add (by another node) can
crash the kernel.
Fix it by guarding it using a state.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If ->run() fails, it can either free the data structures it
allocated, or leave that task to ->free() which will be called
on failures.
However:
md.c calls ->free() even if ->private_data is NULL, which
causes problems in some personalities.
raid0.c frees the data, but doesn't clear ->private_data,
which will become a problem when we fix md.c
So better fix both these issues at once.
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Fixes: 5aa61f427e
URL: https://bugzilla.kernel.org/show_bug.cgi?id=94381
Signed-off-by: NeilBrown <neilb@suse.de>
Recent change to bitmap_create mishandles errors.
In particular a failure doesn't alway cause 'err' to be set.
Signed-off-by: NeilBrown <neilb@suse.de>
Since __ATTR_PREALLOC was introduced in v3.19-rc1~78^2~18
it can now be used by md.
This ensure that writing to these sysfs attributes will never
block due to a memory allocation.
Such blocking could become a deadlock if mdmon is trying to
reconfigure an array after a failure prior to re-enabling writes.
Signed-off-by: NeilBrown <neilb@suse.de>
Algorithm:
1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends NEWDISK with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether the disk
was found:
ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
disc.number set to slot number)
ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on no-new-devs (CR) if device is found
7. Node 1 attempts EX lock on no-new-devs
8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
as SpareLocal
9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
If there is a resync going on, all nodes must suspend writes to the
range. This is recorded in the suspend_info/suspend_list.
If there is an I/O within the ranges of any of the suspend_info,
should_suspend will return 1.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
When a resync is initiated, RESYNCING message is sent to all active
nodes with the range (lo,hi). When the resync is over, a RESYNCING
message is sent with (0,0). A high sector value of zero indicates
that the resync is over.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Re-reads the devices by invalidating the cache.
Since we don't write to faulty devices, this is detected using
events recorded in the devices. If it is old as compared to the mddev
mark it is faulty.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
- request to send a message
- make changes to superblock
- send messages telling everyone that the superblock has changed
- other nodes all read the superblock
- other nodes all ack the messages
- updating node release the "I'm sending a message" resource.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
When a node joins, it does not know of other nodes performing resync.
So, each node keeps the resync information in it's LVB. When a new
node joins, it reads the LVB of each "online" bitmap.
[TODO] The new node attempts to get the PW lock on other bitmap, if
it is successful, it reads the bitmap and performs the resync (if
required) on it's behalf.
If the node does not get the PW, it requests CR and reads the LVB
for the resync information.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
DLM offers callbacks when a node fails and the lock remastery
is performed:
1. recover_prep: called when DLM discovers a node is down
2. recover_slot: called when DLM identifies the node and recovery
can start
3. recover_done: called when all nodes have completed recover_slot
recover_slot() and recover_done() are also called when the node joins
initially in order to inform the node with its slot number. These slot
numbers start from one, so we deduct one to make it start with zero
which the cluster-md code uses.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
md_cluster_info stores the cluster information in the MD device.
The join() is called when mddev detects it is a clustered device.
The main responsibilities are:
1. Setup a DLM lockspace
2. Setup all initial locks such as super block locks and bitmap lock (will come later)
The leave() clears up the lockspace and all the locks held.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Rather than using mddev_lock() to take the reconfig_mutex
when writing to any md sysfs file, we only take mddev_lock()
in the particular _store() functions that require it.
Admittedly this is most, but it isn't all.
This also allows us to remove special-case handling for new_dev_store
(in md_attr_store).
Signed-off-by: NeilBrown <neilb@suse.de>
The one which is not inline (mddev_unlock) gets EXPORTed.
This makes the locking available to personality modules so that it
doesn't have to be imposed upon them.
Signed-off-by: NeilBrown <neilb@suse.de>
There are interdependencies between these two sysfs attributes
and whether a resync is currently running.
Rather than depending on reconfig_mutex to ensure no races when
testing these interdependencies are met, use the spinlock.
This will allow the mutex to be remove from protecting this
code in a subsequent patch.
Signed-off-by: NeilBrown <neilb@suse.de>
There isn't really much room for races with ->safemode_delay.
But as I am trying to clean up any racy code and will soon
be removing reconfig_mutex protection from most _store()
functions:
- only set mddev->safemode_delay once, to ensure no code
can see an intermediate value
- use safemode_timer to call md_safemode_timeout() rather than
calling it directly, to ensure it never races with itself.
Signed-off-by: NeilBrown <neilb@suse.de>
It makes more sense to report bitmap_info->file, rather than
bitmap->file (the later is only available once the array is
active).
With that change, use mddev->lock to protect bitmap_info being
set to NULL, and we can call get_bitmap_file() without taking
the mutex.
Signed-off-by: NeilBrown <neilb@suse.de>
1/ delay setting mddev->bitmap_info.file until 'f' looks
usable, so we don't have to unset it.
2/ Don't allow bitmap file to be set if bitmap_info.file
is already set.
Signed-off-by: NeilBrown <neilb@suse.de>
'buf' is only used because d_path fills from the end of the
buffer instead of from the start.
We don't need a separate buf to handle that, we just need to use
memmove() to move the string to the start.
Signed-off-by: NeilBrown <neilb@suse.de>
No rdev attributes need locking for 'show', though
state_show() might benefit from ensuring it sees a
consistent set of flags.
None even use rdev->mddev, so testing for it isn't really
needed and it certainly doesn't need to be held constant.
So improve state_show() and remove the locking.
Signed-off-by: NeilBrown <neilb@suse.de>
Most attributes can be read safely without any locking.
A race might lead to a slightly out-dated value, but nothing wrong.
We already have locking in some places where needed.
All that remains is can_clear_show(), behind_writes_used_show()
and action_show() which are easily fixed.
Signed-off-by: NeilBrown <neilb@suse.de>
The only access in md_seq_show that could suffer from races
not protected by ->lock is walking the rdev list.
This can receive sufficient protection from 'rcu'.
So use rdev_for_each_rcu() and get rid of mddev_lock().
Now reading /proc/mdstat will never block in md_seq_show.
Signed-off-by: NeilBrown <neilb@suse.de>
->pers is already protected by ->reconfig_mutex, and
cannot possibly change when there are threads running or
outstanding IO.
However there are some places where we access ->pers
not in a thread or IO context, and where ->reconfig_mutex
is unnecessarily heavy-weight: level_show and md_seq_show().
So protect all changes, and those accesses, with ->lock.
This is a step toward taking those accesses out from under
reconfig_mutex.
[Fixed missing "mddev->pers" -> "pers" conversion, thanks to
Dan Carpenter <dan.carpenter@oracle.com>]
Signed-off-by: NeilBrown <neilb@suse.de>
Gather all the changes that can happen atomically and might
be relevant to other code into one place. This will
make it easier to refine the locking.
Note that this puts quite a few things between mddev_detach()
and ->free(). Enabling this was the point of some recent patches.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that the ->stop function only frees the private data,
rename is accordingly.
Also pass in the private pointer as an arg rather than using
mddev->private. This flexibility will be useful in level_store().
Finally, don't clear ->private. It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before ->to_remove was
introduced).
Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.
Signed-off-by: NeilBrown <neilb@suse.de>
Each md personality has a 'stop' operation which does two
things:
1/ it finalizes some aspects of the array to ensure nothing
is accessing the ->private data
2/ it frees the ->private data.
All the steps in '1' can apply to all arrays and so can be
performed in common code.
This is useful as in the case where we change the personality which
manages an array (in level_store()), it would be helpful to do
step 1 early, and step 2 later.
So split the 'step 1' functionality out into a new mddev_detach().
Signed-off-by: NeilBrown <neilb@suse.de>
There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.
So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.
Signed-off-by: NeilBrown <neilb@suse.de>
There is currently no locking around calls to the 'congested'
bdi function. If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.
So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.
When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock. Fortunately this is the case.
Signed-off-by: NeilBrown <neilb@suse.de>
This lock is used for (slightly) more than helping with writing
superblocks, and it will soon be extended further. So the
name is inappropriate.
Also, the _irq variant hasn't been needed since 2.6.37 as it is
never taking from interrupt or bh context.
So:
-rename write_lock to lock
-document what it protects
-remove _irq ... except in md_flush_request() as there
is no wait_event_lock() (with no _irq). This can be
cleaned up after appropriate changes to wait.h.
Signed-off-by: NeilBrown <neilb@suse.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVIzozznsnt1WYoG5AQKEOQ/9G31KYGhe3/Tn+Hhyy7o16+6fjOBSUN2z
6GCGtOuHhom0Q+y4cdeAohw+eW9bJPqc9tRyRrLnVCuAkeymPP1t7fIRLGAuOHq5
QjjX7ZwKK8urmPAFIxRFgJwrJctFEoaCQiZHk4Pcx9YcYm4paJP2liqqQ0khGPcl
KGGjWgD3KMAG0moJzNmLMWBZevG3SLCTirPxOcDJuBkQ7KfP0e4UGVn/UYXdgF+f
73yForADVIE/Jq/CXnsOEZa/nPTM7DIsXZAQqqFCnkUiuCvwhgfkowxhwCExx2d6
w3+I69anD1WII1Z27bNDBgnJGR5xvdiAl6NTJ+GY1uMCo3lnV9OYpQrJxAEJP4dC
nXILm6SVm7WLfLvVY1lJm3wZndIyWNmTcDxODGyMZhAY4RmjDHG+AZ9tcnY1pM+9
rt0jHbmaj/ZZ5jKbGxPKBlQ3/U7EP8d4oiruOaDWCGT4sN+e2dbZVLpaWoNaBFUx
eyUft+DEvY82ryz2Ktn/Exsx1aWk2Cy3cMl0ELq2L/2WAcjhA6gsikNmKG7tfafn
Bd6WIGzAygATaFs4hOtFANyXqWubWi8TgLUHXNYUkJ2JooaMuSIfWrpeMNJyHHU7
5bWTaSZqZ86Ygb6upvJqaA3olePMo1RPBsOev46TnuFT9vtfpzPRy6FlvbXfx8sK
M5YCqNpTDTs=
=Ar2y
-----END PGP SIGNATURE-----
Merge tag 'md/3.19' of git://neil.brown.name/md
Pull md updates from Neil Brown:
"Three fixes for md.
I did have a largish set of locking changes queued, but late testing
showed they weren't quite as stable as I thought and while I fixed
what I found, I decided it safer to delay them a release ...
particularly as I'll be AFK for a few weeks. So expect a larger batch
next time :-)"
* tag 'md/3.19' of git://neil.brown.name/md:
md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.
md: fix semicolon.cocci warnings
md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
A recent change to md started the ->sync_thread from a asynchronously
from a work_queue rather than synchronously. This means that there
can be a small window between the time when MD_RECOVERY_RUNNING is set
and when ->sync_thread is set.
So code that checks ->sync_thread might now conclude that the thread
has not been started and (because a lock is held) will not be started.
That is no longer the case.
Most of those places are best fixed by testing MD_RECOVERY_RUNNING
as well. To make this completely reliable, we wake_up(&resync_wait)
after clearing that flag as well as after clearing ->sync_thread.
Other places are better served by flushing the relevant workqueue
to ensure that that if the sync thread was starting, it has now
started. This is particularly best if we are about to stop the
sync thread.
Fixes: ac05f25669
Signed-off-by: NeilBrown <neilb@suse.de>
md_check_recovery will skip any recovery and also clear
MD_RECOVERY_NEEDED if MD_RECOVERY_FROZEN is set.
So when we clear _FROZEN, we must set _NEEDED and ensure that
md_check_recovery gets run.
Otherwise we could miss out on something that is needed.
In particular, this can make it impossible to remove a
failed device from an array is the 'recovery-needed' processing
didn't happen.
Suitable for stable kernels since 3.13.
Cc: stable@vger.kernel.org (3.13+)
Reported-and-tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Fixes: 30b8feb730
Signed-off-by: NeilBrown <neilb@suse.de>
All the interesting information printed by this ioctl
is provided in /proc/mdstat and/or sysfs.
So it isn't needed and isn't used and would be best if it didn't
exist.
Signed-off-by: NeilBrown <neilb@suse.de>
Most of the places that call this are doing so pointlessly.
A couple of the others a best replaced with WARN_ON().
Signed-off-by: NeilBrown <neilb@suse.de>
unknown ioctls no longer get this deep into md_ioctl since
md_ioctl_valid() was introduced in 3.14.
So remove the test and the misleading comment.
Signed-off-by: NeilBrown <neilb@suse.de>
If an array is active, devices can be marked 'faulty', but simply
removing the 'sync' flag is wrong. That only makes sense
for an array which is not active (and is probably only useful
for testing anyway).
Signed-off-by: NeilBrown <neilb@suse.de>
The main 'md' thread is needed for processing writes, so if it blocks
write requests could be delayed.
Starting a new thread requires some GFP_KERNEL allocations and so can
wait for writes to complete. This can deadlock.
So instead, ask a workqueue to start the sync thread.
There is no particular rush for this to happen, so any work queue
will do.
MD_RECOVERY_RUNNING is used to ensure only one thread is started.
Reported-by: BillStuff <billstuff2001@sbcglobal.net>
Signed-off-by: NeilBrown <neilb@suse.de>
We don't really need the full mddev_lock here, and having to
drop it is messy.
RCU is enough to protect these lists.
Signed-off-by: NeilBrown <neilb@suse.de>
printk may cause long time lapse if value of printk_delay in sysctl is
configured large by user. If register_md_personality takes long time to print in
spinlock pers_lock, we may encounter high CPU usage rate when there are other
pers_lock competitors who may be blocked to spin.
We can avoid this condition by moving printk out of coverage of pers_lock
spinlock.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In general we don't allow an array to be stopped if it is in use.
However if the array hasn't really been started yet, then any
apparent use is an anomily, probably due to 'udev' or similar
having a look to see what is there.
This means that if something goes wrong while assembling an array
it cannot reliably be un-assembled - STOP_ARRAY could fail.
There is no value here, so change do_md_stop() to succeed
despite concurrent opens if the array has not yet been
activated. i.e. if ->pers is NULL.
Reported-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
An array can only accept a bitmap if it will call bitmap_daemon_work
periodically, which means it needs a thread running.
If there is no thread, don't allow a bitmap to be added.
Signed-off-by: NeilBrown <neilb@suse.de>
When we calculate the speed of recovery, the numerator that contains
the recovery done sectors. It's need to subtract the sectors which
don't finish recovery.
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The way md devices are traditionally created in the kernel
is simply to open the device with the desired major/minor number.
This can be problematic as some support tools, notably udev and
programs run by udev, can open a device just to see what is there, and
find that it has created something. It is easy for a race to cause
udev to open an md device just after it was destroy, causing it to
suddenly re-appear.
For some time we have had an alternate way to create md devices
echo md_somename > /sys/modules/md_mod/paramaters/new_array
This will always use a minor number of 512 or higher, which mdadm
normally avoids.
Using this makes the creation-by-opening unnecessary, but does
not disable it, so it is still there to cause problems.
This patch disable probing for devices with a major of 9 (MD_MAJOR)
and a minor of 512 and up. This devices created by writing to
new_array cannot be re-created by opening the node in /dev.
Signed-off-by: NeilBrown <neilb@suse.de>
When we write to a degraded array which has a bitmap, we
make sure the relevant bit in the bitmap remains set when
the write completes (so a 're-add' can quickly rebuilt a
temporarily-missing device).
If, immediately after such a write starts, we incorporate a spare,
commence recovery, and skip over the region where the write is
happening (because the 'needs recovery' flag isn't set yet),
then that write will not get to the new device.
Once the recovery finishes the new device will be trusted, but will
have incorrect data, leading to possible corruption.
We cannot set the 'needs recovery' flag when we start the write as we
do not know easily if the write will be "degraded" or not. That
depends on details of the particular raid level and particular write
request.
This patch fixes a corruption issue of long standing and so it
suitable for any -stable kernel. It applied correctly to 3.0 at
least and will minor editing to earlier kernels.
Reported-by: Bill <billstuff2001@sbcglobal.net>
Tested-by: Bill <billstuff2001@sbcglobal.net>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/53A518BB.60709@sbcglobal.net
Signed-off-by: NeilBrown <neilb@suse.de>
If an array has a bitmap, the when we set the "has bitmap" flag we
incorrectly clear the "is clean" flag.
"is clean" isn't really important when a bitmap is present, but it is
best to get it right anyway.
Reported-by: George Duffield <forumscollective@gmail.com>
Link: http://lkml.kernel.org/CAG__1a4MRV6gJL38XLAurtoSiD3rLBTmWpcS5HYvPpSfPR88UQ@mail.gmail.com
Fixes: 36fa30636f (v2.6.14)
Signed-off-by: NeilBrown <neilb@suse.de>
Julia Lawall and coccinelle report that md_clear_badblocks always
returns 0, despite appearing to have an error path.
The error path really should return an error code. ENOSPC is
reasonably appropriate.
Reported-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NeilBrown <neilb@suse.de>
read-only arrays should not be changed. This includes changing
the level, layout, size, or number of devices.
So reject those changes for readonly arrays.
Signed-off-by: NeilBrown <neilb@suse.de>
Commit 8313b8e57f
md: fix problem when adding device to read-only array with bitmap.
added a called to md_reap_sync_thread() which cause a reshape thread
to be interrupted (in particular, it could cause md_thread() to never even
call md_do_sync()).
However it didn't set MD_RECOVERY_INTR so ->finish_reshape() would not
know that the reshape didn't complete.
This only happens when mddev->ro is set and normally reshape threads
don't run in that situation. But raid5 and raid10 can start a reshape
thread during "run" is the array is in the middle of a reshape.
They do this even if ->ro is set.
So it is best to set MD_RECOVERY_INTR before abortingg the
sync thread, just in case.
Though it rare for this to trigger a problem it can cause data corruption
because the reshape isn't finished properly.
So it is suitable for any stable which the offending commit was applied to.
(3.2 or later)
Fixes: 8313b8e57f
Cc: stable@vger.kernel.org (3.2+)
Signed-off-by: NeilBrown <neilb@suse.de>
If mddev->ro is set, md_to_sync will (correctly) abort.
However in that case MD_RECOVERY_INTR isn't set.
If a RESHAPE had been requested, then ->finish_reshape() will be
called and it will think the reshape was successful even though
nothing happened.
Normally a resync will not be requested if ->ro is set, but if an
array is stopped while a reshape is on-going, then when the array is
started, the reshape will be restarted. If the array is also set
read-only at this point, the reshape will instantly appear to success,
resulting in data corruption.
Consequently, this patch is suitable for any -stable kernel.
Cc: stable@vger.kernel.org (any)
Signed-off-by: NeilBrown <neilb@suse.de>
If an md array with externally managed metadata (e.g. DDF or IMSM)
is in use, then we should not set safemode==2 at shutdown because:
1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
2/ The safemode management code doesn't cope with safemode==2 on external metadata
and md_check_recover enters an infinite loop.
Even at shutdown, an infinite-looping process can be problematic, so this
could cause shutdown to hang.
Cc: stable@vger.kernel.org (any kernel)
Signed-off-by: NeilBrown <neilb@suse.de>
If md-mod is unloaded while some process is in poll() or select(),
then that process maintains a pointer to md_event_waiters, and when
the try to unlink from that list, they will oops.
The procfs infrastructure ensures that ->poll won't be called after
remove_proc_entry, but doesn't provide a wait_queue_head for us to
use, and the waitqueue code doesn't provide a way to remove all
listeners from a waitqueue.
So we need to:
1/ make sure no further references to md_event_waiters are taken (by
setting md_unloading)
2/ wake up all processes currently waiting, and
3/ wait until all those processes have disconnected from our
wait_queue_head.
Reported-by: "majianpeng" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
md bitmap code currently tries to use i_writecount to stop any other
process from writing to out bitmap file. But that is really an abuse
and has bit-rotted so locking is all wrong.
So discard that - root should be allowed to shoot self in foot.
Still use it in a much less intrusive way to stop the same file being
used as bitmap on two different array, and apply other checks to
ensure the file is at least vaguely usable for bitmap storage
(is regular, is open for write. Support for ->bmap is already checked
elsewhere).
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
All bug fixes, two tagged for -stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIVAwUAUuG8WTnsnt1WYoG5AQKW7A//V93TYUiUAG6zNUFrNZjuoXP0ym3jlpkH
eIIFdcV7rr0Irgtd8+s9cW8Cjsbq3d/vMbFlwP1Co32mnCnFFojeKtCvM9GqkYrH
o4Zr1nVAVYzKO4awByK3wBT9WbzEc/XlDgYQIpZExIYeZdzLOm6HyvlRbcE86Ug5
QoGYOUlLu4LUZmFgB9zQ7JM0GACV5pS1afSObtACj2t2x5GVHNU84u+M+D8urPXO
wnf+AIAzquh5F+8MX+DxmMEUaSzUHf8fXOM3jYVbzPI71SpaHssL4SwBn+j4I/8/
SCSqeIh7qMSuqUy63/iHKCy5qAgNuRdL9fYlOTkpxzHm81Ddj8u7fySsApggVOa2
yeKTkSRlsMFeu+LiGKNi/fINVxboaoYJVZ2DTNtKxSuW2VL2aPNz1Qjq4QnR3nSI
LpaB3VeVKdMsH8Em1a8cgZWcjo5YFAcNtUnJq2fvj9VZ3SJNw4ZoKDL+l718iGao
xIwAXMSafAHQVAAaNVFkwrea13TeOyxikY5Ra4vWfm+Fw8TzmYq5DqO0zaILwdAJ
2FnNj2/2y3hk2K7qBcEvjjEakxPlTwzrzxZMfJDRMuQLqvrjbXiMGOnWzgl1D/9x
4/uPjeFZLG7byxmIyg4Y83NkPgkWnRPpGK98r26pUH1UgnRF0a5aUFXQk7rsQrU6
noRkZ9EPD8s=
=d81E
-----END PGP SIGNATURE-----
Merge tag 'md/3.14' of git://neil.brown.name/md
Pull md updates from Neil Brown:
"All bug fixes, two tagged for -stable"
* tag 'md/3.14' of git://neil.brown.name/md:
md/raid5: close recently introduced race in stripe_head management.
md/raid5: fix long-standing problem with bitmap handling on write failure.
md: check command validity early in md_ioctl().
md: ensure metadata is writen after raid level change.
md/raid10: avoid fullsync when not necessary.
md: allow a partially recovered device to be hot-added to an array.
md: Change handling of save_raid_disk and metadata update during recovery.
Verify that the cmd parameter passed to md_ioctl() is valid before
doing anything.
This fixes mddev->hold_active being set to 0 when an invalid ioctl
command is passed to md_ioctl() before the array has been configured.
Clearing mddev->hold_active in that case can lead to a livelock
situation when an invalid ioctl number is given to md_ioctl() by a
process when the mddev is currently being opened by another process:
Process 1 Process 2
--------- ---------
md_alloc()
mddev_find()
-> returns a new mddev with
hold_active == UNTIL_IOCTL
add_disk()
-> sends KOBJ_ADD uevent
(sees KOBJ_ADD uevent for device)
md_open()
md_ioctl(INVALID_IOCTL)
-> returns ENODEV and clears
mddev->hold_active
md_release()
md_put()
-> deletes the mddev as
hold_active is 0
md_open()
mddev_find()
-> returns a newly
allocated mddev with
mddev->gendisk == NULL
-> returns with ERESTARTSYS
(kernel restarts the open syscall)
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: NeilBrown <neilb@suse.de>
All of these fix real bugs the people have hit, and are tagged
for -stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIVAwUAUtYZqznsnt1WYoG5AQK50g//XuqVR/esIpGR+knf+1sD3Zk85Vp33kGL
2UfbQbi40q/uLjBhJhOSkx/sYGw1Eo255vNX+yMVjYT9F+xbhI8vlLfecqx5Fk5J
M+JH1sM7E2T79boFLoOBGSl/qppsQsPHa3p87FmFHQrrAuEMIbFiP98MnQjdSiv4
Cu9cAR7x7njepHeMXBFiV7URaYtCHAXR9iMdkebkKIFlfND8w2QYD+LWo3SzBKs9
jTrSBJRpXLHE+bZLOQPhAryb7nWkcT1R7N0vsVMQKcq1o6ZiRNnk/B9xNtV34hkc
5zwTPe/d5AsV6Tsxg0dSs7xcBn/A+F5lg8fzdOhyE1F13COmB7sepjPTMPAy/oP1
zjyPwnnWkHMDUW2usf3aqPMt+LGMofRCJHXjkqpMgIWQ96SQUY8F9PPxchkUCsx/
A38I+vXl2jGDHh/DFSduef3sDOF6TYyKyLteJftyny96dc1RutrZSbHPdrkDz1YQ
6zcyvpv0FexiXITrLg70FG8fnRMK91ZfHrmuzVP7tpm2TyeIfDriLhTAIXAcXHOT
l22a1bNj4shFfztnD0CbH6nY/iJM7ov0x5+IyG5/iYbipon02MenQeV9km6JVwQb
OCGHYCTswiFSduX1E1ru52dHXifbANWgzcUH0sjGQ0YZNmxvPRBWDjB1H2J1auzW
J8T10qimw1w=
=uvyl
-----END PGP SIGNATURE-----
Merge tag 'md/3.13-fixes' of git://neil.brown.name/md
Pull late md fixes from Neil Brown:
"Half a dozen md bug fixes.
All of these fix real bugs the people have hit, and are tagged for
-stable. Sorry they are late .... Christmas holidays and all that.
Hopefully they can still squeak into 3.13"
* tag 'md/3.13-fixes' of git://neil.brown.name/md:
md: fix problem when adding device to read-only array with bitmap.
md/raid10: fix bug when raid10 recovery fails to recover a block.
md/raid5: fix a recently broken BUG_ON().
md/raid1: fix request counting bug in new 'barrier' code.
md/raid10: fix two bugs in handling of known-bad-blocks.
md/raid5: Fix possible confusion when multiple write errors occur.
level_store() currently does not make sure the metadata is
updates to reflect the new raid level. It simply sets MD_CHANGE_DEVS.
Any level with a ->thread will quickly notice this and update the
metadata. However RAID0 and Linear do not have a thread so no
metadata update happens until the array is stopped. At that point the
metadata is written.
This is later that we would like. While the delay doesn't risk any
data it can cause confusion. So if there is no md thread, immediately
update the metadata after a level change.
Reported-by: Richard Michael <rmichael@edgeofthenet.org>
Signed-off-by: NeilBrown <neilb@suse.de>
When adding a new device into an array it is normally important to
clear any stale data from ->recovery_offset else the new device may
not be recovered properly.
However when re-adding a device which is known to be nearly in-sync,
this is not needed and can be detrimental. The (bitmap-based)
resync will still happen, and further recovery is only needed from
where-ever it was already up to.
So if save_raid_disk is set, signifying a re-add, don't clear
->recovery_offset.
Signed-off-by: NeilBrown <neilb@suse.de>
Since commit d70ed2e4fa
MD: Allow restarting an interrupted incremental recovery.
we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences. This patch
changes things to make them work better.
At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".
Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>". So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.
After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count". That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point. So this
was an improvement.
However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering). So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.
It really is wrong to not update all the metadata together. Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata. i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening. We
now add another one to say that it is a bitmap-based recovery.
With this we can remove the code that disables the write-out of
metadata on some devices.
So this patch:
- moves the setting of 'saved_raid_disk' from add_new_disk to
the validate_super methods. This makes sure it is always set
properly, both when adding a new device to an array, and when
assembling an array from a collection of devices.
- Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
bitmap-based recovery is allowed.
This is only present in v1.x metadata. v0.90 doesn't support
devices which are in the middle of recovery at all.
- Only skips writing metadata to Faulty devices.
- Also allows rdev state to be set to "-insync" via sysfs.
This can be used for external-metadata arrays. When the
'role' is set the device is assumed to be in-sync. If, after
setting the role, we set the state to "-insync", the role is
moved to saved_raid_disk which effectively says the device is
partly in-sync with that slot and needs a bitmap recovery.
Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If an array is started degraded, and then the missing device
is found it can be re-added and a minimal bitmap-based recovery
will bring it fully up-to-date.
If the array is read-only a recovery would not be allowed.
But also if the array is read-only and the missing device was
present very recently, then there could be no need for any
recovery at all, so we simply include the device in the read-only
array without any recovery.
However... if the missing device was removed a little longer ago
it could be missing some updates, but if a bitmap is present it will
be conditionally accepted pending a bitmap-based update. We don't
currently detect this case properly and will include that old
device into the read-only array with no recovery even though it really
needs a recovery.
This patch keeps track of whether a bitmap-based-recovery is really
needed or not in the new Bitmap_sync rdev flag. If that is set,
then the device will not be added to a read-only array.
Cc: Andrei Warkentin <andreiw@vmware.com>
Fixes: d70ed2e4fa
Cc: stable@vger.kernel.org (3.2+)
Signed-off-by: NeilBrown <neilb@suse.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
=/4/R
-----END PGP SIGNATURE-----
Merge tag 'v3.13-rc6' into for-3.14/core
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. It contains:
- A fix for a use-after-free of a request in blk-mq. From Ming Lei
- A fix for a blk-mq bug that could attempt to dereference a NULL rq
if allocation failed
- Two xen-blkfront small fixes
- Cleanup of submit_bio_wait() type uses in the kernel, unifying
that. From Kent
- A fix for 32-bit blkg_rwstat reading. I apologize for this one
looking mangled in the shortlog, it's entirely my fault for missing
an empty line between the description and body of the text"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: fix use-after-free of request
blk-mq: fix dereference of rq->mq_ctx if allocation fails
block: xen-blkfront: Fix possible NULL ptr dereference
xen-blkfront: Silence pfn maybe-uninitialized warning
block: submit_bio_wait() conversions
Update of blkg_stat and blkg_rwstat may happen in bh context
commit 7a0a5355cb md: Don't test all of mddev->flags at once.
made most tests on mddev->flags safer, but missed one.
When
commit 260fa034ef md: avoid deadlock when dirty buffers during md_stop.
added MD_STILL_CLOSED, this caused md_check_recovery to misbehave.
It can think there is something to do but find nothing. This can
lead to the md thread spinning during array shutdown.
https://bugzilla.kernel.org/show_bug.cgi?id=65721
Reported-and-tested-by: Richard W.M. Jones <rjones@redhat.com>
Fixes: 260fa034ef
Cc: stable@vger.kernel.org (3.12)
Signed-off-by: NeilBrown <neilb@suse.de>
It was being open coded in a few places.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It was being open coded in a few places.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Mostly optimisations and obscure bug fixes.
- raid5 gets less lock contention
- raid1 gets less contention between normal-io and resync-io
during resync.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUovzDznsnt1WYoG5AQJ1pQ//bDuXadoJ5dwjWjVxFOKoQ9j/9joEI0yH
XTApD3ADKckdBc4TSLOIbCNLW1Pbe23HlOI/GjCiJ/7mePL3OwHd7Fx8Rfq3BubV
f7NgjVwu8nwYD0OXEZsshImptEtrbYwQdy+qlKcHXcZz1MUfR+Egih3r/ouTEfEt
FNq/6MpyN0IKSY82xP/jFZgesBucgKz/YOUIbwClxm7UiyISKvWQLBIAfLB3dyI3
HoEdEzQX6I56Rw0mkSUG4Mk+8xx/8twxL+yqEUqfdJREWuB56Km8kl8y/e465Nk0
ZZg6j/TrslVEwbEeVMx0syvYcaAWFZ4X2jdKfo1lI0g9beZp7H1GRF8yR1s2t/h4
g/vb55MEN++4LPaE9ut4z7SG2yLyGkZgFTzTjyq5of+DFL0cayO7wXxbgpcD7JYf
Doef/OSa6csKiGiJI48iQa08Bolmz9ZWzZQXhAthKfFQ9Rv+GEtIAi4kLR8EZPbu
0/FL1ylYNUY9O7p0g+iy9Kcoc+xW36I95pPZf8pO8GFcXTjyuCCBVh/SNvFZZHPl
3xk3aZJknAEID8VrVG2IJPkeDI8WK8YxmpU/nARCoytn07Df6Ye8jGvLdR8pL3lB
TIZV6eRY4yciB8LtoK9Kg4XTmOMhBtjt4c3znkljp98vhOQQb/oHN+BXMGcwqvr9
fk0KGrg31VA=
=8RCg
-----END PGP SIGNATURE-----
Merge tag 'md/3.13' of git://neil.brown.name/md
Pull md update from Neil Brown:
"Mostly optimisations and obscure bug fixes.
- raid5 gets less lock contention
- raid1 gets less contention between normal-io and resync-io during
resync"
* tag 'md/3.13' of git://neil.brown.name/md:
md/raid5: Use conf->device_lock protect changing of multi-thread resources.
md/raid5: Before freeing old multi-thread worker, it should flush them.
md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
UAPI: include <asm/byteorder.h> in linux/raid/md_p.h
raid1: Rewrite the implementation of iobarrier.
raid1: Add some macros to make code clearly.
raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
raid1: Add a field array_frozen to indicate whether raid in freeze state.
md: Convert use of typedef ctl_table to struct ctl_table
md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
md: fix some places where mddev_lock return value is not checked.
raid5: Retry R5_ReadNoMerge flag when hit a read error.
raid5: relieve lock contention in get_active_stripe()
raid5: relieve lock contention in get_active_stripe()
wait: add wait_event_cmd()
md/raid5.c: add proper locking to error path of raid5_start_reshape.
md: fix calculation of stacking limits on level change.
raid5: Use slow_path to release stripe when mddev->thread is null
When raid5 recovery hits a fresh badblock, this badblock will flagged as unack
badblock until md_update_sb() is called.
But md_stop will take reconfig lock which means raid5d can't call
md_update_sb() in md_check_recovery(), the badblock will always
be unack, so raid5d thread enters an infinite loop and md_stop_write()
can never stop sync_thread. This causes deadlock.
To solve this, when STOP_ARRAY ioctl is issued and sync_thread is
running, we need set md->recovery FROZEN and INTR flags and wait for
sync_thread to stop before we (re)take reconfig lock.
This requires that raid5 reshape_request notices MD_RECOVERY_INTR
(which it probably should have noticed anyway) and stops waiting for a
metadata update in that case.
Reported-by: Jianpeng Ma <majianpeng@gmail.com>
Reported-by: Bian Yu <bianyu@kedacom.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We currently use kthread_should_stop() in various places in the
sync/reshape code to abort early.
However some places set MD_RECOVERY_INTR but don't immediately call
md_reap_sync_thread() (and we will shortly get another one).
When this happens we are relying on md_check_recovery() to reap the
thread and that only happen when it finishes normally.
So MD_RECOVERY_INTR must lead to a normal finish without the
kthread_should_stop() test.
So replace all relevant tests, and be more careful when the thread is
interrupted not to acknowledge that latest step in a reshape as it may
not be fully committed yet.
Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
so we don't wait have to wait for the speed to drop before we can abort.
Signed-off-by: NeilBrown <neilb@suse.de>
Sometimes we need to lock and mddev and cannot cope with
failure due to interrupt.
In these cases we should use mutex_lock, not mutex_lock_interruptible.
Signed-off-by: NeilBrown <neilb@suse.de>
The various ->run routines of md personalities assume that the 'queue'
has been initialised by the blk_set_stacking_limits() call in
md_alloc().
However when the level is changed (by level_store()) the ->run routine
for the new level is called for an array which has already had the
stacking limits modified. This can result in incorrect final
settings.
So call blk_set_stacking_limits() before ->run in level_store().
A specific consequence of this bug is that it causes
discard_granularity to be set incorrectly when reshaping a RAID4 to a
RAID0.
This is suitable for any -stable kernel since 3.3 in which
blk_set_stacking_limits() was introduced.
Cc: stable@vger.kernel.org (3.3+)
Reported-and-tested-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block IO core updates from Jens Axboe:
"This is the pull request for the core changes in the block layer for
3.13. It contains:
- The new blk-mq request interface.
This is a new and more scalable queueing model that marries the
best part of the request based interface we currently have (which
is fully featured, but scales poorly) and the bio based "interface"
which the new drivers for high IOPS devices end up using because
it's much faster than the request based one.
The bio interface has no block layer support, since it taps into
the stack much earlier. This means that drivers end up having to
implement a lot of functionality on their own, like tagging,
timeout handling, requeue, etc. The blk-mq interface provides all
these. Some drivers even provide a switch to select bio or rq and
has code to handle both, since things like merging only works in
the rq model and hence is faster for some workloads. This is a
huge mess. Conversion of these drivers nets us a substantial code
reduction. Initial results on converting SCSI to this model even
shows an 8x improvement on single queue devices. So while the
model was intended to work on the newer multiqueue devices, it has
substantial improvements for "classic" hardware as well. This code
has gone through extensive testing and development, it's now ready
to go. A pull request is coming to convert virtio-blk to this
model will be will be coming as well, with more drivers scheduled
for 3.14 conversion.
- Two blktrace fixes from Jan and Chen Gang.
- A plug merge fix from Alireza Haghdoost.
- Conversion of __get_cpu_var() from Christoph Lameter.
- Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.
- A fix for a race between request completion and the timeout
handling from Jeff Moyer. This is what caused the merge conflict
with blk-mq/core, in case you are looking at that.
- A dm stacking fix from Mike Snitzer.
- A code consolidation fix and duplicated code removal from Kent
Overstreet.
- A handful of block bug fixes from Mikulas Patocka, fixing a loop
crash and memory corruption on blk cg.
- Elevator switch bug fix from Tomoki Sekiyama.
A heads-up that I had to rebase this branch. Initially the immutable
bio_vecs had been queued up for inclusion, but a week later, it became
clear that it wasn't fully cooked yet. So the decision was made to
pull this out and postpone it until 3.14. It was a straight forward
rebase, just pruning out the immutable series and the later fixes of
problems with it. The rest of the patches applied directly and no
further changes were made"
* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: Do not call sector_div() with a 64-bit divisor
kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
block: Consolidate duplicated bio_trim() implementations
block: Use rw_copy_check_uvector()
block: Enable sysfs nomerge control for I/O requests in the plug list
block: properly stack underlying max_segment_size to DM device
elevator: acquire q->sysfs_lock in elevator_change()
elevator: Fix a race in elevator switching and md device initialization
block: Replace __get_cpu_var uses
bdi: test bdi_init failure
block: fix a probe argument to blk_register_region
loop: fix crash if blk_alloc_queue fails
blk-core: Fix memory corruption if blkcg_init_queue fails
block: fix race between request completion and timeout handling
blktrace: Send BLK_TN_PROCESS events to all running traces
blk-mq: don't disallow request merges for req->special being set
blk-mq: mq plug list breakage
blk-mq: fix for flush deadlock
...
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Here's the big driver core / sysfs update for 3.13-rc1.
There's lots of dev_groups updates for different subsystems, as they all
get slowly migrated over to the safe versions of the attribute groups
(removing userspace races with the creation of the sysfs files.) Also
in here are some kobject updates, devres expansions, and the first round
of Tejun's sysfs reworking to enable it to be used by other subsystems
as a backend for an in-kernel filesystem.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlJ6xAMACgkQMUfUDdst+yk1kQCfcHXhfnrvFZ5J/mDP509IzhNS
ddEAoLEWoivtBppNsgrWqXpD1vi4UMsE
=JmVW
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core / sysfs patches from Greg KH:
"Here's the big driver core / sysfs update for 3.13-rc1.
There's lots of dev_groups updates for different subsystems, as they
all get slowly migrated over to the safe versions of the attribute
groups (removing userspace races with the creation of the sysfs
files.) Also in here are some kobject updates, devres expansions, and
the first round of Tejun's sysfs reworking to enable it to be used by
other subsystems as a backend for an in-kernel filesystem.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits)
sysfs: rename sysfs_assoc_lock and explain what it's about
sysfs: use generic_file_llseek() for sysfs_file_operations
sysfs: return correct error code on unimplemented mmap()
mdio_bus: convert bus code to use dev_groups
device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name
sysfs: separate out dup filename warning into a separate function
sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c
sysfs: remove unused sysfs_get_dentry() prototype
sysfs: honor bin_attr.attr.ignore_lockdep
sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr
devres: restore zeroing behavior of devres_alloc()
sysfs: fix sysfs_write_file for bin file
input: gameport: convert bus code to use dev_groups
input: serio: remove bus usage of dev_attrs
input: serio: use DEVICE_ATTR_RO()
i2o: convert bus code to use dev_groups
memstick: convert bus code to use dev_groups
tifm: convert bus code to use dev_groups
virtio: convert bus code to use dev_groups
ipack: convert bus code to use dev_groups
...
The pre-existing sysfs interfaces which take explicit namespace
argument are weird in that they place the optional @ns in front of
@name which is contrary to the established convention. For example,
we end up forcing vast majority of sysfs_get_dirent() users to do
sysfs_get_dirent(parent, NULL, name), which is silly and error-prone
especially as @ns and @name may be interchanged without causing
compilation warning.
This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the
positions of @name and @ns, and sysfs_get_dirent() is now a wrapper
around sysfs_get_dirent_ns(). This makes confusions a lot less
likely.
There are other interfaces which take @ns before @name. They'll be
updated by following patches.
This patch doesn't introduce any functional changes.
v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol
error on module builds. Reported by build test robot. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the last process closes /dev/mdX sync_blockdev will be called so
that all buffers get flushed.
So if it is then opened for the STOP_ARRAY ioctl to be sent there will
be nothing to flush.
However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
moments before some other process which was writing closes their file
descriptor, then there won't be a 'last close' and the buffers might
not get flushed.
So do_md_stop() calls sync_blockdev(). However at this point it is
holding ->reconfig_mutex. So if the array is currently 'clean' then
the writes from sync_blockdev() will not complete until the array
can be marked dirty and that won't happen until some other thread
can get ->reconfig_mutex. So we deadlock.
We need to move the sync_blockdev() call to before we take
->reconfig_mutex.
However then some other thread could open /dev/mdX and write to it
after we call sync_blockdev() and before we actually stop the array.
This can leave dirty data in the page cache which is awkward.
So introduce new flag MD_STILL_CLOSED. Set it before calling
sync_blockdev(), clear it if anyone does open the file, and abort the
STOP_ARRAY attempt if it gets set before we lock against further
opens.
It is still possible to get problems if you open /dev/mdX, write to
it, then issue the STOP_ARRAY ioctl. Just don't do that.
Signed-off-by: NeilBrown <neilb@suse.de>
mddev->flags is mostly used to record if an update of the
metadata is needed. Sometimes the whole field is tested
instead of just the important bits. This makes it difficult
to introduce more state bits.
So replace all bare tests of mddev->flags with tests for the bits
that actually need testing.
Signed-off-by: NeilBrown <neilb@suse.de>
Setting a variable to itself probably wasn't the intention here.
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
immediately - otherwise the small value might not be honoured.
However if the previous timeout was 0 meaning "no timeout", we didn't.
This would mean that no timeout happens until the next write completes,
which could be a long time.
Signed-off-by: NeilBrown <neilb@suse.de>
There is no really need as GFP_NOIO is very likely sufficient,
and failure is not catastrophic.
Calling md_allow_write here will convert a read-auto array to
read/write which could be confusing when you are just performing
a read operation.
Signed-off-by: NeilBrown <neilb@suse.de>
commit 7ceb17e87b
md: Allow devices to be re-added to a read-only array.
allowed a bit more than just that. It also allows devices to be added
to a read-write array and to end up skipping recovery.
This patch removes the offending piece of code pending a rewrite for a
subsequent release.
More specifically:
If the array has a bitmap, then the device will still need a bitmap
based resync ('saved_raid_disk' is set under different conditions
is a bitmap is present).
If the array doesn't have a bitmap, then this is correct as long as
nothing has been written to the array since the metadata was checked
by ->validate_super. However there is no locking to ensure that there
was no write.
Bug was introduced in 3.10 and causes data corruption so
patch is suitable for 3.10-stable.
Cc: stable@vger.kernel.org (3.10)
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD: Remember the last sync operation that was performed
This patch adds a field to the mddev structure to track the last
sync operation that was performed. This is especially useful when
it comes to what is recorded in mismatch_cnt in sysfs. If the
last operation was "data-check", then it reports the number of
descrepancies found by the user-initiated check. If it was a
"repair" operation, then it is reporting the number of
descrepancies repaired. etc.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The usage of strict_strtoul() is not preferred, because
strict_strtoul() is obsolete. Thus, kstrtoul() should be
used.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When a device has failed, it needs to be removed from the personality
module before it can be removed from the array as a whole.
The first step is performed by md_check_recovery() which is called
from the raid management thread.
So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery
to have run. This increases the chance that the ioctl will succeed.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Neil Brown <nfbrown@suse.de>
Some tagged for -stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUbl1mznsnt1WYoG5AQKGlQ//eixdawF+DUK5hadqZ9EDni+BAVzb7m69
+zU6ilQ7UOh7bxtAoJqrgFVykK+LG8wvYsEBwMjB9oRDLA96/YDXXiBzXHvd6mGh
g271lwMTQ9h+O8L6psLUX6qsrH3i7SJmF8ySPKi6Fe5ruT8ToOB8Ii8XQebEZdXo
VOzRz2VgSTcBdrTyKPDsBJByDQX36hsK8Gs5YSl5F3nvyV4dvGWMlyoTF1TRRt9K
YCCZ8pSk3kTXaSdl0syrJxI17pEUC8mtcA01S6JD/GV49CGO8LYAckVJ4ijWw7VV
IGGlH0DsYSMgJ7yyuLz4ifaqRnsWsAGW0WyiZYYKvjtNUiyBuBBbo2cQ1lNkR5p4
jnLhpJJVh0hLCPn6wcCWIBIdT/mFaBpXkvZPd3ks5kefGXsfpVPm0fK8r0fzkzgy
tJCZtZFZHeK1qsgaDsiS76S2ZNcFh0HQVIa84Q200/XUDgh8dYlD0+7oIsVu0UBZ
72Aop+Ak9+k4vKTvB9/hpcY+Rt0MI7zKewXBDSDK1sXhIHLQqv8rCEeNYiuPPqr/
ghRukn+C/Wtr7JYBsX+jMjxtmSzYtwBOihwLoZCH9pp3C5jTvyQk9s8n1j13V2RK
sAFtfpCVoQ8tTa7IITKRMfftzHn1WiPlPsj6VbigJ6A4N98csgv7x2rF7FyqcF0X
aoj69nQ3i/4=
=8iy3
-----END PGP SIGNATURE-----
Merge tag 'md-3.10-fixes' of git://neil.brown.name/md
Pull md bugfixes from Neil Brown:
"A few bugfixes for md
Some tagged for -stable"
* tag 'md-3.10-fixes' of git://neil.brown.name/md:
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
md: md_stop_writes() should always freeze recovery.
__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.
However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward. This can particularly cause problems or dm-raid.
So change __md_stop_writes() to always freeze recovery. This is safe
and more predicatable.
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Tested-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
The value passed is 0 in all but "it can never happen" cases (and those
only in a couple of drivers) *and* it would've been lost on the way
out anyway, even if something tried to pass something meaningful.
Just don't bother.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Maintenance of a bad-block-list currently defaults to 'enabled'
and is then disabled when it cannot be supported.
This is backwards and causes problem for dm-raid which didn't know
to disable it.
So fix the defaults, and only enabled for v1.x metadata which
explicitly has bad blocks enabled.
The problem with dm-raid has been present since badblock support was
added in v3.1, so this patch is suitable for any -stable from 3.1
onwards.
Cc: stable@vger.kernel.org (3.1+)
Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD: Export 'md_reap_sync_thread' function
Make 'md_reap_sync_thread' available to other files, specifically dm-raid.c.
- rename reap_sync_thread to md_reap_sync_thread
- move the fn after md_check_recovery to match md.h declaration placement
- export md_reap_sync_thread
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
read-only arrays should stay that way as much as possible.
Updating the metadata - which could be triggered by a re-add
while assembling the array metadata - should be avoided.
Signed-off-by: NeilBrown <neilb@suse.de>
When assembling an array incrementally we might want to make
it device available when "enough" devices are present, but maybe
not "all" devices are present.
If the remaining devices appear before the array is actually used,
they should be added transparently.
We do this by using the "read-auto" mode where the array acts like
it is read-only until a write request arrives.
Current an add-device request switches a read-auto array to active.
This means that only one device can be added after the array is first
made read-auto. This isn't a problem for RAID5, but is not ideal for
RAID6 or RAID10.
Also we don't really want to switch the array to read-auto at all
when re-adding a device as this doesn't really imply any change.
So:
- remove the "md_update_sb()" call from add_new_disk(). This isn't
really needed as just adding a disk doesn't require a metadata
update. Instead, just set MD_CHANGE_DEVS. This will effect a
metadata update soon enough, once the array is not read-only.
- Allow the ADD_NEW_DISK ioctl to succeed without activating a
read-auto array, providing the MD_DISK_SYNC flag is set.
In this case, the device will be rejected if it cannot be added
with the correct device number, or has an incorrect event count.
- Teach remove_and_add_spares() to be careful about adding spares
when the array is read-only (or read-mostly) - only add devices
that are thought to be in-sync, and only do it if the array is
in-sync itself.
- In md_check_recovery, use remove_and_add_spares in the read-only
case, rather than open coding just the 'remove' part of it.
Reported-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
If a fail device or a spare is removed from an array, there is
not need to make the array 'active'. If/when the array does become
active for some other reason the metadata will be update to reflect
the removal.
If that never happens and the array is stopped while still read-auto,
then there is no loss in forgetting the that the device had 'failed'.
A read-only array will leave failed devices attached to
the array personality, so we need to explicitly call
remove_and_add_spares() to free it (clearing Blocked just
like we do in store_slot()).
Signed-off-by: NeilBrown <neilb@suse.de>
slot_store and remove_and_add_spares both call ->hot_remove_disk(),
but with slightly different tests and consequences, which is
at least untidy and might be buggy.
So modify remove_and_add_spaces() so that it can be asked
to remove a specific device, and call it from slot_store().
We also clear the Blocked flag to ensure that doesn't prevent
removal. The purpose of Blocked is to prevent automatic removal
by the kernel before an error is acknowledged.
If the array is read/write then user-space would have not reason
to remove a device unless it was known to be 'spare' or 'faulty' in
which it would have already cleared the Blocked flag.
If the array is read-only, the flag might still be blocked, but
there is no harm in clearing the flag for read-only arrays.
Signed-off-by: NeilBrown <neilb@suse.de>
Normally we don't even try to update the metadata if
the array is read-only. However future patches
will increase the number of things that can happen on a read-only
array, so it is safest to explicitly disable this.
Every time that mddev->ro is set to 0, either
- md_update_sb will be called again (at least if MD_CHANGE_DEVS
is set) or
- the mddev->thread is scheduled, which will also run
md_update_sb if needed.
So this is safe: if the array ever become read-write the
metadata will be updated.
Signed-off-by: NeilBrown <neilb@suse.de>
Tejun writes:
-----
This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.
* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.
* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.
The three commits are located in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue
Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.
e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a707 ("raid5: use bio_reset()")
The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge
so that you can use it for verification. The test merge commit has
proper merge description.
While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.
----
Fixed up the conflict.
Conflicts:
drivers/md/raid5.c
Signed-off-by: Jens Axboe <axboe@kernel.dk>
MD: Prevent sysfs operations on uninitialized kobjects
Device-mapper does not use sysfs; but when device-mapper is leveraging
MD's RAID personalities, MD sometimes attempts to update sysfs. This
patch adds checks for 'mddev-kobj.sd' in sysfs_[un]link_rdev to ensure
it is about to operate on something valid. This patch also checks for
'mddev->kobj.sd' before calling 'sysfs_notify' in 'remove_and_add_spares'.
Although 'sysfs_notify' already makes this check, doing so in
'remove_and_add_spares' prevents an additional mutex operation.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If something has failed while the array was read-auto,
then when we switch to 'active' we need to update the metadata.
This will happen anyway but it is good to expedite it, and
also to ensure any failed device has been released by the
underlying device before we try to action the ioctl which
caused us to switch to 'active' mode.
Reported-by: Joe Lawrence <Joe.Lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
You cannot resize a RAID0 array (in terms of making the devices
bigger), but the code doesn't entirely stop you.
So:
disable setting of the available size on each device for
RAID0 and Linear devices. This must not change as doing so
can change the effective layout of data.
Make sure that the size that raid0_size() reports is accurate,
but rounding devices sizes to chunk sizes. As the device sizes
cannot change now, this isn't so important, but it is best to be
safe.
Without this change:
mdadm --grow /dev/md0 -z max
mdadm --grow /dev/md0 -Z max
then read to the end of the array
can cause a BUG in a RAID0 array.
These bugs have been present ever since it became possible
to resize any device, which is a long time. So the fix is
suitable for any -stable kerenl.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
If an fsync occurs on a read-only array, we need to send a
completion for the IO and may not increment the active IO count.
Otherwise, we hit a bug trace and can't stop the MD array anymore.
By advice of Christoph Hellwig we return success upon a flush
request but we return -EROFS for other writes.
We detect flush requests by checking if the bio has zero sectors.
This patch is suitable to any -stable kernel to which it applies.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Paul Menzel <paulepanter@users.sourceforge.net>
Signed-off-by: NeilBrown <neilb@suse.de>
Mostly just little fixes. Probably biggest part is
AVX accelerated RAID6 calculations.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUM/w2Dnsnt1WYoG5AQKXlg/9F5juv4CjRkRRFLqZgOPBLmn/s/2Vspgh
2Kv8Jcyixd8jUQNbobZv0ahlJH/iSU61kpOE8QjLbKi5Y42vAbM0ZU2aHJ6nqGZy
HiTI8K+7kTvCK3ZXLcUQ+4oPPBNTcoTZbLWaEOmIqB1ruLddoIR7M9fG3PspVeG0
jijnXR8IfL6mr4YDXnJkEhFrneTysVik05RkKYZKyM/9r3stAoMJ9o0/EFy3OFxb
lO6mLEtvjVArXcnuf1RMCw2YKgki9Y4r73HCplgQsVFvcxcpsya4gFF+lRR5j7cO
/eMYbSQ89iWEYKh1dJ9u1nofc8fX5ia71QQyO1fkO4GXRHXPVIyBgKSbe7SaL6iG
JUMm7idUV2rZGeq3ln3k8Yor4QqHvN1n7pRKKUF+ZdsPoQ1B/TABu+qpsAdo5ZhP
fxDsULsHrzEaxgetd4V8F2Uptca9ni43sMI8mwsvVlA0p6SOzMIyoJLC9xAZpx11
b3H3+7Oje/fasmszBoq5B9uAlSt9XXVN4DDn2q6cX+S96JSX6jcsN1c6cJBO+ZxB
OU6a6P5mnU6HuxU02rspe7G8BeU+ybaonErOW+GdyC4r7M/cImC0dSp0NGHK2211
oqu0xBx/Q/ddTFwKQqa4HzR2ws09+LhKbjdqYIhCEKttIbLIAjf73ARZ19XPSRRX
pDR/ey2CB6E=
=uK52
-----END PGP SIGNATURE-----
Merge tag 'md-3.8' of git://neil.brown.name/md
Pull md update from Neil Brown:
"Mostly just little fixes. Probably biggest part is AVX accelerated
RAID6 calculations."
* tag 'md-3.8' of git://neil.brown.name/md:
md/raid5: add blktrace calls
md/raid5: use async_tx_quiesce() instead of open-coding it.
md: Use ->curr_resync as last completed request when cleanly aborting resync.
lib/raid6: build proper files on corresponding arch
lib/raid6: Add AVX2 optimized gen_syndrome functions
lib/raid6: Add AVX2 optimized recovery functions
md: Update checkpoint of resync/recovery based on time.
md:Add place to update ->recovery_cp.
md.c: re-indent various 'switch' statements.
md: close race between removing and adding a device.
md: removed unused variable in calc_sb_1_csm.
Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:
- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.
- A few cleanups from Akinobu Mita for drbd and cciss.
- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.
- A few fixes for xen back/front block driver from Roger Pau Monne.
- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."
* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
If a resync is aborted cleanly, ->curr_resync is a reliable
record of where we got up to.
If there was an error it is less reliable but we always know that
->curr_resync_completed is safe.
So add a flag MD_RECOVERY_ERROR to differentiate between these cases
and set recovery_cp accordingly.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
md will current only only checkpoint recovery or resync ever 1/16th
of the device size. As devices get larger this can become a long time
an so a lot of work that might need to be duplicated after a shutdown.
So add a time-based checkpoint. Every 5 minutes limits the amount of
duplicated effort to at most 5 minutes, and has almost zero impact on
performance.
[changelog entry re-written by NeilBrown]
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In resyncing, recovery_cp only updated when resync aborted or completed.
But in md drives,many place used it to judge.So add a place to update.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When we remove a device from an md array, the final removal of
the "dev-XX" sys entry is run asynchronously.
If we then re-add that device immediately before the worker thread
gets to run, we can end up trying to add the "dev-XX" sysfs entry back
before it has been removed.
So in both places where we add a device, call
flush_workqueue(md_misc_wq);
before taking the md lock (as holding the md lock can prevent removal
to complete).
Signed-off-by: NeilBrown <neilb@suse.de>
New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
moves the private wait_event_lock_irq() macro from MD to regular wait
includes, introduces new macro wait_event_lock_irq_cmd() instead of using
the old method with omitting cmd parameter which is ugly and makes a use
of new macros in the MD. It also introduces the _interruptible_ variant.
The use of new interface is when one have a special lock to protect data
structures used in the condition, or one also needs to invoke "cmd"
before putting it to sleep.
All new macros are expected to be called with the lock taken. The lock
is released before sleep and is reacquired afterwards. We will leave the
macro with the lock held.
Note to DM: IMO this should also fix theoretical race on waitqueue while
using simultaneously wait_event_lock_irq() and wait_event() because of
lack of locking around current state setting and wait queue removal.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
md_stop() would stop an array, but not free various attached
data structures.
For internal arrays, these are freed later in do_md_stop() or
mddev_put(), but they don't apply for dm-raid arrays.
So get md_stop() to free them, and only all it from dm-raid.
For internal arrays we now call __md_stop.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If read_seqretry returned true and bbp was changed, it will write
invalid address which can cause some serious problem.
This bug was introduced by commit v3.0-rc7-130-g2699b67.
So fix is suitable for 3.0.y thru 3.6.y.
Reported-by: zhuwenfeng@kedacom.com
Tested-by: zhuwenfeng@kedacom.com
Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This bug was introduced by commit(v3.0-rc7-126-g2230dfe).
So fix is suitable for 3.0.y thru 3.6.y.
Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
"discard" support, some dm-raid improvements and other assorted
bits and pieces.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAUHk6Rjnsnt1WYoG5AQKovQ//Ym0ROo5a6uekb2USLyFSdQH3TC7z0v0+
+kujrgoc4nHZU/vj5yfMvPVomEUsAhHEwTkvvCiXFFHn6cxPzC8ezm8d40xEeISX
qp6i2bPlvGURhsW1tYeD+THtY82/oyzQ4Wa/vaE1sjVLQ+caa2q7kVVgAL9Bj/Kz
aESIZjAuPxQNE1674/KR0EmMFcbpd0z1WDV+ydKlRV5jHCHGYf8OmxOenJFf+V/b
/f9p2u+NUq5BN5WLhThcysO8lPX1Y7GG8IYay3DlSt/crU24R2a2j0qh/BDoK8+t
/DceoHipbIiGxXLVjM7y+1RwPpCh75HJSZQHltPype2Z3iwtwEth9uTkEE3M2h/W
tOQEbOZku0kcgsrys7JBmpkBwkR9oZqq1kDd4YBzqW4PiGVP6z0JRH8QpjjB+mjN
47ODYIZcaEYZ+0Jj8kcVxo3gv4Xj4DWH+auSNZihTVmjQPVqrcy3CAt3CkuDzTkY
34fZVuCDiCetLGCGQKrwfMDnySVy5xOmtC6iWsEY5rExAeb0E+BCzcBvbAXzt+ef
MPDsrxWbo/ZkvpuwXOwLFTccBuRtAsFi7CM4jcow53W6XMnPpdubphNw5nylaEm1
DEzfID58mv8VHWRuW15vr7SbtROjYJkEFCIaEK3oprrRUYftZntIABcknqvcIYR+
/ULNzkRU1w4=
=XRmL
-----END PGP SIGNATURE-----
Merge tag 'md-3.7' of git://neil.brown.name/md
Pull md updates from NeilBrown:
- "discard" support, some dm-raid improvements and other assorted bits
and pieces.
* tag 'md-3.7' of git://neil.brown.name/md: (29 commits)
md: refine reporting of resync/reshape delays.
md/raid5: be careful not to resize_stripes too big.
md: make sure manual changes to recovery checkpoint are saved.
md/raid10: use correct limit variable
md: writing to sync_action should clear the read-auto state.
Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races
md/raid5: make sure to_read and to_write never go negative.
md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
md/raid5: protect debug message against NULL derefernce.
md/raid5: add some missing locking in handle_failed_stripe.
MD: raid5 avoid unnecessary zero page for trim
MD: raid5 trim support
md/bitmap:Don't use IS_ERR to judge alloc_page().
md/raid1: Don't release reference to device while handling read error.
raid: replace list_for_each_continue_rcu with new interface
add further __init annotations to crypto/xor.c
DM RAID: Fix for "sync" directive ineffectiveness
DM RAID: Fix comparison of index and quantity for "rebuild" parameter
DM RAID: Add rebuild capability for RAID10
DM RAID: Move 'rebuild' checking code to its own function
...
If 'resync_max' is set to 0 (as is often done when starting a
reshape, so the mdadm can remain in control during a sensitive
period), and if the reshape request is initially delayed because
another array using the same array is resyncing or reshaping etc,
when user-space cannot easily tell when the delay changes from being
due to a conflicting reshape, to being due to resync_max = 0.
So introduce a new state: (curr_resync == 3) to reflect this, make
sure it is visible both via /proc/mdstat and via the "sync_completed"
sysfs attribute, and ensure that the event transition from one delay
state to the other is properly notified.
Signed-off-by: NeilBrown <neilb@suse.de>
If you make an array bigger but suppress resync of the new region with
mdadm --grow /dev/mdX --size=max --assume-clean
then stop the array before anything is written to it, the effect of
the "--assume-clean" is lost and the array will resync the new space
when restarted.
So ensure that we update the metadata in the case.
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In some cases array are started in 'read-auto' state where in
nothing gets written to any device until the array is written
to. The purpose of this is to make accidental auto-assembly
of the wrong arrays less of a risk, and to allow arrays to be
started to read suspend-to-disk images without actually changing
anything (as might happen if the array were dirty and a
resync seemed necessary).
Explicitly writing the 'sync_action' for a read-auto array currently
doesn't clear the read-auto state, so the sync action doesn't
happen, which can be confusing.
So allow any successful write to sync_action to clear any read-auto
state.
Reported-by: Alexander Kühn <alexander.kuehn@nagilum.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Now that multiple threads can handle stripes, it is safer to
use an atomic64_t for resync_mismatches, to avoid update races.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID10: Fix a couple potential kernel panics if RAID10 is used by dm-raid
When device-mapper uses the RAID10 personality through dm-raid.c, there is no
'gendisk' structure in mddev and some sysfs information is also not populated.
This patch avoids touching those non-existent structures.
Signed-off-by: Jonathan Brassow <jbrassow@rehdat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Some ioctls don't need to take the mutex and doing so can cause
a delay as it is held during super-block update.
So move those ioctls out of the mutex and rely on rcu locking
to ensure we don't access stale data.
Signed-off-by: NeilBrown <neilb@suse.de>
Change the thread parameter, so the thread can carry extra info. Next patch
will use it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block IO update from Jens Axboe:
"Core block IO bits for 3.7. Not a huge round this time, it contains:
- First series from Kent cleaning up and generalizing bio allocation
and freeing.
- WRITE_SAME support from Martin.
- Mikulas patches to prevent O_DIRECT crashes when someone changes
the block size of a device.
- Make bio_split() work on data-less bio's (like trim/discards).
- A few other minor fixups."
Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
Morton. It is due to the VM no longer using a prio-tree (see commit
6b2dbba8b6: "mm: replace vma prio_tree with an interval tree").
So make set_blocksize() use mapping_mapped() instead of open-coding the
internal VM knowledge that has changed.
* 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
block: makes bio_split support bio without data
scatterlist: refactor the sg_nents
scatterlist: add sg_nents
fs: fix include/percpu-rwsem.h export error
percpu-rw-semaphore: fix documentation typos
fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
blockdev: turn a rw semaphore into a percpu rw semaphore
Fix a crash when block device is read and block size is changed at the same time
block: fix request_queue->flags initialization
block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
block: ioctl to zero block ranges
block: Make blkdev_issue_zeroout use WRITE SAME
block: Implement support for WRITE SAME
block: Consolidate command flag and queue limit checks for merges
block: Clean up special command handling logic
block/blk-tag.c: Remove useless kfree
block: remove the duplicated setting for congestion_threshold
block: reject invalid queue attribute values
block: Add bio_clone_bioset(), bio_clone_kmalloc()
block: Consolidate bio_alloc_bioset(), bio_kmalloc()
...
It isn't always necessary to update the metadata when spares are
removed as the presence-or-not of a spare isn't really important to
the integrity of an array.
Also activating a spare doesn't always require updating the metadata
as the update on 'recovery-completed' is usually sufficient.
However the introduction of 'replacement' devices have made these
transitions sometimes more important. For example the 'Replacement'
flag isn't cleared until the original device is removed, so we need
to ensure a metadata update after that 'spare' is removed.
So set MD_CHANGE_DEVS whenever a spare is activated or removed, to
complement the current situation where it is set when a spare is added
or a device is failed (or a number of other less common situations).
This is suitable for -stable as out-of-data metadata could lead
to data corruption.
This is only relevant for 3.3 and later 9when 'replacement' as
introduced.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Previously, there was bio_clone() but it only allocated from the fs bio
set; as a result various users were open coding it and using
__bio_clone().
This changes bio_clone() to become bio_clone_bioset(), and then we add
bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
the functionality the last patch adedd.
This will also help in a later patch changing how bio cloning works.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Boaz Harrosh <bharrosh@panasas.com>
CC: Jeff Garzik <jeff@garzik.org>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that bios keep track of where they were allocated from,
bio_integrity_alloc_bioset() becomes redundant.
Remove bio_integrity_alloc_bioset() and drop bio_set argument from the
related functions and make them use bio->bi_pool.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the old code, when you allocate a bio from a bio pool you have to
implement your own destructor that knows how to find the bio pool the
bio was originally allocated from.
This adds a new field to struct bio (bi_pool) and changes
bio_alloc_bioset() to use it. This makes various bio destructors
unnecessary, so they're then deleted.
v6: Explain the temporary if statement in bio_put
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Nicholas Bellinger <nab@linux-iscsi.org>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit 27a7b260f7
md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.
changed 0.90 metadata handling to truncated size to 4TB as that is
all that 0.90 can record.
However for RAID0 and Linear, 0.90 doesn't need to record the size, so
this truncation is not needed and causes working arrays to become too small.
So avoid the truncation for RAID0 and Linear
This bug was introduced in 3.1 and is suitable for any stable kernels
from then onwards.
As the offending commit was tagged for 'stable', any stable kernel
that it was applied to should also get this patch. That includes
at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for
providing that list).
Cc: stable@vger.kernel.org
Signed-off-by: Neil Brown <neilb@suse.de>
Pull block driver changes from Jens Axboe:
- Making the plugging support for drivers a bit more sane from Neil.
This supersedes the plugging change from Shaohua as well.
- The usual round of drbd updates.
- Using a tail add instead of a head add in the request completion for
ndb, making us find the most completed request more quickly.
- A few floppy changes, getting rid of a duplicated flag and also
running the floppy init async (since it takes forever in boot terms)
from Andi.
* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
floppy: remove duplicated flag FD_RAW_NEED_DISK
blk: pass from_schedule to non-request unplug functions.
block: stack unplug
blk: centralize non-request unplug handling.
md: remove plug_cnt feature of plugging.
block/nbd: micro-optimization in nbd request completion
drbd: announce FLUSH/FUA capability to upper layers
drbd: fix max_bio_size to be unsigned
drbd: flush drbd work queue before invalidate/invalidate remote
drbd: fix potential access after free
drbd: call local-io-error handler early
drbd: do not reset rs_pending_cnt too early
drbd: reset congestion information before reporting it in /proc/drbd
drbd: report congestion if we are waiting for some userland callback
drbd: differentiate between normal and forced detach
drbd: cleanup, remove two unused global flags
floppy: Run floppy initialization asynchronous
This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This seemed like a good idea at the time, but after further thought I
cannot see it making a difference other than very occasionally and
testing to try to exercise the case it is most likely to help did not
show any performance difference by removing it.
So remove the counting of active plugs and allow 'pending writes' to
be activated at any time, not just when no plugs are active.
This is only relevant when there is a write-intent bitmap, and the
updating of the bitmap will likely introduce enough delay that
the single-threading of bitmap updates will be enough to collect large
numbers of updates together.
Removing this will make it easier to centralise the unplug code, and
will clear the other for other unplug enhancements which have a
measurable effect.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
do_md_stop tests mddev->openers while holding ->open_mutex,
and fails if this count is too high.
So callers do not need to check mddev->openers and doing so isn't
very meaningful as they don't hold ->open_mutex so the number could
change.
So remove the unnecessary tests on mddev->openers.
These are not called often enough for there to be any gain in
an early test on ->open_mutex to avoid the need for a slightly more
costly mutex_lock call.
Signed-off-by: NeilBrown <neilb@suse.de>
md will refuse to stop an array if any other fd (or mounted fs) is
using it.
When any fs is unmounted of when the last open fd is closed all
pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
so there will be no pending IO to worry about when the array is
stopped.
However in order to send the STOP_ARRAY ioctl to stop the array one
must first get and open fd on the block device.
If some fd is being used to write to the block device and it is closed
after mdadm open the block device, but before mdadm issues the
STOP_ARRAY ioctl, then there will be no last-close on the md device so
__blkdev_put will not call sync_blockdev.
If this happens, then IO can still be in-flight while md tears down
the array and bad things can happen (use-after-free and subsequent
havoc).
So in the case where do_md_stop is being called from an open file
descriptor, call sync_block after taking the mutex to ensure there
will be no new openers.
This is needed when setting a read-write device to read-only too.
Cc: stable@vger.kernel.org
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>