linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-28 11:18:45 +07:00

Author	SHA1	Message	Date
Shaohua Li	7184ef8bab	MD: fix sleep in atomic bioset_free() will take a mutex, so can't get called with spinlock hold. Fix: 5a85071c2cbc(md: use a separate bio_set for synchronous IO.) Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-07-03 14:38:59 -07:00
Shaohua Li	7f053a6a74	MD: fix a null dereference rdev->mddev could be null in start time. Reported-by: Ming Lei <ming.lei@redhat.com> Fix: 5a85071c2cbc(md: use a separate bio_set for synchronous IO.) Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-06-23 09:19:49 -07:00
NeilBrown	5a85071c2c	md: use a separate bio_set for synchronous IO. md devices allocate a bio_set and use it for two distinct purposes. mddev->bio_set is used to clone bios as part of sending upper level requests down to lower level devices, and it is also use for synchronous IO such as superblock and bitmap updates, and for correcting read errors. This multiple usage can lead to deadlocks. It is likely that cloned bios might be queued for write and to be waiting for a metadata update before the write can be permitted. If the cloning exhausted mddev->bio_set, the metadata update may not be able to proceed. This scenario has been seen during heavy testing, with lots of IO and lots of memory pressure. Address this by adding a new bio_set specifically for synchronous IO. All synchronous IO goes directly to the underlying device and is not queued at the md level, so request using entries from the new mddev->sync_set will complete in a timely fashion. Requests that use mddev->bio_set will sometimes need to wait for synchronous IO, but will no longer risk deadlocking that iO. Also: small simplification in mddev_put(): there is no need to wait until the spinlock is released before calling bioset_free(). Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-06-21 10:52:36 -07:00
NeilBrown	9b10f6a9c2	block: remove bio_clone() and all references. bio_clone() is no longer used. Only bio_clone_bioset() or bio_clone_fast(). This is for the best, as bio_clone() used fs_bio_set, and filesystems are unlikely to want to use bio_clone(). So remove bio_clone() and all references. This includes a fix to some incorrect documentation. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-18 12:40:59 -06:00
NeilBrown	011067b056	blk: replace bioset_create_nobvec() with a flags arg to bioset_create() "flags" arguments are often seen as good API design as they allow easy extensibility. bioset_create_nobvec() is implemented internally as a variation in flags passed to __bioset_create(). To support future extension, make the internal structure part of the API. i.e. add a 'flags' argument to bioset_create() and discard bioset_create_nobvec(). Note that the bio_split allocations in drivers/md/raid* do not need the bvec mempool - they should have used bioset_create_nobvec(). Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-18 12:40:59 -06:00
NeilBrown	af67c31fba	blk: remove bio_set arg from blk_queue_split() blk_queue_split() is always called with the last arg being q->bio_split, where 'q' is the first arg. Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses q->bio_split. This is inconsistent and unnecessary. Remove the last arg and always use q->bio_split inside blk_queue_split() Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed) Reviewed-by: Javier González <javier@cnexlabs.com> Tested-by: Javier González <javier@cnexlabs.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-18 12:40:59 -06:00
Lidong Zhong	8df7202439	md: change the initialization value for a spare device spot to MD_DISK_ROLE_SPARE The value for spare spot of sb->dev_roles is changed from MD_DISK_ROLE_FAULTY to MD_DISK_ROLE_SPARE to keep align with the value when the superblock is firstly created in userspace. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-06-16 12:04:09 -07:00
NeilBrown	cc27b0c78c	md: fix deadlock between mddev_suspend() and md_write_start() If mddev_suspend() races with md_write_start() we can deadlock with mddev_suspend() waiting for the request that is currently in md_write_start() to complete the ->make_request() call, and md_write_start() waiting for the metadata to be updated to mark the array as 'dirty'. As metadata updates done by md_check_recovery() only happen then the mddev_lock() can be claimed, and as mddev_suspend() is often called with the lock held, these threads wait indefinitely for each other. We fix this by having md_write_start() abort if mddev_suspend() is happening, and ->make_request() aborts if md_write_start() aborted. md_make_request() can detect this abort, decrease the ->active_io count, and wait for mddev_suspend(). Reported-by: Nix <nix@esperi.org.uk> Fix: 68866e425be2(MD: no sync IO while suspended) Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-06-13 10:18:01 -07:00
Christoph Hellwig	fdd050b5b3	Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base	2017-06-13 11:45:14 +02:00
Jens Axboe	8f66439eec	Linux 4.12-rc5 -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU 4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+ 2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4= =m2Gx -----END PGP SIGNATURE----- Merge tag 'v4.12-rc5' into for-4.13/block We've already got a few conflicts and upcoming work depends on some of the changes that have gone into mainline as regression fixes for this series. Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream trees to continue working on 4.13 changes. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-12 08:30:13 -06:00
Christoph Hellwig	4e4cbee93d	block: switch bios to blk_status_t Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-06-09 09:27:32 -06:00
NeilBrown	a415c0f106	md: initialise ->writes_pending in personality modules. The new per-cpu counter for writes_pending is initialised in md_alloc(), which is not called by dm-raid. So dm-raid fails when md_write_start() is called. Move the initialization to the personality modules that need it. This way it is always initialised when needed, but isn't unnecessarily initialized (requiring memory allocation) when the personality doesn't use writes_pending. Reported-by: Heinz Mauelshagen <heinzm@redhat.com> Fixes: `4ad23a9764` ("MD: use per-cpu counter for writes_pending") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-06-05 16:04:35 -07:00
Amir Goldstein	e6fd2093a8	md: namespace private helper names The md private helper uuid_equal() collides with a generic helper of the same name. Rename the md private helper to md_uuid_equal() and do the same for md_sb_equal(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>	2017-06-05 16:56:37 +02:00
Jan Kara	5a8948f8a3	md: Make flush bios explicitely sync Commit `b685d3d65a` "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_{FUA\|PREFLUSH\|...} definitions. generic_make_request_checks() however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions Fix the problem by making sure all bios which are synchronous are properly marked with REQ_SYNC. CC: linux-raid@vger.kernel.org CC: Shaohua Li <shli@kernel.org> Fixes: `b685d3d65a` CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Shaohua Li <shli@fb.com>	2017-05-31 09:25:53 -07:00
Artur Paszkiewicz	2214c260c7	md: don't return -EAGAIN in md_allow_write for external metadata arrays This essentially reverts commit `b5470dc5fc` ("md: resolve external metadata handling deadlock in md_allow_write") with some adjustments. Since commit `6791875e2e` ("md: make reconfig_mutex optional for writes to md sysfs files.") changing array_state to 'active' does not use mddev_lock() and will not cause a deadlock with md_allow_write(). This revert simplifies userspace tools that write to sysfs attributes like "stripe_cache_size" or "consistency_policy" because it removes the need for special handling for external metadata arrays, checking for EAGAIN and retrying the write. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-05-08 10:32:59 -07:00
NeilBrown	97b20ef784	md: handle read-only member devices better. 1/ If an array has any read-only devices when it is started, the array itself must be read-only 2/ A read-only device cannot be added to an array after it is started. 3/ Setting an array to read-write should not succeed if any member devices are read-only Reported-and-Tested-by: Nanda Kishore Chinnaram <Nanda_Kishore_Chinna@dell.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-20 13:25:51 -07:00
NeilBrown	78b6350dca	md: support disabling of create-on-open semantics. md allows a new array device to be created by simply opening a device file. This make it difficult to remove the device and udev is likely to open the device file as part of processing the REMOVE event. There is an alternate mechanism for creating arrays by writing to the new_array module parameter. When using tools that work with this parameter, it is best to disable the old semantics. This new module parameter allows that. Signed-off-by: NeilBrown <neilb@suse.com> Acted-by: Coly Li <colyli@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-12 12:30:17 -07:00
NeilBrown	039b7225e6	md: allow creation of mdNNN arrays via md_mod/parameters/new_array The intention when creating the "new_array" parameter and the possibility of having array names line "md_HOME" was to transition away from the old way of creating arrays and to eventually only use this new way. The "old" way of creating array is to create a device node in /dev and then open it. The act of opening creates the array. This is problematic because sometimes the device node can be opened when we don't want to create an array. This can easily happen when some rule triggered by udev looks at a device as it is being destroyed. The node in /dev continues to exist for a short period after an array is stopped, and opening it during this time recreates the array (as an inactive array). Unfortunately no clear plan for the transition was created. It is now time to fix that. This patch allows devices with numeric names, like "md999" to be created by writing to "new_array". This will only work if the minor number given is not already in use. This will allow mdadm to support the creation of arrays with numbers > 511 (currently not possible) by writing to new_array. mdadm can, at some point, use this approach to create all arrays, which will allow the transition to only using the new-way. Signed-off-by: NeilBrown <neilb@suse.com> Acted-by: Coly Li <colyli@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-12 12:30:11 -07:00
Zhilong Liu	b670883bb9	md.c:didn't unlock the mddev before return EINVAL in array_size_store md.c: it needs to release the mddev lock before the array_size_store() returns. Fixes: `ab5a98b132` ("md-cluster: change array_sectors and update size are not supported") Signed-off-by: Zhilong Liu <zlliu@suse.com> Reviewed-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-10 10:50:24 -07:00
NeilBrown	065e519e71	md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop if called md_set_readonly and set MD_CLOSING bit, the mddev cannot be opened any more due to the MD_CLOING bit wasn't cleared. Thus it needs to be cleared in md_ioctl after any call to md_set_readonly() or do_md_stop(). Signed-off-by: NeilBrown <neilb@suse.com> Fixes: `af8d8e6f03` ("md: changes for MD_STILL_CLOSED flag") Cc: stable@vger.kernel.org (v4.9+) Signed-off-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-10 10:47:50 -07:00
NeilBrown	4ad23a9764	MD: use per-cpu counter for writes_pending The 'writes_pending' counter is used to determine when the array is stable so that it can be marked in the superblock as "Clean". Consequently it needs to be updated frequently but only checked for zero occasionally. Recent changes to raid5 cause the count to be updated even more often - once per 4K rather than once per bio. This provided justification for making the updates more efficient. So we replace the atomic counter a percpu-refcount. This can be incremented and decremented cheaply most of the time, and can be switched to "atomic" mode when more precise counting is needed. As it is possible for multiple threads to want a precise count, we introduce a "sync_checker" counter to count the number of threads in "set_in_sync()", and only switch the refcount back to percpu mode when that is zero. We need to be careful about races between set_in_sync() setting ->in_sync to 1, and md_write_start() setting it to zero. md_write_start() holds the rcu_read_lock() while checking if the refcount is in percpu mode. If it is, then we know a switch to 'atomic' will not happen until after we call rcu_read_unlock(), in which case set_in_sync() will see the elevated count, and not set in_sync to 1. If it is not in percpu mode, we take the mddev->lock to ensure proper synchronization. It is no longer possible to quickly check if the count is zero, which we previously did to update a timer or to schedule the md_thread. So now we do these every time we decrement that counter, but make sure they are fast. mod_timer() already optimizes the case where the timeout value doesn't actually change. We leverage that further by always rounding off the jiffies to the timeout value. This may delay the marking of 'clean' slightly, but ensure we only perform atomic operation here when absolutely needed. md_wakeup_thread() current always calls wake_up(), even if THREAD_WAKEUP is already set. That too can be optimised to avoid calls to wake_up(). Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:56 -07:00
NeilBrown	55cc39f345	md: close a race with setting mddev->in_sync If ->in_sync is being set just as md_write_start() is being called, it is possible that set_in_sync() won't see the elevated ->writes_pending, and md_write_start() won't see the set ->in_sync. To close this race, re-test ->writes_pending after setting ->in_sync, and add memory barriers to ensure the increment of ->writes_pending will be seen by the time of this second test, or the new ->in_sync will be seen by md_write_start(). Add a spinlock to array_state_show() to ensure this temporary instability is never visible from userspace. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:30 -07:00
NeilBrown	6497709b5d	md: factor out set_in_sync() Three separate places in md.c check if the number of active writes is zero and, if so, sets mddev->in_sync. There are a few differences, but there shouldn't be: - it is always appropriate to notify the change in sysfs_state, and there is no need to do this outside a spin-locked region. - we never need to check ->recovery_cp. The state of resync is not relevant for whether there are any pending writes or not (which is what ->in_sync reports). So create set_in_sync() which does the correct tests and makes the correct changes, and call this in all three places. Any behaviour changes here a minor and cosmetic. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:18 -07:00
NeilBrown	497280509f	md/raid5: use md_write_start to count stripes, not bios We use md_write_start() to increase the count of pending writes, and md_write_end() to decrement the count. We currently count bios submitted to md/raid5. Change it count stripe_heads that a WRITE bio has been attached to. So now, raid5_make_request() calls md_write_start() and then md_write_end() to keep the count elevated during the setup of the request. add_stripe_bio() calls md_write_start() for each stripe_head, and the completion routines always call md_write_end(), instead of only calling it when raid5_dec_bi_active_stripes() returns 0. make_discard_request also calls md_write_start/end(). The parallel between md_write_{start,end} and use of bi_phys_segments can be seen in that: Whenever we set bi_phys_segments to 1, we now call md_write_start. Whenever we increment it on non-read requests with raid5_inc_bi_active_stripes(), we now call md_write_start(). Whenever we decrement bi_phys_segments on non-read requsts with raid5_dec_bi_active_stripes(), we now call md_write_end(). This reduces our dependence on keeping a per-bio count of active stripes in bi_phys_segments. md_write_inc() is added which parallels md_write_start(), but requires that a write has already been started, and is certain never to sleep. This can be used inside a spinlocked region when adding to a write request. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:15:42 -07:00
Guoqing Jiang	48df498daf	md: move bitmap_destroy to the beginning of __md_stop Since we have switched to sync way to handle METADATA_UPDATED msg for md-cluster, then process_metadata_update is depended on mddev->thread->wqueue. With the new change, clustered raid could possible hang if array received a METADATA_UPDATED msg after array unregistered mddev->thread, so we need to stop clustered raid (bitmap_destroy -> bitmap_free -> md_cluster_stop) earlier than unregister thread (mddev_detach -> md_unregister_thread). And this change should be safe for non-clustered raid since all writes are stopped before the destroy. Also in md_run, we activate the personality (pers->run()) before activating the bitmap (bitmap_create()). So it is pleasingly symmetric to stop the bitmap (bitmap_destroy()) before stopping the personality (__md_stop() calls pers->free()), we achieve this by move bitmap_destroy to the beginning of __md_stop. But we don't want to break the codes for waiting behind IO as Shaohua mentioned, so introduce bitmap_wait_behind_writes to call the codes, and call the new fun in both mddev_detach and bitmap_destroy, then we will not break original behind IO code and also fit the new condition well. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:58 -07:00
Artur Paszkiewicz	ba903a3ea4	raid5-ppl: runtime PPL enabling or disabling Allow writing to 'consistency_policy' attribute when the array is active. Add a new function 'change_consistency_policy' to the md_personality operations structure to handle the change in the personality code. Values "ppl" and "resync" are accepted and turn PPL on and off respectively. When enabling PPL its location and size should first be set using 'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be written at this location on each member device. Enabling or disabling PPL is performed under a suspended array. The raid5_reset_stripe_cache function frees the stripe cache and allocates it again in order to allocate or free the ppl_pages for the stripes in the stripe cache. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:56 -07:00
Artur Paszkiewicz	664aed0444	md: add sysfs entries for PPL Add 'consistency_policy' attribute for array. It indicates how the array maintains consistency in case of unexpected shutdown. Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location and size of the PPL space on the device. They can't be changed for active members if the array is started and PPL is enabled, so in the setter functions only basic checks are performed. More checks are done in ppl_validate_rdev() when starting the log. These attributes are writable to allow enabling PPL for external metadata arrays and (later) to enable/disable PPL for a running array. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:55 -07:00
Artur Paszkiewicz	ea0213e0c7	md: superblock changes for PPL Include information about PPL location and size into mdp_superblock_1 and copy it to/from rdev. Because PPL is mutually exclusive with bitmap, put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL to mddev->flags to indicate that PPL is enabled on an array. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:53 -07:00
Guoqing Jiang	818da59f97	md-cluster: add the support for resize To update size for cluster raid, we need to make sure all nodes can perform the change successfully. However, it is possible that some of them can't do it due to failure (bitmap_resize could fail). So we need to consider the issue before we set the capacity unconditionally, and we use below steps to perform sanity check. 1. A change the size, then broadcast METADATA_UPDATED msg. 2. B and C receive METADATA_UPDATED change the size excepts call set_capacity, sync_size is not update if the change failed. Also call bitmap_update_sb to sync sb to disk. 3. A checks other node's sync_size, if sync_size has been updated in all nodes, then send CHANGE_CAPACITY msg otherwise send msg to revert previous change. 4. B and C call set_capacity if receive CHANGE_CAPACITY msg, otherwise pers->resize will be called to restore the old value. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:50 -07:00
Guoqing Jiang	0ba959774e	md-cluster: use sync way to handle METADATA_UPDATED msg Previously, when node received METADATA_UPDATED msg, it just need to wakeup mddev->thread, then md_reload_sb will be called eventually. We taken the asynchronous way to avoid a deadlock issue, the deadlock issue could happen when one node is receiving the METADATA_UPDATED msg (wants reconfig_mutex) and trying to run the path: md_check_recovery -> mddev_trylock(hold reconfig_mutex) -> md_update_sb-metadata_update_start (want EX on token however token is got by the sending node) Since we will support resizing for clustered raid, and we need the metadata update handling to be synchronous so that the initiating node can detect failure, so we need to change the way for handling METADATA_UPDATED msg. But, we obviously need to avoid above deadlock with the sync way. To make this happen, we considered to not hold reconfig_mutex to call md_reload_sb, if some other thread has already taken reconfig_mutex and waiting for the 'token', then process_recvd_msg() can safely call md_reload_sb() without taking the mutex. This is because we can be certain that no other thread will take the mutex, and we also certain that the actions performed by md_reload_sb() won't interfere with anything that the other thread is in the middle of. To make this more concrete, we added a new cinfo->state bit MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD Which is set in lock_token() just before dlm_lock_sync() is called, and cleared just after. As lock_token() is always called with reconfig_mutex() held (the specific case is the resync_info_update which is distinguished well in previous patch), if process_recvd_msg() finds that the new bit is set, then the mutex must be held by some other thread, and it will keep waiting. So process_metadata_update() can call md_reload_sb() if either mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is set. The tricky bit is what to do if neither of these apply. We need to wait. Fortunately mddev_unlock() always calls wake_up() on mddev->thread->wqueue. So we can get lock_token() to call wake_up() on that when it sets the bit. There are also some related changes inside this commit: 1. remove RELOAD_SB related codes since there are not valid anymore. 2. mddev is added into md_cluster_info then we can get mddev inside lock_token. 3. add new parameter for lock_token to distinguish reconfig_mutex is held or not. And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below: 1. set it before unregister thread, otherwise a deadlock could appear if stop a resyncing array. This is because md_unregister_thread(&cinfo->recv_thread) is blocked by recv_daemon -> process_recvd_msg -> process_metadata_update. To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is also need to be set before unregister thread. 2. set it in metadata_update_start to fix another deadlock. a. Node A sends METADATA_UPDATED msg (held Token lock). b. Node B wants to do resync, and is blocked since it can't get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is not set since the callchain (md_do_sync -> sync_request -> resync_info_update -> sendmsg -> lock_comm -> lock_token) doesn't hold reconfig_mutex. c. Node B trys to update sb (held reconfig_mutex), but stopped at wait_event() in metadata_update_start since we have set MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2). d. Then Node B receives METADATA_UPDATED msg from A, of course recv_daemon is blocked forever. Since metadata_update_start always calls lock_token with reconfig_mutex, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and lock_token don't need to set it twice unless lock_token is invoked from lock_comm. Finally, thanks to Neil for his great idea and help! Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:49 -07:00
Jason Yan	1345921393	md: fix incorrect use of lexx_to_cpu in does_sb_need_changing The sb->layout is of type __le32, so we shoud use le32_to_cpu. Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-10 12:54:38 -08:00
Jason Yan	3fb632e40d	md: fix super_offset endianness in super_1_rdev_size_change The sb->super_offset should be big-endian, but the rdev->sb_start is in host byte order, so fix this by adding cpu_to_le64. Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-10 12:54:37 -08:00
NeilBrown	1b3bae49fb	md: don't impose the MD_SB_DISKS limit on arrays without metadata. These arrays, created with "mdadm --build" don't benefit from a limit. The default will be used, which is '0' and is interpreted as "don't impose a limit". Reported-by: ian_bruce@mail.ru Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-09 09:02:30 -08:00
Guoqing Jiang	c948363421	md: move funcs from pers->resize to update_size raid1_resize and raid5_resize should also check the mddev->queue if run underneath dm-raid. And both set_capacity and revalidate_disk are used in pers->resize such as raid1, raid10 and raid5. So move them from personality file to common code. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-09 09:02:18 -08:00
Shaohua Li	99b3d74ec0	md: delete dead code Nobody is using mddev_check_plugged(), so delete the dead code Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-09 09:01:29 -08:00
Ingo Molnar	3f07c01441	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/signal.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:29 +01:00
Linus Torvalds	a682e00354	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull md updates from Shaohua Li: "Mainly fixes bugs and improves performance: - Improve scalability for raid1 from Coly - Improve raid5-cache read performance, disk efficiency and IO pattern from Song and me - Fix a race condition of disk hotplug for linear from Coly - A few cleanup patches from Ming and Byungchul - Fix a memory leak from Neil - Fix WRITE SAME IO failure from me - Add doc for raid5-cache from me" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits) md/raid1: fix write behind issues introduced by bio_clone_bioset_partial md/raid1: handle flush request correctly md/linear: shutup lockdep warnning md/raid1: fix a use-after-free bug RAID1: avoid unnecessary spin locks in I/O barrier code RAID1: a new I/O barrier implementation to remove resync window md/raid5: Don't reinvent the wheel but use existing llist API md: fast clone bio in bio_clone_mddev() md: remove unnecessary check on mddev md/raid1: use bio_clone_bioset_partial() in case of write behind md: fail if mddev->bio_set can't be created block: introduce bio_clone_bioset_partial() md: disable WRITE SAME if it fails in underlayer disks md/raid5-cache: exclude reclaiming stripes in reclaim check md/raid5-cache: stripe reclaim only counts valid stripes MD: add doc for raid5-cache Documentation: move MD related doc into a separate dir md: ensure md devices are freed before module is unloaded. md/r5cache: improve journal device efficiency md/r5cache: enable chunk_aligned_read with write back cache ...	2017-02-24 14:42:19 -08:00
Jens Axboe	818551e2b2	Merge branch 'for-4.11/next' into for-4.11/linus-merge Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-17 14:08:19 -07:00
Ming Lei	d7a1030839	md: fast clone bio in bio_clone_mddev() Firstly bio_clone_mddev() is used in raid normal I/O and isn't in resync I/O path. Secondly all the direct access to bvec table in raid happens on resync I/O except for write behind of raid1, in which we still use bio_clone() for allocating new bvec table. So this patch replaces bio_clone() with bio_clone_fast() in bio_clone_mddev(). Also kill bio_clone_mddev() and call bio_clone_fast() directly, as suggested by Christoph Hellwig. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-02-15 11:24:54 -08:00
Ming Lei	ed7ef732ca	md: remove unnecessary check on mddev mddev is never NULL and neither is ->bio_set, so remove the check. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-02-15 11:24:13 -08:00
Ming Lei	10273170fd	md: fail if mddev->bio_set can't be created The current behaviour is to fall back to allocate bio from 'fs_bio_set', that isn't a correct way because it might cause deadlock. So this patch simply return failure if mddev->bio_set can't be created. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-02-15 11:23:24 -08:00
NeilBrown	9356863c94	md: ensure md devices are freed before module is unloaded. Commit: `cbd1998377` ("md: Fix unfortunate interaction with evms") change mddev_put() so that it would not destroy an md device while ->ctime was non-zero. Unfortunately, we didn't make sure to clear ->ctime when unloading the module, so it is possible for an md device to remain after module unload. An attempt to open such a device will trigger an invalid memory reference in: get_gendisk -> kobj_lookup -> exact_lock -> get_disk when tring to access disk->fops, which was in the module that has been removed. So ensure we clear ->ctime in md_exit(), and explain how that is useful, as it isn't immediately obvious when looking at the code. Fixes: `cbd1998377` ("md: Fix unfortunate interaction with evms") Tested-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-02-13 09:17:53 -08:00
Jan Kara	dc3b17cc8b	block: Use pointer to backing_dev_info from request_queue We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-02 08:20:48 -07:00
Song Liu	a85dd7b8df	md/r5cache: flush data only stripes in r5l_recovery_log() For safer operation, all arrays start in write-through mode, which has been better tested and is more mature. And actually the write-through/write-mode isn't persistent after array restarted, so we always start array in write-through mode. However, if recovery found data-only stripes before the shutdown (from previous write-back mode), it is not safe to start the array in write-through mode, as write-through mode can not handle stripes with data in write-back cache. To solve this problem, we flush all data-only stripes in r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with empty cache in write-through mode. This logic is implemented in r5c_recovery_flush_data_only_stripes(): 1. enable write back cache 2. flush all stripes 3. wake up conf->mddev->thread 4. wait for all stripes get flushed (reuse wait_for_quiescent) 5. disable write back cache The wait in 4 will be waked up in release_inactive_stripe_list() when conf->active_stripes reaches 0. It is safe to wake up mddev->thread here because all the resource required for the thread has been initialized. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-01-24 11:20:15 -08:00
Shaohua Li	20737738d3	Merge branch 'md-next' into md-linus	2016-12-13 12:40:15 -08:00
Linus Torvalds	36869cb93d	Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: "This is the main block pull request this series. Contrary to previous release, I've kept the core and driver changes in the same branch. We always ended up having dependencies between the two for obvious reasons, so makes more sense to keep them together. That said, I'll probably try and keep more topical branches going forward, especially for cycles that end up being as busy as this one. The major parts of this pull request is: - Improved support for O_DIRECT on block devices, with a small private implementation instead of using the pig that is fs/direct-io.c. From Christoph. - Request completion tracking in a scalable fashion. This is utilized by two components in this pull, the new hybrid polling and the writeback queue throttling code. - Improved support for polling with O_DIRECT, adding a hybrid mode that combines pure polling with an initial sleep. From me. - Support for automatic throttling of writeback queues on the block side. This uses feedback from the device completion latencies to scale the queue on the block side up or down. From me. - Support from SMR drives in the block layer and for SD. From Hannes and Shaun. - Multi-connection support for nbd. From Josef. - Cleanup of request and bio flags, so we have a clear split between which are bio (or rq) private, and which ones are shared. From Christoph. - A set of patches from Bart, that improve how we handle queue stopping and starting in blk-mq. - Support for WRITE_ZEROES from Chaitanya. - Lightnvm updates from Javier/Matias. - Supoort for FC for the nvme-over-fabrics code. From James Smart. - A bunch of fixes from a whole slew of people, too many to name here" * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits) blk-stat: fix a few cases of missing batch flushing blk-flush: run the queue when inserting blk-mq flush elevator: make the rqhash helpers exported blk-mq: abstract out blk_mq_dispatch_rq_list() helper blk-mq: add blk_mq_start_stopped_hw_queue() block: improve handling of the magic discard payload blk-wbt: don't throttle discard or write zeroes nbd: use dev_err_ratelimited in io path nbd: reset the setup task for NBD_CLEAR_SOCK nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME nvme-fabrics: Add target support for FC transport nvme-fabrics: Add host support for FC transport nvme-fabrics: Add FC transport LLDD api definitions nvme-fabrics: Add FC transport FC-NVME definitions nvme-fabrics: Add FC transport error codes to nvme.h Add type 0x28 NVME type code to scsi fc headers nvme-fabrics: patch target code in prep for FC transport support nvme-fabrics: set sqe.command_id in core not transports parser: add u64 number parser nvme-rdma: align to generic ib_event logging helper ...	2016-12-13 10:19:16 -08:00
Shaohua Li	2953079c69	md: separate flags for superblock changes The mddev->flags are used for different purposes. There are a lot of places we check/change the flags without masking unrelated flags, we could check/change unrelated flags. These usage are most for superblock write, so spearate superblock related flags. This should make the code clearer and also fix real bugs. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-12-08 22:01:47 -08:00
Shaohua Li	82a301cb0e	md: MD_RECOVERY_NEEDED is set for mddev->recovery Fixes: 90f5f7ad4f38("md: Wait for md_check_recovery before attempting device removal.") Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-12-08 22:00:43 -08:00
NeilBrown	e2342ca832	md: fix refcount problem on mddev when stopping array. md_open() gets a counted reference on an mddev using mddev_find(). If it ends up returning an error, it must drop this reference. There are two error paths where the reference is not dropped. One only happens if the process is signalled and an awkward time, which is quite unlikely. The other was introduced recently in commit `af8d8e6f0`. Change the code to ensure the drop the reference when returning an error, and make it harded to re-introduce this sort of bug in the future. Reported-by: Marc Smith <marc.smith@mcc.edu> Fixes: `af8d8e6f03` ("md: changes for MD_STILL_CLOSED flag") Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-12-05 17:11:03 -08:00
Shaohua Li	034e33f5ed	md: stop write should stop journal reclaim __md_stop_writes currently doesn't stop raid5-cache reclaim thread. It's possible the reclaim thread is still running and doing write, which doesn't match what __md_stop_writes should do. The extra ->quiesce() call should not harm any raid types. For raid5-cache, this will guarantee we reclaim all caches before we update superblock. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: NeilBrown <neilb@suse.de> Cc: Song Liu <songliubraving@fb.com>	2016-11-23 19:30:25 -08:00
Shaohua Li	ce1ccd079f	raid5-cache: suspend reclaim thread instead of shutdown There is mechanism to suspend a kernel thread. Use it instead of playing create/destroy game. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: NeilBrown <neilb@suse.de> Cc: Song Liu <songliubraving@fb.com>	2016-11-23 19:30:06 -08:00
NeilBrown	46533ff7fe	md: Use REQ_FAILFAST_* on metadata writes where appropriate This can only be supported on personalities which ensure that md_error() never causes an array to enter the 'failed' state. i.e. if marking a device Faulty would cause some data to be inaccessible, the device is status is left as non-Faulty. This is true for RAID1 and RAID10. If we get a failure writing metadata but the device doesn't fail, it must be the last device so we re-write without FAILFAST to improve chance of success. We also flag the device as LastDev so that future metadata updates don't waste time on failfast writes. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-22 09:11:33 -08:00
NeilBrown	688834e6ae	md/failfast: add failfast flag for md to be used by some personalities. This patch just adds a 'failfast' per-device flag which can be stored in v0.90 or v1.x metadata. The flag is not used yet but the intent is that it can be used for mirrored (raid1/raid10) arrays where low latency is more important than keeping all devices on-line. Setting the flag for a device effectively gives permission for that device to be marked as Faulty and excluded from the array on the first error. The underlying driver will be directed not to retry requests that result in failures. There is a proviso that the device must not be marked faulty if that would cause the array as a whole to fail, it may only be marked Faulty if the array remains functional, but is degraded. Failures on read requests will cause the device to be marked as Faulty immediately so that further reads will avoid that device. No attempt will be made to correct read errors by over-writing with the correct data. It is expected that if transient errors, such as cable unplug, are possible, then something in user-space will revalidate failed devices and re-add them when they appear to be working again. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-22 08:58:17 -08:00
Shaohua Li	504634f60f	md: add blktrace event for writes to superblock superblock write is an expensive operation. With raid5-cache, it can be called regularly. Tracing to help performance debug. Signed-off-by: Shaohua Li <shli@fb.com> Cc: NeilBrown <neilb@suse.com>	2016-11-18 09:47:57 -08:00
NeilBrown	6119e6792b	md: remove md_super_wait() call after bitmap_flush() bitmap_flush() finishes with bitmap_update_sb(), and that finishes with write_page(..., 1), so write_page() will wait for all writes to complete. So there is no point calling md_super_wait() immediately afterwards. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-09 17:14:28 -08:00
NeilBrown	060b0689f5	md: perform async updates for metadata where possible. When adding devices to, or removing device from, an array we need to update the metadata. However we don't need to do it synchronously as data integrity doesn't depend on these changes being recorded instantly. So avoid the synchronous call to md_update_sb and just set a flag so that the thread will do it. This can reduce the number of updates performed when lots of devices are being added or removed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:23 -08:00
NeilBrown	9d48739ef1	md: change all printk() to pr_err() or pr_warn() etc. 1/ using pr_debug() for a number of messages reduces the noise of md, but still allows them to be enabled when needed. 2/ try to be consistent in the usage of pr_err() and pr_warn(), and document the intention 3/ When strings have been split onto multiple lines, rejoin into a single string. The cost of having lines > 80 chars is less than the cost of not being able to easily search for a particular message. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:21 -08:00
NeilBrown	7f0f0d87fa	md: fix some issues with alloc_disk_sb() 1/ don't print a warning if allocation fails. page_alloc() does that already. 2/ always check return status for error. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:21 -08:00
Tomasz Majchrzak	91a6c4aded	md: wake up personality thread after array state update When raid1/raid10 array fails to write to one of the drives, the request is added to bio_end_io_list and finished by personality thread. The thread doesn't handle it as long as MD_CHANGE_PENDING flag is set. In case of external metadata this flag is cleared, however the thread is not woken up. It causes request to be blocked for few seconds (until another action on the array wakes up the thread) or to get stuck indefinitely. Wake up personality thread once MD_CHANGE_PENDING has been cleared. Moving 'restart_array' call after the flag is cleared it not a solution because in read-write mode the call doesn't wake up the thread. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:21 -08:00
Tomasz Majchrzak	dcbcb48650	md: don't fail an array if there are unacknowledged bad blocks If external metadata handler supports bad blocks and unacknowledged bad blocks are present, don't report disk via sysfs as faulty. Such situation can be still handled so disk just has to be blocked for a moment. It makes it consistent with kernel state as corresponding rdev flag is also not set. When the disk in being unblocked there are few cases: 1. Disk has been in blocked and faulty state, it is being unblocked but it still remains in faulty state. Metadata handler will remove it from array in the next call. 2. There is no bad block support in external metadata handler and bad blocks are present - put the disk in blocked and faulty state (see case 1). 3. There is bad block support in external metadata handler and all bad blocks are acknowledged - clear all flags, continue. 4. There is bad block support in external metadata handler but there are still unacknowledged bad blocks - clear all flags, continue. It is fine to clear Blocked flag because it was probably not set anyway (if it was it is case 1). BlockedBadBlocks flag can also be cleared because the request waiting for it will set it again when it finds out that some bad block is still not acknowledged. Recovery is not necessary but there are no problems if the flag is set. Sysfs rdev state is still reported as blocked (due to unacknowledged bad blocks) so metadata handler will process remaining bad blocks and unblock disk again. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:20 -08:00
Tomasz Majchrzak	35b785f769	md: add bad block support for external metadata Add new rdev flag which external metadata handler can use to switch on/off bad block support. If new bad block is encountered, notify it via rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been cleared, notify update to rdev 'bad_blocks' sysfs file. When bad blocks support is being removed, just clear rdev flag. It is not necessary to reset badblocks->shift field. If there are bad blocks cleared or added at the same time, it is ok for those changes to be applied to the structure. The array is in blocked state and the drive which cannot handle bad blocks any more will be removed from the array before it is unlocked. Simplify state_show function by adding a separator at the end of each string and overwrite last separator with new line. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:20 -08:00
Christoph Hellwig	70fd76140a	block,fs: use REQ_* flags directly Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-01 09:43:26 -06:00
NeilBrown	1217e1d199	md: be careful not lot leak internal curr_resync value into metadata. -- (all) mddev->curr_resync usually records where the current resync is up to, but during the starting phase it has some "magic" values. 1 - means that the array is trying to start a resync, but has yielded to another array which shares physical devices, and also needs to start a resync 2 - means the array is trying to start resync, but has found another array which shares physical devices and has already started resync. 3 - means that resync has commensed, but it is possible that nothing has actually been resynced yet. It is important that this value not be visible to user-space and particularly that it doesn't get written to the metadata, as the resync or recovery checkpoint. In part, this is because it may be slightly higher than the correct value, though this is very rare. In part, because it is not a multiple of 4K, and some devices only support 4K aligned accesses. There are two places where this value is propagates into either ->curr_resync_completed or ->recovery_cp or ->recovery_offset. These currently avoid the propagation of values 1 and 3, but will allow 3 to leak through. Change them to only propagate the value if it is > 3. As this can cause an array to fail, the patch is suitable for -stable. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Viswesh <viswesh.vichu@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-28 22:04:05 -07:00
Tomasz Majchrzak	16f889499a	md: report 'write_pending' state when array in sync If there is a bad block on a disk and there is a recovery performed from this disk, the same bad block is reported for a new disk. It involves setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external metadata this flag is not being cleared as array state is reported as 'clean'. The read request to bad block in RAID5 array gets stuck as it is waiting for a flag to be cleared - as per commit `c3cce6cda1` ("md/raid5: ensure device failure recorded before write request returns."). The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been clarified in commit `070dc6dd71` ("md: resolve confusion of MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in personality error handlers since and it doesn't fully comply with initial purpose. It was supposed to notify that write request is about to start, however now it is also used to request metadata update. Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has been set and in_sync has been set to 0 at the same time. Error handlers just set the flag without modifying in_sync value. Sysfs array state is a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is set and in_sync is set to 1. Userspace has no idea it is expected to take some action. Swap the order that array state is checked so 'write_pending' is reported ahead of 'clean' ('write_pending' is a misleading name but it is too late to rename it now). Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-24 15:28:19 -07:00
Shaohua Li	bb086a89a4	md: set rotational bit if all disks in an array are non-rotational, set the array non-rotational. This only works for array with all disks populated at startup. Support for disk hotadd/hotremove could be added later if necessary. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-03 10:20:27 -07:00
Shaohua Li	90bcf13381	md: fix a potential deadlock lockdep reports a potential deadlock. Fix this by droping the mutex before md_import_device [ 1137.126601] ====================================================== [ 1137.127013] [ INFO: possible circular locking dependency detected ] [ 1137.127013] 4.8.0-rc4+ #538 Not tainted [ 1137.127013] ------------------------------------------------------- [ 1137.127013] mdadm/16675 is trying to acquire lock: [ 1137.127013] (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff81243cf3>] __blkdev_get+0x63/0x450 [ 1137.127013] but task is already holding lock: [ 1137.127013] (detected_devices_mutex){+.+.+.}, at: [<ffffffff81a5138c>] md_ioctl+0x2ac/0x1f50 [ 1137.127013] which lock already depends on the new lock. [ 1137.127013] the existing dependency chain (in reverse order) is: [ 1137.127013] -> #1 (detected_devices_mutex){+.+.+.}: [ 1137.127013] [<ffffffff810b6f19>] lock_acquire+0xb9/0x220 [ 1137.127013] [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0 [ 1137.127013] [<ffffffff81a4eeaf>] md_autodetect_dev+0x3f/0x90 [ 1137.127013] [<ffffffff81595be8>] rescan_partitions+0x1a8/0x2c0 [ 1137.127013] [<ffffffff81590081>] __blkdev_reread_part+0x71/0xb0 [ 1137.127013] [<ffffffff815900e5>] blkdev_reread_part+0x25/0x40 [ 1137.127013] [<ffffffff81590c4b>] blkdev_ioctl+0x51b/0xa30 [ 1137.127013] [<ffffffff81242bf1>] block_ioctl+0x41/0x50 [ 1137.127013] [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0 [ 1137.127013] [<ffffffff81215321>] SyS_ioctl+0x41/0x70 [ 1137.127013] [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 1137.127013] -> #0 (&bdev->bd_mutex){+.+.+.}: [ 1137.127013] [<ffffffff810b6af2>] __lock_acquire+0x1662/0x1690 [ 1137.127013] [<ffffffff810b6f19>] lock_acquire+0xb9/0x220 [ 1137.127013] [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0 [ 1137.127013] [<ffffffff81243cf3>] __blkdev_get+0x63/0x450 [ 1137.127013] [<ffffffff81244307>] blkdev_get+0x227/0x350 [ 1137.127013] [<ffffffff812444f6>] blkdev_get_by_dev+0x36/0x50 [ 1137.127013] [<ffffffff81a46d65>] lock_rdev+0x35/0x80 [ 1137.127013] [<ffffffff81a49bb4>] md_import_device+0xb4/0x1b0 [ 1137.127013] [<ffffffff81a513d6>] md_ioctl+0x2f6/0x1f50 [ 1137.127013] [<ffffffff815909b3>] blkdev_ioctl+0x283/0xa30 [ 1137.127013] [<ffffffff81242bf1>] block_ioctl+0x41/0x50 [ 1137.127013] [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0 [ 1137.127013] [<ffffffff81215321>] SyS_ioctl+0x41/0x70 [ 1137.127013] [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 1137.127013] other info that might help us debug this: [ 1137.127013] Possible unsafe locking scenario: [ 1137.127013] CPU0 CPU1 [ 1137.127013] ---- ---- [ 1137.127013] lock(detected_devices_mutex); [ 1137.127013] lock(&bdev->bd_mutex); [ 1137.127013] lock(detected_devices_mutex); [ 1137.127013] lock(&bdev->bd_mutex); [ 1137.127013] * DEADLOCK * Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	c20c33f0e2	md-cluster: clean related infos of cluster cluster_info and bitmap_info.nodes also need to be cleared when array is stopped. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	af8d8e6f03	md: changes for MD_STILL_CLOSED flag When stop clustered raid while it is pending on resync, MD_STILL_CLOSED flag could be cleared since udev rule is triggered to open the mddev. So obviously array can't be stopped soon and returns EBUSY. mdadm -Ss md-raid-arrays.rules set MD_STILL_CLOSED md_open() ... ... ... clear MD_STILL_CLOSED do_md_stop We make below changes to resolve this issue: 1. rename MD_STILL_CLOSED to MD_CLOSING since it is set when stop array and it means we are stopping array. 2. let md_open returns early if CLOSING is set, so no other threads will open array if one thread is trying to close it. 3. no need to clear CLOSING bit in md_open because 1 has ensure the bit is cleared, then we also don't need to test CLOSING bit in do_md_stop. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	e566aef12a	md-cluster: call md_kick_rdev_from_array once ack failed The new_disk_ack could return failure if WAITING_FOR_NEWDISK is not set, so we need to kick the dev from array in case failure happened. And we missed to check err before call new_disk_ack othwise we could kick a rdev which isn't in array, thanks for the reminder from Shaohua. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	47a7b0d888	md-cluster: make md-cluster also can work when compiled into kernel The md-cluster is compiled as module by default, if it is compiled by built-in way, then we can't make md-cluster works. [64782.630008] md/raid1:md127: active with 2 out of 2 mirrors [64782.630528] md-cluster module not found. [64782.630530] md127: Could not setup cluster service (-2) Fixes: `edb39c9` ("Introduce md_cluster_operations to handle cluster functions") Cc: stable@vger.kernel.org (v4.1+) Reported-by: Marc Smith <marc.smith@mcc.edu> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-08 11:11:27 -07:00
Linus Torvalds	86a1679860	Merge tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD fixes from Shaohua Li: "This includes several bug fixes: - Alexey Obitotskiy fixed a hang for faulty raid5 array with external management - Song Liu fixed two raid5 journal related bugs - Tomasz Majchrzak fixed a bad block recording issue and an accounting issue for raid10 - ZhengYuan Liu fixed an accounting issue for raid5 - I fixed a potential race condition and memory leak with DIF/DIX enabled - other trival fixes" * tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: raid5: avoid unnecessary bio data set raid5: fix memory leak of bio integrity data raid10: record correct address of bad block md-cluster: fix error return code in join() r5cache: set MD_JOURNAL_CLEAN correctly md: don't print the same repeated messages about delayed sync operation md: remove obsolete ret in md_start_sync md: do not count journal as spare in GET_ARRAY_INFO md: Prevent IO hold during accessing to faulty raid5 array MD: hold mddev lock to change bitmap location raid5: fix incorrectly counter of conf->empty_inactive_list_nr raid10: increment write counter after bio is split	2016-08-30 11:24:04 -07:00
Song Liu	486b0f7bcd	r5cache: set MD_JOURNAL_CLEAN correctly Currently, the code sets MD_JOURNAL_CLEAN when the array has MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array will be MD_JOURNAL_CLEAN even if the journal device is missing. With this patch, the MD_JOURNAL_CLEAN is only set when the journal device presents. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-24 10:21:50 -07:00
Artur Paszkiewicz	c622ca543b	md: don't print the same repeated messages about delayed sync operation This fixes a long-standing bug that caused a flood of messages like: "md: delaying data-check of md1 until md2 has finished (they share one or more physical units)" It can be reproduced like this: 1. Create at least 3 raid1 arrays on a pair of disks, each on different partitions. 2. Request a sync operation like 'check' or 'repair' on 2 arrays by writing to their md/sync_action attribute files. One operation should start and one should be delayed and a message like the above will be printed. 3. Issue a write to the third array. Each write will cause 2 copies of the message to be printed. This happens when wake_up(&resync_wait) is called, usually by md_check_recovery(). Then the delayed sync thread again prints the message and is put to sleep. This patch adds a check in md_do_sync() to prevent printing this message more than once for the same pair of devices. Reported-by: Sven Koehler <sven.koehler@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=151801 Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-17 10:22:08 -07:00
Guoqing Jiang	207efcd2b5	md: remove obsolete ret in md_start_sync The ret is not needed anymore since we have already move resync_start into md_do_sync in commit `41a9a0d`. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-17 10:22:07 -07:00
Song Liu	b347af816a	md: do not count journal as spare in GET_ARRAY_INFO GET_ARRAY_INFO counts journal as spare (spare_disks), which is not accurate. This patch fixes this. Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-16 18:34:15 -07:00
Jens Axboe	1eff9d322a	block: rename bio bi_rw to bi_opf Since commit `63a4cc2486`, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-07 14:41:02 -06:00
Shaohua Li	3f35e210ed	Merge branch 'mymd/for-next' into mymd/for-linus	2016-07-28 09:34:14 -07:00
Shaohua Li	5d8817833c	MD: fix null pointer deference The md device might not have personality (for example, ddf raid array). The issue is introduced by 8430e7e0af9a15(md: disconnect device from personality before trying to remove it) Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-07-28 09:06:34 -07:00
Tomasz Majchrzak	573275b58e	md: add missing sysfs_notify on array_state update Changeset `6791875e2e` has added early return from a function so there is no sysfs notification for 'active' and 'clean' state change. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-07-19 11:28:39 -07:00
Alexey Obitotskiy	4cb9da7d9c	Fix kernel module refcount handling md loads raidX modules and increments module refcount each time level has changed but does not decrement it. You are unable to unload raid0 module after reshape because raid0 reshape changes level to raid4 and back to raid0. Signed-off-by: Aleksey Obitotskiy <aleksey.obitotskiy@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-07-19 11:17:31 -07:00
Arnd Bergmann	0e3ef49eda	md: use seconds granularity for error logging The md code stores the exact time of the last error in the last_read_error variable using a timespec structure. It only ever uses the seconds portion of that though, so we can use a scalar for it. There won't be an overflow in 2038 here, because it already used monotonic time and 32-bit is enough for that, but I've decided to use time64_t for consistency in the conversion. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Shaohua Li <shli@fb.com>	2016-07-19 11:00:47 -07:00
NeilBrown	d787be4092	md: reduce the number of synchronize_rcu() calls when multiple devices fail. Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made which can delay several milliseconds in some case. If lots of devices fail at once - as could happen with a large RAID10 where one set of devices are removed all at once - these delays can add up to be very inconcenient. As failure is not reversible we can check for that first, setting a separate flag if it is found, and then all synchronize_rcu() once for all the flagged devices. Then ->hot_remove_disk() function can skip the synchronize_rcu() step if the flag is set. fix build error(Shaohua) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:22 -07:00
NeilBrown	8430e7e0af	md: disconnect device from personality before trying to remove it. When the HOT_REMOVE_DISK ioctl is used to remove a device, we call remove_and_add_spares() which will remove it from the personality if possible. This improves the chances that the removal will succeed. When writing "remove" to dev-XX/state, we don't. So that can fail more easily. So add the remove_and_add_spares() into "remove" handling. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:12 -07:00
Xiao Ni	4ba1e78891	MD:Update superblock when err == 0 in size_store This is a simple check before updating the superblock. It should update the superblock when update_size return 0. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:11 -07:00
Cong Wang	5b1f5bc332	md: use a mutex to protect a global list We saw a list corruption in the list all_detected_devices: WARNING: CPU: 16 PID: 226 at lib/list_debug.c:29 __list_add+0x3c/0xa9() list_add corruption. next->prev should be prev (ffff880859d58320), but was ffff880859ce74c0. (next=ffffffff81abfdb0). Modules linked in: ahci libahci libata sd_mod scsi_mod CPU: 16 PID: 226 Comm: kworker/u241:4 Not tainted 4.1.20 #1 Hardware name: Dell Inc. PowerEdge C6220/04GD66, BIOS 2.2.3 11/07/2013 Workqueue: events_unbound async_run_entry_fn 0000000000000000 ffff880859a5baf8 ffffffff81502872 ffff880859a5bb48 0000000000000009 ffff880859a5bb38 ffffffff810692a5 ffff880859ee8828 ffffffff812ad02c ffff880859d58320 ffffffff81abfdb0 ffff880859eb90c0 Call Trace: [<ffffffff81502872>] dump_stack+0x4d/0x63 [<ffffffff810692a5>] warn_slowpath_common+0xa1/0xbb [<ffffffff812ad02c>] ? __list_add+0x3c/0xa9 [<ffffffff81069305>] warn_slowpath_fmt+0x46/0x48 [<ffffffff812ad02c>] __list_add+0x3c/0xa9 [<ffffffff81406f28>] md_autodetect_dev+0x41/0x62 [<ffffffff81285862>] rescan_partitions+0x25f/0x29d [<ffffffff81506372>] ? mutex_lock+0x13/0x31 [<ffffffff811a090f>] __blkdev_get+0x1aa/0x3cd [<ffffffff811a0b91>] blkdev_get+0x5f/0x294 [<ffffffff81377ceb>] ? put_device+0x17/0x19 [<ffffffff8128227c>] ? disk_put_part+0x12/0x14 [<ffffffff812836f3>] add_disk+0x29d/0x407 [<ffffffff81384345>] ? __pm_runtime_use_autosuspend+0x5c/0x64 [<ffffffffa004a724>] sd_probe_async+0x115/0x1af [sd_mod] [<ffffffff81083177>] async_run_entry_fn+0x72/0x12c [<ffffffff8107c44c>] process_one_work+0x198/0x2ce [<ffffffff8107cac7>] worker_thread+0x1dd/0x2bb [<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15 [<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15 [<ffffffff81080d9c>] kthread+0xae/0xb6 [<ffffffff81080000>] ? param_array_set+0x40/0xfa [<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61 [<ffffffff81508152>] ret_from_fork+0x42/0x70 [<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61 I suspect it is because there is no lock protecting this global list, autostart_arrays() is called in ioctl() path where there is no lock. Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-09 09:37:23 -07:00
Mike Christie	28a8f0d317	block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	796a5cf083	md: use bio op accessors Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	4e49ea4a3d	block/fs/drivers: remove rw argument from submit_bio This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Guoqing Jiang	db76767213	md: simplify the code with md_kick_rdev_from_array Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-03 16:23:02 -07:00
Guoqing Jiang	bb8bf15bd6	md-cluster: fix deadlock issue when add disk to an recoverying array Add a disk to an array which is performing recovery is a little complicated, we need to do both reap the sync thread and perform add disk for the case, then it caused deadlock as follows. linux44:~ # ps aux\|grep md\|grep D root 1822 0.0 0.0 0 0 ? D 16:50 0:00 [md127_resync] root 1848 0.0 0.0 19860 952 pts/0 D+ 16:50 0:00 mdadm --manage /dev/md127 --re-add /dev/vdb linux44:~ # cat /proc/1848/stack [<ffffffff8107afde>] kthread_stop+0x6e/0x120 [<ffffffffa051ddb0>] md_unregister_thread+0x40/0x80 [md_mod] [<ffffffffa0526e45>] md_reap_sync_thread+0x15/0x150 [md_mod] [<ffffffffa05271e0>] action_store+0x260/0x270 [md_mod] [<ffffffffa05206b4>] md_attr_store+0xb4/0x100 [md_mod] [<ffffffff81214a7e>] sysfs_write_file+0xbe/0x140 [<ffffffff811a6b98>] vfs_write+0xb8/0x1e0 [<ffffffff811a75b8>] SyS_write+0x48/0xa0 [<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b [<00007f068ea1ed30>] 0x7f068ea1ed30 linux44:~ # cat /proc/1822/stack [<ffffffffa05251a6>] md_do_sync+0x846/0xf40 [md_mod] [<ffffffffa052402d>] md_thread+0x16d/0x180 [md_mod] [<ffffffff8107ad94>] kthread+0xb4/0xc0 [<ffffffff8152a518>] ret_from_fork+0x58/0x90 Task1848 Task1822 md_attr_store (held reconfig_mutex by call mddev_lock()) action_store md_reap_sync_thread md_unregister_thread kthread_stop md_wakeup_thread(mddev->thread); wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_PENDING)) md_check_recovery is triggered by wakeup mddev->thread, but it can't clear MD_CHANGE_PENDING flag since it can't get lock which was held by md_attr_store already. To solve the deadlock problem, we move "->resync_finish()" from md_do_sync to md_reap_sync_thread (after md_update_sb), also MD_HELD_RESYNC_LOCK is introduced since it is possible that node can't get resync lock in md_do_sync. Then we do not need to wait for MD_CHANGE_PENDING is cleared or not since metadata should be updated after md_update_sb, so just call resync_finish if MD_HELD_RESYNC_LOCK is set. We also unified the code after skip label, since set PENDING for non-clustered case should be harmless. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-03 16:22:59 -07:00
Linus Torvalds	feaa7cb5c5	Merge tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD updates from Shaohua Li: "Several patches from Guoqing fixing md-cluster bugs and several patches from Heinz fixing dm-raid bugs" * tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md-cluster: check the return value of process_recvd_msg md-cluster: gather resync infos and enable recv_thread after bitmap is ready md: set MD_CHANGE_PENDING in a atomic region md: raid5: add prerequisite to run underneath dm-raid md: raid10: add prerequisite to run underneath dm-raid md: md.c: fix oops in mddev_suspend for raid0 md-cluster: fix ifnullfree.cocci warnings md-cluster/bitmap: unplug bitmap to sync dirty pages to disk md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit md-cluster/bitmap: fix wrong calcuation of offset md-cluster: sync bitmap when node received RESYNCING msg md-cluster: always setup in-memory bitmap md-cluster: wakeup thread if activated a spare disk md-cluster: change array_sectors and update size are not supported md-cluster: fix locking when node joins cluster during message broadcast md-cluster: unregister thread if err happened md-cluster: wake up thread to continue recovery md-cluser: make resync_finish only called after pers->sync_request md-cluster: change resync lock from asynchronous to synchronous	2016-05-19 17:25:13 -07:00
Linus Torvalds	24b9f0cf00	Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "On top of the core pull request, this is the drivers pull request for this merge window. This contains: - Switch drivers to the new write back cache API, and kill off the flush flags. From me. - Kill the discard support for the STEC pci-e flash driver. It's trivially broken, and apparently unmaintained, so it's safer to just remove it. From Jeff Moyer. - A set of lightnvm updates from the usual suspects (Matias/Javier, and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei Tao. - A set of updates for NVMe: - Turn the controller state management into a proper state machine. From Christoph. - Shuffling of code in preparation for NVMe-over-fabrics, also from Christoph. - Cleanup of the command prep part from Ming Lin. - Rewrite of the discard support from Ming Lin. - Deadlock fix for namespace removal from Ming Lin. - Use the now exported blk-mq tag helper for IO termination. From Sagi. - Various little fixes from Christoph, Guilherme, Keith, Ming Lin, Wang Sheng-Hui. - Convert mtip32xx to use the now exported blk-mq tag iter function, from Keith" * 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits) lightnvm: reserved space calculation incorrect lightnvm: rename nr_pages to nr_ppas on nvm_rq lightnvm: add is_cached entry to struct ppa_addr lightnvm: expose gennvm_mark_blk to targets lightnvm: remove mgt targets on mgt removal lightnvm: pass dma address to hardware rather than pointer lightnvm: do not assume sequential lun alloc. nvme/lightnvm: Log using the ctrl named device lightnvm: rename dma helper functions lightnvm: enable metadata to be sent to device lightnvm: do not free unused metadata on rrpc lightnvm: fix out of bound ppa lun id on bb tbl lightnvm: refactor set_bb_tbl for accepting ppa list lightnvm: move responsibility for bad blk mgmt to target lightnvm: make nvm_set_rqd_ppalist() aware of vblks lightnvm: remove struct factory_blks lightnvm: refactor device ops->get_bb_tbl() lightnvm: introduce nvm_for_each_lun_ppa() macro lightnvm: refactor dev->online_target to global nvm_targets lightnvm: rename nvm_targets to nvm_tgt_type ...	2016-05-17 16:03:32 -07:00
Guoqing Jiang	85ad1d13ee	md: set MD_CHANGE_PENDING in a atomic region Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <martink@posteo.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-09 09:24:02 -07:00
Heinz Mauelshagen	092398dce8	md: md.c: fix oops in mddev_suspend for raid0 Introduced by upstream commit `70d9798b95` The raid0 personality does not create mddev->thread as oposed to other personalities leading to its unconditional access in mddev_suspend() causing an oops. Patch checks for mddev->thread in order to keep the intention of aforementioned commit. Fixes: `70d9798b95` ("MD: warn for potential deadlock") Cc: stable@vger.kernel.org (4.5+) Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-09 09:23:23 -07:00
Guoqing Jiang	a578183ed9	md-cluster: wakeup thread if activated a spare disk When a device is re-added, it will ultimately need to be activated and that happens in md_check_recovery, so we need to set MD_RECOVERY_NEEDED right after remove_and_add_spares. A specifical issue without the change is that when one node perform fail/remove/readd on a disk, but slave nodes could not add the disk back to array as expected (added as missed instead of in sync). So give slave nodes a chance to do resync. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	ab5a98b132	md-cluster: change array_sectors and update size are not supported Currently, some features are not supported yet, such as change array_sectors and update size, so return EINVAL for them and listed it in document. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	2c97cf1385	md-cluser: make resync_finish only called after pers->sync_request It is not reasonable that cluster raid to release resync lock before the last pers->sync_request has finished. As the metadata will be changed when node performs resync, we need to inform other nodes to update metadata, so the MD_CHANGE_PENDING flag is set before finish resync. Then metadata_update_finish is move ahead to ensure that METADATA_UPDATED msg is sent before finish resync, and metadata_update_start need to be run after "repeat:" label accordingly. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	41a9a0dcf8	md-cluster: change resync lock from asynchronous to synchronous If multiple nodes choose to attempt do resync at the same time they need to be serialized so they don't duplicate effort. This serialization is done by locking the 'resync' DLM lock. Currently if a node cannot get the lock immediately it doesn't request notification when the lock becomes available (i.e. DLM_LKF_NOQUEUE is set), so it may not reliably find out when it is safe to try again. Rather than trying to arrange an async wake-up when the lock becomes available, switch to using synchronous locking - this is a lot easier to think about. As it is not permitted to block in the 'raid1d' thread, move the locking to the resync thread. So the rsync thread is forked immediately, but it blocks until the resync lock is available. Once the lock is locked it checks again if any resync action is needed. A particular symptom of the current problem is that a node can get stuck with "resync=pending" indefinitely. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Shaohua Li	9c573de328	MD: make bio mergeable blk_queue_split marks bio unmergeable, which makes sense for normal bio. But if dispatching the bio to underlayer disk, the blk_queue_split checks are invalid, hence it's possible the bio becomes mergeable. In the reported bug, this bug causes trim against raid0 performance slash https://bugzilla.kernel.org/show_bug.cgi?id=117051 Reported-and-tested-by: Park Ju Hyung <qkrwngud825@gmail.com> Fixes: 6ac45aeb6bca(block: avoid to merge splitted bio) Cc: stable@vger.kernel.org (v4.3+) Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Reviewed-by: Jens Axboe <axboe@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-04-25 18:21:33 -07:00
Jens Axboe	56883a7ec8	md: update to using blk_queue_write_cache() Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-04-12 16:00:39 -06:00

1 2 3 4 5 ...

1039 Commits