linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-13 09:26:49 +07:00

Author	SHA1	Message	Date
Song Liu	bac624f3f8	MD: add a new disk role to present write journal device Next patches will use a disk as raid5/6 journaling. We need a new disk role to present the journal device and add MD_FEATURE_JOURNAL to feature_map for backward compability. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Song Liu	c4d4c91b44	MD: replace special disk roles with macros Add the following two macros for special roles: spare and faulty MD_DISK_ROLE_SPARE 0xffff MD_DISK_ROLE_FAULTY 0xfffe Add MD_DISK_ROLE_MAX 0xff00 as the maximal possible regular role, and minimal value of special role. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Goldwyn Rodrigues	28c1b9fdf4	md-cluster: Call update_raid_disks() if another node --grow's raid_disks To incorporate --grow feature executed on one node, other nodes need to acknowledge the change in number of disks. Call update_raid_disks() to update internal data structures. This leads to call check_reshape() -> md_allow_write() -> md_update_sb(), this results in a deadlock. This is done so it can safely allocate memory (which might trigger writeback which might write to raid1). This is not required for md with a bitmap. In the clustered case, we don't perform md_update_sb() in md_allow_write(), but in do_md_run(). Also we disable safemode for clustered mode. mddev->recovery_cp need not be set in check_sb_changes() because this is required only when a node reads another node's bitmap. mddev->recovery_cp (which is read from sb->resync_offset), is set only if mddev is in_sync. Since we disabled safemode, in_sync is set to zero. In a clustered environment, the MD may not be in sync because another node could be writing to it. So make sure that in_sync is not set in case of clustered node in __md_stop_writes(). Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Guoqing Jiang	23b63f9fa8	md: check the return value for metadata_update_start We shouldn't run related funs of md_cluster_ops in case metadata_update_start returned failure. Signed-off-by: Guoqing Jiang <gqjiang@suse.com>	2015-10-12 11:58:15 -05:00
Guoqing Jiang	a9720903d1	md-cluster: only call kick_rdev_from_array after remove disk successfully For cluster raid, we should not kick it from array if the disk can't be remove from array successfully. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 11:58:15 -05:00
Goldwyn Rodrigues	dbb64f8635	md-cluster: Fix adding of new disk with new reload code Adding the disk worked incorrectly with the new reload code. Fix it: - No operation should be performed on rdev marked as Candidate - After a metadata update operation, kick disk if role is 0xfffe else clear Candidate bit and continue with the regular change check. - Saving the mode of the lock resource to check if token lock is already locked, because it can be called twice while adding a disk. However, unlock_comm() must be called only once. - add_new_disk() is called by the node initiating the --add operation. If it needs to be canceled, call add_new_disk_cancel(). The operation is completed by md_update_sb() which will write and unlock the communication. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 03:35:30 -05:00
Goldwyn Rodrigues	c186b128cd	md-cluster: Perform resync/recovery under a DLM lock Resync or recovery must be performed by only one node at a time. A DLM lock resource, resync_lockres provides the mutual exclusion so that only one node performs the recovery/resync at a time. If a node is unable to get the resync_lockres, because recovery is being performed by another node, it set MD_RECOVER_NEEDED so as to schedule recovery in the future. Remove the debug message in resync_info_update() used during development. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 03:32:44 -05:00
Goldwyn Rodrigues	2aa82191ac	md-cluster: Perform a lazy update In a clustered environment, a change such as marking a device faulty, can be recorded by any of the nodes. This is communicated to all the nodes and re-recording such a change is unnecessary, and quite often pretty disruptive. With this patch, just before the update, we detect for the changes and if the changes are already in superblock, we abort the update after clearing all the flags Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 01:37:17 -05:00
Goldwyn Rodrigues	70bcecdb15	md-cluster: Improve md_reload_sb to be less error prone md_reload_sb is too simplistic and it explicitly needs to determine the changes made by the writing node. However, there are multiple areas where a simple reload could fail. Instead, read the superblock of one of the "good" rdevs and update the necessary information: - read the superblock into a newly allocated page, by temporarily swapping out rdev->sb_page and calling ->load_super. - if that fails return - if it succeeds, call check_sb_changes 1. iterates over list of active devices and checks the matching dev_roles[] value. If that is 'faulty', the device must be marked as faulty - call md_error to mark the device as faulty. Make sure not to set CHANGE_DEVS and wakeup mddev->thread or else it would initiate a resync process, which is the responsibility of the "primary" node. - clear the Blocked bit - Call remove_and_add_spares() to hot remove the device. If the device is 'spare': - call remove_and_add_spares() to get the number of spares added in this operation. - Reduce mddev->degraded to mark the array as not degraded. 2. reset recovery_cp - read the rest of the rdevs to update recovery_offset. If recovery_offset is equal to MaxSector, call spare_active() to set it In_sync This required that recovery_offset be initialized to MaxSector, as opposed to zero so as to communicate the end of sync for a rdev. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 01:34:48 -05:00
Goldwyn Rodrigues	2910ff17d1	md: remove_and_add_spares() to activate specific rdev remove_and_add_spares() checks for all devices to activate spare. Change it to activate a specific device if a non-null rdev argument is passed. remove_and_add_spares() can be used to activate spares in slot_store() as well. For hot_remove_disk(), check if rdev->raid_disk == -1 before calling remove_and_add_spares() Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 01:33:58 -05:00
Goldwyn Rodrigues	c40f341f1e	md-cluster: Use a small window for resync Suspending the entire device for resync could take too long. Resync in small chunks. cluster's resync window (32M) is maintained in r1conf as cluster_sync_low and cluster_sync_high and processed in raid1's sync_request(). If the current resync is outside the cluster resync window: 1. Set the cluster_sync_low to curr_resync_completed. 2. Check if the sync will fit in the new window, if not issue a wait_barrier() and set cluster_sync_low to sector_nr. 3. Set cluster_sync_high to cluster_sync_low + resync_window. 4. Send a message to all nodes so they may add it in their suspension list. bitmap_cond_end_sync is modified to allow to force a sync inorder to get the curr_resync_completed uptodate with the sector passed. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-10-12 01:32:05 -05:00
Goldwyn Rodrigues	3c462c880b	md: Increment version for clustered bitmaps Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels to assemble a clustered device. In order to maximize compatibility, the major version is set to BITMAP_MAJOR_CLUSTERED only if the bitmap is clustered. Added MD_FEATURE_CLUSTERED in order to return error for older kernels which would assemble MD even if the bitmap is corrupted. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-12 01:31:33 -05:00
Shaohua Li	d4929add83	md: clear CHANGE_PENDING in readonly array If faulty disks of an array are more than allowed degraded number, the array enters error handling. It will be marked as read-only with MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is set for a raid5 array, all returned IO will be hold on a list till the bit is clear. But recovery nevery clears this bit, the IO is always in pending state and nevery finish. This has bad effects like upper layer can't get an IO error and the array can't be stopped. Fixes: `c3cce6cda1` ("md/raid5: ensure device failure recorded before write request returns.") Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:44 +10:00
NeilBrown	88724bfa68	md: wait for pending superblock updates before switching to read-only If a superblock update is pending, wait for it to complete before letting md_set_readonly() switch to readonly. Otherwise we might lose important information about a device having failed. For external arrays, waiting for superblock updates can wait on user-space, so in that case, just return an error. Reported-and-tested-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:43 +10:00
NeilBrown	e89c6fdf9e	Merge linux-block/for-4.3/core into md/for-linux There were a few conflicts that are fairly easy to resolve. Signed-off-by: NeilBrown <neilb@suse.com>	2015-09-05 11:08:32 +02:00
Linus Torvalds	1081230b74	Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...	2015-09-02 13:10:25 -07:00
NeilBrown	55ce74d4bf	md/raid1: ensure device failure recorded before write request returns. When a write to one of the legs of a RAID1 fails, the failure is recorded in the metadata of the other leg(s) so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:23 +02:00
NeilBrown	6022e75bf0	md: extend spinlock protection in register_md_cluster_operations This code looks racy. The only possible race is if two modules try to register at the same time and that won't happen. But make the code look safe anyway. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:59 +02:00
Guoqing Jiang	dc737d7c3d	md-cluster: transfer the resync ownership to another node When node A stops an array while the array is doing a resync, we need to let another node B take over the resync task. To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC message to the cluster. And the node B which received that message will invoke __recover_slot to do resync. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:41:12 +02:00
Sasha Levin	25b2edfa3b	md: setup safemode_timer before it's being used We used to set up the safemode_timer timer in md_run. If md_run would fail before the timer was set up we'd end up trying to modify a timer that doesn't have a callback function when we access safe_delay_store, which would trigger a BUG. neilb: delete init_timer() call as setup_timer() does that. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:39:39 +02:00
NeilBrown	5ed1df2eac	md: sync sync_completed has correct value as recovery finishes. There can be a small window between the moment that recovery actually writes the last block and the time when various sysfs and /proc/mdstat attributes report that it has finished. During this time, 'sync_completed' can have the wrong value. This can confuse monitoring software. So: - don't set curr_resync_completed beyond the end of the devices, - set it correctly when resync/recovery has completed. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:38:17 +02:00
NeilBrown	c5e19d906a	md: be careful when testing resync_max against curr_resync_completed. While it generally shouldn't happen, it is not impossible for curr_resync_completed to exceed resync_max. This can particularly happen when reshaping RAID5 - the current status isn't copied to curr_resync_completed promptly, so when it is, it can exceed resync_max. This happens when the reshape is 'frozen', resync_max is set low, and reshape is re-enabled. Taking a difference between two unsigned numbers is always dangerous anyway, so add a test to behave correctly if curr_resync_completed > resync_max Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:37:33 +02:00
NeilBrown	a4a3d26d87	md: set MD_RECOVERY_RECOVER when starting a degraded array. This ensures that 'sync_action' will show 'recover' immediately the array is started. If there is no spare the status will change to 'idle' once that is detected. Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change happens. This allows scripts which monitor status not to get confused - particularly my test scripts. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:37:03 +02:00
NeilBrown	985ca973b6	md: close some races between setting and checking sync_action. When checking sync_action in a script, we want to be sure it is as accurate as possible. As resync/reshape etc doesn't always start immediately (a separate thread is scheduled to do it), it is best if 'action_show' checks if MD_RECOVER_NEEDED is set (which it does) and in that case reports what is likely to start soon (which it only sometimes does). So: - report 'reshape' if reshape_position suggests one might start. - set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very likely to happen next. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:30:40 +02:00
NeilBrown	f7851be736	md: Keep /proc/mdstat reporting recovery until fully DONE. Currently when a recovery completes, mdstat shows that it has finished before the new device is marked as a full member. Because of this it can appear to a script that the recovery finished but the array isn't in sync. So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery". Once md_reap_sync_thread() completes, the spare will be active and then MD_RECOVERY_DONE will be cleared. To ensure this is race-free, set MD_RECOVERY_DONE before clearning curr_resync. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:29:09 +02:00
Kent Overstreet	8ae126660f	block: kill merge_bvec_fn() completely As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:57 -06:00
Kent Overstreet	54efd50bfd	block: make generic_make_request handle arbitrarily sized bios The way the block layer is currently written, it goes to great lengths to avoid having to split bios; upper layer code (such as bio_add_page()) checks what the underlying device can handle and tries to always create bios that don't need to be split. But this approach becomes unwieldy and eventually breaks down with stacked devices and devices with dynamic limits, and it adds a lot of complexity. If the block layer could split bios as needed, we could eliminate a lot of complexity elsewhere - particularly in stacked drivers. Code that creates bios can then create whatever size bios are convenient, and more importantly stacked drivers don't have to deal with both their own bio size limitations and the limitations of the (potentially multiple) devices underneath them. In the future this will let us delete merge_bvec_fn and a bunch of other code. We do this by adding calls to blk_queue_split() to the various make_request functions that need it - a few can already handle arbitrary size bios. Note that we add the call _after_ any call to blk_queue_bounce(); this means that blk_queue_split() and blk_recalc_rq_segments() don't need to be concerned with bouncing affecting segment merging. Some make_request_fn() callbacks were simple enough to audit and verify they don't need blk_queue_split() calls. The skipped ones are: * nfhd_make_request (arch/m68k/emu/nfblock.c) * axon_ram_make_request (arch/powerpc/sysdev/axonram.c) * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c) * brd_make_request (ramdisk - drivers/block/brd.c) * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c) * loop_make_request * null_queue_bio * bcache's make_request fns Some others are almost certainly safe to remove now, but will be left for future patches. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Geoff Levand <geoff@infradead.org> Cc: Jim Paris <jim@jtan.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: skip more mq-based drivers, resolve merge conflicts, etc.] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:33 -06:00
Benjamin Randazzo	25eafe1a81	md: simplify get_bitmap_file now that "file" is zeroed. There is no point assigning '\0' to file->pathname[0] as file is now zeroed out, so remove that branch and simplify the code. [Original patch combined this with the change to use kzalloc. I split the two so that the change to kzalloc is easier to backport. - neilb] Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-03 17:12:44 +10:00
Benjamin Randazzo	b6878d9e03	md: use kzalloc() when bitmap is disabled In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a mdu_bitmap_file_t called "file". 5769 file = kmalloc(sizeof(file), GFP_NOIO); 5770 if (!file) 5771 return -ENOMEM; This structure is copied to user space at the end of the function. 5786 if (err == 0 && 5787 copy_to_user(arg, file, sizeof(file))) 5788 err = -EFAULT But if bitmap is disabled only the first byte of "file" is initialized with zero, so it's possible to read some bytes (up to 4095) of kernel space memory from user space. This is an information leak. 5775 /* bitmap disabled, zero the first byte and copy out */ 5776 if (!mddev->bitmap_info.file) 5777 file->pathname[0] = '\0'; Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-03 14:56:02 +10:00
Christoph Hellwig	4246a0b63b	block: add a bi_error field to struct bio Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-29 08:55:15 -06:00
Linus Torvalds	aca105a697	Some md fixes for 4.2 Several are tagged for -stable. A few aren't because they are not very, serious or because they are in the 'experimental' cluster code. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVsbRgAAoJEDnsnt1WYoG54BcQAJlVxcGdMXNAbFAfhToH+cwY DcCKmnXiu3TCcK/gtr0SlUKIhv7kPE2HaGbTko3g5/uTk7SCuMSWRbjpQJr1u98U 9VUPZV0RGLEcQXgsjG3sobEtdSYMSn1/BpdJpmsn/Q6cyTheiEJA9fEghn9F0Iw9 3Ctc56aL0nKsnpeRTvWmKR3F995L2Ene+pIHPbSqTlQcGU+DdxQL/iY9YspMzeih 6SWQICo59+pGSRQdKaU968nZ1a8hJCvhV2NbW0JsJMo9iGo5htCJLdxZatscZV6E xAppCRymW/Nl0n59wCOBhUp+0skVhtYZ1UmNdx/vUA7vcZJX85vIG7WGaaSQAmlF Y4nNMSaX6SWbt/oVgq+JiD39T1oFZN5N5Fh9juqcp3fshiWLO74cnTwea1LtkQZX LfW2HhajDUgoJqoXL6ppKDK8l7alQdcaoYn0C6SfqRZ+PLB5ERrSBHNZpKDbI9aw CFW93lHTtlwtwZn0S7k06mCG8ilO4iPthUVaJEeI1kUWVKY0Ju4xnVTt/7jCroUa Km14KdWeZsS8j9/xr6FYBjarjuh7M9SoflgUK84txE+PIZsSczyrErpBkMf2YBWQ 0KduqQabXa1XYWGtZ7Uhj6iOOGNBcTcZsww8cBEDOJEDyCtSs5w/iPmsFCEGUuo2 YTz6qKpYPgB1VzK2PoBG =6ZCK -----END PGP SIGNATURE----- Merge tag 'md/4.2-fixes' of git://neil.brown.name/md Pull md fixes from Neil Brown: "Some md fixes for 4.2 Several are tagged for -stable. A few aren't because they are not very, serious or because they are in the 'experimental' cluster code" * tag 'md/4.2-fixes' of git://neil.brown.name/md: md/raid5: clear R5_NeedReplace when no longer needed. Fix read-balancing during node failure md-cluster: fix bitmap sub-offset in bitmap_read_sb md: Return error if request_module fails and returns positive value md: Skip cluster setup in case of error while reading bitmap md/raid1: fix test for 'was read error from last working device'. md: Skip cluster setup for dm-raid md: flush ->event_work before stopping array. md/raid10: always set reshape_safe when initializing reshape_position. md/raid5: avoid races when changing cache size.	2015-07-25 11:24:58 -07:00
Goldwyn Rodrigues	b0c26a79d6	md: Return error if request_module fails and returns positive value request_module() can return 256 (process exited) in some cases, which is not as specified in the documentation before the request_module() definition. Convert the error to -ENOENT. The positive error number results in bitmap_create() returning a value that is meant to be an error but doesn't look like one, so it is dereferenced as a point and causes a crash. (not needed for stable as this is "experimental" code) Fixes: `edb39c9ded` ("Introduce md_cluster_operations to handle cluster functions") Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-24 13:37:51 +10:00
NeilBrown	ee5d004fd0	md: flush ->event_work before stopping array. The 'event_work' worker used by dm-raid may still be running when the array is stopped. This can result in an oops. So flush the workqueue on which it is run after detaching and before destroying the device. Reported-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Cc: stable@vger.kernel.org (2.6.38+ please delay 2 weeks after -final release) Fixes: `9d09e663d5` ("dm: raid456 basic support")	2015-07-22 14:09:29 +10:00
Linus Torvalds	1dc51b8288	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted VFS fixes and related cleanups (IMO the most interesting in that part are f_path-related things and Eric's descriptor-related stuff). UFS regression fixes (it got broken last cycle). 9P fixes. fs-cache series, DAX patches, Jan's file_remove_suid() work" [ I'd say this is much more than "fixes and related cleanups". The file_table locking rule change by Eric Dumazet is a rather big and fundamental update even if the patch isn't huge. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits) 9p: cope with bogus responses from server in p9_client_{read,write} p9_client_write(): avoid double p9_free_req() 9p: forgetting to cancel request on interrupted zero-copy RPC dax: bdev_direct_access() may sleep block: Add support for DAX reads/writes to block devices dax: Use copy_from_iter_nocache dax: Add block size note to documentation fs/file.c: __fget() and dup2() atomicity rules fs/file.c: don't acquire files->file_lock in fd_install() fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation vfs: avoid creation of inode number 0 in get_next_ino namei: make set_root_rcu() return void make simple_positive() public ufs: use dir_pages instead of ufs_dir_pages() pagemap.h: move dir_pages() over there remove the pointless include of lglock.h fs: cleanup slight list_entry abuse xfs: Correctly lock inode when removing suid and file capabilities fs: Call security_ops->inode_killpriv on truncate fs: Provide function telling whether file_remove_privs() will do anything ...	2015-07-04 19:36:06 -07:00
Linus Torvalds	6aaf0da872	md updates for 4.2 A mixed bag - a few bug fixes - some performance improvement that decrease lock contention - some clean-up Nothing major. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVi6weAAoJEDnsnt1WYoG50CsP/RqFbZicRSIvzXUURwP+yCP0 3YZuURj4IXC6Cy/HLX+bZoj1p/b+GIRsZ72fWFJrd2LheaAI6WojCCLlnmXUtI/Y LIppF8/A2hfCNbF9cILByvrbzfndeEGK8kvootBDpvD0jlYiGePPAMQY2zx0MAyb T4yJ/KiziLniP6x7vqZrQ6I1MRVjeanN6RWXktFtixMpNOKUJe3PiZbUz4VDIrHR DaiHCbMjvRIkUWgNY8HmijEt+c8AYia7muqLj359dy2xF1hlUIdCx+61cgFD1zd8 enKDH3xp+3B9BEgHe+AtxTAzpqSgU93tdhUjGcy/orA+yYjAAcA4ifngrzfE3VKb kwQgPh2JvUrubavrcto0hthS5RldrCpDXebOM4aEq+7lDHCwrZ39Qio5+1F7TLt5 A5E3Eb7dPRdp9T3LrluX8/f7bO/Wbmxvv/RwnSLTpnGQoBWIAqCpQ+e9ro446Gsx /phXv3tE78fKj88LgQY/mm8ICeCppmQGLrpmjk9bkaZzqFdzQoURVmPh8QPMuJB4 iMHpOOKLzrUlW/23rRxaIKwPuFyxlNuLAvyA3ezsymGiZ+SqSeFCEm1jN64EfMCI 39rpfZt2pcVVOZJ9YeuzZG9wpie96yGZgnVWlP3FPjqRpboXqmtHlYA6EMRtqDAy mjSiGDF2bxkT1/YcjELD =sXTI -----END PGP SIGNATURE----- Merge tag 'md/4.2' of git://neil.brown.name/md Pull md updates from Neil Brown: "A mixed bag - a few bug fixes - some performance improvement that decrease lock contention - some clean-up Nothing major" * tag 'md/4.2' of git://neil.brown.name/md: md: clear Blocked flag on failed devices when array is read-only. md: unlock mddev_lock on an error path. md: clear mddev->private when it has been freed. md: fix a build warning md/raid5: ignore released_stripes check md/raid5: per hash value and exclusive wait_for_stripe md/raid5: split wait_for_stripe and introduce wait_for_quiescent wait: introduce wait_event_exclusive_cmd md: convert to kstrto*() md/raid10: make sync_request_write() call bio_copy_data()	2015-06-29 11:10:56 -07:00
Rasmus Villemoes	90a9befb20	drivers/md/md.c: use strreplace() There's no point in starting over when we meet a '/'. This also eliminates a stack variable and a little .text. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-06-25 17:00:40 -07:00
Neil Brown	ab16bfc732	md: clear Blocked flag on failed devices when array is read-only. The Blocked flag indicates that a device has failed but that this fact hasn't been recorded in the metadata yet. Writes to such devices cannot be allowed until the metadata has been updated. On a read-only array, the Blocked flag will never be cleared. This prevents the device being removed from the array. If the metadata is being handled by the kernel (i.e. !mddev->external), then we can be sure that if the array is switch to writable, then a metadata update will happen and will record the failure. So we don't need the flag set. If metadata is externally managed, it is upto the external manager to clear the 'blocked' flag. Reported-by: XiaoNi <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-25 17:16:49 +10:00
NeilBrown	9a8c0fa861	md: unlock mddev_lock on an error path. This error path retuns while still holding the lock - bad. Fixes: `6791875e2e` ("md: make reconfig_mutex optional for writes to md sysfs files.") Cc: stable@vger.kernel.org (v4.0+) Signed-off-by: NeilBrown <neilb@suse.com>	2015-06-25 17:14:09 +10:00
NeilBrown	bd6919228d	md: clear mddev->private when it has been freed. If ->private is set when ->run is called, it is assumed to be a 'config' prepared as part of 'reshape'. So it is important when we free that config, that we also clear ->private. This is not often a problem as the mddev will normally be discarded shortly after the config us freed. However if an 'assemble' races with a final close, the assemble can use the old mddev which has a stale ->private. This leads to any of various sorts of crashes. So clear ->private after calling ->free(). Reported-by: Nate Clark <nate@neworld.us> Cc: stable@vger.kernel.org (v4.0+) Fixes: `afa0f557cb` ("md: rename ->stop to ->free") Signed-off-by: NeilBrown <neilb@suse.com>	2015-06-25 17:14:09 +10:00
Miklos Szeredi	9bf39ab2ad	vfs: add file_path() helper Turn d_path(&file->f_path, ...); into file_path(file, ...); Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-06-23 18:00:05 -04:00
Firo Yang	4e02361232	md: fix a build warning Warning like this: drivers/md/md.c: In function "update_array_info": drivers/md/md.c:6394:26: warning: logical not is only applied to the left hand side of comparison [-Wlogical-not-parentheses] !mddev->persistent != info->not_persistent\|\| Fix it as Neil Brown said: mddev->persistent != !info->not_persistent \|\| Signed-off-by: Firo Yang <firogm@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 10:00:38 +10:00
Alexey Dobriyan	4c9309c0cc	md: convert to kstrto() Convert away from deprecated simple_strto() functions. Add "fit into sector_t" checks. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 10:00:06 +10:00
NeilBrown	ea358cd0d2	md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync MD_RECOVERY_DONE is normally cleared by md_check_recovery after a resync etc finished. However it is possible for raid5_start_reshape to race and start a reshape before MD_RECOVERY_DONE is cleared. This can lean to multiple reshapes running at the same time, which isn't good. To make sure it is cleared before starting a reshape, and also clear it when reaping a thread, just to be safe. Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-12 20:16:33 +10:00
NeilBrown	8e8e2518fc	md: Close race when setting 'action' to 'idle'. Checking ->sync_thread without holding the mddev_lock() isn't really safe, even after flushing the workqueue which ensures md_start_sync() has been run. While this code is waiting for the lock, md_check_recovery could reap the thread itself, and then start another thread (e.g. recovery might finish, then reshape starts). When this thread gets the lock md_start_sync() hasn't run so it doesn't get reaped, but MD_RECOVERY_RUNNING gets cleared. This allows two threads to start which leads to confusion. So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do the flush and the test and the reap all under the mddev_lock to avoid any race with md_check_recovery. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `6791875e2e` ("md: make reconfig_mutex optional for writes to md sysfs files.") Cc: stable@vger.kernel.org (v4.0+)	2015-06-12 20:16:26 +10:00
NeilBrown	c008f1d356	md: don't return 0 from array_state_store Returning zero from a 'store' function is bad. The return value should be either len length of the string or an error. So use 'len' if 'err' is zero. Fixes: `6791875e2e` ("md: make reconfig_mutex optional for writes to md sysfs files.") Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel (v4.0+)	2015-06-12 20:16:16 +10:00
Linus Torvalds	c492e2d464	Assorted fixes for new RAID5 stripe-batching functionality. Unfortunately this functionality was merged a little prematurely. The necessary testing and code review is now complete (or as complete as it can be) and to code passes a variety of tests and looks quite sensible. Also a fix for some recent locking changes - a race was introduced which causes a reshape request to sometimes fail. No data safety issues. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVWf/sDnsnt1WYoG5AQJ8hQ/+KIGUijacXXUXBE4QuO1DMTkltV61bk6E TJQ6fTuMvXeOuyGm+BoSFTJrOJiP6/PVxl4jnAkjLlvAK/JVKekG0PXv2flmD9EJ udK/g8d2k+4L2O0uiGdGSfOQaEaQ4OQvNmQOP9GF/FXNdyYfZbSJnxG+kzWnStGZ 3LNEMoDok9TiUDVSJ3PgibnUHYr3zNJFjBGszfRW0HqXBRWM5TI6HQ0bWwrm61mQ sIOvFeS7CVOBQWW7zkY3uvz/g7dpuPlXqmDOomF+prKlU320SrpSDDBD2Qg56rXh 8YGAzLPV8R6xB5hjGFnoHtvxF/f5Fntb3WbC5az0zv+q/phDYA9Nd2UN5APemyGB PJuxW4Ojq2DWIAvmf0HQEkvjJlqeugCgIQXJJ8yvIaBXJJjit1jMSEXjolM4vlLh h6Su/hwoyTi9NxdYpFeR6JuyHjzTyrjyBkbW8y12wVQjmDncBdKtieYZX4TvPxVz n7Qrk2bpFhR/icP6eYWCvt6iwU1e+5lXNb/18AYm9bJe5BE5/N1X0azrxbdZT4cl 1DvQw2HAMBGp+nSr+R1lqO4yX+busBZUTYsaGvH4T7Ubs+UjwgTE3tPoevj6w829 0/7r/UPfSn0XFbd5rrPY+bOBsAOIMDG5g3mj7K7+38sVeX9VOVN4sGftS5dWTr9e RQBTZAK0+qI= =Y0Vm -----END PGP SIGNATURE----- Merge tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md Pull m,ore md bugfixes gfrom Neil Brown: "Assorted fixes for new RAID5 stripe-batching functionality. Unfortunately this functionality was merged a little prematurely. The necessary testing and code review is now complete (or as complete as it can be) and to code passes a variety of tests and looks quite sensible. Also a fix for some recent locking changes - a race was introduced which causes a reshape request to sometimes fail. No data safety issues" * tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md: md: fix race when unfreezing sync_action md/raid5: break stripe-batches when the array has failed. md/raid5: call break_stripe_batch_list from handle_stripe_clean_event md/raid5: be more selective about distributing flags across batch. md/raid5: add handle_flags arg to break_stripe_batch_list. md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list md/raid5: remove condition test from check_break_stripe_batch_list. md/raid5: Ensure a batch member is not handled prematurely. md/raid5: close race between STRIPE_BIT_DELAY and batching. md/raid5: ensure whole batch is delayed for all required bitmap updates.	2015-05-29 10:35:21 -07:00
NeilBrown	56ccc1125b	md: fix race when unfreezing sync_action A recent change removed the need for locking around writing to "sync_action" (and various other places), but introduced a subtle race. When e.g. setting 'reshape' on a 'frozen' array, the 'frozen' flag is cleared before 'reshape' is set, so the md thread can get in and start trying recovery - which isn't wanted. So instead of clearing MD_RECOVERY_FROZEN for any command except 'frozen', only clear it when each specific command is parsed. This allows the handling of 'reshape' to clear the bit while a lock is held. Also remove some places where we set MD_RECOVERY_NEEDED, as it is always set on non-error exit of the function. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `6791875e2e` ("md: make reconfig_mutex optional for writes to md sysfs files.")	2015-05-28 18:04:45 +10:00
Linus Torvalds	1daac193f2	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A collection of fixes since the merge window; - fix for a double elevator module release, from Chao Yu. Ancient bug. - the splice() MORE flag fix from Christophe Leroy. - a fix for NVMe, fixing a patch that went in in the merge window. From Keith. - two fixes for blk-mq CPU hotplug handling, from Ming Lei. - bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md. - two blk-mq fixes from Shaohua, fixing a race on queue stop and a bad merge issue with FUA writes. - division-by-zero fix for writeback from Tejun. - a block bounce page accounting fix, making sure we inc/dec after bouncing so that pre/post IO pages match up. From Wang YanQing" * 'for-linus' of git://git.kernel.dk/linux-block: splice: sendfile() at once fails for big files blk-mq: don't lose requests if a stopped queue restarts blk-mq: fix FUA request hang block: destroy bdi before blockdev is unregistered. block:bounce: fix call inc_\|dec_zone_page_state on different pages confuse value of NR_BOUNCE elevator: fix double release of elevator module writeback: use \|1 instead of +1 to protect against div by zero blk-mq: fix CPU hotplug handling blk-mq: fix race between timeout and CPU hotplug NVMe: Fix VPD B0 max sectors translation	2015-05-08 19:49:35 -07:00
NeilBrown	6cd18e711d	block: destroy bdi before blockdev is unregistered. Because of the peculiar way that md devices are created (automatically when the device node is opened), a new device can be created and registered immediately after the blk_unregister_region(disk_devt(disk), disk->minors); call in del_gendisk(). Therefore it is important that all visible artifacts of the previous device are removed before this call. In particular, the 'bdi'. Since: commit `c4db59d31e` Author: Christoph Hellwig <hch@lst.de> fs: don't reassign dirty inodes to default_backing_dev_info moved the device_unregister(bdi->dev); call from bdi_unregister() to bdi_destroy() it has been quite easy to lose a race and have a new (e.g.) "md127" be created after the blk_unregister_region() call and before bdi_destroy() is ultimately called by the final 'put_disk', which must come after del_gendisk(). The new device finds that the bdi name is already registered in sysfs and complains > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70() > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127' We can fix this by moving the bdi_destroy() call out of blk_release_queue() (which can happen very late when a refcount reaches zero) and into blk_cleanup_queue() - which happens exactly when the md device driver calls it. Then it is only necessary for md to call blk_cleanup_queue() before del_gendisk(). As loop.c devices are also created on demand by opening the device node, we make the same change there. Fixes: `c4db59d31e` Reported-by: Azat Khuzhin <a3at.mail@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org (v4.0) Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-04-27 10:27:20 -06:00
NeilBrown	ac8fa4196d	md: allow resync to go faster when there is competing IO. When md notices non-sync IO happening while it is trying to resync (or reshape or recover) it slows down to the set minimum. The default minimum might have made sense many years ago but the drives have become faster. Changing the default to match the times isn't really a long term solution. This patch changes the code so that instead of waiting until the speed has dropped to the target, it just waits until pending requests have completed. This means that the delay inserted is a function of the speed of the devices. Testing shows that: - for some loads, the resync speed is unchanged. For those loads increasing the minimum doesn't change the speed either. So this is a good result. To increase resync speed under such loads we would probably need to increase the resync window size. - for other loads, resync speed does increase to a reasonable fraction (e.g. 20%) of maximum possible, and throughput of the load only drops a little bit (e.g. 10%) - for other loads, throughput of the non-sync load drops quite a bit more. These seem to be latency-sensitive loads. So it isn't a perfect solution, but it is mostly an improvement. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00

1 2 3 4 5 ...

847 Commits