linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-13 17:36:46 +07:00

Author	SHA1	Message	Date
NeilBrown	0472a42ba1	md/raid5: remove over-loading of ->bi_phys_segments. When a read request, which bypassed the cache, fails, we need to retry it through the cache. This involves attaching it to a sequence of stripe_heads, and it may not be possible to get all the stripe_heads we need at once. We do what we can, and record how far we got in ->bi_phys_segments so we can pick up again later. There is only ever one bio which may have a non-zero offset stored in ->bi_phys_segments, the one that is either active in the single thread which calls retry_aligned_read(), or is in conf->retry_read_aligned waiting for retry_aligned_read() to be called again. So we only need to store one offset value. This can be in a local variable passed between remove_bio_from_retry() and retry_aligned_read(), or in the r5conf structure next to the ->retry_read_aligned pointer. Storing it there allows the last usage of ->bi_phys_segments to be removed from md/raid5.c. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:16:56 -07:00
NeilBrown	016c76ac76	md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter md/raid5 needs to keep track of how many stripe_heads are processing a bio so that it can delay calling bio_endio() until all stripe_heads have completed. It currently uses 16 bits of ->bi_phys_segments for this purpose. 16 bits is only enough for 256M requests, and it is possible for a single bio to be larger than this, which causes problems. Also, the bio struct contains a larger counter, __bi_remaining, which has a purpose very similar to the purpose of our counter. So stop using ->bi_phys_segments, and instead use __bi_remaining. This means we don't need to initialize the counter, as our caller initializes it to '1'. It also means we can call bio_endio() directly as it tests this counter internally. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:16:30 -07:00
NeilBrown	bd83d0a28c	md/raid5: call bio_endio() directly rather than queueing for later. We currently gather bios that need to be returned into a bio_list and call bio_endio() on them all together. The original reason for this was to avoid making the calls while holding a spinlock. Locking has changed a lot since then, and that reason is no longer valid. So discard return_io() and various return_bi lists, and just call bio_endio() directly as needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:16:12 -07:00
NeilBrown	16d997b78b	md/raid5: simplfy delaying of writes while metadata is updated. If a device fails during a write, we must ensure the failure is recorded in the metadata before the completion of the write is acknowleged. Commit `c3cce6cda1` ("md/raid5: ensure device failure recorded before write request returns.") added code for this, but it was unnecessarily complicated. We already had similar functionality for handling updates to the bad-block-list, thanks to Commit `de393cdea6` ("md: make it easier to wait for bad blocks to be acknowledged.") So revert most of the former commit, and instead avoid collecting completed writes if MD_CHANGE_PENDING is set. raid5d() will then flush the metadata and retry the stripe_head. As this change can leave a stripe_head ready for handling immediately after handle_active_stripes() returns, we change raid5_do_work() to pause when MD_CHANGE_PENDING is set, so that it doesn't spin. We check MD_CHANGE_PENDING after analyse_stripe() as it could be set asynchronously. After analyse_stripe(), we have collected stable data about the state of devices, which will be used to make decisions. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:15:57 -07:00
NeilBrown	497280509f	md/raid5: use md_write_start to count stripes, not bios We use md_write_start() to increase the count of pending writes, and md_write_end() to decrement the count. We currently count bios submitted to md/raid5. Change it count stripe_heads that a WRITE bio has been attached to. So now, raid5_make_request() calls md_write_start() and then md_write_end() to keep the count elevated during the setup of the request. add_stripe_bio() calls md_write_start() for each stripe_head, and the completion routines always call md_write_end(), instead of only calling it when raid5_dec_bi_active_stripes() returns 0. make_discard_request also calls md_write_start/end(). The parallel between md_write_{start,end} and use of bi_phys_segments can be seen in that: Whenever we set bi_phys_segments to 1, we now call md_write_start. Whenever we increment it on non-read requests with raid5_inc_bi_active_stripes(), we now call md_write_start(). Whenever we decrement bi_phys_segments on non-read requsts with raid5_dec_bi_active_stripes(), we now call md_write_end(). This reduces our dependence on keeping a per-bio count of active stripes in bi_phys_segments. md_write_inc() is added which parallels md_write_start(), but requires that a write has already been started, and is certain never to sleep. This can be used inside a spinlocked region when adding to a write request. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:15:42 -07:00
Artur Paszkiewicz	ba903a3ea4	raid5-ppl: runtime PPL enabling or disabling Allow writing to 'consistency_policy' attribute when the array is active. Add a new function 'change_consistency_policy' to the md_personality operations structure to handle the change in the personality code. Values "ppl" and "resync" are accepted and turn PPL on and off respectively. When enabling PPL its location and size should first be set using 'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be written at this location on each member device. Enabling or disabling PPL is performed under a suspended array. The raid5_reset_stripe_cache function frees the stripe cache and allocates it again in order to allocate or free the ppl_pages for the stripes in the stripe cache. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:56 -07:00
Artur Paszkiewicz	6358c239d8	raid5-ppl: support disk hot add/remove with PPL Add a function to modify the log by removing an rdev when a drive fails or adding when a spare/replacement is activated as a raid member. Removing a disk just clears the child log rdev pointer. No new stripes will be accepted for this child log in ppl_write_stripe() and running io units will be processed without writing PPL to the device. Adding a disk sets the child log rdev pointer and writes an empty PPL header. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:56 -07:00
Artur Paszkiewicz	4536bf9ba2	raid5-ppl: load and recover the log Load the log from each disk when starting the array and recover if the array is dirty. The initial empty PPL is written by mdadm. When loading the log we verify the header checksum and signature. For external metadata arrays the signature is verified in userspace, so here we read it from the header, verifying only if it matches on all disks, and use it later when writing PPL. In addition to the header checksum, each header entry also contains a checksum of its partial parity data. If the header is valid, recovery is performed for each entry until an invalid entry is found. If the array is not degraded and recovery using PPL fully succeeds, there is no need to resync the array because data and parity will be consistent, so in this case resync will be disabled. Due to compatibility with IMSM implementations on other systems, we can't assume that the recovery data block size is always 4K. Writes generated by MD raid5 don't have this issue, but when recovering PPL written in other environments it is possible to have entries with 512-byte sector granularity. The recovery code takes this into account and also the logical sector size of the underlying drives. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:55 -07:00
Artur Paszkiewicz	3418d036c8	raid5-ppl: Partial Parity Log write logging implementation Implement the calculation of partial parity for a stripe and PPL write logging functionality. The description of PPL is added to the documentation. More details can be found in the comments in raid5-ppl.c. Attach a page for holding the partial parity data to stripe_head. Allocate it only if mddev has the MD_HAS_PPL flag set. Partial parity is the xor of not modified data chunks of a stripe and is calculated as follows: - reconstruct-write case: xor data from all not updated disks in a stripe - read-modify-write case: xor old data and parity from all updated disks in a stripe Implement it using the async_tx API and integrate into raid_run_ops(). It must be called when we still have access to old data, so do it when STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is stored into sh->ppl_page. Partial parity is not meaningful for full stripe write and is not stored in the log or used for recovery, so don't attempt to calculate it when stripe has STRIPE_FULL_WRITE. Put the PPL metadata structures to md_p.h because userspace tools (mdadm) will also need to read/write PPL. Warn about using PPL with enabled disk volatile write-back cache for now. It can be removed once disk cache flushing before writing PPL is implemented. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:54 -07:00
Artur Paszkiewicz	ff875738ed	raid5: separate header for log functions Move raid5-cache declarations from raid5.h to raid5-log.h, add inline wrappers for functions which will be shared with ppl and use them in raid5 core instead of direct calls to raid5-cache. Remove unused parameter from r5c_cache_data(), move two duplicated pr_debug() calls to r5l_init_log(). Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:54 -07:00
Shaohua Li	aaf9f12ebf	md/raid5: sort bios Previous patch (raid5: only dispatch IO from raid5d for harddisk raid) defers IO dispatching. The goal is to create better IO pattern. At that time, we don't sort the deffered IO and hope the block layer can do IO merge and sort. Now the raid5-cache writeback could create large amount of bios. And if we enable muti-thread for stripe handling, we can't control when to dispatch IO to raid disks. In a lot of time, we are dispatching IO which block layer can't do merge effectively. This patch moves further for the IO dispatching defer. We accumulate bios, but we don't dispatch all the bios after a threshold is met. This 'dispatch partial portion of bios' stragety allows bios coming in a large time window are sent to disks together. At the dispatching time, there is large chance the block layer can merge the bios. To make this more effective, we dispatch IO in ascending order. This increases request merge chance and reduces disk seek. Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:52 -07:00
Shaohua Li	535ae4eb12	md/raid5: prioritize stripes for writeback In raid5-cache writeback mode, we have two types of stripes to handle. - stripes which aren't cached yet - stripes which are cached and flushing out to raid disks Upperlayer is more sensistive to latency of the first type of stripes generally. But we only one handle list for all these stripes, where the two types of stripes are mixed together. When reclaim flushes a lot of stripes, the first type of stripes could be noticeably delayed. On the other hand, if the log space is tight, we'd like to handle the second type of stripes faster and free log space. This patch destinguishes the two types stripes. They are added into different handle list. When we try to get a stripe to handl, we prefer the first type of stripes unless log space is tight. This should have no impact for !writeback case. Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:51 -07:00
Song Liu	0977762f6d	md/r5cache: fix set_syndrome_sources() for data in cache Before this patch, device InJournal will be included in prexor (SYNDROME_SRC_WANT_DRAIN) but not in reconstruct (SYNDROME_SRC_WRITTEN). So it will break parity calculation. With srctype == SYNDROME_SRC_WRITTEN, we need include both dev with non-null ->written and dev with R5_InJournal. This fixes logic in 1e6d690(md/r5cache: caching phase of r5cache) Cc: stable@vger.kernel.org (v4.10+) Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-14 09:57:10 -07:00
Guoqing Jiang	c948363421	md: move funcs from pers->resize to update_size raid1_resize and raid5_resize should also check the mddev->queue if run underneath dm-raid. And both set_capacity and revalidate_disk are used in pers->resize such as raid1, raid10 and raid5. So move them from personality file to common code. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-09 09:02:18 -08:00
Ingo Molnar	3f07c01441	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/signal.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:29 +01:00
Linus Torvalds	a682e00354	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull md updates from Shaohua Li: "Mainly fixes bugs and improves performance: - Improve scalability for raid1 from Coly - Improve raid5-cache read performance, disk efficiency and IO pattern from Song and me - Fix a race condition of disk hotplug for linear from Coly - A few cleanup patches from Ming and Byungchul - Fix a memory leak from Neil - Fix WRITE SAME IO failure from me - Add doc for raid5-cache from me" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits) md/raid1: fix write behind issues introduced by bio_clone_bioset_partial md/raid1: handle flush request correctly md/linear: shutup lockdep warnning md/raid1: fix a use-after-free bug RAID1: avoid unnecessary spin locks in I/O barrier code RAID1: a new I/O barrier implementation to remove resync window md/raid5: Don't reinvent the wheel but use existing llist API md: fast clone bio in bio_clone_mddev() md: remove unnecessary check on mddev md/raid1: use bio_clone_bioset_partial() in case of write behind md: fail if mddev->bio_set can't be created block: introduce bio_clone_bioset_partial() md: disable WRITE SAME if it fails in underlayer disks md/raid5-cache: exclude reclaiming stripes in reclaim check md/raid5-cache: stripe reclaim only counts valid stripes MD: add doc for raid5-cache Documentation: move MD related doc into a separate dir md: ensure md devices are freed before module is unloaded. md/r5cache: improve journal device efficiency md/r5cache: enable chunk_aligned_read with write back cache ...	2017-02-24 14:42:19 -08:00
Jens Axboe	818551e2b2	Merge branch 'for-4.11/next' into for-4.11/linus-merge Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-17 14:08:19 -07:00
Byungchul Park	eae8263fb1	md/raid5: Don't reinvent the wheel but use existing llist API Although llist provides proper APIs, they are not used. Make them used. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-02-16 14:49:05 -08:00
Ming Lei	d7a1030839	md: fast clone bio in bio_clone_mddev() Firstly bio_clone_mddev() is used in raid normal I/O and isn't in resync I/O path. Secondly all the direct access to bvec table in raid happens on resync I/O except for write behind of raid1, in which we still use bio_clone() for allocating new bvec table. So this patch replaces bio_clone() with bio_clone_fast() in bio_clone_mddev(). Also kill bio_clone_mddev() and call bio_clone_fast() directly, as suggested by Christoph Hellwig. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-02-15 11:24:54 -08:00
Shaohua Li	e33fbb9cc7	md/raid5-cache: exclude reclaiming stripes in reclaim check stripes which are being reclaimed are still accounted into cached stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the stripes and does unnecessary stripe reclaim. In practice, I saw one stripe is reclaimed one time. This will cause bad IO pattern. Fixing this by excluding the reclaing stripes in the check. Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-02-13 09:20:05 -08:00
Song Liu	39b99586b3	md/r5cache: improve journal device efficiency It is important to be able to flush all stripes in raid5-cache. Therefore, we need reserve some space on the journal device for these flushes. If flush operation includes pending writes to the stripe, we need to reserve (conf->raid_disk + 1) pages per stripe for the flush out. This reduces the efficiency of journal space. If we exclude these pending writes from flush operation, we only need (conf->max_degraded + 1) pages per stripe. With this patch, when log space is critical (R5C_LOG_CRITICAL=1), pending writes will be excluded from stripe flush out. Therefore, we can reduce reserved space for flush out and thus improve journal device efficiency. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-02-13 09:17:52 -08:00
Song Liu	03b047f45c	md/r5cache: enable chunk_aligned_read with write back cache Chunk aligned read significantly reduces CPU usage of raid456. However, it is not safe to fully bypass the write back cache. This patch enables chunk aligned read with write back cache. For chunk aligned read, we track stripes in write back cache at a bigger granularity, "big_stripe". Each chunk may contain more than one stripe (for example, a 256kB chunk contains 64 4kB-page, so this chunk contain 64 stripes). For chunk_aligned_read, these stripes are grouped into one big_stripe, so we only need one lookup for the whole chunk. For each big_stripe, struct big_stripe_info tracks how many stripes of this big_stripe are in the write back cache. We count how many stripes of this big_stripe are in the write back cache. These counters are tracked in a radix tree (big_stripe_tree). r5c_tree_index() is used to calculate keys for the radix tree. chunk_aligned_read() calls r5c_big_stripe_cached() to look up big_stripe of each chunk in the tree. If this big_stripe is in the tree, chunk_aligned_read() aborts. This look up is protected by rcu_read_lock(). It is necessary to remember whether a stripe is counted in big_stripe_tree. Instead of adding new flag, we reuses existing flags: STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these two flags are set, the stripe is counted in big_stripe_tree. This requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to r5c_try_caching_write(); and moving clear_bit of STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to r5c_finish_stripe_write_out(). Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-02-13 09:17:51 -08:00
Shaohua Li	765d704db1	raid5: only dispatch IO from raid5d for harddisk raid We made raid5 stripe handling multi-thread before. It works well for SSD. But for harddisk, the multi-threading creates more disk seek, so not always improve performance. For several hard disks based raid5, multi-threading is required as raid5d becames a bottleneck especially for sequential write. To overcome the disk seek issue, we only dispatch IO from raid5d if the array is harddisk based. Other threads can still handle stripes, but can't dispatch IO. Idealy, we should control IO dispatching order according to IO position interrnally. Right now we still depend on block layer, which isn't very efficient sometimes though. My setup has 9 harddisks, each disk can do around 180M/s sequential write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential write. The test machine uses an ATOM CPU. I measure sequential write with large iodepth bandwidth to raid array: without patch: ~600M/s without patch and group_thread_cnt=4: 750M/s with patch and group_thread_cnt=4: 950M/s with patch, group_thread_cnt=4, skip_copy=1: 1150M/s We are pretty close to the maximum bandwidth in the large iodepth iodepth case. The performance gap of small iodepth sequential write between software raid and theory value is still very big though, because we don't have an efficient pipeline. Cc: NeilBrown <neilb@suse.com> Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-02-13 09:17:50 -08:00
Jan Kara	dc3b17cc8b	block: Use pointer to backing_dev_info from request_queue We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-02 08:20:48 -07:00
Song Liu	2e38a37f23	md/r5cache: disable write back for degraded array write-back cache in degraded mode introduces corner cases to the array. Although we try to cover all these corner cases, it is safer to just disable write-back cache when the array is in degraded mode. In this patch, we disable writeback cache for degraded mode: 1. On device failure, if the array enters degraded mode, raid5_error() will submit async job r5c_disable_writeback_async to disable writeback; 2. In r5c_journal_mode_store(), it is invalid to enable writeback in degraded mode; 3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled in write-through mode. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-01-24 11:26:06 -08:00
Song Liu	07e8336484	md/r5cache: shift complex rmw from read path to write path Write back cache requires a complex RMW mechanism, where old data is read into dev->orig_page for prexor, and then xor is done with dev->page. This logic is already implemented in the write path. However, current read path is not awared of this requirement. When the array is optimal, the RMW is not required, as the data are read from raid disks. However, when the target stripe is degraded, complex RMW is required to generate right data. To keep read path as clean as possible, we handle read path by flushing degraded, in-journal stripes before processing reads to missing dev. Specifically, when there is read requests to a degraded stripe with data in journal, handle_stripe_fill() calls r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying() will do the complex RMW and flush the stripe to RAID disks. After that, read requests are handled. There is one more corner case when there is non-overwrite bio for the missing (or out of sync) dev. handle_stripe_dirtying() will not be able to process the non-overwrite bios without constructing the data in handle_stripe_fill(). This is fixed by delaying non-overwrite bios in handle_stripe_dirtying(). So handle_stripe_fill() works on these bios after the stripe is flushed to raid disks. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-01-24 11:20:15 -08:00
Song Liu	ba02684daf	md/raid5: move comment of fetch_block to right location Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-01-24 11:20:14 -08:00
Song Liu	86aa1397dd	md/r5cache: read data into orig_page for prexor of cached data With write back cache, we use orig_page to do prexor. This patch makes sure we read data into orig_page for it. Flag R5_OrigPageUPTDODATE is added to show whether orig_page has the latest data from raid disk. We introduce a helper function uptodate_for_rmw() to simplify the a couple conditions in handle_stripe_dirtying(). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-01-24 11:20:14 -08:00
Jes Sorensen	32cd7cbbac	md/raid5: Use correct IS_ERR() variation on pointer check This fixes a build error on certain architectures, such as ppc64. Fixes: 6995f0b247e("md: takeover should clear unrelated bits") Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-01-09 13:58:10 -08:00
Shaohua Li	394ed8e474	md: cleanup mddev flag clear for takeover Commit `6995f0b` (md: takeover should clear unrelated bits) clear unrelated bits, but it's quite fragile. To avoid error in the future, define a macro for unsupported mddev flags for each raid type and use it to clear unsupported mddev flags. This should be less error-prone. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-01-05 11:45:18 -08:00
Shaohua Li	20737738d3	Merge branch 'md-next' into md-linus	2016-12-13 12:40:15 -08:00
Shaohua Li	2953079c69	md: separate flags for superblock changes The mddev->flags are used for different purposes. There are a lot of places we check/change the flags without masking unrelated flags, we could check/change unrelated flags. These usage are most for superblock write, so spearate superblock related flags. This should make the code clearer and also fix real bugs. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-12-08 22:01:47 -08:00
Shaohua Li	6995f0b247	md: takeover should clear unrelated bits When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit will be accidentally set, but raid5 doesn't support it. The same is true for the MD_HAS_JOURNAL bit. Fix: `46533ff` (md: Use REQ_FAILFAST_* on metadata writes where appropriate) Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-12-08 22:00:11 -08:00
Konstantin Khlebnikov	e8d7c33232	md/raid5: limit request size according to implementation limits Current implementation employ 16bit counter of active stripes in lower bits of bio->bi_phys_segments. If request is big enough to overflow this counter bio will be completed and freed too early. Fortunately this not happens in default configuration because several other limits prevent that: stripe_cache_size * nr_disks effectively limits count of active stripes. And small max_sectors_kb at lower disks prevent that during normal read/write operations. Overflow easily happens in discard if it's enabled by module parameter "devices_handle_discard_safely" and stripe_cache_size is set big enough. This patch limits requests size with 256Mb - 8Kb to prevent overflows. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Shaohua Li <shli@kernel.org> Cc: Neil Brown <neilb@suse.com> Cc: stable@vger.kernel.org Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-29 15:53:21 -08:00
Song Liu	d7bd398e97	md/r5cache: handle alloc_page failure RMW of r5c write back cache uses an extra page to store old data for prexor. handle_stripe_dirtying() allocates this page by calling alloc_page(). However, alloc_page() may fail. To handle alloc_page() failures, this patch adds an extra page to disk_info. When alloc_page fails, handle_stripe() trys to use these pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE), the stripe is added to delayed_list. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-27 21:35:38 -08:00
Ming Lei	3a83f46775	block: bio: pass bvec table to bio_init() Some drivers often use external bvec table, so introduce this helper for this case. It is always safe to access the bio->bi_io_vec in this way for this case. After converting to this usage, it will becomes a bit easier to evaluate the remaining direct access to bio->bi_io_vec, so it can help to prepare for the following multipage bvec support. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixed up the new O_DIRECT cases. Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-22 08:57:21 -07:00
Song Liu	3bddb7f8f2	md/r5cache: handle FLUSH and FUA With raid5 cache, we committing data from journal device. When there is flush request, we need to flush journal device's cache. This was not needed in raid5 journal, because we will flush the journal before committing data to raid disks. This is similar to FUA, except that we also need flush journal for FUA. Otherwise, corruptions in earlier meta data will stop recovery from reaching FUA data. slightly changed the code by Shaohua Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-18 17:13:49 -08:00
Song Liu	5aabf7c49d	md/r5cache: r5cache recovery: part 2 1. In previous patch, we: - add new data to r5l_recovery_ctx - add new functions to recovery write-back cache The new functions are not used in this patch, so this patch does not change the behavior of recovery. 2. In this patchpatch, we: - modify main recovery procedure r5l_recovery_log() to call new functions - remove old functions Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-18 13:28:28 -08:00
Song Liu	2c7da14b90	md/r5cache: sysfs entry journal_mode With write cache, journal_mode is the knob to switch between write-back and write-through. Below is an example: root@virt-test:~/# cat /sys/block/md0/md/journal_mode [write-through] write-back root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode root@virt-test:~/# cat /sys/block/md0/md/journal_mode write-through [write-back] Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-18 13:27:24 -08:00
Song Liu	a39f7afde3	md/r5cache: write-out phase and reclaim support There are two limited resources, stripe cache and journal disk space. For better performance, we priotize reclaim of full stripe writes. To free up more journal space, we free earliest data on the journal. In current implementation, reclaim happens when: 1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim if there is no reclaim in the past 5 seconds. 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes, or cached stripes is enough for a full stripe (chunk size / 4k) (r5c_check_cached_full_stripe) 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage) 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data) r5c_do_reclaim() contains new logic of reclaim. For stripe cache: When stripe cache pressure is high (more than 3/4 stripes are cached, or there is empty inactive lists), flush all full stripe. If fewer than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes are flushed, flush some paritial stripes. When stripe cache pressure is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes. For log space: To avoid deadlock due to log space, we need to reserve enough space to flush cached data. The size of required log space depends on total number of cached stripes (stripe_in_journal_count). In current implementation, the writing-out phase automatically include pending data writes with parity writes (similar to write through case). Therefore, we need up to (conf->raid_disks + 1) pages for each cached stripe (1 page for meta data, raid_disks pages for all data and parity). r5c_log_required_to_flush_cache() calculates log space required to flush cache. In the following, we refer to the space calculated by r5c_log_required_to_flush_cache() as reclaim_required_space. Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL is set when free space on the log device is less than 2x of reclaim_required_space. r5c_cache keeps all data in cache (not fully committed to RAID) in a list (stripe_in_journal_list). These stripes are in the order of their first appearance on the journal. So the log tail (last_checkpoint) should point to the journal_start of the first item in the list. When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is set, the state machine only writes data that are already in the log device (in stripe_in_journal_list). This patch includes a fix to improve performance by Shaohua Li <shli@fb.com>. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-18 13:26:48 -08:00
Song Liu	1e6d690b93	md/r5cache: caching phase of r5cache As described in previous patch, write back cache operates in two phases: caching and writing-out. The caching phase works as: 1. write data to journal (r5c_handle_stripe_dirtying, r5c_cache_data) 2. call bio_endio (r5c_handle_data_cached, r5c_return_dev_pending_writes). Then the writing-out phase is as: 1. Mark the stripe as write-out (r5c_make_stripe_write_out) 2. Calcualte parity (reconstruct or RMW) 3. Write parity (and maybe some other data) to journal device 4. Write data and parity to RAID disks This patch implements caching phase. The cache is integrated with stripe cache of raid456. It leverages code of r5l_log to write data to journal device. Writing-out phase of the cache is implemented in the next patch. With r5cache, write operation does not wait for parity calculation and write out, so the write latency is lower (1 write to journal device vs. read and then write to raid disks). Also, r5cache will reduce RAID overhead (multipile IO due to read-modify-write of parity) and provide more opportunities of full stripe writes. This patch adds 2 flags to stripe_head.state: - STRIPE_R5C_PARTIAL_STRIPE, - STRIPE_R5C_FULL_STRIPE, Instead of inactive_list, stripes with cached data are tracked in r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list. STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for stripes in these lists. Note: stripes in r5c_full/partial_stripe_list are not considered as "active". For RMW, the code allocates an extra page for each data block being updated. This is stored in r5dev->orig_page and the old data is read into it. Then the prexor calculation subtracts ->orig_page from the parity block, and the reconstruct calculation adds the ->page data back into the parity block. r5cache naturally excludes SkipCopy. When the array has write back cache, async_copy_data() will not skip copy. There are some known limitations of the cache implementation: 1. Write cache only covers full page writes (R5_OVERWRITE). Writes of smaller granularity are write through. 2. Only one log io (sh->log_io) for each stripe at anytime. Later writes for the same stripe have to wait. This can be improved by moving log_io to r5dev. 3. With writeback cache, read path must enter state machine, which is a significant bottleneck for some workloads. 4. There is no per stripe checkpoint (with r5l_payload_flush) in the log, so recovery code has to replay more than necessary data (sometimes all the log from last_checkpoint). This reduces availability of the array. This patch includes a fix proposed by ZhengYuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-18 13:26:30 -08:00
Song Liu	2ded370373	md/r5cache: State machine for raid5-cache write back mode This patch adds state machine for raid5-cache. With log device, the raid456 array could operate in two different modes (r5c_journal_mode): - write-back (R5C_MODE_WRITE_BACK) - write-through (R5C_MODE_WRITE_THROUGH) Existing code of raid5-cache only has write-through mode. For write-back cache, it is necessary to extend the state machine. With write-back cache, every stripe could operate in two different phases: - caching - writing-out In caching phase, the stripe handles writes as: - write to journal - return IO In writing-out phase, the stripe behaviors as a stripe in write through mode R5C_MODE_WRITE_THROUGH. STRIPE_R5C_CACHING is added to sh->state to differentiate caching and writing-out phase. Please note: this is a "no-op" patch for raid5-cache write-through mode. The following detailed explanation is copied from the raid5-cache.c: /* * raid5 cache state machine * * With rhe RAID cache, each stripe works in two phases: * - caching phase * - writing-out phase * * These two phases are controlled by bit STRIPE_R5C_CACHING: * if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase * if STRIPE_R5C_CACHING == 1, the stripe is in caching phase * * When there is no journal, or the journal is in write-through mode, * the stripe is always in writing-out phase. * * For write-back journal, the stripe is sent to caching phase on write * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off * the write-out phase by clearing STRIPE_R5C_CACHING. * * Stripes in caching phase do not write the raid disks. Instead, all * writes are committed from the log device. Therefore, a stripe in * caching phase handles writes as: * - write to log device * - return IO * * Stripes in writing-out phase handle writes as: * - calculate parity * - write pending data and parity to journal * - write data and parity to raid disks * - return IO for pending writes */ Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-18 13:26:07 -08:00
Song Liu	937621c36e	md/r5cache: move some code to raid5.h Move some define and inline functions to raid5.h, so they can be used in raid5-cache.c Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-18 13:25:40 -08:00
NeilBrown	cc6167b4f3	md/raid5: change printk() to pr_*() Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:22 -08:00
Tomasz Majchrzak	7adb072ca8	raid5: revert commit `11367799f3` Revert commit `11367799f3` ("md: Prevent IO hold during accessing to faulty raid5 array") as it doesn't comply with commit `c3cce6cda1` ("md/raid5: ensure device failure recorded before write request returns."). That change is not required anymore as the problem is resolved by commit `16f889499a` ("md: report 'write_pending' state when array in sync") - read request is stuck as array state is not reported correctly via sysfs attribute. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:21 -08:00
Christoph Hellwig	70fd76140a	block,fs: use REQ_* flags directly Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-01 09:43:26 -06:00
Linus Torvalds	c23112e039	Merge tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD updates from Shaohua Li: "This update includes: - new AVX512 instruction based raid6 gen/recovery algorithm - a couple of md-cluster related bug fixes - fix a potential deadlock - set nonrotational bit for raid array with SSD - set correct max_hw_sectors for raid5/6, which hopefuly can improve performance a little bit - other minor fixes" * tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md: set rotational bit raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays raid5: handle register_shrinker failure raid5: fix to detect failure of register_shrinker md: fix a potential deadlock md/bitmap: fix wrong cleanup raid5: allow arbitrary max_hw_sectors lib/raid6: Add AVX512 optimized xor_syndrome functions lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions lib/raid6: Add AVX512 optimized recovery functions lib/raid6: Add AVX512 optimized gen_syndrome functions md-cluster: make resync lock also could be interruptted md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang md-cluster: convert the completion to wait queue md-cluster: protect md_find_rdev_nr_rcu with rcu lock md-cluster: clean related infos of cluster md: changes for MD_STILL_CLOSED flag md-cluster: remove some unnecessary dlm_unlock_sync md-cluster: use FORCEUNLOCK in lockres_free md-cluster: call md_kick_rdev_from_array once ack failed	2016-10-07 09:45:43 -07:00
Linus Torvalds	597f03f9d1	Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug updates from Thomas Gleixner: "Yet another batch of cpu hotplug core updates and conversions: - Provide core infrastructure for multi instance drivers so the drivers do not have to keep custom lists. - Convert custom lists to the new infrastructure. The block-mq custom list conversion comes through the block tree and makes the diffstat tip over to more lines removed than added. - Handle unbalanced hotplug enable/disable calls more gracefully. - Remove the obsolete CPU_STARTING/DYING notifier support. - Convert another batch of notifier users. The relayfs changes which conflicted with the conversion have been shipped to me by Andrew. The remaining lot is targeted for 4.10 so that we finally can remove the rest of the notifiers" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) cpufreq: Fix up conversion to hotplug state machine blk/mq: Reserve hotplug states for block multiqueue x86/apic/uv: Convert to hotplug state machine s390/mm/pfault: Convert to hotplug state machine mips/loongson/smp: Convert to hotplug state machine mips/octeon/smp: Convert to hotplug state machine fault-injection/cpu: Convert to hotplug state machine padata: Convert to hotplug state machine cpufreq: Convert to hotplug state machine ACPI/processor: Convert to hotplug state machine virtio scsi: Convert to hotplug state machine oprofile/timer: Convert to hotplug state machine block/softirq: Convert to hotplug state machine lib/irq_poll: Convert to hotplug state machine x86/microcode: Convert to hotplug state machine sh/SH-X3 SMP: Convert to hotplug state machine ia64/mca: Convert to hotplug state machine ARM/OMAP/wakeupgen: Convert to hotplug state machine ARM/shmobile: Convert to hotplug state machine arm64/FP/SIMD: Convert to hotplug state machine ...	2016-10-03 19:43:08 -07:00
Shaohua Li	30c8946566	raid5: handle register_shrinker failure register_shrinker() now can fail. When it happens, shrinker.nr_deferred is null. We use it to determine if unregister_shrinker is required. Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Chao Yu	6a0f53ff35	raid5: fix to detect failure of register_shrinker register_shrinker can fail after commit `1d3d4437ea` ("vmscan: per-node deferred work"), we should detect the failure of it, otherwise we may fail to register shrinker after raid5 configuration was setup successfully. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00

1 2 3 4 5 ...

709 Commits