linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-04 13:56:51 +07:00

History

colyli@suse.de 824e47dadd RAID1: avoid unnecessary spin locks in I/O barrier code When I run a parallel reading performan testing on a md raid1 device with two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is only 2.7GB/s, this is around 50% of the idea performance number. The perf reports locking contention happens at allow_barrier() and wait_barrier() code, - 41.41% fio [kernel.kallsyms] [k] _raw_spin_lock_irqsave - _raw_spin_lock_irqsave + 89.92% allow_barrier + 9.34% __wake_up - 37.30% fio [kernel.kallsyms] [k] _raw_spin_lock_irq - _raw_spin_lock_irq - 100.00% wait_barrier The reason is, in these I/O barrier related functions, - raise_barrier() - lower_barrier() - wait_barrier() - allow_barrier() They always hold conf->resync_lock firstly, even there are only regular reading I/Os and no resync I/O at all. This is a huge performance penalty. The solution is a lockless-like algorithm in I/O barrier code, and only holding conf->resync_lock when it has to. The original idea is from Hannes Reinecke, and Neil Brown provides comments to improve it. I continue to work on it, and make the patch into current form. In the new simpler raid1 I/O barrier implementation, there are two wait barrier functions, - wait_barrier() Which calls _wait_barrier(), is used for regular write I/O. If there is resync I/O happening on the same I/O barrier bucket, or the whole array is frozen, task will wait until no barrier on same barrier bucket, or the whold array is unfreezed. - wait_read_barrier() Since regular read I/O won't interfere with resync I/O (read_balance() will make sure only uptodate data will be read out), it is unnecessary to wait for barrier in regular read I/Os, waiting in only necessary when the whole array is frozen. The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf-> barrier[idx] are very carefully designed in raise_barrier(), lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to avoid unnecessary spin locks in these functions. Once conf-> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index has to wait in raise_barrier(). Then in _wait_barrier() if no barrier raised in same barrier bucket index and array is not frozen, the regular I/O doesn't need to hold conf->resync_lock, it can just increase conf->nr_pending[idx], and return to its caller. wait_read_barrier() is very similar to _wait_barrier(), the only difference is it only waits when array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier code almostly gets rid of all spin lock cost. This patch significantly improves raid1 reading peroformance. From my testing, a raid1 device built by two NVMe SSD, runs fio with 64KB blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput increases from 2.7GB/s to 4.6GB/s (+70%). Changelog V4: - Change conf->nr_queued[] to atomic_t. - Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t))) V3: - Add smp_mb__after_atomic() as Shaohua and Neil suggested. - Change conf->nr_queued[] from atomic_t to int. - Change conf->array_frozen from atomic_t back to int, and use READ_ONCE(conf->array_frozen) to check value of conf->array_frozen in _wait_barrier() and wait_read_barrier(). - In _wait_barrier() and wait_read_barrier(), add a call to wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]), to fix a deadlock between _wait_barrier()/wait_read_barrier and freeze_array(). V2: - Remove a spin_lock/unlock pair in raid1d(). - Add more code comments to explain why there is no racy when checking two atomic_t variables at same time. V1: - Original RFC patch for comments. Signed-off-by: Coly Li <colyli@suse.de> Cc: Shaohua Li <shli@fb.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>		2017-02-19 22:04:25 -08:00
..
bcache	bcache: partition support: add 16 minors per bcacheN device	2016-12-17 13:02:00 -07:00
persistent-data	dm space map: always set ev if sm_ll_mutate() succeeds	2016-12-08 14:13:15 -05:00
bitmap.c	md: separate flags for superblock changes	2016-12-08 22:01:47 -08:00
bitmap.h	md-cluster: sync bitmap when node received RESYNCING msg	2016-05-04 12:39:35 -07:00
dm-bio-prison.c	block: add a bi_error field to struct bio	2015-07-29 08:55:15 -06:00
dm-bio-prison.h	dm bio prison: add dm_cell_promote_or_release()	2015-05-29 14:19:06 -04:00
dm-bio-record.h
dm-bufio.c	. various fixes and improvements to request-based DM and DM multipath	2016-12-14 11:01:00 -08:00
dm-bufio.h
dm-builtin.c	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
dm-cache-block-types.h	linux: drop __bitwise__ everywhere	2016-12-16 00:13:41 +02:00
dm-cache-metadata.c	dm cache metadata: remove an extra newline in DMERR and code	2016-11-21 09:52:02 -05:00
dm-cache-metadata.h	dm cache: make sure every metadata function checks fail_io	2016-03-10 17:12:12 -05:00
dm-cache-policy-cleaner.c	dm cache: speed up writing of the hint array	2016-09-22 11:15:02 -04:00
dm-cache-policy-internal.h	dm cache: speed up writing of the hint array	2016-09-22 11:15:02 -04:00
dm-cache-policy-smq.c	dm cache policy smq: use hash_32() instead of hash_32_generic()	2016-12-08 19:42:37 -05:00
dm-cache-policy.c
dm-cache-policy.h	dm cache: speed up writing of the hint array	2016-09-22 11:15:02 -04:00
dm-cache-target.c	dm cache: add missing cache device name to DMERR in set_cache_mode()	2016-11-21 09:52:03 -05:00
dm-core.h	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
dm-crypt.c	dm crypt: replace RCU read-side section with rwsem	2017-02-03 10:26:14 -05:00
dm-delay.c	dm: rename target's per_bio_data_size to per_io_data_size	2016-02-22 22:34:37 -05:00
dm-era-target.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm-exception-store.c	- Revert a dm-multipath change that caused a regression for unprivledged	2015-11-04 21:19:53 -08:00
dm-exception-store.h	dm snapshot: fix hung bios when copy error occurs	2016-01-08 20:03:05 -05:00
dm-flakey.c	dm flakey: introduce "error_writes" feature	2016-12-13 15:01:31 -05:00
dm-io.c	dm io: use bvec iterator helpers to implement .get_page and .next_page	2016-11-21 09:51:57 -05:00
dm-ioctl.c	Replace <asm/uaccess.h> with <linux/uaccess.h> globally	2016-12-24 11:46:01 -08:00
dm-kcopyd.c	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
dm-linear.c	libnvdimm for 4.8	2016-07-28 17:38:16 -07:00
dm-log-userspace-base.c	dm: drop NULL test before kmem_cache_destroy() and mempool_destroy()	2015-10-31 19:06:00 -04:00
dm-log-userspace-transfer.c	dm log userspace transfer: match wait_for_completion_timeout return type	2015-04-15 12:10:20 -04:00
dm-log-userspace-transfer.h
dm-log-writes.c	Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block	2016-10-07 14:42:05 -07:00
dm-log.c	block,fs: use REQ_* flags directly	2016-11-01 09:43:26 -06:00
dm-mpath.c	dm mpath: cleanup -Wbool-operation warning in choose_pgpath()	2017-02-03 10:18:37 -05:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h	dm path selector: remove 'repeat_count' return from .select_path hook	2016-02-22 22:34:42 -05:00
dm-queue-length.c	dm path selector: remove 'repeat_count' return from .select_path hook	2016-02-22 22:34:42 -05:00
dm-raid1.c	Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block	2016-12-13 10:19:16 -08:00
dm-raid.c	. various fixes and improvements to request-based DM and DM multipath	2016-12-14 11:01:00 -08:00
dm-region-hash.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm-round-robin.c	dm round robin: do not use this_cpu_ptr() without having preemption disabled	2016-08-15 09:23:14 -04:00
dm-rq.c	dm rq: cope with DM device destruction while in dm_old_request_fn()	2017-02-03 10:18:43 -05:00
dm-rq.h	dm rq: introduce dm_mq_kick_requeue_list()	2016-09-15 11:16:05 -04:00
dm-service-time.c	dm path selector: remove 'repeat_count' return from .select_path hook	2016-02-22 22:34:42 -05:00
dm-snap-persistent.c	block,fs: use REQ_* flags directly	2016-11-01 09:43:26 -06:00
dm-snap-transient.c	dm snapshot: fix hung bios when copy error occurs	2016-01-08 20:03:05 -05:00
dm-snap.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm-stats.c	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
dm-stats.h	dm stats: support precise timestamps	2015-06-17 12:40:40 -04:00
dm-stripe.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm-switch.c	dm switch: simplify conditional in alloc_region_table()	2015-10-31 19:06:06 -04:00
dm-sysfs.c	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
dm-table.c	dm table: simplify dm_table_determine_type()	2016-12-08 14:13:03 -05:00
dm-target.c	libnvdimm for 4.8	2016-07-28 17:38:16 -07:00
dm-thin-metadata.c	dm thin: fix a race condition between discarding and provisioning a block	2016-07-20 12:43:35 -04:00
dm-thin-metadata.h	dm thin: fix a race condition between discarding and provisioning a block	2016-07-20 12:43:35 -04:00
dm-thin.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm-uevent.c
dm-uevent.h
dm-verity-fec.c	dm verity fec: fix block calculation	2016-07-01 23:29:08 -04:00
dm-verity-fec.h	dm verity: add support for forward error correction	2015-12-10 10:39:03 -05:00
dm-verity-target.c	dm verity: fix incorrect error message	2016-11-21 09:52:01 -05:00
dm-verity.h	dm verity: add ignore_zero_blocks feature	2015-12-10 10:39:03 -05:00
dm-zero.c	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
dm.c	. various fixes and improvements to request-based DM and DM multipath	2016-12-14 11:01:00 -08:00
dm.h	dm: add infrastructure for DAX support	2016-07-20 23:49:49 -04:00
faulty.c	md: fast clone bio in bio_clone_mddev()	2017-02-15 11:24:54 -08:00
Kconfig	dm block manager: make block locking optional	2016-11-14 15:17:47 -05:00
linear.c	md: disable WRITE SAME if it fails in underlayer disks	2017-02-13 19:24:16 -08:00
linear.h	md linear: fix a race between linear_add() and linear_congested()	2017-02-13 09:17:50 -08:00
Makefile	dm: move request-based code out to dm-rq.[hc]	2016-06-10 15:15:44 -04:00
md-cluster.c	md-cluster: make resync lock also could be interruptted	2016-09-21 09:09:44 -07:00
md-cluster.h	md-cluster: gather resync infos and enable recv_thread after bitmap is ready	2016-05-09 09:24:03 -07:00
md.c	md: fast clone bio in bio_clone_mddev()	2017-02-15 11:24:54 -08:00
md.h	md: fast clone bio in bio_clone_mddev()	2017-02-15 11:24:54 -08:00
multipath.c	md: disable WRITE SAME if it fails in underlayer disks	2017-02-13 19:24:16 -08:00
multipath.h
raid0.c	md: disable WRITE SAME if it fails in underlayer disks	2017-02-13 19:24:16 -08:00
raid0.h	block: kill merge_bvec_fn() completely	2015-08-13 12:31:57 -06:00
raid1.c	RAID1: avoid unnecessary spin locks in I/O barrier code	2017-02-19 22:04:25 -08:00
raid1.h	RAID1: avoid unnecessary spin locks in I/O barrier code	2017-02-19 22:04:25 -08:00
raid5-cache.c	md/raid5-cache: exclude reclaiming stripes in reclaim check	2017-02-13 09:20:05 -08:00
raid5.c	md/raid5: Don't reinvent the wheel but use existing llist API	2017-02-16 14:49:05 -08:00
raid5.h	md/raid5-cache: exclude reclaiming stripes in reclaim check	2017-02-13 09:20:05 -08:00
raid10.c	md: fast clone bio in bio_clone_mddev()	2017-02-15 11:24:54 -08:00
raid10.h	md/raid10: add failfast handling for reads.	2016-11-22 09:14:28 -08:00