linux_dsm_epyc7002/drivers/md
colyli@suse.de 824e47dadd RAID1: avoid unnecessary spin locks in I/O barrier code
When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
 - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
         + 89.92% allow_barrier
         + 9.34% __wake_up
 - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
   - _raw_spin_lock_irq
         - 100.00% wait_barrier

The reason is, in these I/O barrier related functions,
 - raise_barrier()
 - lower_barrier()
 - wait_barrier()
 - allow_barrier()
They always hold conf->resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf->resync_lock when it has to.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. I continue to work on it, and make the patch into
current form.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
 - wait_barrier()
   Which calls _wait_barrier(), is used for regular write I/O. If there is
   resync I/O happening on the same I/O barrier bucket, or the whole
   array is frozen, task will wait until no barrier on same barrier bucket,
   or the whold array is unfreezed.
 - wait_read_barrier()
   Since regular read I/O won't interfere with resync I/O (read_balance()
   will make sure only uptodate data will be read out), it is unnecessary
   to wait for barrier in regular read I/Os, waiting in only necessary
   when the whole array is frozen.

The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf->
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
raised in same barrier bucket index and array is not frozen, the regular
I/O doesn't need to hold conf->resync_lock, it can just increase
conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
very similar to _wait_barrier(), the only difference is it only waits when
array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
code almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Changelog
V4:
- Change conf->nr_queued[] to atomic_t.
- Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t)))
V3:
- Add smp_mb__after_atomic() as Shaohua and Neil suggested.
- Change conf->nr_queued[] from atomic_t to int.
- Change conf->array_frozen from atomic_t back to int, and use
  READ_ONCE(conf->array_frozen) to check value of conf->array_frozen
  in _wait_barrier() and wait_read_barrier().
- In _wait_barrier() and wait_read_barrier(), add a call to
  wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]),
  to fix a deadlock between  _wait_barrier()/wait_read_barrier and
  freeze_array().
V2:
- Remove a spin_lock/unlock pair in raid1d().
- Add more code comments to explain why there is no racy when checking two
  atomic_t variables at same time.
V1:
- Original RFC patch for comments.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-19 22:04:25 -08:00
..
bcache bcache: partition support: add 16 minors per bcacheN device 2016-12-17 13:02:00 -07:00
persistent-data dm space map: always set ev if sm_ll_mutate() succeeds 2016-12-08 14:13:15 -05:00
bitmap.c md: separate flags for superblock changes 2016-12-08 22:01:47 -08:00
bitmap.h md-cluster: sync bitmap when node received RESYNCING msg 2016-05-04 12:39:35 -07:00
dm-bio-prison.c block: add a bi_error field to struct bio 2015-07-29 08:55:15 -06:00
dm-bio-prison.h dm bio prison: add dm_cell_promote_or_release() 2015-05-29 14:19:06 -04:00
dm-bio-record.h
dm-bufio.c . various fixes and improvements to request-based DM and DM multipath 2016-12-14 11:01:00 -08:00
dm-bufio.h
dm-builtin.c dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
dm-cache-block-types.h linux: drop __bitwise__ everywhere 2016-12-16 00:13:41 +02:00
dm-cache-metadata.c dm cache metadata: remove an extra newline in DMERR and code 2016-11-21 09:52:02 -05:00
dm-cache-metadata.h dm cache: make sure every metadata function checks fail_io 2016-03-10 17:12:12 -05:00
dm-cache-policy-cleaner.c dm cache: speed up writing of the hint array 2016-09-22 11:15:02 -04:00
dm-cache-policy-internal.h dm cache: speed up writing of the hint array 2016-09-22 11:15:02 -04:00
dm-cache-policy-smq.c dm cache policy smq: use hash_32() instead of hash_32_generic() 2016-12-08 19:42:37 -05:00
dm-cache-policy.c
dm-cache-policy.h dm cache: speed up writing of the hint array 2016-09-22 11:15:02 -04:00
dm-cache-target.c dm cache: add missing cache device name to DMERR in set_cache_mode() 2016-11-21 09:52:03 -05:00
dm-core.h dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
dm-crypt.c dm crypt: replace RCU read-side section with rwsem 2017-02-03 10:26:14 -05:00
dm-delay.c dm: rename target's per_bio_data_size to per_io_data_size 2016-02-22 22:34:37 -05:00
dm-era-target.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm-exception-store.c - Revert a dm-multipath change that caused a regression for unprivledged 2015-11-04 21:19:53 -08:00
dm-exception-store.h dm snapshot: fix hung bios when copy error occurs 2016-01-08 20:03:05 -05:00
dm-flakey.c dm flakey: introduce "error_writes" feature 2016-12-13 15:01:31 -05:00
dm-io.c dm io: use bvec iterator helpers to implement .get_page and .next_page 2016-11-21 09:51:57 -05:00
dm-ioctl.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
dm-kcopyd.c dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
dm-linear.c libnvdimm for 4.8 2016-07-28 17:38:16 -07:00
dm-log-userspace-base.c dm: drop NULL test before kmem_cache_destroy() and mempool_destroy() 2015-10-31 19:06:00 -04:00
dm-log-userspace-transfer.c dm log userspace transfer: match wait_for_completion_timeout return type 2015-04-15 12:10:20 -04:00
dm-log-userspace-transfer.h
dm-log-writes.c Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block 2016-10-07 14:42:05 -07:00
dm-log.c block,fs: use REQ_* flags directly 2016-11-01 09:43:26 -06:00
dm-mpath.c dm mpath: cleanup -Wbool-operation warning in choose_pgpath() 2017-02-03 10:18:37 -05:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h dm path selector: remove 'repeat_count' return from .select_path hook 2016-02-22 22:34:42 -05:00
dm-queue-length.c dm path selector: remove 'repeat_count' return from .select_path hook 2016-02-22 22:34:42 -05:00
dm-raid1.c Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block 2016-12-13 10:19:16 -08:00
dm-raid.c . various fixes and improvements to request-based DM and DM multipath 2016-12-14 11:01:00 -08:00
dm-region-hash.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm-round-robin.c dm round robin: do not use this_cpu_ptr() without having preemption disabled 2016-08-15 09:23:14 -04:00
dm-rq.c dm rq: cope with DM device destruction while in dm_old_request_fn() 2017-02-03 10:18:43 -05:00
dm-rq.h dm rq: introduce dm_mq_kick_requeue_list() 2016-09-15 11:16:05 -04:00
dm-service-time.c dm path selector: remove 'repeat_count' return from .select_path hook 2016-02-22 22:34:42 -05:00
dm-snap-persistent.c block,fs: use REQ_* flags directly 2016-11-01 09:43:26 -06:00
dm-snap-transient.c dm snapshot: fix hung bios when copy error occurs 2016-01-08 20:03:05 -05:00
dm-snap.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm-stats.c dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
dm-stats.h dm stats: support precise timestamps 2015-06-17 12:40:40 -04:00
dm-stripe.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm-switch.c dm switch: simplify conditional in alloc_region_table() 2015-10-31 19:06:06 -04:00
dm-sysfs.c dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
dm-table.c dm table: simplify dm_table_determine_type() 2016-12-08 14:13:03 -05:00
dm-target.c libnvdimm for 4.8 2016-07-28 17:38:16 -07:00
dm-thin-metadata.c dm thin: fix a race condition between discarding and provisioning a block 2016-07-20 12:43:35 -04:00
dm-thin-metadata.h dm thin: fix a race condition between discarding and provisioning a block 2016-07-20 12:43:35 -04:00
dm-thin.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm-uevent.c
dm-uevent.h
dm-verity-fec.c dm verity fec: fix block calculation 2016-07-01 23:29:08 -04:00
dm-verity-fec.h dm verity: add support for forward error correction 2015-12-10 10:39:03 -05:00
dm-verity-target.c dm verity: fix incorrect error message 2016-11-21 09:52:01 -05:00
dm-verity.h dm verity: add ignore_zero_blocks feature 2015-12-10 10:39:03 -05:00
dm-zero.c block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
dm.c . various fixes and improvements to request-based DM and DM multipath 2016-12-14 11:01:00 -08:00
dm.h dm: add infrastructure for DAX support 2016-07-20 23:49:49 -04:00
faulty.c md: fast clone bio in bio_clone_mddev() 2017-02-15 11:24:54 -08:00
Kconfig dm block manager: make block locking optional 2016-11-14 15:17:47 -05:00
linear.c md: disable WRITE SAME if it fails in underlayer disks 2017-02-13 19:24:16 -08:00
linear.h md linear: fix a race between linear_add() and linear_congested() 2017-02-13 09:17:50 -08:00
Makefile dm: move request-based code out to dm-rq.[hc] 2016-06-10 15:15:44 -04:00
md-cluster.c md-cluster: make resync lock also could be interruptted 2016-09-21 09:09:44 -07:00
md-cluster.h md-cluster: gather resync infos and enable recv_thread after bitmap is ready 2016-05-09 09:24:03 -07:00
md.c md: fast clone bio in bio_clone_mddev() 2017-02-15 11:24:54 -08:00
md.h md: fast clone bio in bio_clone_mddev() 2017-02-15 11:24:54 -08:00
multipath.c md: disable WRITE SAME if it fails in underlayer disks 2017-02-13 19:24:16 -08:00
multipath.h
raid0.c md: disable WRITE SAME if it fails in underlayer disks 2017-02-13 19:24:16 -08:00
raid0.h block: kill merge_bvec_fn() completely 2015-08-13 12:31:57 -06:00
raid1.c RAID1: avoid unnecessary spin locks in I/O barrier code 2017-02-19 22:04:25 -08:00
raid1.h RAID1: avoid unnecessary spin locks in I/O barrier code 2017-02-19 22:04:25 -08:00
raid5-cache.c md/raid5-cache: exclude reclaiming stripes in reclaim check 2017-02-13 09:20:05 -08:00
raid5.c md/raid5: Don't reinvent the wheel but use existing llist API 2017-02-16 14:49:05 -08:00
raid5.h md/raid5-cache: exclude reclaiming stripes in reclaim check 2017-02-13 09:20:05 -08:00
raid10.c md: fast clone bio in bio_clone_mddev() 2017-02-15 11:24:54 -08:00
raid10.h md/raid10: add failfast handling for reads. 2016-11-22 09:14:28 -08:00