Go to file
colyli@suse.de 824e47dadd RAID1: avoid unnecessary spin locks in I/O barrier code
When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
 - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
         + 89.92% allow_barrier
         + 9.34% __wake_up
 - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
   - _raw_spin_lock_irq
         - 100.00% wait_barrier

The reason is, in these I/O barrier related functions,
 - raise_barrier()
 - lower_barrier()
 - wait_barrier()
 - allow_barrier()
They always hold conf->resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf->resync_lock when it has to.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. I continue to work on it, and make the patch into
current form.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
 - wait_barrier()
   Which calls _wait_barrier(), is used for regular write I/O. If there is
   resync I/O happening on the same I/O barrier bucket, or the whole
   array is frozen, task will wait until no barrier on same barrier bucket,
   or the whold array is unfreezed.
 - wait_read_barrier()
   Since regular read I/O won't interfere with resync I/O (read_balance()
   will make sure only uptodate data will be read out), it is unnecessary
   to wait for barrier in regular read I/Os, waiting in only necessary
   when the whole array is frozen.

The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf->
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
raised in same barrier bucket index and array is not frozen, the regular
I/O doesn't need to hold conf->resync_lock, it can just increase
conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
very similar to _wait_barrier(), the only difference is it only waits when
array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
code almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Changelog
V4:
- Change conf->nr_queued[] to atomic_t.
- Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t)))
V3:
- Add smp_mb__after_atomic() as Shaohua and Neil suggested.
- Change conf->nr_queued[] from atomic_t to int.
- Change conf->array_frozen from atomic_t back to int, and use
  READ_ONCE(conf->array_frozen) to check value of conf->array_frozen
  in _wait_barrier() and wait_read_barrier().
- In _wait_barrier() and wait_read_barrier(), add a call to
  wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]),
  to fix a deadlock between  _wait_barrier()/wait_read_barrier and
  freeze_array().
V2:
- Remove a spin_lock/unlock pair in raid1d().
- Add more code comments to explain why there is no racy when checking two
  atomic_t variables at same time.
V1:
- Original RFC patch for comments.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-19 22:04:25 -08:00
arch Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-02-11 10:31:46 -08:00
block block: introduce bio_clone_bioset_partial() 2017-02-15 11:22:05 -08:00
certs certs: Add a secondary system keyring that can be added to dynamically 2016-04-11 22:48:09 +01:00
crypto crypto: algif_aead - Fix kernel panic on list_del 2017-02-03 17:45:48 +08:00
Documentation MD: add doc for raid5-cache 2017-02-13 09:17:54 -08:00
drivers RAID1: avoid unnecessary spin locks in I/O barrier code 2017-02-19 22:04:25 -08:00
firmware WHENCE: use https://linuxtv.org for LinuxTV URLs 2015-12-04 10:35:11 -02:00
fs Merge branch 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2017-02-11 09:15:58 -08:00
include block: introduce bio_clone_bioset_partial() 2017-02-15 11:22:05 -08:00
init kbuild: modversions: add infrastructure for emitting relative CRCs 2017-02-03 08:28:25 -08:00
ipc ipc/sem.c: fix incorrect sem_lock pairing 2017-01-10 18:31:55 -08:00
kernel Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-02-11 10:24:16 -08:00
lib EXPORT_SYMBOL radix_tree_replace_slot 2017-02-13 09:17:51 -08:00
mm mm/slub.c: fix random_seq offset destruction 2017-02-08 15:41:43 -08:00
net l2tp: do not use udp_ioctl() 2017-02-10 15:57:34 -05:00
samples bpf: fix samples xdp_tx_iptunnel and tc_l2_redirect with fake KBUILD_MODNAME 2017-01-20 12:04:07 -05:00
scripts kbuild: modversions: add infrastructure for emitting relative CRCs 2017-02-03 08:28:25 -08:00
security selinux: fix off-by-one in setprocattr 2017-02-08 19:09:53 +11:00
sound ALSA: hda - adding a new NV HDMI/DP codec ID in the driver 2017-02-09 08:57:47 +01:00
tools perf/urgent fixes: 2017-02-03 20:42:30 +01:00
usr kbuild: initramfs cleanup, set target from Kconfig 2017-01-05 09:40:16 -08:00
virt KVM/ARM updates for 4.10-rc4 2017-01-17 15:04:59 +01:00
.cocciconfig scripts: add Linux .cocciconfig for coccinelle 2016-07-22 12:13:39 +02:00
.get_maintainer.ignore Add hch to .get_maintainer.ignore 2015-08-21 14:30:10 -07:00
.gitattributes .gitattributes: set git diff driver for C source code files 2016-10-07 18:46:30 -07:00
.gitignore Merge branch 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild 2016-08-02 16:48:52 -04:00
.mailmap mailmap: add codeaurora.org names for nameless email commits 2017-01-10 18:31:55 -08:00
COPYING [PATCH] update FSF address in COPYING 2005-09-10 10:06:29 -07:00
CREDITS CREDITS: Remove outdated address information 2016-12-21 15:21:29 -08:00
Kbuild scripts/gdb: provide linux constants 2016-05-23 17:04:14 -07:00
Kconfig kbuild: migrate all arch to the kconfig mainmenu upgrade 2010-09-19 22:54:11 -04:00
MAINTAINERS ARM: SoC fixes for v4.10 2017-02-08 10:01:39 -08:00
Makefile Linux 4.10-rc8 2017-02-12 13:03:20 -08:00
README README: add a new README file, pointing to the Documentation/ 2016-10-24 08:12:35 -02:00

Linux kernel
============

This file was moved to Documentation/admin-guide/README.rst

Please notice that there are several guides for kernel developers and users.
These guides can be rendered in a number of formats, like HTML and PDF.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
See Documentation/00-INDEX for a list of what is contained in each file.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.