linux_dsm_epyc7002/block
NeilBrown 79bd99596b blk: improve order of bio handling in generic_make_request()
To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling.  They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially.  If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete.  This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device.  Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order.  That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent.  That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue.  However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn.  After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove all deadlocks.  It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request.  This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return.  The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off.  If it splits again, the same process happens.  In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-08 10:55:17 -07:00
..
partitions partitions/efi: Fix integer overflow in GPT size calculation 2017-01-17 09:02:31 -07:00
badblocks.c badblocks: badblocks_set/clear update unacked_exist 2016-10-21 15:45:47 -06:00
bio-integrity.c block: remove bio_is_rw 2016-10-28 08:45:17 -06:00
bio.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-02-24 14:42:19 -08:00
blk-cgroup.c sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
blk-core.c blk: improve order of bio handling in generic_make_request() 2017-03-08 10:55:17 -07:00
blk-exec.c block: introduce blk_rq_is_passthrough 2017-01-31 14:00:34 -07:00
blk-flush.c block: don't defer flushes on blk-mq + scheduling 2017-02-17 12:35:47 -07:00
blk-integrity.c block: Use pointer to backing_dev_info from request_queue 2017-02-02 08:20:48 -07:00
blk-ioc.c Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2017-03-03 10:53:35 -08:00
blk-lib.c block: don't try Write Same from __blkdev_issue_zeroout 2017-02-06 09:34:46 -07:00
blk-map.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h> 2017-03-02 08:42:36 +01:00
blk-merge.c block: optionally merge discontiguous discard bios into a single request 2017-02-08 13:43:08 -07:00
blk-mq-cpumap.c blk-mq: export blk_mq_map_queues 2016-11-08 17:30:00 -05:00
blk-mq-debugfs.c block: use same block debugfs directory for blk-mq and blktrace 2017-02-02 10:20:16 -07:00
blk-mq-pci.c blk_mq: linux/blk-mq.h does not include all the headers it depends on 2016-09-19 08:21:51 -06:00
blk-mq-sched.c blk-mq: move update of tags->rqs to __blk_mq_alloc_request() 2017-03-02 08:56:04 -07:00
blk-mq-sched.h blk-mq-sched: separate mark hctx and queue restart operations 2017-02-23 11:55:47 -07:00
blk-mq-sysfs.c blk-mq: free hctx->cpumask in release handler of hctx's kobject 2017-03-08 09:56:12 -07:00
blk-mq-tag.c blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset 2017-03-02 08:56:04 -07:00
blk-mq-tag.h blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset 2017-03-02 08:56:04 -07:00
blk-mq-virtio.c blk-mq: provide a default queue mapping for virtio device 2017-02-27 20:54:05 +02:00
blk-mq.c blk-mq: free hctx->cpumask in release handler of hctx's kobject 2017-03-08 09:56:12 -07:00
blk-mq.h blk-mq: make lifetime consitent between q/ctx and its kobject 2017-03-08 09:56:12 -07:00
blk-settings.c block: optionally merge discontiguous discard bios into a single request 2017-02-08 13:43:08 -07:00
blk-softirq.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h> 2017-03-02 08:42:26 +01:00
blk-stat.c blk-stat: fix a few cases of missing batch flushing 2016-12-09 13:08:35 -07:00
blk-stat.h block: add scalable completion tracking of requests 2016-11-10 13:53:26 -07:00
blk-sysfs.c block: don't call ioc_exit_icq() with the queue lock held for blk-mq 2017-03-02 13:59:08 -07:00
blk-tag.c blk-mq-sched: add framework for MQ capable IO schedulers 2017-01-17 10:04:20 -07:00
blk-throttle.c scripts/spelling.txt: add "embeded" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
blk-timeout.c block: remove REQ_NO_TIMEOUT flag 2015-12-22 09:38:34 -07:00
blk-wbt.c block: Use pointer to backing_dev_info from request_queue 2017-02-02 08:20:48 -07:00
blk-wbt.h blk-wbt: allow wbt to be enabled always through sysfs 2016-11-28 10:27:03 -07:00
blk-zoned.c block: Rename blk_queue_zone_size and bdev_zone_size 2017-01-12 07:58:32 -07:00
blk.h block: optionally merge discontiguous discard bios into a single request 2017-02-08 13:43:08 -07:00
bounce.c
bsg-lib.c block: split scsi_request out of struct request 2017-01-27 15:08:35 -07:00
bsg.c lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
cfq-iosched.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> 2017-03-02 08:42:27 +01:00
cmdline-parser.c
compat_ioctl.c block: Get rid of blk_get_backing_dev_info() 2017-02-02 08:21:32 -07:00
deadline-iosched.c block: enumify ELEVATOR_*_MERGE 2017-02-08 13:43:06 -07:00
elevator.c block: don't call ioc_exit_icq() with the queue lock held for blk-mq 2017-03-02 13:59:08 -07:00
genhd.c Revert "scsi, block: fix duplicate bdi name registration crashes" 2017-03-08 10:55:17 -07:00
ioctl.c block: Get rid of blk_get_backing_dev_info() 2017-02-02 08:21:32 -07:00
ioprio.c sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
Kconfig virtio, vhost: optimizations, fixes 2017-03-02 13:53:13 -08:00
Kconfig.iosched block: get rid of blk-mq default scheduler choice Kconfig entries 2017-02-22 13:19:45 -07:00
Makefile virtio, vhost: optimizations, fixes 2017-03-02 13:53:13 -08:00
mq-deadline.c block: enumify ELEVATOR_*_MERGE 2017-02-08 13:43:06 -07:00
noop-iosched.c block: move existing elevator ops to union 2017-01-17 10:03:33 -07:00
opal_proto.h block/sed-opal: allocate struct opal_dev dynamically 2017-02-17 12:41:47 -07:00
partition-generic.c block: Rename blk_queue_zone_size and bdev_zone_size 2017-01-12 07:58:32 -07:00
scsi_ioctl.c block: fold cmd_type into the REQ_OP_ space 2017-01-31 14:00:44 -07:00
sed-opal.c block/sed: Fix opal user range check and unused variables 2017-03-08 09:56:12 -07:00
t10-pi.c