2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Copyright (C) 1991, 1992 Linus Torvalds
|
|
|
|
* Copyright (C) 1994, Karl Keyte: Added support for disk statistics
|
|
|
|
* Elevator latency, (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE
|
|
|
|
* Queue request tables / lock, selectable elevator, Jens Axboe <axboe@suse.de>
|
2008-01-31 19:03:55 +07:00
|
|
|
* kernel-doc documentation started by NeilBrown <neilb@cse.unsw.edu.au>
|
|
|
|
* - July2000
|
2005-04-17 05:20:36 +07:00
|
|
|
* bio rewrite, highmem i/o, etc, Jens Axboe <axboe@suse.de> - may 2001
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This handles all read/write requests to block devices
|
|
|
|
*/
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/backing-dev.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/blkdev.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
#include <linux/blk-mq.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/kernel_stat.h>
|
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/completion.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/writeback.h>
|
2006-12-10 17:19:35 +07:00
|
|
|
#include <linux/task_io_accounting_ops.h>
|
2006-12-08 17:39:46 +07:00
|
|
|
#include <linux/fault-inject.h>
|
2011-03-08 19:19:51 +07:00
|
|
|
#include <linux/list_sort.h>
|
2011-10-19 19:32:38 +07:00
|
|
|
#include <linux/delay.h>
|
2012-04-20 06:29:22 +07:00
|
|
|
#include <linux/ratelimit.h>
|
2013-03-23 10:42:26 +07:00
|
|
|
#include <linux/pm_runtime.h>
|
2015-05-23 04:13:17 +07:00
|
|
|
#include <linux/blk-cgroup.h>
|
2017-02-01 05:53:20 +07:00
|
|
|
#include <linux/debugfs.h>
|
tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:
- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...
Cons:
- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
This is mainly because we can't get the deivce from a request queue.
But this may change in the future.
- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.
Since pc requests should be rather rare, this is not a big issue.
- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.
The overhead is minimized by using __dynamic_array() instead of __array().
I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s
So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.
And the binary output of TRACE_EVENT is much smaller than blktrace:
# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
Following are some comparisons between TRACE_EVENT and blktrace:
plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]
unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1
remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384
bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]
getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]
bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]
ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]
Changelog from v2 -> v3:
- use the newly introduced __dynamic_array().
Changelog from v1 -> v2:
- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().
- support large pc requests.
- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
- some cleanups.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 12:43:05 +07:00
|
|
|
|
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/block.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-01-29 20:51:59 +07:00
|
|
|
#include "blk.h"
|
2013-12-26 20:31:35 +07:00
|
|
|
#include "blk-mq.h"
|
2017-01-17 20:03:22 +07:00
|
|
|
#include "blk-mq-sched.h"
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
#include "blk-wbt.h"
|
2008-01-29 20:51:59 +07:00
|
|
|
|
2017-02-01 05:53:20 +07:00
|
|
|
#ifdef CONFIG_DEBUG_FS
|
|
|
|
struct dentry *blk_debugfs_root;
|
|
|
|
#endif
|
|
|
|
|
2010-11-16 18:52:38 +07:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
|
2009-10-02 02:16:13 +07:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
|
2013-04-18 23:00:26 +07:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
|
2014-04-29 01:30:52 +07:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_split);
|
2012-12-15 02:49:27 +07:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
|
2008-11-26 17:59:56 +07:00
|
|
|
|
2011-12-14 06:33:37 +07:00
|
|
|
DEFINE_IDA(blk_queue_ida);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* For the allocated request tables
|
|
|
|
*/
|
2015-11-24 08:58:45 +07:00
|
|
|
struct kmem_cache *request_cachep;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For queue allocation
|
|
|
|
*/
|
2008-01-31 19:03:55 +07:00
|
|
|
struct kmem_cache *blk_requestq_cachep;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Controlling structure to kblockd
|
|
|
|
*/
|
2006-01-09 22:02:34 +07:00
|
|
|
static struct workqueue_struct *kblockd_workqueue;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-05-23 04:13:42 +07:00
|
|
|
static void blk_clear_congested(struct request_list *rl, int sync)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_CGROUP_WRITEBACK
|
|
|
|
clear_wb_congested(rl->blkg->wb_congested, sync);
|
|
|
|
#else
|
2015-05-23 04:13:43 +07:00
|
|
|
/*
|
|
|
|
* If !CGROUP_WRITEBACK, all blkg's map to bdi->wb and we shouldn't
|
|
|
|
* flip its congestion state for events on other blkcgs.
|
|
|
|
*/
|
|
|
|
if (rl == &rl->q->root_rl)
|
2017-02-02 21:56:50 +07:00
|
|
|
clear_wb_congested(rl->q->backing_dev_info->wb.congested, sync);
|
2015-05-23 04:13:42 +07:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_set_congested(struct request_list *rl, int sync)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_CGROUP_WRITEBACK
|
|
|
|
set_wb_congested(rl->blkg->wb_congested, sync);
|
|
|
|
#else
|
2015-05-23 04:13:43 +07:00
|
|
|
/* see blk_clear_congested() */
|
|
|
|
if (rl == &rl->q->root_rl)
|
2017-02-02 21:56:50 +07:00
|
|
|
set_wb_congested(rl->q->backing_dev_info->wb.congested, sync);
|
2015-05-23 04:13:42 +07:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2008-01-29 20:51:59 +07:00
|
|
|
void blk_queue_congestion_threshold(struct request_queue *q)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
int nr;
|
|
|
|
|
|
|
|
nr = q->nr_requests - (q->nr_requests / 8) + 1;
|
|
|
|
if (nr > q->nr_requests)
|
|
|
|
nr = q->nr_requests;
|
|
|
|
q->nr_congestion_on = nr;
|
|
|
|
|
|
|
|
nr = q->nr_requests - (q->nr_requests / 8) - (q->nr_requests / 16) - 1;
|
|
|
|
if (nr < 1)
|
|
|
|
nr = 1;
|
|
|
|
q->nr_congestion_off = nr;
|
|
|
|
}
|
|
|
|
|
2008-04-29 14:54:36 +07:00
|
|
|
void blk_rq_init(struct request_queue *q, struct request *rq)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2008-04-25 17:26:28 +07:00
|
|
|
memset(rq, 0, sizeof(*rq));
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
INIT_LIST_HEAD(&rq->queuelist);
|
2008-09-14 19:55:09 +07:00
|
|
|
INIT_LIST_HEAD(&rq->timeout_list);
|
2008-09-14 01:26:01 +07:00
|
|
|
rq->cpu = -1;
|
2008-02-08 18:41:03 +07:00
|
|
|
rq->q = q;
|
2009-05-07 20:24:44 +07:00
|
|
|
rq->__sector = (sector_t) -1;
|
2006-07-13 16:55:04 +07:00
|
|
|
INIT_HLIST_NODE(&rq->hash);
|
|
|
|
RB_CLEAR_NODE(&rq->rb_node);
|
2008-02-08 18:41:03 +07:00
|
|
|
rq->tag = -1;
|
2017-01-17 20:03:22 +07:00
|
|
|
rq->internal_tag = -1;
|
2009-04-23 09:05:18 +07:00
|
|
|
rq->start_time = jiffies;
|
2010-04-02 05:01:41 +07:00
|
|
|
set_start_time_ns(rq);
|
2011-01-05 22:57:38 +07:00
|
|
|
rq->part = NULL;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2008-04-29 14:54:36 +07:00
|
|
|
EXPORT_SYMBOL(blk_rq_init);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-09-27 17:46:13 +07:00
|
|
|
static void req_bio_endio(struct request *rq, struct bio *bio,
|
|
|
|
unsigned int nbytes, int error)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2015-06-26 21:01:13 +07:00
|
|
|
if (error)
|
2015-07-20 20:29:37 +07:00
|
|
|
bio->bi_error = error;
|
2006-01-06 15:51:03 +07:00
|
|
|
|
2016-10-20 20:12:13 +07:00
|
|
|
if (unlikely(rq->rq_flags & RQF_QUIET))
|
2015-07-25 01:37:59 +07:00
|
|
|
bio_set_flag(bio, BIO_QUIET);
|
block: Supress Buffer I/O errors when SCSI REQ_QUIET flag set
Allow the scsi request REQ_QUIET flag to be propagated to the buffer
file system layer. The basic ideas is to pass the flag from the scsi
request to the bio (block IO) and then to the buffer layer. The buffer
layer can then suppress needless printks.
This patch declutters the kernel log by removed the 40-50 (per lun)
buffer io error messages seen during a boot in my multipath setup . It
is a good chance any real errors will be missed in the "noise" it the
logs without this patch.
During boot I see blocks of messages like
"
__ratelimit: 211 callbacks suppressed
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242847
Buffer I/O error on device sdm, logical block 1
Buffer I/O error on device sdm, logical block 5242878
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242872
"
in my logs.
My disk environment is multipath fiber channel using the SCSI_DH_RDAC
code and multipathd. This topology includes an "active" and "ghost"
path for each lun. IO's to the "ghost" path will never complete and the
SCSI layer, via the scsi device handler rdac code, quick returns the IOs
to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
layer messages.
I am wanting to extend the QUIET behavior to include the buffer file
system layer to deal with these errors as well. I have been running this
patch for a while now on several boxes without issue. A few runs of
bonnie++ show no noticeable difference in performance in my setup.
Thanks for John Stultz for the quiet_error finalization.
Submitted-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-25 16:24:35 +07:00
|
|
|
|
2012-09-21 06:38:30 +07:00
|
|
|
bio_advance(bio, nbytes);
|
2008-07-01 01:04:41 +07:00
|
|
|
|
2011-01-25 18:43:52 +07:00
|
|
|
/* don't actually finish bio if it's part of flush sequence */
|
2016-10-20 20:12:13 +07:00
|
|
|
if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
|
2015-07-20 20:29:37 +07:00
|
|
|
bio_endio(bio);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
void blk_dump_rq_flags(struct request *rq, char *msg)
|
|
|
|
{
|
2017-01-31 22:57:31 +07:00
|
|
|
printk(KERN_INFO "%s: dev %s: flags=%llx\n", msg,
|
|
|
|
rq->rq_disk ? rq->rq_disk->disk_name : "?",
|
2013-05-23 17:25:08 +07:00
|
|
|
(unsigned long long) rq->cmd_flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-05-07 20:24:39 +07:00
|
|
|
printk(KERN_INFO " sector %llu, nr/cnr %u/%u\n",
|
|
|
|
(unsigned long long)blk_rq_pos(rq),
|
|
|
|
blk_rq_sectors(rq), blk_rq_cur_sectors(rq));
|
2014-04-10 22:46:28 +07:00
|
|
|
printk(KERN_INFO " bio %p, biotail %p, len %u\n",
|
|
|
|
rq->bio, rq->biotail, blk_rq_bytes(rq));
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_dump_rq_flags);
|
|
|
|
|
2011-03-02 23:08:00 +07:00
|
|
|
static void blk_delay_work(struct work_struct *work)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-03-02 23:08:00 +07:00
|
|
|
struct request_queue *q;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-03-02 23:08:00 +07:00
|
|
|
q = container_of(work, struct request_queue, delay_work.work);
|
|
|
|
spin_lock_irq(q->queue_lock);
|
2011-04-18 16:41:33 +07:00
|
|
|
__blk_run_queue(q);
|
2011-03-02 23:08:00 +07:00
|
|
|
spin_unlock_irq(q->queue_lock);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2011-03-02 23:08:00 +07:00
|
|
|
* blk_delay_queue - restart queueing after defined interval
|
|
|
|
* @q: The &struct request_queue in question
|
|
|
|
* @msecs: Delay in msecs
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* Description:
|
2011-03-02 23:08:00 +07:00
|
|
|
* Sometimes queueing needs to be postponed for a little while, to allow
|
|
|
|
* resources to come back. This function will make sure that queueing is
|
2012-11-28 19:45:56 +07:00
|
|
|
* restarted around the specified time. Queue lock must be held.
|
2011-03-02 23:08:00 +07:00
|
|
|
*/
|
|
|
|
void blk_delay_queue(struct request_queue *q, unsigned long msecs)
|
2007-11-08 02:26:56 +07:00
|
|
|
{
|
2012-11-28 19:45:56 +07:00
|
|
|
if (likely(!blk_queue_dead(q)))
|
|
|
|
queue_delayed_work(kblockd_workqueue, &q->delay_work,
|
|
|
|
msecs_to_jiffies(msecs));
|
2007-11-08 02:26:56 +07:00
|
|
|
}
|
2011-03-02 23:08:00 +07:00
|
|
|
EXPORT_SYMBOL(blk_delay_queue);
|
2007-11-08 02:26:56 +07:00
|
|
|
|
2015-12-29 03:01:22 +07:00
|
|
|
/**
|
|
|
|
* blk_start_queue_async - asynchronously restart a previously stopped queue
|
|
|
|
* @q: The &struct request_queue in question
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* blk_start_queue_async() will clear the stop flag on the queue, and
|
|
|
|
* ensure that the request_fn for the queue is run from an async
|
|
|
|
* context.
|
|
|
|
**/
|
|
|
|
void blk_start_queue_async(struct request_queue *q)
|
|
|
|
{
|
|
|
|
queue_flag_clear(QUEUE_FLAG_STOPPED, q);
|
|
|
|
blk_run_queue_async(q);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_start_queue_async);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/**
|
|
|
|
* blk_start_queue - restart a previously stopped queue
|
2007-07-24 14:28:11 +07:00
|
|
|
* @q: The &struct request_queue in question
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* blk_start_queue() will clear the stop flag on the queue, and call
|
|
|
|
* the request_fn for the queue if it was in a stopped state when
|
|
|
|
* entered. Also see blk_stop_queue(). Queue lock must be held.
|
|
|
|
**/
|
2007-07-24 14:28:11 +07:00
|
|
|
void blk_start_queue(struct request_queue *q)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2006-06-05 17:09:01 +07:00
|
|
|
WARN_ON(!irqs_disabled());
|
|
|
|
|
2008-04-29 19:48:33 +07:00
|
|
|
queue_flag_clear(QUEUE_FLAG_STOPPED, q);
|
2011-04-18 16:41:33 +07:00
|
|
|
__blk_run_queue(q);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_start_queue);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_stop_queue - stop a queue
|
2007-07-24 14:28:11 +07:00
|
|
|
* @q: The &struct request_queue in question
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* The Linux block layer assumes that a block driver will consume all
|
|
|
|
* entries on the request queue when the request_fn strategy is called.
|
|
|
|
* Often this will not happen, because of hardware limitations (queue
|
|
|
|
* depth settings). If a device driver gets a 'queue full' response,
|
|
|
|
* or if it simply chooses not to queue more I/O at one point, it can
|
|
|
|
* call this function to prevent the request_fn from being called until
|
|
|
|
* the driver has signalled it's ready to go again. This happens by calling
|
|
|
|
* blk_start_queue() to restart queue operations. Queue lock must be held.
|
|
|
|
**/
|
2007-07-24 14:28:11 +07:00
|
|
|
void blk_stop_queue(struct request_queue *q)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-08-22 03:18:24 +07:00
|
|
|
cancel_delayed_work(&q->delay_work);
|
2008-04-29 19:48:33 +07:00
|
|
|
queue_flag_set(QUEUE_FLAG_STOPPED, q);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_stop_queue);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_sync_queue - cancel any pending callbacks on a queue
|
|
|
|
* @q: the queue
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* The block layer may perform asynchronous callback activity
|
|
|
|
* on a queue, such as calling the unplug function after a timeout.
|
|
|
|
* A block device may call blk_sync_queue to ensure that any
|
|
|
|
* such activity is cancelled, thus allowing it to release resources
|
2007-05-09 13:57:56 +07:00
|
|
|
* that the callbacks might use. The caller must already have made sure
|
2005-04-17 05:20:36 +07:00
|
|
|
* that its ->make_request_fn will not re-add plugging prior to calling
|
|
|
|
* this function.
|
|
|
|
*
|
2011-03-03 07:05:33 +07:00
|
|
|
* This function does not cancel any asynchronous activity arising
|
2014-09-08 23:27:23 +07:00
|
|
|
* out of elevator or throttling code. That would require elevator_exit()
|
2012-03-06 04:15:12 +07:00
|
|
|
* and blkcg_exit_queue() to be called with queue lock initialized.
|
2011-03-03 07:05:33 +07:00
|
|
|
*
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
void blk_sync_queue(struct request_queue *q)
|
|
|
|
{
|
2008-11-19 20:38:39 +07:00
|
|
|
del_timer_sync(&q->timeout);
|
2013-12-26 20:31:36 +07:00
|
|
|
|
|
|
|
if (q->mq_ops) {
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
int i;
|
|
|
|
|
2014-04-16 23:48:08 +07:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2016-08-25 04:54:25 +07:00
|
|
|
cancel_work_sync(&hctx->run_work);
|
2014-04-16 23:48:08 +07:00
|
|
|
cancel_delayed_work_sync(&hctx->delay_work);
|
|
|
|
}
|
2013-12-26 20:31:36 +07:00
|
|
|
} else {
|
|
|
|
cancel_delayed_work_sync(&q->delay_work);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_sync_queue);
|
|
|
|
|
2012-12-06 20:32:01 +07:00
|
|
|
/**
|
|
|
|
* __blk_run_queue_uncond - run a queue whether or not it has been stopped
|
|
|
|
* @q: The queue to run
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Invoke request handling on a queue if there are any pending requests.
|
|
|
|
* May be used to restart request handling after a request has completed.
|
|
|
|
* This variant runs the queue whether or not the queue has been
|
|
|
|
* stopped. Must be called with the queue lock held and interrupts
|
|
|
|
* disabled. See also @blk_run_queue.
|
|
|
|
*/
|
|
|
|
inline void __blk_run_queue_uncond(struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (unlikely(blk_queue_dead(q)))
|
|
|
|
return;
|
|
|
|
|
2012-11-28 19:46:45 +07:00
|
|
|
/*
|
|
|
|
* Some request_fn implementations, e.g. scsi_request_fn(), unlock
|
|
|
|
* the queue lock internally. As a result multiple threads may be
|
|
|
|
* running such a request function concurrently. Keep track of the
|
|
|
|
* number of active request_fn invocations such that blk_drain_queue()
|
|
|
|
* can wait until all these request_fn calls have finished.
|
|
|
|
*/
|
|
|
|
q->request_fn_active++;
|
2012-12-06 20:32:01 +07:00
|
|
|
q->request_fn(q);
|
2012-11-28 19:46:45 +07:00
|
|
|
q->request_fn_active--;
|
2012-12-06 20:32:01 +07:00
|
|
|
}
|
2015-04-18 03:37:20 +07:00
|
|
|
EXPORT_SYMBOL_GPL(__blk_run_queue_uncond);
|
2012-12-06 20:32:01 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/**
|
2008-10-14 14:51:06 +07:00
|
|
|
* __blk_run_queue - run a single device queue
|
2005-04-17 05:20:36 +07:00
|
|
|
* @q: The queue to run
|
2008-10-14 14:51:06 +07:00
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* See @blk_run_queue. This variant must be called with the queue lock
|
2011-04-18 16:41:33 +07:00
|
|
|
* held and interrupts disabled.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2011-04-18 16:41:33 +07:00
|
|
|
void __blk_run_queue(struct request_queue *q)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2009-04-23 09:05:17 +07:00
|
|
|
if (unlikely(blk_queue_stopped(q)))
|
|
|
|
return;
|
|
|
|
|
2012-12-06 20:32:01 +07:00
|
|
|
__blk_run_queue_uncond(q);
|
2008-04-29 19:48:33 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__blk_run_queue);
|
2006-05-11 13:20:16 +07:00
|
|
|
|
2011-04-18 16:41:33 +07:00
|
|
|
/**
|
|
|
|
* blk_run_queue_async - run a single device queue in workqueue context
|
|
|
|
* @q: The queue to run
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Tells kblockd to perform the equivalent of @blk_run_queue on behalf
|
2012-11-28 19:45:56 +07:00
|
|
|
* of us. The caller must hold the queue lock.
|
2011-04-18 16:41:33 +07:00
|
|
|
*/
|
|
|
|
void blk_run_queue_async(struct request_queue *q)
|
|
|
|
{
|
2012-11-28 19:45:56 +07:00
|
|
|
if (likely(!blk_queue_stopped(q) && !blk_queue_dead(q)))
|
2012-08-22 03:18:24 +07:00
|
|
|
mod_delayed_work(kblockd_workqueue, &q->delay_work, 0);
|
2011-04-18 16:41:33 +07:00
|
|
|
}
|
2011-04-19 18:32:46 +07:00
|
|
|
EXPORT_SYMBOL(blk_run_queue_async);
|
2011-04-18 16:41:33 +07:00
|
|
|
|
2008-04-29 19:48:33 +07:00
|
|
|
/**
|
|
|
|
* blk_run_queue - run a single device queue
|
|
|
|
* @q: The queue to run
|
2008-10-14 14:51:06 +07:00
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Invoke request handling on this queue, if it has pending work to do.
|
2009-04-23 09:05:17 +07:00
|
|
|
* May be used to restart queueing when a request has completed.
|
2008-04-29 19:48:33 +07:00
|
|
|
*/
|
|
|
|
void blk_run_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(q->queue_lock, flags);
|
2011-04-18 16:41:33 +07:00
|
|
|
__blk_run_queue(q);
|
2005-04-17 05:20:36 +07:00
|
|
|
spin_unlock_irqrestore(q->queue_lock, flags);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_run_queue);
|
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
void blk_put_queue(struct request_queue *q)
|
2006-03-19 06:34:37 +07:00
|
|
|
{
|
|
|
|
kobject_put(&q->kobj);
|
|
|
|
}
|
2011-05-27 12:44:43 +07:00
|
|
|
EXPORT_SYMBOL(blk_put_queue);
|
2006-03-19 06:34:37 +07:00
|
|
|
|
2011-10-19 19:32:38 +07:00
|
|
|
/**
|
2012-11-28 19:43:38 +07:00
|
|
|
* __blk_drain_queue - drain requests from request_queue
|
2011-10-19 19:32:38 +07:00
|
|
|
* @q: queue to drain
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
* @drain_all: whether to drain all requests or only the ones w/ ELVPRIV
|
2011-10-19 19:32:38 +07:00
|
|
|
*
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
* Drain requests from @q. If @drain_all is set, all requests are drained.
|
|
|
|
* If not, only ELVPRIV requests are drained. The caller is responsible
|
|
|
|
* for ensuring that no new requests which need to be drained are queued.
|
2011-10-19 19:32:38 +07:00
|
|
|
*/
|
2012-11-28 19:43:38 +07:00
|
|
|
static void __blk_drain_queue(struct request_queue *q, bool drain_all)
|
|
|
|
__releases(q->queue_lock)
|
|
|
|
__acquires(q->queue_lock)
|
2011-10-19 19:32:38 +07:00
|
|
|
{
|
2012-06-15 13:45:25 +07:00
|
|
|
int i;
|
|
|
|
|
2012-11-28 19:43:38 +07:00
|
|
|
lockdep_assert_held(q->queue_lock);
|
|
|
|
|
2011-10-19 19:32:38 +07:00
|
|
|
while (true) {
|
2011-12-14 06:33:37 +07:00
|
|
|
bool drain = false;
|
2011-10-19 19:32:38 +07:00
|
|
|
|
2012-03-07 03:24:55 +07:00
|
|
|
/*
|
|
|
|
* The caller might be trying to drain @q before its
|
|
|
|
* elevator is initialized.
|
|
|
|
*/
|
|
|
|
if (q->elevator)
|
|
|
|
elv_drain_elevator(q);
|
|
|
|
|
2012-03-06 04:15:12 +07:00
|
|
|
blkcg_drain_queue(q);
|
2011-10-19 19:32:38 +07:00
|
|
|
|
2011-12-16 02:03:04 +07:00
|
|
|
/*
|
|
|
|
* This function might be called on a queue which failed
|
2012-03-07 03:24:55 +07:00
|
|
|
* driver init after queue creation or is not yet fully
|
|
|
|
* active yet. Some drivers (e.g. fd and loop) get unhappy
|
|
|
|
* in such cases. Kick queue iff dispatch queue has
|
|
|
|
* something on it and @q has request_fn set.
|
2011-12-16 02:03:04 +07:00
|
|
|
*/
|
2012-03-07 03:24:55 +07:00
|
|
|
if (!list_empty(&q->queue_head) && q->request_fn)
|
2011-12-16 02:03:04 +07:00
|
|
|
__blk_run_queue(q);
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
|
2012-06-05 10:40:58 +07:00
|
|
|
drain |= q->nr_rqs_elvpriv;
|
2012-11-28 19:46:45 +07:00
|
|
|
drain |= q->request_fn_active;
|
2011-12-14 06:33:37 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Unfortunately, requests are queued at and tracked from
|
|
|
|
* multiple places and there's no single counter which can
|
|
|
|
* be drained. Check all the queues and counters.
|
|
|
|
*/
|
|
|
|
if (drain_all) {
|
2014-09-25 22:23:46 +07:00
|
|
|
struct blk_flush_queue *fq = blk_get_flush_queue(q, NULL);
|
2011-12-14 06:33:37 +07:00
|
|
|
drain |= !list_empty(&q->queue_head);
|
|
|
|
for (i = 0; i < 2; i++) {
|
2012-06-05 10:40:58 +07:00
|
|
|
drain |= q->nr_rqs[i];
|
2011-12-14 06:33:37 +07:00
|
|
|
drain |= q->in_flight[i];
|
2014-09-25 22:23:43 +07:00
|
|
|
if (fq)
|
|
|
|
drain |= !list_empty(&fq->flush_queue[i]);
|
2011-12-14 06:33:37 +07:00
|
|
|
}
|
|
|
|
}
|
2011-10-19 19:32:38 +07:00
|
|
|
|
2011-12-14 06:33:37 +07:00
|
|
|
if (!drain)
|
2011-10-19 19:32:38 +07:00
|
|
|
break;
|
2012-11-28 19:43:38 +07:00
|
|
|
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
|
2011-10-19 19:32:38 +07:00
|
|
|
msleep(10);
|
2012-11-28 19:43:38 +07:00
|
|
|
|
|
|
|
spin_lock_irq(q->queue_lock);
|
2011-10-19 19:32:38 +07:00
|
|
|
}
|
2012-06-15 13:45:25 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* With queue marked dead, any woken up waiter will fail the
|
|
|
|
* allocation path, so the wakeup chaining is lost and we're
|
|
|
|
* left with hung waiters. We need to wake up those waiters.
|
|
|
|
*/
|
|
|
|
if (q->request_fn) {
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
struct request_list *rl;
|
|
|
|
|
|
|
|
blk_queue_for_each_rl(rl, q)
|
|
|
|
for (i = 0; i < ARRAY_SIZE(rl->wait); i++)
|
|
|
|
wake_up_all(&rl->wait[i]);
|
2012-06-15 13:45:25 +07:00
|
|
|
}
|
2011-10-19 19:32:38 +07:00
|
|
|
}
|
|
|
|
|
2012-03-06 04:14:58 +07:00
|
|
|
/**
|
|
|
|
* blk_queue_bypass_start - enter queue bypass mode
|
|
|
|
* @q: queue of interest
|
|
|
|
*
|
|
|
|
* In bypass mode, only the dispatch FIFO queue of @q is used. This
|
|
|
|
* function makes @q enter bypass mode and drains all requests which were
|
2012-03-06 04:14:59 +07:00
|
|
|
* throttled or issued before. On return, it's guaranteed that no request
|
2012-04-14 04:50:53 +07:00
|
|
|
* is being throttled or has ELVPRIV set and blk_queue_bypass() %true
|
|
|
|
* inside queue or RCU read lock.
|
2012-03-06 04:14:58 +07:00
|
|
|
*/
|
|
|
|
void blk_queue_bypass_start(struct request_queue *q)
|
|
|
|
{
|
|
|
|
spin_lock_irq(q->queue_lock);
|
2014-07-01 23:29:17 +07:00
|
|
|
q->bypass_depth++;
|
2012-03-06 04:14:58 +07:00
|
|
|
queue_flag_set(QUEUE_FLAG_BYPASS, q);
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
|
2014-07-01 23:29:17 +07:00
|
|
|
/*
|
|
|
|
* Queues start drained. Skip actual draining till init is
|
|
|
|
* complete. This avoids lenghty delays during queue init which
|
|
|
|
* can happen many times during boot.
|
|
|
|
*/
|
|
|
|
if (blk_queue_init_done(q)) {
|
2012-11-28 19:43:38 +07:00
|
|
|
spin_lock_irq(q->queue_lock);
|
|
|
|
__blk_drain_queue(q, false);
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
|
2012-04-14 03:11:31 +07:00
|
|
|
/* ensure blk_queue_bypass() is %true inside RCU read lock */
|
|
|
|
synchronize_rcu();
|
|
|
|
}
|
2012-03-06 04:14:58 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_queue_bypass_start);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_queue_bypass_end - leave queue bypass mode
|
|
|
|
* @q: queue of interest
|
|
|
|
*
|
|
|
|
* Leave bypass mode and restore the normal queueing behavior.
|
|
|
|
*/
|
|
|
|
void blk_queue_bypass_end(struct request_queue *q)
|
|
|
|
{
|
|
|
|
spin_lock_irq(q->queue_lock);
|
|
|
|
if (!--q->bypass_depth)
|
|
|
|
queue_flag_clear(QUEUE_FLAG_BYPASS, q);
|
|
|
|
WARN_ON_ONCE(q->bypass_depth < 0);
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_queue_bypass_end);
|
|
|
|
|
2014-12-23 04:04:42 +07:00
|
|
|
void blk_set_queue_dying(struct request_queue *q)
|
|
|
|
{
|
2016-08-17 06:48:36 +07:00
|
|
|
spin_lock_irq(q->queue_lock);
|
|
|
|
queue_flag_set(QUEUE_FLAG_DYING, q);
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
2014-12-23 04:04:42 +07:00
|
|
|
|
2017-03-27 19:06:58 +07:00
|
|
|
/*
|
|
|
|
* When queue DYING flag is set, we need to block new req
|
|
|
|
* entering queue, so we call blk_freeze_queue_start() to
|
|
|
|
* prevent I/O from crossing blk_queue_enter().
|
|
|
|
*/
|
|
|
|
blk_freeze_queue_start(q);
|
|
|
|
|
2014-12-23 04:04:42 +07:00
|
|
|
if (q->mq_ops)
|
|
|
|
blk_mq_wake_waiters(q);
|
|
|
|
else {
|
|
|
|
struct request_list *rl;
|
|
|
|
|
2017-02-01 13:36:50 +07:00
|
|
|
spin_lock_irq(q->queue_lock);
|
2014-12-23 04:04:42 +07:00
|
|
|
blk_queue_for_each_rl(rl, q) {
|
|
|
|
if (rl->rq_pool) {
|
|
|
|
wake_up(&rl->wait[BLK_RW_SYNC]);
|
|
|
|
wake_up(&rl->wait[BLK_RW_ASYNC]);
|
|
|
|
}
|
|
|
|
}
|
2017-02-01 13:36:50 +07:00
|
|
|
spin_unlock_irq(q->queue_lock);
|
2014-12-23 04:04:42 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_set_queue_dying);
|
|
|
|
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
/**
|
|
|
|
* blk_cleanup_queue - shutdown a request queue
|
|
|
|
* @q: request queue to shutdown
|
|
|
|
*
|
2012-12-06 20:32:01 +07:00
|
|
|
* Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
|
|
|
|
* put it. All future requests will be failed immediately with -ENODEV.
|
2011-03-03 07:04:42 +07:00
|
|
|
*/
|
2008-01-31 19:03:55 +07:00
|
|
|
void blk_cleanup_queue(struct request_queue *q)
|
2006-03-19 06:34:37 +07:00
|
|
|
{
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
spinlock_t *lock = q->queue_lock;
|
2008-09-18 23:22:54 +07:00
|
|
|
|
2012-11-28 19:42:38 +07:00
|
|
|
/* mark @q DYING, no new request or merges will be allowed afterwards */
|
2006-03-19 06:34:37 +07:00
|
|
|
mutex_lock(&q->sysfs_lock);
|
2014-12-23 04:04:42 +07:00
|
|
|
blk_set_queue_dying(q);
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
spin_lock_irq(lock);
|
2012-03-06 04:14:59 +07:00
|
|
|
|
2012-04-14 04:50:53 +07:00
|
|
|
/*
|
2012-11-28 19:42:38 +07:00
|
|
|
* A dying queue is permanently in bypass mode till released. Note
|
2012-04-14 04:50:53 +07:00
|
|
|
* that, unlike blk_queue_bypass_start(), we aren't performing
|
|
|
|
* synchronize_rcu() after entering bypass mode to avoid the delay
|
|
|
|
* as some drivers create and destroy a lot of queues while
|
|
|
|
* probing. This is still safe because blk_release_queue() will be
|
|
|
|
* called only after the queue refcnt drops to zero and nothing,
|
|
|
|
* RCU or not, would be traversing the queue by then.
|
|
|
|
*/
|
2012-03-06 04:14:59 +07:00
|
|
|
q->bypass_depth++;
|
|
|
|
queue_flag_set(QUEUE_FLAG_BYPASS, q);
|
|
|
|
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
queue_flag_set(QUEUE_FLAG_NOMERGES, q);
|
|
|
|
queue_flag_set(QUEUE_FLAG_NOXMERGES, q);
|
2012-11-28 19:42:38 +07:00
|
|
|
queue_flag_set(QUEUE_FLAG_DYING, q);
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
spin_unlock_irq(lock);
|
|
|
|
mutex_unlock(&q->sysfs_lock);
|
|
|
|
|
2012-12-06 20:32:01 +07:00
|
|
|
/*
|
|
|
|
* Drain all requests queued before DYING marking. Set DEAD flag to
|
|
|
|
* prevent that q->request_fn() gets invoked after draining finished.
|
|
|
|
*/
|
2015-10-22 00:20:12 +07:00
|
|
|
blk_freeze_queue(q);
|
|
|
|
spin_lock_irq(lock);
|
|
|
|
if (!q->mq_ops)
|
2013-12-26 20:31:35 +07:00
|
|
|
__blk_drain_queue(q, true);
|
2012-12-06 20:32:01 +07:00
|
|
|
queue_flag_set(QUEUE_FLAG_DEAD, q);
|
2012-11-28 19:43:38 +07:00
|
|
|
spin_unlock_irq(lock);
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
|
2015-10-22 00:20:23 +07:00
|
|
|
/* for synchronous bio-based driver finish in-flight integrity i/o */
|
|
|
|
blk_flush_integrity();
|
|
|
|
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
/* @q won't process any more request, flush async actions */
|
2017-02-02 21:56:50 +07:00
|
|
|
del_timer_sync(&q->backing_dev_info->laptop_mode_wb_timer);
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
blk_sync_queue(q);
|
|
|
|
|
2014-12-09 22:57:48 +07:00
|
|
|
if (q->mq_ops)
|
|
|
|
blk_mq_free_queue(q);
|
2015-10-22 00:20:12 +07:00
|
|
|
percpu_ref_exit(&q->q_usage_counter);
|
2014-12-09 22:57:48 +07:00
|
|
|
|
2012-05-24 22:28:52 +07:00
|
|
|
spin_lock_irq(lock);
|
|
|
|
if (q->queue_lock != &q->__queue_lock)
|
|
|
|
q->queue_lock = &q->__queue_lock;
|
|
|
|
spin_unlock_irq(lock);
|
|
|
|
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 19:42:16 +07:00
|
|
|
/* @q is and will stay empty, shutdown and put */
|
2006-03-19 06:34:37 +07:00
|
|
|
blk_put_queue(q);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
EXPORT_SYMBOL(blk_cleanup_queue);
|
|
|
|
|
2015-03-25 06:21:16 +07:00
|
|
|
/* Allocate memory local to the request queue */
|
2017-01-27 23:51:45 +07:00
|
|
|
static void *alloc_request_simple(gfp_t gfp_mask, void *data)
|
2015-03-25 06:21:16 +07:00
|
|
|
{
|
2017-01-27 23:51:45 +07:00
|
|
|
struct request_queue *q = data;
|
|
|
|
|
|
|
|
return kmem_cache_alloc_node(request_cachep, gfp_mask, q->node);
|
2015-03-25 06:21:16 +07:00
|
|
|
}
|
|
|
|
|
2017-01-27 23:51:45 +07:00
|
|
|
static void free_request_simple(void *element, void *data)
|
2015-03-25 06:21:16 +07:00
|
|
|
{
|
|
|
|
kmem_cache_free(request_cachep, element);
|
|
|
|
}
|
|
|
|
|
2017-01-27 23:51:45 +07:00
|
|
|
static void *alloc_request_size(gfp_t gfp_mask, void *data)
|
|
|
|
{
|
|
|
|
struct request_queue *q = data;
|
|
|
|
struct request *rq;
|
|
|
|
|
|
|
|
rq = kmalloc_node(sizeof(struct request) + q->cmd_size, gfp_mask,
|
|
|
|
q->node);
|
|
|
|
if (rq && q->init_rq_fn && q->init_rq_fn(q, rq, gfp_mask) < 0) {
|
|
|
|
kfree(rq);
|
|
|
|
rq = NULL;
|
|
|
|
}
|
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void free_request_size(void *element, void *data)
|
|
|
|
{
|
|
|
|
struct request_queue *q = data;
|
|
|
|
|
|
|
|
if (q->exit_rq_fn)
|
|
|
|
q->exit_rq_fn(q, element);
|
|
|
|
kfree(element);
|
|
|
|
}
|
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
int blk_init_rl(struct request_list *rl, struct request_queue *q,
|
|
|
|
gfp_t gfp_mask)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2010-05-26 00:15:15 +07:00
|
|
|
if (unlikely(rl->rq_pool))
|
|
|
|
return 0;
|
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
rl->q = q;
|
2009-04-06 19:48:01 +07:00
|
|
|
rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
|
|
|
|
rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
|
|
|
|
init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
|
|
|
|
init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-01-27 23:51:45 +07:00
|
|
|
if (q->cmd_size) {
|
|
|
|
rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
|
|
|
|
alloc_request_size, free_request_size,
|
|
|
|
q, gfp_mask, q->node);
|
|
|
|
} else {
|
|
|
|
rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
|
|
|
|
alloc_request_simple, free_request_simple,
|
|
|
|
q, gfp_mask, q->node);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
if (!rl->rq_pool)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
void blk_exit_rl(struct request_list *rl)
|
|
|
|
{
|
|
|
|
if (rl->rq_pool)
|
|
|
|
mempool_destroy(rl->rq_pool);
|
|
|
|
}
|
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-11-10 16:39:44 +07:00
|
|
|
return blk_alloc_queue_node(gfp_mask, NUMA_NO_NODE);
|
2005-06-23 14:08:19 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_alloc_queue);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-11-26 15:13:05 +07:00
|
|
|
int blk_queue_enter(struct request_queue *q, bool nowait)
|
2015-10-22 00:20:12 +07:00
|
|
|
{
|
|
|
|
while (true) {
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (percpu_ref_tryget_live(&q->q_usage_counter))
|
|
|
|
return 0;
|
|
|
|
|
2015-11-26 15:13:05 +07:00
|
|
|
if (nowait)
|
2015-10-22 00:20:12 +07:00
|
|
|
return -EBUSY;
|
|
|
|
|
2017-03-27 19:06:56 +07:00
|
|
|
/*
|
2017-03-27 19:06:57 +07:00
|
|
|
* read pair of barrier in blk_freeze_queue_start(),
|
2017-03-27 19:06:56 +07:00
|
|
|
* we need to order reading __PERCPU_REF_DEAD flag of
|
2017-03-27 19:06:58 +07:00
|
|
|
* .q_usage_counter and reading .mq_freeze_depth or
|
|
|
|
* queue dying flag, otherwise the following wait may
|
|
|
|
* never return if the two reads are reordered.
|
2017-03-27 19:06:56 +07:00
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
|
2015-10-22 00:20:12 +07:00
|
|
|
ret = wait_event_interruptible(q->mq_freeze_wq,
|
|
|
|
!atomic_read(&q->mq_freeze_depth) ||
|
|
|
|
blk_queue_dying(q));
|
|
|
|
if (blk_queue_dying(q))
|
|
|
|
return -ENODEV;
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void blk_queue_exit(struct request_queue *q)
|
|
|
|
{
|
|
|
|
percpu_ref_put(&q->q_usage_counter);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_queue_usage_counter_release(struct percpu_ref *ref)
|
|
|
|
{
|
|
|
|
struct request_queue *q =
|
|
|
|
container_of(ref, struct request_queue, q_usage_counter);
|
|
|
|
|
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
|
|
|
}
|
|
|
|
|
2015-10-30 19:57:30 +07:00
|
|
|
static void blk_rq_timed_out_timer(unsigned long data)
|
|
|
|
{
|
|
|
|
struct request_queue *q = (struct request_queue *)data;
|
|
|
|
|
|
|
|
kblockd_schedule_work(&q->timeout_work);
|
|
|
|
}
|
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
|
2005-06-23 14:08:19 +07:00
|
|
|
{
|
2007-07-24 14:28:11 +07:00
|
|
|
struct request_queue *q;
|
2005-06-23 14:08:19 +07:00
|
|
|
|
2008-01-29 20:51:59 +07:00
|
|
|
q = kmem_cache_alloc_node(blk_requestq_cachep,
|
2007-07-17 18:03:29 +07:00
|
|
|
gfp_mask | __GFP_ZERO, node_id);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (!q)
|
|
|
|
return NULL;
|
|
|
|
|
2012-03-23 15:58:54 +07:00
|
|
|
q->id = ida_simple_get(&blk_queue_ida, 0, 0, gfp_mask);
|
2011-12-14 06:33:37 +07:00
|
|
|
if (q->id < 0)
|
2014-05-27 22:35:14 +07:00
|
|
|
goto fail_q;
|
2011-12-14 06:33:37 +07:00
|
|
|
|
2015-04-24 12:37:18 +07:00
|
|
|
q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
|
|
|
|
if (!q->bio_split)
|
|
|
|
goto fail_id;
|
|
|
|
|
2017-02-02 21:56:51 +07:00
|
|
|
q->backing_dev_info = bdi_alloc_node(gfp_mask, node_id);
|
|
|
|
if (!q->backing_dev_info)
|
|
|
|
goto fail_split;
|
|
|
|
|
2017-03-22 06:20:01 +07:00
|
|
|
q->stats = blk_alloc_queue_stats();
|
|
|
|
if (!q->stats)
|
|
|
|
goto fail_stats;
|
|
|
|
|
2017-02-02 21:56:50 +07:00
|
|
|
q->backing_dev_info->ra_pages =
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 19:29:47 +07:00
|
|
|
(VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
|
2017-02-02 21:56:50 +07:00
|
|
|
q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
|
|
|
|
q->backing_dev_info->name = "block";
|
2011-11-23 16:59:13 +07:00
|
|
|
q->node = node_id;
|
2009-06-12 19:42:56 +07:00
|
|
|
|
2017-02-02 21:56:50 +07:00
|
|
|
setup_timer(&q->backing_dev_info->laptop_mode_wb_timer,
|
2010-04-06 19:25:14 +07:00
|
|
|
laptop_mode_timer_fn, (unsigned long) q);
|
2008-09-14 19:55:09 +07:00
|
|
|
setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
|
2012-03-07 03:24:55 +07:00
|
|
|
INIT_LIST_HEAD(&q->queue_head);
|
2008-09-14 19:55:09 +07:00
|
|
|
INIT_LIST_HEAD(&q->timeout_list);
|
2011-12-14 06:33:41 +07:00
|
|
|
INIT_LIST_HEAD(&q->icq_list);
|
2012-03-06 04:15:18 +07:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
2012-03-06 04:15:20 +07:00
|
|
|
INIT_LIST_HEAD(&q->blkg_list);
|
2012-03-06 04:15:18 +07:00
|
|
|
#endif
|
2011-03-02 23:08:00 +07:00
|
|
|
INIT_DELAYED_WORK(&q->delay_work, blk_delay_work);
|
2006-03-19 06:34:37 +07:00
|
|
|
|
2008-01-29 20:51:59 +07:00
|
|
|
kobject_init(&q->kobj, &blk_queue_ktype);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-03-19 06:34:37 +07:00
|
|
|
mutex_init(&q->sysfs_lock);
|
2008-05-15 06:05:54 +07:00
|
|
|
spin_lock_init(&q->__queue_lock);
|
2006-03-19 06:34:37 +07:00
|
|
|
|
2011-03-03 07:04:42 +07:00
|
|
|
/*
|
|
|
|
* By default initialize queue_lock to internal lock and driver can
|
|
|
|
* override it later if need be.
|
|
|
|
*/
|
|
|
|
q->queue_lock = &q->__queue_lock;
|
|
|
|
|
2012-04-14 03:11:31 +07:00
|
|
|
/*
|
|
|
|
* A queue starts its life with bypass turned on to avoid
|
|
|
|
* unnecessary bypass on/off overhead and nasty surprises during
|
2012-09-21 04:08:52 +07:00
|
|
|
* init. The initial bypass will be finished when the queue is
|
|
|
|
* registered by blk_register_queue().
|
2012-04-14 03:11:31 +07:00
|
|
|
*/
|
|
|
|
q->bypass_depth = 1;
|
|
|
|
__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
init_waitqueue_head(&q->mq_freeze_wq);
|
|
|
|
|
2015-10-22 00:20:12 +07:00
|
|
|
/*
|
|
|
|
* Init percpu_ref in atomic mode so that it's faster to shutdown.
|
|
|
|
* See blk_register_queue() for details.
|
|
|
|
*/
|
|
|
|
if (percpu_ref_init(&q->q_usage_counter,
|
|
|
|
blk_queue_usage_counter_release,
|
|
|
|
PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
|
2013-10-14 23:11:36 +07:00
|
|
|
goto fail_bdi;
|
2012-03-06 04:15:05 +07:00
|
|
|
|
2015-10-22 00:20:12 +07:00
|
|
|
if (blkcg_init_queue(q))
|
|
|
|
goto fail_ref;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
return q;
|
2011-12-14 06:33:37 +07:00
|
|
|
|
2015-10-22 00:20:12 +07:00
|
|
|
fail_ref:
|
|
|
|
percpu_ref_exit(&q->q_usage_counter);
|
2013-10-14 23:11:36 +07:00
|
|
|
fail_bdi:
|
2017-03-22 06:20:01 +07:00
|
|
|
blk_free_queue_stats(q->stats);
|
|
|
|
fail_stats:
|
2017-02-02 21:56:51 +07:00
|
|
|
bdi_put(q->backing_dev_info);
|
2015-04-24 12:37:18 +07:00
|
|
|
fail_split:
|
|
|
|
bioset_free(q->bio_split);
|
2011-12-14 06:33:37 +07:00
|
|
|
fail_id:
|
|
|
|
ida_simple_remove(&blk_queue_ida, q->id);
|
|
|
|
fail_q:
|
|
|
|
kmem_cache_free(blk_requestq_cachep, q);
|
|
|
|
return NULL;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2005-06-23 14:08:19 +07:00
|
|
|
EXPORT_SYMBOL(blk_alloc_queue_node);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_init_queue - prepare a request queue for use with a block device
|
|
|
|
* @rfn: The function to be called to process requests that have been
|
|
|
|
* placed on the queue.
|
|
|
|
* @lock: Request queue spin lock
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* If a block device wishes to use the standard request handling procedures,
|
|
|
|
* which sorts requests and coalesces adjacent requests, then it must
|
|
|
|
* call blk_init_queue(). The function @rfn will be called when there
|
|
|
|
* are requests on the queue that need to be processed. If the device
|
|
|
|
* supports plugging, then @rfn may not be called immediately when requests
|
|
|
|
* are available on the queue, but may be called at some time later instead.
|
|
|
|
* Plugged queues are generally unplugged when a buffer belonging to one
|
|
|
|
* of the requests on the queue is needed, or due to memory pressure.
|
|
|
|
*
|
|
|
|
* @rfn is not required, or even expected, to remove all requests off the
|
|
|
|
* queue, but only as many as it can handle at a time. If it does leave
|
|
|
|
* requests on the queue, it is responsible for arranging that the requests
|
|
|
|
* get dealt with eventually.
|
|
|
|
*
|
|
|
|
* The queue spin lock must be held while manipulating the requests on the
|
2006-06-05 17:09:01 +07:00
|
|
|
* request queue; this lock will be taken also from interrupt context, so irq
|
|
|
|
* disabling is needed for it.
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
2008-08-20 01:13:11 +07:00
|
|
|
* Function returns a pointer to the initialized request queue, or %NULL if
|
2005-04-17 05:20:36 +07:00
|
|
|
* it didn't succeed.
|
|
|
|
*
|
|
|
|
* Note:
|
|
|
|
* blk_init_queue() must be paired with a blk_cleanup_queue() call
|
|
|
|
* when the block device is deactivated (such as at module unload).
|
|
|
|
**/
|
2005-06-23 14:08:19 +07:00
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
struct request_queue *blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-11-10 16:39:44 +07:00
|
|
|
return blk_init_queue_node(rfn, lock, NUMA_NO_NODE);
|
2005-06-23 14:08:19 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_init_queue);
|
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
struct request_queue *
|
2005-06-23 14:08:19 +07:00
|
|
|
blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
|
|
|
|
{
|
2017-01-03 18:52:44 +07:00
|
|
|
struct request_queue *q;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-01-03 18:52:44 +07:00
|
|
|
q = blk_alloc_queue_node(GFP_KERNEL, node_id);
|
|
|
|
if (!q)
|
2010-06-04 00:34:52 +07:00
|
|
|
return NULL;
|
|
|
|
|
2017-01-03 18:52:44 +07:00
|
|
|
q->request_fn = rfn;
|
|
|
|
if (lock)
|
|
|
|
q->queue_lock = lock;
|
|
|
|
if (blk_init_allocated_queue(q) < 0) {
|
|
|
|
blk_cleanup_queue(q);
|
|
|
|
return NULL;
|
|
|
|
}
|
2014-02-10 23:29:00 +07:00
|
|
|
|
2014-03-09 07:20:01 +07:00
|
|
|
return q;
|
2010-05-11 13:57:42 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_init_queue_node);
|
|
|
|
|
2015-11-06 00:41:16 +07:00
|
|
|
static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio);
|
2015-05-12 01:06:32 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-01-03 18:52:44 +07:00
|
|
|
int blk_init_allocated_queue(struct request_queue *q)
|
|
|
|
{
|
2017-01-27 23:51:45 +07:00
|
|
|
q->fq = blk_alloc_flush_queue(q, NUMA_NO_NODE, q->cmd_size);
|
2014-09-25 22:23:44 +07:00
|
|
|
if (!q->fq)
|
2017-01-03 18:52:44 +07:00
|
|
|
return -ENOMEM;
|
2014-03-09 07:20:01 +07:00
|
|
|
|
2017-01-27 23:51:45 +07:00
|
|
|
if (q->init_rq_fn && q->init_rq_fn(q, q->fq->flush_rq, GFP_KERNEL))
|
|
|
|
goto out_free_flush_queue;
|
2014-03-09 07:20:01 +07:00
|
|
|
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
|
2017-01-27 23:51:45 +07:00
|
|
|
goto out_exit_flush_rq;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-10-30 19:57:30 +07:00
|
|
|
INIT_WORK(&q->timeout_work, blk_timeout_work);
|
2012-09-21 04:09:30 +07:00
|
|
|
q->queue_flags |= QUEUE_FLAG_DEFAULT;
|
2011-03-03 07:04:42 +07:00
|
|
|
|
2009-03-06 14:48:33 +07:00
|
|
|
/*
|
|
|
|
* This also sets hw/phys segments, boundary and size
|
|
|
|
*/
|
2011-09-12 17:03:37 +07:00
|
|
|
blk_queue_make_request(q, blk_queue_bio);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-02-20 23:01:57 +07:00
|
|
|
q->sg_reserved_size = INT_MAX;
|
|
|
|
|
2013-10-16 05:42:16 +07:00
|
|
|
/* Protect q->elevator from elevator_change */
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
|
|
|
|
2012-04-14 03:11:31 +07:00
|
|
|
/* init elevator */
|
2013-10-16 05:42:16 +07:00
|
|
|
if (elevator_init(q, NULL)) {
|
|
|
|
mutex_unlock(&q->sysfs_lock);
|
2017-01-27 23:51:45 +07:00
|
|
|
goto out_exit_flush_rq;
|
2013-10-16 05:42:16 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&q->sysfs_lock);
|
2017-01-03 18:52:44 +07:00
|
|
|
return 0;
|
2013-10-16 05:42:16 +07:00
|
|
|
|
2017-01-27 23:51:45 +07:00
|
|
|
out_exit_flush_rq:
|
|
|
|
if (q->exit_rq_fn)
|
|
|
|
q->exit_rq_fn(q, q->fq->flush_rq);
|
|
|
|
out_free_flush_queue:
|
2014-09-25 22:23:44 +07:00
|
|
|
blk_free_flush_queue(q->fq);
|
2017-01-03 18:52:44 +07:00
|
|
|
return -ENOMEM;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2011-11-23 16:59:13 +07:00
|
|
|
EXPORT_SYMBOL(blk_init_allocated_queue);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-12-14 06:33:38 +07:00
|
|
|
bool blk_get_queue(struct request_queue *q)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-11-28 19:42:38 +07:00
|
|
|
if (likely(!blk_queue_dying(q))) {
|
2011-12-14 06:33:38 +07:00
|
|
|
__blk_get_queue(q);
|
|
|
|
return true;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2011-12-14 06:33:38 +07:00
|
|
|
return false;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2011-05-27 12:44:43 +07:00
|
|
|
EXPORT_SYMBOL(blk_get_queue);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
static inline void blk_free_request(struct request_list *rl, struct request *rq)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2016-10-20 20:12:13 +07:00
|
|
|
if (rq->rq_flags & RQF_ELVPRIV) {
|
2012-06-05 10:40:59 +07:00
|
|
|
elv_put_request(rl->q, rq);
|
2011-12-14 06:33:42 +07:00
|
|
|
if (rq->elv.icq)
|
2012-02-07 13:51:30 +07:00
|
|
|
put_io_context(rq->elv.icq->ioc);
|
2011-12-14 06:33:42 +07:00
|
|
|
}
|
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
mempool_free(rq, rl->rq_pool);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ioc_batching returns true if the ioc is a valid batching request and
|
|
|
|
* should be given priority access to a request.
|
|
|
|
*/
|
2007-07-24 14:28:11 +07:00
|
|
|
static inline int ioc_batching(struct request_queue *q, struct io_context *ioc)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
if (!ioc)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure the process is able to allocate at least 1 request
|
|
|
|
* even if the batch times out, otherwise we could theoretically
|
|
|
|
* lose wakeups.
|
|
|
|
*/
|
|
|
|
return ioc->nr_batch_requests == q->nr_batching ||
|
|
|
|
(ioc->nr_batch_requests > 0
|
|
|
|
&& time_before(jiffies, ioc->last_waited + BLK_BATCH_TIME));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ioc_set_batching sets ioc to be a new "batcher" if it is not one. This
|
|
|
|
* will cause the process to be a "batcher" on all queues in the system. This
|
|
|
|
* is the behaviour we want though - once it gets a wakeup it should be given
|
|
|
|
* a nice run.
|
|
|
|
*/
|
2007-07-24 14:28:11 +07:00
|
|
|
static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
if (!ioc || ioc_batching(q, ioc))
|
|
|
|
return;
|
|
|
|
|
|
|
|
ioc->nr_batch_requests = q->nr_batching;
|
|
|
|
ioc->last_waited = jiffies;
|
|
|
|
}
|
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
static void __freed_request(struct request_list *rl, int sync)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-06-05 10:40:59 +07:00
|
|
|
struct request_queue *q = rl->q;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-05-23 04:13:42 +07:00
|
|
|
if (rl->count[sync] < queue_congestion_off_threshold(q))
|
|
|
|
blk_clear_congested(rl, sync);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-04-06 19:48:01 +07:00
|
|
|
if (rl->count[sync] + 1 <= q->nr_requests) {
|
|
|
|
if (waitqueue_active(&rl->wait[sync]))
|
|
|
|
wake_up(&rl->wait[sync]);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
blk_clear_rl_full(rl, sync);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A request has just been released. Account for it, update the full and
|
|
|
|
* congestion status, wake up any waiters. Called under q->queue_lock.
|
|
|
|
*/
|
2016-10-20 20:12:13 +07:00
|
|
|
static void freed_request(struct request_list *rl, bool sync,
|
|
|
|
req_flags_t rq_flags)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-06-05 10:40:59 +07:00
|
|
|
struct request_queue *q = rl->q;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-06-05 10:40:58 +07:00
|
|
|
q->nr_rqs[sync]--;
|
2009-04-06 19:48:01 +07:00
|
|
|
rl->count[sync]--;
|
2016-10-20 20:12:13 +07:00
|
|
|
if (rq_flags & RQF_ELVPRIV)
|
2012-06-05 10:40:58 +07:00
|
|
|
q->nr_rqs_elvpriv--;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
__freed_request(rl, sync);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-04-06 19:48:01 +07:00
|
|
|
if (unlikely(rl->starved[sync ^ 1]))
|
2012-06-05 10:40:59 +07:00
|
|
|
__freed_request(rl, sync ^ 1);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2014-05-21 00:49:02 +07:00
|
|
|
int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
|
|
|
|
{
|
|
|
|
struct request_list *rl;
|
2015-05-23 04:13:42 +07:00
|
|
|
int on_thresh, off_thresh;
|
2014-05-21 00:49:02 +07:00
|
|
|
|
|
|
|
spin_lock_irq(q->queue_lock);
|
|
|
|
q->nr_requests = nr;
|
|
|
|
blk_queue_congestion_threshold(q);
|
2015-05-23 04:13:42 +07:00
|
|
|
on_thresh = queue_congestion_on_threshold(q);
|
|
|
|
off_thresh = queue_congestion_off_threshold(q);
|
2014-05-21 00:49:02 +07:00
|
|
|
|
2015-05-23 04:13:42 +07:00
|
|
|
blk_queue_for_each_rl(rl, q) {
|
|
|
|
if (rl->count[BLK_RW_SYNC] >= on_thresh)
|
|
|
|
blk_set_congested(rl, BLK_RW_SYNC);
|
|
|
|
else if (rl->count[BLK_RW_SYNC] < off_thresh)
|
|
|
|
blk_clear_congested(rl, BLK_RW_SYNC);
|
2014-05-21 00:49:02 +07:00
|
|
|
|
2015-05-23 04:13:42 +07:00
|
|
|
if (rl->count[BLK_RW_ASYNC] >= on_thresh)
|
|
|
|
blk_set_congested(rl, BLK_RW_ASYNC);
|
|
|
|
else if (rl->count[BLK_RW_ASYNC] < off_thresh)
|
|
|
|
blk_clear_congested(rl, BLK_RW_ASYNC);
|
2014-05-21 00:49:02 +07:00
|
|
|
|
|
|
|
if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
|
|
|
|
blk_set_rl_full(rl, BLK_RW_SYNC);
|
|
|
|
} else {
|
|
|
|
blk_clear_rl_full(rl, BLK_RW_SYNC);
|
|
|
|
wake_up(&rl->wait[BLK_RW_SYNC]);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
|
|
|
|
blk_set_rl_full(rl, BLK_RW_ASYNC);
|
|
|
|
} else {
|
|
|
|
blk_clear_rl_full(rl, BLK_RW_ASYNC);
|
|
|
|
wake_up(&rl->wait[BLK_RW_ASYNC]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-10-19 19:33:05 +07:00
|
|
|
/**
|
2012-06-05 10:40:55 +07:00
|
|
|
* __get_request - get a free request
|
2012-06-05 10:40:59 +07:00
|
|
|
* @rl: request list to allocate from
|
2016-10-28 21:48:16 +07:00
|
|
|
* @op: operation and flags
|
2011-10-19 19:33:05 +07:00
|
|
|
* @bio: bio to allocate request for (can be %NULL)
|
|
|
|
* @gfp_mask: allocation mask
|
|
|
|
*
|
|
|
|
* Get a free request from @q. This function may fail under memory
|
|
|
|
* pressure or if @q is dead.
|
|
|
|
*
|
2014-09-08 23:27:23 +07:00
|
|
|
* Must be called with @q->queue_lock held and,
|
2014-08-28 21:15:21 +07:00
|
|
|
* Returns ERR_PTR on failure, with @q->queue_lock held.
|
|
|
|
* Returns request pointer on success, with @q->queue_lock *not held*.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2016-10-28 21:48:16 +07:00
|
|
|
static struct request *__get_request(struct request_list *rl, unsigned int op,
|
|
|
|
struct bio *bio, gfp_t gfp_mask)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-06-05 10:40:59 +07:00
|
|
|
struct request_queue *q = rl->q;
|
2012-03-06 04:15:23 +07:00
|
|
|
struct request *rq;
|
2012-06-05 10:40:56 +07:00
|
|
|
struct elevator_type *et = q->elevator->type;
|
|
|
|
struct io_context *ioc = rq_ioc(bio);
|
2011-12-14 06:33:42 +07:00
|
|
|
struct io_cq *icq = NULL;
|
2016-10-28 21:48:16 +07:00
|
|
|
const bool is_sync = op_is_sync(op);
|
2011-10-19 19:31:22 +07:00
|
|
|
int may_queue;
|
2016-10-20 20:12:13 +07:00
|
|
|
req_flags_t rq_flags = RQF_ALLOCED;
|
2005-11-12 17:09:12 +07:00
|
|
|
|
2012-11-28 19:42:38 +07:00
|
|
|
if (unlikely(blk_queue_dying(q)))
|
2014-08-28 21:15:21 +07:00
|
|
|
return ERR_PTR(-ENODEV);
|
2011-10-19 19:33:05 +07:00
|
|
|
|
2016-10-28 21:48:16 +07:00
|
|
|
may_queue = elv_may_queue(q, op);
|
2005-11-12 17:09:12 +07:00
|
|
|
if (may_queue == ELV_MQUEUE_NO)
|
|
|
|
goto rq_starved;
|
|
|
|
|
2009-04-06 19:48:01 +07:00
|
|
|
if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
|
|
|
|
if (rl->count[is_sync]+1 >= q->nr_requests) {
|
2005-11-12 17:09:12 +07:00
|
|
|
/*
|
|
|
|
* The queue will fill after this allocation, so set
|
|
|
|
* it as full, and mark this process as "batching".
|
|
|
|
* This process will be allowed to complete a batch of
|
|
|
|
* requests, others will be blocked.
|
|
|
|
*/
|
2012-06-05 10:40:59 +07:00
|
|
|
if (!blk_rl_full(rl, is_sync)) {
|
2005-11-12 17:09:12 +07:00
|
|
|
ioc_set_batching(q, ioc);
|
2012-06-05 10:40:59 +07:00
|
|
|
blk_set_rl_full(rl, is_sync);
|
2005-11-12 17:09:12 +07:00
|
|
|
} else {
|
|
|
|
if (may_queue != ELV_MQUEUE_MUST
|
|
|
|
&& !ioc_batching(q, ioc)) {
|
|
|
|
/*
|
|
|
|
* The queue is full and the allocating
|
|
|
|
* process is not a "batcher", and not
|
|
|
|
* exempted by the IO scheduler
|
|
|
|
*/
|
2014-08-28 21:15:21 +07:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2005-11-12 17:09:12 +07:00
|
|
|
}
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2015-05-23 04:13:42 +07:00
|
|
|
blk_set_congested(rl, is_sync);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2005-06-28 21:35:11 +07:00
|
|
|
/*
|
|
|
|
* Only allow batching queuers to allocate up to 50% over the defined
|
|
|
|
* limit of requests, otherwise we could have thousands of requests
|
|
|
|
* allocated with any setting of ->nr_requests
|
|
|
|
*/
|
2009-04-06 19:48:01 +07:00
|
|
|
if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
|
2014-08-28 21:15:21 +07:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2005-06-29 21:15:40 +07:00
|
|
|
|
2012-06-05 10:40:58 +07:00
|
|
|
q->nr_rqs[is_sync]++;
|
2009-04-06 19:48:01 +07:00
|
|
|
rl->count[is_sync]++;
|
|
|
|
rl->starved[is_sync] = 0;
|
2005-10-28 13:29:39 +07:00
|
|
|
|
2011-12-14 06:33:42 +07:00
|
|
|
/*
|
|
|
|
* Decide whether the new request will be managed by elevator. If
|
2016-10-20 20:12:13 +07:00
|
|
|
* so, mark @rq_flags and increment elvpriv. Non-zero elvpriv will
|
2011-12-14 06:33:42 +07:00
|
|
|
* prevent the current elevator from being destroyed until the new
|
|
|
|
* request is freed. This guarantees icq's won't be destroyed and
|
|
|
|
* makes creating new ones safe.
|
|
|
|
*
|
2017-01-25 17:17:11 +07:00
|
|
|
* Flush requests do not use the elevator so skip initialization.
|
|
|
|
* This allows a request to share the flush and elevator data.
|
|
|
|
*
|
2011-12-14 06:33:42 +07:00
|
|
|
* Also, lookup icq while holding queue_lock. If it doesn't exist,
|
|
|
|
* it will be created after releasing queue_lock.
|
|
|
|
*/
|
2017-01-25 17:17:11 +07:00
|
|
|
if (!op_is_flush(op) && !blk_queue_bypass(q)) {
|
2016-10-20 20:12:13 +07:00
|
|
|
rq_flags |= RQF_ELVPRIV;
|
2012-06-05 10:40:58 +07:00
|
|
|
q->nr_rqs_elvpriv++;
|
2011-12-14 06:33:42 +07:00
|
|
|
if (et->icq_cache && ioc)
|
|
|
|
icq = ioc_lookup_icq(ioc, q);
|
2011-02-11 17:05:46 +07:00
|
|
|
}
|
2005-10-28 13:29:39 +07:00
|
|
|
|
2010-10-25 03:06:02 +07:00
|
|
|
if (blk_queue_io_stat(q))
|
2016-10-20 20:12:13 +07:00
|
|
|
rq_flags |= RQF_IO_STAT;
|
2005-04-17 05:20:36 +07:00
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
|
2012-04-20 06:29:21 +07:00
|
|
|
/* allocate and init request */
|
2012-06-05 10:40:59 +07:00
|
|
|
rq = mempool_alloc(rl->rq_pool, gfp_mask);
|
2012-04-20 06:29:21 +07:00
|
|
|
if (!rq)
|
2012-03-06 04:15:23 +07:00
|
|
|
goto fail_alloc;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-04-20 06:29:21 +07:00
|
|
|
blk_rq_init(q, rq);
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
blk_rq_set_rl(rq, rl);
|
2016-10-28 21:48:16 +07:00
|
|
|
rq->cmd_flags = op;
|
2016-10-20 20:12:13 +07:00
|
|
|
rq->rq_flags = rq_flags;
|
2012-04-20 06:29:21 +07:00
|
|
|
|
2012-04-20 06:29:22 +07:00
|
|
|
/* init elvpriv */
|
2016-10-20 20:12:13 +07:00
|
|
|
if (rq_flags & RQF_ELVPRIV) {
|
2012-04-20 06:29:22 +07:00
|
|
|
if (unlikely(et->icq_cache && !icq)) {
|
2012-06-05 10:40:56 +07:00
|
|
|
if (ioc)
|
|
|
|
icq = ioc_create_icq(ioc, q, gfp_mask);
|
2012-04-20 06:29:22 +07:00
|
|
|
if (!icq)
|
|
|
|
goto fail_elvpriv;
|
2012-04-20 06:29:21 +07:00
|
|
|
}
|
2012-04-20 06:29:22 +07:00
|
|
|
|
|
|
|
rq->elv.icq = icq;
|
|
|
|
if (unlikely(elv_set_request(q, rq, bio, gfp_mask)))
|
|
|
|
goto fail_elvpriv;
|
|
|
|
|
|
|
|
/* @rq->elv.icq holds io_context until @rq is freed */
|
2012-04-20 06:29:21 +07:00
|
|
|
if (icq)
|
|
|
|
get_io_context(icq->ioc);
|
|
|
|
}
|
2012-04-20 06:29:22 +07:00
|
|
|
out:
|
2005-11-12 17:09:12 +07:00
|
|
|
/*
|
|
|
|
* ioc may be NULL here, and ioc_batching will be false. That's
|
|
|
|
* OK, if the queue is under the request limit then requests need
|
|
|
|
* not count toward the nr_batch_requests limit. There will always
|
|
|
|
* be some limit enforced by BLK_BATCH_TIME.
|
|
|
|
*/
|
2005-04-17 05:20:36 +07:00
|
|
|
if (ioc_batching(q, ioc))
|
|
|
|
ioc->nr_batch_requests--;
|
2008-01-31 19:03:55 +07:00
|
|
|
|
2016-06-06 02:32:11 +07:00
|
|
|
trace_block_getrq(q, bio, op);
|
2005-04-17 05:20:36 +07:00
|
|
|
return rq;
|
2012-03-06 04:15:23 +07:00
|
|
|
|
2012-04-20 06:29:22 +07:00
|
|
|
fail_elvpriv:
|
|
|
|
/*
|
|
|
|
* elvpriv init failed. ioc, icq and elvpriv aren't mempool backed
|
|
|
|
* and may fail indefinitely under memory pressure and thus
|
|
|
|
* shouldn't stall IO. Treat this request as !elvpriv. This will
|
|
|
|
* disturb iosched and blkcg but weird is bettern than dead.
|
|
|
|
*/
|
2014-08-27 22:50:36 +07:00
|
|
|
printk_ratelimited(KERN_WARNING "%s: dev %s: request aux data allocation failed, iosched may be disturbed\n",
|
2017-02-02 21:56:50 +07:00
|
|
|
__func__, dev_name(q->backing_dev_info->dev));
|
2012-04-20 06:29:22 +07:00
|
|
|
|
2016-10-20 20:12:13 +07:00
|
|
|
rq->rq_flags &= ~RQF_ELVPRIV;
|
2012-04-20 06:29:22 +07:00
|
|
|
rq->elv.icq = NULL;
|
|
|
|
|
|
|
|
spin_lock_irq(q->queue_lock);
|
2012-06-05 10:40:58 +07:00
|
|
|
q->nr_rqs_elvpriv--;
|
2012-04-20 06:29:22 +07:00
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
goto out;
|
|
|
|
|
2012-03-06 04:15:23 +07:00
|
|
|
fail_alloc:
|
|
|
|
/*
|
|
|
|
* Allocation failed presumably due to memory. Undo anything we
|
|
|
|
* might have messed up.
|
|
|
|
*
|
|
|
|
* Allocating task should really be put onto the front of the wait
|
|
|
|
* queue, but this is pretty rare.
|
|
|
|
*/
|
|
|
|
spin_lock_irq(q->queue_lock);
|
2016-10-20 20:12:13 +07:00
|
|
|
freed_request(rl, is_sync, rq_flags);
|
2012-03-06 04:15:23 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* in the very unlikely event that allocation failed and no
|
|
|
|
* requests for this direction was pending, mark us starved so that
|
|
|
|
* freeing of a request in the other direction will notice
|
|
|
|
* us. another possible fix would be to split the rq mempool into
|
|
|
|
* READ and WRITE
|
|
|
|
*/
|
|
|
|
rq_starved:
|
|
|
|
if (unlikely(rl->count[is_sync] == 0))
|
|
|
|
rl->starved[is_sync] = 1;
|
2014-08-28 21:15:21 +07:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2011-10-19 19:33:05 +07:00
|
|
|
/**
|
2012-06-05 10:40:55 +07:00
|
|
|
* get_request - get a free request
|
2011-10-19 19:33:05 +07:00
|
|
|
* @q: request_queue to allocate request from
|
2016-10-28 21:48:16 +07:00
|
|
|
* @op: operation and flags
|
2011-10-19 19:33:05 +07:00
|
|
|
* @bio: bio to allocate request for (can be %NULL)
|
2012-06-05 10:40:55 +07:00
|
|
|
* @gfp_mask: allocation mask
|
2011-10-19 19:33:05 +07:00
|
|
|
*
|
2015-11-07 07:28:21 +07:00
|
|
|
* Get a free request from @q. If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
|
|
|
|
* this function keeps retrying under memory pressure and fails iff @q is dead.
|
2005-06-29 10:45:14 +07:00
|
|
|
*
|
2014-09-08 23:27:23 +07:00
|
|
|
* Must be called with @q->queue_lock held and,
|
2014-08-28 21:15:21 +07:00
|
|
|
* Returns ERR_PTR on failure, with @q->queue_lock held.
|
|
|
|
* Returns request pointer on success, with @q->queue_lock *not held*.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2016-10-28 21:48:16 +07:00
|
|
|
static struct request *get_request(struct request_queue *q, unsigned int op,
|
|
|
|
struct bio *bio, gfp_t gfp_mask)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2016-10-28 21:48:16 +07:00
|
|
|
const bool is_sync = op_is_sync(op);
|
2012-06-05 10:40:55 +07:00
|
|
|
DEFINE_WAIT(wait);
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
struct request_list *rl;
|
2005-04-17 05:20:36 +07:00
|
|
|
struct request *rq;
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
|
|
|
|
rl = blk_get_rl(q, bio); /* transferred to @rq on success */
|
2012-06-05 10:40:55 +07:00
|
|
|
retry:
|
2016-10-28 21:48:16 +07:00
|
|
|
rq = __get_request(rl, op, bio, gfp_mask);
|
2014-08-28 21:15:21 +07:00
|
|
|
if (!IS_ERR(rq))
|
2012-06-05 10:40:55 +07:00
|
|
|
return rq;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-11-07 07:28:21 +07:00
|
|
|
if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
blk_put_rl(rl);
|
2014-08-28 21:15:21 +07:00
|
|
|
return rq;
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-06-05 10:40:55 +07:00
|
|
|
/* wait on @rl and retry */
|
|
|
|
prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
|
|
|
|
TASK_UNINTERRUPTIBLE);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2016-06-06 02:32:11 +07:00
|
|
|
trace_block_sleeprq(q, bio, op);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-06-05 10:40:55 +07:00
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
io_schedule();
|
2005-06-29 10:45:14 +07:00
|
|
|
|
2012-06-05 10:40:55 +07:00
|
|
|
/*
|
|
|
|
* After sleeping, we become a "batching" process and will be able
|
|
|
|
* to allocate at least one request, and up to a big batch of them
|
|
|
|
* for a small period time. See ioc_batching, ioc_set_batching
|
|
|
|
*/
|
|
|
|
ioc_set_batching(q, current->io_context);
|
2008-05-22 20:13:29 +07:00
|
|
|
|
2012-06-05 10:40:55 +07:00
|
|
|
spin_lock_irq(q->queue_lock);
|
|
|
|
finish_wait(&rl->wait[is_sync], &wait);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-06-05 10:40:55 +07:00
|
|
|
goto retry;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
static struct request *blk_old_get_request(struct request_queue *q, int rw,
|
|
|
|
gfp_t gfp_mask)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
struct request *rq;
|
|
|
|
|
2012-06-05 10:40:56 +07:00
|
|
|
/* create ioc upfront */
|
|
|
|
create_io_context(gfp_mask, q->node);
|
|
|
|
|
2005-06-29 10:45:14 +07:00
|
|
|
spin_lock_irq(q->queue_lock);
|
2016-10-28 21:48:16 +07:00
|
|
|
rq = get_request(q, rw, NULL, gfp_mask);
|
2016-07-19 16:31:50 +07:00
|
|
|
if (IS_ERR(rq)) {
|
2011-10-19 19:33:05 +07:00
|
|
|
spin_unlock_irq(q->queue_lock);
|
2016-07-19 16:31:50 +07:00
|
|
|
return rq;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2016-07-19 16:31:50 +07:00
|
|
|
/* q->queue_lock is unlocked at this point */
|
|
|
|
rq->__data_len = 0;
|
|
|
|
rq->__sector = (sector_t) -1;
|
|
|
|
rq->bio = rq->biotail = NULL;
|
2005-04-17 05:20:36 +07:00
|
|
|
return rq;
|
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
|
|
|
|
struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
|
|
|
|
{
|
|
|
|
if (q->mq_ops)
|
2015-11-26 15:13:05 +07:00
|
|
|
return blk_mq_alloc_request(q, rw,
|
|
|
|
(gfp_mask & __GFP_DIRECT_RECLAIM) ?
|
|
|
|
0 : BLK_MQ_REQ_NOWAIT);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
else
|
|
|
|
return blk_old_get_request(q, rw, gfp_mask);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
EXPORT_SYMBOL(blk_get_request);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_requeue_request - put a request back on queue
|
|
|
|
* @q: request queue where request should be inserted
|
|
|
|
* @rq: request to be inserted
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Drivers often keep queueing requests until the hardware cannot accept
|
|
|
|
* more, when that condition happens we need to put the request back
|
|
|
|
* on the queue. Must be called with queue lock held.
|
|
|
|
*/
|
2007-07-24 14:28:11 +07:00
|
|
|
void blk_requeue_request(struct request_queue *q, struct request *rq)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2008-09-14 19:55:09 +07:00
|
|
|
blk_delete_timer(rq);
|
|
|
|
blk_clear_rq_complete(rq);
|
2008-10-30 14:34:33 +07:00
|
|
|
trace_block_rq_requeue(q, rq);
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
wbt_requeue(q->rq_wb, &rq->issue_stat);
|
2006-03-24 02:00:26 +07:00
|
|
|
|
2016-10-20 20:12:13 +07:00
|
|
|
if (rq->rq_flags & RQF_QUEUED)
|
2005-04-17 05:20:36 +07:00
|
|
|
blk_queue_end_tag(q, rq);
|
|
|
|
|
2009-05-27 19:17:08 +07:00
|
|
|
BUG_ON(blk_queued_rq(rq));
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
elv_requeue_request(q, rq);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_requeue_request);
|
|
|
|
|
2011-03-08 19:19:51 +07:00
|
|
|
static void add_acct_request(struct request_queue *q, struct request *rq,
|
|
|
|
int where)
|
|
|
|
{
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
blk_account_io_start(rq, true);
|
2011-03-10 14:52:07 +07:00
|
|
|
__elv_add_request(q, rq, where);
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
|
|
|
|
2008-08-25 17:56:14 +07:00
|
|
|
static void part_round_stats_single(int cpu, struct hd_struct *part,
|
|
|
|
unsigned long now)
|
|
|
|
{
|
2014-05-10 04:48:23 +07:00
|
|
|
int inflight;
|
|
|
|
|
2008-08-25 17:56:14 +07:00
|
|
|
if (now == part->stamp)
|
|
|
|
return;
|
|
|
|
|
2014-05-10 04:48:23 +07:00
|
|
|
inflight = part_in_flight(part);
|
|
|
|
if (inflight) {
|
2008-08-25 17:56:14 +07:00
|
|
|
__part_stat_add(cpu, part, time_in_queue,
|
2014-05-10 04:48:23 +07:00
|
|
|
inflight * (now - part->stamp));
|
2008-08-25 17:56:14 +07:00
|
|
|
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
|
|
|
|
}
|
|
|
|
part->stamp = now;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2008-10-16 12:46:23 +07:00
|
|
|
* part_round_stats() - Round off the performance stats on a struct disk_stats.
|
|
|
|
* @cpu: cpu number for stats access
|
|
|
|
* @part: target partition
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* The average IO queue length and utilisation statistics are maintained
|
|
|
|
* by observing the current state of the queue length and the amount of
|
|
|
|
* time it has been in this state for.
|
|
|
|
*
|
|
|
|
* Normally, that accounting is done on IO completion, but that can result
|
|
|
|
* in more than a second's worth of IO being accounted for within any one
|
|
|
|
* second, leading to >100% utilisation. To deal with that, we call this
|
|
|
|
* function to do a round-off before returning the results when reading
|
|
|
|
* /proc/diskstats. This accounts immediately for all queue usage up to
|
|
|
|
* the current jiffies and restarts the counters again.
|
|
|
|
*/
|
2008-08-25 17:47:21 +07:00
|
|
|
void part_round_stats(int cpu, struct hd_struct *part)
|
2008-02-08 17:04:35 +07:00
|
|
|
{
|
|
|
|
unsigned long now = jiffies;
|
|
|
|
|
2008-08-25 17:56:14 +07:00
|
|
|
if (part->partno)
|
|
|
|
part_round_stats_single(cpu, &part_to_disk(part)->part0, now);
|
|
|
|
part_round_stats_single(cpu, part, now);
|
2008-02-08 17:04:35 +07:00
|
|
|
}
|
2008-08-25 17:56:14 +07:00
|
|
|
EXPORT_SYMBOL_GPL(part_round_stats);
|
2008-02-08 17:04:35 +07:00
|
|
|
|
2014-12-04 07:00:23 +07:00
|
|
|
#ifdef CONFIG_PM
|
2013-03-23 10:42:27 +07:00
|
|
|
static void blk_pm_put_request(struct request *rq)
|
|
|
|
{
|
2016-10-20 20:12:13 +07:00
|
|
|
if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
|
2013-03-23 10:42:27 +07:00
|
|
|
pm_runtime_mark_last_busy(rq->q->dev);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline void blk_pm_put_request(struct request *rq) {}
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* queue lock must be held
|
|
|
|
*/
|
2007-07-24 14:28:11 +07:00
|
|
|
void __blk_put_request(struct request_queue *q, struct request *req)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2016-10-20 20:12:13 +07:00
|
|
|
req_flags_t rq_flags = req->rq_flags;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
if (unlikely(!q))
|
|
|
|
return;
|
|
|
|
|
2014-02-08 01:22:37 +07:00
|
|
|
if (q->mq_ops) {
|
|
|
|
blk_mq_free_request(req);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2013-03-23 10:42:27 +07:00
|
|
|
blk_pm_put_request(req);
|
|
|
|
|
2005-10-20 21:23:44 +07:00
|
|
|
elv_completed_request(q, req);
|
|
|
|
|
2009-03-24 18:35:07 +07:00
|
|
|
/* this is a bio leak */
|
|
|
|
WARN_ON(req->bio != NULL);
|
|
|
|
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
wbt_done(q->rq_wb, &req->issue_stat);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Request may not have originated from ll_rw_blk. if not,
|
|
|
|
* it didn't come out of our reserved rq pools
|
|
|
|
*/
|
2016-10-20 20:12:13 +07:00
|
|
|
if (rq_flags & RQF_ALLOCED) {
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
struct request_list *rl = blk_rq_rl(req);
|
2016-10-28 21:48:16 +07:00
|
|
|
bool sync = op_is_sync(req->cmd_flags);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
BUG_ON(!list_empty(&req->queuelist));
|
2014-04-10 09:27:01 +07:00
|
|
|
BUG_ON(ELV_ON_HASH(req));
|
2005-04-17 05:20:36 +07:00
|
|
|
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
blk_free_request(rl, req);
|
2016-10-20 20:12:13 +07:00
|
|
|
freed_request(rl, sync, rq_flags);
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
blk_put_rl(rl);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
}
|
2005-11-11 18:30:24 +07:00
|
|
|
EXPORT_SYMBOL_GPL(__blk_put_request);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
void blk_put_request(struct request *req)
|
|
|
|
{
|
2007-07-24 14:28:11 +07:00
|
|
|
struct request_queue *q = req->q;
|
2005-10-20 21:23:44 +07:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
if (q->mq_ops)
|
|
|
|
blk_mq_free_request(req);
|
|
|
|
else {
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(q->queue_lock, flags);
|
|
|
|
__blk_put_request(q, req);
|
|
|
|
spin_unlock_irqrestore(q->queue_lock, flags);
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_put_request);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
|
|
|
|
struct bio *bio)
|
2011-03-08 19:19:51 +07:00
|
|
|
{
|
2016-08-06 04:35:16 +07:00
|
|
|
const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
|
2011-03-08 19:19:51 +07:00
|
|
|
|
|
|
|
if (!ll_back_merge_fn(q, req, bio))
|
|
|
|
return false;
|
|
|
|
|
2013-01-12 04:06:34 +07:00
|
|
|
trace_block_bio_backmerge(q, req, bio);
|
2011-03-08 19:19:51 +07:00
|
|
|
|
|
|
|
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
|
|
|
|
blk_rq_set_mixed_merge(req);
|
|
|
|
|
|
|
|
req->biotail->bi_next = bio;
|
|
|
|
req->biotail = bio;
|
2013-10-12 05:44:27 +07:00
|
|
|
req->__data_len += bio->bi_iter.bi_size;
|
2011-03-08 19:19:51 +07:00
|
|
|
req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
blk_account_io_start(req, false);
|
2011-03-08 19:19:51 +07:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
|
|
|
|
struct bio *bio)
|
2011-03-08 19:19:51 +07:00
|
|
|
{
|
2016-08-06 04:35:16 +07:00
|
|
|
const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
|
2011-03-08 19:19:51 +07:00
|
|
|
|
|
|
|
if (!ll_front_merge_fn(q, req, bio))
|
|
|
|
return false;
|
|
|
|
|
2013-01-12 04:06:34 +07:00
|
|
|
trace_block_bio_frontmerge(q, req, bio);
|
2011-03-08 19:19:51 +07:00
|
|
|
|
|
|
|
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
|
|
|
|
blk_rq_set_mixed_merge(req);
|
|
|
|
|
|
|
|
bio->bi_next = req->bio;
|
|
|
|
req->bio = bio;
|
|
|
|
|
2013-10-12 05:44:27 +07:00
|
|
|
req->__sector = bio->bi_iter.bi_sector;
|
|
|
|
req->__data_len += bio->bi_iter.bi_size;
|
2011-03-08 19:19:51 +07:00
|
|
|
req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
blk_account_io_start(req, false);
|
2011-03-08 19:19:51 +07:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2017-02-08 20:46:49 +07:00
|
|
|
bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
unsigned short segments = blk_rq_nr_discard_segments(req);
|
|
|
|
|
|
|
|
if (segments >= queue_max_discard_segments(q))
|
|
|
|
goto no_merge;
|
|
|
|
if (blk_rq_sectors(req) + bio_sectors(bio) >
|
|
|
|
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
|
|
|
|
goto no_merge;
|
|
|
|
|
|
|
|
req->biotail->bi_next = bio;
|
|
|
|
req->biotail = bio;
|
|
|
|
req->__data_len += bio->bi_iter.bi_size;
|
|
|
|
req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
|
|
|
|
req->nr_phys_segments = segments + 1;
|
|
|
|
|
|
|
|
blk_account_io_start(req, false);
|
|
|
|
return true;
|
|
|
|
no_merge:
|
|
|
|
req_set_nomerge(q, req);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2011-10-19 19:33:08 +07:00
|
|
|
/**
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
* blk_attempt_plug_merge - try to merge with %current's plugged list
|
2011-10-19 19:33:08 +07:00
|
|
|
* @q: request_queue new bio is being queued at
|
|
|
|
* @bio: new bio being queued
|
|
|
|
* @request_count: out parameter for number of traversed plugged requests
|
2015-10-31 08:36:16 +07:00
|
|
|
* @same_queue_rq: pointer to &struct request that gets filled in when
|
|
|
|
* another request associated with @q is found on the plug list
|
|
|
|
* (optional, may be %NULL)
|
2011-10-19 19:33:08 +07:00
|
|
|
*
|
|
|
|
* Determine whether @bio being queued on @q can be merged with a request
|
|
|
|
* on %current's plugged list. Returns %true if merge was successful,
|
|
|
|
* otherwise %false.
|
|
|
|
*
|
block: don't call elevator callbacks for plug merges
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.
For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.
This, for example, leads to the following crash during elevator
switch.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
PGD 112cbc067 PUD 115d5c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 1
Modules linked in: deadline_iosched
Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
Stack:
ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
Call Trace:
[<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
[<ffffffff81391bf1>] elv_try_merge+0x21/0x40
[<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
[<ffffffff81396a5a>] generic_make_request+0xca/0x100
[<ffffffff81396b04>] submit_bio+0x74/0x100
[<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
[<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff811986b2>] do_sync_read+0xe2/0x120
[<ffffffff81199345>] vfs_read+0xc5/0x180
[<ffffffff81199501>] sys_read+0x51/0x90
[<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
There are multiple ways to fix this including making plug merge check
ELVPRIV; however,
* Calling into elevator outside queue lock is confusing and
error-prone.
* Requests on plug list aren't known to the elevator. They aren't on
the elevator yet, so there's no elevator specific state to update.
* Given the nature of plug merges - collecting bio's for the same
purpose from the same issuer - elevator specific restrictions aren't
applicable.
So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Note that this makes per-cgroup merged stats skip plug merging.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 15:19:42 +07:00
|
|
|
* Plugging coalesces IOs from the same issuer for the same purpose without
|
|
|
|
* going through @q->queue_lock. As such it's more of an issuing mechanism
|
|
|
|
* than scheduling, and the request, while may have elvpriv data, is not
|
|
|
|
* added on the elevator at this point. In addition, we don't have
|
|
|
|
* reliable access to the elevator outside queue lock. Only check basic
|
|
|
|
* merging parameters without querying the elevator.
|
2014-05-21 04:46:26 +07:00
|
|
|
*
|
|
|
|
* Caller must ensure !blk_queue_nomerges(q) beforehand.
|
2011-03-08 19:19:51 +07:00
|
|
|
*/
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
|
2015-05-09 00:51:33 +07:00
|
|
|
unsigned int *request_count,
|
|
|
|
struct request **same_queue_rq)
|
2011-03-08 19:19:51 +07:00
|
|
|
{
|
|
|
|
struct blk_plug *plug;
|
|
|
|
struct request *rq;
|
2013-10-30 01:01:03 +07:00
|
|
|
struct list_head *plug_list;
|
2011-03-08 19:19:51 +07:00
|
|
|
|
2011-10-19 19:33:08 +07:00
|
|
|
plug = current->plug;
|
2011-03-08 19:19:51 +07:00
|
|
|
if (!plug)
|
2017-02-08 20:46:48 +07:00
|
|
|
return false;
|
2011-08-24 21:04:34 +07:00
|
|
|
*request_count = 0;
|
2011-03-08 19:19:51 +07:00
|
|
|
|
2013-10-30 01:01:03 +07:00
|
|
|
if (q->mq_ops)
|
|
|
|
plug_list = &plug->mq_list;
|
|
|
|
else
|
|
|
|
plug_list = &plug->list;
|
|
|
|
|
|
|
|
list_for_each_entry_reverse(rq, plug_list, queuelist) {
|
2017-02-08 20:46:48 +07:00
|
|
|
bool merged = false;
|
2011-03-08 19:19:51 +07:00
|
|
|
|
2015-05-09 00:51:33 +07:00
|
|
|
if (rq->q == q) {
|
2012-04-07 00:37:47 +07:00
|
|
|
(*request_count)++;
|
2015-05-09 00:51:33 +07:00
|
|
|
/*
|
|
|
|
* Only blk-mq multiple hardware queues case checks the
|
|
|
|
* rq in the same queue, there should be only one such
|
|
|
|
* rq in a queue
|
|
|
|
**/
|
|
|
|
if (same_queue_rq)
|
|
|
|
*same_queue_rq = rq;
|
|
|
|
}
|
2011-08-24 21:04:34 +07:00
|
|
|
|
block: don't call elevator callbacks for plug merges
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.
For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.
This, for example, leads to the following crash during elevator
switch.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
PGD 112cbc067 PUD 115d5c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 1
Modules linked in: deadline_iosched
Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
Stack:
ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
Call Trace:
[<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
[<ffffffff81391bf1>] elv_try_merge+0x21/0x40
[<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
[<ffffffff81396a5a>] generic_make_request+0xca/0x100
[<ffffffff81396b04>] submit_bio+0x74/0x100
[<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
[<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff811986b2>] do_sync_read+0xe2/0x120
[<ffffffff81199345>] vfs_read+0xc5/0x180
[<ffffffff81199501>] sys_read+0x51/0x90
[<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
There are multiple ways to fix this including making plug merge check
ELVPRIV; however,
* Calling into elevator outside queue lock is confusing and
error-prone.
* Requests on plug list aren't known to the elevator. They aren't on
the elevator yet, so there's no elevator specific state to update.
* Given the nature of plug merges - collecting bio's for the same
purpose from the same issuer - elevator specific restrictions aren't
applicable.
So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Note that this makes per-cgroup merged stats skip plug merging.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 15:19:42 +07:00
|
|
|
if (rq->q != q || !blk_rq_merge_ok(rq, bio))
|
2011-03-08 19:19:51 +07:00
|
|
|
continue;
|
|
|
|
|
2017-02-08 20:46:48 +07:00
|
|
|
switch (blk_try_merge(rq, bio)) {
|
|
|
|
case ELEVATOR_BACK_MERGE:
|
|
|
|
merged = bio_attempt_back_merge(q, rq, bio);
|
|
|
|
break;
|
|
|
|
case ELEVATOR_FRONT_MERGE:
|
|
|
|
merged = bio_attempt_front_merge(q, rq, bio);
|
|
|
|
break;
|
2017-02-08 20:46:49 +07:00
|
|
|
case ELEVATOR_DISCARD_MERGE:
|
|
|
|
merged = bio_attempt_discard_merge(q, rq, bio);
|
|
|
|
break;
|
2017-02-08 20:46:48 +07:00
|
|
|
default:
|
|
|
|
break;
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
2017-02-08 20:46:48 +07:00
|
|
|
|
|
|
|
if (merged)
|
|
|
|
return true;
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
2017-02-08 20:46:48 +07:00
|
|
|
|
|
|
|
return false;
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
|
|
|
|
2015-10-20 22:13:51 +07:00
|
|
|
unsigned int blk_plug_queued_count(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug;
|
|
|
|
struct request *rq;
|
|
|
|
struct list_head *plug_list;
|
|
|
|
unsigned int ret = 0;
|
|
|
|
|
|
|
|
plug = current->plug;
|
|
|
|
if (!plug)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (q->mq_ops)
|
|
|
|
plug_list = &plug->mq_list;
|
|
|
|
else
|
|
|
|
plug_list = &plug->list;
|
|
|
|
|
|
|
|
list_for_each_entry(rq, plug_list, queuelist) {
|
|
|
|
if (rq->q == q)
|
|
|
|
ret++;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-01-29 20:53:40 +07:00
|
|
|
void init_request_from_bio(struct request *req, struct bio *bio)
|
2006-01-06 15:49:58 +07:00
|
|
|
{
|
2016-08-06 04:35:16 +07:00
|
|
|
if (bio->bi_opf & REQ_RAHEAD)
|
2009-07-03 15:48:16 +07:00
|
|
|
req->cmd_flags |= REQ_FAILFAST_MASK;
|
2006-06-13 13:26:10 +07:00
|
|
|
|
2006-01-06 15:49:58 +07:00
|
|
|
req->errors = 0;
|
2013-10-12 05:44:27 +07:00
|
|
|
req->__sector = bio->bi_iter.bi_sector;
|
2017-04-04 22:25:14 +07:00
|
|
|
blk_rq_set_prio(req, rq_ioc(bio));
|
2016-10-18 01:27:28 +07:00
|
|
|
if (ioprio_valid(bio_prio(bio)))
|
|
|
|
req->ioprio = bio_prio(bio);
|
2007-08-16 18:31:30 +07:00
|
|
|
blk_rq_bio_prep(req->q, req, bio);
|
2006-01-06 15:49:58 +07:00
|
|
|
}
|
|
|
|
|
2015-11-06 00:41:16 +07:00
|
|
|
static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2011-03-08 19:19:51 +07:00
|
|
|
struct blk_plug *plug;
|
2017-02-08 20:46:48 +07:00
|
|
|
int where = ELEVATOR_INSERT_SORT;
|
2017-02-03 23:48:28 +07:00
|
|
|
struct request *req, *free;
|
2011-08-24 21:04:34 +07:00
|
|
|
unsigned int request_count = 0;
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
unsigned int wb_acct;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* low level driver can indicate that it wants pages above a
|
|
|
|
* certain limit bounced to low memory (ie for highmem, or even
|
|
|
|
* ISA dma in theory)
|
|
|
|
*/
|
|
|
|
blk_queue_bounce(q, &bio);
|
|
|
|
|
2015-12-23 00:23:44 +07:00
|
|
|
blk_queue_split(q, &bio, q->bio_split);
|
|
|
|
|
2013-02-22 07:42:55 +07:00
|
|
|
if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
|
2015-07-20 20:29:37 +07:00
|
|
|
bio->bi_error = -EIO;
|
|
|
|
bio_endio(bio);
|
2015-11-06 00:41:16 +07:00
|
|
|
return BLK_QC_T_NONE;
|
2013-02-22 07:42:55 +07:00
|
|
|
}
|
|
|
|
|
2017-01-27 22:30:47 +07:00
|
|
|
if (op_is_flush(bio->bi_opf)) {
|
2011-03-08 19:19:51 +07:00
|
|
|
spin_lock_irq(q->queue_lock);
|
2011-01-25 18:43:54 +07:00
|
|
|
where = ELEVATOR_INSERT_FLUSH;
|
2010-09-03 16:56:16 +07:00
|
|
|
goto get_rq;
|
|
|
|
}
|
|
|
|
|
2011-03-08 19:19:51 +07:00
|
|
|
/*
|
|
|
|
* Check if we can merge with the plugged list before grabbing
|
|
|
|
* any locks.
|
|
|
|
*/
|
2015-10-20 22:13:51 +07:00
|
|
|
if (!blk_queue_nomerges(q)) {
|
|
|
|
if (blk_attempt_plug_merge(q, bio, &request_count, NULL))
|
2015-11-06 00:41:16 +07:00
|
|
|
return BLK_QC_T_NONE;
|
2015-10-20 22:13:51 +07:00
|
|
|
} else
|
|
|
|
request_count = blk_plug_queued_count(q);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-03-08 19:19:51 +07:00
|
|
|
spin_lock_irq(q->queue_lock);
|
2006-03-24 02:00:26 +07:00
|
|
|
|
2017-02-08 20:46:48 +07:00
|
|
|
switch (elv_merge(q, &req, bio)) {
|
|
|
|
case ELEVATOR_BACK_MERGE:
|
|
|
|
if (!bio_attempt_back_merge(q, req, bio))
|
|
|
|
break;
|
|
|
|
elv_bio_merged(q, req, bio);
|
|
|
|
free = attempt_back_merge(q, req);
|
|
|
|
if (free)
|
|
|
|
__blk_put_request(q, free);
|
|
|
|
else
|
|
|
|
elv_merged_request(q, req, ELEVATOR_BACK_MERGE);
|
|
|
|
goto out_unlock;
|
|
|
|
case ELEVATOR_FRONT_MERGE:
|
|
|
|
if (!bio_attempt_front_merge(q, req, bio))
|
|
|
|
break;
|
|
|
|
elv_bio_merged(q, req, bio);
|
|
|
|
free = attempt_front_merge(q, req);
|
|
|
|
if (free)
|
|
|
|
__blk_put_request(q, free);
|
|
|
|
else
|
|
|
|
elv_merged_request(q, req, ELEVATOR_FRONT_MERGE);
|
|
|
|
goto out_unlock;
|
|
|
|
default:
|
|
|
|
break;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2005-06-29 10:45:13 +07:00
|
|
|
get_rq:
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
wb_acct = wbt_wait(q->rq_wb, bio, q->queue_lock);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2005-06-29 10:45:13 +07:00
|
|
|
* Grab a free request. This is might sleep but can not fail.
|
2005-06-29 10:45:14 +07:00
|
|
|
* Returns with the queue unlocked.
|
2005-06-29 10:45:13 +07:00
|
|
|
*/
|
2016-10-28 21:48:16 +07:00
|
|
|
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
|
2014-08-28 21:15:21 +07:00
|
|
|
if (IS_ERR(req)) {
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
__wbt_done(q->rq_wb, wb_acct);
|
2015-07-20 20:29:37 +07:00
|
|
|
bio->bi_error = PTR_ERR(req);
|
|
|
|
bio_endio(bio);
|
2011-10-19 19:33:05 +07:00
|
|
|
goto out_unlock;
|
|
|
|
}
|
2005-06-29 10:45:14 +07:00
|
|
|
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
wbt_track(&req->issue_stat, wb_acct);
|
|
|
|
|
2005-06-29 10:45:13 +07:00
|
|
|
/*
|
|
|
|
* After dropping the lock and possibly sleeping here, our request
|
|
|
|
* may now be mergeable after it had proven unmergeable (above).
|
|
|
|
* We don't worry about that case for efficiency. It won't happen
|
|
|
|
* often, and the elevators are able to handle it.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2006-01-06 15:49:58 +07:00
|
|
|
init_request_from_bio(req, bio);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-10-24 21:11:30 +07:00
|
|
|
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags))
|
2011-07-26 20:01:15 +07:00
|
|
|
req->cpu = raw_smp_processor_id();
|
2011-03-08 19:19:51 +07:00
|
|
|
|
|
|
|
plug = current->plug;
|
2011-03-09 17:56:30 +07:00
|
|
|
if (plug) {
|
2011-04-12 15:28:28 +07:00
|
|
|
/*
|
|
|
|
* If this is the first request added after a plug, fire
|
2013-09-12 02:21:07 +07:00
|
|
|
* of a plug trace.
|
2016-11-16 17:07:05 +07:00
|
|
|
*
|
|
|
|
* @request_count may become stale because of schedule
|
|
|
|
* out, so check plug list again.
|
2011-04-12 15:28:28 +07:00
|
|
|
*/
|
2016-11-16 17:07:05 +07:00
|
|
|
if (!request_count || list_empty(&plug->list))
|
2011-04-12 15:28:28 +07:00
|
|
|
trace_block_plug(q);
|
2011-11-16 15:21:50 +07:00
|
|
|
else {
|
2016-11-04 07:03:53 +07:00
|
|
|
struct request *last = list_entry_rq(plug->list.prev);
|
|
|
|
if (request_count >= BLK_MAX_REQUEST_COUNT ||
|
|
|
|
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE) {
|
2011-11-16 15:21:50 +07:00
|
|
|
blk_flush_plug_list(plug, false);
|
2011-11-16 15:21:50 +07:00
|
|
|
trace_block_plug(q);
|
|
|
|
}
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
|
|
|
list_add_tail(&req->queuelist, &plug->list);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
blk_account_io_start(req, true);
|
2011-03-08 19:19:51 +07:00
|
|
|
} else {
|
|
|
|
spin_lock_irq(q->queue_lock);
|
|
|
|
add_acct_request(q, req, where);
|
2011-04-18 16:41:33 +07:00
|
|
|
__blk_run_queue(q);
|
2011-03-08 19:19:51 +07:00
|
|
|
out_unlock:
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
}
|
2015-11-06 00:41:16 +07:00
|
|
|
|
|
|
|
return BLK_QC_T_NONE;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If bio->bi_dev is a partition, remap the location
|
|
|
|
*/
|
|
|
|
static inline void blk_partition_remap(struct bio *bio)
|
|
|
|
{
|
|
|
|
struct block_device *bdev = bio->bi_bdev;
|
|
|
|
|
2016-11-22 04:52:23 +07:00
|
|
|
/*
|
|
|
|
* Zone reset does not include bi_size so bio_sectors() is always 0.
|
|
|
|
* Include a test for the reset op code and perform the remap if needed.
|
|
|
|
*/
|
|
|
|
if (bdev != bdev->bd_contains &&
|
|
|
|
(bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET)) {
|
2005-04-17 05:20:36 +07:00
|
|
|
struct hd_struct *p = bdev->bd_part;
|
|
|
|
|
2013-10-12 05:44:27 +07:00
|
|
|
bio->bi_iter.bi_sector += p->start_sect;
|
2005-04-17 05:20:36 +07:00
|
|
|
bio->bi_bdev = bdev->bd_contains;
|
2007-08-07 20:30:23 +07:00
|
|
|
|
2010-11-16 18:52:38 +07:00
|
|
|
trace_block_bio_remap(bdev_get_queue(bio->bi_bdev), bio,
|
|
|
|
bdev->bd_dev,
|
2013-10-12 05:44:27 +07:00
|
|
|
bio->bi_iter.bi_sector - p->start_sect);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void handle_bad_sector(struct bio *bio)
|
|
|
|
{
|
|
|
|
char b[BDEVNAME_SIZE];
|
|
|
|
|
|
|
|
printk(KERN_INFO "attempt to access beyond end of device\n");
|
2016-06-06 02:32:21 +07:00
|
|
|
printk(KERN_INFO "%s: rw=%d, want=%Lu, limit=%Lu\n",
|
2005-04-17 05:20:36 +07:00
|
|
|
bdevname(bio->bi_bdev, b),
|
2016-08-06 04:35:16 +07:00
|
|
|
bio->bi_opf,
|
2012-09-26 05:05:12 +07:00
|
|
|
(unsigned long long)bio_end_sector(bio),
|
2010-11-08 20:39:12 +07:00
|
|
|
(long long)(i_size_read(bio->bi_bdev->bd_inode) >> 9));
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2006-12-08 17:39:46 +07:00
|
|
|
#ifdef CONFIG_FAIL_MAKE_REQUEST
|
|
|
|
|
|
|
|
static DECLARE_FAULT_ATTR(fail_make_request);
|
|
|
|
|
|
|
|
static int __init setup_fail_make_request(char *str)
|
|
|
|
{
|
|
|
|
return setup_fault_attr(&fail_make_request, str);
|
|
|
|
}
|
|
|
|
__setup("fail_make_request=", setup_fail_make_request);
|
|
|
|
|
2011-07-27 06:09:03 +07:00
|
|
|
static bool should_fail_request(struct hd_struct *part, unsigned int bytes)
|
2006-12-08 17:39:46 +07:00
|
|
|
{
|
2011-07-27 06:09:03 +07:00
|
|
|
return part->make_it_fail && should_fail(&fail_make_request, bytes);
|
2006-12-08 17:39:46 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static int __init fail_make_request_debugfs(void)
|
|
|
|
{
|
2011-08-04 06:21:01 +07:00
|
|
|
struct dentry *dir = fault_create_debugfs_attr("fail_make_request",
|
|
|
|
NULL, &fail_make_request);
|
|
|
|
|
2014-04-11 14:58:56 +07:00
|
|
|
return PTR_ERR_OR_ZERO(dir);
|
2006-12-08 17:39:46 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
late_initcall(fail_make_request_debugfs);
|
|
|
|
|
|
|
|
#else /* CONFIG_FAIL_MAKE_REQUEST */
|
|
|
|
|
2011-07-27 06:09:03 +07:00
|
|
|
static inline bool should_fail_request(struct hd_struct *part,
|
|
|
|
unsigned int bytes)
|
2006-12-08 17:39:46 +07:00
|
|
|
{
|
2011-07-27 06:09:03 +07:00
|
|
|
return false;
|
2006-12-08 17:39:46 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_FAIL_MAKE_REQUEST */
|
|
|
|
|
2007-07-18 18:27:58 +07:00
|
|
|
/*
|
|
|
|
* Check whether this bio extends beyond the end of the device.
|
|
|
|
*/
|
|
|
|
static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
|
|
|
|
{
|
|
|
|
sector_t maxsector;
|
|
|
|
|
|
|
|
if (!nr_sectors)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Test device or partition size, when known. */
|
2010-11-08 20:39:12 +07:00
|
|
|
maxsector = i_size_read(bio->bi_bdev->bd_inode) >> 9;
|
2007-07-18 18:27:58 +07:00
|
|
|
if (maxsector) {
|
2013-10-12 05:44:27 +07:00
|
|
|
sector_t sector = bio->bi_iter.bi_sector;
|
2007-07-18 18:27:58 +07:00
|
|
|
|
|
|
|
if (maxsector < nr_sectors || maxsector - nr_sectors < sector) {
|
|
|
|
/*
|
|
|
|
* This may well happen - the kernel calls bread()
|
|
|
|
* without checking the size of the device, e.g., when
|
|
|
|
* mounting a device.
|
|
|
|
*/
|
|
|
|
handle_bad_sector(bio);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-09-15 19:01:40 +07:00
|
|
|
static noinline_for_stack bool
|
|
|
|
generic_make_request_checks(struct bio *bio)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2007-07-24 14:28:11 +07:00
|
|
|
struct request_queue *q;
|
2011-09-12 17:12:01 +07:00
|
|
|
int nr_sectors = bio_sectors(bio);
|
2007-11-02 14:49:08 +07:00
|
|
|
int err = -EIO;
|
2011-09-12 17:12:01 +07:00
|
|
|
char b[BDEVNAME_SIZE];
|
|
|
|
struct hd_struct *part;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
might_sleep();
|
|
|
|
|
2007-07-18 18:27:58 +07:00
|
|
|
if (bio_check_eod(bio, nr_sectors))
|
|
|
|
goto end_io;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-09-12 17:12:01 +07:00
|
|
|
q = bdev_get_queue(bio->bi_bdev);
|
|
|
|
if (unlikely(!q)) {
|
|
|
|
printk(KERN_ERR
|
|
|
|
"generic_make_request: Trying to access "
|
|
|
|
"nonexistent block-device %s (%Lu)\n",
|
|
|
|
bdevname(bio->bi_bdev, b),
|
2013-10-12 05:44:27 +07:00
|
|
|
(long long) bio->bi_iter.bi_sector);
|
2011-09-12 17:12:01 +07:00
|
|
|
goto end_io;
|
|
|
|
}
|
2006-12-08 17:39:46 +07:00
|
|
|
|
2011-09-12 17:12:01 +07:00
|
|
|
part = bio->bi_bdev->bd_part;
|
2013-10-12 05:44:27 +07:00
|
|
|
if (should_fail_request(part, bio->bi_iter.bi_size) ||
|
2011-09-12 17:12:01 +07:00
|
|
|
should_fail_request(&part_to_disk(part)->part0,
|
2013-10-12 05:44:27 +07:00
|
|
|
bio->bi_iter.bi_size))
|
2011-09-12 17:12:01 +07:00
|
|
|
goto end_io;
|
2006-03-24 02:00:26 +07:00
|
|
|
|
2011-09-12 17:12:01 +07:00
|
|
|
/*
|
|
|
|
* If this device has partitions, remap block n
|
|
|
|
* of partition p to block n+start(p) of the disk.
|
|
|
|
*/
|
|
|
|
blk_partition_remap(bio);
|
2006-03-24 02:00:26 +07:00
|
|
|
|
2011-09-12 17:12:01 +07:00
|
|
|
if (bio_check_eod(bio, nr_sectors))
|
|
|
|
goto end_io;
|
2010-09-03 16:56:17 +07:00
|
|
|
|
2011-09-12 17:12:01 +07:00
|
|
|
/*
|
|
|
|
* Filter flush bio's early so that make_request based
|
|
|
|
* drivers without flush support don't have to worry
|
|
|
|
* about them.
|
|
|
|
*/
|
2017-01-27 23:08:23 +07:00
|
|
|
if (op_is_flush(bio->bi_opf) &&
|
2016-04-14 02:33:19 +07:00
|
|
|
!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
|
2016-08-06 04:35:16 +07:00
|
|
|
bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
|
2011-09-12 17:12:01 +07:00
|
|
|
if (!nr_sectors) {
|
|
|
|
err = 0;
|
2007-11-02 14:49:08 +07:00
|
|
|
goto end_io;
|
|
|
|
}
|
2011-09-12 17:12:01 +07:00
|
|
|
}
|
2006-10-31 13:07:21 +07:00
|
|
|
|
2016-06-09 21:00:36 +07:00
|
|
|
switch (bio_op(bio)) {
|
|
|
|
case REQ_OP_DISCARD:
|
|
|
|
if (!blk_queue_discard(q))
|
|
|
|
goto not_supported;
|
|
|
|
break;
|
|
|
|
case REQ_OP_SECURE_ERASE:
|
|
|
|
if (!blk_queue_secure_erase(q))
|
|
|
|
goto not_supported;
|
|
|
|
break;
|
|
|
|
case REQ_OP_WRITE_SAME:
|
|
|
|
if (!bdev_write_same(bio->bi_bdev))
|
|
|
|
goto not_supported;
|
2016-12-04 20:56:39 +07:00
|
|
|
break;
|
2016-10-18 13:40:32 +07:00
|
|
|
case REQ_OP_ZONE_REPORT:
|
|
|
|
case REQ_OP_ZONE_RESET:
|
|
|
|
if (!bdev_is_zoned(bio->bi_bdev))
|
|
|
|
goto not_supported;
|
2016-06-09 21:00:36 +07:00
|
|
|
break;
|
2016-12-01 03:28:59 +07:00
|
|
|
case REQ_OP_WRITE_ZEROES:
|
|
|
|
if (!bdev_write_zeroes_sectors(bio->bi_bdev))
|
|
|
|
goto not_supported;
|
|
|
|
break;
|
2016-06-09 21:00:36 +07:00
|
|
|
default:
|
|
|
|
break;
|
2011-09-12 17:12:01 +07:00
|
|
|
}
|
2009-09-09 02:56:38 +07:00
|
|
|
|
2012-06-05 10:40:56 +07:00
|
|
|
/*
|
|
|
|
* Various block parts want %current->io_context and lazy ioc
|
|
|
|
* allocation ends up trading a lot of pain for a small amount of
|
|
|
|
* memory. Just allocate it upfront. This may fail and block
|
|
|
|
* layer knows how to live with it.
|
|
|
|
*/
|
|
|
|
create_io_context(GFP_ATOMIC, q->node);
|
|
|
|
|
2015-08-19 04:55:20 +07:00
|
|
|
if (!blkcg_bio_issue_check(q, bio))
|
|
|
|
return false;
|
2011-09-15 19:01:40 +07:00
|
|
|
|
block: trace completion of all bios.
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.
When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.
So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 22:40:52 +07:00
|
|
|
if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) {
|
|
|
|
trace_block_bio_queue(q, bio);
|
|
|
|
/* Now that enqueuing has been traced, we need to trace
|
|
|
|
* completion as well.
|
|
|
|
*/
|
|
|
|
bio_set_flag(bio, BIO_TRACE_COMPLETION);
|
|
|
|
}
|
2011-09-15 19:01:40 +07:00
|
|
|
return true;
|
2008-11-28 11:32:03 +07:00
|
|
|
|
2016-06-09 21:00:36 +07:00
|
|
|
not_supported:
|
|
|
|
err = -EOPNOTSUPP;
|
2008-11-28 11:32:03 +07:00
|
|
|
end_io:
|
2015-07-20 20:29:37 +07:00
|
|
|
bio->bi_error = err;
|
|
|
|
bio_endio(bio);
|
2011-09-15 19:01:40 +07:00
|
|
|
return false;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2011-09-15 19:01:40 +07:00
|
|
|
/**
|
|
|
|
* generic_make_request - hand a buffer to its device driver for I/O
|
|
|
|
* @bio: The bio describing the location in memory and on the device.
|
|
|
|
*
|
|
|
|
* generic_make_request() is used to make I/O requests of block
|
|
|
|
* devices. It is passed a &struct bio, which describes the I/O that needs
|
|
|
|
* to be done.
|
|
|
|
*
|
|
|
|
* generic_make_request() does not return any status. The
|
|
|
|
* success/failure status of the request, along with notification of
|
|
|
|
* completion, is delivered asynchronously through the bio->bi_end_io
|
|
|
|
* function described (one day) else where.
|
|
|
|
*
|
|
|
|
* The caller of generic_make_request must make sure that bi_io_vec
|
|
|
|
* are set to describe the memory buffer, and that bi_dev and bi_sector are
|
|
|
|
* set to describe the device address, and the
|
|
|
|
* bi_end_io and optionally bi_private are set to describe how
|
|
|
|
* completion notification should be signaled.
|
|
|
|
*
|
|
|
|
* generic_make_request and the drivers it calls may use bi_next if this
|
|
|
|
* bio happens to be merged with someone else, and may resubmit the bio to
|
|
|
|
* a lower device by calling into generic_make_request recursively, which
|
|
|
|
* means the bio should NOT be touched after the call to ->make_request_fn.
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 14:53:42 +07:00
|
|
|
*/
|
2015-11-06 00:41:16 +07:00
|
|
|
blk_qc_t generic_make_request(struct bio *bio)
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 14:53:42 +07:00
|
|
|
{
|
2017-03-10 13:00:47 +07:00
|
|
|
/*
|
|
|
|
* bio_list_on_stack[0] contains bios submitted by the current
|
|
|
|
* make_request_fn.
|
|
|
|
* bio_list_on_stack[1] contains bios that were submitted before
|
|
|
|
* the current make_request_fn, but that haven't been processed
|
|
|
|
* yet.
|
|
|
|
*/
|
|
|
|
struct bio_list bio_list_on_stack[2];
|
2015-11-06 00:41:16 +07:00
|
|
|
blk_qc_t ret = BLK_QC_T_NONE;
|
2010-02-23 14:55:42 +07:00
|
|
|
|
2011-09-15 19:01:40 +07:00
|
|
|
if (!generic_make_request_checks(bio))
|
2015-11-06 00:41:16 +07:00
|
|
|
goto out;
|
2011-09-15 19:01:40 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We only want one ->make_request_fn to be active at a time, else
|
|
|
|
* stack usage with stacked devices could be a problem. So use
|
|
|
|
* current->bio_list to keep a list of requests submited by a
|
|
|
|
* make_request_fn function. current->bio_list is also used as a
|
|
|
|
* flag to say if generic_make_request is currently active in this
|
|
|
|
* task or not. If it is NULL, then no make_request is active. If
|
|
|
|
* it is non-NULL, then a make_request is active, and new requests
|
|
|
|
* should be added at the tail
|
|
|
|
*/
|
2010-02-23 14:55:42 +07:00
|
|
|
if (current->bio_list) {
|
2017-03-10 13:00:47 +07:00
|
|
|
bio_list_add(¤t->bio_list[0], bio);
|
2015-11-06 00:41:16 +07:00
|
|
|
goto out;
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 14:53:42 +07:00
|
|
|
}
|
2011-09-15 19:01:40 +07:00
|
|
|
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 14:53:42 +07:00
|
|
|
/* following loop may be a bit non-obvious, and so deserves some
|
|
|
|
* explanation.
|
|
|
|
* Before entering the loop, bio->bi_next is NULL (as all callers
|
|
|
|
* ensure that) so we have a list with a single bio.
|
|
|
|
* We pretend that we have just taken it off a longer list, so
|
2010-02-23 14:55:42 +07:00
|
|
|
* we assign bio_list to a pointer to the bio_list_on_stack,
|
|
|
|
* thus initialising the bio_list of new bios to be
|
2011-09-15 19:01:40 +07:00
|
|
|
* added. ->make_request() may indeed add some more bios
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 14:53:42 +07:00
|
|
|
* through a recursive call to generic_make_request. If it
|
|
|
|
* did, we find a non-NULL value in bio_list and re-enter the loop
|
|
|
|
* from the top. In this case we really did just take the bio
|
2010-02-23 14:55:42 +07:00
|
|
|
* of the top of the list (no pretending) and so remove it from
|
2011-09-15 19:01:40 +07:00
|
|
|
* bio_list, and call into ->make_request() again.
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 14:53:42 +07:00
|
|
|
*/
|
|
|
|
BUG_ON(bio->bi_next);
|
2017-03-10 13:00:47 +07:00
|
|
|
bio_list_init(&bio_list_on_stack[0]);
|
|
|
|
current->bio_list = bio_list_on_stack;
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 14:53:42 +07:00
|
|
|
do {
|
2011-09-15 19:01:40 +07:00
|
|
|
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
|
|
|
|
|
2015-11-26 15:13:05 +07:00
|
|
|
if (likely(blk_queue_enter(q, false) == 0)) {
|
blk: improve order of bio handling in generic_make_request()
To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.
If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.
This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.
These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().
An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.
This, by itself, it not enough to remove all deadlocks. It just makes
it possible for drivers to take the extra step required themselves.
To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.
A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.
With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.
Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-08 03:38:05 +07:00
|
|
|
struct bio_list lower, same;
|
|
|
|
|
|
|
|
/* Create a fresh bio_list for all subordinate requests */
|
2017-03-10 13:00:47 +07:00
|
|
|
bio_list_on_stack[1] = bio_list_on_stack[0];
|
|
|
|
bio_list_init(&bio_list_on_stack[0]);
|
2015-11-06 00:41:16 +07:00
|
|
|
ret = q->make_request_fn(q, bio);
|
2015-10-22 00:20:12 +07:00
|
|
|
|
|
|
|
blk_queue_exit(q);
|
2011-09-15 19:01:40 +07:00
|
|
|
|
blk: improve order of bio handling in generic_make_request()
To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.
If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.
This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.
These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().
An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.
This, by itself, it not enough to remove all deadlocks. It just makes
it possible for drivers to take the extra step required themselves.
To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.
A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.
With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.
Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-08 03:38:05 +07:00
|
|
|
/* sort new bios into those for a lower level
|
|
|
|
* and those for the same level
|
|
|
|
*/
|
|
|
|
bio_list_init(&lower);
|
|
|
|
bio_list_init(&same);
|
2017-03-10 13:00:47 +07:00
|
|
|
while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
|
blk: improve order of bio handling in generic_make_request()
To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.
If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.
This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.
These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().
An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.
This, by itself, it not enough to remove all deadlocks. It just makes
it possible for drivers to take the extra step required themselves.
To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.
A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.
With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.
Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-08 03:38:05 +07:00
|
|
|
if (q == bdev_get_queue(bio->bi_bdev))
|
|
|
|
bio_list_add(&same, bio);
|
|
|
|
else
|
|
|
|
bio_list_add(&lower, bio);
|
|
|
|
/* now assemble so we handle the lowest level first */
|
2017-03-10 13:00:47 +07:00
|
|
|
bio_list_merge(&bio_list_on_stack[0], &lower);
|
|
|
|
bio_list_merge(&bio_list_on_stack[0], &same);
|
|
|
|
bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
|
2015-10-22 00:20:12 +07:00
|
|
|
} else {
|
|
|
|
bio_io_error(bio);
|
|
|
|
}
|
2017-03-10 13:00:47 +07:00
|
|
|
bio = bio_list_pop(&bio_list_on_stack[0]);
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 14:53:42 +07:00
|
|
|
} while (bio);
|
2010-02-23 14:55:42 +07:00
|
|
|
current->bio_list = NULL; /* deactivate */
|
2015-11-06 00:41:16 +07:00
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 14:53:42 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
EXPORT_SYMBOL(generic_make_request);
|
|
|
|
|
|
|
|
/**
|
2008-08-20 01:13:11 +07:00
|
|
|
* submit_bio - submit a bio to the block device layer for I/O
|
2005-04-17 05:20:36 +07:00
|
|
|
* @bio: The &struct bio which describes the I/O
|
|
|
|
*
|
|
|
|
* submit_bio() is very similar in purpose to generic_make_request(), and
|
|
|
|
* uses that function to do most of the work. Both are fairly rough
|
2008-08-20 01:13:11 +07:00
|
|
|
* interfaces; @bio must be presetup and ready for I/O.
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
*/
|
2016-06-06 02:31:41 +07:00
|
|
|
blk_qc_t submit_bio(struct bio *bio)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2007-09-27 18:01:25 +07:00
|
|
|
/*
|
|
|
|
* If it's a regular read/write or a barrier with data attached,
|
|
|
|
* go through the normal accounting stuff before submission.
|
|
|
|
*/
|
2012-09-18 23:19:25 +07:00
|
|
|
if (bio_has_data(bio)) {
|
2012-09-18 23:19:27 +07:00
|
|
|
unsigned int count;
|
|
|
|
|
2016-06-06 02:31:48 +07:00
|
|
|
if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
|
2012-09-18 23:19:27 +07:00
|
|
|
count = bdev_logical_block_size(bio->bi_bdev) >> 9;
|
|
|
|
else
|
|
|
|
count = bio_sectors(bio);
|
|
|
|
|
2016-06-06 02:31:45 +07:00
|
|
|
if (op_is_write(bio_op(bio))) {
|
2007-09-27 18:01:25 +07:00
|
|
|
count_vm_events(PGPGOUT, count);
|
|
|
|
} else {
|
2013-10-12 05:44:27 +07:00
|
|
|
task_io_account_read(bio->bi_iter.bi_size);
|
2007-09-27 18:01:25 +07:00
|
|
|
count_vm_events(PGPGIN, count);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (unlikely(block_dump)) {
|
|
|
|
char b[BDEVNAME_SIZE];
|
2010-09-14 13:48:01 +07:00
|
|
|
printk(KERN_DEBUG "%s(%d): %s block %Lu on %s (%u sectors)\n",
|
2007-10-19 13:40:40 +07:00
|
|
|
current->comm, task_pid_nr(current),
|
2016-06-06 02:31:45 +07:00
|
|
|
op_is_write(bio_op(bio)) ? "WRITE" : "READ",
|
2013-10-12 05:44:27 +07:00
|
|
|
(unsigned long long)bio->bi_iter.bi_sector,
|
2010-09-14 13:48:01 +07:00
|
|
|
bdevname(bio->bi_bdev, b),
|
|
|
|
count);
|
2007-09-27 18:01:25 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2015-11-06 00:41:16 +07:00
|
|
|
return generic_make_request(bio);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(submit_bio);
|
|
|
|
|
2008-09-18 21:45:38 +07:00
|
|
|
/**
|
2015-11-26 14:46:57 +07:00
|
|
|
* blk_cloned_rq_check_limits - Helper function to check a cloned request
|
|
|
|
* for new the queue limits
|
2008-09-18 21:45:38 +07:00
|
|
|
* @q: the queue
|
|
|
|
* @rq: the request being checked
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* @rq may have been made based on weaker limitations of upper-level queues
|
|
|
|
* in request stacking drivers, and it may violate the limitation of @q.
|
|
|
|
* Since the block layer and the underlying device driver trust @rq
|
|
|
|
* after it is inserted to @q, it should be checked against @q before
|
|
|
|
* the insertion using this generic function.
|
|
|
|
*
|
|
|
|
* Request stacking drivers like request-based dm may change the queue
|
2015-11-26 14:46:57 +07:00
|
|
|
* limits when retrying requests on other queues. Those requests need
|
|
|
|
* to be checked against the new queue limits again during dispatch.
|
2008-09-18 21:45:38 +07:00
|
|
|
*/
|
2015-11-26 14:46:57 +07:00
|
|
|
static int blk_cloned_rq_check_limits(struct request_queue *q,
|
|
|
|
struct request *rq)
|
2008-09-18 21:45:38 +07:00
|
|
|
{
|
2016-06-06 02:32:15 +07:00
|
|
|
if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, req_op(rq))) {
|
2008-09-18 21:45:38 +07:00
|
|
|
printk(KERN_ERR "%s: over max size limit.\n", __func__);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* queue's settings related to segment counting like q->bounce_pfn
|
|
|
|
* may differ from that of other stacking queues.
|
|
|
|
* Recalculate it to check the request correctly on this queue's
|
|
|
|
* limitation.
|
|
|
|
*/
|
|
|
|
blk_recalc_rq_segments(rq);
|
2010-02-26 12:20:39 +07:00
|
|
|
if (rq->nr_phys_segments > queue_max_segments(q)) {
|
2008-09-18 21:45:38 +07:00
|
|
|
printk(KERN_ERR "%s: over max segments limit.\n", __func__);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_insert_cloned_request - Helper for stacking drivers to submit a request
|
|
|
|
* @q: the queue to submit the request
|
|
|
|
* @rq: the request being queued
|
|
|
|
*/
|
|
|
|
int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-16 02:37:25 +07:00
|
|
|
int where = ELEVATOR_INSERT_BACK;
|
2008-09-18 21:45:38 +07:00
|
|
|
|
2015-11-26 14:46:57 +07:00
|
|
|
if (blk_cloned_rq_check_limits(q, rq))
|
2008-09-18 21:45:38 +07:00
|
|
|
return -EIO;
|
|
|
|
|
2011-07-27 06:09:03 +07:00
|
|
|
if (rq->rq_disk &&
|
|
|
|
should_fail_request(&rq->rq_disk->part0, blk_rq_bytes(rq)))
|
2008-09-18 21:45:38 +07:00
|
|
|
return -EIO;
|
|
|
|
|
2014-10-18 06:46:38 +07:00
|
|
|
if (q->mq_ops) {
|
|
|
|
if (blk_queue_io_stat(q))
|
|
|
|
blk_account_io_start(rq, true);
|
2017-01-27 15:00:47 +07:00
|
|
|
blk_mq_sched_insert_request(rq, false, true, false, false);
|
2014-10-18 06:46:38 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-09-18 21:45:38 +07:00
|
|
|
spin_lock_irqsave(q->queue_lock, flags);
|
2012-11-28 19:42:38 +07:00
|
|
|
if (unlikely(blk_queue_dying(q))) {
|
2011-12-14 06:33:37 +07:00
|
|
|
spin_unlock_irqrestore(q->queue_lock, flags);
|
|
|
|
return -ENODEV;
|
|
|
|
}
|
2008-09-18 21:45:38 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Submitting request must be dequeued before calling this function
|
|
|
|
* because it will be linked to another request_queue
|
|
|
|
*/
|
|
|
|
BUG_ON(blk_queued_rq(rq));
|
|
|
|
|
2017-01-27 22:30:47 +07:00
|
|
|
if (op_is_flush(rq->cmd_flags))
|
block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-16 02:37:25 +07:00
|
|
|
where = ELEVATOR_INSERT_FLUSH;
|
|
|
|
|
|
|
|
add_acct_request(q, rq, where);
|
2011-10-17 17:57:23 +07:00
|
|
|
if (where == ELEVATOR_INSERT_FLUSH)
|
|
|
|
__blk_run_queue(q);
|
2008-09-18 21:45:38 +07:00
|
|
|
spin_unlock_irqrestore(q->queue_lock, flags);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
|
|
|
|
|
2009-07-03 15:48:17 +07:00
|
|
|
/**
|
|
|
|
* blk_rq_err_bytes - determine number of bytes till the next failure boundary
|
|
|
|
* @rq: request to examine
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* A request could be merge of IOs which require different failure
|
|
|
|
* handling. This function determines the number of bytes which
|
|
|
|
* can be failed from the beginning of the request without
|
|
|
|
* crossing into area which need to be retried further.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* The number of bytes to fail.
|
|
|
|
*
|
|
|
|
* Context:
|
|
|
|
* queue_lock must be held.
|
|
|
|
*/
|
|
|
|
unsigned int blk_rq_err_bytes(const struct request *rq)
|
|
|
|
{
|
|
|
|
unsigned int ff = rq->cmd_flags & REQ_FAILFAST_MASK;
|
|
|
|
unsigned int bytes = 0;
|
|
|
|
struct bio *bio;
|
|
|
|
|
2016-10-20 20:12:13 +07:00
|
|
|
if (!(rq->rq_flags & RQF_MIXED_MERGE))
|
2009-07-03 15:48:17 +07:00
|
|
|
return blk_rq_bytes(rq);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Currently the only 'mixing' which can happen is between
|
|
|
|
* different fastfail types. We can safely fail portions
|
|
|
|
* which have all the failfast bits that the first one has -
|
|
|
|
* the ones which are at least as eager to fail as the first
|
|
|
|
* one.
|
|
|
|
*/
|
|
|
|
for (bio = rq->bio; bio; bio = bio->bi_next) {
|
2016-08-06 04:35:16 +07:00
|
|
|
if ((bio->bi_opf & ff) != ff)
|
2009-07-03 15:48:17 +07:00
|
|
|
break;
|
2013-10-12 05:44:27 +07:00
|
|
|
bytes += bio->bi_iter.bi_size;
|
2009-07-03 15:48:17 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* this could lead to infinite loop */
|
|
|
|
BUG_ON(blk_rq_bytes(rq) && !bytes);
|
|
|
|
return bytes;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_err_bytes);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
void blk_account_io_completion(struct request *req, unsigned int bytes)
|
2009-01-23 16:54:44 +07:00
|
|
|
{
|
2009-04-24 13:10:11 +07:00
|
|
|
if (blk_do_io_stat(req)) {
|
2009-01-23 16:54:44 +07:00
|
|
|
const int rw = rq_data_dir(req);
|
|
|
|
struct hd_struct *part;
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
cpu = part_stat_lock();
|
2011-01-05 22:57:38 +07:00
|
|
|
part = req->part;
|
2009-01-23 16:54:44 +07:00
|
|
|
part_stat_add(cpu, part, sectors[rw], bytes >> 9);
|
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
void blk_account_io_done(struct request *req)
|
2009-01-23 16:54:44 +07:00
|
|
|
{
|
|
|
|
/*
|
2010-09-03 16:56:16 +07:00
|
|
|
* Account IO completion. flush_rq isn't accounted as a
|
|
|
|
* normal IO on queueing nor completion. Accounting the
|
|
|
|
* containing request is enough.
|
2009-01-23 16:54:44 +07:00
|
|
|
*/
|
2016-10-20 20:12:13 +07:00
|
|
|
if (blk_do_io_stat(req) && !(req->rq_flags & RQF_FLUSH_SEQ)) {
|
2009-01-23 16:54:44 +07:00
|
|
|
unsigned long duration = jiffies - req->start_time;
|
|
|
|
const int rw = rq_data_dir(req);
|
|
|
|
struct hd_struct *part;
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
cpu = part_stat_lock();
|
2011-01-05 22:57:38 +07:00
|
|
|
part = req->part;
|
2009-01-23 16:54:44 +07:00
|
|
|
|
|
|
|
part_stat_inc(cpu, part, ios[rw]);
|
|
|
|
part_stat_add(cpu, part, ticks[rw], duration);
|
|
|
|
part_round_stats(cpu, part);
|
block: Seperate read and write statistics of in_flight requests v2
Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.
But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.
So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;
- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-07 01:16:55 +07:00
|
|
|
part_dec_in_flight(part, rw);
|
2009-01-23 16:54:44 +07:00
|
|
|
|
2011-01-07 14:43:37 +07:00
|
|
|
hd_struct_put(part);
|
2009-01-23 16:54:44 +07:00
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-12-04 07:00:23 +07:00
|
|
|
#ifdef CONFIG_PM
|
2013-03-23 10:42:27 +07:00
|
|
|
/*
|
|
|
|
* Don't process normal requests when queue is suspended
|
|
|
|
* or in the process of suspending/resuming
|
|
|
|
*/
|
|
|
|
static struct request *blk_pm_peek_request(struct request_queue *q,
|
|
|
|
struct request *rq)
|
|
|
|
{
|
|
|
|
if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
|
2016-10-20 20:12:13 +07:00
|
|
|
(q->rpm_status != RPM_ACTIVE && !(rq->rq_flags & RQF_PM))))
|
2013-03-23 10:42:27 +07:00
|
|
|
return NULL;
|
|
|
|
else
|
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline struct request *blk_pm_peek_request(struct request_queue *q,
|
|
|
|
struct request *rq)
|
|
|
|
{
|
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
void blk_account_io_start(struct request *rq, bool new_io)
|
|
|
|
{
|
|
|
|
struct hd_struct *part;
|
|
|
|
int rw = rq_data_dir(rq);
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
if (!blk_do_io_stat(rq))
|
|
|
|
return;
|
|
|
|
|
|
|
|
cpu = part_stat_lock();
|
|
|
|
|
|
|
|
if (!new_io) {
|
|
|
|
part = rq->part;
|
|
|
|
part_stat_inc(cpu, part, merges[rw]);
|
|
|
|
} else {
|
|
|
|
part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
|
|
|
|
if (!hd_struct_try_get(part)) {
|
|
|
|
/*
|
|
|
|
* The partition is already being removed,
|
|
|
|
* the request will be accounted on the disk only
|
|
|
|
*
|
|
|
|
* We take a reference on disk->part0 although that
|
|
|
|
* partition will never be deleted, so we can treat
|
|
|
|
* it as any other partition.
|
|
|
|
*/
|
|
|
|
part = &rq->rq_disk->part0;
|
|
|
|
hd_struct_get(part);
|
|
|
|
}
|
|
|
|
part_round_stats(cpu, part);
|
|
|
|
part_inc_in_flight(part, rw);
|
|
|
|
rq->part = part;
|
|
|
|
}
|
|
|
|
|
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
|
2007-12-12 05:52:28 +07:00
|
|
|
/**
|
2009-05-08 09:54:16 +07:00
|
|
|
* blk_peek_request - peek at the top of a request queue
|
|
|
|
* @q: request queue to peek at
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Return the request at the top of @q. The returned request
|
|
|
|
* should be started using blk_start_request() before LLD starts
|
|
|
|
* processing it.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* Pointer to the request at the top of @q if available. Null
|
|
|
|
* otherwise.
|
|
|
|
*
|
|
|
|
* Context:
|
|
|
|
* queue_lock must be held.
|
|
|
|
*/
|
|
|
|
struct request *blk_peek_request(struct request_queue *q)
|
2009-04-23 09:05:18 +07:00
|
|
|
{
|
|
|
|
struct request *rq;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
while ((rq = __elv_next_request(q)) != NULL) {
|
2013-03-23 10:42:27 +07:00
|
|
|
|
|
|
|
rq = blk_pm_peek_request(q, rq);
|
|
|
|
if (!rq)
|
|
|
|
break;
|
|
|
|
|
2016-10-20 20:12:13 +07:00
|
|
|
if (!(rq->rq_flags & RQF_STARTED)) {
|
2009-04-23 09:05:18 +07:00
|
|
|
/*
|
|
|
|
* This is the first time the device driver
|
|
|
|
* sees this request (possibly after
|
|
|
|
* requeueing). Notify IO scheduler.
|
|
|
|
*/
|
2016-10-20 20:12:13 +07:00
|
|
|
if (rq->rq_flags & RQF_SORTED)
|
2009-04-23 09:05:18 +07:00
|
|
|
elv_activate_rq(q, rq);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* just mark as started even if we don't start
|
|
|
|
* it, a request that has been delayed should
|
|
|
|
* not be passed by new incoming requests
|
|
|
|
*/
|
2016-10-20 20:12:13 +07:00
|
|
|
rq->rq_flags |= RQF_STARTED;
|
2009-04-23 09:05:18 +07:00
|
|
|
trace_block_rq_issue(q, rq);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!q->boundary_rq || q->boundary_rq == rq) {
|
|
|
|
q->end_sector = rq_end_sector(rq);
|
|
|
|
q->boundary_rq = NULL;
|
|
|
|
}
|
|
|
|
|
2016-10-20 20:12:13 +07:00
|
|
|
if (rq->rq_flags & RQF_DONTPREP)
|
2009-04-23 09:05:18 +07:00
|
|
|
break;
|
|
|
|
|
2009-05-07 20:24:41 +07:00
|
|
|
if (q->dma_drain_size && blk_rq_bytes(rq)) {
|
2009-04-23 09:05:18 +07:00
|
|
|
/*
|
|
|
|
* make sure space for the drain appears we
|
|
|
|
* know we can do this because max_hw_segments
|
|
|
|
* has been adjusted to be one fewer than the
|
|
|
|
* device can handle
|
|
|
|
*/
|
|
|
|
rq->nr_phys_segments++;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!q->prep_rq_fn)
|
|
|
|
break;
|
|
|
|
|
|
|
|
ret = q->prep_rq_fn(q, rq);
|
|
|
|
if (ret == BLKPREP_OK) {
|
|
|
|
break;
|
|
|
|
} else if (ret == BLKPREP_DEFER) {
|
|
|
|
/*
|
|
|
|
* the request may have been (partially) prepped.
|
|
|
|
* we need to keep this request in the front to
|
2016-10-20 20:12:13 +07:00
|
|
|
* avoid resource deadlock. RQF_STARTED will
|
2009-04-23 09:05:18 +07:00
|
|
|
* prevent other fs requests from passing this one.
|
|
|
|
*/
|
2009-05-07 20:24:41 +07:00
|
|
|
if (q->dma_drain_size && blk_rq_bytes(rq) &&
|
2016-10-20 20:12:13 +07:00
|
|
|
!(rq->rq_flags & RQF_DONTPREP)) {
|
2009-04-23 09:05:18 +07:00
|
|
|
/*
|
|
|
|
* remove the space for the drain we added
|
|
|
|
* so that we don't add it again
|
|
|
|
*/
|
|
|
|
--rq->nr_phys_segments;
|
|
|
|
}
|
|
|
|
|
|
|
|
rq = NULL;
|
|
|
|
break;
|
2016-02-04 12:52:12 +07:00
|
|
|
} else if (ret == BLKPREP_KILL || ret == BLKPREP_INVALID) {
|
|
|
|
int err = (ret == BLKPREP_INVALID) ? -EREMOTEIO : -EIO;
|
|
|
|
|
2016-10-20 20:12:13 +07:00
|
|
|
rq->rq_flags |= RQF_QUIET;
|
2009-05-30 11:43:49 +07:00
|
|
|
/*
|
|
|
|
* Mark this request as started so we don't trigger
|
|
|
|
* any debug logic in the end I/O path.
|
|
|
|
*/
|
|
|
|
blk_start_request(rq);
|
2016-02-04 12:52:12 +07:00
|
|
|
__blk_end_request_all(rq, err);
|
2009-04-23 09:05:18 +07:00
|
|
|
} else {
|
|
|
|
printk(KERN_ERR "%s: bad return=%d\n", __func__, ret);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return rq;
|
|
|
|
}
|
2009-05-08 09:54:16 +07:00
|
|
|
EXPORT_SYMBOL(blk_peek_request);
|
2009-04-23 09:05:18 +07:00
|
|
|
|
2009-05-08 09:54:16 +07:00
|
|
|
void blk_dequeue_request(struct request *rq)
|
2009-04-23 09:05:18 +07:00
|
|
|
{
|
2009-05-08 09:54:16 +07:00
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
BUG_ON(list_empty(&rq->queuelist));
|
|
|
|
BUG_ON(ELV_ON_HASH(rq));
|
|
|
|
|
|
|
|
list_del_init(&rq->queuelist);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the time frame between a request being removed from the lists
|
|
|
|
* and to it is freed is accounted as io that is in progress at
|
|
|
|
* the driver side.
|
|
|
|
*/
|
2010-04-02 05:01:41 +07:00
|
|
|
if (blk_account_rq(rq)) {
|
2009-05-20 13:54:31 +07:00
|
|
|
q->in_flight[rq_is_sync(rq)]++;
|
2010-04-02 05:01:41 +07:00
|
|
|
set_io_start_time_ns(rq);
|
|
|
|
}
|
2009-04-23 09:05:18 +07:00
|
|
|
}
|
|
|
|
|
2009-05-08 09:54:16 +07:00
|
|
|
/**
|
|
|
|
* blk_start_request - start request processing on the driver
|
|
|
|
* @req: request to dequeue
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Dequeue @req and start timeout timer on it. This hands off the
|
|
|
|
* request to the driver.
|
|
|
|
*
|
|
|
|
* Block internal functions which don't want to start timer should
|
|
|
|
* call blk_dequeue_request().
|
|
|
|
*
|
|
|
|
* Context:
|
|
|
|
* queue_lock must be held.
|
|
|
|
*/
|
|
|
|
void blk_start_request(struct request *req)
|
|
|
|
{
|
|
|
|
blk_dequeue_request(req);
|
|
|
|
|
2016-11-08 11:32:37 +07:00
|
|
|
if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
|
2017-03-28 05:19:41 +07:00
|
|
|
blk_stat_set_issue(&req->issue_stat, blk_rq_sectors(req));
|
2016-11-08 11:32:37 +07:00
|
|
|
req->rq_flags |= RQF_STATS;
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
wbt_issue(req->q->rq_wb, &req->issue_stat);
|
2016-11-08 11:32:37 +07:00
|
|
|
}
|
|
|
|
|
block: fix race between request completion and timeout handling
crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461
RIP: 0010:[<ffffffff8124e424>] [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
RSP: 0018:ffff881057eefd60 EFLAGS: 00010012
RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
Stack:
0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
<0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
<0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
Call Trace:
[<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150
[<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850
[<ffffffff81362a23>] scsi_queue_insert+0x13/0x20
[<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160
[<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660
[<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660
[<ffffffff810908c6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090830>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
RIP [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
RSP <ffff881057eefd60>
The RIP is this line:
BUG_ON(blk_queued_rq(rq));
After digging through the code, I think there may be a race between the
request completion and the timer handler running.
A timer is started for each request put on the device's queue (see
blk_start_request->blk_add_timer). If the request does not complete
before the timer expires, the timer handler (blk_rq_timed_out_timer)
will mark the request complete atomically:
static inline int blk_mark_rq_complete(struct request *rq)
{
return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
}
and then call blk_rq_timed_out. The latter function will call
scsi_times_out, which will return one of BLK_EH_HANDLED,
BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is
returned, blk_clear_rq_complete is called, and blk_add_timer is again
called to simply wait longer for the request to complete.
Now, if the request happens to complete while this is going on, what
happens? Given that we know the completion handler will bail if it
finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
handler running after that bit is cleared. So, from the above
paragraph, after the call to blk_clear_rq_complete. If the completion
sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
there (I haven't seen this in the cores). Next, if we get the
completion before the call to list_add_tail, then the timer will
eventually fire for an old req, which may either be freed or reallocated
(there is evidence that this might be the case). Finally, if the
completion comes in *after* the addition to the timeout list, I think
it's harmless. The request will be removed from the timeout list,
req_atom_complete will be set, and all will be well.
This will only actually explain the coredumps *IF* the request
structure was freed, reallocated *and* queued before the error handler
thread had a chance to process it. That is possible, but it may make
sense to keep digging for another race. I think that if this is what
was happening, we would see other instances of this problem showing up
as null pointer or garbage pointer dereferences, for example when the
request structure was not re-used. It looks like we actually do run
into that situation in other reports.
This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
&req->atomic_flags)); from blk_add_timer to the only caller that could
trip over it (blk_start_request). It then inverts the calls to
blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
the race. I've boot tested this patch, but nothing more.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-09 01:36:41 +07:00
|
|
|
BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags));
|
2009-05-08 09:54:16 +07:00
|
|
|
blk_add_timer(req);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_start_request);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_fetch_request - fetch a request from a request queue
|
|
|
|
* @q: request queue to fetch a request from
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Return the request at the top of @q. The request is started on
|
|
|
|
* return and LLD can start processing it immediately.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* Pointer to the request at the top of @q if available. Null
|
|
|
|
* otherwise.
|
|
|
|
*
|
|
|
|
* Context:
|
|
|
|
* queue_lock must be held.
|
|
|
|
*/
|
|
|
|
struct request *blk_fetch_request(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct request *rq;
|
|
|
|
|
|
|
|
rq = blk_peek_request(q);
|
|
|
|
if (rq)
|
|
|
|
blk_start_request(rq);
|
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_fetch_request);
|
|
|
|
|
2007-12-12 05:52:28 +07:00
|
|
|
/**
|
2009-04-23 09:05:18 +07:00
|
|
|
* blk_update_request - Special helper function for request stacking drivers
|
2009-06-12 10:00:41 +07:00
|
|
|
* @req: the request being processed
|
2008-08-20 01:13:11 +07:00
|
|
|
* @error: %0 for success, < %0 for error
|
2009-06-12 10:00:41 +07:00
|
|
|
* @nr_bytes: number of bytes to complete @req
|
2007-12-12 05:52:28 +07:00
|
|
|
*
|
|
|
|
* Description:
|
2009-06-12 10:00:41 +07:00
|
|
|
* Ends I/O on a number of bytes attached to @req, but doesn't complete
|
|
|
|
* the request structure even if @req doesn't have leftover.
|
|
|
|
* If @req has leftover, sets it up for the next range of segments.
|
2009-04-23 09:05:18 +07:00
|
|
|
*
|
|
|
|
* This special helper function is only for request stacking drivers
|
|
|
|
* (e.g. request-based dm) so that they can handle partial completion.
|
|
|
|
* Actual device drivers should use blk_end_request instead.
|
|
|
|
*
|
|
|
|
* Passing the result of blk_rq_bytes() as @nr_bytes guarantees
|
|
|
|
* %false return from this function.
|
2007-12-12 05:52:28 +07:00
|
|
|
*
|
|
|
|
* Return:
|
2009-04-23 09:05:18 +07:00
|
|
|
* %false - this request doesn't have any more data
|
|
|
|
* %true - this request has more data
|
2007-12-12 05:52:28 +07:00
|
|
|
**/
|
2009-04-23 09:05:18 +07:00
|
|
|
bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-09-21 06:38:30 +07:00
|
|
|
int total_bytes;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-10-01 19:32:31 +07:00
|
|
|
trace_block_rq_complete(req->q, req, nr_bytes);
|
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
if (!req->bio)
|
|
|
|
return false;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2009-04-19 05:00:41 +07:00
|
|
|
* For fs requests, rq is just carrier of independent bio's
|
|
|
|
* and each partial completion should be handled separately.
|
|
|
|
* Reset per-request error on each partial completion.
|
|
|
|
*
|
|
|
|
* TODO: tj: This is too subtle. It would be better to let
|
|
|
|
* low level drivers do what they see fit.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2017-01-31 22:57:29 +07:00
|
|
|
if (!blk_rq_is_passthrough(req))
|
2005-04-17 05:20:36 +07:00
|
|
|
req->errors = 0;
|
|
|
|
|
2017-01-31 22:57:29 +07:00
|
|
|
if (error && !blk_rq_is_passthrough(req) &&
|
2016-10-20 20:12:13 +07:00
|
|
|
!(req->rq_flags & RQF_QUIET)) {
|
2011-01-18 16:13:13 +07:00
|
|
|
char *error_type;
|
|
|
|
|
|
|
|
switch (error) {
|
|
|
|
case -ENOLINK:
|
|
|
|
error_type = "recoverable transport";
|
|
|
|
break;
|
|
|
|
case -EREMOTEIO:
|
|
|
|
error_type = "critical target";
|
|
|
|
break;
|
|
|
|
case -EBADE:
|
|
|
|
error_type = "critical nexus";
|
|
|
|
break;
|
2013-01-30 16:26:16 +07:00
|
|
|
case -ETIMEDOUT:
|
|
|
|
error_type = "timeout";
|
|
|
|
break;
|
2013-07-01 20:16:25 +07:00
|
|
|
case -ENOSPC:
|
|
|
|
error_type = "critical space allocation";
|
|
|
|
break;
|
2013-07-01 20:16:26 +07:00
|
|
|
case -ENODATA:
|
|
|
|
error_type = "critical medium";
|
|
|
|
break;
|
2011-01-18 16:13:13 +07:00
|
|
|
case -EIO:
|
|
|
|
default:
|
|
|
|
error_type = "I/O";
|
|
|
|
break;
|
|
|
|
}
|
block: make blk_update_request print prefix match ratelimited prefix
In blk_update_request, change the printk_ratelimited
prefix from end_request to blk_update_request so it
matches the name printed if rate limiting occurs.
Old:
[10234.933106] blk_update_request: 174 callbacks suppressed
[10234.934940] end_request: critical target error, dev sdr, sector 16
[10234.949788] end_request: critical target error, dev sdr, sector 16
New:
[16863.445173] blk_update_request: 398 callbacks suppressed
[16863.447029] blk_update_request: critical target error, dev sdr, sector
1442066176
[16863.449383] blk_update_request: critical target error, dev sdr, sector
802802888
[16863.451680] blk_update_request: critical target error, dev sdr, sector
1609535456
Signed-off-by: Robert Elliott <elliott@hp.com>
Reviewed-by: Webb Scales <webbnh@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-27 22:50:31 +07:00
|
|
|
printk_ratelimited(KERN_ERR "%s: %s error, dev %s, sector %llu\n",
|
|
|
|
__func__, error_type, req->rq_disk ?
|
2012-08-31 06:26:25 +07:00
|
|
|
req->rq_disk->disk_name : "?",
|
|
|
|
(unsigned long long)blk_rq_pos(req));
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2009-01-23 16:54:44 +07:00
|
|
|
blk_account_io_completion(req, nr_bytes);
|
2005-11-01 14:35:42 +07:00
|
|
|
|
2012-09-21 06:38:30 +07:00
|
|
|
total_bytes = 0;
|
|
|
|
while (req->bio) {
|
|
|
|
struct bio *bio = req->bio;
|
2013-10-12 05:44:27 +07:00
|
|
|
unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2013-10-12 05:44:27 +07:00
|
|
|
if (bio_bytes == bio->bi_iter.bi_size)
|
2005-04-17 05:20:36 +07:00
|
|
|
req->bio = bio->bi_next;
|
|
|
|
|
block: trace completion of all bios.
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.
When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.
So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 22:40:52 +07:00
|
|
|
/* Completion has already been traced */
|
|
|
|
bio_clear_flag(bio, BIO_TRACE_COMPLETION);
|
2012-09-21 06:38:30 +07:00
|
|
|
req_bio_endio(req, bio, bio_bytes, error);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-09-21 06:38:30 +07:00
|
|
|
total_bytes += bio_bytes;
|
|
|
|
nr_bytes -= bio_bytes;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-09-21 06:38:30 +07:00
|
|
|
if (!nr_bytes)
|
|
|
|
break;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* completely done
|
|
|
|
*/
|
2009-04-23 09:05:18 +07:00
|
|
|
if (!req->bio) {
|
|
|
|
/*
|
|
|
|
* Reset counters so that the request stacking driver
|
|
|
|
* can find how many bytes remain in the request
|
|
|
|
* later.
|
|
|
|
*/
|
2009-05-07 20:24:44 +07:00
|
|
|
req->__data_len = 0;
|
2009-04-23 09:05:18 +07:00
|
|
|
return false;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2016-12-09 05:20:32 +07:00
|
|
|
WARN_ON_ONCE(req->rq_flags & RQF_SPECIAL_PAYLOAD);
|
|
|
|
|
2009-05-07 20:24:44 +07:00
|
|
|
req->__data_len -= total_bytes;
|
2009-05-07 20:24:41 +07:00
|
|
|
|
|
|
|
/* update sector only for requests with clear definition of sector */
|
2017-01-31 22:57:29 +07:00
|
|
|
if (!blk_rq_is_passthrough(req))
|
2009-05-07 20:24:44 +07:00
|
|
|
req->__sector += total_bytes >> 9;
|
2009-05-07 20:24:41 +07:00
|
|
|
|
2009-07-03 15:48:17 +07:00
|
|
|
/* mixed attributes always follow the first bio */
|
2016-10-20 20:12:13 +07:00
|
|
|
if (req->rq_flags & RQF_MIXED_MERGE) {
|
2009-07-03 15:48:17 +07:00
|
|
|
req->cmd_flags &= ~REQ_FAILFAST_MASK;
|
2016-08-06 04:35:16 +07:00
|
|
|
req->cmd_flags |= req->bio->bi_opf & REQ_FAILFAST_MASK;
|
2009-07-03 15:48:17 +07:00
|
|
|
}
|
|
|
|
|
2009-05-07 20:24:41 +07:00
|
|
|
/*
|
|
|
|
* If total number of sectors is less than the first segment
|
|
|
|
* size, something has gone terribly wrong.
|
|
|
|
*/
|
|
|
|
if (blk_rq_bytes(req) < blk_rq_cur_bytes(req)) {
|
2011-03-30 14:51:33 +07:00
|
|
|
blk_dump_rq_flags(req, "request botched");
|
2009-05-07 20:24:44 +07:00
|
|
|
req->__data_len = blk_rq_cur_bytes(req);
|
2009-05-07 20:24:41 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* recalculate the number of segments */
|
2005-04-17 05:20:36 +07:00
|
|
|
blk_recalc_rq_segments(req);
|
2009-05-07 20:24:41 +07:00
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
return true;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2009-04-23 09:05:18 +07:00
|
|
|
EXPORT_SYMBOL_GPL(blk_update_request);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
static bool blk_update_bidi_request(struct request *rq, int error,
|
|
|
|
unsigned int nr_bytes,
|
|
|
|
unsigned int bidi_bytes)
|
2009-04-23 09:05:18 +07:00
|
|
|
{
|
2009-04-23 09:05:18 +07:00
|
|
|
if (blk_update_request(rq, error, nr_bytes))
|
|
|
|
return true;
|
2009-04-23 09:05:18 +07:00
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
/* Bidi request must be completed as a whole */
|
|
|
|
if (unlikely(blk_bidi_rq(rq)) &&
|
|
|
|
blk_update_request(rq->next_rq, error, bidi_bytes))
|
|
|
|
return true;
|
2009-04-23 09:05:18 +07:00
|
|
|
|
2010-06-09 15:42:09 +07:00
|
|
|
if (blk_queue_add_random(rq->q))
|
|
|
|
add_disk_randomness(rq->rq_disk);
|
2009-04-23 09:05:18 +07:00
|
|
|
|
|
|
|
return false;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2010-07-01 17:49:17 +07:00
|
|
|
/**
|
|
|
|
* blk_unprep_request - unprepare a request
|
|
|
|
* @req: the request
|
|
|
|
*
|
|
|
|
* This function makes a request ready for complete resubmission (or
|
|
|
|
* completion). It happens only after all error handling is complete,
|
|
|
|
* so represents the appropriate moment to deallocate any resources
|
|
|
|
* that were allocated to the request in the prep_rq_fn. The queue
|
|
|
|
* lock is held when calling this.
|
|
|
|
*/
|
|
|
|
void blk_unprep_request(struct request *req)
|
|
|
|
{
|
|
|
|
struct request_queue *q = req->q;
|
|
|
|
|
2016-10-20 20:12:13 +07:00
|
|
|
req->rq_flags &= ~RQF_DONTPREP;
|
2010-07-01 17:49:17 +07:00
|
|
|
if (q->unprep_rq_fn)
|
|
|
|
q->unprep_rq_fn(q, req);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_unprep_request);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* queue lock must be held
|
|
|
|
*/
|
2014-04-16 14:44:59 +07:00
|
|
|
void blk_finish_request(struct request *req, int error)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2016-11-08 11:32:37 +07:00
|
|
|
struct request_queue *q = req->q;
|
|
|
|
|
|
|
|
if (req->rq_flags & RQF_STATS)
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 22:56:08 +07:00
|
|
|
blk_stat_add(req);
|
2016-11-08 11:32:37 +07:00
|
|
|
|
2016-10-20 20:12:13 +07:00
|
|
|
if (req->rq_flags & RQF_QUEUED)
|
2016-11-08 11:32:37 +07:00
|
|
|
blk_queue_end_tag(q, req);
|
2007-12-12 05:53:24 +07:00
|
|
|
|
2009-05-27 19:17:08 +07:00
|
|
|
BUG_ON(blk_queued_rq(req));
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-01-31 22:57:29 +07:00
|
|
|
if (unlikely(laptop_mode) && !blk_rq_is_passthrough(req))
|
2017-02-02 21:56:50 +07:00
|
|
|
laptop_io_completion(req->q->backing_dev_info);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-10-30 16:16:20 +07:00
|
|
|
blk_delete_timer(req);
|
|
|
|
|
2016-10-20 20:12:13 +07:00
|
|
|
if (req->rq_flags & RQF_DONTPREP)
|
2010-07-01 17:49:17 +07:00
|
|
|
blk_unprep_request(req);
|
|
|
|
|
2009-01-23 16:54:44 +07:00
|
|
|
blk_account_io_done(req);
|
2007-12-12 05:53:24 +07:00
|
|
|
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
if (req->end_io) {
|
|
|
|
wbt_done(req->q->rq_wb, &req->issue_stat);
|
2006-01-06 15:49:03 +07:00
|
|
|
req->end_io(req, error);
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
} else {
|
2007-12-12 05:53:24 +07:00
|
|
|
if (blk_bidi_rq(req))
|
|
|
|
__blk_put_request(req->next_rq->q, req->next_rq);
|
|
|
|
|
2016-11-08 11:32:37 +07:00
|
|
|
__blk_put_request(q, req);
|
2007-12-12 05:53:24 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2014-04-16 14:44:59 +07:00
|
|
|
EXPORT_SYMBOL(blk_finish_request);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-12-12 05:41:17 +07:00
|
|
|
/**
|
2009-04-23 09:05:18 +07:00
|
|
|
* blk_end_bidi_request - Complete a bidi request
|
|
|
|
* @rq: the request to complete
|
|
|
|
* @error: %0 for success, < %0 for error
|
|
|
|
* @nr_bytes: number of bytes to complete @rq
|
|
|
|
* @bidi_bytes: number of bytes to complete @rq->next_rq
|
2007-09-21 15:41:07 +07:00
|
|
|
*
|
|
|
|
* Description:
|
2007-12-12 05:51:46 +07:00
|
|
|
* Ends I/O on a number of bytes attached to @rq and @rq->next_rq.
|
2009-04-23 09:05:18 +07:00
|
|
|
* Drivers that supports bidi can safely call this member for any
|
|
|
|
* type of request, bidi or uni. In the later case @bidi_bytes is
|
|
|
|
* just ignored.
|
2007-12-12 05:40:30 +07:00
|
|
|
*
|
|
|
|
* Return:
|
2009-04-23 09:05:18 +07:00
|
|
|
* %false - we are done with this request
|
|
|
|
* %true - still buffers pending for this request
|
2007-09-21 15:41:07 +07:00
|
|
|
**/
|
2009-05-11 15:56:09 +07:00
|
|
|
static bool blk_end_bidi_request(struct request *rq, int error,
|
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:45:09 +07:00
|
|
|
unsigned int nr_bytes, unsigned int bidi_bytes)
|
|
|
|
{
|
2007-12-12 05:40:30 +07:00
|
|
|
struct request_queue *q = rq->q;
|
2009-04-23 09:05:18 +07:00
|
|
|
unsigned long flags;
|
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:45:09 +07:00
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes))
|
|
|
|
return true;
|
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:45:09 +07:00
|
|
|
|
2007-12-12 05:40:30 +07:00
|
|
|
spin_lock_irqsave(q->queue_lock, flags);
|
2009-04-23 09:05:18 +07:00
|
|
|
blk_finish_request(rq, error);
|
2007-12-12 05:40:30 +07:00
|
|
|
spin_unlock_irqrestore(q->queue_lock, flags);
|
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
return false;
|
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:45:09 +07:00
|
|
|
}
|
|
|
|
|
2007-12-12 05:40:30 +07:00
|
|
|
/**
|
2009-04-23 09:05:18 +07:00
|
|
|
* __blk_end_bidi_request - Complete a bidi request with queue lock held
|
|
|
|
* @rq: the request to complete
|
2008-08-20 01:13:11 +07:00
|
|
|
* @error: %0 for success, < %0 for error
|
2007-12-12 05:51:46 +07:00
|
|
|
* @nr_bytes: number of bytes to complete @rq
|
|
|
|
* @bidi_bytes: number of bytes to complete @rq->next_rq
|
2007-12-12 05:40:30 +07:00
|
|
|
*
|
|
|
|
* Description:
|
2009-04-23 09:05:18 +07:00
|
|
|
* Identical to blk_end_bidi_request() except that queue lock is
|
|
|
|
* assumed to be locked on entry and remains so on return.
|
2007-12-12 05:40:30 +07:00
|
|
|
*
|
|
|
|
* Return:
|
2009-04-23 09:05:18 +07:00
|
|
|
* %false - we are done with this request
|
|
|
|
* %true - still buffers pending for this request
|
2007-12-12 05:40:30 +07:00
|
|
|
**/
|
block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-16 02:37:25 +07:00
|
|
|
bool __blk_end_bidi_request(struct request *rq, int error,
|
2009-05-11 15:56:09 +07:00
|
|
|
unsigned int nr_bytes, unsigned int bidi_bytes)
|
2007-12-12 05:40:30 +07:00
|
|
|
{
|
2009-04-23 09:05:18 +07:00
|
|
|
if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes))
|
|
|
|
return true;
|
2007-12-12 05:40:30 +07:00
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
blk_finish_request(rq, error);
|
2007-12-12 05:40:30 +07:00
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
return false;
|
2007-12-12 05:40:30 +07:00
|
|
|
}
|
2007-12-12 05:51:02 +07:00
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_end_request - Helper function for drivers to complete the request.
|
|
|
|
* @rq: the request being processed
|
2008-08-20 01:13:11 +07:00
|
|
|
* @error: %0 for success, < %0 for error
|
2007-12-12 05:51:02 +07:00
|
|
|
* @nr_bytes: number of bytes to complete
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Ends I/O on a number of bytes attached to @rq.
|
|
|
|
* If @rq has leftover, sets it up for the next range of segments.
|
|
|
|
*
|
|
|
|
* Return:
|
2009-05-11 15:56:09 +07:00
|
|
|
* %false - we are done with this request
|
|
|
|
* %true - still buffers pending for this request
|
2007-12-12 05:51:02 +07:00
|
|
|
**/
|
2009-05-11 15:56:09 +07:00
|
|
|
bool blk_end_request(struct request *rq, int error, unsigned int nr_bytes)
|
2007-12-12 05:51:02 +07:00
|
|
|
{
|
2009-05-11 15:56:09 +07:00
|
|
|
return blk_end_bidi_request(rq, error, nr_bytes, 0);
|
2007-12-12 05:51:02 +07:00
|
|
|
}
|
2009-07-29 03:11:24 +07:00
|
|
|
EXPORT_SYMBOL(blk_end_request);
|
2007-12-12 05:40:30 +07:00
|
|
|
|
|
|
|
/**
|
2009-05-11 15:56:09 +07:00
|
|
|
* blk_end_request_all - Helper function for drives to finish the request.
|
|
|
|
* @rq: the request to finish
|
2009-06-12 10:00:41 +07:00
|
|
|
* @error: %0 for success, < %0 for error
|
2007-12-12 05:40:30 +07:00
|
|
|
*
|
|
|
|
* Description:
|
2009-05-11 15:56:09 +07:00
|
|
|
* Completely finish @rq.
|
|
|
|
*/
|
|
|
|
void blk_end_request_all(struct request *rq, int error)
|
2007-12-12 05:40:30 +07:00
|
|
|
{
|
2009-05-11 15:56:09 +07:00
|
|
|
bool pending;
|
|
|
|
unsigned int bidi_bytes = 0;
|
2007-12-12 05:40:30 +07:00
|
|
|
|
2009-05-11 15:56:09 +07:00
|
|
|
if (unlikely(blk_bidi_rq(rq)))
|
|
|
|
bidi_bytes = blk_rq_bytes(rq->next_rq);
|
2007-12-12 05:40:30 +07:00
|
|
|
|
2009-05-11 15:56:09 +07:00
|
|
|
pending = blk_end_bidi_request(rq, error, blk_rq_bytes(rq), bidi_bytes);
|
|
|
|
BUG_ON(pending);
|
|
|
|
}
|
2009-07-29 03:11:24 +07:00
|
|
|
EXPORT_SYMBOL(blk_end_request_all);
|
2007-12-12 05:40:30 +07:00
|
|
|
|
2009-05-11 15:56:09 +07:00
|
|
|
/**
|
|
|
|
* blk_end_request_cur - Helper function to finish the current request chunk.
|
|
|
|
* @rq: the request to finish the current chunk for
|
2009-06-12 10:00:41 +07:00
|
|
|
* @error: %0 for success, < %0 for error
|
2009-05-11 15:56:09 +07:00
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Complete the current consecutively mapped chunk from @rq.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* %false - we are done with this request
|
|
|
|
* %true - still buffers pending for this request
|
|
|
|
*/
|
|
|
|
bool blk_end_request_cur(struct request *rq, int error)
|
|
|
|
{
|
|
|
|
return blk_end_request(rq, error, blk_rq_cur_bytes(rq));
|
2007-12-12 05:40:30 +07:00
|
|
|
}
|
2009-07-29 03:11:24 +07:00
|
|
|
EXPORT_SYMBOL(blk_end_request_cur);
|
2007-12-12 05:40:30 +07:00
|
|
|
|
2009-07-03 15:48:17 +07:00
|
|
|
/**
|
|
|
|
* blk_end_request_err - Finish a request till the next failure boundary.
|
|
|
|
* @rq: the request to finish till the next failure boundary for
|
|
|
|
* @error: must be negative errno
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Complete @rq till the next failure boundary.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* %false - we are done with this request
|
|
|
|
* %true - still buffers pending for this request
|
|
|
|
*/
|
|
|
|
bool blk_end_request_err(struct request *rq, int error)
|
|
|
|
{
|
|
|
|
WARN_ON(error >= 0);
|
|
|
|
return blk_end_request(rq, error, blk_rq_err_bytes(rq));
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_end_request_err);
|
|
|
|
|
2007-12-12 05:51:46 +07:00
|
|
|
/**
|
2009-05-11 15:56:09 +07:00
|
|
|
* __blk_end_request - Helper function for drivers to complete the request.
|
|
|
|
* @rq: the request being processed
|
|
|
|
* @error: %0 for success, < %0 for error
|
|
|
|
* @nr_bytes: number of bytes to complete
|
2007-12-12 05:51:46 +07:00
|
|
|
*
|
|
|
|
* Description:
|
2009-05-11 15:56:09 +07:00
|
|
|
* Must be called with queue lock held unlike blk_end_request().
|
2007-12-12 05:51:46 +07:00
|
|
|
*
|
|
|
|
* Return:
|
2009-05-11 15:56:09 +07:00
|
|
|
* %false - we are done with this request
|
|
|
|
* %true - still buffers pending for this request
|
2007-12-12 05:51:46 +07:00
|
|
|
**/
|
2009-05-11 15:56:09 +07:00
|
|
|
bool __blk_end_request(struct request *rq, int error, unsigned int nr_bytes)
|
2007-12-12 05:51:46 +07:00
|
|
|
{
|
2009-05-11 15:56:09 +07:00
|
|
|
return __blk_end_bidi_request(rq, error, nr_bytes, 0);
|
2007-12-12 05:51:46 +07:00
|
|
|
}
|
2009-07-29 03:11:24 +07:00
|
|
|
EXPORT_SYMBOL(__blk_end_request);
|
2007-12-12 05:51:46 +07:00
|
|
|
|
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:45:09 +07:00
|
|
|
/**
|
2009-05-11 15:56:09 +07:00
|
|
|
* __blk_end_request_all - Helper function for drives to finish the request.
|
|
|
|
* @rq: the request to finish
|
2009-06-12 10:00:41 +07:00
|
|
|
* @error: %0 for success, < %0 for error
|
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:45:09 +07:00
|
|
|
*
|
|
|
|
* Description:
|
2009-05-11 15:56:09 +07:00
|
|
|
* Completely finish @rq. Must be called with queue lock held.
|
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:45:09 +07:00
|
|
|
*/
|
2009-05-11 15:56:09 +07:00
|
|
|
void __blk_end_request_all(struct request *rq, int error)
|
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:45:09 +07:00
|
|
|
{
|
2009-05-11 15:56:09 +07:00
|
|
|
bool pending;
|
|
|
|
unsigned int bidi_bytes = 0;
|
|
|
|
|
|
|
|
if (unlikely(blk_bidi_rq(rq)))
|
|
|
|
bidi_bytes = blk_rq_bytes(rq->next_rq);
|
|
|
|
|
|
|
|
pending = __blk_end_bidi_request(rq, error, blk_rq_bytes(rq), bidi_bytes);
|
|
|
|
BUG_ON(pending);
|
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:45:09 +07:00
|
|
|
}
|
2009-07-29 03:11:24 +07:00
|
|
|
EXPORT_SYMBOL(__blk_end_request_all);
|
block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io).
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:45:09 +07:00
|
|
|
|
2007-12-12 05:51:02 +07:00
|
|
|
/**
|
2009-05-11 15:56:09 +07:00
|
|
|
* __blk_end_request_cur - Helper function to finish the current request chunk.
|
|
|
|
* @rq: the request to finish the current chunk for
|
2009-06-12 10:00:41 +07:00
|
|
|
* @error: %0 for success, < %0 for error
|
2007-12-12 05:51:02 +07:00
|
|
|
*
|
|
|
|
* Description:
|
2009-05-11 15:56:09 +07:00
|
|
|
* Complete the current consecutively mapped chunk from @rq. Must
|
|
|
|
* be called with queue lock held.
|
2007-12-12 05:51:02 +07:00
|
|
|
*
|
|
|
|
* Return:
|
2009-05-11 15:56:09 +07:00
|
|
|
* %false - we are done with this request
|
|
|
|
* %true - still buffers pending for this request
|
|
|
|
*/
|
|
|
|
bool __blk_end_request_cur(struct request *rq, int error)
|
2007-12-12 05:51:02 +07:00
|
|
|
{
|
2009-05-11 15:56:09 +07:00
|
|
|
return __blk_end_request(rq, error, blk_rq_cur_bytes(rq));
|
2007-12-12 05:51:02 +07:00
|
|
|
}
|
2009-07-29 03:11:24 +07:00
|
|
|
EXPORT_SYMBOL(__blk_end_request_cur);
|
2007-12-12 05:51:02 +07:00
|
|
|
|
2009-07-03 15:48:17 +07:00
|
|
|
/**
|
|
|
|
* __blk_end_request_err - Finish a request till the next failure boundary.
|
|
|
|
* @rq: the request to finish till the next failure boundary for
|
|
|
|
* @error: must be negative errno
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Complete @rq till the next failure boundary. Must be called
|
|
|
|
* with queue lock held.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* %false - we are done with this request
|
|
|
|
* %true - still buffers pending for this request
|
|
|
|
*/
|
|
|
|
bool __blk_end_request_err(struct request *rq, int error)
|
|
|
|
{
|
|
|
|
WARN_ON(error >= 0);
|
|
|
|
return __blk_end_request(rq, error, blk_rq_err_bytes(rq));
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__blk_end_request_err);
|
|
|
|
|
2008-01-29 20:53:40 +07:00
|
|
|
void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
|
|
|
|
struct bio *bio)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2014-04-10 22:46:28 +07:00
|
|
|
if (bio_has_data(bio))
|
2008-08-06 00:01:53 +07:00
|
|
|
rq->nr_phys_segments = bio_phys_segments(q, bio);
|
2014-04-10 22:46:28 +07:00
|
|
|
|
2013-10-12 05:44:27 +07:00
|
|
|
rq->__data_len = bio->bi_iter.bi_size;
|
2005-04-17 05:20:36 +07:00
|
|
|
rq->bio = rq->biotail = bio;
|
|
|
|
|
2007-08-16 18:31:28 +07:00
|
|
|
if (bio->bi_bdev)
|
|
|
|
rq->rq_disk = bio->bi_bdev->bd_disk;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-11-26 15:16:19 +07:00
|
|
|
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
|
|
|
|
/**
|
|
|
|
* rq_flush_dcache_pages - Helper function to flush all pages in a request
|
|
|
|
* @rq: the request to be flushed
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Flush all pages in @rq.
|
|
|
|
*/
|
|
|
|
void rq_flush_dcache_pages(struct request *rq)
|
|
|
|
{
|
|
|
|
struct req_iterator iter;
|
2013-11-24 08:19:00 +07:00
|
|
|
struct bio_vec bvec;
|
2009-11-26 15:16:19 +07:00
|
|
|
|
|
|
|
rq_for_each_segment(bvec, rq, iter)
|
2013-11-24 08:19:00 +07:00
|
|
|
flush_dcache_page(bvec.bv_page);
|
2009-11-26 15:16:19 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(rq_flush_dcache_pages);
|
|
|
|
#endif
|
|
|
|
|
2008-10-01 21:12:15 +07:00
|
|
|
/**
|
|
|
|
* blk_lld_busy - Check if underlying low-level drivers of a device are busy
|
|
|
|
* @q : the queue of the device being checked
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Check if underlying low-level drivers of a device are busy.
|
|
|
|
* If the drivers want to export their busy state, they must set own
|
|
|
|
* exporting function using blk_queue_lld_busy() first.
|
|
|
|
*
|
|
|
|
* Basically, this function is used only by request stacking drivers
|
|
|
|
* to stop dispatching requests to underlying devices when underlying
|
|
|
|
* devices are busy. This behavior helps more I/O merging on the queue
|
|
|
|
* of the request stacking driver and prevents I/O throughput regression
|
|
|
|
* on burst I/O load.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* 0 - Not busy (The request stacking driver should dispatch request)
|
|
|
|
* 1 - Busy (The request stacking driver should stop dispatching request)
|
|
|
|
*/
|
|
|
|
int blk_lld_busy(struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (q->lld_busy_fn)
|
|
|
|
return q->lld_busy_fn(q);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_lld_busy);
|
|
|
|
|
2015-06-26 21:01:13 +07:00
|
|
|
/**
|
|
|
|
* blk_rq_unprep_clone - Helper function to free all bios in a cloned request
|
|
|
|
* @rq: the clone request to be cleaned up
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Free all bios in @rq for a cloned request.
|
|
|
|
*/
|
|
|
|
void blk_rq_unprep_clone(struct request *rq)
|
|
|
|
{
|
|
|
|
struct bio *bio;
|
|
|
|
|
|
|
|
while ((bio = rq->bio) != NULL) {
|
|
|
|
rq->bio = bio->bi_next;
|
|
|
|
|
|
|
|
bio_put(bio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy attributes of the original request to the clone request.
|
|
|
|
* The actual data parts (e.g. ->cmd, ->sense) are not copied.
|
|
|
|
*/
|
|
|
|
static void __blk_rq_prep_clone(struct request *dst, struct request *src)
|
block: add request clone interface (v2)
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-11 18:10:16 +07:00
|
|
|
{
|
|
|
|
dst->cpu = src->cpu;
|
|
|
|
dst->__sector = blk_rq_pos(src);
|
|
|
|
dst->__data_len = blk_rq_bytes(src);
|
|
|
|
dst->nr_phys_segments = src->nr_phys_segments;
|
|
|
|
dst->ioprio = src->ioprio;
|
|
|
|
dst->extra_len = src->extra_len;
|
2015-06-26 21:01:13 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_rq_prep_clone - Helper function to setup clone request
|
|
|
|
* @rq: the request to be setup
|
|
|
|
* @rq_src: original request to be cloned
|
|
|
|
* @bs: bio_set that bios for clone are allocated from
|
|
|
|
* @gfp_mask: memory allocation mask for bio
|
|
|
|
* @bio_ctr: setup function to be called for each clone bio.
|
|
|
|
* Returns %0 for success, non %0 for failure.
|
|
|
|
* @data: private data to be passed to @bio_ctr
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Clones bios in @rq_src to @rq, and copies attributes of @rq_src to @rq.
|
|
|
|
* The actual data parts of @rq_src (e.g. ->cmd, ->sense)
|
|
|
|
* are not copied, and copying such parts is the caller's responsibility.
|
|
|
|
* Also, pages which the original bios are pointing to are not copied
|
|
|
|
* and the cloned bios just point same pages.
|
|
|
|
* So cloned bios must be completed before original bios, which means
|
|
|
|
* the caller must complete @rq before @rq_src.
|
|
|
|
*/
|
|
|
|
int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
|
|
|
|
struct bio_set *bs, gfp_t gfp_mask,
|
|
|
|
int (*bio_ctr)(struct bio *, struct bio *, void *),
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct bio *bio, *bio_src;
|
|
|
|
|
|
|
|
if (!bs)
|
|
|
|
bs = fs_bio_set;
|
|
|
|
|
|
|
|
__rq_for_each_bio(bio_src, rq_src) {
|
|
|
|
bio = bio_clone_fast(bio_src, gfp_mask, bs);
|
|
|
|
if (!bio)
|
|
|
|
goto free_and_out;
|
|
|
|
|
|
|
|
if (bio_ctr && bio_ctr(bio, bio_src, data))
|
|
|
|
goto free_and_out;
|
|
|
|
|
|
|
|
if (rq->bio) {
|
|
|
|
rq->biotail->bi_next = bio;
|
|
|
|
rq->biotail = bio;
|
|
|
|
} else
|
|
|
|
rq->bio = rq->biotail = bio;
|
|
|
|
}
|
|
|
|
|
|
|
|
__blk_rq_prep_clone(rq, rq_src);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
free_and_out:
|
|
|
|
if (bio)
|
|
|
|
bio_put(bio);
|
|
|
|
blk_rq_unprep_clone(rq);
|
|
|
|
|
|
|
|
return -ENOMEM;
|
block: add request clone interface (v2)
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-11 18:10:16 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
|
|
|
|
|
2014-04-08 22:15:35 +07:00
|
|
|
int kblockd_schedule_work(struct work_struct *work)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
return queue_work(kblockd_workqueue, work);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kblockd_schedule_work);
|
|
|
|
|
2016-08-25 04:52:48 +07:00
|
|
|
int kblockd_schedule_work_on(int cpu, struct work_struct *work)
|
|
|
|
{
|
|
|
|
return queue_work_on(cpu, kblockd_workqueue, work);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kblockd_schedule_work_on);
|
|
|
|
|
2014-04-08 22:15:35 +07:00
|
|
|
int kblockd_schedule_delayed_work(struct delayed_work *dwork,
|
|
|
|
unsigned long delay)
|
2010-09-16 04:06:35 +07:00
|
|
|
{
|
|
|
|
return queue_delayed_work(kblockd_workqueue, dwork, delay);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kblockd_schedule_delayed_work);
|
|
|
|
|
2014-04-08 22:17:40 +07:00
|
|
|
int kblockd_schedule_delayed_work_on(int cpu, struct delayed_work *dwork,
|
|
|
|
unsigned long delay)
|
|
|
|
{
|
|
|
|
return queue_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kblockd_schedule_delayed_work_on);
|
|
|
|
|
2011-09-21 15:00:16 +07:00
|
|
|
/**
|
|
|
|
* blk_start_plug - initialize blk_plug and track it inside the task_struct
|
|
|
|
* @plug: The &struct blk_plug that needs to be initialized
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Tracking blk_plug inside the task_struct will help with auto-flushing the
|
|
|
|
* pending I/O should the task end up blocking between blk_start_plug() and
|
|
|
|
* blk_finish_plug(). This is important from a performance perspective, but
|
|
|
|
* also ensures that we don't deadlock. For instance, if the task is blocking
|
|
|
|
* for a memory allocation, memory reclaim could end up wanting to free a
|
|
|
|
* page belonging to that request that is currently residing in our private
|
|
|
|
* plug. By flushing the pending I/O when the process goes to sleep, we avoid
|
|
|
|
* this kind of deadlock.
|
|
|
|
*/
|
2011-03-08 19:19:51 +07:00
|
|
|
void blk_start_plug(struct blk_plug *plug)
|
|
|
|
{
|
|
|
|
struct task_struct *tsk = current;
|
|
|
|
|
2015-05-09 00:51:28 +07:00
|
|
|
/*
|
|
|
|
* If this is a nested plug, don't actually assign it.
|
|
|
|
*/
|
|
|
|
if (tsk->plug)
|
|
|
|
return;
|
|
|
|
|
2011-03-08 19:19:51 +07:00
|
|
|
INIT_LIST_HEAD(&plug->list);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
INIT_LIST_HEAD(&plug->mq_list);
|
2011-04-18 14:52:22 +07:00
|
|
|
INIT_LIST_HEAD(&plug->cb_list);
|
2011-03-08 19:19:51 +07:00
|
|
|
/*
|
2015-05-09 00:51:28 +07:00
|
|
|
* Store ordering should not be needed here, since a potential
|
|
|
|
* preempt will imply a full memory barrier
|
2011-03-08 19:19:51 +07:00
|
|
|
*/
|
2015-05-09 00:51:28 +07:00
|
|
|
tsk->plug = plug;
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_start_plug);
|
|
|
|
|
|
|
|
static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
|
|
|
|
{
|
|
|
|
struct request *rqa = container_of(a, struct request, queuelist);
|
|
|
|
struct request *rqb = container_of(b, struct request, queuelist);
|
|
|
|
|
block: Add blk_rq_pos(rq) to sort rq when plushing
My workload is a raid5 which had 16 disks. And used our filesystem to
write using direct-io mode.
I used the blktrace to find those message:
8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5]
8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5]
8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5]
8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5]
8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5]
8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5]
8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5]
8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5]
8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5]
8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5]
8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5]
8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5]
8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5]
8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5]
8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5]
8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5]
8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5]
8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5]
8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5]
8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5]
8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5]
8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5]
8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5]
8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5]
8,16 0 0 2.453853661 0 m N cfq2579 insert_request
8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5]
8,16 0 0 2.453854439 0 m N cfq2579 insert_request
8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2
8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1
8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert
8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request
8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1
8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5]
8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1
8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert
8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request
8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2
8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5]
8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0]
8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0
8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0]
8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0
8,16 0 0 2.454795160 0 m N cfq schedule dispatch
From above messages,we can find rq[W 7493144 + 104] and rq[W
7493120 + 24] do not merge.
Because the bio order is:
8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5]
8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5]
8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5]
8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5]
The bio(7493144) first and bio(7493120) later.So the subsequent
bios will be divided into two parts.
When flushing plug-list,because elv_attempt_insert_merge only support
backmerge,not supporting frontmerge.
So rq[7493120 + 24] can't merge with rq[7493144 + 104].
From my test,i found those situation can count 25% in our system.
Using this patch, there is no this situation.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
CC:Shaohua Li <shli@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-26 02:58:17 +07:00
|
|
|
return !(rqa->q < rqb->q ||
|
|
|
|
(rqa->q == rqb->q && blk_rq_pos(rqa) < blk_rq_pos(rqb)));
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
|
|
|
|
2011-04-16 18:51:05 +07:00
|
|
|
/*
|
|
|
|
* If 'from_schedule' is true, then postpone the dispatch of requests
|
|
|
|
* until a safe kblockd context. We due this to avoid accidental big
|
|
|
|
* additional stack usage in driver dispatch, in places where the originally
|
|
|
|
* plugger did not intend it.
|
|
|
|
*/
|
2011-04-15 20:49:07 +07:00
|
|
|
static void queue_unplugged(struct request_queue *q, unsigned int depth,
|
2011-04-16 18:51:05 +07:00
|
|
|
bool from_schedule)
|
2011-04-18 14:59:55 +07:00
|
|
|
__releases(q->queue_lock)
|
2011-04-12 15:12:19 +07:00
|
|
|
{
|
2011-04-16 18:51:05 +07:00
|
|
|
trace_block_unplug(q, depth, !from_schedule);
|
2011-04-18 14:59:55 +07:00
|
|
|
|
2012-11-28 19:45:56 +07:00
|
|
|
if (from_schedule)
|
2011-04-18 16:41:33 +07:00
|
|
|
blk_run_queue_async(q);
|
2012-11-28 19:45:56 +07:00
|
|
|
else
|
2011-04-18 16:41:33 +07:00
|
|
|
__blk_run_queue(q);
|
2012-11-28 19:45:56 +07:00
|
|
|
spin_unlock(q->queue_lock);
|
2011-04-12 15:12:19 +07:00
|
|
|
}
|
|
|
|
|
2012-07-31 14:08:15 +07:00
|
|
|
static void flush_plug_callbacks(struct blk_plug *plug, bool from_schedule)
|
2011-04-18 14:52:22 +07:00
|
|
|
{
|
|
|
|
LIST_HEAD(callbacks);
|
|
|
|
|
2012-07-31 14:08:15 +07:00
|
|
|
while (!list_empty(&plug->cb_list)) {
|
|
|
|
list_splice_init(&plug->cb_list, &callbacks);
|
2011-04-18 14:52:22 +07:00
|
|
|
|
2012-07-31 14:08:15 +07:00
|
|
|
while (!list_empty(&callbacks)) {
|
|
|
|
struct blk_plug_cb *cb = list_first_entry(&callbacks,
|
2011-04-18 14:52:22 +07:00
|
|
|
struct blk_plug_cb,
|
|
|
|
list);
|
2012-07-31 14:08:15 +07:00
|
|
|
list_del(&cb->list);
|
2012-07-31 14:08:15 +07:00
|
|
|
cb->callback(cb, from_schedule);
|
2012-07-31 14:08:15 +07:00
|
|
|
}
|
2011-04-18 14:52:22 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-07-31 14:08:14 +07:00
|
|
|
struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, void *data,
|
|
|
|
int size)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = current->plug;
|
|
|
|
struct blk_plug_cb *cb;
|
|
|
|
|
|
|
|
if (!plug)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
list_for_each_entry(cb, &plug->cb_list, list)
|
|
|
|
if (cb->callback == unplug && cb->data == data)
|
|
|
|
return cb;
|
|
|
|
|
|
|
|
/* Not currently on the callback list */
|
|
|
|
BUG_ON(size < sizeof(*cb));
|
|
|
|
cb = kzalloc(size, GFP_ATOMIC);
|
|
|
|
if (cb) {
|
|
|
|
cb->data = data;
|
|
|
|
cb->callback = unplug;
|
|
|
|
list_add(&cb->list, &plug->cb_list);
|
|
|
|
}
|
|
|
|
return cb;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_check_plugged);
|
|
|
|
|
2011-04-16 18:51:05 +07:00
|
|
|
void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
|
2011-03-08 19:19:51 +07:00
|
|
|
{
|
|
|
|
struct request_queue *q;
|
|
|
|
unsigned long flags;
|
|
|
|
struct request *rq;
|
2011-04-11 19:13:10 +07:00
|
|
|
LIST_HEAD(list);
|
2011-04-12 15:12:19 +07:00
|
|
|
unsigned int depth;
|
2011-03-08 19:19:51 +07:00
|
|
|
|
2012-07-31 14:08:15 +07:00
|
|
|
flush_plug_callbacks(plug, from_schedule);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
|
|
|
|
if (!list_empty(&plug->mq_list))
|
|
|
|
blk_mq_flush_plug_list(plug, from_schedule);
|
|
|
|
|
2011-03-08 19:19:51 +07:00
|
|
|
if (list_empty(&plug->list))
|
|
|
|
return;
|
|
|
|
|
2011-04-11 19:13:10 +07:00
|
|
|
list_splice_init(&plug->list, &list);
|
|
|
|
|
2013-01-11 20:46:09 +07:00
|
|
|
list_sort(NULL, &list, plug_rq_cmp);
|
2011-03-08 19:19:51 +07:00
|
|
|
|
|
|
|
q = NULL;
|
2011-04-12 15:12:19 +07:00
|
|
|
depth = 0;
|
2011-04-12 15:11:24 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Save and disable interrupts here, to avoid doing it for every
|
|
|
|
* queue lock we have to take.
|
|
|
|
*/
|
2011-03-08 19:19:51 +07:00
|
|
|
local_irq_save(flags);
|
2011-04-11 19:13:10 +07:00
|
|
|
while (!list_empty(&list)) {
|
|
|
|
rq = list_entry_rq(list.next);
|
2011-03-08 19:19:51 +07:00
|
|
|
list_del_init(&rq->queuelist);
|
|
|
|
BUG_ON(!rq->q);
|
|
|
|
if (rq->q != q) {
|
2011-04-18 14:59:55 +07:00
|
|
|
/*
|
|
|
|
* This drops the queue lock
|
|
|
|
*/
|
|
|
|
if (q)
|
2011-04-16 18:51:05 +07:00
|
|
|
queue_unplugged(q, depth, from_schedule);
|
2011-03-08 19:19:51 +07:00
|
|
|
q = rq->q;
|
2011-04-12 15:12:19 +07:00
|
|
|
depth = 0;
|
2011-03-08 19:19:51 +07:00
|
|
|
spin_lock(q->queue_lock);
|
|
|
|
}
|
2011-12-14 06:33:37 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Short-circuit if @q is dead
|
|
|
|
*/
|
2012-11-28 19:42:38 +07:00
|
|
|
if (unlikely(blk_queue_dying(q))) {
|
2011-12-14 06:33:37 +07:00
|
|
|
__blk_end_request_all(rq, -ENODEV);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2011-03-08 19:19:51 +07:00
|
|
|
/*
|
|
|
|
* rq is already accounted, so use raw insert
|
|
|
|
*/
|
2017-01-27 22:30:47 +07:00
|
|
|
if (op_is_flush(rq->cmd_flags))
|
2011-03-25 22:57:52 +07:00
|
|
|
__elv_add_request(q, rq, ELEVATOR_INSERT_FLUSH);
|
|
|
|
else
|
|
|
|
__elv_add_request(q, rq, ELEVATOR_INSERT_SORT_MERGE);
|
2011-04-12 15:12:19 +07:00
|
|
|
|
|
|
|
depth++;
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
|
|
|
|
2011-04-18 14:59:55 +07:00
|
|
|
/*
|
|
|
|
* This drops the queue lock
|
|
|
|
*/
|
|
|
|
if (q)
|
2011-04-16 18:51:05 +07:00
|
|
|
queue_unplugged(q, depth, from_schedule);
|
2011-03-08 19:19:51 +07:00
|
|
|
|
|
|
|
local_irq_restore(flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
void blk_finish_plug(struct blk_plug *plug)
|
|
|
|
{
|
2015-05-09 00:51:28 +07:00
|
|
|
if (plug != current->plug)
|
|
|
|
return;
|
2011-04-15 20:49:07 +07:00
|
|
|
blk_flush_plug_list(plug, false);
|
2011-03-08 19:19:51 +07:00
|
|
|
|
2015-05-09 00:51:28 +07:00
|
|
|
current->plug = NULL;
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
2011-04-15 20:20:10 +07:00
|
|
|
EXPORT_SYMBOL(blk_finish_plug);
|
2011-03-08 19:19:51 +07:00
|
|
|
|
2014-12-04 07:00:23 +07:00
|
|
|
#ifdef CONFIG_PM
|
2013-03-23 10:42:26 +07:00
|
|
|
/**
|
|
|
|
* blk_pm_runtime_init - Block layer runtime PM initialization routine
|
|
|
|
* @q: the queue of the device
|
|
|
|
* @dev: the device the queue belongs to
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Initialize runtime-PM-related fields for @q and start auto suspend for
|
|
|
|
* @dev. Drivers that want to take advantage of request-based runtime PM
|
|
|
|
* should call this function after @dev has been initialized, and its
|
|
|
|
* request queue @q has been allocated, and runtime PM for it can not happen
|
|
|
|
* yet(either due to disabled/forbidden or its usage_count > 0). In most
|
|
|
|
* cases, driver should call this function before any I/O has taken place.
|
|
|
|
*
|
|
|
|
* This function takes care of setting up using auto suspend for the device,
|
|
|
|
* the autosuspend delay is set to -1 to make runtime suspend impossible
|
|
|
|
* until an updated value is either set by user or by driver. Drivers do
|
|
|
|
* not need to touch other autosuspend settings.
|
|
|
|
*
|
|
|
|
* The block layer runtime PM is request based, so only works for drivers
|
|
|
|
* that use request as their IO unit instead of those directly use bio's.
|
|
|
|
*/
|
|
|
|
void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
|
|
|
|
{
|
|
|
|
q->dev = dev;
|
|
|
|
q->rpm_status = RPM_ACTIVE;
|
|
|
|
pm_runtime_set_autosuspend_delay(q->dev, -1);
|
|
|
|
pm_runtime_use_autosuspend(q->dev);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_pm_runtime_init);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_pre_runtime_suspend - Pre runtime suspend check
|
|
|
|
* @q: the queue of the device
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* This function will check if runtime suspend is allowed for the device
|
|
|
|
* by examining if there are any requests pending in the queue. If there
|
|
|
|
* are requests pending, the device can not be runtime suspended; otherwise,
|
|
|
|
* the queue's status will be updated to SUSPENDING and the driver can
|
|
|
|
* proceed to suspend the device.
|
|
|
|
*
|
|
|
|
* For the not allowed case, we mark last busy for the device so that
|
|
|
|
* runtime PM core will try to autosuspend it some time later.
|
|
|
|
*
|
|
|
|
* This function should be called near the start of the device's
|
|
|
|
* runtime_suspend callback.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* 0 - OK to runtime suspend the device
|
|
|
|
* -EBUSY - Device should not be runtime suspended
|
|
|
|
*/
|
|
|
|
int blk_pre_runtime_suspend(struct request_queue *q)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
2015-12-01 13:45:46 +07:00
|
|
|
if (!q->dev)
|
|
|
|
return ret;
|
|
|
|
|
2013-03-23 10:42:26 +07:00
|
|
|
spin_lock_irq(q->queue_lock);
|
|
|
|
if (q->nr_pending) {
|
|
|
|
ret = -EBUSY;
|
|
|
|
pm_runtime_mark_last_busy(q->dev);
|
|
|
|
} else {
|
|
|
|
q->rpm_status = RPM_SUSPENDING;
|
|
|
|
}
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_pre_runtime_suspend);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_post_runtime_suspend - Post runtime suspend processing
|
|
|
|
* @q: the queue of the device
|
|
|
|
* @err: return value of the device's runtime_suspend function
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Update the queue's runtime status according to the return value of the
|
|
|
|
* device's runtime suspend function and mark last busy for the device so
|
|
|
|
* that PM core will try to auto suspend the device at a later time.
|
|
|
|
*
|
|
|
|
* This function should be called near the end of the device's
|
|
|
|
* runtime_suspend callback.
|
|
|
|
*/
|
|
|
|
void blk_post_runtime_suspend(struct request_queue *q, int err)
|
|
|
|
{
|
2015-12-01 13:45:46 +07:00
|
|
|
if (!q->dev)
|
|
|
|
return;
|
|
|
|
|
2013-03-23 10:42:26 +07:00
|
|
|
spin_lock_irq(q->queue_lock);
|
|
|
|
if (!err) {
|
|
|
|
q->rpm_status = RPM_SUSPENDED;
|
|
|
|
} else {
|
|
|
|
q->rpm_status = RPM_ACTIVE;
|
|
|
|
pm_runtime_mark_last_busy(q->dev);
|
|
|
|
}
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_post_runtime_suspend);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_pre_runtime_resume - Pre runtime resume processing
|
|
|
|
* @q: the queue of the device
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Update the queue's runtime status to RESUMING in preparation for the
|
|
|
|
* runtime resume of the device.
|
|
|
|
*
|
|
|
|
* This function should be called near the start of the device's
|
|
|
|
* runtime_resume callback.
|
|
|
|
*/
|
|
|
|
void blk_pre_runtime_resume(struct request_queue *q)
|
|
|
|
{
|
2015-12-01 13:45:46 +07:00
|
|
|
if (!q->dev)
|
|
|
|
return;
|
|
|
|
|
2013-03-23 10:42:26 +07:00
|
|
|
spin_lock_irq(q->queue_lock);
|
|
|
|
q->rpm_status = RPM_RESUMING;
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_pre_runtime_resume);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_post_runtime_resume - Post runtime resume processing
|
|
|
|
* @q: the queue of the device
|
|
|
|
* @err: return value of the device's runtime_resume function
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Update the queue's runtime status according to the return value of the
|
|
|
|
* device's runtime_resume function. If it is successfully resumed, process
|
|
|
|
* the requests that are queued into the device's queue when it is resuming
|
|
|
|
* and then mark last busy and initiate autosuspend for it.
|
|
|
|
*
|
|
|
|
* This function should be called near the end of the device's
|
|
|
|
* runtime_resume callback.
|
|
|
|
*/
|
|
|
|
void blk_post_runtime_resume(struct request_queue *q, int err)
|
|
|
|
{
|
2015-12-01 13:45:46 +07:00
|
|
|
if (!q->dev)
|
|
|
|
return;
|
|
|
|
|
2013-03-23 10:42:26 +07:00
|
|
|
spin_lock_irq(q->queue_lock);
|
|
|
|
if (!err) {
|
|
|
|
q->rpm_status = RPM_ACTIVE;
|
|
|
|
__blk_run_queue(q);
|
|
|
|
pm_runtime_mark_last_busy(q->dev);
|
2013-05-17 14:47:20 +07:00
|
|
|
pm_request_autosuspend(q->dev);
|
2013-03-23 10:42:26 +07:00
|
|
|
} else {
|
|
|
|
q->rpm_status = RPM_SUSPENDED;
|
|
|
|
}
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_post_runtime_resume);
|
2016-02-18 15:54:11 +07:00
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_set_runtime_active - Force runtime status of the queue to be active
|
|
|
|
* @q: the queue of the device
|
|
|
|
*
|
|
|
|
* If the device is left runtime suspended during system suspend the resume
|
|
|
|
* hook typically resumes the device and corrects runtime status
|
|
|
|
* accordingly. However, that does not affect the queue runtime PM status
|
|
|
|
* which is still "suspended". This prevents processing requests from the
|
|
|
|
* queue.
|
|
|
|
*
|
|
|
|
* This function can be used in driver's resume hook to correct queue
|
|
|
|
* runtime PM status and re-enable peeking requests from the queue. It
|
|
|
|
* should be called before first request is added to the queue.
|
|
|
|
*/
|
|
|
|
void blk_set_runtime_active(struct request_queue *q)
|
|
|
|
{
|
|
|
|
spin_lock_irq(q->queue_lock);
|
|
|
|
q->rpm_status = RPM_ACTIVE;
|
|
|
|
pm_runtime_mark_last_busy(q->dev);
|
|
|
|
pm_request_autosuspend(q->dev);
|
|
|
|
spin_unlock_irq(q->queue_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_set_runtime_active);
|
2013-03-23 10:42:26 +07:00
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
int __init blk_dev_init(void)
|
|
|
|
{
|
2016-10-28 21:48:16 +07:00
|
|
|
BUILD_BUG_ON(REQ_OP_LAST >= (1 << REQ_OP_BITS));
|
|
|
|
BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
|
2015-07-07 14:11:07 +07:00
|
|
|
FIELD_SIZEOF(struct request, cmd_flags));
|
2016-10-28 21:48:16 +07:00
|
|
|
BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
|
|
|
|
FIELD_SIZEOF(struct bio, bi_opf));
|
2009-04-27 19:53:54 +07:00
|
|
|
|
2011-01-03 21:01:47 +07:00
|
|
|
/* used for unplugging and affects IO latency/throughput - HIGHPRI */
|
|
|
|
kblockd_workqueue = alloc_workqueue("kblockd",
|
2014-06-12 04:43:54 +07:00
|
|
|
WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (!kblockd_workqueue)
|
|
|
|
panic("Failed to create kblockd\n");
|
|
|
|
|
|
|
|
request_cachep = kmem_cache_create("blkdev_requests",
|
2007-07-20 08:11:58 +07:00
|
|
|
sizeof(struct request), 0, SLAB_PANIC, NULL);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-11-21 04:16:46 +07:00
|
|
|
blk_requestq_cachep = kmem_cache_create("request_queue",
|
2007-07-24 14:28:11 +07:00
|
|
|
sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-02-01 05:53:20 +07:00
|
|
|
#ifdef CONFIG_DEBUG_FS
|
|
|
|
blk_debugfs_root = debugfs_create_dir("block", NULL);
|
|
|
|
#endif
|
|
|
|
|
2008-01-24 14:53:35 +07:00
|
|
|
return 0;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|