2005-04-17 05:20:36 +07:00
|
|
|
#ifndef _LINUX_BLKDEV_H
|
|
|
|
#define _LINUX_BLKDEV_H
|
|
|
|
|
2012-05-14 13:29:23 +07:00
|
|
|
#include <linux/sched.h>
|
2017-02-01 22:36:40 +07:00
|
|
|
#include <linux/sched/clock.h>
|
2012-05-14 13:29:23 +07:00
|
|
|
|
2007-09-21 14:19:54 +07:00
|
|
|
#ifdef CONFIG_BLOCK
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/major.h>
|
|
|
|
#include <linux/genhd.h>
|
|
|
|
#include <linux/list.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
#include <linux/llist.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/timer.h>
|
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/pagemap.h>
|
2015-05-23 04:13:32 +07:00
|
|
|
#include <linux/backing-dev-defs.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/wait.h>
|
|
|
|
#include <linux/mempool.h>
|
2016-01-16 07:56:14 +07:00
|
|
|
#include <linux/pfn.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/stringify.h>
|
2008-09-11 15:57:55 +07:00
|
|
|
#include <linux/gfp.h>
|
2007-07-09 17:40:35 +07:00
|
|
|
#include <linux/bsg.h>
|
2008-09-14 01:26:01 +07:00
|
|
|
#include <linux/smp.h>
|
2013-01-09 23:05:13 +07:00
|
|
|
#include <linux/rcupdate.h>
|
2014-07-01 23:34:38 +07:00
|
|
|
#include <linux/percpu-refcount.h>
|
2015-05-01 17:46:15 +07:00
|
|
|
#include <linux/scatterlist.h>
|
2016-10-18 13:40:33 +07:00
|
|
|
#include <linux/blkzoned.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-05-27 00:46:22 +07:00
|
|
|
struct module;
|
2006-03-22 23:52:04 +07:00
|
|
|
struct scsi_ioctl_command;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct request_queue;
|
|
|
|
struct elevator_queue;
|
2006-03-24 02:00:26 +07:00
|
|
|
struct blk_trace;
|
2007-07-09 17:38:05 +07:00
|
|
|
struct request;
|
|
|
|
struct sg_io_hdr;
|
2011-08-01 03:05:09 +07:00
|
|
|
struct bsg_job;
|
2012-04-17 03:57:25 +07:00
|
|
|
struct blkcg_gq;
|
2014-09-25 22:23:43 +07:00
|
|
|
struct blk_flush_queue;
|
2015-10-15 19:10:48 +07:00
|
|
|
struct pr_ops;
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
struct rq_wb;
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 22:56:08 +07:00
|
|
|
struct blk_queue_stats;
|
|
|
|
struct blk_stat_callback;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
#define BLKDEV_MIN_RQ 4
|
|
|
|
#define BLKDEV_MAX_RQ 128 /* Default maximum */
|
|
|
|
|
2012-04-14 03:11:28 +07:00
|
|
|
/*
|
|
|
|
* Maximum number of blkcg policies allowed to be registered concurrently.
|
|
|
|
* Defined here to simplify include dependency.
|
|
|
|
*/
|
|
|
|
#define BLKCG_MAX_POLS 2
|
|
|
|
|
2006-01-06 15:49:03 +07:00
|
|
|
typedef void (rq_end_io_fn)(struct request *, int);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
#define BLK_RL_SYNCFULL (1U << 0)
|
|
|
|
#define BLK_RL_ASYNCFULL (1U << 1)
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct request_list {
|
2012-06-05 10:40:59 +07:00
|
|
|
struct request_queue *q; /* the queue this rl belongs to */
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
|
|
|
struct blkcg_gq *blkg; /* blkg this request pool belongs to */
|
|
|
|
#endif
|
2009-04-06 19:48:01 +07:00
|
|
|
/*
|
|
|
|
* count[], starved[], and wait[] are indexed by
|
|
|
|
* BLK_RW_SYNC/BLK_RW_ASYNC
|
|
|
|
*/
|
2012-06-05 10:40:58 +07:00
|
|
|
int count[2];
|
|
|
|
int starved[2];
|
|
|
|
mempool_t *rq_pool;
|
|
|
|
wait_queue_head_t wait[2];
|
2012-06-05 10:40:59 +07:00
|
|
|
unsigned int flags;
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2016-10-20 20:12:13 +07:00
|
|
|
/*
|
|
|
|
* request flags */
|
|
|
|
typedef __u32 __bitwise req_flags_t;
|
|
|
|
|
|
|
|
/* elevator knows about this request */
|
|
|
|
#define RQF_SORTED ((__force req_flags_t)(1 << 0))
|
|
|
|
/* drive already may have started this one */
|
|
|
|
#define RQF_STARTED ((__force req_flags_t)(1 << 1))
|
|
|
|
/* uses tagged queueing */
|
|
|
|
#define RQF_QUEUED ((__force req_flags_t)(1 << 2))
|
|
|
|
/* may not be passed by ioscheduler */
|
|
|
|
#define RQF_SOFTBARRIER ((__force req_flags_t)(1 << 3))
|
|
|
|
/* request for flush sequence */
|
|
|
|
#define RQF_FLUSH_SEQ ((__force req_flags_t)(1 << 4))
|
|
|
|
/* merge of different types, fail separately */
|
|
|
|
#define RQF_MIXED_MERGE ((__force req_flags_t)(1 << 5))
|
|
|
|
/* track inflight for MQ */
|
|
|
|
#define RQF_MQ_INFLIGHT ((__force req_flags_t)(1 << 6))
|
|
|
|
/* don't call prep for this one */
|
|
|
|
#define RQF_DONTPREP ((__force req_flags_t)(1 << 7))
|
|
|
|
/* set for "ide_preempt" requests and also for requests for which the SCSI
|
|
|
|
"quiesce" state must be ignored. */
|
|
|
|
#define RQF_PREEMPT ((__force req_flags_t)(1 << 8))
|
|
|
|
/* contains copies of user pages */
|
|
|
|
#define RQF_COPY_USER ((__force req_flags_t)(1 << 9))
|
|
|
|
/* vaguely specified driver internal error. Ignored by the block layer */
|
|
|
|
#define RQF_FAILED ((__force req_flags_t)(1 << 10))
|
|
|
|
/* don't warn about errors */
|
|
|
|
#define RQF_QUIET ((__force req_flags_t)(1 << 11))
|
|
|
|
/* elevator private data attached */
|
|
|
|
#define RQF_ELVPRIV ((__force req_flags_t)(1 << 12))
|
|
|
|
/* account I/O stat */
|
|
|
|
#define RQF_IO_STAT ((__force req_flags_t)(1 << 13))
|
|
|
|
/* request came from our alloc pool */
|
|
|
|
#define RQF_ALLOCED ((__force req_flags_t)(1 << 14))
|
|
|
|
/* runtime pm request */
|
|
|
|
#define RQF_PM ((__force req_flags_t)(1 << 15))
|
|
|
|
/* on IO scheduler merge hash */
|
|
|
|
#define RQF_HASHED ((__force req_flags_t)(1 << 16))
|
2016-11-08 11:32:37 +07:00
|
|
|
/* IO stats tracking on */
|
|
|
|
#define RQF_STATS ((__force req_flags_t)(1 << 17))
|
2016-12-09 05:20:32 +07:00
|
|
|
/* Look at ->special_vec for the actual data payload instead of the
|
|
|
|
bio chain. */
|
|
|
|
#define RQF_SPECIAL_PAYLOAD ((__force req_flags_t)(1 << 18))
|
2016-10-20 20:12:13 +07:00
|
|
|
|
|
|
|
/* flags that prevent us from merging requests: */
|
|
|
|
#define RQF_NOMERGE_FLAGS \
|
2016-12-09 05:20:32 +07:00
|
|
|
(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
|
2016-10-20 20:12:13 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2014-05-06 17:12:45 +07:00
|
|
|
* Try to put the fields that are referenced together in the same cacheline.
|
|
|
|
*
|
|
|
|
* If you modify this structure, make sure to update blk_rq_init() and
|
|
|
|
* especially blk_mq_rq_ctx_init() to take care of the added fields.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
struct request {
|
2014-01-31 06:45:47 +07:00
|
|
|
struct list_head queuelist;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
union {
|
|
|
|
struct call_single_data csd;
|
2016-06-28 14:03:59 +07:00
|
|
|
u64 fifo_time;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
};
|
2006-01-09 22:02:34 +07:00
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
struct request_queue *q;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
struct blk_mq_ctx *mq_ctx;
|
2006-08-10 14:00:21 +07:00
|
|
|
|
2016-06-09 21:00:35 +07:00
|
|
|
int cpu;
|
2016-10-28 21:48:16 +07:00
|
|
|
unsigned int cmd_flags; /* op and common flags */
|
2016-10-20 20:12:13 +07:00
|
|
|
req_flags_t rq_flags;
|
2017-02-01 02:34:41 +07:00
|
|
|
|
|
|
|
int internal_tag;
|
|
|
|
|
2008-09-14 19:55:09 +07:00
|
|
|
unsigned long atomic_flags;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-05-07 20:24:44 +07:00
|
|
|
/* the following two fields are internal, NEVER access directly */
|
|
|
|
unsigned int __data_len; /* total data len */
|
2017-01-17 20:03:22 +07:00
|
|
|
int tag;
|
2010-03-19 14:58:16 +07:00
|
|
|
sector_t __sector; /* sector cursor */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
struct bio *bio;
|
|
|
|
struct bio *biotail;
|
|
|
|
|
2014-04-10 09:27:01 +07:00
|
|
|
/*
|
|
|
|
* The hash is used inside the scheduler, and killed once the
|
|
|
|
* request reaches the dispatch list. The ipi_list is only used
|
|
|
|
* to queue the request for softirq completion, which is long
|
|
|
|
* after the request has been unhashed (and even removed from
|
|
|
|
* the dispatch list).
|
|
|
|
*/
|
|
|
|
union {
|
|
|
|
struct hlist_node hash; /* merge hash */
|
|
|
|
struct list_head ipi_list;
|
|
|
|
};
|
|
|
|
|
2006-08-10 14:00:21 +07:00
|
|
|
/*
|
|
|
|
* The rb_node is only used inside the io scheduler, requests
|
|
|
|
* are pruned when moved to the dispatch queue. So let the
|
2011-02-11 17:08:00 +07:00
|
|
|
* completion_data share space with the rb_node.
|
2006-08-10 14:00:21 +07:00
|
|
|
*/
|
|
|
|
union {
|
|
|
|
struct rb_node rb_node; /* sort/lookup */
|
2016-12-09 05:20:32 +07:00
|
|
|
struct bio_vec special_vec;
|
2011-02-11 17:08:00 +07:00
|
|
|
void *completion_data;
|
2006-08-10 14:00:21 +07:00
|
|
|
};
|
2006-07-28 14:23:08 +07:00
|
|
|
|
2006-07-12 19:04:37 +07:00
|
|
|
/*
|
2010-04-21 22:44:16 +07:00
|
|
|
* Three pointers are available for the IO schedulers, if they need
|
2011-02-11 17:08:00 +07:00
|
|
|
* more they have to dynamically allocate it. Flush requests are
|
|
|
|
* never put on the IO scheduler. So let the flush fields share
|
2011-12-14 06:33:41 +07:00
|
|
|
* space with the elevator data.
|
2006-07-12 19:04:37 +07:00
|
|
|
*/
|
2011-02-11 17:08:00 +07:00
|
|
|
union {
|
2011-12-14 06:33:41 +07:00
|
|
|
struct {
|
|
|
|
struct io_cq *icq;
|
|
|
|
void *priv[2];
|
|
|
|
} elv;
|
|
|
|
|
2011-02-11 17:08:00 +07:00
|
|
|
struct {
|
|
|
|
unsigned int seq;
|
|
|
|
struct list_head list;
|
block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-16 02:37:25 +07:00
|
|
|
rq_end_io_fn *saved_end_io;
|
2011-02-11 17:08:00 +07:00
|
|
|
} flush;
|
|
|
|
};
|
2006-07-12 19:04:37 +07:00
|
|
|
|
2006-06-13 14:02:34 +07:00
|
|
|
struct gendisk *rq_disk;
|
2011-01-05 22:57:38 +07:00
|
|
|
struct hd_struct *part;
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned long start_time;
|
2016-11-08 11:32:37 +07:00
|
|
|
struct blk_issue_stat issue_stat;
|
2010-04-02 05:01:41 +07:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
struct request_list *rl; /* rl this rq is alloced from */
|
2010-04-02 05:01:41 +07:00
|
|
|
unsigned long long start_time_ns;
|
|
|
|
unsigned long long io_start_time_ns; /* when passed to hardware */
|
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
/* Number of scatter-gather DMA addr+len pairs after
|
|
|
|
* physical address coalescing is performed.
|
|
|
|
*/
|
|
|
|
unsigned short nr_phys_segments;
|
2010-09-11 01:50:10 +07:00
|
|
|
#if defined(CONFIG_BLK_DEV_INTEGRITY)
|
|
|
|
unsigned short nr_integrity_segments;
|
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-06-13 14:02:34 +07:00
|
|
|
unsigned short ioprio;
|
|
|
|
|
2017-04-06 01:16:38 +07:00
|
|
|
unsigned int timeout;
|
|
|
|
|
2009-04-23 09:05:20 +07:00
|
|
|
void *special; /* opaque pointer available for LLD use */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-07-28 14:32:07 +07:00
|
|
|
int errors;
|
|
|
|
|
2008-03-04 17:17:11 +07:00
|
|
|
unsigned int extra_len; /* length of alignment and padding */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-09-14 19:55:09 +07:00
|
|
|
unsigned long deadline;
|
|
|
|
struct list_head timeout_list;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
2006-10-01 01:29:12 +07:00
|
|
|
* completion callback.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
rq_end_io_fn *end_io;
|
|
|
|
void *end_io_data;
|
2007-07-16 13:52:14 +07:00
|
|
|
|
|
|
|
/* for bidi */
|
|
|
|
struct request *next_rq;
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2017-01-31 22:57:31 +07:00
|
|
|
static inline bool blk_rq_is_scsi(struct request *rq)
|
|
|
|
{
|
|
|
|
return req_op(rq) == REQ_OP_SCSI_IN || req_op(rq) == REQ_OP_SCSI_OUT;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_rq_is_private(struct request *rq)
|
|
|
|
{
|
|
|
|
return req_op(rq) == REQ_OP_DRV_IN || req_op(rq) == REQ_OP_DRV_OUT;
|
|
|
|
}
|
|
|
|
|
2017-01-31 22:57:29 +07:00
|
|
|
static inline bool blk_rq_is_passthrough(struct request *rq)
|
|
|
|
{
|
2017-01-31 22:57:31 +07:00
|
|
|
return blk_rq_is_scsi(rq) || blk_rq_is_private(rq);
|
2017-01-31 22:57:29 +07:00
|
|
|
}
|
|
|
|
|
2008-08-14 14:59:13 +07:00
|
|
|
static inline unsigned short req_get_ioprio(struct request *req)
|
|
|
|
{
|
|
|
|
return req->ioprio;
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/elevator.h>
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
struct blk_queue_ctx;
|
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
typedef void (request_fn_proc) (struct request_queue *q);
|
2015-11-06 00:41:16 +07:00
|
|
|
typedef blk_qc_t (make_request_fn) (struct request_queue *q, struct bio *bio);
|
2007-07-24 14:28:11 +07:00
|
|
|
typedef int (prep_rq_fn) (struct request_queue *, struct request *);
|
2010-07-01 17:49:17 +07:00
|
|
|
typedef void (unprep_rq_fn) (struct request_queue *, struct request *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
struct bio_vec;
|
2006-01-09 22:02:34 +07:00
|
|
|
typedef void (softirq_done_fn)(struct request *);
|
2008-02-19 17:36:53 +07:00
|
|
|
typedef int (dma_drain_needed_fn)(struct request *);
|
2008-10-01 21:12:15 +07:00
|
|
|
typedef int (lld_busy_fn) (struct request_queue *q);
|
2011-08-01 03:05:09 +07:00
|
|
|
typedef int (bsg_job_fn) (struct bsg_job *);
|
2017-01-27 23:51:45 +07:00
|
|
|
typedef int (init_rq_fn)(struct request_queue *, struct request *, gfp_t);
|
|
|
|
typedef void (exit_rq_fn)(struct request_queue *, struct request *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-09-14 19:55:09 +07:00
|
|
|
enum blk_eh_timer_return {
|
|
|
|
BLK_EH_NOT_HANDLED,
|
|
|
|
BLK_EH_HANDLED,
|
|
|
|
BLK_EH_RESET_TIMER,
|
|
|
|
};
|
|
|
|
|
|
|
|
typedef enum blk_eh_timer_return (rq_timed_out_fn)(struct request *);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
enum blk_queue_state {
|
|
|
|
Queue_down,
|
|
|
|
Queue_up,
|
|
|
|
};
|
|
|
|
|
|
|
|
struct blk_queue_tag {
|
|
|
|
struct request **tag_index; /* map of busy tags */
|
|
|
|
unsigned long *tag_map; /* bit map of free/busy tags */
|
|
|
|
int max_depth; /* what we will send to device */
|
2005-08-06 03:28:11 +07:00
|
|
|
int real_max_depth; /* what the array can hold */
|
2005-04-17 05:20:36 +07:00
|
|
|
atomic_t refcnt; /* map can be shared */
|
2015-01-16 08:32:25 +07:00
|
|
|
int alloc_policy; /* tag allocation policy */
|
|
|
|
int next_tag; /* next tag */
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
2015-01-16 08:32:25 +07:00
|
|
|
#define BLK_TAG_ALLOC_FIFO 0 /* allocate starting from 0 */
|
|
|
|
#define BLK_TAG_ALLOC_RR 1 /* allocate starting from last allocated tag */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-08-16 12:10:05 +07:00
|
|
|
#define BLK_SCSI_MAX_CMDS (256)
|
|
|
|
#define BLK_SCSI_CMD_PER_LONG (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8))
|
|
|
|
|
2016-10-18 13:40:29 +07:00
|
|
|
/*
|
|
|
|
* Zoned block device models (zoned limit).
|
|
|
|
*/
|
|
|
|
enum blk_zoned_model {
|
|
|
|
BLK_ZONED_NONE, /* Regular block device */
|
|
|
|
BLK_ZONED_HA, /* Host-aware zoned block device */
|
|
|
|
BLK_ZONED_HM, /* Host-managed zoned block device */
|
|
|
|
};
|
|
|
|
|
2009-05-23 04:17:51 +07:00
|
|
|
struct queue_limits {
|
|
|
|
unsigned long bounce_pfn;
|
|
|
|
unsigned long seg_boundary_mask;
|
2015-08-20 04:24:05 +07:00
|
|
|
unsigned long virt_boundary_mask;
|
2009-05-23 04:17:51 +07:00
|
|
|
|
|
|
|
unsigned int max_hw_sectors;
|
2015-11-14 04:46:48 +07:00
|
|
|
unsigned int max_dev_sectors;
|
2014-06-06 02:38:39 +07:00
|
|
|
unsigned int chunk_sectors;
|
2009-05-23 04:17:51 +07:00
|
|
|
unsigned int max_sectors;
|
|
|
|
unsigned int max_segment_size;
|
2009-05-23 04:17:53 +07:00
|
|
|
unsigned int physical_block_size;
|
|
|
|
unsigned int alignment_offset;
|
|
|
|
unsigned int io_min;
|
|
|
|
unsigned int io_opt;
|
2009-09-30 18:54:20 +07:00
|
|
|
unsigned int max_discard_sectors;
|
2015-07-16 22:14:26 +07:00
|
|
|
unsigned int max_hw_discard_sectors;
|
2012-09-18 23:19:27 +07:00
|
|
|
unsigned int max_write_same_sectors;
|
2016-12-01 03:28:59 +07:00
|
|
|
unsigned int max_write_zeroes_sectors;
|
2009-11-10 17:50:21 +07:00
|
|
|
unsigned int discard_granularity;
|
|
|
|
unsigned int discard_alignment;
|
2009-05-23 04:17:51 +07:00
|
|
|
|
|
|
|
unsigned short logical_block_size;
|
2010-02-26 12:20:39 +07:00
|
|
|
unsigned short max_segments;
|
2010-09-11 01:50:10 +07:00
|
|
|
unsigned short max_integrity_segments;
|
2017-02-08 20:46:49 +07:00
|
|
|
unsigned short max_discard_segments;
|
2009-05-23 04:17:51 +07:00
|
|
|
|
2009-05-23 04:17:53 +07:00
|
|
|
unsigned char misaligned;
|
2009-11-10 17:50:21 +07:00
|
|
|
unsigned char discard_misaligned;
|
2010-12-02 01:41:49 +07:00
|
|
|
unsigned char cluster;
|
2013-07-12 12:39:53 +07:00
|
|
|
unsigned char raid_partial_stripes_expensive;
|
2016-10-18 13:40:29 +07:00
|
|
|
enum blk_zoned_model zoned;
|
2009-05-23 04:17:51 +07:00
|
|
|
};
|
|
|
|
|
2016-10-18 13:40:33 +07:00
|
|
|
#ifdef CONFIG_BLK_DEV_ZONED
|
|
|
|
|
|
|
|
struct blk_zone_report_hdr {
|
|
|
|
unsigned int nr_zones;
|
|
|
|
u8 padding[60];
|
|
|
|
};
|
|
|
|
|
|
|
|
extern int blkdev_report_zones(struct block_device *bdev,
|
|
|
|
sector_t sector, struct blk_zone *zones,
|
|
|
|
unsigned int *nr_zones, gfp_t gfp_mask);
|
|
|
|
extern int blkdev_reset_zones(struct block_device *bdev, sector_t sectors,
|
|
|
|
sector_t nr_sectors, gfp_t gfp_mask);
|
|
|
|
|
2016-10-18 13:40:35 +07:00
|
|
|
extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
|
|
|
|
unsigned int cmd, unsigned long arg);
|
|
|
|
extern int blkdev_reset_zones_ioctl(struct block_device *bdev, fmode_t mode,
|
|
|
|
unsigned int cmd, unsigned long arg);
|
|
|
|
|
|
|
|
#else /* CONFIG_BLK_DEV_ZONED */
|
|
|
|
|
|
|
|
static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
|
|
|
|
fmode_t mode, unsigned int cmd,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
return -ENOTTY;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int blkdev_reset_zones_ioctl(struct block_device *bdev,
|
|
|
|
fmode_t mode, unsigned int cmd,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
return -ENOTTY;
|
|
|
|
}
|
|
|
|
|
2016-10-18 13:40:33 +07:00
|
|
|
#endif /* CONFIG_BLK_DEV_ZONED */
|
|
|
|
|
2011-07-14 02:17:23 +07:00
|
|
|
struct request_queue {
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Together with queue_head for cacheline sharing
|
|
|
|
*/
|
|
|
|
struct list_head queue_head;
|
|
|
|
struct request *last_merge;
|
2008-10-31 16:05:07 +07:00
|
|
|
struct elevator_queue *elevator;
|
2012-06-05 10:40:58 +07:00
|
|
|
int nr_rqs[2]; /* # allocated [a]sync rqs */
|
|
|
|
int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 22:56:08 +07:00
|
|
|
struct blk_queue_stats *stats;
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 02:38:14 +07:00
|
|
|
struct rq_wb *rq_wb;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
* If blkcg is not used, @q->root_rl serves all requests. If blkcg
|
|
|
|
* is used, root blkg allocates from @q->root_rl and all other
|
|
|
|
* blkgs from their own blkg->rl. Which one to use should be
|
|
|
|
* determined using bio_request_list().
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
struct request_list root_rl;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
request_fn_proc *request_fn;
|
|
|
|
make_request_fn *make_request_fn;
|
|
|
|
prep_rq_fn *prep_rq_fn;
|
2010-07-01 17:49:17 +07:00
|
|
|
unprep_rq_fn *unprep_rq_fn;
|
2006-01-09 22:02:34 +07:00
|
|
|
softirq_done_fn *softirq_done_fn;
|
2008-09-14 19:55:09 +07:00
|
|
|
rq_timed_out_fn *rq_timed_out_fn;
|
2008-02-19 17:36:53 +07:00
|
|
|
dma_drain_needed_fn *dma_drain_needed;
|
2008-10-01 21:12:15 +07:00
|
|
|
lld_busy_fn *lld_busy_fn;
|
2017-01-27 23:51:45 +07:00
|
|
|
init_rq_fn *init_rq_fn;
|
|
|
|
exit_rq_fn *exit_rq_fn;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2016-12-13 23:24:51 +07:00
|
|
|
const struct blk_mq_ops *mq_ops;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
|
|
|
|
unsigned int *mq_map;
|
|
|
|
|
|
|
|
/* sw queues */
|
2014-06-03 10:24:06 +07:00
|
|
|
struct blk_mq_ctx __percpu *queue_ctx;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
unsigned int nr_queues;
|
|
|
|
|
2016-03-30 23:21:08 +07:00
|
|
|
unsigned int queue_depth;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
/* hw dispatch queues */
|
|
|
|
struct blk_mq_hw_ctx **queue_hw_ctx;
|
|
|
|
unsigned int nr_hw_queues;
|
|
|
|
|
2005-10-20 21:23:44 +07:00
|
|
|
/*
|
|
|
|
* Dispatch queue sorting
|
|
|
|
*/
|
2005-10-20 21:37:00 +07:00
|
|
|
sector_t end_sector;
|
2005-10-20 21:23:44 +07:00
|
|
|
struct request *boundary_rq;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2011-03-02 23:08:00 +07:00
|
|
|
* Delayed queue handling
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2011-03-02 23:08:00 +07:00
|
|
|
struct delayed_work delay_work;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-02-02 21:56:50 +07:00
|
|
|
struct backing_dev_info *backing_dev_info;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The queue owner gets to use this for whatever they like.
|
|
|
|
* ll_rw_blk doesn't touch it.
|
|
|
|
*/
|
|
|
|
void *queuedata;
|
|
|
|
|
|
|
|
/*
|
2011-07-14 02:17:23 +07:00
|
|
|
* various queue flags, see QUEUE_* below
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2011-07-14 02:17:23 +07:00
|
|
|
unsigned long queue_flags;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-12-14 06:33:37 +07:00
|
|
|
/*
|
|
|
|
* ida allocated id for this queue. Used to index queues from
|
|
|
|
* ioctx.
|
|
|
|
*/
|
|
|
|
int id;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2011-07-14 02:17:23 +07:00
|
|
|
* queue needs bounce pages for pages above this limit
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2011-07-14 02:17:23 +07:00
|
|
|
gfp_t bounce_gfp;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
2005-04-13 04:22:06 +07:00
|
|
|
* protects queue structures from reentrancy. ->__queue_lock should
|
|
|
|
* _never_ be used directly, it is queue private. always use
|
|
|
|
* ->queue_lock.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2005-04-13 04:22:06 +07:00
|
|
|
spinlock_t __queue_lock;
|
2005-04-17 05:20:36 +07:00
|
|
|
spinlock_t *queue_lock;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* queue kobject
|
|
|
|
*/
|
|
|
|
struct kobject kobj;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
/*
|
|
|
|
* mq queue kobject
|
|
|
|
*/
|
|
|
|
struct kobject mq_kobj;
|
|
|
|
|
2015-10-22 00:20:18 +07:00
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
struct blk_integrity integrity;
|
|
|
|
#endif /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
|
2014-12-04 07:00:23 +07:00
|
|
|
#ifdef CONFIG_PM
|
2013-03-23 10:42:26 +07:00
|
|
|
struct device *dev;
|
|
|
|
int rpm_status;
|
|
|
|
unsigned int nr_pending;
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* queue settings
|
|
|
|
*/
|
|
|
|
unsigned long nr_requests; /* Max # of requests */
|
|
|
|
unsigned int nr_congestion_on;
|
|
|
|
unsigned int nr_congestion_off;
|
|
|
|
unsigned int nr_batching;
|
|
|
|
|
2008-01-11 00:30:36 +07:00
|
|
|
unsigned int dma_drain_size;
|
2011-07-14 02:17:23 +07:00
|
|
|
void *dma_drain_buffer;
|
2008-03-04 17:18:17 +07:00
|
|
|
unsigned int dma_pad_mask;
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned int dma_alignment;
|
|
|
|
|
|
|
|
struct blk_queue_tag *queue_tags;
|
2007-10-25 15:14:47 +07:00
|
|
|
struct list_head tag_busy_list;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-11-10 14:52:05 +07:00
|
|
|
unsigned int nr_sorted;
|
2009-05-20 13:54:31 +07:00
|
|
|
unsigned int in_flight[2];
|
2016-11-08 11:32:37 +07:00
|
|
|
|
2012-11-28 19:46:45 +07:00
|
|
|
/*
|
|
|
|
* Number of active block driver functions for which blk_drain_queue()
|
|
|
|
* must wait. Must be incremented around functions that unlock the
|
|
|
|
* queue_lock internally, e.g. scsi_request_fn().
|
|
|
|
*/
|
|
|
|
unsigned int request_fn_active;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-09-14 19:55:09 +07:00
|
|
|
unsigned int rq_timeout;
|
2016-11-15 03:03:03 +07:00
|
|
|
int poll_nsec;
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 22:56:08 +07:00
|
|
|
|
|
|
|
struct blk_stat_callback *poll_cb;
|
|
|
|
struct blk_rq_stat poll_stat[2];
|
|
|
|
|
2008-09-14 19:55:09 +07:00
|
|
|
struct timer_list timeout;
|
2015-10-30 19:57:30 +07:00
|
|
|
struct work_struct timeout_work;
|
2008-09-14 19:55:09 +07:00
|
|
|
struct list_head timeout_list;
|
|
|
|
|
2011-12-14 06:33:41 +07:00
|
|
|
struct list_head icq_list;
|
2012-03-06 04:15:18 +07:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
2012-04-14 03:11:33 +07:00
|
|
|
DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS);
|
2012-04-17 03:57:25 +07:00
|
|
|
struct blkcg_gq *root_blkg;
|
2012-03-06 04:15:19 +07:00
|
|
|
struct list_head blkg_list;
|
2012-03-06 04:15:18 +07:00
|
|
|
#endif
|
2011-12-14 06:33:41 +07:00
|
|
|
|
2009-05-23 04:17:51 +07:00
|
|
|
struct queue_limits limits;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* sg stuff
|
|
|
|
*/
|
|
|
|
unsigned int sg_timeout;
|
|
|
|
unsigned int sg_reserved_size;
|
2005-06-23 14:08:19 +07:00
|
|
|
int node;
|
2006-09-29 15:59:40 +07:00
|
|
|
#ifdef CONFIG_BLK_DEV_IO_TRACE
|
2006-03-24 02:00:26 +07:00
|
|
|
struct blk_trace *blk_trace;
|
2006-09-29 15:59:40 +07:00
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2010-09-03 16:56:16 +07:00
|
|
|
* for flush operations
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2014-09-25 22:23:43 +07:00
|
|
|
struct blk_flush_queue *fq;
|
2006-03-19 06:34:37 +07:00
|
|
|
|
2014-05-28 21:08:02 +07:00
|
|
|
struct list_head requeue_list;
|
|
|
|
spinlock_t requeue_lock;
|
2016-09-15 00:28:30 +07:00
|
|
|
struct delayed_work requeue_work;
|
2014-05-28 21:08:02 +07:00
|
|
|
|
2006-03-19 06:34:37 +07:00
|
|
|
struct mutex sysfs_lock;
|
2007-07-09 17:40:35 +07:00
|
|
|
|
2012-03-06 04:14:58 +07:00
|
|
|
int bypass_depth;
|
2015-05-07 14:38:13 +07:00
|
|
|
atomic_t mq_freeze_depth;
|
2012-03-06 04:14:58 +07:00
|
|
|
|
2007-07-09 17:40:35 +07:00
|
|
|
#if defined(CONFIG_BLK_DEV_BSG)
|
2011-08-01 03:05:09 +07:00
|
|
|
bsg_job_fn *bsg_job_fn;
|
|
|
|
int bsg_job_size;
|
2007-07-09 17:40:35 +07:00
|
|
|
struct bsg_class_device bsg_dev;
|
|
|
|
#endif
|
2010-09-16 04:06:35 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_DEV_THROTTLING
|
|
|
|
/* Throttle data */
|
|
|
|
struct throtl_data *td;
|
|
|
|
#endif
|
2013-01-09 23:05:13 +07:00
|
|
|
struct rcu_head rcu_head;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
wait_queue_head_t mq_freeze_wq;
|
2015-10-22 00:20:12 +07:00
|
|
|
struct percpu_ref q_usage_counter;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
struct list_head all_q_node;
|
2014-05-14 04:10:52 +07:00
|
|
|
|
|
|
|
struct blk_mq_tag_set *tag_set;
|
|
|
|
struct list_head tag_set_list;
|
2015-04-24 12:37:18 +07:00
|
|
|
struct bio_set *bio_split;
|
2015-09-27 00:09:20 +07:00
|
|
|
|
2017-02-01 05:53:18 +07:00
|
|
|
#ifdef CONFIG_BLK_DEBUG_FS
|
2017-01-25 23:06:40 +07:00
|
|
|
struct dentry *debugfs_dir;
|
|
|
|
struct dentry *mq_debugfs_dir;
|
|
|
|
#endif
|
|
|
|
|
2015-09-27 00:09:20 +07:00
|
|
|
bool mq_sysfs_init_done;
|
2017-01-27 23:51:45 +07:00
|
|
|
|
|
|
|
size_t cmd_size;
|
|
|
|
void *rq_alloc_data;
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
|
|
|
|
#define QUEUE_FLAG_STOPPED 2 /* queue is stopped */
|
2009-04-06 19:48:01 +07:00
|
|
|
#define QUEUE_FLAG_SYNCFULL 3 /* read queue has been filled */
|
|
|
|
#define QUEUE_FLAG_ASYNCFULL 4 /* write queue has been filled */
|
2012-11-28 19:42:38 +07:00
|
|
|
#define QUEUE_FLAG_DYING 5 /* queue being torn down */
|
2012-03-06 04:14:58 +07:00
|
|
|
#define QUEUE_FLAG_BYPASS 6 /* act as dumb FIFO queue */
|
2011-04-19 18:32:46 +07:00
|
|
|
#define QUEUE_FLAG_BIDI 7 /* queue supports bidi requests */
|
|
|
|
#define QUEUE_FLAG_NOMERGES 8 /* disable merge attempts */
|
2011-07-24 01:44:25 +07:00
|
|
|
#define QUEUE_FLAG_SAME_COMP 9 /* complete on same CPU-group */
|
2011-04-19 18:32:46 +07:00
|
|
|
#define QUEUE_FLAG_FAIL_IO 10 /* fake timeout */
|
|
|
|
#define QUEUE_FLAG_STACKABLE 11 /* supports request stacking */
|
|
|
|
#define QUEUE_FLAG_NONROT 12 /* non-rotational device (SSD) */
|
2008-10-27 16:44:46 +07:00
|
|
|
#define QUEUE_FLAG_VIRT QUEUE_FLAG_NONROT /* paravirt device */
|
2011-04-19 18:32:46 +07:00
|
|
|
#define QUEUE_FLAG_IO_STAT 13 /* do IO stats */
|
|
|
|
#define QUEUE_FLAG_DISCARD 14 /* supports DISCARD */
|
|
|
|
#define QUEUE_FLAG_NOXMERGES 15 /* No extended merges */
|
|
|
|
#define QUEUE_FLAG_ADD_RANDOM 16 /* Contributes to random pool */
|
2016-06-09 21:00:36 +07:00
|
|
|
#define QUEUE_FLAG_SECERASE 17 /* supports secure erase */
|
2011-07-24 01:44:25 +07:00
|
|
|
#define QUEUE_FLAG_SAME_FORCE 18 /* force complete on same CPU */
|
2012-12-06 20:32:01 +07:00
|
|
|
#define QUEUE_FLAG_DEAD 19 /* queue tear-down finished */
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
#define QUEUE_FLAG_INIT_DONE 20 /* queue is initialized */
|
2014-05-29 22:53:32 +07:00
|
|
|
#define QUEUE_FLAG_NO_SG_MERGE 21 /* don't attempt to merge SG segments*/
|
2015-11-06 00:44:55 +07:00
|
|
|
#define QUEUE_FLAG_POLL 22 /* IO polling enabled if set */
|
2016-04-13 01:32:46 +07:00
|
|
|
#define QUEUE_FLAG_WC 23 /* Write back caching */
|
|
|
|
#define QUEUE_FLAG_FUA 24 /* device supports FUA writes */
|
2016-04-14 02:33:19 +07:00
|
|
|
#define QUEUE_FLAG_FLUSH_NQ 25 /* flush not queueuable */
|
2016-06-24 04:05:50 +07:00
|
|
|
#define QUEUE_FLAG_DAX 26 /* device supports DAX */
|
2016-11-08 11:32:37 +07:00
|
|
|
#define QUEUE_FLAG_STATS 27 /* track rq completion times */
|
2017-04-08 01:45:20 +07:00
|
|
|
#define QUEUE_FLAG_POLL_STATS 28 /* collecting stats for hybrid polling */
|
|
|
|
#define QUEUE_FLAG_REGISTERED 29 /* queue has been registered to a disk */
|
2009-01-23 16:54:44 +07:00
|
|
|
|
|
|
|
#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
|
2009-09-04 01:06:47 +07:00
|
|
|
(1 << QUEUE_FLAG_STACKABLE) | \
|
2010-06-09 15:42:09 +07:00
|
|
|
(1 << QUEUE_FLAG_SAME_COMP) | \
|
|
|
|
(1 << QUEUE_FLAG_ADD_RANDOM))
|
2006-01-06 15:51:03 +07:00
|
|
|
|
2013-11-19 23:25:07 +07:00
|
|
|
#define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
|
2014-12-17 00:54:25 +07:00
|
|
|
(1 << QUEUE_FLAG_STACKABLE) | \
|
2016-03-03 22:04:03 +07:00
|
|
|
(1 << QUEUE_FLAG_SAME_COMP) | \
|
|
|
|
(1 << QUEUE_FLAG_POLL))
|
2013-11-19 23:25:07 +07:00
|
|
|
|
2012-03-30 17:33:28 +07:00
|
|
|
static inline void queue_lockdep_assert_held(struct request_queue *q)
|
2008-04-30 00:16:38 +07:00
|
|
|
{
|
2012-03-30 17:33:28 +07:00
|
|
|
if (q->queue_lock)
|
|
|
|
lockdep_assert_held(q->queue_lock);
|
2008-04-30 00:16:38 +07:00
|
|
|
}
|
|
|
|
|
2008-04-29 19:48:33 +07:00
|
|
|
static inline void queue_flag_set_unlocked(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
2008-07-03 18:18:54 +07:00
|
|
|
static inline int queue_flag_test_and_clear(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
2012-03-30 17:33:28 +07:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-07-03 18:18:54 +07:00
|
|
|
|
|
|
|
if (test_bit(flag, &q->queue_flags)) {
|
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int queue_flag_test_and_set(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
2012-03-30 17:33:28 +07:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-07-03 18:18:54 +07:00
|
|
|
|
|
|
|
if (!test_bit(flag, &q->queue_flags)) {
|
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2008-04-29 19:48:33 +07:00
|
|
|
static inline void queue_flag_set(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2012-03-30 17:33:28 +07:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-04-29 19:48:33 +07:00
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void queue_flag_clear_unlocked(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
2009-05-20 13:54:31 +07:00
|
|
|
static inline int queue_in_flight(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->in_flight[0] + q->in_flight[1];
|
|
|
|
}
|
|
|
|
|
2008-04-29 19:48:33 +07:00
|
|
|
static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2012-03-30 17:33:28 +07:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-04-29 19:48:33 +07:00
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#define blk_queue_tagged(q) test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
|
|
|
|
#define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
|
2012-11-28 19:42:38 +07:00
|
|
|
#define blk_queue_dying(q) test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags)
|
2012-12-06 20:32:01 +07:00
|
|
|
#define blk_queue_dead(q) test_bit(QUEUE_FLAG_DEAD, &(q)->queue_flags)
|
2012-03-06 04:14:58 +07:00
|
|
|
#define blk_queue_bypass(q) test_bit(QUEUE_FLAG_BYPASS, &(q)->queue_flags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
#define blk_queue_init_done(q) test_bit(QUEUE_FLAG_INIT_DONE, &(q)->queue_flags)
|
2008-04-29 19:44:19 +07:00
|
|
|
#define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
|
2010-01-29 15:04:08 +07:00
|
|
|
#define blk_queue_noxmerges(q) \
|
|
|
|
test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags)
|
2008-09-24 18:03:33 +07:00
|
|
|
#define blk_queue_nonrot(q) test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
|
2009-01-23 16:54:44 +07:00
|
|
|
#define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
|
2010-06-09 15:42:09 +07:00
|
|
|
#define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
|
block: add a queue flag for request stacking support
This patch adds a queue flag to indicate the block device can be
used for request stacking.
Request stacking drivers need to stack their devices on top of
only devices of which q->request_fn is functional.
Since bio stacking drivers (e.g. md, loop) basically initialize
their queue using blk_alloc_queue() and don't set q->request_fn,
the check of (q->request_fn == NULL) looks enough for that purpose.
However, dm will become both types of stacking driver (bio-based and
request-based). And dm will always set q->request_fn even if the dm
device is bio-based of which q->request_fn is not functional actually.
So we need something else to distinguish the type of the device.
Adding a queue flag is a solution for that.
The reason why dm always sets q->request_fn is to keep
the compatibility of dm user-space tools.
Currently, all dm user-space tools are using bio-based dm without
specifying the type of the dm device they use.
To use request-based dm without changing such tools, the kernel
must decide the type of the dm device automatically.
The automatic type decision can't be done at the device creation time
and needs to be deferred until such tools load a mapping table,
since the actual type is decided by dm target type included in
the mapping table.
So a dm device has to be initialized using blk_init_queue()
so that we can load either type of table.
Then, all queue stuffs are set (e.g. q->request_fn) and we have
no element to distinguish that it is bio-based or request-based,
even after a table is loaded and the type of the device is decided.
By the way, some stuffs of the queue (e.g. request_list, elevator)
are needless when the dm device is used as bio-based.
But the memory size is not so large (about 20[KB] per queue on ia64),
so I hope the memory loss can be acceptable for bio-based dm users.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:46:13 +07:00
|
|
|
#define blk_queue_stackable(q) \
|
|
|
|
test_bit(QUEUE_FLAG_STACKABLE, &(q)->queue_flags)
|
2009-09-30 18:52:12 +07:00
|
|
|
#define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
|
2016-06-09 21:00:36 +07:00
|
|
|
#define blk_queue_secure_erase(q) \
|
|
|
|
(test_bit(QUEUE_FLAG_SECERASE, &(q)->queue_flags))
|
2016-06-24 04:05:50 +07:00
|
|
|
#define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-08-07 23:17:56 +07:00
|
|
|
#define blk_noretry_request(rq) \
|
|
|
|
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
|
|
|
|
REQ_FAILFAST_DRIVER))
|
|
|
|
|
2017-01-31 22:57:29 +07:00
|
|
|
static inline bool blk_account_rq(struct request *rq)
|
|
|
|
{
|
|
|
|
return (rq->rq_flags & RQF_STARTED) && !blk_rq_is_passthrough(rq);
|
|
|
|
}
|
2010-08-07 23:17:56 +07:00
|
|
|
|
2008-08-26 15:25:02 +07:00
|
|
|
#define blk_rq_cpu_valid(rq) ((rq)->cpu != -1)
|
2007-07-16 13:52:14 +07:00
|
|
|
#define blk_bidi_rq(rq) ((rq)->next_rq != NULL)
|
2007-12-12 05:40:30 +07:00
|
|
|
/* rq->queuelist of dequeued request must be list_empty() */
|
|
|
|
#define blk_queued_rq(rq) (!list_empty(&(rq)->queuelist))
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
#define list_entry_rq(ptr) list_entry((ptr), struct request, queuelist)
|
|
|
|
|
2016-06-06 02:32:22 +07:00
|
|
|
#define rq_data_dir(rq) (op_is_write(req_op(rq)) ? WRITE : READ)
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-04-16 23:57:18 +07:00
|
|
|
/*
|
|
|
|
* Driver can handle struct request, if it either has an old style
|
|
|
|
* request_fn defined, or is blk-mq based.
|
|
|
|
*/
|
|
|
|
static inline bool queue_is_rq_based(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->request_fn || q->mq_ops;
|
|
|
|
}
|
|
|
|
|
2010-12-02 01:41:49 +07:00
|
|
|
static inline unsigned int blk_queue_cluster(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.cluster;
|
|
|
|
}
|
|
|
|
|
2016-10-18 13:40:29 +07:00
|
|
|
static inline enum blk_zoned_model
|
|
|
|
blk_queue_zoned_model(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.zoned;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_queue_is_zoned(struct request_queue *q)
|
|
|
|
{
|
|
|
|
switch (blk_queue_zoned_model(q)) {
|
|
|
|
case BLK_ZONED_HA:
|
|
|
|
case BLK_ZONED_HM:
|
|
|
|
return true;
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-01-12 21:58:32 +07:00
|
|
|
static inline unsigned int blk_queue_zone_sectors(struct request_queue *q)
|
2016-10-18 13:40:33 +07:00
|
|
|
{
|
|
|
|
return blk_queue_is_zoned(q) ? q->limits.chunk_sectors : 0;
|
|
|
|
}
|
|
|
|
|
2009-04-06 19:48:01 +07:00
|
|
|
static inline bool rq_is_sync(struct request *rq)
|
|
|
|
{
|
2016-10-28 21:48:16 +07:00
|
|
|
return op_is_sync(rq->cmd_flags);
|
2009-04-06 19:48:01 +07:00
|
|
|
}
|
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
static inline bool blk_rl_full(struct request_list *rl, bool sync)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-06-05 10:40:59 +07:00
|
|
|
unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
|
|
|
|
|
|
|
|
return rl->flags & flag;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
static inline void blk_set_rl_full(struct request_list *rl, bool sync)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-06-05 10:40:59 +07:00
|
|
|
unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
|
|
|
|
|
|
|
|
rl->flags |= flag;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
static inline void blk_clear_rl_full(struct request_list *rl, bool sync)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-06-05 10:40:59 +07:00
|
|
|
unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
|
|
|
|
|
|
|
|
rl->flags &= ~flag;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2012-09-18 23:19:25 +07:00
|
|
|
static inline bool rq_mergeable(struct request *rq)
|
|
|
|
{
|
2017-01-31 22:57:29 +07:00
|
|
|
if (blk_rq_is_passthrough(rq))
|
2012-09-18 23:19:25 +07:00
|
|
|
return false;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2016-06-06 02:32:23 +07:00
|
|
|
if (req_op(rq) == REQ_OP_FLUSH)
|
|
|
|
return false;
|
|
|
|
|
2016-12-01 03:28:59 +07:00
|
|
|
if (req_op(rq) == REQ_OP_WRITE_ZEROES)
|
|
|
|
return false;
|
|
|
|
|
2012-09-18 23:19:25 +07:00
|
|
|
if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
|
2016-10-20 20:12:13 +07:00
|
|
|
return false;
|
|
|
|
if (rq->rq_flags & RQF_NOMERGE_FLAGS)
|
2012-09-18 23:19:25 +07:00
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-09-18 23:19:27 +07:00
|
|
|
static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b)
|
|
|
|
{
|
|
|
|
if (bio_data(a) == bio_data(b))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2016-03-30 23:21:08 +07:00
|
|
|
static inline unsigned int blk_queue_depth(struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (q->queue_depth)
|
|
|
|
return q->queue_depth;
|
|
|
|
|
|
|
|
return q->nr_requests;
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* q->prep_rq_fn return values
|
|
|
|
*/
|
2016-02-04 12:52:12 +07:00
|
|
|
enum {
|
|
|
|
BLKPREP_OK, /* serve it */
|
|
|
|
BLKPREP_KILL, /* fatal error, kill, return -EIO */
|
|
|
|
BLKPREP_DEFER, /* leave on queue */
|
|
|
|
BLKPREP_INVALID, /* invalid command, kill, return -EREMOTEIO */
|
|
|
|
};
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
extern unsigned long blk_max_low_pfn, blk_max_pfn;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* standard bounce addresses:
|
|
|
|
*
|
|
|
|
* BLK_BOUNCE_HIGH : bounce all highmem pages
|
|
|
|
* BLK_BOUNCE_ANY : don't bounce anything
|
|
|
|
* BLK_BOUNCE_ISA : bounce pages above ISA DMA boundary
|
|
|
|
*/
|
2008-04-21 14:51:05 +07:00
|
|
|
|
|
|
|
#if BITS_PER_LONG == 32
|
2005-04-17 05:20:36 +07:00
|
|
|
#define BLK_BOUNCE_HIGH ((u64)blk_max_low_pfn << PAGE_SHIFT)
|
2008-04-21 14:51:05 +07:00
|
|
|
#else
|
|
|
|
#define BLK_BOUNCE_HIGH -1ULL
|
|
|
|
#endif
|
|
|
|
#define BLK_BOUNCE_ANY (-1ULL)
|
2010-05-31 13:59:03 +07:00
|
|
|
#define BLK_BOUNCE_ISA (DMA_BIT_MASK(24))
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-09 17:38:05 +07:00
|
|
|
/*
|
|
|
|
* default timeout for SG_IO if none specified
|
|
|
|
*/
|
|
|
|
#define BLK_DEFAULT_SG_TIMEOUT (60 * HZ)
|
2008-12-06 05:49:18 +07:00
|
|
|
#define BLK_MIN_SG_TIMEOUT (7 * HZ)
|
2007-07-09 17:38:05 +07:00
|
|
|
|
2007-07-17 18:03:37 +07:00
|
|
|
#ifdef CONFIG_BOUNCE
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int init_emergency_isa_pool(void);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
|
2005-04-17 05:20:36 +07:00
|
|
|
#else
|
|
|
|
static inline int init_emergency_isa_pool(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2007-07-24 14:28:11 +07:00
|
|
|
static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_MMU */
|
|
|
|
|
2008-08-28 14:17:06 +07:00
|
|
|
struct rq_map_data {
|
|
|
|
struct page **pages;
|
|
|
|
int page_order;
|
|
|
|
int nr_entries;
|
2008-12-18 12:49:37 +07:00
|
|
|
unsigned long offset;
|
2008-12-18 12:49:38 +07:00
|
|
|
int null_mapped;
|
2009-07-09 19:46:53 +07:00
|
|
|
int from_user;
|
2008-08-28 14:17:06 +07:00
|
|
|
};
|
|
|
|
|
2007-09-25 17:35:59 +07:00
|
|
|
struct req_iterator {
|
2013-11-24 08:19:00 +07:00
|
|
|
struct bvec_iter iter;
|
2007-09-25 17:35:59 +07:00
|
|
|
struct bio *bio;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* This should not be used directly - use rq_for_each_segment */
|
2009-02-23 15:03:10 +07:00
|
|
|
#define for_each_bio(_bio) \
|
|
|
|
for (; _bio; _bio = _bio->bi_next)
|
2007-09-25 17:35:59 +07:00
|
|
|
#define __rq_for_each_bio(_bio, rq) \
|
2005-04-17 05:20:36 +07:00
|
|
|
if ((rq->bio)) \
|
|
|
|
for (_bio = (rq)->bio; _bio; _bio = _bio->bi_next)
|
|
|
|
|
2007-09-25 17:35:59 +07:00
|
|
|
#define rq_for_each_segment(bvl, _rq, _iter) \
|
|
|
|
__rq_for_each_bio(_iter.bio, _rq) \
|
2013-11-24 08:19:00 +07:00
|
|
|
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
|
2007-09-25 17:35:59 +07:00
|
|
|
|
2013-08-08 04:26:21 +07:00
|
|
|
#define rq_iter_last(bvec, _iter) \
|
2013-11-24 08:19:00 +07:00
|
|
|
(_iter.bio->bi_next == NULL && \
|
2013-08-08 04:26:21 +07:00
|
|
|
bio_iter_last(bvec, _iter.iter))
|
2007-09-25 17:35:59 +07:00
|
|
|
|
2009-11-26 15:16:19 +07:00
|
|
|
#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
|
|
|
|
# error "You should define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE for your platform"
|
|
|
|
#endif
|
|
|
|
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
|
|
|
|
extern void rq_flush_dcache_pages(struct request *rq);
|
|
|
|
#else
|
|
|
|
static inline void rq_flush_dcache_pages(struct request *rq)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2016-05-10 23:23:52 +07:00
|
|
|
#ifdef CONFIG_PRINTK
|
|
|
|
#define vfs_msg(sb, level, fmt, ...) \
|
|
|
|
__vfs_msg(sb, level, fmt, ##__VA_ARGS__)
|
|
|
|
#else
|
|
|
|
#define vfs_msg(sb, level, fmt, ...) \
|
|
|
|
do { \
|
|
|
|
no_printk(fmt, ##__VA_ARGS__); \
|
|
|
|
__vfs_msg(sb, "", " "); \
|
|
|
|
} while (0)
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int blk_register_queue(struct gendisk *disk);
|
|
|
|
extern void blk_unregister_queue(struct gendisk *disk);
|
2015-11-06 00:41:16 +07:00
|
|
|
extern blk_qc_t generic_make_request(struct bio *bio);
|
2008-04-29 14:54:36 +07:00
|
|
|
extern void blk_rq_init(struct request_queue *q, struct request *rq);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void blk_put_request(struct request *);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void __blk_put_request(struct request_queue *, struct request *);
|
|
|
|
extern struct request *blk_get_request(struct request_queue *, int, gfp_t);
|
|
|
|
extern void blk_requeue_request(struct request_queue *, struct request *);
|
2008-10-01 21:12:15 +07:00
|
|
|
extern int blk_lld_busy(struct request_queue *q);
|
2015-06-26 21:01:13 +07:00
|
|
|
extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
|
|
|
|
struct bio_set *bs, gfp_t gfp_mask,
|
|
|
|
int (*bio_ctr)(struct bio *, struct bio *, void *),
|
|
|
|
void *data);
|
|
|
|
extern void blk_rq_unprep_clone(struct request *rq);
|
2008-09-18 21:45:38 +07:00
|
|
|
extern int blk_insert_cloned_request(struct request_queue *q,
|
|
|
|
struct request *rq);
|
2016-07-19 16:31:51 +07:00
|
|
|
extern int blk_rq_append_bio(struct request *rq, struct bio *bio);
|
2011-03-02 23:08:00 +07:00
|
|
|
extern void blk_delay_queue(struct request_queue *, unsigned long);
|
2015-04-24 12:37:18 +07:00
|
|
|
extern void blk_queue_split(struct request_queue *, struct bio **,
|
|
|
|
struct bio_set *);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_recount_segments(struct request_queue *, struct bio *);
|
2012-01-12 22:01:28 +07:00
|
|
|
extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
|
2012-01-12 22:01:27 +07:00
|
|
|
extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
|
|
|
|
unsigned int, void __user *);
|
2007-08-28 02:38:10 +07:00
|
|
|
extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|
|
|
|
unsigned int, void __user *);
|
2008-09-03 04:16:41 +07:00
|
|
|
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|
|
|
|
struct scsi_ioctl_command __user *);
|
2006-10-20 13:28:16 +07:00
|
|
|
|
2015-11-26 15:13:05 +07:00
|
|
|
extern int blk_queue_enter(struct request_queue *q, bool nowait);
|
2015-11-20 04:29:28 +07:00
|
|
|
extern void blk_queue_exit(struct request_queue *q);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_start_queue(struct request_queue *q);
|
2015-12-29 03:01:22 +07:00
|
|
|
extern void blk_start_queue_async(struct request_queue *q);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_stop_queue(struct request_queue *q);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void blk_sync_queue(struct request_queue *q);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void __blk_stop_queue(struct request_queue *q);
|
2011-04-18 16:41:33 +07:00
|
|
|
extern void __blk_run_queue(struct request_queue *q);
|
2015-04-18 03:37:20 +07:00
|
|
|
extern void __blk_run_queue_uncond(struct request_queue *q);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_run_queue(struct request_queue *);
|
2011-04-19 18:32:46 +07:00
|
|
|
extern void blk_run_queue_async(struct request_queue *q);
|
2016-11-02 23:09:51 +07:00
|
|
|
extern void blk_mq_quiesce_queue(struct request_queue *q);
|
2008-08-28 14:17:05 +07:00
|
|
|
extern int blk_rq_map_user(struct request_queue *, struct request *,
|
2008-08-28 14:17:06 +07:00
|
|
|
struct rq_map_data *, void __user *, unsigned long,
|
|
|
|
gfp_t);
|
2006-12-19 17:12:46 +07:00
|
|
|
extern int blk_rq_unmap_user(struct bio *);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, unsigned int, gfp_t);
|
|
|
|
extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
|
2015-01-18 22:16:31 +07:00
|
|
|
struct rq_map_data *, const struct iov_iter *,
|
|
|
|
gfp_t);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern int blk_execute_rq(struct request_queue *, struct gendisk *,
|
2005-06-20 19:11:09 +07:00
|
|
|
struct request *, int);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
|
2006-01-06 16:00:50 +07:00
|
|
|
struct request *, int, rq_end_io_fn *);
|
2005-11-11 18:30:24 +07:00
|
|
|
|
2016-11-04 22:34:34 +07:00
|
|
|
bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
|
2015-11-06 00:44:55 +07:00
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2014-09-08 06:03:56 +07:00
|
|
|
return bdev->bd_disk->queue; /* this is never NULL */
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
/*
|
2009-07-03 15:48:17 +07:00
|
|
|
* blk_rq_pos() : the current sector
|
|
|
|
* blk_rq_bytes() : bytes left in the entire request
|
|
|
|
* blk_rq_cur_bytes() : bytes left in the current segment
|
|
|
|
* blk_rq_err_bytes() : bytes left till the next error boundary
|
|
|
|
* blk_rq_sectors() : sectors left in the entire request
|
|
|
|
* blk_rq_cur_sectors() : sectors left in the current segment
|
2009-04-23 09:05:18 +07:00
|
|
|
*/
|
2009-05-07 20:24:38 +07:00
|
|
|
static inline sector_t blk_rq_pos(const struct request *rq)
|
|
|
|
{
|
2009-05-07 20:24:44 +07:00
|
|
|
return rq->__sector;
|
2009-05-07 20:24:41 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_rq_bytes(const struct request *rq)
|
|
|
|
{
|
2009-05-07 20:24:44 +07:00
|
|
|
return rq->__data_len;
|
2009-05-07 20:24:38 +07:00
|
|
|
}
|
|
|
|
|
2009-05-07 20:24:41 +07:00
|
|
|
static inline int blk_rq_cur_bytes(const struct request *rq)
|
|
|
|
{
|
|
|
|
return rq->bio ? bio_cur_bytes(rq->bio) : 0;
|
|
|
|
}
|
2009-04-23 09:05:18 +07:00
|
|
|
|
2009-07-03 15:48:17 +07:00
|
|
|
extern unsigned int blk_rq_err_bytes(const struct request *rq);
|
|
|
|
|
2009-05-07 20:24:38 +07:00
|
|
|
static inline unsigned int blk_rq_sectors(const struct request *rq)
|
|
|
|
{
|
2009-05-07 20:24:41 +07:00
|
|
|
return blk_rq_bytes(rq) >> 9;
|
2009-05-07 20:24:38 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
|
|
|
|
{
|
2009-05-07 20:24:41 +07:00
|
|
|
return blk_rq_cur_bytes(rq) >> 9;
|
2009-05-07 20:24:38 +07:00
|
|
|
}
|
|
|
|
|
2017-01-13 18:29:10 +07:00
|
|
|
/*
|
|
|
|
* Some commands like WRITE SAME have a payload or data transfer size which
|
|
|
|
* is different from the size of the request. Any driver that supports such
|
|
|
|
* commands using the RQF_SPECIAL_PAYLOAD flag needs to use this helper to
|
|
|
|
* calculate the data transfer size.
|
|
|
|
*/
|
|
|
|
static inline unsigned int blk_rq_payload_bytes(struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
|
|
|
|
return rq->special_vec.bv_len;
|
|
|
|
return blk_rq_bytes(rq);
|
|
|
|
}
|
|
|
|
|
2012-09-18 23:19:26 +07:00
|
|
|
static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
|
2016-06-06 02:32:15 +07:00
|
|
|
int op)
|
2012-09-18 23:19:26 +07:00
|
|
|
{
|
2016-08-16 14:59:35 +07:00
|
|
|
if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
|
block: fix max discard sectors limit
linux-v3.8-rc1 and later support for plug for blkdev_issue_discard with
commit 0cfbcafcae8b7364b5fa96c2b26ccde7a3a296a9
(block: add plug for blkdev_issue_discard )
For example,
1) DISCARD rq-1 with size size 4GB
2) DISCARD rq-2 with size size 1GB
If these 2 discard requests get merged, final request size will be 5GB.
In this case, request's __data_len field may overflow as it can store
max 4GB(unsigned int).
This issue was observed while doing mkfs.f2fs on 5GB SD card:
https://lkml.org/lkml/2013/4/1/292
Info: sector size = 512
Info: total sectors = 11370496 (in 512bytes)
Info: zone aligned segment0 blkaddr: 512
[ 257.789764] blk_update_request: bio idx 0 >= vcnt 0
mkfs process gets stuck in D state and I see the following in the dmesg:
[ 257.789733] __end_that: dev mmcblk0: type=1, flags=122c8081
[ 257.789764] sector 4194304, nr/cnr 2981888/4294959104
[ 257.789764] bio df3840c0, biotail df3848c0, buffer (null), len
1526726656
[ 257.789764] blk_update_request: bio idx 0 >= vcnt 0
[ 257.794921] request botched: dev mmcblk0: type=1, flags=122c8081
[ 257.794921] sector 4194304, nr/cnr 2981888/4294959104
[ 257.794921] bio df3840c0, biotail df3848c0, buffer (null), len
1526726656
This patch fixes this issue.
Reported-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Tested-by: Max Filippov <jcmvbkbc@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-24 21:52:50 +07:00
|
|
|
return min(q->limits.max_discard_sectors, UINT_MAX >> 9);
|
2012-09-18 23:19:26 +07:00
|
|
|
|
2016-06-06 02:32:15 +07:00
|
|
|
if (unlikely(op == REQ_OP_WRITE_SAME))
|
2012-09-18 23:19:27 +07:00
|
|
|
return q->limits.max_write_same_sectors;
|
|
|
|
|
2016-12-01 03:28:59 +07:00
|
|
|
if (unlikely(op == REQ_OP_WRITE_ZEROES))
|
|
|
|
return q->limits.max_write_zeroes_sectors;
|
|
|
|
|
2012-09-18 23:19:26 +07:00
|
|
|
return q->limits.max_sectors;
|
|
|
|
}
|
|
|
|
|
2014-06-06 02:38:39 +07:00
|
|
|
/*
|
|
|
|
* Return maximum size of a request at given offset. Only valid for
|
|
|
|
* file system requests.
|
|
|
|
*/
|
|
|
|
static inline unsigned int blk_max_size_offset(struct request_queue *q,
|
|
|
|
sector_t offset)
|
|
|
|
{
|
|
|
|
if (!q->limits.chunk_sectors)
|
2014-06-18 12:09:29 +07:00
|
|
|
return q->limits.max_sectors;
|
2014-06-06 02:38:39 +07:00
|
|
|
|
|
|
|
return q->limits.chunk_sectors -
|
|
|
|
(offset & (q->limits.chunk_sectors - 1));
|
|
|
|
}
|
|
|
|
|
2016-07-21 10:40:47 +07:00
|
|
|
static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
|
|
|
|
sector_t offset)
|
2012-09-18 23:19:26 +07:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
2017-01-31 22:57:29 +07:00
|
|
|
if (blk_rq_is_passthrough(rq))
|
2012-09-18 23:19:26 +07:00
|
|
|
return q->limits.max_hw_sectors;
|
|
|
|
|
2016-08-16 14:59:35 +07:00
|
|
|
if (!q->limits.chunk_sectors ||
|
|
|
|
req_op(rq) == REQ_OP_DISCARD ||
|
|
|
|
req_op(rq) == REQ_OP_SECURE_ERASE)
|
2016-06-06 02:32:15 +07:00
|
|
|
return blk_queue_get_max_sectors(q, req_op(rq));
|
2014-06-06 02:38:39 +07:00
|
|
|
|
2016-07-21 10:40:47 +07:00
|
|
|
return min(blk_max_size_offset(q, offset),
|
2016-06-06 02:32:15 +07:00
|
|
|
blk_queue_get_max_sectors(q, req_op(rq)));
|
2012-09-18 23:19:26 +07:00
|
|
|
}
|
|
|
|
|
2013-09-22 02:57:47 +07:00
|
|
|
static inline unsigned int blk_rq_count_bios(struct request *rq)
|
|
|
|
{
|
|
|
|
unsigned int nr_bios = 0;
|
|
|
|
struct bio *bio;
|
|
|
|
|
|
|
|
__rq_for_each_bio(bio, rq)
|
|
|
|
nr_bios++;
|
|
|
|
|
|
|
|
return nr_bios;
|
|
|
|
}
|
|
|
|
|
2016-10-18 01:27:28 +07:00
|
|
|
/*
|
|
|
|
* blk_rq_set_prio - associate a request with prio from ioc
|
|
|
|
* @rq: request of interest
|
|
|
|
* @ioc: target iocontext
|
|
|
|
*
|
|
|
|
* Assocate request prio with ioc prio so request based drivers
|
|
|
|
* can leverage priority information.
|
|
|
|
*/
|
|
|
|
static inline void blk_rq_set_prio(struct request *rq, struct io_context *ioc)
|
|
|
|
{
|
|
|
|
if (ioc)
|
|
|
|
rq->ioprio = ioc->ioprio;
|
|
|
|
}
|
|
|
|
|
2009-05-08 09:54:16 +07:00
|
|
|
/*
|
|
|
|
* Request issue related functions.
|
|
|
|
*/
|
|
|
|
extern struct request *blk_peek_request(struct request_queue *q);
|
|
|
|
extern void blk_start_request(struct request *rq);
|
|
|
|
extern struct request *blk_fetch_request(struct request_queue *q);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2009-04-23 09:05:18 +07:00
|
|
|
* Request completion related functions.
|
|
|
|
*
|
|
|
|
* blk_update_request() completes given number of bytes and updates
|
|
|
|
* the request without completing it.
|
|
|
|
*
|
2009-04-23 09:05:19 +07:00
|
|
|
* blk_end_request() and friends. __blk_end_request() must be called
|
|
|
|
* with the request queue spinlock acquired.
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* Several drivers define their own end_request and call
|
2007-12-12 05:52:28 +07:00
|
|
|
* blk_end_request() for parts of the original function.
|
|
|
|
* This prevents code duplication in drivers.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-04-23 09:05:18 +07:00
|
|
|
extern bool blk_update_request(struct request *rq, int error,
|
|
|
|
unsigned int nr_bytes);
|
2014-04-16 14:44:59 +07:00
|
|
|
extern void blk_finish_request(struct request *rq, int error);
|
2009-05-11 15:56:09 +07:00
|
|
|
extern bool blk_end_request(struct request *rq, int error,
|
|
|
|
unsigned int nr_bytes);
|
|
|
|
extern void blk_end_request_all(struct request *rq, int error);
|
|
|
|
extern bool blk_end_request_cur(struct request *rq, int error);
|
2009-07-03 15:48:17 +07:00
|
|
|
extern bool blk_end_request_err(struct request *rq, int error);
|
2009-05-11 15:56:09 +07:00
|
|
|
extern bool __blk_end_request(struct request *rq, int error,
|
|
|
|
unsigned int nr_bytes);
|
|
|
|
extern void __blk_end_request_all(struct request *rq, int error);
|
|
|
|
extern bool __blk_end_request_cur(struct request *rq, int error);
|
2009-07-03 15:48:17 +07:00
|
|
|
extern bool __blk_end_request_err(struct request *rq, int error);
|
2009-04-23 09:05:18 +07:00
|
|
|
|
2006-01-09 22:02:34 +07:00
|
|
|
extern void blk_complete_request(struct request *);
|
2008-09-14 19:55:09 +07:00
|
|
|
extern void __blk_complete_request(struct request *);
|
|
|
|
extern void blk_abort_request(struct request *);
|
2010-07-01 17:49:17 +07:00
|
|
|
extern void blk_unprep_request(struct request *);
|
2006-01-09 22:02:34 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Access functions for manipulating queue properties
|
|
|
|
*/
|
2007-07-24 14:28:11 +07:00
|
|
|
extern struct request_queue *blk_init_queue_node(request_fn_proc *rfn,
|
2005-06-23 14:08:19 +07:00
|
|
|
spinlock_t *lock, int node_id);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
|
2017-01-03 18:52:44 +07:00
|
|
|
extern int blk_init_allocated_queue(struct request_queue *);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_cleanup_queue(struct request_queue *);
|
|
|
|
extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
|
|
|
|
extern void blk_queue_bounce_limit(struct request_queue *, u64);
|
2010-02-26 12:20:38 +07:00
|
|
|
extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
|
2014-06-06 02:38:39 +07:00
|
|
|
extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
|
2010-02-26 12:20:39 +07:00
|
|
|
extern void blk_queue_max_segments(struct request_queue *, unsigned short);
|
2017-02-08 20:46:49 +07:00
|
|
|
extern void blk_queue_max_discard_segments(struct request_queue *,
|
|
|
|
unsigned short);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
|
2009-09-30 18:54:20 +07:00
|
|
|
extern void blk_queue_max_discard_sectors(struct request_queue *q,
|
|
|
|
unsigned int max_discard_sectors);
|
2012-09-18 23:19:27 +07:00
|
|
|
extern void blk_queue_max_write_same_sectors(struct request_queue *q,
|
|
|
|
unsigned int max_write_same_sectors);
|
2016-12-01 03:28:59 +07:00
|
|
|
extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
|
|
|
|
unsigned int max_write_same_sectors);
|
2009-05-23 04:17:49 +07:00
|
|
|
extern void blk_queue_logical_block_size(struct request_queue *, unsigned short);
|
2010-10-14 02:18:03 +07:00
|
|
|
extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
|
2009-05-23 04:17:53 +07:00
|
|
|
extern void blk_queue_alignment_offset(struct request_queue *q,
|
|
|
|
unsigned int alignment);
|
2009-07-31 22:49:11 +07:00
|
|
|
extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
|
2009-05-23 04:17:53 +07:00
|
|
|
extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
|
2009-09-12 02:54:52 +07:00
|
|
|
extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
|
2009-05-23 04:17:53 +07:00
|
|
|
extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
|
2016-03-30 23:21:08 +07:00
|
|
|
extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
|
2009-06-16 13:23:52 +07:00
|
|
|
extern void blk_set_default_limits(struct queue_limits *lim);
|
2012-01-11 22:27:11 +07:00
|
|
|
extern void blk_set_stacking_limits(struct queue_limits *lim);
|
2009-05-23 04:17:53 +07:00
|
|
|
extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
|
|
|
|
sector_t offset);
|
2010-01-11 15:21:49 +07:00
|
|
|
extern int bdev_stack_limits(struct queue_limits *t, struct block_device *bdev,
|
|
|
|
sector_t offset);
|
2009-05-23 04:17:53 +07:00
|
|
|
extern void disk_stack_limits(struct gendisk *disk, struct block_device *bdev,
|
|
|
|
sector_t offset);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_stack_limits(struct request_queue *t, struct request_queue *b);
|
2008-03-04 17:18:17 +07:00
|
|
|
extern void blk_queue_dma_pad(struct request_queue *, unsigned int);
|
2008-07-04 14:30:03 +07:00
|
|
|
extern void blk_queue_update_dma_pad(struct request_queue *, unsigned int);
|
2008-02-19 17:36:53 +07:00
|
|
|
extern int blk_queue_dma_drain(struct request_queue *q,
|
|
|
|
dma_drain_needed_fn *dma_drain_needed,
|
|
|
|
void *buf, unsigned int size);
|
2008-10-01 21:12:15 +07:00
|
|
|
extern void blk_queue_lld_busy(struct request_queue *q, lld_busy_fn *fn);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_segment_boundary(struct request_queue *, unsigned long);
|
2015-08-20 04:24:05 +07:00
|
|
|
extern void blk_queue_virt_boundary(struct request_queue *, unsigned long);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_prep_rq(struct request_queue *, prep_rq_fn *pfn);
|
2010-07-01 17:49:17 +07:00
|
|
|
extern void blk_queue_unprep_rq(struct request_queue *, unprep_rq_fn *ufn);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_dma_alignment(struct request_queue *, int);
|
2008-01-01 05:37:00 +07:00
|
|
|
extern void blk_queue_update_dma_alignment(struct request_queue *, int);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
|
2008-09-14 19:55:09 +07:00
|
|
|
extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
|
|
|
|
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
|
2011-05-07 00:34:32 +07:00
|
|
|
extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
|
2016-04-13 01:32:46 +07:00
|
|
|
extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2017-02-08 20:46:49 +07:00
|
|
|
/*
|
|
|
|
* Number of physical segments as sent to the device.
|
|
|
|
*
|
|
|
|
* Normally this is the number of discontiguous data segments sent by the
|
|
|
|
* submitter. But for data-less command like discard we might have no
|
|
|
|
* actual data segments submitted, but the driver might have to add it's
|
|
|
|
* own special payload. In that case we still return 1 here so that this
|
|
|
|
* special payload will be mapped.
|
|
|
|
*/
|
2016-12-09 05:20:32 +07:00
|
|
|
static inline unsigned short blk_rq_nr_phys_segments(struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
|
|
|
|
return 1;
|
|
|
|
return rq->nr_phys_segments;
|
|
|
|
}
|
|
|
|
|
2017-02-08 20:46:49 +07:00
|
|
|
/*
|
|
|
|
* Number of discard segments (or ranges) the driver needs to fill in.
|
|
|
|
* Each discard bio merged into a request is counted as one segment.
|
|
|
|
*/
|
|
|
|
static inline unsigned short blk_rq_nr_discard_segments(struct request *rq)
|
|
|
|
{
|
|
|
|
return max_t(unsigned short, rq->nr_phys_segments, 1);
|
|
|
|
}
|
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void blk_dump_rq_flags(struct request *, char *);
|
|
|
|
extern long nr_blockdev_pages(void);
|
|
|
|
|
2011-12-14 06:33:38 +07:00
|
|
|
bool __must_check blk_get_queue(struct request_queue *);
|
2007-07-24 14:28:11 +07:00
|
|
|
struct request_queue *blk_alloc_queue(gfp_t);
|
|
|
|
struct request_queue *blk_alloc_queue_node(gfp_t, int);
|
|
|
|
extern void blk_put_queue(struct request_queue *);
|
2015-06-05 23:57:37 +07:00
|
|
|
extern void blk_set_queue_dying(struct request_queue *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2013-03-23 10:42:26 +07:00
|
|
|
/*
|
|
|
|
* block layer runtime pm functions
|
|
|
|
*/
|
2014-12-04 07:00:23 +07:00
|
|
|
#ifdef CONFIG_PM
|
2013-03-23 10:42:26 +07:00
|
|
|
extern void blk_pm_runtime_init(struct request_queue *q, struct device *dev);
|
|
|
|
extern int blk_pre_runtime_suspend(struct request_queue *q);
|
|
|
|
extern void blk_post_runtime_suspend(struct request_queue *q, int err);
|
|
|
|
extern void blk_pre_runtime_resume(struct request_queue *q);
|
|
|
|
extern void blk_post_runtime_resume(struct request_queue *q, int err);
|
2016-02-18 15:54:11 +07:00
|
|
|
extern void blk_set_runtime_active(struct request_queue *q);
|
2013-03-23 10:42:26 +07:00
|
|
|
#else
|
|
|
|
static inline void blk_pm_runtime_init(struct request_queue *q,
|
|
|
|
struct device *dev) {}
|
|
|
|
static inline int blk_pre_runtime_suspend(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
static inline void blk_post_runtime_suspend(struct request_queue *q, int err) {}
|
|
|
|
static inline void blk_pre_runtime_resume(struct request_queue *q) {}
|
|
|
|
static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
|
2016-11-18 21:16:06 +07:00
|
|
|
static inline void blk_set_runtime_active(struct request_queue *q) {}
|
2013-03-23 10:42:26 +07:00
|
|
|
#endif
|
|
|
|
|
2011-07-08 13:19:21 +07:00
|
|
|
/*
|
2011-09-21 15:00:16 +07:00
|
|
|
* blk_plug permits building a queue of related requests by holding the I/O
|
|
|
|
* fragments for a short period. This allows merging of sequential requests
|
|
|
|
* into single larger request. As the requests are moved from a per-task list to
|
|
|
|
* the device's request_queue in a batch, this results in improved scalability
|
|
|
|
* as the lock contention for request_queue lock is reduced.
|
|
|
|
*
|
|
|
|
* It is ok not to disable preemption when adding the request to the plug list
|
|
|
|
* or when attempting a merge, because blk_schedule_flush_list() will only flush
|
|
|
|
* the plug list when the task sleeps by itself. For details, please see
|
|
|
|
* schedule() where blk_schedule_flush_plug() is called.
|
2011-07-08 13:19:21 +07:00
|
|
|
*/
|
2011-03-08 19:19:51 +07:00
|
|
|
struct blk_plug {
|
2011-09-21 15:00:16 +07:00
|
|
|
struct list_head list; /* requests */
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
struct list_head mq_list; /* blk-mq requests */
|
2011-09-21 15:00:16 +07:00
|
|
|
struct list_head cb_list; /* md requires an unplug callback */
|
2011-03-08 19:19:51 +07:00
|
|
|
};
|
2011-07-08 13:19:20 +07:00
|
|
|
#define BLK_MAX_REQUEST_COUNT 16
|
2016-11-04 07:03:53 +07:00
|
|
|
#define BLK_PLUG_FLUSH_SIZE (128 * 1024)
|
2011-07-08 13:19:20 +07:00
|
|
|
|
2012-07-31 14:08:14 +07:00
|
|
|
struct blk_plug_cb;
|
2012-07-31 14:08:15 +07:00
|
|
|
typedef void (*blk_plug_cb_fn)(struct blk_plug_cb *, bool);
|
2011-04-18 14:52:22 +07:00
|
|
|
struct blk_plug_cb {
|
|
|
|
struct list_head list;
|
2012-07-31 14:08:14 +07:00
|
|
|
blk_plug_cb_fn callback;
|
|
|
|
void *data;
|
2011-04-18 14:52:22 +07:00
|
|
|
};
|
2012-07-31 14:08:14 +07:00
|
|
|
extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
|
|
|
|
void *data, int size);
|
2011-03-08 19:19:51 +07:00
|
|
|
extern void blk_start_plug(struct blk_plug *);
|
|
|
|
extern void blk_finish_plug(struct blk_plug *);
|
2011-04-15 20:49:07 +07:00
|
|
|
extern void blk_flush_plug_list(struct blk_plug *, bool);
|
2011-03-08 19:19:51 +07:00
|
|
|
|
|
|
|
static inline void blk_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
2011-04-16 18:27:55 +07:00
|
|
|
if (plug)
|
|
|
|
blk_flush_plug_list(plug, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_schedule_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
2011-04-15 20:20:10 +07:00
|
|
|
if (plug)
|
2011-04-15 20:49:07 +07:00
|
|
|
blk_flush_plug_list(plug, true);
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_needs_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
return plug &&
|
|
|
|
(!list_empty(&plug->list) ||
|
|
|
|
!list_empty(&plug->mq_list) ||
|
|
|
|
!list_empty(&plug->cb_list));
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* tag stuff
|
|
|
|
*/
|
2007-07-24 14:28:11 +07:00
|
|
|
extern int blk_queue_start_tag(struct request_queue *, struct request *);
|
|
|
|
extern struct request *blk_queue_find_tag(struct request_queue *, int);
|
|
|
|
extern void blk_queue_end_tag(struct request_queue *, struct request *);
|
2015-01-16 08:32:25 +07:00
|
|
|
extern int blk_queue_init_tags(struct request_queue *, int, struct blk_queue_tag *, int);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_free_tags(struct request_queue *);
|
|
|
|
extern int blk_queue_resize_tags(struct request_queue *, int);
|
|
|
|
extern void blk_queue_invalidate_tags(struct request_queue *);
|
2015-01-16 08:32:25 +07:00
|
|
|
extern struct blk_queue_tag *blk_init_tags(int, int);
|
2006-08-31 02:48:45 +07:00
|
|
|
extern void blk_free_tags(struct blk_queue_tag *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-10-04 13:27:25 +07:00
|
|
|
static inline struct request *blk_map_queue_find_tag(struct blk_queue_tag *bqt,
|
|
|
|
int tag)
|
|
|
|
{
|
|
|
|
if (unlikely(bqt == NULL || tag >= bqt->real_max_depth))
|
|
|
|
return NULL;
|
|
|
|
return bqt->tag_index[tag];
|
|
|
|
}
|
2010-09-17 01:51:46 +07:00
|
|
|
|
2017-04-06 00:21:08 +07:00
|
|
|
extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *);
|
|
|
|
extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
|
|
|
|
sector_t nr_sects, gfp_t gfp_mask, struct page *page);
|
2016-07-19 16:23:33 +07:00
|
|
|
|
|
|
|
#define BLKDEV_DISCARD_SECURE (1 << 0) /* issue a secure erase */
|
2010-09-17 01:51:46 +07:00
|
|
|
|
2010-04-28 20:55:06 +07:00
|
|
|
extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
|
|
|
|
sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
|
2016-04-17 01:55:28 +07:00
|
|
|
extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
|
2016-06-09 21:00:36 +07:00
|
|
|
sector_t nr_sects, gfp_t gfp_mask, int flags,
|
2016-06-06 02:31:49 +07:00
|
|
|
struct bio **biop);
|
2017-04-06 00:21:08 +07:00
|
|
|
|
|
|
|
#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
|
2017-04-06 00:21:10 +07:00
|
|
|
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
|
2017-04-06 00:21:08 +07:00
|
|
|
|
2016-12-01 03:28:58 +07:00
|
|
|
extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
|
|
|
|
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
|
2017-04-06 00:21:08 +07:00
|
|
|
unsigned flags);
|
2010-04-28 20:55:09 +07:00
|
|
|
extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
|
2017-04-06 00:21:08 +07:00
|
|
|
sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
|
|
|
|
|
2010-08-18 16:29:10 +07:00
|
|
|
static inline int sb_issue_discard(struct super_block *sb, sector_t block,
|
|
|
|
sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
|
2008-08-06 00:01:53 +07:00
|
|
|
{
|
2010-08-18 16:29:10 +07:00
|
|
|
return blkdev_issue_discard(sb->s_bdev, block << (sb->s_blocksize_bits - 9),
|
|
|
|
nr_blocks << (sb->s_blocksize_bits - 9),
|
|
|
|
gfp_mask, flags);
|
2008-08-06 00:01:53 +07:00
|
|
|
}
|
2010-10-28 08:30:04 +07:00
|
|
|
static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
|
2010-10-28 10:44:47 +07:00
|
|
|
sector_t nr_blocks, gfp_t gfp_mask)
|
2010-10-28 08:30:04 +07:00
|
|
|
{
|
|
|
|
return blkdev_issue_zeroout(sb->s_bdev,
|
|
|
|
block << (sb->s_blocksize_bits - 9),
|
|
|
|
nr_blocks << (sb->s_blocksize_bits - 9),
|
2017-04-06 00:21:08 +07:00
|
|
|
gfp_mask, 0);
|
2010-10-28 08:30:04 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-06-26 21:27:10 +07:00
|
|
|
extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
|
2008-06-26 18:48:27 +07:00
|
|
|
|
2010-02-26 12:20:37 +07:00
|
|
|
enum blk_default_limits {
|
|
|
|
BLK_MAX_SEGMENTS = 128,
|
|
|
|
BLK_SAFE_MAX_SECTORS = 255,
|
2015-08-14 01:57:57 +07:00
|
|
|
BLK_DEF_MAX_SECTORS = 2560,
|
2010-02-26 12:20:37 +07:00
|
|
|
BLK_MAX_SEGMENT_SIZE = 65536,
|
|
|
|
BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL,
|
|
|
|
};
|
2008-12-03 18:55:08 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#define blkdev_entry_to_request(entry) list_entry((entry), struct request, queuelist)
|
|
|
|
|
2009-05-23 04:17:50 +07:00
|
|
|
static inline unsigned long queue_bounce_pfn(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 04:17:51 +07:00
|
|
|
return q->limits.bounce_pfn;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long queue_segment_boundary(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 04:17:51 +07:00
|
|
|
return q->limits.seg_boundary_mask;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
2015-08-20 04:24:05 +07:00
|
|
|
static inline unsigned long queue_virt_boundary(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.virt_boundary_mask;
|
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:50 +07:00
|
|
|
static inline unsigned int queue_max_sectors(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 04:17:51 +07:00
|
|
|
return q->limits.max_sectors;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int queue_max_hw_sectors(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 04:17:51 +07:00
|
|
|
return q->limits.max_hw_sectors;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
2010-02-26 12:20:39 +07:00
|
|
|
static inline unsigned short queue_max_segments(struct request_queue *q)
|
2009-05-23 04:17:50 +07:00
|
|
|
{
|
2010-02-26 12:20:39 +07:00
|
|
|
return q->limits.max_segments;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
2017-02-08 20:46:49 +07:00
|
|
|
static inline unsigned short queue_max_discard_segments(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.max_discard_segments;
|
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:50 +07:00
|
|
|
static inline unsigned int queue_max_segment_size(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 04:17:51 +07:00
|
|
|
return q->limits.max_segment_size;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:49 +07:00
|
|
|
static inline unsigned short queue_logical_block_size(struct request_queue *q)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
int retval = 512;
|
|
|
|
|
2009-05-23 04:17:51 +07:00
|
|
|
if (q && q->limits.logical_block_size)
|
|
|
|
retval = q->limits.logical_block_size;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:49 +07:00
|
|
|
static inline unsigned short bdev_logical_block_size(struct block_device *bdev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2009-05-23 04:17:49 +07:00
|
|
|
return queue_logical_block_size(bdev_get_queue(bdev));
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:53 +07:00
|
|
|
static inline unsigned int queue_physical_block_size(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.physical_block_size;
|
|
|
|
}
|
|
|
|
|
2010-10-14 02:18:03 +07:00
|
|
|
static inline unsigned int bdev_physical_block_size(struct block_device *bdev)
|
2009-10-04 01:52:01 +07:00
|
|
|
{
|
|
|
|
return queue_physical_block_size(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:53 +07:00
|
|
|
static inline unsigned int queue_io_min(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.io_min;
|
|
|
|
}
|
|
|
|
|
2009-10-04 01:52:01 +07:00
|
|
|
static inline int bdev_io_min(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
return queue_io_min(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:53 +07:00
|
|
|
static inline unsigned int queue_io_opt(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.io_opt;
|
|
|
|
}
|
|
|
|
|
2009-10-04 01:52:01 +07:00
|
|
|
static inline int bdev_io_opt(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
return queue_io_opt(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:53 +07:00
|
|
|
static inline int queue_alignment_offset(struct request_queue *q)
|
|
|
|
{
|
2009-10-04 01:52:01 +07:00
|
|
|
if (q->limits.misaligned)
|
2009-05-23 04:17:53 +07:00
|
|
|
return -1;
|
|
|
|
|
2009-10-04 01:52:01 +07:00
|
|
|
return q->limits.alignment_offset;
|
2009-05-23 04:17:53 +07:00
|
|
|
}
|
|
|
|
|
2010-01-11 15:21:51 +07:00
|
|
|
static inline int queue_limit_alignment_offset(struct queue_limits *lim, sector_t sector)
|
2009-12-29 14:35:35 +07:00
|
|
|
{
|
|
|
|
unsigned int granularity = max(lim->physical_block_size, lim->io_min);
|
2014-10-09 05:26:13 +07:00
|
|
|
unsigned int alignment = sector_div(sector, granularity >> 9) << 9;
|
2009-12-29 14:35:35 +07:00
|
|
|
|
2014-10-09 05:26:13 +07:00
|
|
|
return (granularity + lim->alignment_offset - alignment) % granularity;
|
2009-05-23 04:17:53 +07:00
|
|
|
}
|
|
|
|
|
2009-10-04 01:52:01 +07:00
|
|
|
static inline int bdev_alignment_offset(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q->limits.misaligned)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
if (bdev != bdev->bd_contains)
|
|
|
|
return bdev->bd_part->alignment_offset;
|
|
|
|
|
|
|
|
return q->limits.alignment_offset;
|
|
|
|
}
|
|
|
|
|
2009-11-10 17:50:21 +07:00
|
|
|
static inline int queue_discard_alignment(struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (q->limits.discard_misaligned)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
return q->limits.discard_alignment;
|
|
|
|
}
|
|
|
|
|
2010-01-11 15:21:51 +07:00
|
|
|
static inline int queue_limit_discard_alignment(struct queue_limits *lim, sector_t sector)
|
2009-11-10 17:50:21 +07:00
|
|
|
{
|
2012-12-19 22:18:35 +07:00
|
|
|
unsigned int alignment, granularity, offset;
|
2010-01-11 15:21:48 +07:00
|
|
|
|
2011-05-18 15:37:35 +07:00
|
|
|
if (!lim->max_discard_sectors)
|
|
|
|
return 0;
|
|
|
|
|
2012-12-19 22:18:35 +07:00
|
|
|
/* Why are these in bytes, not sectors? */
|
|
|
|
alignment = lim->discard_alignment >> 9;
|
|
|
|
granularity = lim->discard_granularity >> 9;
|
|
|
|
if (!granularity)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Offset of the partition start in 'granularity' sectors */
|
|
|
|
offset = sector_div(sector, granularity);
|
|
|
|
|
|
|
|
/* And why do we do this modulus *again* in blkdev_issue_discard()? */
|
|
|
|
offset = (granularity + alignment - offset) % granularity;
|
|
|
|
|
|
|
|
/* Turn it back into bytes, gaah */
|
|
|
|
return offset << 9;
|
2009-11-10 17:50:21 +07:00
|
|
|
}
|
|
|
|
|
block: split discard into aligned requests
When a disk has large discard_granularity and small max_discard_sectors,
discards are not split with optimal alignment. In the limit case of
discard_granularity == max_discard_sectors, no request could be aligned
correctly, so in fact you might end up with no discarded logical blocks
at all.
Another example that helps showing the condition in the patch is with
discard_granularity == 64, max_discard_sectors == 128. A request that is
submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
However, only 2 aligned blocks out of 3 are included in the request;
128..191 may be left intact and not discarded. With this patch, the
first request will be truncated to ensure good alignment of what's left,
and the split will be 2..127, 128..255, 256..257. The patch will also
take into account the discard_alignment.
At most one extra request will be introduced, because the first request
will be reduced by at most granularity-1 sectors, and granularity
must be less than max_discard_sectors. Subsequent requests will run
on round_down(max_discard_sectors, granularity) sectors, as in the
current code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02 14:48:50 +07:00
|
|
|
static inline int bdev_discard_alignment(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (bdev != bdev->bd_contains)
|
|
|
|
return bdev->bd_part->discard_alignment;
|
|
|
|
|
|
|
|
return q->limits.discard_alignment;
|
|
|
|
}
|
|
|
|
|
2012-09-18 23:19:27 +07:00
|
|
|
static inline unsigned int bdev_write_same(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return q->limits.max_write_same_sectors;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-12-01 03:28:59 +07:00
|
|
|
static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return q->limits.max_write_zeroes_sectors;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-10-18 13:40:29 +07:00
|
|
|
static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return blk_queue_zoned_model(q);
|
|
|
|
|
|
|
|
return BLK_ZONED_NONE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool bdev_is_zoned(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return blk_queue_is_zoned(q);
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2017-01-12 21:58:32 +07:00
|
|
|
static inline unsigned int bdev_zone_sectors(struct block_device *bdev)
|
2016-10-18 13:40:33 +07:00
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
2017-01-12 21:58:32 +07:00
|
|
|
return blk_queue_zone_sectors(q);
|
2016-10-18 13:40:33 +07:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
static inline int queue_dma_alignment(struct request_queue *q)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2008-01-01 22:23:02 +07:00
|
|
|
return q ? q->dma_alignment : 511;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2010-09-15 18:08:27 +07:00
|
|
|
static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr,
|
2008-08-28 13:05:58 +07:00
|
|
|
unsigned int len)
|
|
|
|
{
|
|
|
|
unsigned int alignment = queue_dma_alignment(q) | q->dma_pad_mask;
|
2010-09-15 18:08:27 +07:00
|
|
|
return !(addr & alignment) && !(len & alignment);
|
2008-08-28 13:05:58 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/* assumes size > 256 */
|
|
|
|
static inline unsigned int blksize_bits(unsigned int size)
|
|
|
|
{
|
|
|
|
unsigned int bits = 8;
|
|
|
|
do {
|
|
|
|
bits++;
|
|
|
|
size >>= 1;
|
|
|
|
} while (size > 256);
|
|
|
|
return bits;
|
|
|
|
}
|
|
|
|
|
2005-09-10 14:27:17 +07:00
|
|
|
static inline unsigned int block_size(struct block_device *bdev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
return bdev->bd_block_size;
|
|
|
|
}
|
|
|
|
|
2011-05-07 00:34:32 +07:00
|
|
|
static inline bool queue_flush_queueable(struct request_queue *q)
|
|
|
|
{
|
2016-04-14 02:33:19 +07:00
|
|
|
return !test_bit(QUEUE_FLAG_FLUSH_NQ, &q->queue_flags);
|
2011-05-07 00:34:32 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
typedef struct {struct page *v;} Sector;
|
|
|
|
|
|
|
|
unsigned char *read_dev_sector(struct block_device *, sector_t, Sector *);
|
|
|
|
|
|
|
|
static inline void put_dev_sector(Sector p)
|
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 19:29:47 +07:00
|
|
|
put_page(p.v);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2016-02-26 22:40:51 +07:00
|
|
|
static inline bool __bvec_gap_to_prev(struct request_queue *q,
|
|
|
|
struct bio_vec *bprv, unsigned int offset)
|
|
|
|
{
|
|
|
|
return offset ||
|
|
|
|
((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
|
|
|
|
}
|
|
|
|
|
2015-08-20 04:24:05 +07:00
|
|
|
/*
|
|
|
|
* Check if adding a bio_vec after bprv with offset would create a gap in
|
|
|
|
* the SG list. Most drivers don't care about this, but some do.
|
|
|
|
*/
|
|
|
|
static inline bool bvec_gap_to_prev(struct request_queue *q,
|
|
|
|
struct bio_vec *bprv, unsigned int offset)
|
|
|
|
{
|
|
|
|
if (!queue_virt_boundary(q))
|
|
|
|
return false;
|
2016-02-26 22:40:51 +07:00
|
|
|
return __bvec_gap_to_prev(q, bprv, offset);
|
2015-08-20 04:24:05 +07:00
|
|
|
}
|
|
|
|
|
2016-12-17 17:49:09 +07:00
|
|
|
/*
|
|
|
|
* Check if the two bvecs from two bios can be merged to one segment.
|
|
|
|
* If yes, no need to check gap between the two bios since the 1st bio
|
|
|
|
* and the 1st bvec in the 2nd bio can be handled in one segment.
|
|
|
|
*/
|
|
|
|
static inline bool bios_segs_mergeable(struct request_queue *q,
|
|
|
|
struct bio *prev, struct bio_vec *prev_last_bv,
|
|
|
|
struct bio_vec *next_first_bv)
|
|
|
|
{
|
|
|
|
if (!BIOVEC_PHYS_MERGEABLE(prev_last_bv, next_first_bv))
|
|
|
|
return false;
|
|
|
|
if (!BIOVEC_SEG_BOUNDARY(q, prev_last_bv, next_first_bv))
|
|
|
|
return false;
|
|
|
|
if (prev->bi_seg_back_size + next_first_bv->bv_len >
|
|
|
|
queue_max_segment_size(q))
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2015-09-03 23:28:20 +07:00
|
|
|
static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
|
|
|
|
struct bio *next)
|
|
|
|
{
|
2016-02-26 22:40:52 +07:00
|
|
|
if (bio_has_data(prev) && queue_virt_boundary(q)) {
|
|
|
|
struct bio_vec pb, nb;
|
|
|
|
|
|
|
|
bio_get_last_bvec(prev, &pb);
|
|
|
|
bio_get_first_bvec(next, &nb);
|
2015-09-03 23:28:20 +07:00
|
|
|
|
2016-12-17 17:49:09 +07:00
|
|
|
if (!bios_segs_mergeable(q, prev, &pb, &nb))
|
|
|
|
return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
|
2016-02-26 22:40:52 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
2015-09-03 23:28:20 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool req_gap_back_merge(struct request *req, struct bio *bio)
|
|
|
|
{
|
|
|
|
return bio_will_gap(req->q, req->biotail, bio);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool req_gap_front_merge(struct request *req, struct bio *bio)
|
|
|
|
{
|
|
|
|
return bio_will_gap(req->q, bio, req->bio);
|
|
|
|
}
|
|
|
|
|
2014-04-08 22:15:35 +07:00
|
|
|
int kblockd_schedule_work(struct work_struct *work);
|
2016-08-25 04:52:48 +07:00
|
|
|
int kblockd_schedule_work_on(int cpu, struct work_struct *work);
|
2014-04-08 22:15:35 +07:00
|
|
|
int kblockd_schedule_delayed_work(struct delayed_work *dwork, unsigned long delay);
|
2014-04-08 22:17:40 +07:00
|
|
|
int kblockd_schedule_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned long delay);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-04-02 05:01:41 +07:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
2010-06-01 17:23:18 +07:00
|
|
|
/*
|
|
|
|
* This should not be using sched_clock(). A real patch is in progress
|
|
|
|
* to fix this up, until that is in place we need to disable preemption
|
|
|
|
* around sched_clock() in this function and set_io_start_time_ns().
|
|
|
|
*/
|
2010-04-02 05:01:41 +07:00
|
|
|
static inline void set_start_time_ns(struct request *req)
|
|
|
|
{
|
2010-06-01 17:23:18 +07:00
|
|
|
preempt_disable();
|
2010-04-02 05:01:41 +07:00
|
|
|
req->start_time_ns = sched_clock();
|
2010-06-01 17:23:18 +07:00
|
|
|
preempt_enable();
|
2010-04-02 05:01:41 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void set_io_start_time_ns(struct request *req)
|
|
|
|
{
|
2010-06-01 17:23:18 +07:00
|
|
|
preempt_disable();
|
2010-04-02 05:01:41 +07:00
|
|
|
req->io_start_time_ns = sched_clock();
|
2010-06-01 17:23:18 +07:00
|
|
|
preempt_enable();
|
2010-04-02 05:01:41 +07:00
|
|
|
}
|
2010-04-09 13:31:19 +07:00
|
|
|
|
|
|
|
static inline uint64_t rq_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return req->start_time_ns;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline uint64_t rq_io_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return req->io_start_time_ns;
|
|
|
|
}
|
2010-04-02 05:01:41 +07:00
|
|
|
#else
|
|
|
|
static inline void set_start_time_ns(struct request *req) {}
|
|
|
|
static inline void set_io_start_time_ns(struct request *req) {}
|
2010-04-09 13:31:19 +07:00
|
|
|
static inline uint64_t rq_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline uint64_t rq_io_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2010-04-02 05:01:41 +07:00
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#define MODULE_ALIAS_BLOCKDEV(major,minor) \
|
|
|
|
MODULE_ALIAS("block-major-" __stringify(major) "-" __stringify(minor))
|
|
|
|
#define MODULE_ALIAS_BLOCKDEV_MAJOR(major) \
|
|
|
|
MODULE_ALIAS("block-major-" __stringify(major) "-*")
|
|
|
|
|
2008-07-01 01:04:41 +07:00
|
|
|
#if defined(CONFIG_BLK_DEV_INTEGRITY)
|
|
|
|
|
2014-09-27 06:20:02 +07:00
|
|
|
enum blk_integrity_flags {
|
|
|
|
BLK_INTEGRITY_VERIFY = 1 << 0,
|
|
|
|
BLK_INTEGRITY_GENERATE = 1 << 1,
|
2014-09-27 06:20:03 +07:00
|
|
|
BLK_INTEGRITY_DEVICE_CAPABLE = 1 << 2,
|
2014-09-27 06:20:05 +07:00
|
|
|
BLK_INTEGRITY_IP_CHECKSUM = 1 << 3,
|
2014-09-27 06:20:02 +07:00
|
|
|
};
|
2008-07-01 01:04:41 +07:00
|
|
|
|
2014-09-27 06:20:01 +07:00
|
|
|
struct blk_integrity_iter {
|
2008-07-01 01:04:41 +07:00
|
|
|
void *prot_buf;
|
|
|
|
void *data_buf;
|
2014-09-27 06:19:59 +07:00
|
|
|
sector_t seed;
|
2008-07-01 01:04:41 +07:00
|
|
|
unsigned int data_size;
|
2014-09-27 06:19:59 +07:00
|
|
|
unsigned short interval;
|
2008-07-01 01:04:41 +07:00
|
|
|
const char *disk_name;
|
|
|
|
};
|
|
|
|
|
2014-09-27 06:20:01 +07:00
|
|
|
typedef int (integrity_processing_fn) (struct blk_integrity_iter *);
|
2008-07-01 01:04:41 +07:00
|
|
|
|
2015-10-22 00:19:33 +07:00
|
|
|
struct blk_integrity_profile {
|
|
|
|
integrity_processing_fn *generate_fn;
|
|
|
|
integrity_processing_fn *verify_fn;
|
|
|
|
const char *name;
|
|
|
|
};
|
2008-07-01 01:04:41 +07:00
|
|
|
|
2015-10-22 00:19:49 +07:00
|
|
|
extern void blk_integrity_register(struct gendisk *, struct blk_integrity *);
|
2008-07-01 01:04:41 +07:00
|
|
|
extern void blk_integrity_unregister(struct gendisk *);
|
2008-10-01 14:38:39 +07:00
|
|
|
extern int blk_integrity_compare(struct gendisk *, struct gendisk *);
|
2010-09-11 01:50:10 +07:00
|
|
|
extern int blk_rq_map_integrity_sg(struct request_queue *, struct bio *,
|
|
|
|
struct scatterlist *);
|
|
|
|
extern int blk_rq_count_integrity_sg(struct request_queue *, struct bio *);
|
2014-09-27 06:20:06 +07:00
|
|
|
extern bool blk_integrity_merge_rq(struct request_queue *, struct request *,
|
|
|
|
struct request *);
|
|
|
|
extern bool blk_integrity_merge_bio(struct request_queue *, struct request *,
|
|
|
|
struct bio *);
|
2008-07-01 01:04:41 +07:00
|
|
|
|
2015-10-22 00:19:49 +07:00
|
|
|
static inline struct blk_integrity *blk_get_integrity(struct gendisk *disk)
|
2008-10-02 17:53:22 +07:00
|
|
|
{
|
2015-10-22 00:20:18 +07:00
|
|
|
struct blk_integrity *bi = &disk->queue->integrity;
|
2015-10-22 00:19:49 +07:00
|
|
|
|
|
|
|
if (!bi->profile)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return bi;
|
2008-10-02 17:53:22 +07:00
|
|
|
}
|
|
|
|
|
2015-10-22 00:19:49 +07:00
|
|
|
static inline
|
|
|
|
struct blk_integrity *bdev_get_integrity(struct block_device *bdev)
|
2008-10-02 23:47:49 +07:00
|
|
|
{
|
2015-10-22 00:19:49 +07:00
|
|
|
return blk_get_integrity(bdev->bd_disk);
|
2008-10-02 23:47:49 +07:00
|
|
|
}
|
|
|
|
|
2014-09-27 06:19:56 +07:00
|
|
|
static inline bool blk_integrity_rq(struct request *rq)
|
2008-07-01 01:04:41 +07:00
|
|
|
{
|
2014-09-27 06:19:56 +07:00
|
|
|
return rq->cmd_flags & REQ_INTEGRITY;
|
2008-07-01 01:04:41 +07:00
|
|
|
}
|
|
|
|
|
2010-09-11 01:50:10 +07:00
|
|
|
static inline void blk_queue_max_integrity_segments(struct request_queue *q,
|
|
|
|
unsigned int segs)
|
|
|
|
{
|
|
|
|
q->limits.max_integrity_segments = segs;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned short
|
|
|
|
queue_max_integrity_segments(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.max_integrity_segments;
|
|
|
|
}
|
|
|
|
|
2015-09-11 22:03:04 +07:00
|
|
|
static inline bool integrity_req_gap_back_merge(struct request *req,
|
|
|
|
struct bio *next)
|
|
|
|
{
|
|
|
|
struct bio_integrity_payload *bip = bio_integrity(req->bio);
|
|
|
|
struct bio_integrity_payload *bip_next = bio_integrity(next);
|
|
|
|
|
|
|
|
return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
|
|
|
|
bip_next->bip_vec[0].bv_offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool integrity_req_gap_front_merge(struct request *req,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
struct bio_integrity_payload *bip = bio_integrity(bio);
|
|
|
|
struct bio_integrity_payload *bip_next = bio_integrity(req->bio);
|
|
|
|
|
|
|
|
return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
|
|
|
|
bip_next->bip_vec[0].bv_offset);
|
|
|
|
}
|
|
|
|
|
2008-07-01 01:04:41 +07:00
|
|
|
#else /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
|
2012-01-12 15:17:30 +07:00
|
|
|
struct bio;
|
|
|
|
struct block_device;
|
|
|
|
struct gendisk;
|
|
|
|
struct blk_integrity;
|
|
|
|
|
|
|
|
static inline int blk_integrity_rq(struct request *rq)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int blk_rq_count_integrity_sg(struct request_queue *q,
|
|
|
|
struct bio *b)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int blk_rq_map_integrity_sg(struct request_queue *q,
|
|
|
|
struct bio *b,
|
|
|
|
struct scatterlist *s)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline struct blk_integrity *bdev_get_integrity(struct block_device *b)
|
|
|
|
{
|
2014-10-10 05:30:17 +07:00
|
|
|
return NULL;
|
2012-01-12 15:17:30 +07:00
|
|
|
}
|
|
|
|
static inline struct blk_integrity *blk_get_integrity(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
static inline int blk_integrity_compare(struct gendisk *a, struct gendisk *b)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2015-10-22 00:19:49 +07:00
|
|
|
static inline void blk_integrity_register(struct gendisk *d,
|
2012-01-12 15:17:30 +07:00
|
|
|
struct blk_integrity *b)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void blk_integrity_unregister(struct gendisk *d)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void blk_queue_max_integrity_segments(struct request_queue *q,
|
|
|
|
unsigned int segs)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline unsigned short queue_max_integrity_segments(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2014-09-27 06:20:06 +07:00
|
|
|
static inline bool blk_integrity_merge_rq(struct request_queue *rq,
|
|
|
|
struct request *r1,
|
|
|
|
struct request *r2)
|
2012-01-12 15:17:30 +07:00
|
|
|
{
|
2014-10-29 09:27:43 +07:00
|
|
|
return true;
|
2012-01-12 15:17:30 +07:00
|
|
|
}
|
2014-09-27 06:20:06 +07:00
|
|
|
static inline bool blk_integrity_merge_bio(struct request_queue *rq,
|
|
|
|
struct request *r,
|
|
|
|
struct bio *b)
|
2012-01-12 15:17:30 +07:00
|
|
|
{
|
2014-10-29 09:27:43 +07:00
|
|
|
return true;
|
2012-01-12 15:17:30 +07:00
|
|
|
}
|
2015-10-22 00:19:49 +07:00
|
|
|
|
2015-09-11 22:03:04 +07:00
|
|
|
static inline bool integrity_req_gap_back_merge(struct request *req,
|
|
|
|
struct bio *next)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
static inline bool integrity_req_gap_front_merge(struct request *req,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2008-07-01 01:04:41 +07:00
|
|
|
|
|
|
|
#endif /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
|
2016-01-16 07:55:59 +07:00
|
|
|
/**
|
|
|
|
* struct blk_dax_ctl - control and output parameters for ->direct_access
|
|
|
|
* @sector: (input) offset relative to a block_device
|
|
|
|
* @addr: (output) kernel virtual address for @sector populated by driver
|
|
|
|
* @pfn: (output) page frame number for @addr populated by driver
|
|
|
|
* @size: (input) number of bytes requested
|
|
|
|
*/
|
|
|
|
struct blk_dax_ctl {
|
|
|
|
sector_t sector;
|
2016-06-04 08:06:47 +07:00
|
|
|
void *addr;
|
2016-01-16 07:55:59 +07:00
|
|
|
long size;
|
2016-01-16 07:56:14 +07:00
|
|
|
pfn_t pfn;
|
2016-01-16 07:55:59 +07:00
|
|
|
};
|
|
|
|
|
2007-10-09 00:26:20 +07:00
|
|
|
struct block_device_operations {
|
[PATCH] beginning of methods conversion
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-02 21:09:22 +07:00
|
|
|
int (*open) (struct block_device *, fmode_t);
|
2013-05-06 08:52:57 +07:00
|
|
|
void (*release) (struct gendisk *, fmode_t);
|
2016-08-05 21:11:04 +07:00
|
|
|
int (*rw_page)(struct block_device *, sector_t, struct page *, bool);
|
[PATCH] beginning of methods conversion
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-02 21:09:22 +07:00
|
|
|
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
|
|
|
|
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
|
2016-06-04 08:06:47 +07:00
|
|
|
long (*direct_access)(struct block_device *, sector_t, void **, pfn_t *,
|
|
|
|
long);
|
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-09 02:57:37 +07:00
|
|
|
unsigned int (*check_events) (struct gendisk *disk,
|
|
|
|
unsigned int clearing);
|
|
|
|
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
|
2007-10-09 00:26:20 +07:00
|
|
|
int (*media_changed) (struct gendisk *);
|
2010-05-16 01:09:29 +07:00
|
|
|
void (*unlock_native_capacity) (struct gendisk *);
|
2007-10-09 00:26:20 +07:00
|
|
|
int (*revalidate_disk) (struct gendisk *);
|
|
|
|
int (*getgeo)(struct block_device *, struct hd_geometry *);
|
2010-05-17 12:32:43 +07:00
|
|
|
/* this callback is with swap_lock and sometimes page table lock held */
|
|
|
|
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
|
2007-10-09 00:26:20 +07:00
|
|
|
struct module *owner;
|
2015-10-15 19:10:48 +07:00
|
|
|
const struct pr_ops *pr_ops;
|
2007-10-09 00:26:20 +07:00
|
|
|
};
|
|
|
|
|
2007-08-30 07:34:12 +07:00
|
|
|
extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
|
|
|
|
unsigned long);
|
2014-06-05 06:07:46 +07:00
|
|
|
extern int bdev_read_page(struct block_device *, sector_t, struct page *);
|
|
|
|
extern int bdev_write_page(struct block_device *, sector_t, struct page *,
|
|
|
|
struct writeback_control *);
|
2016-01-16 07:55:59 +07:00
|
|
|
extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
|
2016-05-10 23:23:53 +07:00
|
|
|
extern int bdev_dax_supported(struct super_block *, int);
|
2016-05-10 23:23:57 +07:00
|
|
|
extern bool bdev_dax_capable(struct block_device *);
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
#else /* CONFIG_BLOCK */
|
2014-06-05 06:06:27 +07:00
|
|
|
|
|
|
|
struct block_device;
|
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
/*
|
|
|
|
* stubs for when the block layer is configured out
|
|
|
|
*/
|
|
|
|
#define buffer_heads_over_limit 0
|
|
|
|
|
|
|
|
static inline long nr_blockdev_pages(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-03-12 02:17:08 +07:00
|
|
|
struct blk_plug {
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline void blk_start_plug(struct blk_plug *plug)
|
2011-03-08 19:19:51 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-03-12 02:17:08 +07:00
|
|
|
static inline void blk_finish_plug(struct blk_plug *plug)
|
2011-03-08 19:19:51 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-03-12 02:17:08 +07:00
|
|
|
static inline void blk_flush_plug(struct task_struct *task)
|
2011-03-08 19:19:51 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-04-16 18:27:55 +07:00
|
|
|
static inline void blk_schedule_flush_plug(struct task_struct *task)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2011-03-08 19:19:51 +07:00
|
|
|
static inline bool blk_needs_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2014-06-05 06:06:27 +07:00
|
|
|
static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
|
|
|
|
sector_t *error_sector)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
#endif /* CONFIG_BLOCK */
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif
|