2005-04-17 05:20:36 +07:00
|
|
|
#ifndef _LINUX_BLKDEV_H
|
|
|
|
#define _LINUX_BLKDEV_H
|
|
|
|
|
2012-05-14 13:29:23 +07:00
|
|
|
#include <linux/sched.h>
|
|
|
|
|
2007-09-21 14:19:54 +07:00
|
|
|
#ifdef CONFIG_BLOCK
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/major.h>
|
|
|
|
#include <linux/genhd.h>
|
|
|
|
#include <linux/list.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
#include <linux/llist.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/timer.h>
|
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/backing-dev.h>
|
|
|
|
#include <linux/wait.h>
|
|
|
|
#include <linux/mempool.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/stringify.h>
|
2008-09-11 15:57:55 +07:00
|
|
|
#include <linux/gfp.h>
|
2007-07-09 17:40:35 +07:00
|
|
|
#include <linux/bsg.h>
|
2008-09-14 01:26:01 +07:00
|
|
|
#include <linux/smp.h>
|
2013-01-09 23:05:13 +07:00
|
|
|
#include <linux/rcupdate.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
#include <asm/scatterlist.h>
|
|
|
|
|
2011-05-27 00:46:22 +07:00
|
|
|
struct module;
|
2006-03-22 23:52:04 +07:00
|
|
|
struct scsi_ioctl_command;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct request_queue;
|
|
|
|
struct elevator_queue;
|
|
|
|
struct request_pm_state;
|
2006-03-24 02:00:26 +07:00
|
|
|
struct blk_trace;
|
2007-07-09 17:38:05 +07:00
|
|
|
struct request;
|
|
|
|
struct sg_io_hdr;
|
2011-08-01 03:05:09 +07:00
|
|
|
struct bsg_job;
|
2012-04-17 03:57:25 +07:00
|
|
|
struct blkcg_gq;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
#define BLKDEV_MIN_RQ 4
|
|
|
|
#define BLKDEV_MAX_RQ 128 /* Default maximum */
|
|
|
|
|
2012-04-14 03:11:28 +07:00
|
|
|
/*
|
|
|
|
* Maximum number of blkcg policies allowed to be registered concurrently.
|
|
|
|
* Defined here to simplify include dependency.
|
|
|
|
*/
|
|
|
|
#define BLKCG_MAX_POLS 2
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct request;
|
2006-01-06 15:49:03 +07:00
|
|
|
typedef void (rq_end_io_fn)(struct request *, int);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
#define BLK_RL_SYNCFULL (1U << 0)
|
|
|
|
#define BLK_RL_ASYNCFULL (1U << 1)
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
struct request_list {
|
2012-06-05 10:40:59 +07:00
|
|
|
struct request_queue *q; /* the queue this rl belongs to */
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
|
|
|
struct blkcg_gq *blkg; /* blkg this request pool belongs to */
|
|
|
|
#endif
|
2009-04-06 19:48:01 +07:00
|
|
|
/*
|
|
|
|
* count[], starved[], and wait[] are indexed by
|
|
|
|
* BLK_RW_SYNC/BLK_RW_ASYNC
|
|
|
|
*/
|
2012-06-05 10:40:58 +07:00
|
|
|
int count[2];
|
|
|
|
int starved[2];
|
|
|
|
mempool_t *rq_pool;
|
|
|
|
wait_queue_head_t wait[2];
|
2012-06-05 10:40:59 +07:00
|
|
|
unsigned int flags;
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2006-08-10 13:44:47 +07:00
|
|
|
/*
|
|
|
|
* request command types
|
|
|
|
*/
|
|
|
|
enum rq_cmd_type_bits {
|
|
|
|
REQ_TYPE_FS = 1, /* fs request */
|
|
|
|
REQ_TYPE_BLOCK_PC, /* scsi command */
|
|
|
|
REQ_TYPE_SENSE, /* sense request */
|
|
|
|
REQ_TYPE_PM_SUSPEND, /* suspend request */
|
|
|
|
REQ_TYPE_PM_RESUME, /* resume request */
|
|
|
|
REQ_TYPE_PM_SHUTDOWN, /* shutdown request */
|
|
|
|
REQ_TYPE_SPECIAL, /* driver defined type */
|
|
|
|
/*
|
|
|
|
* for ATA/ATAPI devices. this really doesn't belong here, ide should
|
|
|
|
* use REQ_TYPE_SPECIAL and use rq->cmd[0] with the range of driver
|
|
|
|
* private REQ_LB opcodes to differentiate what type of request this is
|
|
|
|
*/
|
|
|
|
REQ_TYPE_ATA_TASKFILE,
|
2006-10-12 20:08:45 +07:00
|
|
|
REQ_TYPE_ATA_PC,
|
2006-08-10 13:44:47 +07:00
|
|
|
};
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#define BLK_MAX_CDB 16
|
|
|
|
|
|
|
|
/*
|
2014-05-06 17:12:45 +07:00
|
|
|
* Try to put the fields that are referenced together in the same cacheline.
|
|
|
|
*
|
|
|
|
* If you modify this structure, make sure to update blk_rq_init() and
|
|
|
|
* especially blk_mq_rq_ctx_init() to take care of the added fields.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
struct request {
|
2014-01-31 06:45:47 +07:00
|
|
|
struct list_head queuelist;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
union {
|
|
|
|
struct call_single_data csd;
|
2014-02-24 22:39:52 +07:00
|
|
|
unsigned long fifo_time;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
};
|
2006-01-09 22:02:34 +07:00
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
struct request_queue *q;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
struct blk_mq_ctx *mq_ctx;
|
2006-08-10 14:00:21 +07:00
|
|
|
|
2013-05-23 17:25:08 +07:00
|
|
|
u64 cmd_flags;
|
2006-08-10 13:44:47 +07:00
|
|
|
enum rq_cmd_type_bits cmd_type;
|
2008-09-14 19:55:09 +07:00
|
|
|
unsigned long atomic_flags;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-03-19 14:58:16 +07:00
|
|
|
int cpu;
|
|
|
|
|
2009-05-07 20:24:44 +07:00
|
|
|
/* the following two fields are internal, NEVER access directly */
|
|
|
|
unsigned int __data_len; /* total data len */
|
2010-03-19 14:58:16 +07:00
|
|
|
sector_t __sector; /* sector cursor */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
struct bio *bio;
|
|
|
|
struct bio *biotail;
|
|
|
|
|
2014-04-10 09:27:01 +07:00
|
|
|
/*
|
|
|
|
* The hash is used inside the scheduler, and killed once the
|
|
|
|
* request reaches the dispatch list. The ipi_list is only used
|
|
|
|
* to queue the request for softirq completion, which is long
|
|
|
|
* after the request has been unhashed (and even removed from
|
|
|
|
* the dispatch list).
|
|
|
|
*/
|
|
|
|
union {
|
|
|
|
struct hlist_node hash; /* merge hash */
|
|
|
|
struct list_head ipi_list;
|
|
|
|
};
|
|
|
|
|
2006-08-10 14:00:21 +07:00
|
|
|
/*
|
|
|
|
* The rb_node is only used inside the io scheduler, requests
|
|
|
|
* are pruned when moved to the dispatch queue. So let the
|
2011-02-11 17:08:00 +07:00
|
|
|
* completion_data share space with the rb_node.
|
2006-08-10 14:00:21 +07:00
|
|
|
*/
|
|
|
|
union {
|
|
|
|
struct rb_node rb_node; /* sort/lookup */
|
2011-02-11 17:08:00 +07:00
|
|
|
void *completion_data;
|
2006-08-10 14:00:21 +07:00
|
|
|
};
|
2006-07-28 14:23:08 +07:00
|
|
|
|
2006-07-12 19:04:37 +07:00
|
|
|
/*
|
2010-04-21 22:44:16 +07:00
|
|
|
* Three pointers are available for the IO schedulers, if they need
|
2011-02-11 17:08:00 +07:00
|
|
|
* more they have to dynamically allocate it. Flush requests are
|
|
|
|
* never put on the IO scheduler. So let the flush fields share
|
2011-12-14 06:33:41 +07:00
|
|
|
* space with the elevator data.
|
2006-07-12 19:04:37 +07:00
|
|
|
*/
|
2011-02-11 17:08:00 +07:00
|
|
|
union {
|
2011-12-14 06:33:41 +07:00
|
|
|
struct {
|
|
|
|
struct io_cq *icq;
|
|
|
|
void *priv[2];
|
|
|
|
} elv;
|
|
|
|
|
2011-02-11 17:08:00 +07:00
|
|
|
struct {
|
|
|
|
unsigned int seq;
|
|
|
|
struct list_head list;
|
block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-16 02:37:25 +07:00
|
|
|
rq_end_io_fn *saved_end_io;
|
2011-02-11 17:08:00 +07:00
|
|
|
} flush;
|
|
|
|
};
|
2006-07-12 19:04:37 +07:00
|
|
|
|
2006-06-13 14:02:34 +07:00
|
|
|
struct gendisk *rq_disk;
|
2011-01-05 22:57:38 +07:00
|
|
|
struct hd_struct *part;
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned long start_time;
|
2010-04-02 05:01:41 +07:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
struct request_list *rl; /* rl this rq is alloced from */
|
2010-04-02 05:01:41 +07:00
|
|
|
unsigned long long start_time_ns;
|
|
|
|
unsigned long long io_start_time_ns; /* when passed to hardware */
|
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
/* Number of scatter-gather DMA addr+len pairs after
|
|
|
|
* physical address coalescing is performed.
|
|
|
|
*/
|
|
|
|
unsigned short nr_phys_segments;
|
2010-09-11 01:50:10 +07:00
|
|
|
#if defined(CONFIG_BLK_DEV_INTEGRITY)
|
|
|
|
unsigned short nr_integrity_segments;
|
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-06-13 14:02:34 +07:00
|
|
|
unsigned short ioprio;
|
|
|
|
|
2009-04-23 09:05:20 +07:00
|
|
|
void *special; /* opaque pointer available for LLD use */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-07-28 14:32:07 +07:00
|
|
|
int tag;
|
|
|
|
int errors;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* when request is used as a packet command carrier
|
|
|
|
*/
|
2008-04-29 14:54:39 +07:00
|
|
|
unsigned char __cmd[BLK_MAX_CDB];
|
|
|
|
unsigned char *cmd;
|
2010-03-19 14:58:16 +07:00
|
|
|
unsigned short cmd_len;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-03-04 17:17:11 +07:00
|
|
|
unsigned int extra_len; /* length of alignment and padding */
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned int sense_len;
|
2009-05-07 20:24:37 +07:00
|
|
|
unsigned int resid_len; /* residual count */
|
2005-04-17 05:20:36 +07:00
|
|
|
void *sense;
|
|
|
|
|
2008-09-14 19:55:09 +07:00
|
|
|
unsigned long deadline;
|
|
|
|
struct list_head timeout_list;
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned int timeout;
|
2005-11-11 18:31:37 +07:00
|
|
|
int retries;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
2006-10-01 01:29:12 +07:00
|
|
|
* completion callback.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
|
|
|
rq_end_io_fn *end_io;
|
|
|
|
void *end_io_data;
|
2007-07-16 13:52:14 +07:00
|
|
|
|
|
|
|
/* for bidi */
|
|
|
|
struct request *next_rq;
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
2008-08-14 14:59:13 +07:00
|
|
|
static inline unsigned short req_get_ioprio(struct request *req)
|
|
|
|
{
|
|
|
|
return req->ioprio;
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2006-08-10 13:44:47 +07:00
|
|
|
* State information carried for REQ_TYPE_PM_SUSPEND and REQ_TYPE_PM_RESUME
|
2005-04-17 05:20:36 +07:00
|
|
|
* requests. Some step values could eventually be made generic.
|
|
|
|
*/
|
|
|
|
struct request_pm_state
|
|
|
|
{
|
|
|
|
/* PM state machine step value, currently driver specific */
|
|
|
|
int pm_step;
|
|
|
|
/* requested PM state value (S1, S2, S3, S4, ...) */
|
|
|
|
u32 pm_state;
|
|
|
|
void* data; /* for driver use */
|
|
|
|
};
|
|
|
|
|
|
|
|
#include <linux/elevator.h>
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
struct blk_queue_ctx;
|
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
typedef void (request_fn_proc) (struct request_queue *q);
|
2011-09-12 17:12:01 +07:00
|
|
|
typedef void (make_request_fn) (struct request_queue *q, struct bio *bio);
|
2007-07-24 14:28:11 +07:00
|
|
|
typedef int (prep_rq_fn) (struct request_queue *, struct request *);
|
2010-07-01 17:49:17 +07:00
|
|
|
typedef void (unprep_rq_fn) (struct request_queue *, struct request *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
struct bio_vec;
|
2008-07-03 14:53:43 +07:00
|
|
|
struct bvec_merge_data {
|
|
|
|
struct block_device *bi_bdev;
|
|
|
|
sector_t bi_sector;
|
|
|
|
unsigned bi_size;
|
|
|
|
unsigned long bi_rw;
|
|
|
|
};
|
|
|
|
typedef int (merge_bvec_fn) (struct request_queue *, struct bvec_merge_data *,
|
|
|
|
struct bio_vec *);
|
2006-01-09 22:02:34 +07:00
|
|
|
typedef void (softirq_done_fn)(struct request *);
|
2008-02-19 17:36:53 +07:00
|
|
|
typedef int (dma_drain_needed_fn)(struct request *);
|
2008-10-01 21:12:15 +07:00
|
|
|
typedef int (lld_busy_fn) (struct request_queue *q);
|
2011-08-01 03:05:09 +07:00
|
|
|
typedef int (bsg_job_fn) (struct bsg_job *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-09-14 19:55:09 +07:00
|
|
|
enum blk_eh_timer_return {
|
|
|
|
BLK_EH_NOT_HANDLED,
|
|
|
|
BLK_EH_HANDLED,
|
|
|
|
BLK_EH_RESET_TIMER,
|
|
|
|
};
|
|
|
|
|
|
|
|
typedef enum blk_eh_timer_return (rq_timed_out_fn)(struct request *);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
enum blk_queue_state {
|
|
|
|
Queue_down,
|
|
|
|
Queue_up,
|
|
|
|
};
|
|
|
|
|
|
|
|
struct blk_queue_tag {
|
|
|
|
struct request **tag_index; /* map of busy tags */
|
|
|
|
unsigned long *tag_map; /* bit map of free/busy tags */
|
|
|
|
int busy; /* current depth */
|
|
|
|
int max_depth; /* what we will send to device */
|
2005-08-06 03:28:11 +07:00
|
|
|
int real_max_depth; /* what the array can hold */
|
2005-04-17 05:20:36 +07:00
|
|
|
atomic_t refcnt; /* map can be shared */
|
|
|
|
};
|
|
|
|
|
2008-08-16 12:10:05 +07:00
|
|
|
#define BLK_SCSI_MAX_CMDS (256)
|
|
|
|
#define BLK_SCSI_CMD_PER_LONG (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8))
|
|
|
|
|
2009-05-23 04:17:51 +07:00
|
|
|
struct queue_limits {
|
|
|
|
unsigned long bounce_pfn;
|
|
|
|
unsigned long seg_boundary_mask;
|
|
|
|
|
|
|
|
unsigned int max_hw_sectors;
|
|
|
|
unsigned int max_sectors;
|
|
|
|
unsigned int max_segment_size;
|
2009-05-23 04:17:53 +07:00
|
|
|
unsigned int physical_block_size;
|
|
|
|
unsigned int alignment_offset;
|
|
|
|
unsigned int io_min;
|
|
|
|
unsigned int io_opt;
|
2009-09-30 18:54:20 +07:00
|
|
|
unsigned int max_discard_sectors;
|
2012-09-18 23:19:27 +07:00
|
|
|
unsigned int max_write_same_sectors;
|
2009-11-10 17:50:21 +07:00
|
|
|
unsigned int discard_granularity;
|
|
|
|
unsigned int discard_alignment;
|
2009-05-23 04:17:51 +07:00
|
|
|
|
|
|
|
unsigned short logical_block_size;
|
2010-02-26 12:20:39 +07:00
|
|
|
unsigned short max_segments;
|
2010-09-11 01:50:10 +07:00
|
|
|
unsigned short max_integrity_segments;
|
2009-05-23 04:17:51 +07:00
|
|
|
|
2009-05-23 04:17:53 +07:00
|
|
|
unsigned char misaligned;
|
2009-11-10 17:50:21 +07:00
|
|
|
unsigned char discard_misaligned;
|
2010-12-02 01:41:49 +07:00
|
|
|
unsigned char cluster;
|
2011-05-18 15:37:35 +07:00
|
|
|
unsigned char discard_zeroes_data;
|
2013-07-12 12:39:53 +07:00
|
|
|
unsigned char raid_partial_stripes_expensive;
|
2009-05-23 04:17:51 +07:00
|
|
|
};
|
|
|
|
|
2011-07-14 02:17:23 +07:00
|
|
|
struct request_queue {
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Together with queue_head for cacheline sharing
|
|
|
|
*/
|
|
|
|
struct list_head queue_head;
|
|
|
|
struct request *last_merge;
|
2008-10-31 16:05:07 +07:00
|
|
|
struct elevator_queue *elevator;
|
2012-06-05 10:40:58 +07:00
|
|
|
int nr_rqs[2]; /* # allocated [a]sync rqs */
|
|
|
|
int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
* If blkcg is not used, @q->root_rl serves all requests. If blkcg
|
|
|
|
* is used, root blkg allocates from @q->root_rl and all other
|
|
|
|
* blkgs from their own blkg->rl. Which one to use should be
|
|
|
|
* determined using bio_request_list().
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-27 05:05:44 +07:00
|
|
|
struct request_list root_rl;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
request_fn_proc *request_fn;
|
|
|
|
make_request_fn *make_request_fn;
|
|
|
|
prep_rq_fn *prep_rq_fn;
|
2010-07-01 17:49:17 +07:00
|
|
|
unprep_rq_fn *unprep_rq_fn;
|
2005-04-17 05:20:36 +07:00
|
|
|
merge_bvec_fn *merge_bvec_fn;
|
2006-01-09 22:02:34 +07:00
|
|
|
softirq_done_fn *softirq_done_fn;
|
2008-09-14 19:55:09 +07:00
|
|
|
rq_timed_out_fn *rq_timed_out_fn;
|
2008-02-19 17:36:53 +07:00
|
|
|
dma_drain_needed_fn *dma_drain_needed;
|
2008-10-01 21:12:15 +07:00
|
|
|
lld_busy_fn *lld_busy_fn;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
struct blk_mq_ops *mq_ops;
|
|
|
|
|
|
|
|
unsigned int *mq_map;
|
|
|
|
|
|
|
|
/* sw queues */
|
|
|
|
struct blk_mq_ctx *queue_ctx;
|
|
|
|
unsigned int nr_queues;
|
|
|
|
|
|
|
|
/* hw dispatch queues */
|
|
|
|
struct blk_mq_hw_ctx **queue_hw_ctx;
|
|
|
|
unsigned int nr_hw_queues;
|
|
|
|
|
2005-10-20 21:23:44 +07:00
|
|
|
/*
|
|
|
|
* Dispatch queue sorting
|
|
|
|
*/
|
2005-10-20 21:37:00 +07:00
|
|
|
sector_t end_sector;
|
2005-10-20 21:23:44 +07:00
|
|
|
struct request *boundary_rq;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2011-03-02 23:08:00 +07:00
|
|
|
* Delayed queue handling
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2011-03-02 23:08:00 +07:00
|
|
|
struct delayed_work delay_work;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
struct backing_dev_info backing_dev_info;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The queue owner gets to use this for whatever they like.
|
|
|
|
* ll_rw_blk doesn't touch it.
|
|
|
|
*/
|
|
|
|
void *queuedata;
|
|
|
|
|
|
|
|
/*
|
2011-07-14 02:17:23 +07:00
|
|
|
* various queue flags, see QUEUE_* below
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2011-07-14 02:17:23 +07:00
|
|
|
unsigned long queue_flags;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2011-12-14 06:33:37 +07:00
|
|
|
/*
|
|
|
|
* ida allocated id for this queue. Used to index queues from
|
|
|
|
* ioctx.
|
|
|
|
*/
|
|
|
|
int id;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2011-07-14 02:17:23 +07:00
|
|
|
* queue needs bounce pages for pages above this limit
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2011-07-14 02:17:23 +07:00
|
|
|
gfp_t bounce_gfp;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/*
|
2005-04-13 04:22:06 +07:00
|
|
|
* protects queue structures from reentrancy. ->__queue_lock should
|
|
|
|
* _never_ be used directly, it is queue private. always use
|
|
|
|
* ->queue_lock.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2005-04-13 04:22:06 +07:00
|
|
|
spinlock_t __queue_lock;
|
2005-04-17 05:20:36 +07:00
|
|
|
spinlock_t *queue_lock;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* queue kobject
|
|
|
|
*/
|
|
|
|
struct kobject kobj;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
/*
|
|
|
|
* mq queue kobject
|
|
|
|
*/
|
|
|
|
struct kobject mq_kobj;
|
|
|
|
|
2013-03-23 10:42:26 +07:00
|
|
|
#ifdef CONFIG_PM_RUNTIME
|
|
|
|
struct device *dev;
|
|
|
|
int rpm_status;
|
|
|
|
unsigned int nr_pending;
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* queue settings
|
|
|
|
*/
|
|
|
|
unsigned long nr_requests; /* Max # of requests */
|
|
|
|
unsigned int nr_congestion_on;
|
|
|
|
unsigned int nr_congestion_off;
|
|
|
|
unsigned int nr_batching;
|
|
|
|
|
2008-01-11 00:30:36 +07:00
|
|
|
unsigned int dma_drain_size;
|
2011-07-14 02:17:23 +07:00
|
|
|
void *dma_drain_buffer;
|
2008-03-04 17:18:17 +07:00
|
|
|
unsigned int dma_pad_mask;
|
2005-04-17 05:20:36 +07:00
|
|
|
unsigned int dma_alignment;
|
|
|
|
|
|
|
|
struct blk_queue_tag *queue_tags;
|
2007-10-25 15:14:47 +07:00
|
|
|
struct list_head tag_busy_list;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-11-10 14:52:05 +07:00
|
|
|
unsigned int nr_sorted;
|
2009-05-20 13:54:31 +07:00
|
|
|
unsigned int in_flight[2];
|
2012-11-28 19:46:45 +07:00
|
|
|
/*
|
|
|
|
* Number of active block driver functions for which blk_drain_queue()
|
|
|
|
* must wait. Must be incremented around functions that unlock the
|
|
|
|
* queue_lock internally, e.g. scsi_request_fn().
|
|
|
|
*/
|
|
|
|
unsigned int request_fn_active;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-09-14 19:55:09 +07:00
|
|
|
unsigned int rq_timeout;
|
|
|
|
struct timer_list timeout;
|
|
|
|
struct list_head timeout_list;
|
|
|
|
|
2011-12-14 06:33:41 +07:00
|
|
|
struct list_head icq_list;
|
2012-03-06 04:15:18 +07:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
2012-04-14 03:11:33 +07:00
|
|
|
DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS);
|
2012-04-17 03:57:25 +07:00
|
|
|
struct blkcg_gq *root_blkg;
|
2012-03-06 04:15:19 +07:00
|
|
|
struct list_head blkg_list;
|
2012-03-06 04:15:18 +07:00
|
|
|
#endif
|
2011-12-14 06:33:41 +07:00
|
|
|
|
2009-05-23 04:17:51 +07:00
|
|
|
struct queue_limits limits;
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* sg stuff
|
|
|
|
*/
|
|
|
|
unsigned int sg_timeout;
|
|
|
|
unsigned int sg_reserved_size;
|
2005-06-23 14:08:19 +07:00
|
|
|
int node;
|
2006-09-29 15:59:40 +07:00
|
|
|
#ifdef CONFIG_BLK_DEV_IO_TRACE
|
2006-03-24 02:00:26 +07:00
|
|
|
struct blk_trace *blk_trace;
|
2006-09-29 15:59:40 +07:00
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2010-09-03 16:56:16 +07:00
|
|
|
* for flush operations
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2010-09-03 16:56:16 +07:00
|
|
|
unsigned int flush_flags;
|
2011-05-07 00:34:32 +07:00
|
|
|
unsigned int flush_not_queueable:1;
|
block: hold queue if flush is running for non-queueable flush drive
In some drives, flush requests are non-queueable. When flush request is
running, normal read/write requests can't run. If block layer dispatches
such request, driver can't handle it and requeue it. Tejun suggested we
can hold the queue when flush is running. This can avoid unnecessary
requeue. Also this can improve performance. For example, we have
request flush1, write1, flush 2. flush1 is dispatched, then queue is
hold, write1 isn't inserted to queue. After flush1 is finished, flush2
will be dispatched. Since disk cache is already clean, flush2 will be
finished very soon, so looks like flush2 is folded to flush1.
In my test, the queue holding completely solves a regression introduced by
commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:
block: make the flush insertion use the tail of the dispatch list
It's not a preempt type request, in fact we have to insert it
behind requests that do specify INSERT_FRONT.
which causes about 20% regression running a sysbench fileio
workload.
Stable: 2.6.39 only
Cc: stable@kernel.org
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-07 00:34:41 +07:00
|
|
|
unsigned int flush_queue_delayed:1;
|
2011-01-25 18:43:54 +07:00
|
|
|
unsigned int flush_pending_idx:1;
|
|
|
|
unsigned int flush_running_idx:1;
|
|
|
|
unsigned long flush_pending_since;
|
|
|
|
struct list_head flush_queue[2];
|
|
|
|
struct list_head flush_data_in_flight;
|
2014-02-10 23:29:00 +07:00
|
|
|
struct request *flush_rq;
|
|
|
|
spinlock_t mq_flush_lock;
|
2006-03-19 06:34:37 +07:00
|
|
|
|
2014-05-28 21:08:02 +07:00
|
|
|
struct list_head requeue_list;
|
|
|
|
spinlock_t requeue_lock;
|
|
|
|
struct work_struct requeue_work;
|
|
|
|
|
2006-03-19 06:34:37 +07:00
|
|
|
struct mutex sysfs_lock;
|
2007-07-09 17:40:35 +07:00
|
|
|
|
2012-03-06 04:14:58 +07:00
|
|
|
int bypass_depth;
|
|
|
|
|
2007-07-09 17:40:35 +07:00
|
|
|
#if defined(CONFIG_BLK_DEV_BSG)
|
2011-08-01 03:05:09 +07:00
|
|
|
bsg_job_fn *bsg_job_fn;
|
|
|
|
int bsg_job_size;
|
2007-07-09 17:40:35 +07:00
|
|
|
struct bsg_class_device bsg_dev;
|
|
|
|
#endif
|
2010-09-16 04:06:35 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_DEV_THROTTLING
|
|
|
|
/* Throttle data */
|
|
|
|
struct throtl_data *td;
|
|
|
|
#endif
|
2013-01-09 23:05:13 +07:00
|
|
|
struct rcu_head rcu_head;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
wait_queue_head_t mq_freeze_wq;
|
|
|
|
struct percpu_counter mq_usage_counter;
|
|
|
|
struct list_head all_q_node;
|
2014-05-14 04:10:52 +07:00
|
|
|
|
|
|
|
struct blk_mq_tag_set *tag_set;
|
|
|
|
struct list_head tag_set_list;
|
2005-04-17 05:20:36 +07:00
|
|
|
};
|
|
|
|
|
|
|
|
#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
|
|
|
|
#define QUEUE_FLAG_STOPPED 2 /* queue is stopped */
|
2009-04-06 19:48:01 +07:00
|
|
|
#define QUEUE_FLAG_SYNCFULL 3 /* read queue has been filled */
|
|
|
|
#define QUEUE_FLAG_ASYNCFULL 4 /* write queue has been filled */
|
2012-11-28 19:42:38 +07:00
|
|
|
#define QUEUE_FLAG_DYING 5 /* queue being torn down */
|
2012-03-06 04:14:58 +07:00
|
|
|
#define QUEUE_FLAG_BYPASS 6 /* act as dumb FIFO queue */
|
2011-04-19 18:32:46 +07:00
|
|
|
#define QUEUE_FLAG_BIDI 7 /* queue supports bidi requests */
|
|
|
|
#define QUEUE_FLAG_NOMERGES 8 /* disable merge attempts */
|
2011-07-24 01:44:25 +07:00
|
|
|
#define QUEUE_FLAG_SAME_COMP 9 /* complete on same CPU-group */
|
2011-04-19 18:32:46 +07:00
|
|
|
#define QUEUE_FLAG_FAIL_IO 10 /* fake timeout */
|
|
|
|
#define QUEUE_FLAG_STACKABLE 11 /* supports request stacking */
|
|
|
|
#define QUEUE_FLAG_NONROT 12 /* non-rotational device (SSD) */
|
2008-10-27 16:44:46 +07:00
|
|
|
#define QUEUE_FLAG_VIRT QUEUE_FLAG_NONROT /* paravirt device */
|
2011-04-19 18:32:46 +07:00
|
|
|
#define QUEUE_FLAG_IO_STAT 13 /* do IO stats */
|
|
|
|
#define QUEUE_FLAG_DISCARD 14 /* supports DISCARD */
|
|
|
|
#define QUEUE_FLAG_NOXMERGES 15 /* No extended merges */
|
|
|
|
#define QUEUE_FLAG_ADD_RANDOM 16 /* Contributes to random pool */
|
|
|
|
#define QUEUE_FLAG_SECDISCARD 17 /* supports SECDISCARD */
|
2011-07-24 01:44:25 +07:00
|
|
|
#define QUEUE_FLAG_SAME_FORCE 18 /* force complete on same CPU */
|
2012-12-06 20:32:01 +07:00
|
|
|
#define QUEUE_FLAG_DEAD 19 /* queue tear-down finished */
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
#define QUEUE_FLAG_INIT_DONE 20 /* queue is initialized */
|
2014-05-29 22:53:32 +07:00
|
|
|
#define QUEUE_FLAG_NO_SG_MERGE 21 /* don't attempt to merge SG segments*/
|
2009-01-23 16:54:44 +07:00
|
|
|
|
|
|
|
#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
|
2009-09-04 01:06:47 +07:00
|
|
|
(1 << QUEUE_FLAG_STACKABLE) | \
|
2010-06-09 15:42:09 +07:00
|
|
|
(1 << QUEUE_FLAG_SAME_COMP) | \
|
|
|
|
(1 << QUEUE_FLAG_ADD_RANDOM))
|
2006-01-06 15:51:03 +07:00
|
|
|
|
2013-11-19 23:25:07 +07:00
|
|
|
#define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
|
|
|
|
(1 << QUEUE_FLAG_SAME_COMP))
|
|
|
|
|
2012-03-30 17:33:28 +07:00
|
|
|
static inline void queue_lockdep_assert_held(struct request_queue *q)
|
2008-04-30 00:16:38 +07:00
|
|
|
{
|
2012-03-30 17:33:28 +07:00
|
|
|
if (q->queue_lock)
|
|
|
|
lockdep_assert_held(q->queue_lock);
|
2008-04-30 00:16:38 +07:00
|
|
|
}
|
|
|
|
|
2008-04-29 19:48:33 +07:00
|
|
|
static inline void queue_flag_set_unlocked(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
2008-07-03 18:18:54 +07:00
|
|
|
static inline int queue_flag_test_and_clear(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
2012-03-30 17:33:28 +07:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-07-03 18:18:54 +07:00
|
|
|
|
|
|
|
if (test_bit(flag, &q->queue_flags)) {
|
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int queue_flag_test_and_set(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
2012-03-30 17:33:28 +07:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-07-03 18:18:54 +07:00
|
|
|
|
|
|
|
if (!test_bit(flag, &q->queue_flags)) {
|
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2008-04-29 19:48:33 +07:00
|
|
|
static inline void queue_flag_set(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2012-03-30 17:33:28 +07:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-04-29 19:48:33 +07:00
|
|
|
__set_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void queue_flag_clear_unlocked(unsigned int flag,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
2009-05-20 13:54:31 +07:00
|
|
|
static inline int queue_in_flight(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->in_flight[0] + q->in_flight[1];
|
|
|
|
}
|
|
|
|
|
2008-04-29 19:48:33 +07:00
|
|
|
static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2012-03-30 17:33:28 +07:00
|
|
|
queue_lockdep_assert_held(q);
|
2008-04-29 19:48:33 +07:00
|
|
|
__clear_bit(flag, &q->queue_flags);
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#define blk_queue_tagged(q) test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
|
|
|
|
#define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
|
2012-11-28 19:42:38 +07:00
|
|
|
#define blk_queue_dying(q) test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags)
|
2012-12-06 20:32:01 +07:00
|
|
|
#define blk_queue_dead(q) test_bit(QUEUE_FLAG_DEAD, &(q)->queue_flags)
|
2012-03-06 04:14:58 +07:00
|
|
|
#define blk_queue_bypass(q) test_bit(QUEUE_FLAG_BYPASS, &(q)->queue_flags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
#define blk_queue_init_done(q) test_bit(QUEUE_FLAG_INIT_DONE, &(q)->queue_flags)
|
2008-04-29 19:44:19 +07:00
|
|
|
#define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
|
2010-01-29 15:04:08 +07:00
|
|
|
#define blk_queue_noxmerges(q) \
|
|
|
|
test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags)
|
2008-09-24 18:03:33 +07:00
|
|
|
#define blk_queue_nonrot(q) test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
|
2009-01-23 16:54:44 +07:00
|
|
|
#define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
|
2010-06-09 15:42:09 +07:00
|
|
|
#define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
|
block: add a queue flag for request stacking support
This patch adds a queue flag to indicate the block device can be
used for request stacking.
Request stacking drivers need to stack their devices on top of
only devices of which q->request_fn is functional.
Since bio stacking drivers (e.g. md, loop) basically initialize
their queue using blk_alloc_queue() and don't set q->request_fn,
the check of (q->request_fn == NULL) looks enough for that purpose.
However, dm will become both types of stacking driver (bio-based and
request-based). And dm will always set q->request_fn even if the dm
device is bio-based of which q->request_fn is not functional actually.
So we need something else to distinguish the type of the device.
Adding a queue flag is a solution for that.
The reason why dm always sets q->request_fn is to keep
the compatibility of dm user-space tools.
Currently, all dm user-space tools are using bio-based dm without
specifying the type of the dm device they use.
To use request-based dm without changing such tools, the kernel
must decide the type of the dm device automatically.
The automatic type decision can't be done at the device creation time
and needs to be deferred until such tools load a mapping table,
since the actual type is decided by dm target type included in
the mapping table.
So a dm device has to be initialized using blk_init_queue()
so that we can load either type of table.
Then, all queue stuffs are set (e.g. q->request_fn) and we have
no element to distinguish that it is bio-based or request-based,
even after a table is loaded and the type of the device is decided.
By the way, some stuffs of the queue (e.g. request_list, elevator)
are needless when the dm device is used as bio-based.
But the memory size is not so large (about 20[KB] per queue on ia64),
so I hope the memory loss can be acceptable for bio-based dm users.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-09-18 21:46:13 +07:00
|
|
|
#define blk_queue_stackable(q) \
|
|
|
|
test_bit(QUEUE_FLAG_STACKABLE, &(q)->queue_flags)
|
2009-09-30 18:52:12 +07:00
|
|
|
#define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
|
2010-08-12 04:17:49 +07:00
|
|
|
#define blk_queue_secdiscard(q) (blk_queue_discard(q) && \
|
|
|
|
test_bit(QUEUE_FLAG_SECDISCARD, &(q)->queue_flags))
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-08-07 23:17:56 +07:00
|
|
|
#define blk_noretry_request(rq) \
|
|
|
|
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
|
|
|
|
REQ_FAILFAST_DRIVER))
|
|
|
|
|
|
|
|
#define blk_account_rq(rq) \
|
|
|
|
(((rq)->cmd_flags & REQ_STARTED) && \
|
2012-09-18 23:19:25 +07:00
|
|
|
((rq)->cmd_type == REQ_TYPE_FS))
|
2010-08-07 23:17:56 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#define blk_pm_request(rq) \
|
2010-08-07 23:17:56 +07:00
|
|
|
((rq)->cmd_type == REQ_TYPE_PM_SUSPEND || \
|
|
|
|
(rq)->cmd_type == REQ_TYPE_PM_RESUME)
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-08-26 15:25:02 +07:00
|
|
|
#define blk_rq_cpu_valid(rq) ((rq)->cpu != -1)
|
2007-07-16 13:52:14 +07:00
|
|
|
#define blk_bidi_rq(rq) ((rq)->next_rq != NULL)
|
2007-12-12 05:40:30 +07:00
|
|
|
/* rq->queuelist of dequeued request must be list_empty() */
|
|
|
|
#define blk_queued_rq(rq) (!list_empty(&(rq)->queuelist))
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
#define list_entry_rq(ptr) list_entry((ptr), struct request, queuelist)
|
|
|
|
|
2013-05-23 17:25:08 +07:00
|
|
|
#define rq_data_dir(rq) (((rq)->cmd_flags & 1) != 0)
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-12-02 01:41:49 +07:00
|
|
|
static inline unsigned int blk_queue_cluster(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.cluster;
|
|
|
|
}
|
|
|
|
|
2006-07-28 14:26:13 +07:00
|
|
|
/*
|
2009-04-06 19:48:01 +07:00
|
|
|
* We regard a request as sync, if either a read or a sync write
|
2006-07-28 14:26:13 +07:00
|
|
|
*/
|
2009-04-06 19:48:01 +07:00
|
|
|
static inline bool rw_is_sync(unsigned int rw_flags)
|
|
|
|
{
|
2010-08-07 23:20:39 +07:00
|
|
|
return !(rw_flags & REQ_WRITE) || (rw_flags & REQ_SYNC);
|
2009-04-06 19:48:01 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool rq_is_sync(struct request *rq)
|
|
|
|
{
|
|
|
|
return rw_is_sync(rq->cmd_flags);
|
|
|
|
}
|
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
static inline bool blk_rl_full(struct request_list *rl, bool sync)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-06-05 10:40:59 +07:00
|
|
|
unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
|
|
|
|
|
|
|
|
return rl->flags & flag;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
static inline void blk_set_rl_full(struct request_list *rl, bool sync)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-06-05 10:40:59 +07:00
|
|
|
unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
|
|
|
|
|
|
|
|
rl->flags |= flag;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2012-06-05 10:40:59 +07:00
|
|
|
static inline void blk_clear_rl_full(struct request_list *rl, bool sync)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2012-06-05 10:40:59 +07:00
|
|
|
unsigned int flag = sync ? BLK_RL_SYNCFULL : BLK_RL_ASYNCFULL;
|
|
|
|
|
|
|
|
rl->flags &= ~flag;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2012-09-18 23:19:25 +07:00
|
|
|
static inline bool rq_mergeable(struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->cmd_type != REQ_TYPE_FS)
|
|
|
|
return false;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-09-18 23:19:25 +07:00
|
|
|
if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2012-09-18 23:19:26 +07:00
|
|
|
static inline bool blk_check_merge_flags(unsigned int flags1,
|
|
|
|
unsigned int flags2)
|
|
|
|
{
|
|
|
|
if ((flags1 & REQ_DISCARD) != (flags2 & REQ_DISCARD))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if ((flags1 & REQ_SECURE) != (flags2 & REQ_SECURE))
|
|
|
|
return false;
|
|
|
|
|
2012-09-18 23:19:27 +07:00
|
|
|
if ((flags1 & REQ_WRITE_SAME) != (flags2 & REQ_WRITE_SAME))
|
|
|
|
return false;
|
|
|
|
|
2012-09-18 23:19:26 +07:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2012-09-18 23:19:27 +07:00
|
|
|
static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b)
|
|
|
|
{
|
|
|
|
if (bio_data(a) == bio_data(b))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* q->prep_rq_fn return values
|
|
|
|
*/
|
|
|
|
#define BLKPREP_OK 0 /* serve it */
|
|
|
|
#define BLKPREP_KILL 1 /* fatal error, kill */
|
|
|
|
#define BLKPREP_DEFER 2 /* leave on queue */
|
|
|
|
|
|
|
|
extern unsigned long blk_max_low_pfn, blk_max_pfn;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* standard bounce addresses:
|
|
|
|
*
|
|
|
|
* BLK_BOUNCE_HIGH : bounce all highmem pages
|
|
|
|
* BLK_BOUNCE_ANY : don't bounce anything
|
|
|
|
* BLK_BOUNCE_ISA : bounce pages above ISA DMA boundary
|
|
|
|
*/
|
2008-04-21 14:51:05 +07:00
|
|
|
|
|
|
|
#if BITS_PER_LONG == 32
|
2005-04-17 05:20:36 +07:00
|
|
|
#define BLK_BOUNCE_HIGH ((u64)blk_max_low_pfn << PAGE_SHIFT)
|
2008-04-21 14:51:05 +07:00
|
|
|
#else
|
|
|
|
#define BLK_BOUNCE_HIGH -1ULL
|
|
|
|
#endif
|
|
|
|
#define BLK_BOUNCE_ANY (-1ULL)
|
2010-05-31 13:59:03 +07:00
|
|
|
#define BLK_BOUNCE_ISA (DMA_BIT_MASK(24))
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-07-09 17:38:05 +07:00
|
|
|
/*
|
|
|
|
* default timeout for SG_IO if none specified
|
|
|
|
*/
|
|
|
|
#define BLK_DEFAULT_SG_TIMEOUT (60 * HZ)
|
2008-12-06 05:49:18 +07:00
|
|
|
#define BLK_MIN_SG_TIMEOUT (7 * HZ)
|
2007-07-09 17:38:05 +07:00
|
|
|
|
2007-07-17 18:03:37 +07:00
|
|
|
#ifdef CONFIG_BOUNCE
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int init_emergency_isa_pool(void);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
|
2005-04-17 05:20:36 +07:00
|
|
|
#else
|
|
|
|
static inline int init_emergency_isa_pool(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2007-07-24 14:28:11 +07:00
|
|
|
static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_MMU */
|
|
|
|
|
2008-08-28 14:17:06 +07:00
|
|
|
struct rq_map_data {
|
|
|
|
struct page **pages;
|
|
|
|
int page_order;
|
|
|
|
int nr_entries;
|
2008-12-18 12:49:37 +07:00
|
|
|
unsigned long offset;
|
2008-12-18 12:49:38 +07:00
|
|
|
int null_mapped;
|
2009-07-09 19:46:53 +07:00
|
|
|
int from_user;
|
2008-08-28 14:17:06 +07:00
|
|
|
};
|
|
|
|
|
2007-09-25 17:35:59 +07:00
|
|
|
struct req_iterator {
|
2013-11-24 08:19:00 +07:00
|
|
|
struct bvec_iter iter;
|
2007-09-25 17:35:59 +07:00
|
|
|
struct bio *bio;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* This should not be used directly - use rq_for_each_segment */
|
2009-02-23 15:03:10 +07:00
|
|
|
#define for_each_bio(_bio) \
|
|
|
|
for (; _bio; _bio = _bio->bi_next)
|
2007-09-25 17:35:59 +07:00
|
|
|
#define __rq_for_each_bio(_bio, rq) \
|
2005-04-17 05:20:36 +07:00
|
|
|
if ((rq->bio)) \
|
|
|
|
for (_bio = (rq)->bio; _bio; _bio = _bio->bi_next)
|
|
|
|
|
2007-09-25 17:35:59 +07:00
|
|
|
#define rq_for_each_segment(bvl, _rq, _iter) \
|
|
|
|
__rq_for_each_bio(_iter.bio, _rq) \
|
2013-11-24 08:19:00 +07:00
|
|
|
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
|
2007-09-25 17:35:59 +07:00
|
|
|
|
2013-08-08 04:26:21 +07:00
|
|
|
#define rq_iter_last(bvec, _iter) \
|
2013-11-24 08:19:00 +07:00
|
|
|
(_iter.bio->bi_next == NULL && \
|
2013-08-08 04:26:21 +07:00
|
|
|
bio_iter_last(bvec, _iter.iter))
|
2007-09-25 17:35:59 +07:00
|
|
|
|
2009-11-26 15:16:19 +07:00
|
|
|
#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
|
|
|
|
# error "You should define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE for your platform"
|
|
|
|
#endif
|
|
|
|
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
|
|
|
|
extern void rq_flush_dcache_pages(struct request *rq);
|
|
|
|
#else
|
|
|
|
static inline void rq_flush_dcache_pages(struct request *rq)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
extern int blk_register_queue(struct gendisk *disk);
|
|
|
|
extern void blk_unregister_queue(struct gendisk *disk);
|
|
|
|
extern void generic_make_request(struct bio *bio);
|
2008-04-29 14:54:36 +07:00
|
|
|
extern void blk_rq_init(struct request_queue *q, struct request *rq);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void blk_put_request(struct request *);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void __blk_put_request(struct request_queue *, struct request *);
|
|
|
|
extern struct request *blk_get_request(struct request_queue *, int, gfp_t);
|
2009-05-17 22:57:15 +07:00
|
|
|
extern struct request *blk_make_request(struct request_queue *, struct bio *,
|
|
|
|
gfp_t);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_requeue_request(struct request_queue *, struct request *);
|
2010-06-18 21:59:42 +07:00
|
|
|
extern void blk_add_request_payload(struct request *rq, struct page *page,
|
|
|
|
unsigned int len);
|
2008-09-18 21:45:38 +07:00
|
|
|
extern int blk_rq_check_limits(struct request_queue *q, struct request *rq);
|
2008-10-01 21:12:15 +07:00
|
|
|
extern int blk_lld_busy(struct request_queue *q);
|
block: add request clone interface (v2)
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-11 18:10:16 +07:00
|
|
|
extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
|
|
|
|
struct bio_set *bs, gfp_t gfp_mask,
|
|
|
|
int (*bio_ctr)(struct bio *, struct bio *, void *),
|
|
|
|
void *data);
|
|
|
|
extern void blk_rq_unprep_clone(struct request *rq);
|
2008-09-18 21:45:38 +07:00
|
|
|
extern int blk_insert_cloned_request(struct request_queue *q,
|
|
|
|
struct request *rq);
|
2011-03-02 23:08:00 +07:00
|
|
|
extern void blk_delay_queue(struct request_queue *, unsigned long);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_recount_segments(struct request_queue *, struct bio *);
|
2012-01-12 22:01:28 +07:00
|
|
|
extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
|
2012-01-12 22:01:27 +07:00
|
|
|
extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
|
|
|
|
unsigned int, void __user *);
|
2007-08-28 02:38:10 +07:00
|
|
|
extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|
|
|
|
unsigned int, void __user *);
|
2008-09-03 04:16:41 +07:00
|
|
|
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|
|
|
|
struct scsi_ioctl_command __user *);
|
2006-10-20 13:28:16 +07:00
|
|
|
|
2011-09-12 17:12:01 +07:00
|
|
|
extern void blk_queue_bio(struct request_queue *q, struct bio *bio);
|
2011-09-12 17:08:27 +07:00
|
|
|
|
2006-10-20 13:28:16 +07:00
|
|
|
/*
|
|
|
|
* A queue has just exitted congestion. Note this in the global counter of
|
|
|
|
* congested queues, and wake up anyone who was waiting for requests to be
|
|
|
|
* put back.
|
|
|
|
*/
|
2009-07-09 19:52:32 +07:00
|
|
|
static inline void blk_clear_queue_congested(struct request_queue *q, int sync)
|
2006-10-20 13:28:16 +07:00
|
|
|
{
|
2009-07-09 19:52:32 +07:00
|
|
|
clear_bdi_congested(&q->backing_dev_info, sync);
|
2006-10-20 13:28:16 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A queue has just entered congestion. Flag that in the queue's VM-visible
|
|
|
|
* state flags and increment the global gounter of congested queues.
|
|
|
|
*/
|
2009-07-09 19:52:32 +07:00
|
|
|
static inline void blk_set_queue_congested(struct request_queue *q, int sync)
|
2006-10-20 13:28:16 +07:00
|
|
|
{
|
2009-07-09 19:52:32 +07:00
|
|
|
set_bdi_congested(&q->backing_dev_info, sync);
|
2006-10-20 13:28:16 +07:00
|
|
|
}
|
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_start_queue(struct request_queue *q);
|
|
|
|
extern void blk_stop_queue(struct request_queue *q);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void blk_sync_queue(struct request_queue *q);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void __blk_stop_queue(struct request_queue *q);
|
2011-04-18 16:41:33 +07:00
|
|
|
extern void __blk_run_queue(struct request_queue *q);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_run_queue(struct request_queue *);
|
2011-04-19 18:32:46 +07:00
|
|
|
extern void blk_run_queue_async(struct request_queue *q);
|
2008-08-28 14:17:05 +07:00
|
|
|
extern int blk_rq_map_user(struct request_queue *, struct request *,
|
2008-08-28 14:17:06 +07:00
|
|
|
struct rq_map_data *, void __user *, unsigned long,
|
|
|
|
gfp_t);
|
2006-12-19 17:12:46 +07:00
|
|
|
extern int blk_rq_unmap_user(struct bio *);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, unsigned int, gfp_t);
|
|
|
|
extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
|
2014-02-09 08:42:52 +07:00
|
|
|
struct rq_map_data *, const struct sg_iovec *,
|
|
|
|
int, unsigned int, gfp_t);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern int blk_execute_rq(struct request_queue *, struct gendisk *,
|
2005-06-20 19:11:09 +07:00
|
|
|
struct request *, int);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
|
2006-01-06 16:00:50 +07:00
|
|
|
struct request *, int, rq_end_io_fn *);
|
2005-11-11 18:30:24 +07:00
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
return bdev->bd_disk->queue;
|
|
|
|
}
|
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
/*
|
2009-07-03 15:48:17 +07:00
|
|
|
* blk_rq_pos() : the current sector
|
|
|
|
* blk_rq_bytes() : bytes left in the entire request
|
|
|
|
* blk_rq_cur_bytes() : bytes left in the current segment
|
|
|
|
* blk_rq_err_bytes() : bytes left till the next error boundary
|
|
|
|
* blk_rq_sectors() : sectors left in the entire request
|
|
|
|
* blk_rq_cur_sectors() : sectors left in the current segment
|
2009-04-23 09:05:18 +07:00
|
|
|
*/
|
2009-05-07 20:24:38 +07:00
|
|
|
static inline sector_t blk_rq_pos(const struct request *rq)
|
|
|
|
{
|
2009-05-07 20:24:44 +07:00
|
|
|
return rq->__sector;
|
2009-05-07 20:24:41 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_rq_bytes(const struct request *rq)
|
|
|
|
{
|
2009-05-07 20:24:44 +07:00
|
|
|
return rq->__data_len;
|
2009-05-07 20:24:38 +07:00
|
|
|
}
|
|
|
|
|
2009-05-07 20:24:41 +07:00
|
|
|
static inline int blk_rq_cur_bytes(const struct request *rq)
|
|
|
|
{
|
|
|
|
return rq->bio ? bio_cur_bytes(rq->bio) : 0;
|
|
|
|
}
|
2009-04-23 09:05:18 +07:00
|
|
|
|
2009-07-03 15:48:17 +07:00
|
|
|
extern unsigned int blk_rq_err_bytes(const struct request *rq);
|
|
|
|
|
2009-05-07 20:24:38 +07:00
|
|
|
static inline unsigned int blk_rq_sectors(const struct request *rq)
|
|
|
|
{
|
2009-05-07 20:24:41 +07:00
|
|
|
return blk_rq_bytes(rq) >> 9;
|
2009-05-07 20:24:38 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
|
|
|
|
{
|
2009-05-07 20:24:41 +07:00
|
|
|
return blk_rq_cur_bytes(rq) >> 9;
|
2009-05-07 20:24:38 +07:00
|
|
|
}
|
|
|
|
|
2012-09-18 23:19:26 +07:00
|
|
|
static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
|
|
|
|
unsigned int cmd_flags)
|
|
|
|
{
|
|
|
|
if (unlikely(cmd_flags & REQ_DISCARD))
|
block: fix max discard sectors limit
linux-v3.8-rc1 and later support for plug for blkdev_issue_discard with
commit 0cfbcafcae8b7364b5fa96c2b26ccde7a3a296a9
(block: add plug for blkdev_issue_discard )
For example,
1) DISCARD rq-1 with size size 4GB
2) DISCARD rq-2 with size size 1GB
If these 2 discard requests get merged, final request size will be 5GB.
In this case, request's __data_len field may overflow as it can store
max 4GB(unsigned int).
This issue was observed while doing mkfs.f2fs on 5GB SD card:
https://lkml.org/lkml/2013/4/1/292
Info: sector size = 512
Info: total sectors = 11370496 (in 512bytes)
Info: zone aligned segment0 blkaddr: 512
[ 257.789764] blk_update_request: bio idx 0 >= vcnt 0
mkfs process gets stuck in D state and I see the following in the dmesg:
[ 257.789733] __end_that: dev mmcblk0: type=1, flags=122c8081
[ 257.789764] sector 4194304, nr/cnr 2981888/4294959104
[ 257.789764] bio df3840c0, biotail df3848c0, buffer (null), len
1526726656
[ 257.789764] blk_update_request: bio idx 0 >= vcnt 0
[ 257.794921] request botched: dev mmcblk0: type=1, flags=122c8081
[ 257.794921] sector 4194304, nr/cnr 2981888/4294959104
[ 257.794921] bio df3840c0, biotail df3848c0, buffer (null), len
1526726656
This patch fixes this issue.
Reported-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Tested-by: Max Filippov <jcmvbkbc@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-24 21:52:50 +07:00
|
|
|
return min(q->limits.max_discard_sectors, UINT_MAX >> 9);
|
2012-09-18 23:19:26 +07:00
|
|
|
|
2012-09-18 23:19:27 +07:00
|
|
|
if (unlikely(cmd_flags & REQ_WRITE_SAME))
|
|
|
|
return q->limits.max_write_same_sectors;
|
|
|
|
|
2012-09-18 23:19:26 +07:00
|
|
|
return q->limits.max_sectors;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_rq_get_max_sectors(struct request *rq)
|
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
|
|
|
if (unlikely(rq->cmd_type == REQ_TYPE_BLOCK_PC))
|
|
|
|
return q->limits.max_hw_sectors;
|
|
|
|
|
|
|
|
return blk_queue_get_max_sectors(q, rq->cmd_flags);
|
|
|
|
}
|
|
|
|
|
2013-09-22 02:57:47 +07:00
|
|
|
static inline unsigned int blk_rq_count_bios(struct request *rq)
|
|
|
|
{
|
|
|
|
unsigned int nr_bios = 0;
|
|
|
|
struct bio *bio;
|
|
|
|
|
|
|
|
__rq_for_each_bio(bio, rq)
|
|
|
|
nr_bios++;
|
|
|
|
|
|
|
|
return nr_bios;
|
|
|
|
}
|
|
|
|
|
2009-05-08 09:54:16 +07:00
|
|
|
/*
|
|
|
|
* Request issue related functions.
|
|
|
|
*/
|
|
|
|
extern struct request *blk_peek_request(struct request_queue *q);
|
|
|
|
extern void blk_start_request(struct request *rq);
|
|
|
|
extern struct request *blk_fetch_request(struct request_queue *q);
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
2009-04-23 09:05:18 +07:00
|
|
|
* Request completion related functions.
|
|
|
|
*
|
|
|
|
* blk_update_request() completes given number of bytes and updates
|
|
|
|
* the request without completing it.
|
|
|
|
*
|
2009-04-23 09:05:19 +07:00
|
|
|
* blk_end_request() and friends. __blk_end_request() must be called
|
|
|
|
* with the request queue spinlock acquired.
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* Several drivers define their own end_request and call
|
2007-12-12 05:52:28 +07:00
|
|
|
* blk_end_request() for parts of the original function.
|
|
|
|
* This prevents code duplication in drivers.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2009-04-23 09:05:18 +07:00
|
|
|
extern bool blk_update_request(struct request *rq, int error,
|
|
|
|
unsigned int nr_bytes);
|
2014-04-16 14:44:59 +07:00
|
|
|
extern void blk_finish_request(struct request *rq, int error);
|
2009-05-11 15:56:09 +07:00
|
|
|
extern bool blk_end_request(struct request *rq, int error,
|
|
|
|
unsigned int nr_bytes);
|
|
|
|
extern void blk_end_request_all(struct request *rq, int error);
|
|
|
|
extern bool blk_end_request_cur(struct request *rq, int error);
|
2009-07-03 15:48:17 +07:00
|
|
|
extern bool blk_end_request_err(struct request *rq, int error);
|
2009-05-11 15:56:09 +07:00
|
|
|
extern bool __blk_end_request(struct request *rq, int error,
|
|
|
|
unsigned int nr_bytes);
|
|
|
|
extern void __blk_end_request_all(struct request *rq, int error);
|
|
|
|
extern bool __blk_end_request_cur(struct request *rq, int error);
|
2009-07-03 15:48:17 +07:00
|
|
|
extern bool __blk_end_request_err(struct request *rq, int error);
|
2009-04-23 09:05:18 +07:00
|
|
|
|
2006-01-09 22:02:34 +07:00
|
|
|
extern void blk_complete_request(struct request *);
|
2008-09-14 19:55:09 +07:00
|
|
|
extern void __blk_complete_request(struct request *);
|
|
|
|
extern void blk_abort_request(struct request *);
|
2010-07-01 17:49:17 +07:00
|
|
|
extern void blk_unprep_request(struct request *);
|
2006-01-09 22:02:34 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Access functions for manipulating queue properties
|
|
|
|
*/
|
2007-07-24 14:28:11 +07:00
|
|
|
extern struct request_queue *blk_init_queue_node(request_fn_proc *rfn,
|
2005-06-23 14:08:19 +07:00
|
|
|
spinlock_t *lock, int node_id);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
|
2010-05-11 13:57:42 +07:00
|
|
|
extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
|
|
|
|
request_fn_proc *, spinlock_t *);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_cleanup_queue(struct request_queue *);
|
|
|
|
extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
|
|
|
|
extern void blk_queue_bounce_limit(struct request_queue *, u64);
|
2010-12-17 14:34:20 +07:00
|
|
|
extern void blk_limits_max_hw_sectors(struct queue_limits *, unsigned int);
|
2010-02-26 12:20:38 +07:00
|
|
|
extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
|
2010-02-26 12:20:39 +07:00
|
|
|
extern void blk_queue_max_segments(struct request_queue *, unsigned short);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
|
2009-09-30 18:54:20 +07:00
|
|
|
extern void blk_queue_max_discard_sectors(struct request_queue *q,
|
|
|
|
unsigned int max_discard_sectors);
|
2012-09-18 23:19:27 +07:00
|
|
|
extern void blk_queue_max_write_same_sectors(struct request_queue *q,
|
|
|
|
unsigned int max_write_same_sectors);
|
2009-05-23 04:17:49 +07:00
|
|
|
extern void blk_queue_logical_block_size(struct request_queue *, unsigned short);
|
2010-10-14 02:18:03 +07:00
|
|
|
extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
|
2009-05-23 04:17:53 +07:00
|
|
|
extern void blk_queue_alignment_offset(struct request_queue *q,
|
|
|
|
unsigned int alignment);
|
2009-07-31 22:49:11 +07:00
|
|
|
extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
|
2009-05-23 04:17:53 +07:00
|
|
|
extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
|
2009-09-12 02:54:52 +07:00
|
|
|
extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
|
2009-05-23 04:17:53 +07:00
|
|
|
extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
|
2009-06-16 13:23:52 +07:00
|
|
|
extern void blk_set_default_limits(struct queue_limits *lim);
|
2012-01-11 22:27:11 +07:00
|
|
|
extern void blk_set_stacking_limits(struct queue_limits *lim);
|
2009-05-23 04:17:53 +07:00
|
|
|
extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
|
|
|
|
sector_t offset);
|
2010-01-11 15:21:49 +07:00
|
|
|
extern int bdev_stack_limits(struct queue_limits *t, struct block_device *bdev,
|
|
|
|
sector_t offset);
|
2009-05-23 04:17:53 +07:00
|
|
|
extern void disk_stack_limits(struct gendisk *disk, struct block_device *bdev,
|
|
|
|
sector_t offset);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_stack_limits(struct request_queue *t, struct request_queue *b);
|
2008-03-04 17:18:17 +07:00
|
|
|
extern void blk_queue_dma_pad(struct request_queue *, unsigned int);
|
2008-07-04 14:30:03 +07:00
|
|
|
extern void blk_queue_update_dma_pad(struct request_queue *, unsigned int);
|
2008-02-19 17:36:53 +07:00
|
|
|
extern int blk_queue_dma_drain(struct request_queue *q,
|
|
|
|
dma_drain_needed_fn *dma_drain_needed,
|
|
|
|
void *buf, unsigned int size);
|
2008-10-01 21:12:15 +07:00
|
|
|
extern void blk_queue_lld_busy(struct request_queue *q, lld_busy_fn *fn);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_segment_boundary(struct request_queue *, unsigned long);
|
|
|
|
extern void blk_queue_prep_rq(struct request_queue *, prep_rq_fn *pfn);
|
2010-07-01 17:49:17 +07:00
|
|
|
extern void blk_queue_unprep_rq(struct request_queue *, unprep_rq_fn *ufn);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_merge_bvec(struct request_queue *, merge_bvec_fn *);
|
|
|
|
extern void blk_queue_dma_alignment(struct request_queue *, int);
|
2008-01-01 05:37:00 +07:00
|
|
|
extern void blk_queue_update_dma_alignment(struct request_queue *, int);
|
2007-07-24 14:28:11 +07:00
|
|
|
extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
|
2008-09-14 19:55:09 +07:00
|
|
|
extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
|
|
|
|
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
|
2010-09-03 16:56:16 +07:00
|
|
|
extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
|
2011-05-07 00:34:32 +07:00
|
|
|
extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
|
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
|
2012-08-03 04:42:04 +07:00
|
|
|
extern int blk_bio_map_sg(struct request_queue *q, struct bio *bio,
|
|
|
|
struct scatterlist *sglist);
|
2005-04-17 05:20:36 +07:00
|
|
|
extern void blk_dump_rq_flags(struct request *, char *);
|
|
|
|
extern long nr_blockdev_pages(void);
|
|
|
|
|
2011-12-14 06:33:38 +07:00
|
|
|
bool __must_check blk_get_queue(struct request_queue *);
|
2007-07-24 14:28:11 +07:00
|
|
|
struct request_queue *blk_alloc_queue(gfp_t);
|
|
|
|
struct request_queue *blk_alloc_queue_node(gfp_t, int);
|
|
|
|
extern void blk_put_queue(struct request_queue *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2013-03-23 10:42:26 +07:00
|
|
|
/*
|
|
|
|
* block layer runtime pm functions
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_PM_RUNTIME
|
|
|
|
extern void blk_pm_runtime_init(struct request_queue *q, struct device *dev);
|
|
|
|
extern int blk_pre_runtime_suspend(struct request_queue *q);
|
|
|
|
extern void blk_post_runtime_suspend(struct request_queue *q, int err);
|
|
|
|
extern void blk_pre_runtime_resume(struct request_queue *q);
|
|
|
|
extern void blk_post_runtime_resume(struct request_queue *q, int err);
|
|
|
|
#else
|
|
|
|
static inline void blk_pm_runtime_init(struct request_queue *q,
|
|
|
|
struct device *dev) {}
|
|
|
|
static inline int blk_pre_runtime_suspend(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
static inline void blk_post_runtime_suspend(struct request_queue *q, int err) {}
|
|
|
|
static inline void blk_pre_runtime_resume(struct request_queue *q) {}
|
|
|
|
static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
|
|
|
|
#endif
|
|
|
|
|
2011-07-08 13:19:21 +07:00
|
|
|
/*
|
2011-09-21 15:00:16 +07:00
|
|
|
* blk_plug permits building a queue of related requests by holding the I/O
|
|
|
|
* fragments for a short period. This allows merging of sequential requests
|
|
|
|
* into single larger request. As the requests are moved from a per-task list to
|
|
|
|
* the device's request_queue in a batch, this results in improved scalability
|
|
|
|
* as the lock contention for request_queue lock is reduced.
|
|
|
|
*
|
|
|
|
* It is ok not to disable preemption when adding the request to the plug list
|
|
|
|
* or when attempting a merge, because blk_schedule_flush_list() will only flush
|
|
|
|
* the plug list when the task sleeps by itself. For details, please see
|
|
|
|
* schedule() where blk_schedule_flush_plug() is called.
|
2011-07-08 13:19:21 +07:00
|
|
|
*/
|
2011-03-08 19:19:51 +07:00
|
|
|
struct blk_plug {
|
2011-09-21 15:00:16 +07:00
|
|
|
struct list_head list; /* requests */
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
struct list_head mq_list; /* blk-mq requests */
|
2011-09-21 15:00:16 +07:00
|
|
|
struct list_head cb_list; /* md requires an unplug callback */
|
2011-03-08 19:19:51 +07:00
|
|
|
};
|
2011-07-08 13:19:20 +07:00
|
|
|
#define BLK_MAX_REQUEST_COUNT 16
|
|
|
|
|
2012-07-31 14:08:14 +07:00
|
|
|
struct blk_plug_cb;
|
2012-07-31 14:08:15 +07:00
|
|
|
typedef void (*blk_plug_cb_fn)(struct blk_plug_cb *, bool);
|
2011-04-18 14:52:22 +07:00
|
|
|
struct blk_plug_cb {
|
|
|
|
struct list_head list;
|
2012-07-31 14:08:14 +07:00
|
|
|
blk_plug_cb_fn callback;
|
|
|
|
void *data;
|
2011-04-18 14:52:22 +07:00
|
|
|
};
|
2012-07-31 14:08:14 +07:00
|
|
|
extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
|
|
|
|
void *data, int size);
|
2011-03-08 19:19:51 +07:00
|
|
|
extern void blk_start_plug(struct blk_plug *);
|
|
|
|
extern void blk_finish_plug(struct blk_plug *);
|
2011-04-15 20:49:07 +07:00
|
|
|
extern void blk_flush_plug_list(struct blk_plug *, bool);
|
2011-03-08 19:19:51 +07:00
|
|
|
|
|
|
|
static inline void blk_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
2011-04-16 18:27:55 +07:00
|
|
|
if (plug)
|
|
|
|
blk_flush_plug_list(plug, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_schedule_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
2011-04-15 20:20:10 +07:00
|
|
|
if (plug)
|
2011-04-15 20:49:07 +07:00
|
|
|
blk_flush_plug_list(plug, true);
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_needs_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
return plug &&
|
|
|
|
(!list_empty(&plug->list) ||
|
|
|
|
!list_empty(&plug->mq_list) ||
|
|
|
|
!list_empty(&plug->cb_list));
|
2011-03-08 19:19:51 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* tag stuff
|
|
|
|
*/
|
2014-04-14 15:30:12 +07:00
|
|
|
#define blk_rq_tagged(rq) \
|
|
|
|
((rq)->mq_ctx || ((rq)->cmd_flags & REQ_QUEUED))
|
2007-07-24 14:28:11 +07:00
|
|
|
extern int blk_queue_start_tag(struct request_queue *, struct request *);
|
|
|
|
extern struct request *blk_queue_find_tag(struct request_queue *, int);
|
|
|
|
extern void blk_queue_end_tag(struct request_queue *, struct request *);
|
|
|
|
extern int blk_queue_init_tags(struct request_queue *, int, struct blk_queue_tag *);
|
|
|
|
extern void blk_queue_free_tags(struct request_queue *);
|
|
|
|
extern int blk_queue_resize_tags(struct request_queue *, int);
|
|
|
|
extern void blk_queue_invalidate_tags(struct request_queue *);
|
2006-08-31 02:48:45 +07:00
|
|
|
extern struct blk_queue_tag *blk_init_tags(int);
|
|
|
|
extern void blk_free_tags(struct blk_queue_tag *);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-10-04 13:27:25 +07:00
|
|
|
static inline struct request *blk_map_queue_find_tag(struct blk_queue_tag *bqt,
|
|
|
|
int tag)
|
|
|
|
{
|
|
|
|
if (unlikely(bqt == NULL || tag >= bqt->real_max_depth))
|
|
|
|
return NULL;
|
|
|
|
return bqt->tag_index[tag];
|
|
|
|
}
|
2010-09-17 01:51:46 +07:00
|
|
|
|
|
|
|
#define BLKDEV_DISCARD_SECURE 0x01 /* secure discard */
|
|
|
|
|
|
|
|
extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *);
|
2010-04-28 20:55:06 +07:00
|
|
|
extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
|
|
|
|
sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
|
2012-09-18 23:19:27 +07:00
|
|
|
extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
|
|
|
|
sector_t nr_sects, gfp_t gfp_mask, struct page *page);
|
2010-04-28 20:55:09 +07:00
|
|
|
extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
|
2010-09-17 01:51:46 +07:00
|
|
|
sector_t nr_sects, gfp_t gfp_mask);
|
2010-08-18 16:29:10 +07:00
|
|
|
static inline int sb_issue_discard(struct super_block *sb, sector_t block,
|
|
|
|
sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
|
2008-08-06 00:01:53 +07:00
|
|
|
{
|
2010-08-18 16:29:10 +07:00
|
|
|
return blkdev_issue_discard(sb->s_bdev, block << (sb->s_blocksize_bits - 9),
|
|
|
|
nr_blocks << (sb->s_blocksize_bits - 9),
|
|
|
|
gfp_mask, flags);
|
2008-08-06 00:01:53 +07:00
|
|
|
}
|
2010-10-28 08:30:04 +07:00
|
|
|
static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
|
2010-10-28 10:44:47 +07:00
|
|
|
sector_t nr_blocks, gfp_t gfp_mask)
|
2010-10-28 08:30:04 +07:00
|
|
|
{
|
|
|
|
return blkdev_issue_zeroout(sb->s_bdev,
|
|
|
|
block << (sb->s_blocksize_bits - 9),
|
|
|
|
nr_blocks << (sb->s_blocksize_bits - 9),
|
2010-10-28 10:44:47 +07:00
|
|
|
gfp_mask);
|
2010-10-28 08:30:04 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-06-26 21:27:10 +07:00
|
|
|
extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
|
2008-06-26 18:48:27 +07:00
|
|
|
|
2010-02-26 12:20:37 +07:00
|
|
|
enum blk_default_limits {
|
|
|
|
BLK_MAX_SEGMENTS = 128,
|
|
|
|
BLK_SAFE_MAX_SECTORS = 255,
|
|
|
|
BLK_DEF_MAX_SECTORS = 1024,
|
|
|
|
BLK_MAX_SEGMENT_SIZE = 65536,
|
|
|
|
BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL,
|
|
|
|
};
|
2008-12-03 18:55:08 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#define blkdev_entry_to_request(entry) list_entry((entry), struct request, queuelist)
|
|
|
|
|
2009-05-23 04:17:50 +07:00
|
|
|
static inline unsigned long queue_bounce_pfn(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 04:17:51 +07:00
|
|
|
return q->limits.bounce_pfn;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long queue_segment_boundary(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 04:17:51 +07:00
|
|
|
return q->limits.seg_boundary_mask;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int queue_max_sectors(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 04:17:51 +07:00
|
|
|
return q->limits.max_sectors;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int queue_max_hw_sectors(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 04:17:51 +07:00
|
|
|
return q->limits.max_hw_sectors;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
2010-02-26 12:20:39 +07:00
|
|
|
static inline unsigned short queue_max_segments(struct request_queue *q)
|
2009-05-23 04:17:50 +07:00
|
|
|
{
|
2010-02-26 12:20:39 +07:00
|
|
|
return q->limits.max_segments;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int queue_max_segment_size(struct request_queue *q)
|
|
|
|
{
|
2009-05-23 04:17:51 +07:00
|
|
|
return q->limits.max_segment_size;
|
2009-05-23 04:17:50 +07:00
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:49 +07:00
|
|
|
static inline unsigned short queue_logical_block_size(struct request_queue *q)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
int retval = 512;
|
|
|
|
|
2009-05-23 04:17:51 +07:00
|
|
|
if (q && q->limits.logical_block_size)
|
|
|
|
retval = q->limits.logical_block_size;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:49 +07:00
|
|
|
static inline unsigned short bdev_logical_block_size(struct block_device *bdev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2009-05-23 04:17:49 +07:00
|
|
|
return queue_logical_block_size(bdev_get_queue(bdev));
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:53 +07:00
|
|
|
static inline unsigned int queue_physical_block_size(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.physical_block_size;
|
|
|
|
}
|
|
|
|
|
2010-10-14 02:18:03 +07:00
|
|
|
static inline unsigned int bdev_physical_block_size(struct block_device *bdev)
|
2009-10-04 01:52:01 +07:00
|
|
|
{
|
|
|
|
return queue_physical_block_size(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:53 +07:00
|
|
|
static inline unsigned int queue_io_min(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.io_min;
|
|
|
|
}
|
|
|
|
|
2009-10-04 01:52:01 +07:00
|
|
|
static inline int bdev_io_min(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
return queue_io_min(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:53 +07:00
|
|
|
static inline unsigned int queue_io_opt(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.io_opt;
|
|
|
|
}
|
|
|
|
|
2009-10-04 01:52:01 +07:00
|
|
|
static inline int bdev_io_opt(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
return queue_io_opt(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2009-05-23 04:17:53 +07:00
|
|
|
static inline int queue_alignment_offset(struct request_queue *q)
|
|
|
|
{
|
2009-10-04 01:52:01 +07:00
|
|
|
if (q->limits.misaligned)
|
2009-05-23 04:17:53 +07:00
|
|
|
return -1;
|
|
|
|
|
2009-10-04 01:52:01 +07:00
|
|
|
return q->limits.alignment_offset;
|
2009-05-23 04:17:53 +07:00
|
|
|
}
|
|
|
|
|
2010-01-11 15:21:51 +07:00
|
|
|
static inline int queue_limit_alignment_offset(struct queue_limits *lim, sector_t sector)
|
2009-12-29 14:35:35 +07:00
|
|
|
{
|
|
|
|
unsigned int granularity = max(lim->physical_block_size, lim->io_min);
|
2010-01-11 15:21:51 +07:00
|
|
|
unsigned int alignment = (sector << 9) & (granularity - 1);
|
2009-12-29 14:35:35 +07:00
|
|
|
|
2010-01-11 15:21:51 +07:00
|
|
|
return (granularity + lim->alignment_offset - alignment)
|
|
|
|
& (granularity - 1);
|
2009-05-23 04:17:53 +07:00
|
|
|
}
|
|
|
|
|
2009-10-04 01:52:01 +07:00
|
|
|
static inline int bdev_alignment_offset(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q->limits.misaligned)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
if (bdev != bdev->bd_contains)
|
|
|
|
return bdev->bd_part->alignment_offset;
|
|
|
|
|
|
|
|
return q->limits.alignment_offset;
|
|
|
|
}
|
|
|
|
|
2009-11-10 17:50:21 +07:00
|
|
|
static inline int queue_discard_alignment(struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (q->limits.discard_misaligned)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
return q->limits.discard_alignment;
|
|
|
|
}
|
|
|
|
|
2010-01-11 15:21:51 +07:00
|
|
|
static inline int queue_limit_discard_alignment(struct queue_limits *lim, sector_t sector)
|
2009-11-10 17:50:21 +07:00
|
|
|
{
|
2012-12-19 22:18:35 +07:00
|
|
|
unsigned int alignment, granularity, offset;
|
2010-01-11 15:21:48 +07:00
|
|
|
|
2011-05-18 15:37:35 +07:00
|
|
|
if (!lim->max_discard_sectors)
|
|
|
|
return 0;
|
|
|
|
|
2012-12-19 22:18:35 +07:00
|
|
|
/* Why are these in bytes, not sectors? */
|
|
|
|
alignment = lim->discard_alignment >> 9;
|
|
|
|
granularity = lim->discard_granularity >> 9;
|
|
|
|
if (!granularity)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Offset of the partition start in 'granularity' sectors */
|
|
|
|
offset = sector_div(sector, granularity);
|
|
|
|
|
|
|
|
/* And why do we do this modulus *again* in blkdev_issue_discard()? */
|
|
|
|
offset = (granularity + alignment - offset) % granularity;
|
|
|
|
|
|
|
|
/* Turn it back into bytes, gaah */
|
|
|
|
return offset << 9;
|
2009-11-10 17:50:21 +07:00
|
|
|
}
|
|
|
|
|
block: split discard into aligned requests
When a disk has large discard_granularity and small max_discard_sectors,
discards are not split with optimal alignment. In the limit case of
discard_granularity == max_discard_sectors, no request could be aligned
correctly, so in fact you might end up with no discarded logical blocks
at all.
Another example that helps showing the condition in the patch is with
discard_granularity == 64, max_discard_sectors == 128. A request that is
submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
However, only 2 aligned blocks out of 3 are included in the request;
128..191 may be left intact and not discarded. With this patch, the
first request will be truncated to ensure good alignment of what's left,
and the split will be 2..127, 128..255, 256..257. The patch will also
take into account the discard_alignment.
At most one extra request will be introduced, because the first request
will be reduced by at most granularity-1 sectors, and granularity
must be less than max_discard_sectors. Subsequent requests will run
on round_down(max_discard_sectors, granularity) sectors, as in the
current code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02 14:48:50 +07:00
|
|
|
static inline int bdev_discard_alignment(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (bdev != bdev->bd_contains)
|
|
|
|
return bdev->bd_part->discard_alignment;
|
|
|
|
|
|
|
|
return q->limits.discard_alignment;
|
|
|
|
}
|
|
|
|
|
2009-12-03 15:24:48 +07:00
|
|
|
static inline unsigned int queue_discard_zeroes_data(struct request_queue *q)
|
|
|
|
{
|
2011-05-18 15:37:35 +07:00
|
|
|
if (q->limits.max_discard_sectors && q->limits.discard_zeroes_data == 1)
|
2009-12-03 15:24:48 +07:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int bdev_discard_zeroes_data(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
return queue_discard_zeroes_data(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2012-09-18 23:19:27 +07:00
|
|
|
static inline unsigned int bdev_write_same(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return q->limits.max_write_same_sectors;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-07-24 14:28:11 +07:00
|
|
|
static inline int queue_dma_alignment(struct request_queue *q)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2008-01-01 22:23:02 +07:00
|
|
|
return q ? q->dma_alignment : 511;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2010-09-15 18:08:27 +07:00
|
|
|
static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr,
|
2008-08-28 13:05:58 +07:00
|
|
|
unsigned int len)
|
|
|
|
{
|
|
|
|
unsigned int alignment = queue_dma_alignment(q) | q->dma_pad_mask;
|
2010-09-15 18:08:27 +07:00
|
|
|
return !(addr & alignment) && !(len & alignment);
|
2008-08-28 13:05:58 +07:00
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/* assumes size > 256 */
|
|
|
|
static inline unsigned int blksize_bits(unsigned int size)
|
|
|
|
{
|
|
|
|
unsigned int bits = 8;
|
|
|
|
do {
|
|
|
|
bits++;
|
|
|
|
size >>= 1;
|
|
|
|
} while (size > 256);
|
|
|
|
return bits;
|
|
|
|
}
|
|
|
|
|
2005-09-10 14:27:17 +07:00
|
|
|
static inline unsigned int block_size(struct block_device *bdev)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
return bdev->bd_block_size;
|
|
|
|
}
|
|
|
|
|
2011-05-07 00:34:32 +07:00
|
|
|
static inline bool queue_flush_queueable(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return !q->flush_not_queueable;
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
typedef struct {struct page *v;} Sector;
|
|
|
|
|
|
|
|
unsigned char *read_dev_sector(struct block_device *, sector_t, Sector *);
|
|
|
|
|
|
|
|
static inline void put_dev_sector(Sector p)
|
|
|
|
{
|
|
|
|
page_cache_release(p.v);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct work_struct;
|
2014-04-08 22:15:35 +07:00
|
|
|
int kblockd_schedule_work(struct work_struct *work);
|
|
|
|
int kblockd_schedule_delayed_work(struct delayed_work *dwork, unsigned long delay);
|
2014-04-08 22:17:40 +07:00
|
|
|
int kblockd_schedule_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned long delay);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2010-04-02 05:01:41 +07:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
2010-06-01 17:23:18 +07:00
|
|
|
/*
|
|
|
|
* This should not be using sched_clock(). A real patch is in progress
|
|
|
|
* to fix this up, until that is in place we need to disable preemption
|
|
|
|
* around sched_clock() in this function and set_io_start_time_ns().
|
|
|
|
*/
|
2010-04-02 05:01:41 +07:00
|
|
|
static inline void set_start_time_ns(struct request *req)
|
|
|
|
{
|
2010-06-01 17:23:18 +07:00
|
|
|
preempt_disable();
|
2010-04-02 05:01:41 +07:00
|
|
|
req->start_time_ns = sched_clock();
|
2010-06-01 17:23:18 +07:00
|
|
|
preempt_enable();
|
2010-04-02 05:01:41 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void set_io_start_time_ns(struct request *req)
|
|
|
|
{
|
2010-06-01 17:23:18 +07:00
|
|
|
preempt_disable();
|
2010-04-02 05:01:41 +07:00
|
|
|
req->io_start_time_ns = sched_clock();
|
2010-06-01 17:23:18 +07:00
|
|
|
preempt_enable();
|
2010-04-02 05:01:41 +07:00
|
|
|
}
|
2010-04-09 13:31:19 +07:00
|
|
|
|
|
|
|
static inline uint64_t rq_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return req->start_time_ns;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline uint64_t rq_io_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return req->io_start_time_ns;
|
|
|
|
}
|
2010-04-02 05:01:41 +07:00
|
|
|
#else
|
|
|
|
static inline void set_start_time_ns(struct request *req) {}
|
|
|
|
static inline void set_io_start_time_ns(struct request *req) {}
|
2010-04-09 13:31:19 +07:00
|
|
|
static inline uint64_t rq_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline uint64_t rq_io_start_time_ns(struct request *req)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2010-04-02 05:01:41 +07:00
|
|
|
#endif
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#define MODULE_ALIAS_BLOCKDEV(major,minor) \
|
|
|
|
MODULE_ALIAS("block-major-" __stringify(major) "-" __stringify(minor))
|
|
|
|
#define MODULE_ALIAS_BLOCKDEV_MAJOR(major) \
|
|
|
|
MODULE_ALIAS("block-major-" __stringify(major) "-*")
|
|
|
|
|
2008-07-01 01:04:41 +07:00
|
|
|
#if defined(CONFIG_BLK_DEV_INTEGRITY)
|
|
|
|
|
2008-06-27 14:12:09 +07:00
|
|
|
#define INTEGRITY_FLAG_READ 2 /* verify data integrity on read */
|
|
|
|
#define INTEGRITY_FLAG_WRITE 4 /* generate data integrity on write */
|
2008-07-01 01:04:41 +07:00
|
|
|
|
|
|
|
struct blk_integrity_exchg {
|
|
|
|
void *prot_buf;
|
|
|
|
void *data_buf;
|
|
|
|
sector_t sector;
|
|
|
|
unsigned int data_size;
|
|
|
|
unsigned short sector_size;
|
|
|
|
const char *disk_name;
|
|
|
|
};
|
|
|
|
|
|
|
|
typedef void (integrity_gen_fn) (struct blk_integrity_exchg *);
|
|
|
|
typedef int (integrity_vrfy_fn) (struct blk_integrity_exchg *);
|
|
|
|
typedef void (integrity_set_tag_fn) (void *, void *, unsigned int);
|
|
|
|
typedef void (integrity_get_tag_fn) (void *, void *, unsigned int);
|
|
|
|
|
|
|
|
struct blk_integrity {
|
|
|
|
integrity_gen_fn *generate_fn;
|
|
|
|
integrity_vrfy_fn *verify_fn;
|
|
|
|
integrity_set_tag_fn *set_tag_fn;
|
|
|
|
integrity_get_tag_fn *get_tag_fn;
|
|
|
|
|
|
|
|
unsigned short flags;
|
|
|
|
unsigned short tuple_size;
|
|
|
|
unsigned short sector_size;
|
|
|
|
unsigned short tag_size;
|
|
|
|
|
|
|
|
const char *name;
|
|
|
|
|
|
|
|
struct kobject kobj;
|
|
|
|
};
|
|
|
|
|
2011-04-02 02:02:31 +07:00
|
|
|
extern bool blk_integrity_is_initialized(struct gendisk *);
|
2008-07-01 01:04:41 +07:00
|
|
|
extern int blk_integrity_register(struct gendisk *, struct blk_integrity *);
|
|
|
|
extern void blk_integrity_unregister(struct gendisk *);
|
2008-10-01 14:38:39 +07:00
|
|
|
extern int blk_integrity_compare(struct gendisk *, struct gendisk *);
|
2010-09-11 01:50:10 +07:00
|
|
|
extern int blk_rq_map_integrity_sg(struct request_queue *, struct bio *,
|
|
|
|
struct scatterlist *);
|
|
|
|
extern int blk_rq_count_integrity_sg(struct request_queue *, struct bio *);
|
|
|
|
extern int blk_integrity_merge_rq(struct request_queue *, struct request *,
|
|
|
|
struct request *);
|
|
|
|
extern int blk_integrity_merge_bio(struct request_queue *, struct request *,
|
|
|
|
struct bio *);
|
2008-07-01 01:04:41 +07:00
|
|
|
|
2008-10-02 17:53:22 +07:00
|
|
|
static inline
|
|
|
|
struct blk_integrity *bdev_get_integrity(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
return bdev->bd_disk->integrity;
|
|
|
|
}
|
|
|
|
|
2008-10-02 23:47:49 +07:00
|
|
|
static inline struct blk_integrity *blk_get_integrity(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
return disk->integrity;
|
|
|
|
}
|
|
|
|
|
2008-07-01 01:04:41 +07:00
|
|
|
static inline int blk_integrity_rq(struct request *rq)
|
|
|
|
{
|
2008-07-17 03:09:06 +07:00
|
|
|
if (rq->bio == NULL)
|
|
|
|
return 0;
|
|
|
|
|
2008-07-01 01:04:41 +07:00
|
|
|
return bio_integrity(rq->bio);
|
|
|
|
}
|
|
|
|
|
2010-09-11 01:50:10 +07:00
|
|
|
static inline void blk_queue_max_integrity_segments(struct request_queue *q,
|
|
|
|
unsigned int segs)
|
|
|
|
{
|
|
|
|
q->limits.max_integrity_segments = segs;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned short
|
|
|
|
queue_max_integrity_segments(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.max_integrity_segments;
|
|
|
|
}
|
|
|
|
|
2008-07-01 01:04:41 +07:00
|
|
|
#else /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
|
2012-01-12 15:17:30 +07:00
|
|
|
struct bio;
|
|
|
|
struct block_device;
|
|
|
|
struct gendisk;
|
|
|
|
struct blk_integrity;
|
|
|
|
|
|
|
|
static inline int blk_integrity_rq(struct request *rq)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int blk_rq_count_integrity_sg(struct request_queue *q,
|
|
|
|
struct bio *b)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int blk_rq_map_integrity_sg(struct request_queue *q,
|
|
|
|
struct bio *b,
|
|
|
|
struct scatterlist *s)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline struct blk_integrity *bdev_get_integrity(struct block_device *b)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline struct blk_integrity *blk_get_integrity(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
static inline int blk_integrity_compare(struct gendisk *a, struct gendisk *b)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int blk_integrity_register(struct gendisk *d,
|
|
|
|
struct blk_integrity *b)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline void blk_integrity_unregister(struct gendisk *d)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void blk_queue_max_integrity_segments(struct request_queue *q,
|
|
|
|
unsigned int segs)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline unsigned short queue_max_integrity_segments(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int blk_integrity_merge_rq(struct request_queue *rq,
|
|
|
|
struct request *r1,
|
|
|
|
struct request *r2)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int blk_integrity_merge_bio(struct request_queue *rq,
|
|
|
|
struct request *r,
|
|
|
|
struct bio *b)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline bool blk_integrity_is_initialized(struct gendisk *g)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2008-07-01 01:04:41 +07:00
|
|
|
|
|
|
|
#endif /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
|
2007-10-09 00:26:20 +07:00
|
|
|
struct block_device_operations {
|
[PATCH] beginning of methods conversion
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-02 21:09:22 +07:00
|
|
|
int (*open) (struct block_device *, fmode_t);
|
2013-05-06 08:52:57 +07:00
|
|
|
void (*release) (struct gendisk *, fmode_t);
|
[PATCH] beginning of methods conversion
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-02 21:09:22 +07:00
|
|
|
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
|
|
|
|
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
|
2007-10-09 00:26:20 +07:00
|
|
|
int (*direct_access) (struct block_device *, sector_t,
|
|
|
|
void **, unsigned long *);
|
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-09 02:57:37 +07:00
|
|
|
unsigned int (*check_events) (struct gendisk *disk,
|
|
|
|
unsigned int clearing);
|
|
|
|
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
|
2007-10-09 00:26:20 +07:00
|
|
|
int (*media_changed) (struct gendisk *);
|
2010-05-16 01:09:29 +07:00
|
|
|
void (*unlock_native_capacity) (struct gendisk *);
|
2007-10-09 00:26:20 +07:00
|
|
|
int (*revalidate_disk) (struct gendisk *);
|
|
|
|
int (*getgeo)(struct block_device *, struct hd_geometry *);
|
2010-05-17 12:32:43 +07:00
|
|
|
/* this callback is with swap_lock and sometimes page table lock held */
|
|
|
|
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
|
2007-10-09 00:26:20 +07:00
|
|
|
struct module *owner;
|
|
|
|
};
|
|
|
|
|
2007-08-30 07:34:12 +07:00
|
|
|
extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
|
|
|
|
unsigned long);
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
#else /* CONFIG_BLOCK */
|
|
|
|
/*
|
|
|
|
* stubs for when the block layer is configured out
|
|
|
|
*/
|
|
|
|
#define buffer_heads_over_limit 0
|
|
|
|
|
|
|
|
static inline long nr_blockdev_pages(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-03-12 02:17:08 +07:00
|
|
|
struct blk_plug {
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline void blk_start_plug(struct blk_plug *plug)
|
2011-03-08 19:19:51 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-03-12 02:17:08 +07:00
|
|
|
static inline void blk_finish_plug(struct blk_plug *plug)
|
2011-03-08 19:19:51 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-03-12 02:17:08 +07:00
|
|
|
static inline void blk_flush_plug(struct task_struct *task)
|
2011-03-08 19:19:51 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-04-16 18:27:55 +07:00
|
|
|
static inline void blk_schedule_flush_plug(struct task_struct *task)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2011-03-08 19:19:51 +07:00
|
|
|
static inline bool blk_needs_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
#endif /* CONFIG_BLOCK */
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
#endif
|