[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
if BLOCK
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
menu "IO Schedulers"
|
|
|
|
|
|
|
|
config IOSCHED_NOOP
|
|
|
|
bool
|
|
|
|
default y
|
|
|
|
---help---
|
|
|
|
The no-op I/O scheduler is a minimal scheduler that does basic merging
|
|
|
|
and sorting. Its main uses include non-disk based block devices like
|
|
|
|
memory devices, and specialised software or hardware environments
|
|
|
|
that do their own scheduling and require only minimal assistance from
|
|
|
|
the kernel.
|
|
|
|
|
|
|
|
config IOSCHED_DEADLINE
|
|
|
|
tristate "Deadline I/O scheduler"
|
|
|
|
default y
|
|
|
|
---help---
|
2009-10-03 14:37:51 +07:00
|
|
|
The deadline I/O scheduler is simple and compact. It will provide
|
|
|
|
CSCAN service with FIFO expiration of requests, switching to
|
|
|
|
a new point in the service tree and doing a batch of IO from there
|
|
|
|
in case of expiry.
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
config IOSCHED_CFQ
|
|
|
|
tristate "CFQ I/O scheduler"
|
|
|
|
default y
|
|
|
|
---help---
|
|
|
|
The CFQ I/O scheduler tries to distribute bandwidth equally
|
|
|
|
among all processes in the system. It should provide a fair
|
2009-10-03 14:40:47 +07:00
|
|
|
and low latency working environment, suitable for both desktop
|
|
|
|
and server systems.
|
|
|
|
|
2007-02-18 02:08:22 +07:00
|
|
|
This is the default I/O scheduler.
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-12-04 00:59:43 +07:00
|
|
|
config CFQ_GROUP_IOSCHED
|
|
|
|
bool "CFQ Group Scheduling support"
|
2010-04-27 00:27:56 +07:00
|
|
|
depends on IOSCHED_CFQ && BLK_CGROUP
|
2009-12-04 00:59:43 +07:00
|
|
|
default n
|
|
|
|
---help---
|
|
|
|
Enable group IO scheduling in CFQ.
|
|
|
|
|
2005-10-31 06:02:19 +07:00
|
|
|
choice
|
block, bfq: add full hierarchical scheduling and cgroups support
Add complete support for full hierarchical scheduling, with a cgroups
interface. Full hierarchical scheduling is implemented through the
'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
associated with processes, and groups are represented in general by
entities. Given the bfq_queues associated with the processes belonging
to a given group, the entities representing these queues are sons of
the entity representing the group. At higher levels, if a group, say
G, contains other groups, then the entity representing G is the parent
entity of the entities representing the groups in G.
Hierarchical scheduling is performed as follows: if the timestamps of
a leaf entity (i.e., of a bfq_queue) change, and such a change lets
the entity become the next-to-serve entity for its parent entity, then
the timestamps of the parent entity are recomputed as a function of
the budget of its new next-to-serve leaf entity. If the parent entity
belongs, in its turn, to a group, and its new timestamps let it become
the next-to-serve for its parent entity, then the timestamps of the
latter parent entity are recomputed as well, and so on. When a new
bfq_queue must be set in service, the reverse path is followed: the
next-to-serve highest-level entity is chosen, then its next-to-serve
child entity, and so on, until the next-to-serve leaf entity is
reached, and the bfq_queue that this entity represents is set in
service.
Writeback is accounted for on a per-group basis, i.e., for each group,
the async I/O requests of the processes of the group are enqueued in a
distinct bfq_queue, and the entity associated with this queue is a
child of the entity associated with the group.
Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of the previous patch). In particular, since each node has
a full scheduler, each group can be assigned its own weight.
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-12 23:23:08 +07:00
|
|
|
|
2005-10-31 06:02:19 +07:00
|
|
|
prompt "Default I/O scheduler"
|
2006-06-19 15:06:48 +07:00
|
|
|
default DEFAULT_CFQ
|
2005-10-31 06:02:19 +07:00
|
|
|
help
|
|
|
|
Select the I/O scheduler which will be used by default for all
|
|
|
|
block devices.
|
|
|
|
|
|
|
|
config DEFAULT_DEADLINE
|
2005-11-04 14:44:58 +07:00
|
|
|
bool "Deadline" if IOSCHED_DEADLINE=y
|
2005-10-31 06:02:19 +07:00
|
|
|
|
|
|
|
config DEFAULT_CFQ
|
2005-11-04 14:44:58 +07:00
|
|
|
bool "CFQ" if IOSCHED_CFQ=y
|
2005-10-31 06:02:19 +07:00
|
|
|
|
|
|
|
config DEFAULT_NOOP
|
|
|
|
bool "No-op"
|
|
|
|
|
|
|
|
endchoice
|
|
|
|
|
|
|
|
config DEFAULT_IOSCHED
|
|
|
|
string
|
|
|
|
default "deadline" if DEFAULT_DEADLINE
|
|
|
|
default "cfq" if DEFAULT_CFQ
|
|
|
|
default "noop" if DEFAULT_NOOP
|
|
|
|
|
2017-01-14 04:43:58 +07:00
|
|
|
config MQ_IOSCHED_DEADLINE
|
|
|
|
tristate "MQ deadline I/O scheduler"
|
|
|
|
default y
|
|
|
|
---help---
|
|
|
|
MQ version of the deadline IO scheduler.
|
|
|
|
|
2017-04-14 15:00:02 +07:00
|
|
|
config MQ_IOSCHED_KYBER
|
|
|
|
tristate "Kyber I/O scheduler"
|
|
|
|
default y
|
|
|
|
---help---
|
|
|
|
The Kyber I/O scheduler is a low-overhead scheduler suitable for
|
|
|
|
multiqueue and other fast devices. Given target latencies for reads and
|
|
|
|
synchronous writes, it will self-tune queue depths to achieve that
|
|
|
|
goal.
|
|
|
|
|
block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
We tag as v0 the version of BFQ containing only BFQ's engine plus
hierarchical support. BFQ's engine is introduced by this commit, while
hierarchical support is added by next commit. We use the v0 tag to
distinguish this minimal version of BFQ from the versions containing
also the features and the improvements added by next commits. BFQ-v0
coincides with the version of BFQ submitted a few years ago [1], apart
from the introduction of preemption, described below.
BFQ is a proportional-share I/O scheduler, whose general structure,
plus a lot of code, are borrowed from CFQ.
- Each process doing I/O on a device is associated with a weight and a
(bfq_)queue.
- BFQ grants exclusive access to the device, for a while, to one queue
(process) at a time, and implements this service model by
associating every queue with a budget, measured in number of
sectors.
- After a queue is granted access to the device, the budget of the
queue is decremented, on each request dispatch, by the size of the
request.
- The in-service queue is expired, i.e., its service is suspended,
only if one of the following events occurs: 1) the queue finishes
its budget, 2) the queue empties, 3) a "budget timeout" fires.
- The budget timeout prevents processes doing random I/O from
holding the device for too long and dramatically reducing
throughput.
- Actually, as in CFQ, a queue associated with a process issuing
sync requests may not be expired immediately when it empties. In
contrast, BFQ may idle the device for a short time interval,
giving the process the chance to go on being served if it issues
a new request in time. Device idling typically boosts the
throughput on rotational devices, if processes do synchronous
and sequential I/O. In addition, under BFQ, device idling is
also instrumental in guaranteeing the desired throughput
fraction to processes issuing sync requests (see [2] for
details).
- With respect to idling for service guarantees, if several
processes are competing for the device at the same time, but
all processes (and groups, after the following commit) have
the same weight, then BFQ guarantees the expected throughput
distribution without ever idling the device. Throughput is
thus as high as possible in this common scenario.
- Queues are scheduled according to a variant of WF2Q+, named
B-WF2Q+, and implemented using an augmented rb-tree to preserve an
O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
also ready for hierarchical scheduling. However, for a cleaner
logical breakdown, the code that enables and completes
hierarchical support is provided in the next commit, which focuses
exactly on this feature.
- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
perfectly fair, and smooth service. In particular, B-WF2Q+
guarantees that each queue receives a fraction of the device
throughput proportional to its weight, even if the throughput
fluctuates, and regardless of: the device parameters, the current
workload and the budgets assigned to the queue.
- The last, budget-independence, property (although probably
counterintuitive in the first place) is definitely beneficial, for
the following reasons:
- First, with any proportional-share scheduler, the maximum
deviation with respect to an ideal service is proportional to
the maximum budget (slice) assigned to queues. As a consequence,
BFQ can keep this deviation tight not only because of the
accurate service of B-WF2Q+, but also because BFQ *does not*
need to assign a larger budget to a queue to let the queue
receive a higher fraction of the device throughput.
- Second, BFQ is free to choose, for every process (queue), the
budget that best fits the needs of the process, or best
leverages the I/O pattern of the process. In particular, BFQ
updates queue budgets with a simple feedback-loop algorithm that
allows a high throughput to be achieved, while still providing
tight latency guarantees to time-sensitive applications. When
the in-service queue expires, this algorithm computes the next
budget of the queue so as to:
- Let large budgets be eventually assigned to the queues
associated with I/O-bound applications performing sequential
I/O: in fact, the longer these applications are served once
got access to the device, the higher the throughput is.
- Let small budgets be eventually assigned to the queues
associated with time-sensitive applications (which typically
perform sporadic and short I/O), because, the smaller the
budget assigned to a queue waiting for service is, the sooner
B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
- Weights can be assigned to processes only indirectly, through I/O
priorities, and according to the relation:
weight = 10 * (IOPRIO_BE_NR - ioprio).
The next patch provides, instead, a cgroups interface through which
weights can be assigned explicitly.
- If several processes are competing for the device at the same time,
but all processes and groups have the same weight, then BFQ
guarantees the expected throughput distribution without ever idling
the device. It uses preemption instead. Throughput is then much
higher in this common scenario.
- ioprio classes are served in strict priority order, i.e.,
lower-priority queues are not served as long as there are
higher-priority queues. Among queues in the same class, the
bandwidth is distributed in proportion to the weight of each
queue. A very thin extra bandwidth is however guaranteed to the Idle
class, to prevent it from starving.
- If the strict_guarantees parameter is set (default: unset), then BFQ
- always performs idling when the in-service queue becomes empty;
- forces the device to serve one I/O request at a time, by
dispatching a new request only if there is no outstanding
request.
In the presence of differentiated weights or I/O-request sizes,
both the above conditions are needed to guarantee that every
queue receives its allotted share of the bandwidth (see
Documentation/block/bfq-iosched.txt for more details). Setting
strict_guarantees may evidently affect throughput.
[1] https://lkml.org/lkml/2008/4/1/234
https://lkml.org/lkml/2008/11/11/148
[2] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19 21:29:02 +07:00
|
|
|
config IOSCHED_BFQ
|
|
|
|
tristate "BFQ I/O scheduler"
|
|
|
|
default n
|
|
|
|
---help---
|
|
|
|
BFQ I/O scheduler for BLK-MQ. BFQ distributes the bandwidth of
|
|
|
|
of the device among all processes according to their weights,
|
|
|
|
regardless of the device parameters and with any workload. It
|
|
|
|
also guarantees a low latency to interactive and soft
|
|
|
|
real-time applications. Details in
|
|
|
|
Documentation/block/bfq-iosched.txt
|
|
|
|
|
block, bfq: add full hierarchical scheduling and cgroups support
Add complete support for full hierarchical scheduling, with a cgroups
interface. Full hierarchical scheduling is implemented through the
'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
associated with processes, and groups are represented in general by
entities. Given the bfq_queues associated with the processes belonging
to a given group, the entities representing these queues are sons of
the entity representing the group. At higher levels, if a group, say
G, contains other groups, then the entity representing G is the parent
entity of the entities representing the groups in G.
Hierarchical scheduling is performed as follows: if the timestamps of
a leaf entity (i.e., of a bfq_queue) change, and such a change lets
the entity become the next-to-serve entity for its parent entity, then
the timestamps of the parent entity are recomputed as a function of
the budget of its new next-to-serve leaf entity. If the parent entity
belongs, in its turn, to a group, and its new timestamps let it become
the next-to-serve for its parent entity, then the timestamps of the
latter parent entity are recomputed as well, and so on. When a new
bfq_queue must be set in service, the reverse path is followed: the
next-to-serve highest-level entity is chosen, then its next-to-serve
child entity, and so on, until the next-to-serve leaf entity is
reached, and the bfq_queue that this entity represents is set in
service.
Writeback is accounted for on a per-group basis, i.e., for each group,
the async I/O requests of the processes of the group are enqueued in a
distinct bfq_queue, and the entity associated with this queue is a
child of the entity associated with the group.
Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of the previous patch). In particular, since each node has
a full scheduler, each group can be assigned its own weight.
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-12 23:23:08 +07:00
|
|
|
config BFQ_GROUP_IOSCHED
|
|
|
|
bool "BFQ hierarchical scheduling support"
|
|
|
|
depends on IOSCHED_BFQ && BLK_CGROUP
|
|
|
|
default n
|
|
|
|
---help---
|
|
|
|
|
|
|
|
Enable hierarchical scheduling in BFQ, using the blkio
|
|
|
|
(cgroups-v1) or io (cgroups-v2) controller.
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
endmenu
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 01:45:40 +07:00
|
|
|
|
|
|
|
endif
|