linux_dsm_epyc7002/Documentation
Paolo Valente aee69d78de block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
We tag as v0 the version of BFQ containing only BFQ's engine plus
hierarchical support. BFQ's engine is introduced by this commit, while
hierarchical support is added by next commit. We use the v0 tag to
distinguish this minimal version of BFQ from the versions containing
also the features and the improvements added by next commits. BFQ-v0
coincides with the version of BFQ submitted a few years ago [1], apart
from the introduction of preemption, described below.

BFQ is a proportional-share I/O scheduler, whose general structure,
plus a lot of code, are borrowed from CFQ.

- Each process doing I/O on a device is associated with a weight and a
  (bfq_)queue.

- BFQ grants exclusive access to the device, for a while, to one queue
  (process) at a time, and implements this service model by
  associating every queue with a budget, measured in number of
  sectors.

  - After a queue is granted access to the device, the budget of the
    queue is decremented, on each request dispatch, by the size of the
    request.

  - The in-service queue is expired, i.e., its service is suspended,
    only if one of the following events occurs: 1) the queue finishes
    its budget, 2) the queue empties, 3) a "budget timeout" fires.

    - The budget timeout prevents processes doing random I/O from
      holding the device for too long and dramatically reducing
      throughput.

    - Actually, as in CFQ, a queue associated with a process issuing
      sync requests may not be expired immediately when it empties. In
      contrast, BFQ may idle the device for a short time interval,
      giving the process the chance to go on being served if it issues
      a new request in time. Device idling typically boosts the
      throughput on rotational devices, if processes do synchronous
      and sequential I/O. In addition, under BFQ, device idling is
      also instrumental in guaranteeing the desired throughput
      fraction to processes issuing sync requests (see [2] for
      details).

      - With respect to idling for service guarantees, if several
        processes are competing for the device at the same time, but
        all processes (and groups, after the following commit) have
        the same weight, then BFQ guarantees the expected throughput
        distribution without ever idling the device. Throughput is
        thus as high as possible in this common scenario.

  - Queues are scheduled according to a variant of WF2Q+, named
    B-WF2Q+, and implemented using an augmented rb-tree to preserve an
    O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
    also ready for hierarchical scheduling. However, for a cleaner
    logical breakdown, the code that enables and completes
    hierarchical support is provided in the next commit, which focuses
    exactly on this feature.

  - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
    perfectly fair, and smooth service. In particular, B-WF2Q+
    guarantees that each queue receives a fraction of the device
    throughput proportional to its weight, even if the throughput
    fluctuates, and regardless of: the device parameters, the current
    workload and the budgets assigned to the queue.

  - The last, budget-independence, property (although probably
    counterintuitive in the first place) is definitely beneficial, for
    the following reasons:

    - First, with any proportional-share scheduler, the maximum
      deviation with respect to an ideal service is proportional to
      the maximum budget (slice) assigned to queues. As a consequence,
      BFQ can keep this deviation tight not only because of the
      accurate service of B-WF2Q+, but also because BFQ *does not*
      need to assign a larger budget to a queue to let the queue
      receive a higher fraction of the device throughput.

    - Second, BFQ is free to choose, for every process (queue), the
      budget that best fits the needs of the process, or best
      leverages the I/O pattern of the process. In particular, BFQ
      updates queue budgets with a simple feedback-loop algorithm that
      allows a high throughput to be achieved, while still providing
      tight latency guarantees to time-sensitive applications. When
      the in-service queue expires, this algorithm computes the next
      budget of the queue so as to:

      - Let large budgets be eventually assigned to the queues
        associated with I/O-bound applications performing sequential
        I/O: in fact, the longer these applications are served once
        got access to the device, the higher the throughput is.

      - Let small budgets be eventually assigned to the queues
        associated with time-sensitive applications (which typically
        perform sporadic and short I/O), because, the smaller the
        budget assigned to a queue waiting for service is, the sooner
        B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).

- Weights can be assigned to processes only indirectly, through I/O
  priorities, and according to the relation:
  weight = 10 * (IOPRIO_BE_NR - ioprio).
  The next patch provides, instead, a cgroups interface through which
  weights can be assigned explicitly.

- If several processes are competing for the device at the same time,
  but all processes and groups have the same weight, then BFQ
  guarantees the expected throughput distribution without ever idling
  the device. It uses preemption instead. Throughput is then much
  higher in this common scenario.

- ioprio classes are served in strict priority order, i.e.,
  lower-priority queues are not served as long as there are
  higher-priority queues.  Among queues in the same class, the
  bandwidth is distributed in proportion to the weight of each
  queue. A very thin extra bandwidth is however guaranteed to the Idle
  class, to prevent it from starving.

- If the strict_guarantees parameter is set (default: unset), then BFQ
     - always performs idling when the in-service queue becomes empty;
     - forces the device to serve one I/O request at a time, by
       dispatching a new request only if there is no outstanding
       request.
  In the presence of differentiated weights or I/O-request sizes,
  both the above conditions are needed to guarantee that every
  queue receives its allotted share of the bandwidth (see
  Documentation/block/bfq-iosched.txt for more details). Setting
  strict_guarantees may evidently affect throughput.

[1] https://lkml.org/lkml/2008/4/1/234
    https://lkml.org/lkml/2008/11/11/148

[2] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
							results.pdf

Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19 08:29:02 -06:00
..
ABI block: remove the discard_zeroes_data flag 2017-04-08 11:25:38 -06:00
accounting tools: move accounting tool from Documentation 2016-09-23 13:07:15 -06:00
acpi scripts/spelling.txt: add "followings" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
admin-guide Power management updates for v4.11-rc2 2017-03-09 16:30:37 -08:00
aoe
arm arm: sunxi: add support for V3s SoC 2017-01-20 21:31:34 +01:00
arm64 irqchip/gicv3-its: Add workaround for QDF2400 ITS erratum 0065 2017-03-07 14:34:27 +00:00
auxdisplay samples: move auxdisplay example code from Documentation 2016-09-23 11:52:32 -06:00
backlight
blackfin samples: move blackfin gptimers-example from Documentation 2016-10-10 07:12:02 -06:00
block block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler 2017-04-19 08:29:02 -06:00
blockdev remove the mg_disk driver 2017-04-14 14:00:49 -06:00
bus-devices
cdrom cdrom: Make device operations read-only 2017-02-14 08:29:56 -07:00
cgroup-v1 Merge branch 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2017-02-27 21:41:08 -08:00
cma
connector samples: connector: from Documentation to samples directory 2016-04-28 07:47:35 -06:00
console
core-api Documentation: Update CPU hotplug and move it to core-api 2017-01-13 10:32:32 -07:00
cpu-freq A slightly quieter cycle for documentation this time around. 2017-02-22 18:51:29 -08:00
cpuidle
cris
crypto crypto: doc - fix typo 2017-02-15 13:23:49 +08:00
dev-tools scripts/spelling.txt: add "disble(d)" pattern and fix typo instances 2017-03-09 17:01:09 -08:00
device-mapper scripts/spelling.txt: add "explictely" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
devicetree USB/PHY fixes for 4.11-rc4 2017-03-26 10:52:52 -07:00
dmaengine dmaengine: Documentation: Fix typo in pxa_dma.txt 2016-11-14 08:14:24 +05:30
doc-guide docs-rst: parse-headers.pl: cleanup the documentation 2016-11-30 17:08:09 -07:00
DocBook A few fixes for the docs tree, including one for a 4.11 build regression. 2017-03-04 11:32:18 -08:00
driver-api Less anger inducing pull request for 4.11 2017-02-23 18:58:18 -08:00
driver-model irqdesc: Add a resource managed version of irq_alloc_descs() 2017-02-10 14:39:20 +01:00
early-userspace
EDID
extcon extcon: int3496: Add GPIO ACPI mapping table 2017-03-22 18:29:46 +09:00
fault-injection
fb Documentation: fb: fix spelling mistakes 2016-05-10 12:05:27 +03:00
features 2nd round of ARC udpates for 4.10rc1 2016-12-23 10:22:47 -08:00
filesystems statx: Add a system call to make enhanced file info available 2017-03-02 20:51:15 -05:00
firmware_class firmware: revamp firmware documentation 2017-01-11 09:42:59 +01:00
fmc
fpga fpga: Add scatterlist based programming 2017-02-10 15:20:44 +01:00
frv docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
gpio gpio: random documentation update 2017-01-31 15:43:05 +01:00
gpu Less anger inducing pull request for 4.11 2017-02-23 18:58:18 -08:00
hid Documentation: HID: Intel ISH HID document 2016-08-17 11:13:07 +02:00
hwmon A slightly quieter cycle for documentation this time around. 2017-02-22 18:51:29 -08:00
i2c i2c: i801: Add support for Intel Gemini Lake 2017-02-09 17:39:16 +01:00
ia64 selftests: move ia64 tests from Documentation/ia64 2016-09-20 09:58:12 -06:00
ide
iio iio: Documentation: Correct the path used to create triggers. 2016-10-01 00:49:58 -06:00
infiniband IB/hfi1: Document new sysfs entries for hfi1 driver 2016-10-02 08:42:19 -04:00
input Documentation: input: fix path to input code definitions 2017-02-12 15:19:00 -07:00
ioctl rpmsg updates for v4.11 2017-02-23 09:41:03 -08:00
isdn docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
kbuild Kconfig: Introduce the "imply" keyword 2016-11-16 09:26:33 +01:00
kdump Documentation: kdump: Add description of enable multi-cpus support 2016-09-20 18:02:54 -06:00
laptops platform/x86: thinkpad_acpi: Add support for X1 Yoga (2016) Tablet Mode 2016-12-13 09:29:06 -08:00
leds leds: class: Add new optional brightness_hw_changed attribute 2017-01-29 19:59:42 +01:00
lightnvm lightnvm: physical block device (pblk) target 2017-04-16 10:06:33 -06:00
livepatch A slightly quieter cycle for documentation this time around. 2017-02-22 18:51:29 -08:00
locking locking/ww_mutex/Documentation: Update the design document 2017-01-14 11:14:55 +01:00
m68k docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
md MD: add doc for raid5-cache 2017-02-13 09:17:54 -08:00
media A few fixes for the docs tree, including one for a 4.11 build regression. 2017-03-04 11:32:18 -08:00
memory-devices
metag
mic samples: move mic/mpssd example code from Documentation 2016-09-20 12:38:48 -06:00
mips
misc-devices samples: move misc-devices/mei example code from Documentation 2016-09-23 11:51:43 -06:00
mmc mmc: core: Extend sysfs with DSR register 2016-07-25 10:34:51 +02:00
mn10300
mtd spi-nor: Add support for Intel SPI serial flash controller 2017-01-03 17:33:36 +00:00
namespaces
netlabel
networking Make IP 'forwarding' doc more precise 2017-03-12 23:28:43 -07:00
nfc
nios2
nvdimm libnvdimm, btt: update the usage section in Documentation 2016-06-17 16:23:23 -07:00
nvmem
parisc
PCI A few fixes for the docs tree, including one for a 4.11 build regression. 2017-03-04 11:32:18 -08:00
pcmcia tools: move pcmcia crc32hash tool from Documentation 2016-09-23 13:07:27 -06:00
perf perf: add qcom l2 cache perf events driver 2017-02-08 19:32:24 +00:00
phy
platform
power More power management updates for v4.11-rc1 2017-03-02 17:33:52 -08:00
powerpc powerpc updates for 4.9 2016-10-07 20:19:31 -07:00
pps Doc: clarify source of jitter in USB1.1, and USB2.0 2017-01-04 14:40:52 -07:00
prctl selftests: move prctl tests from Documentation/prctl 2016-09-20 09:09:09 -06:00
process Doc: Correct typo, "Introdution" => "Introduction" 2016-12-01 10:44:08 -07:00
pti
ptp selftests: move ptp tests from Documentation/ptp 2016-09-20 09:54:38 -06:00
rapidio rapidio/documentation/mport_cdev: add missing parameter description 2016-09-01 17:52:02 -07:00
RCU Merge branches 'doc.2017.01.15b', 'dyntick.2017.01.23a', 'fixes.2017.01.23a', 'srcu.2017.01.25a' and 'torture.2017.01.15b' into HEAD 2017-01-25 12:56:05 -08:00
s390 Documentation: Update path to sysrq.txt 2017-03-03 15:48:38 -07:00
scheduler sched/Documentation/sched-rt-group: Fix incorrect example 2017-01-22 10:34:17 +01:00
scsi scripts/spelling.txt: add "varible" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
security KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload() 2017-03-02 10:09:00 +11:00
serial Documentation: rs485: Do not define manually the ioctl 2016-08-18 11:08:33 -06:00
sh
sound scripts/spelling.txt: add "followings" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
sparc Documentation/sparc: Steps for sending break on sunhv console 2017-02-23 08:27:25 -08:00
sphinx docs: sphinx-extensions: make rstFlatTable work with docutils 0.13 2016-12-18 13:30:29 -07:00
sphinx-static This is the documentation update pull for the 4.9 merge window. 2016-10-04 13:54:07 -07:00
spi spi: spi-ep93xx: simplify GPIO chip selects 2017-02-16 20:10:26 +00:00
sysctl A few fixes for the docs tree, including one for a 4.11 build regression. 2017-03-04 11:32:18 -08:00
target target: make close_session optional 2016-05-10 01:19:26 -07:00
thermal Documentation: fix spelling mistakes of "Celcius" -- > "Celsius" 2017-01-04 14:36:17 -07:00
timers time: Remove CONFIG_TIMER_STATS 2017-02-10 11:15:08 +01:00
trace perf/core: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS 2017-03-01 10:26:39 +01:00
translations doc/ko_KR/memory-barriers: Update control-dependencies section 2017-03-03 15:54:55 -07:00
usb A slightly quieter cycle for documentation this time around. 2017-02-22 18:51:29 -08:00
virtual KVM: Documentation: document MCE ioctls 2017-03-20 16:25:06 +01:00
vm userfaultfd: non-cooperative: rollback userfaultfd_exit 2017-03-09 17:01:09 -08:00
w1 w1: add ability to set (SRAM) and store (EEPROM) configuration for temp sensors like DS18B20 2016-05-01 14:37:49 -07:00
watchdog watchdog: Introduce watchdog_stop_on_unregister helper 2017-02-24 14:00:23 -08:00
wimax
x86 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-02-28 11:46:00 -08:00
xtensa xtensa: cleanup MMU setup and kernel layout macros 2016-07-24 06:33:58 +03:00
.gitignore Add .pyc files to .gitignore 2016-06-30 13:07:33 -06:00
00-INDEX Documentation: move MD related doc into a separate dir 2017-02-13 09:17:53 -08:00
bcache.txt bcache: documentation formatting, edited for clarity, stripe alignment notes 2016-06-23 07:58:38 -06:00
bt8xxgpio.txt
btmrvl.txt
bus-virt-phys-mapping.txt
cachetlb.txt
cgroup-v2.txt cgroup: Fix indenting in PID controller documentation 2017-03-06 14:46:27 -05:00
Changes docs: add back 'Documentation/Changes' file (as symlink) 2016-12-14 16:30:12 -08:00
circular-buffers.txt Documentation: circular-buffers: use READ_ONCE() 2016-11-16 16:17:45 -07:00
clk.txt Documentation: clk: update file names containing referenced structures 2016-08-14 12:12:36 -06:00
CodingStyle doc: re-add CodingStyle and SubmittingPatches 2016-10-24 08:12:35 -02:00
conf.py Documentation/sphinx: fix primary_domain configuration 2017-03-03 16:12:30 -07:00
cpu-load.txt
cputopology.txt topology/sysfs: provide drawer id and siblings attributes 2016-06-13 15:58:27 +02:00
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
digsig.txt
DMA-API-HOWTO.txt Documentation: DMA-API-HOWTO: Fix a typo 2016-09-20 17:58:46 -06:00
DMA-API.txt dma-mapping: add dma_{map,unmap}_resource 2016-09-26 22:16:41 +05:30
DMA-attributes.txt common: DMA-mapping: add DMA_ATTR_PRIVILEGED attribute 2017-01-19 15:56:19 +00:00
DMA-ISA-LPC.txt Documentation: DMA-ISA-LPC.txt 2017-02-12 15:20:07 -07:00
docutils.conf doc-rst: add docutils config file 2016-08-14 11:52:40 -06:00
dontdiff Documentation: dontdiff: Update with additional entries 2017-01-26 15:08:06 -07:00
efi-stub.txt
eisa.txt
flexible-arrays.txt
futex-requeue-pi.txt
gcc-plugins.txt gcc-plugins: update architecture list in documentation 2017-03-21 22:20:05 +11:00
highuid.txt
hw_random.txt
hwspinlock.txt
index.rst docs/zh_CN: Add coding-style into docs build system 2017-01-26 15:30:34 -07:00
intel_txt.txt
Intel-IOMMU.txt
io_ordering.txt
io-mapping.txt
iostats.txt
IPMI.txt Documentation: Fix a typo in IPMI.txt. 2017-01-05 15:01:54 -06:00
IRQ-affinity.txt
IRQ-domain.txt Documentation/IRQ-domain.txt: Document irq_domain_create_{linear, tree} 2016-03-31 00:32:59 -06:00
IRQ.txt
irqflags-tracing.txt
isa.txt Documentation: Add ISA bus driver documentation 2016-05-02 09:32:04 -07:00
isapnp.txt
kernel-doc-nano-HOWTO.txt docs-rst: doc-guide: split the kernel-documentation.rst contents 2016-11-19 10:22:04 -07:00
kernel-per-CPU-kthreads.txt docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
kobject.txt
kprobes.txt Documentation: kprobes: Document jprobes stack copying limitations 2016-08-15 10:19:11 -06:00
kref.txt
kselftest.txt scripts/spelling.txt: add "an user" pattern and fix typo instances 2017-02-27 18:43:46 -08:00
ldm.txt
lockup-watchdogs.txt docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
logo.gif
logo.txt
lzo.txt Documentation: lzo: fix spelling mistakes 2016-04-28 07:23:11 -06:00
mailbox.txt
Makefile samples: move blackfin gptimers-example from Documentation 2016-10-10 07:12:02 -06:00
Makefile.sphinx Add a target to check broken external links in the Documentation 2017-02-15 15:22:47 -07:00
memory-barriers.txt doc: Update control-dependencies section of memory-barriers.txt 2017-01-14 21:29:15 -08:00
memory-hotplug.txt scripts/spelling.txt: add "followings" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
men-chameleon-bus.txt
nommu-mmap.txt
ntb.txt
numastat.txt
padata.txt
parport-lowlevel.txt
percpu-rw-semaphore.txt
phy.txt phy: core: Allow children node to be overridden 2016-04-29 16:39:39 +02:00
pi-futex.txt
pinctrl.txt pinctrl: core: Fix regression caused by delayed work for hogs 2017-01-13 16:25:17 +01:00
pnp.txt
preempt-locking.txt
printk-formats.txt
pwm.txt pwm: Update documentation 2016-05-17 14:48:04 +02:00
rbtree.txt
remoteproc.txt remoteproc: Split driver and consumer dereferencing 2016-10-02 22:50:21 -07:00
rfkill.txt docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
robust-futex-ABI.txt
robust-futexes.txt Documentation: robust-futexes: fix spelling mistakes 2016-04-28 07:26:41 -06:00
rpmsg.txt rpmsg: use module_rpmsg_driver in existing drivers and examples 2016-05-06 11:09:01 -07:00
rtc.txt
SAK.txt
sgi-ioc4.txt
siphash.txt siphash: implement HalfSipHash1-3 for hash tables 2017-01-09 13:58:57 -05:00
SM501.txt
smsc_ece1099.txt
static-keys.txt jump_label: Reduce the size of struct static_key 2017-02-15 09:02:26 -05:00
SubmittingPatches doc: re-add CodingStyle and SubmittingPatches 2016-10-24 08:12:35 -02:00
svga.txt
sync_file.txt dma-buf: Rename struct fence to dma_fence 2016-10-25 14:40:39 +02:00
this_cpu_ops.txt
unaligned-memory-access.txt Documentation/unaligned-memory-access.txt: fix incorrect comparison operator 2016-12-27 13:08:42 -07:00
unshare.txt
vfio-mediated-device.txt vfio-mdev: Make mdev_parent private 2016-12-30 08:13:41 -07:00
vfio.txt
video-output.txt
xillybus.txt Documentation: xillybus: fix spelling mistake 2016-04-28 07:44:54 -06:00
xz.txt
zorro.txt