-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvNQKgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgps+8D/9Iy6YIeoPwN10gYsqIh0P2fS3wKzL3kiww
3vFsWO78PzgLxUlNmB7teLtNFc/R5mi8becZmAdvs9za5YFZk56o3Ifv1x+e+z00
VY1/gxhiJD6suLeJ6lECnERGDaiWOZVRMo2TE17vxYGW6GGaa0Ts6PUUXmpla1u5
WKctgt0Qv9WVNyiIdLdeHqzKJwsSSwNTt8fK7eFhy3x8e0CwJr+GtXckbbW3LFkY
lug0npsTli3EmEPMovZhd25SjZmTk5GTM+ADZQ7Tnv5KXoDWB9jn6TcCSAi3G+5d
5WUVwfnDyYJiH8qvlg5tRJ690muIy3xMOmpr7QBQ0YnR/LQ3EW+1CVfqD+qimgLH
TXzlREXQpBP3YlxSDS5nddz4o5z84GZmC9B/43ujPaZKIQ6eBXYdkmQH7tPtSugm
C6VGomR5tHotjxIiAsexh/5hAus+wW8bObKGTPTyINT0ub3XNclwCKLh26CgI9ie
WvbS9g3j/KPvu/7s6weZpgD+cks0YdWe/XdXXxiHwsGI9h3J2aJna5RQt1rKWDm5
wGCgbc/B8eSwiWx+GXlqdB9/Dy/bGXOnSTDnKpEVl1f5zNjeLwUKXbjvkMefWs4m
jEIcquuDETORY+ZYEfa5YbmS4Lhskr0kzMVTVkZ++81tAWpSCU9Xh3IHrR8TNpt+
J0oh0FHBDg==
=LRTT
-----END PGP SIGNATURE-----
Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"This is the main pull request for block changes for 4.20. This
contains:
- Series enabling runtime PM for blk-mq (Bart).
- Two pull requests from Christoph for NVMe, with items such as;
- Better AEN tracking
- Multipath improvements
- RDMA fixes
- Rework of FC for target removal
- Fixes for issues identified by static checkers
- Fabric cleanups, as prep for TCP transport
- Various cleanups and bug fixes
- Block merging cleanups (Christoph)
- Conversion of drivers to generic DMA mapping API (Christoph)
- Series fixing ref count issues with blkcg (Dennis)
- Series improving BFQ heuristics (Paolo, et al)
- Series improving heuristics for the Kyber IO scheduler (Omar)
- Removal of dangerous bio_rewind_iter() API (Ming)
- Apply single queue IPI redirection logic to blk-mq (Ming)
- Set of fixes and improvements for bcache (Coly et al)
- Series closing a hotplug race with sysfs group attributes (Hannes)
- Set of patches for lightnvm:
- pblk trace support (Hans)
- SPDX license header update (Javier)
- Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
specs behind a common core interface. (Javier, Matias)
- Enable pblk to use a common interface to retrieve chunk metadata
(Matias)
- Bug fixes (Various)
- Set of fixes and updates to the blk IO latency target (Josef)
- blk-mq queue number updates fixes (Jianchao)
- Convert a bunch of drivers from the old legacy IO interface to
blk-mq. This will conclude with the removal of the legacy IO
interface itself in 4.21, with the rest of the drivers (me, Omar)
- Removal of the DAC960 driver. The SCSI tree will introduce two
replacement drivers for this (Hannes)"
* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
block: setup bounce bio_sets properly
blkcg: reassociate bios when make_request() is called recursively
blkcg: fix edge case for blk_get_rl() under memory pressure
nvme-fabrics: move controller options matching to fabrics
nvme-rdma: always have a valid trsvcid
mtip32xx: fully switch to the generic DMA API
rsxx: switch to the generic DMA API
umem: switch to the generic DMA API
sx8: switch to the generic DMA API
sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
skd: switch to the generic DMA API
ubd: remove use of blk_rq_map_sg
nvme-pci: remove duplicate check
drivers/block: Remove DAC960 driver
nvme-pci: fix hot removal during error handling
nvmet-fcloop: suppress a compiler warning
nvme-core: make implicit seed truncation explicit
nvmet-fc: fix kernel-doc headers
nvme-fc: rework the request initialization code
nvme-fc: introduce struct nvme_fcp_op_w_sgl
...
We're only setting up the bounce bio sets if we happen
to need bouncing for regular HIGHMEM, not if we only need
it for ISA devices.
Protect the ISA bounce setup with a mutex, since it's
being invoked from driver init functions and can thus be
called in parallel.
Cc: stable@vger.kernel.org
Reported-by: Ondrej Zary <linux@rainbow-software.org>
Tested-by: Ondrej Zary <linux@rainbow-software.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When submitting a bio, multiple recursive calls to make_request() may
occur. This causes the initial associate done in blkcg_bio_issue_check()
to be incorrect and reference the prior request_queue. This introduces
a helper to do reassociation when make_request() is recursively called.
Fixes: a7b39b4e96 ("blkcg: always associate a bio with a blkg")
Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_queue_split() does respect this limit via bio splitting, so no
need to do that in blkdev_issue_discard(), then we can align to
normal bio submit(bio_add_page() & submit_bio()).
More importantly, this patch fixes one issue introduced in a22c4d7e34
("block: re-add discard_granularity and alignment checks"), in which
zero discard bio may be generated in case of zero alignment.
Fixes: a22c4d7e34 ("block: re-add discard_granularity and alignment checks")
Cc: stable@vger.kernel.org
Cc: Ming Lin <ming.l@ssi.samsung.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Tested-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This pattern is repeated throughout all the blk-mq conversions.
Provide a basic helper to get it done.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We just allocated the queue and haven't even set it up yet,
hence we know that checking if ->mq_ops is NULL is always
going to be true.
In fact we do need to assign a lock to ->queue_lock always,
as we need it for the queue flags modifications.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the hw queues and mq_map are updated, a hctx could be mapped
to a different numa node. At this moment, we need to realloc the
hctx. If fail to do that, go on using previous hctx.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_realloc_hw_ctxs could be invoked during update hw queues.
At the momemt, IO is blocked. Change the gfp flags from GFP_KERNEL
to GFP_NOIO to avoid forever hang during memory allocation in
blk_mq_realloc_hw_ctxs.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-mq debugfs and sysfs entries need to be removed before updating
queue map, otherwise, we get get wrong result there. This patch fixes
it and remove the redundant debugfs and sysfs register/unregister
operations during __blk_mq_update_nr_hw_queues.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bfq defines as asymmetric a scenario where an active entity, say E
(representing either a single bfq_queue or a group of other entities),
has a higher weight than some other entities. If the entity E does sync
I/O in such a scenario, then bfq plugs the dispatch of the I/O of the
other entities in the following situation: E is in service but
temporarily has no pending I/O request. In fact, without this plugging,
all the times that E stops being temporarily idle, it may find the
internal queues of the storage device already filled with an
out-of-control number of extra requests, from other entities. So E may
have to wait for the service of these extra requests, before finally
having its own requests served. This may easily break service
guarantees, with E getting less than its fair share of the device
throughput. Usually, the end result is that E gets the same fraction of
the throughput as the other entities, instead of getting more, according
to its higher weight.
Yet there are two other more subtle cases where E, even if its weight is
actually equal to or even lower than the weight of any other active
entities, may get less than its fair share of the throughput in case the
above I/O plugging is not performed:
1. other entities issue larger requests than E;
2. other entities contain more active child entities than E (or in
general tend to have more backlog than E).
In the first case, other entities may get more service than E because
they get larger requests, than those of E, served during the temporary
idle periods of E. In the second case, other entities get more service
because, by having many child entities, they have many requests ready
for dispatching while E is temporarily idle.
This commit addresses this issue by extending the definition of
asymmetric scenario: a scenario is asymmetric when
- active entities representing bfq_queues have differentiated weights,
as in the original definition
or (inclusive)
- one or more entities representing groups of entities are active.
This broader definition makes sure that I/O plugging will be performed
in all the above cases, provided that there is at least one active
group. Of course, this definition is very coarse, so it will trigger
I/O plugging also in cases where it is not needed, such as, e.g.,
multiple active entities with just one child each, and all with the same
I/O-request size. The reason for this coarse definition is just that a
finer-grained definition would be rather heavy to compute.
On the opposite end, even this new definition does not trigger I/O
plugging in all cases where there is no active group, and all bfq_queues
have the same weight. So, in these cases some unfairness may occur if
there are asymmetries in I/O-request sizes. We made this choice because
I/O plugging may lower throughput, and probably a user that has not
created any group cares more about throughput than about perfect
fairness. At any rate, as for possible applications that may care about
service guarantees, bfq already guarantees a high responsiveness and a
low latency to soft real-time applications automatically.
Signed-off-by: Federico Motta <federico@willer.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tetsuo brought to my attention that I screwed up the scale_up/scale_down
helpers when I factored out the rq-qos code. We need to wake up all the
waiters when we add slots for requests to make, not when we shrink the
slots. Otherwise we'll end up things waiting forever. This was a
mistake and simply puts everything back the way it was.
cc: stable@vger.kernel.org
Fixes: a79050434b ("blk-rq-qos: refactor out common elements of blk-wbt")
eported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BFQ is already doing a similar thing in its .pd_offline_fn() method
implementation.
While it seems that after commit 4c6994806f
("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
was reverted leaving these pointers intact no longer causes crashes
clearing them is still a sensible thing to do to make the code more robust.
Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'default n' is the default value for any bool or tristate Kconfig
setting so there is no need to write it explicitly.
Also since commit f467c5640c ("kconfig: only write '# CONFIG_FOO
is not set' for visible symbols") the Kconfig behavior is the same
regardless of 'default n' being present or not:
...
One side effect of (and the main motivation for) this change is making
the following two definitions behave exactly the same:
config FOO
bool
config FOO
bool
default n
With this change, neither of these will generate a
'# CONFIG_FOO is not set' line (assuming FOO isn't selected/implied).
That might make it clearer to people that a bare 'default n' is
redundant.
...
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lot of controllers may have only one irq vector for completing IO
request. And usually affinity of the only irq vector is all possible
CPUs, however, on most of ARCH, there may be only one specific CPU
for handling this interrupt.
So if all IOs are completed in hardirq context, it is inevitable to
degrade IO performance because of increased irq latency.
This patch tries to address this issue by allowing to complete request
in softirq context, like the legacy IO path.
IOPS is observed as ~13%+ in the following randread test on raid0 over
virtio-scsi.
mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi
fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Zach Marano <zmarano@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When debugging e.g. the SCSI timeout handler it is important that
requests that have not yet been started or that already have
completed are also reported through debugfs.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluw4MIACgkQONu9yGCS
aT7+8xAAiYnc4khUsxeInm3z44WPfRX1+UF51frTNSY5C8Nn5nvRSnTUNLuKkkrz
8RbwCL6UYyJxF9I/oZdHPsPOD4IxXkQY55tBjz7ZbSBIFEwYM6RJMm8mAGlXY7wq
VyWA5MhlpGHM9DjrguB4DMRipnrSc06CVAnC+ZyKLjzblzU1Wdf2dYu+AW9pUVXP
j4r74lFED5djPY1xfqfzEwmYRCeEGYGx7zMqT3GrrF5uFPqj1H6O5klEsAhIZvdl
IWnJTU2coC8R/Sd17g4lHWPIeQNnMUGIUbu+PhIrZ/lDwFxlocg4BvarPXEdzgYi
gdZzKBfovpEsSu5RCQsKWG4IGQxY7I1p70IOP9eqEFHZy77qT1YcHVAWrK1Y/bJd
UA08gUOSzRnhKkNR3+PsaMflUOl9WkpyHECZu394cyRGMutSS50aWkavJPJ/o1Qi
D/oGqZLLcKFyuNcchG+Met1TzY3LvYEDgSburqwqeUZWtAsGs8kmiiq7qvmXx4zV
IcgM8ERqJ8mbfhfsXQU7hwydIrPJ3JdIq19RnM5ajbv2Q4C/qJCyAKkQoacrlKR4
aiow/qvyNrP80rpXfPJB8/8PiWeDtAnnGhM+xySZNlw3t8GR6NYpUkIzf5TdkSb3
C8KuKg6FY9QAS62fv+5KK3LB/wbQanxaPNruQFGe5K1iDQ5Fvzw=
=dMl4
-----END PGP SIGNATURE-----
Merge tag 'v4.19-rc6' into for-4.20/block
Merge -rc6 in, for two reasons:
1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
they aren't in the 4.20 branch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* tag 'v4.19-rc6': (780 commits)
Linux 4.19-rc6
MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
cpufreq: qcom-kryo: Fix section annotations
perf/core: Add sanity check to deal with pinned event failure
xen/blkfront: correct purging of persistent grants
Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
selftests/powerpc: Fix Makefiles for headers_install change
blk-mq: I/O and timer unplugs are inverted in blktrace
dax: Fix deadlock in dax_lock_mapping_entry()
x86/boot: Fix kexec booting failure in the SEV bit detection code
bcache: add separate workqueue for journal_write to avoid deadlock
drm/amd/display: Fix Edid emulation for linux
drm/amd/display: Fix Vega10 lightup on S3 resume
drm/amdgpu: Fix vce work queue was not cancelled when suspend
Revert "drm/panel: Add device_link from panel device to DRM device"
xen/blkfront: When purging persistent grants, keep them in the buffer
clocksource/drivers/timer-atmel-pit: Properly handle error cases
block: fix deadline elevator drain for zoned block devices
ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We apply a smoothing to the scale changes in order to keep sawtoothy
behavior from occurring. However our window for checking if we've
missed our target can sometimes be lower than the smoothing interval
(500ms), especially on faster drives like ssd's. In order to deal with
this keep track of the running tally of the previous intervals that we
threw away because we had already done a scale event recently.
This is needed for the ssd case as these low latency drives will have
bursts of latency, and if it happens to be ok for the window that
directly follows the opening of the scale window we could unthrottle
when previous windows we were missing our target.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use an average latency approach for determining if we're missing our
latency target. This works well for rotational storage where we have
generally consistent latencies, but for ssd's and other low latency
devices you have more of a spikey behavior, which means we often won't
throttle misbehaving groups because a lot of IO completes at drastically
faster times than our latency target. Instead keep track of how many
IO's miss our target and how many IO's are done in our time window. If
the p(90) latency is above our target then we know we need to throttle.
With this change in place we are seeing the same throttling behavior
with our testcase on ssd's as we see with rotational drives.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is logic to keep cgroups that haven't done a lot of IO in the most
recent scale window from being punished for over-active higher priority
groups. However for things like ssd's where the windows are pretty
short we'll end up with small numbers of samples, so 5% of samples will
come out to 0 if there aren't enough. Make the floor 1 sample to keep
us from improperly bailing out of scaling down.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hitting the case where blk_queue_depth() returned 1 uncovered the fact
that iolatency doesn't actually handle this case properly, it simply
doesn't scale down anybody. For this case we should go straight into
applying the time delay, which we weren't doing. Since we already limit
the floor at 1 request this if statement is not needed, and this allows
us to set our depth to 1 which allows us to apply the delay if needed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We were using blk_queue_depth() assuming that it would return
nr_requests, but we hit a case in production on drives that had to have
NCQ turned off in order for them to not shit the bed which resulted in a
qd of 1, even though the nr_requests was much larger. iolatency really
only cares about requests we are allowed to queue up, as any io that
get's onto the request list is going to be serviced soonish, so we want
to be throttling before the bio gets onto the request list. To make
iolatency work as expected, simply use q->nr_requests instead of
blk_queue_depth() as that is what we actually care about.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
NSEC_PER_SEC has type long, so 5 * NSEC_PER_SEC is calculated as a long.
However, 5 seconds is 5,000,000,000 nanoseconds, which overflows a
32-bit long. Make sure all of the targets are calculated as 64-bit
values.
Fixes: 6e25cb01ea ("kyber: implement improved heuristics")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Update device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups were to be created with sysfs_add_groups().
Signed-off-by: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When debugging Kyber, it's really useful to know what latencies we've
been having, how the domain depths have been adjusted, and if we've
actually been throttling. Add three tracepoints, kyber_latency,
kyber_adjust, and kyber_throttled, to record that.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kyber's current heuristics have a few flaws:
- It's based on the mean latency, but p99 latency tends to be more
meaningful to anyone who cares about latency. The mean can also be
skewed by rare outliers that the scheduler can't do anything about.
- The statistics calculations are purely time-based with a short window.
This works for steady, high load, but is more sensitive to outliers
with bursty workloads.
- It only considers the latency once an I/O has been submitted to the
device, but the user cares about the time spent in the kernel, as
well.
These are shortcomings of the generic blk-stat code which doesn't quite
fit the ideal use case for Kyber. So, this replaces the statistics with
a histogram used to calculate percentiles of total latency and I/O
latency, which we then use to adjust depths in a slightly more
intelligent manner:
- Sync and async writes are now the same domain.
- Discards are a separate domain.
- Domain queue depths are scaled by the ratio of the p99 total latency
to the target latency (e.g., if the p99 latency is double the target
latency, we will double the queue depth; if the p99 latency is half of
the target latency, we can halve the queue depth).
- We use the I/O latency to determine whether we should scale queue
depths down: we will only scale down if any domain's I/O latency
exceeds the target latency, which is an indicator of congestion in the
device.
These new heuristics are just as scalable as the heuristics they
replace.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The domain token sbitmaps are currently initialized to the device queue
depth or 256, whichever is larger, and immediately resized to the
maximum depth for that domain (256, 128, or 64 for read, write, and
other, respectively). The sbitmap is never resized larger than that, so
it's unnecessary to allocate a bitmap larger than the maximum depth.
Let's just allocate it to the maximum depth to begin with. This will use
marginally less memory, and more importantly, give us a more appropriate
number of bits per sbitmap word.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kyber will need this in a future change if it is built as a module.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 4bc6339a58 ("block: move blk_stat_add() to
__blk_mq_end_request()") consolidated some calls using ktime_get() so
we'd only need to call it once. Kyber's ->completed_request() hook also
calls ktime_get(), so let's move it to the same place, too.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs. schedule() unplugs are implicit and should be
reported as timer unplugs. While correct in the legacy code, this has
been inverted in blk-mq since 4.11.
Cc: stable@vger.kernel.org
Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the deadline scheduler is used with a zoned block device, writes
to a zone will be dispatched one at a time. This causes the warning
message:
deadline: forced dispatching is broken (nr_sorted=X), please report this
to be displayed when switching to another elevator with the legacy I/O
path while write requests to a zone are being retained in the scheduler
queue.
Prevent this message from being displayed when executing
elv_drain_elevator() for a zoned block device. __blk_drain_queue() will
loop until all writes are dispatched and completed, resulting in the
desired elevator queue drain without extensive modifications to the
deadline code itself to handle forced-dispatch calls.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: 8dc8146f9c ("deadline-iosched: Introduce zone locking support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that the blk-mq core processes power management requests
(marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
runtime power management for blk-mq.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
For blk-mq, instead of maintaining the q->nr_pending counter, rely
on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
request finishes instead of only if the queue depth drops to zero.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A later patch will call blk_freeze_queue_start() followed by
blk_mq_unfreeze_queue() without waiting for q_usage_counter to drop
to zero. Make sure that this doesn't cause a kernel warning to appear
by switching from percpu_ref_reinit() to percpu_ref_resurrect(). The
former namely requires that the refcount it operates on is zero.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of scheduling runtime resume of a request queue after a
request has been queued, schedule asynchronous resume during request
allocation. The new pm_request_resume() calls occur after
blk_queue_enter() has increased the q_usage_counter request queue
member. This change is needed for a later patch that will make request
allocation block while the queue status is not RPM_ACTIVE.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into
two new functions and thereby separate legacy block layer code from code
that works for both the legacy block layer and blk-mq. A later patch will
add calls to the new functions in the blk-mq code.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the corresponding declarations from
<linux/blkdev.h> into <linux/blk-pm.h>. For CONFIG_PM=n, leave out
the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file <linux/blkdev.h>
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Take the Xen check into the core code instead of delegating it to
the architectures.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.
Fixes: f5bbbbe4d6 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Acked-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We only use it in biovec_phys_mergeable and a m68k paravirt driver,
so just opencode it there. Also remove the pointless unsigned long cast
for the offset in the opencoded instances.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These two checks should always be performed together, so merge them into
a single helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The actual recaculation of segments in __blk_recalc_rq_segments will
do this check, so there is no point in forcing it if we know it won't
succeed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Turn the macro into an inline, move it to blk.h and simplify the
arch hooks a bit.
Also rename the function to biovec_phys_mergeable as there is no need
to shout.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keep it close to the actual users instead of exposing the function to all
drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make it easier to understand the purpose of the functions that iterate
over requests by documenting their purpose. Fix several minor spelling
and grammer mistakes in comments in these functions.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkg reference counting now uses percpu_ref rather than atomic_t. Let's
make this consistent with css_tryget. This renames blkg_try_get to
blkg_tryget and now returns a bool rather than the blkg or NULL.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that every bio is associated with a blkg, this puts the use of
blkg_get, blkg_try_get, and blkg_put on the hot path. This switches over
the refcnt in blkg to use percpu_ref.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>