License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 21:07:57 +07:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2008-01-29 20:51:59 +07:00
|
|
|
#ifndef BLK_INTERNAL_H
|
|
|
|
#define BLK_INTERNAL_H
|
|
|
|
|
2011-12-14 06:33:37 +07:00
|
|
|
#include <linux/idr.h>
|
2014-09-25 22:23:47 +07:00
|
|
|
#include <linux/blk-mq.h>
|
2020-03-25 22:48:42 +07:00
|
|
|
#include <linux/part_stat.h>
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 07:37:18 +07:00
|
|
|
#include <linux/blk-crypto.h>
|
2018-09-26 03:30:08 +07:00
|
|
|
#include <xen/xen.h>
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 07:37:18 +07:00
|
|
|
#include "blk-crypto-internal.h"
|
2014-09-25 22:23:47 +07:00
|
|
|
#include "blk-mq.h"
|
block: free sched's request pool in blk_cleanup_queue
In theory, IO scheduler belongs to request queue, and the request pool
of sched tags belongs to the request queue too.
However, the current tags allocation interfaces are re-used for both
driver tags and sched tags, and driver tags is definitely host wide,
and doesn't belong to any request queue, same with its request pool.
So we need tagset instance for freeing request of sched tags.
Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
tags to be freed before calling blk_mq_free_tag_set().
Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
moves blk_exit_queue into __blk_release_queue for simplying the fast
path in generic_make_request(), then causes oops during freeing requests
of sched tags in __blk_release_queue().
Fix the above issue by move freeing request pool of sched tags into
blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
in-queue requests at that time. Freeing sched tags has to be kept in queue's
release handler becasue there might be un-completed dispatch activity
which might refer to sched tags.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-04 20:08:02 +07:00
|
|
|
#include "blk-mq-sched.h"
|
2011-12-14 06:33:37 +07:00
|
|
|
|
2014-05-14 04:10:52 +07:00
|
|
|
/* Max future timer expiry for timeouts */
|
|
|
|
#define BLK_MAX_TIMEOUT (5 * HZ)
|
|
|
|
|
2017-02-01 05:53:20 +07:00
|
|
|
extern struct dentry *blk_debugfs_root;
|
|
|
|
|
2014-09-25 22:23:43 +07:00
|
|
|
struct blk_flush_queue {
|
|
|
|
unsigned int flush_pending_idx:1;
|
|
|
|
unsigned int flush_running_idx:1;
|
block: fix null pointer dereference in blk_mq_rq_timed_out()
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:
[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40
The bug is caused by the race between timeout handle and completion for
flush request.
When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.
After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.
However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.
We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-27 15:19:55 +07:00
|
|
|
blk_status_t rq_status;
|
2014-09-25 22:23:43 +07:00
|
|
|
unsigned long flush_pending_since;
|
|
|
|
struct list_head flush_queue[2];
|
|
|
|
struct list_head flush_data_in_flight;
|
|
|
|
struct request *flush_rq;
|
2015-08-09 14:41:51 +07:00
|
|
|
|
2019-12-18 07:24:35 +07:00
|
|
|
struct lock_class_key key;
|
2014-09-25 22:23:43 +07:00
|
|
|
spinlock_t mq_flush_lock;
|
|
|
|
};
|
|
|
|
|
2020-08-28 09:52:56 +07:00
|
|
|
enum bio_merge_status {
|
|
|
|
BIO_MERGE_OK,
|
|
|
|
BIO_MERGE_NONE,
|
|
|
|
BIO_MERGE_FAILED,
|
|
|
|
};
|
|
|
|
|
2008-01-29 20:51:59 +07:00
|
|
|
extern struct kmem_cache *blk_requestq_cachep;
|
|
|
|
extern struct kobj_type blk_queue_ktype;
|
2011-12-14 06:33:37 +07:00
|
|
|
extern struct ida blk_queue_ida;
|
2008-01-29 20:51:59 +07:00
|
|
|
|
2018-10-30 02:11:38 +07:00
|
|
|
static inline struct blk_flush_queue *
|
|
|
|
blk_get_flush_queue(struct request_queue *q, struct blk_mq_ctx *ctx)
|
2014-09-25 22:23:43 +07:00
|
|
|
{
|
2019-01-24 17:25:32 +07:00
|
|
|
return blk_mq_map_queue(q, REQ_OP_FLUSH, ctx)->fq;
|
2014-09-25 22:23:43 +07:00
|
|
|
}
|
|
|
|
|
2011-12-14 06:33:38 +07:00
|
|
|
static inline void __blk_get_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
kobject_get(&q->kobj);
|
|
|
|
}
|
|
|
|
|
block: fix null pointer dereference in blk_mq_rq_timed_out()
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:
[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40
The bug is caused by the race between timeout handle and completion for
flush request.
When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.
After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.
However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.
We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-27 15:19:55 +07:00
|
|
|
static inline bool
|
|
|
|
is_flush_rq(struct request *req, struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
return hctx->fq->flush_rq == req;
|
|
|
|
}
|
|
|
|
|
2020-03-10 04:41:37 +07:00
|
|
|
struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
|
|
|
|
gfp_t flags);
|
2014-09-25 22:23:47 +07:00
|
|
|
void blk_free_flush_queue(struct blk_flush_queue *q);
|
2014-09-25 22:23:40 +07:00
|
|
|
|
2015-10-22 00:20:12 +07:00
|
|
|
void blk_freeze_queue(struct request_queue *q);
|
|
|
|
|
2018-09-24 14:43:52 +07:00
|
|
|
static inline bool biovec_phys_mergeable(struct request_queue *q,
|
|
|
|
struct bio_vec *vec1, struct bio_vec *vec2)
|
2018-09-24 14:43:50 +07:00
|
|
|
{
|
2018-09-24 14:43:52 +07:00
|
|
|
unsigned long mask = queue_segment_boundary(q);
|
2018-09-24 14:43:53 +07:00
|
|
|
phys_addr_t addr1 = page_to_phys(vec1->bv_page) + vec1->bv_offset;
|
|
|
|
phys_addr_t addr2 = page_to_phys(vec2->bv_page) + vec2->bv_offset;
|
2018-09-24 14:43:52 +07:00
|
|
|
|
|
|
|
if (addr1 + vec1->bv_len != addr2)
|
2018-09-24 14:43:50 +07:00
|
|
|
return false;
|
2019-03-29 14:07:54 +07:00
|
|
|
if (xen_domain() && !xen_biovec_phys_mergeable(vec1, vec2->bv_page))
|
2018-09-24 14:43:50 +07:00
|
|
|
return false;
|
2018-09-24 14:43:52 +07:00
|
|
|
if ((addr1 | mask) != ((addr2 + vec2->bv_len - 1) | mask))
|
|
|
|
return false;
|
2018-09-24 14:43:50 +07:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2018-09-24 14:43:49 +07:00
|
|
|
static inline bool __bvec_gap_to_prev(struct request_queue *q,
|
|
|
|
struct bio_vec *bprv, unsigned int offset)
|
|
|
|
{
|
2018-11-07 20:58:14 +07:00
|
|
|
return (offset & queue_virt_boundary(q)) ||
|
2018-09-24 14:43:49 +07:00
|
|
|
((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if adding a bio_vec after bprv with offset would create a gap in
|
|
|
|
* the SG list. Most drivers don't care about this, but some do.
|
|
|
|
*/
|
|
|
|
static inline bool bvec_gap_to_prev(struct request_queue *q,
|
|
|
|
struct bio_vec *bprv, unsigned int offset)
|
|
|
|
{
|
|
|
|
if (!queue_virt_boundary(q))
|
|
|
|
return false;
|
|
|
|
return __bvec_gap_to_prev(q, bprv, offset);
|
|
|
|
}
|
|
|
|
|
2019-06-06 17:29:04 +07:00
|
|
|
static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
|
|
|
|
unsigned int nr_segs)
|
|
|
|
{
|
|
|
|
rq->nr_phys_segments = nr_segs;
|
|
|
|
rq->__data_len = bio->bi_iter.bi_size;
|
|
|
|
rq->bio = rq->biotail = bio;
|
|
|
|
rq->ioprio = bio_prio(bio);
|
|
|
|
|
|
|
|
if (bio->bi_disk)
|
|
|
|
rq->rq_disk = bio->bi_disk;
|
|
|
|
}
|
|
|
|
|
2015-10-22 00:20:23 +07:00
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
void blk_flush_integrity(void);
|
2017-07-04 05:58:43 +07:00
|
|
|
bool __bio_integrity_endio(struct bio *);
|
2019-12-05 09:09:01 +07:00
|
|
|
void bio_integrity_free(struct bio *bio);
|
2017-07-04 05:58:43 +07:00
|
|
|
static inline bool bio_integrity_endio(struct bio *bio)
|
|
|
|
{
|
|
|
|
if (bio_integrity(bio))
|
|
|
|
return __bio_integrity_endio(bio);
|
|
|
|
return true;
|
|
|
|
}
|
2018-09-24 14:43:47 +07:00
|
|
|
|
|
|
|
static inline bool integrity_req_gap_back_merge(struct request *req,
|
|
|
|
struct bio *next)
|
|
|
|
{
|
|
|
|
struct bio_integrity_payload *bip = bio_integrity(req->bio);
|
|
|
|
struct bio_integrity_payload *bip_next = bio_integrity(next);
|
|
|
|
|
|
|
|
return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
|
|
|
|
bip_next->bip_vec[0].bv_offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool integrity_req_gap_front_merge(struct request *req,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
struct bio_integrity_payload *bip = bio_integrity(bio);
|
|
|
|
struct bio_integrity_payload *bip_next = bio_integrity(req->bio);
|
|
|
|
|
|
|
|
return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
|
|
|
|
bip_next->bip_vec[0].bv_offset);
|
|
|
|
}
|
2020-03-25 22:48:41 +07:00
|
|
|
|
|
|
|
void blk_integrity_add(struct gendisk *);
|
|
|
|
void blk_integrity_del(struct gendisk *);
|
2018-09-24 14:43:47 +07:00
|
|
|
#else /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
static inline bool integrity_req_gap_back_merge(struct request *req,
|
|
|
|
struct bio *next)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
static inline bool integrity_req_gap_front_merge(struct request *req,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-10-22 00:20:23 +07:00
|
|
|
static inline void blk_flush_integrity(void)
|
|
|
|
{
|
|
|
|
}
|
2017-07-04 05:58:43 +07:00
|
|
|
static inline bool bio_integrity_endio(struct bio *bio)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
2019-12-05 09:09:01 +07:00
|
|
|
static inline void bio_integrity_free(struct bio *bio)
|
|
|
|
{
|
|
|
|
}
|
2020-03-25 22:48:41 +07:00
|
|
|
static inline void blk_integrity_add(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void blk_integrity_del(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
}
|
2018-09-24 14:43:47 +07:00
|
|
|
#endif /* CONFIG_BLK_DEV_INTEGRITY */
|
2008-01-29 20:51:59 +07:00
|
|
|
|
2014-05-14 04:10:52 +07:00
|
|
|
unsigned long blk_rq_timeout(unsigned long timeout);
|
2014-04-24 21:51:47 +07:00
|
|
|
void blk_add_timer(struct request *req);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
|
2020-08-28 09:52:56 +07:00
|
|
|
enum bio_merge_status bio_attempt_front_merge(struct request *req,
|
|
|
|
struct bio *bio,
|
|
|
|
unsigned int nr_segs);
|
|
|
|
enum bio_merge_status bio_attempt_back_merge(struct request *req,
|
|
|
|
struct bio *bio,
|
|
|
|
unsigned int nr_segs);
|
|
|
|
enum bio_merge_status bio_attempt_discard_merge(struct request_queue *q,
|
|
|
|
struct request *req,
|
|
|
|
struct bio *bio);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
|
2019-06-06 17:29:01 +07:00
|
|
|
unsigned int nr_segs, struct request **same_queue_rq);
|
2020-08-28 09:52:55 +07:00
|
|
|
bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
|
|
|
|
struct bio *bio, unsigned int nr_segs);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
|
2020-05-27 12:24:16 +07:00
|
|
|
void blk_account_io_start(struct request *req);
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 16:08:53 +07:00
|
|
|
void blk_account_io_done(struct request *req, u64 now);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 15:20:05 +07:00
|
|
|
|
2009-04-23 09:05:18 +07:00
|
|
|
/*
|
|
|
|
* Internal elevator interface
|
|
|
|
*/
|
2016-10-20 20:12:13 +07:00
|
|
|
#define ELV_ON_HASH(rq) ((rq)->rq_flags & RQF_HASHED)
|
2009-04-23 09:05:18 +07:00
|
|
|
|
2011-01-25 18:43:54 +07:00
|
|
|
void blk_insert_flush(struct request *rq);
|
2010-09-03 16:56:16 +07:00
|
|
|
|
2019-09-05 16:51:30 +07:00
|
|
|
void elevator_init_mq(struct request_queue *q);
|
2018-08-21 14:15:03 +07:00
|
|
|
int elevator_switch_mq(struct request_queue *q,
|
|
|
|
struct elevator_type *new_e);
|
block: free sched's request pool in blk_cleanup_queue
In theory, IO scheduler belongs to request queue, and the request pool
of sched tags belongs to the request queue too.
However, the current tags allocation interfaces are re-used for both
driver tags and sched tags, and driver tags is definitely host wide,
and doesn't belong to any request queue, same with its request pool.
So we need tagset instance for freeing request of sched tags.
Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
tags to be freed before calling blk_mq_free_tag_set().
Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
moves blk_exit_queue into __blk_release_queue for simplying the fast
path in generic_make_request(), then causes oops during freeing requests
of sched tags in __blk_release_queue().
Fix the above issue by move freeing request pool of sched tags into
blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
in-queue requests at that time. Freeing sched tags has to be kept in queue's
release handler becasue there might be un-completed dispatch activity
which might refer to sched tags.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-04 20:08:02 +07:00
|
|
|
void __elevator_exit(struct request_queue *, struct elevator_queue *);
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 18:01:48 +07:00
|
|
|
int elv_register_queue(struct request_queue *q, bool uevent);
|
2018-01-18 02:48:08 +07:00
|
|
|
void elv_unregister_queue(struct request_queue *q);
|
|
|
|
|
block: free sched's request pool in blk_cleanup_queue
In theory, IO scheduler belongs to request queue, and the request pool
of sched tags belongs to the request queue too.
However, the current tags allocation interfaces are re-used for both
driver tags and sched tags, and driver tags is definitely host wide,
and doesn't belong to any request queue, same with its request pool.
So we need tagset instance for freeing request of sched tags.
Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
tags to be freed before calling blk_mq_free_tag_set().
Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
moves blk_exit_queue into __blk_release_queue for simplying the fast
path in generic_make_request(), then causes oops during freeing requests
of sched tags in __blk_release_queue().
Fix the above issue by move freeing request pool of sched tags into
blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
in-queue requests at that time. Freeing sched tags has to be kept in queue's
release handler becasue there might be un-completed dispatch activity
which might refer to sched tags.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-04 20:08:02 +07:00
|
|
|
static inline void elevator_exit(struct request_queue *q,
|
|
|
|
struct elevator_queue *e)
|
|
|
|
{
|
2019-09-26 05:23:54 +07:00
|
|
|
lockdep_assert_held(&q->sysfs_lock);
|
|
|
|
|
block: free sched's request pool in blk_cleanup_queue
In theory, IO scheduler belongs to request queue, and the request pool
of sched tags belongs to the request queue too.
However, the current tags allocation interfaces are re-used for both
driver tags and sched tags, and driver tags is definitely host wide,
and doesn't belong to any request queue, same with its request pool.
So we need tagset instance for freeing request of sched tags.
Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
tags to be freed before calling blk_mq_free_tag_set().
Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
moves blk_exit_queue into __blk_release_queue for simplying the fast
path in generic_make_request(), then causes oops during freeing requests
of sched tags in __blk_release_queue().
Fix the above issue by move freeing request pool of sched tags into
blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
in-queue requests at that time. Freeing sched tags has to be kept in queue's
release handler becasue there might be un-completed dispatch activity
which might refer to sched tags.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-04 20:08:02 +07:00
|
|
|
blk_mq_sched_free_requests(q);
|
|
|
|
__elevator_exit(q, e);
|
|
|
|
}
|
|
|
|
|
2017-08-24 00:10:30 +07:00
|
|
|
struct hd_struct *__disk_get_part(struct gendisk *disk, int partno);
|
|
|
|
|
2020-03-24 14:25:13 +07:00
|
|
|
ssize_t part_size_show(struct device *dev, struct device_attribute *attr,
|
|
|
|
char *buf);
|
|
|
|
ssize_t part_stat_show(struct device *dev, struct device_attribute *attr,
|
|
|
|
char *buf);
|
|
|
|
ssize_t part_inflight_show(struct device *dev, struct device_attribute *attr,
|
|
|
|
char *buf);
|
|
|
|
ssize_t part_fail_show(struct device *dev, struct device_attribute *attr,
|
|
|
|
char *buf);
|
|
|
|
ssize_t part_fail_store(struct device *dev, struct device_attribute *attr,
|
|
|
|
const char *buf, size_t count);
|
2008-09-14 19:56:33 +07:00
|
|
|
ssize_t part_timeout_show(struct device *, struct device_attribute *, char *);
|
|
|
|
ssize_t part_timeout_store(struct device *, struct device_attribute *,
|
|
|
|
const char *, size_t);
|
|
|
|
|
2020-07-01 15:59:39 +07:00
|
|
|
void __blk_queue_split(struct bio **bio, unsigned int *nr_segs);
|
2019-06-06 17:29:01 +07:00
|
|
|
int ll_back_merge_fn(struct request *req, struct bio *bio,
|
|
|
|
unsigned int nr_segs);
|
|
|
|
int ll_front_merge_fn(struct request *req, struct bio *bio,
|
|
|
|
unsigned int nr_segs);
|
2017-02-02 22:54:40 +07:00
|
|
|
struct request *attempt_back_merge(struct request_queue *q, struct request *rq);
|
|
|
|
struct request *attempt_front_merge(struct request_queue *q, struct request *rq);
|
2011-03-21 16:14:27 +07:00
|
|
|
int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
|
|
|
|
struct request *next);
|
2019-06-06 17:29:02 +07:00
|
|
|
unsigned int blk_recalc_rq_segments(struct request *rq);
|
2009-07-03 15:48:17 +07:00
|
|
|
void blk_rq_set_mixed_merge(struct request *rq);
|
2012-02-08 15:19:38 +07:00
|
|
|
bool blk_rq_merge_ok(struct request *rq, struct bio *bio);
|
2017-02-08 20:46:48 +07:00
|
|
|
enum elv_merge blk_try_merge(struct request *rq, struct bio *bio);
|
2008-01-29 20:04:06 +07:00
|
|
|
|
2008-03-04 17:23:45 +07:00
|
|
|
int blk_dev_init(void);
|
|
|
|
|
2009-04-24 13:10:11 +07:00
|
|
|
/*
|
|
|
|
* Contribute to IO statistics IFF:
|
|
|
|
*
|
|
|
|
* a) it's attached to a gendisk, and
|
2019-10-11 06:36:26 +07:00
|
|
|
* b) the queue had IO stats enabled when this request was started
|
2009-04-24 13:10:11 +07:00
|
|
|
*/
|
2018-08-16 21:51:40 +07:00
|
|
|
static inline bool blk_do_io_stat(struct request *rq)
|
2009-02-02 14:42:32 +07:00
|
|
|
{
|
2019-10-11 06:36:26 +07:00
|
|
|
return rq->rq_disk && (rq->rq_flags & RQF_IO_STAT);
|
2009-02-02 14:42:32 +07:00
|
|
|
}
|
|
|
|
|
2017-02-08 20:46:47 +07:00
|
|
|
static inline void req_set_nomerge(struct request_queue *q, struct request *req)
|
|
|
|
{
|
|
|
|
req->cmd_flags |= REQ_NOMERGE;
|
|
|
|
if (req == q->last_merge)
|
|
|
|
q->last_merge = NULL;
|
|
|
|
}
|
|
|
|
|
2018-10-29 19:57:17 +07:00
|
|
|
/*
|
|
|
|
* The max size one bio can handle is UINT_MAX becasue bvec_iter.bi_size
|
|
|
|
* is defined as 'unsigned int', meantime it has to aligned to with logical
|
|
|
|
* block size which is the minimum accepted unit by hardware.
|
|
|
|
*/
|
|
|
|
static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
|
|
|
|
}
|
|
|
|
|
block: improve discard bio alignment in __blkdev_issue_discard()
This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.
Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.
Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
/dev/sdb, fill data into the first 50GB by,
# dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
# blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
# sg_get_lba_status /dev/sdb -m 1048 --lba=0
descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated
descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocated
Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.
The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.
25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
26 sector_t nr_sects, gfp_t gfp_mask, int flags,
27 struct bio **biop)
28 {
29 struct request_queue *q = bdev_get_queue(bdev);
[snipped]
56
57 while (nr_sects) {
58 sector_t req_sects = min_t(sector_t, nr_sects,
59 bio_allowed_max_sectors(q));
60
61 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
62
63 bio = blk_next_bio(bio, 0, gfp_mask);
64 bio->bi_iter.bi_sector = sector;
65 bio_set_dev(bio, bdev);
66 bio_set_op_attrs(bio, op, 0);
67
68 bio->bi_iter.bi_size = req_sects << 9;
69 sector += req_sects;
70 nr_sects -= req_sects;
[snipped]
79 }
80
81 *biop = bio;
82 return 0;
83 }
84 EXPORT_SYMBOL(__blkdev_issue_discard);
At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.
If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.
This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.
But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked (including partition offset), if it is not
aligned to discard granularity, the first split location should make
sure following bio has bi_sector aligned to discard granularity. Then
there won't be still-mapped fragment in the middle of the discard range.
The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocated
We an see there is no 2048 sectors segment anymore, everything is clean.
Reported-and-tested-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 09:42:30 +07:00
|
|
|
/*
|
|
|
|
* The max bio size which is aligned to q->limits.discard_granularity. This
|
|
|
|
* is a hint to split large discard bio in generic block layer, then if device
|
|
|
|
* driver needs to split the discard bio into smaller ones, their bi_size can
|
|
|
|
* be very probably and easily aligned to discard_granularity of the device's
|
|
|
|
* queue.
|
|
|
|
*/
|
|
|
|
static inline unsigned int bio_aligned_discard_max_sectors(
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
return round_down(UINT_MAX, q->limits.discard_granularity) >>
|
|
|
|
SECTOR_SHIFT;
|
|
|
|
}
|
|
|
|
|
2011-12-14 06:33:40 +07:00
|
|
|
/*
|
|
|
|
* Internal io_context interface
|
|
|
|
*/
|
|
|
|
void get_io_context(struct io_context *ioc);
|
2011-12-14 06:33:42 +07:00
|
|
|
struct io_cq *ioc_lookup_icq(struct io_context *ioc, struct request_queue *q);
|
2012-03-06 04:15:24 +07:00
|
|
|
struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
|
|
|
|
gfp_t gfp_mask);
|
2011-12-14 06:33:42 +07:00
|
|
|
void ioc_clear_queue(struct request_queue *q);
|
2011-12-14 06:33:40 +07:00
|
|
|
|
2012-03-06 04:15:24 +07:00
|
|
|
int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int node);
|
2011-12-14 06:33:40 +07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Internal throttling interface
|
|
|
|
*/
|
2011-10-19 19:31:18 +07:00
|
|
|
#ifdef CONFIG_BLK_DEV_THROTTLING
|
|
|
|
extern int blk_throtl_init(struct request_queue *q);
|
|
|
|
extern void blk_throtl_exit(struct request_queue *q);
|
2017-03-28 00:51:38 +07:00
|
|
|
extern void blk_throtl_register_queue(struct request_queue *q);
|
2020-06-27 14:31:58 +07:00
|
|
|
bool blk_throtl_bio(struct bio *bio);
|
2011-10-19 19:31:18 +07:00
|
|
|
#else /* CONFIG_BLK_DEV_THROTTLING */
|
|
|
|
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
|
|
|
|
static inline void blk_throtl_exit(struct request_queue *q) { }
|
2017-03-28 00:51:38 +07:00
|
|
|
static inline void blk_throtl_register_queue(struct request_queue *q) { }
|
2020-06-27 14:31:58 +07:00
|
|
|
static inline bool blk_throtl_bio(struct bio *bio) { return false; }
|
2011-10-19 19:31:18 +07:00
|
|
|
#endif /* CONFIG_BLK_DEV_THROTTLING */
|
2017-03-28 00:51:37 +07:00
|
|
|
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
|
|
|
|
extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
|
|
|
|
extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
|
|
|
|
const char *page, size_t count);
|
blk-throttle: add a simple idle detection
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.
We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.
Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.
We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.
The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 00:51:41 +07:00
|
|
|
extern void blk_throtl_bio_endio(struct bio *bio);
|
blk-throttle: add a mechanism to estimate IO latency
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.
To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.
But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.
Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 05:19:42 +07:00
|
|
|
extern void blk_throtl_stat_add(struct request *rq, u64 time);
|
blk-throttle: add a simple idle detection
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.
We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.
Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.
We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.
The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 00:51:41 +07:00
|
|
|
#else
|
|
|
|
static inline void blk_throtl_bio_endio(struct bio *bio) { }
|
blk-throttle: add a mechanism to estimate IO latency
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.
To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.
But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.
Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 05:19:42 +07:00
|
|
|
static inline void blk_throtl_stat_add(struct request *rq, u64 time) { }
|
2017-03-28 00:51:37 +07:00
|
|
|
#endif
|
2011-10-19 19:31:18 +07:00
|
|
|
|
2017-06-19 14:26:21 +07:00
|
|
|
#ifdef CONFIG_BOUNCE
|
|
|
|
extern int init_emergency_isa_pool(void);
|
|
|
|
extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
|
|
|
|
#else
|
|
|
|
static inline int init_emergency_isa_pool(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_BOUNCE */
|
|
|
|
|
2018-07-03 22:15:01 +07:00
|
|
|
#ifdef CONFIG_BLK_CGROUP_IOLATENCY
|
|
|
|
extern int blk_iolatency_init(struct request_queue *q);
|
|
|
|
#else
|
|
|
|
static inline int blk_iolatency_init(struct request_queue *q) { return 0; }
|
|
|
|
#endif
|
|
|
|
|
2018-10-12 17:08:47 +07:00
|
|
|
struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
|
|
|
|
|
2018-10-12 17:08:50 +07:00
|
|
|
#ifdef CONFIG_BLK_DEV_ZONED
|
|
|
|
void blk_queue_free_zone_bitmaps(struct request_queue *q);
|
|
|
|
#else
|
|
|
|
static inline void blk_queue_free_zone_bitmaps(struct request_queue *q) {}
|
|
|
|
#endif
|
|
|
|
|
2020-03-25 22:48:41 +07:00
|
|
|
struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector);
|
|
|
|
|
|
|
|
int blk_alloc_devt(struct hd_struct *part, dev_t *devt);
|
|
|
|
void blk_free_devt(dev_t devt);
|
|
|
|
void blk_invalidate_devt(dev_t devt);
|
|
|
|
char *disk_name(struct gendisk *hd, int partno, char *buf);
|
|
|
|
#define ADDPART_FLAG_NONE 0
|
|
|
|
#define ADDPART_FLAG_RAID 1
|
|
|
|
#define ADDPART_FLAG_WHOLEDISK 2
|
2020-04-14 14:28:54 +07:00
|
|
|
void delete_partition(struct gendisk *disk, struct hd_struct *part);
|
2020-04-14 14:28:53 +07:00
|
|
|
int bdev_add_partition(struct block_device *bdev, int partno,
|
|
|
|
sector_t start, sector_t length);
|
|
|
|
int bdev_del_partition(struct block_device *bdev, int partno);
|
|
|
|
int bdev_resize_partition(struct block_device *bdev, int partno,
|
|
|
|
sector_t start, sector_t length);
|
2020-03-25 22:48:41 +07:00
|
|
|
int disk_expand_part_tbl(struct gendisk *disk, int target);
|
2020-04-14 14:28:55 +07:00
|
|
|
int hd_ref_init(struct hd_struct *part);
|
2020-03-25 22:48:41 +07:00
|
|
|
|
2020-05-08 15:17:58 +07:00
|
|
|
/* no need to get/put refcount of part0 */
|
2020-03-25 22:48:41 +07:00
|
|
|
static inline int hd_struct_try_get(struct hd_struct *part)
|
|
|
|
{
|
2020-05-08 15:17:58 +07:00
|
|
|
if (part->partno)
|
|
|
|
return percpu_ref_tryget_live(&part->ref);
|
|
|
|
return 1;
|
2020-03-25 22:48:41 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void hd_struct_put(struct hd_struct *part)
|
|
|
|
{
|
2020-05-08 15:17:58 +07:00
|
|
|
if (part->partno)
|
|
|
|
percpu_ref_put(&part->ref);
|
2020-03-25 22:48:41 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void hd_free_part(struct hd_struct *part)
|
|
|
|
{
|
2020-05-27 12:24:14 +07:00
|
|
|
free_percpu(part->dkstats);
|
2020-03-25 22:48:41 +07:00
|
|
|
kfree(part->info);
|
|
|
|
percpu_ref_exit(&part->ref);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Any access of part->nr_sects which is not protected by partition
|
|
|
|
* bd_mutex or gendisk bdev bd_mutex, should be done using this
|
|
|
|
* accessor function.
|
|
|
|
*
|
|
|
|
* Code written along the lines of i_size_read() and i_size_write().
|
|
|
|
* CONFIG_PREEMPTION case optimizes the case of UP kernel with preemption
|
|
|
|
* on.
|
|
|
|
*/
|
|
|
|
static inline sector_t part_nr_sects_read(struct hd_struct *part)
|
|
|
|
{
|
|
|
|
#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
|
|
|
|
sector_t nr_sects;
|
|
|
|
unsigned seq;
|
|
|
|
do {
|
|
|
|
seq = read_seqcount_begin(&part->nr_sects_seq);
|
|
|
|
nr_sects = part->nr_sects;
|
|
|
|
} while (read_seqcount_retry(&part->nr_sects_seq, seq));
|
|
|
|
return nr_sects;
|
|
|
|
#elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPTION)
|
|
|
|
sector_t nr_sects;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
nr_sects = part->nr_sects;
|
|
|
|
preempt_enable();
|
|
|
|
return nr_sects;
|
|
|
|
#else
|
|
|
|
return part->nr_sects;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Should be called with mutex lock held (typically bd_mutex) of partition
|
|
|
|
* to provide mutual exlusion among writers otherwise seqcount might be
|
|
|
|
* left in wrong state leaving the readers spinning infinitely.
|
|
|
|
*/
|
|
|
|
static inline void part_nr_sects_write(struct hd_struct *part, sector_t size)
|
|
|
|
{
|
|
|
|
#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
|
2020-06-03 21:49:48 +07:00
|
|
|
preempt_disable();
|
2020-03-25 22:48:41 +07:00
|
|
|
write_seqcount_begin(&part->nr_sects_seq);
|
|
|
|
part->nr_sects = size;
|
|
|
|
write_seqcount_end(&part->nr_sects_seq);
|
2020-06-03 21:49:48 +07:00
|
|
|
preempt_enable();
|
2020-03-25 22:48:41 +07:00
|
|
|
#elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPTION)
|
|
|
|
preempt_disable();
|
|
|
|
part->nr_sects = size;
|
|
|
|
preempt_enable();
|
|
|
|
#else
|
|
|
|
part->nr_sects = size;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2020-05-12 15:55:46 +07:00
|
|
|
int bio_add_hw_page(struct request_queue *q, struct bio *bio,
|
2020-03-28 00:48:37 +07:00
|
|
|
struct page *page, unsigned int len, unsigned int offset,
|
2020-05-12 15:55:46 +07:00
|
|
|
unsigned int max_sectors, bool *same_page);
|
2020-03-28 00:48:37 +07:00
|
|
|
|
2011-10-19 19:31:18 +07:00
|
|
|
#endif /* BLK_INTERNAL_H */
|