blkcg is gonna switch to cgroup common weight range as defined by
CGROUP_WEIGHT_* on the unified hierarchy. In preparation, rename
CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg interface grew to be the biggest of all controllers and
unfortunately most inconsistent too. The interface files are
inconsistent with a number of cloes duplicates. Some files have
recursive variants while others don't. There's distinction between
normal and leaf weights which isn't intuitive and there are a lot of
stat knobs which don't make much sense outside of debugging and expose
too much implementation details to userland.
In the unified hierarchy, everything is always hierarchical and
internal nodes can't have tasks rendering the two structural issues
twisting the current interface. The interface has to be updated in a
significant anyway and this is a good chance to revamp it as a whole.
This patch implements blkcg interface for the unified hierarchy.
* (from a previous patch) blkcg is identified by "io" instead of
"blkio" on the unified hierarchy. Given that the whole interface is
updated anyway, the rename shouldn't carry noticeable conversion
overhead.
* The original interface consisted of 27 files is replaced with the
following three files.
blkio.stat : per-blkcg stats
blkio.weight : per-cgroup and per-cgroup-queue weight settings
blkio.max : per-cgroup-queue bps and iops max limits
Documentation/cgroups/unified-hierarchy.txt updated accordingly.
v2: blkcg_policy->dfl_cftypes wasn't removed on
blkcg_policy_unregister() corrupting the cftypes list. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
tg_set_conf() is largely consisted of parsing and setting the new
config and the follow-up application and propagation. This patch
separates out the latter part into tg_conf_updated(). This will be
used to implement interface for the unified hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, blkg_conf_prep() expects input to be of the following form
MAJ:MIN NUM
and reads the NUM part into blkg_conf_ctx->v. This is quite
restrictive and gets in the way in implementing blkcg interface for
the unified hierarchy. This patch updates blkg_conf_prep() so that it
expects
MAJ:MIN BODY_STR
where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
with ->body which is a char pointer pointing to the start of BODY_STR.
Parsing of the body is moved to blkg_conf_prep()'s callers.
To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
non-const pointer and to accommodate that const is dropped from @input
too.
This doesn't cause any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkio interface has become messy over time and is currently the
largest. In addition to the inconsistent naming scheme, it has
multiple stat files which report more or less the same thing, a number
of debug stat files which expose internal details which shouldn't have
been part of the public interface in the first place, recursive and
non-recursive stats and leaf and non-leaf knobs.
Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
don't make any sense on the unified hierarchy as only leaf cgroups can
contain processes. cgroups is going through a major interface
revision with the unified hierarchy involving significant fundamental
usage changes and given that a significant portion of the interface
doesn't make sense anymore, it's a good time to reorganize the
interface.
As the first step, this patch renames the external visible subsystem
name from "blkio" to "io". This is more concise, matches the other
two major subsystem names, "cpu" and "memory", and better suited as
blkcg will be involved in anything writeback related too whether an
actual block device is involved or not.
As the subsystem legacy_name is set to "blkio", the only userland
visible change outside the unified hierarchy is that blkcg is reported
as "io" instead of "blkio" in the subsystem initialized message during
boot. On the unified hierarchy, blkcg now appears as "io".
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg currently returns -EINVAL for most errors which can be pretty
confusing given that the failure modes are quite varied. Update the
error returns so that
* -EINVAL only for syntactic errors.
* -ERANGE if the value is out of range.
* -ENODEV if the target device can't be found.
* -EOPNOTSUPP if the policy is not enabled on the target device.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkg_to_cfqg() and blkcg_to_cfqgd() on a valid blkg with the policy
enabled are guaranteed to return non-NULL and the counterpart in
blk-throttle doesn't have these checks either. Remove the spurious
NULL checks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
The recent percpu conversion of blkg_rwstat triggered the following
warning in certain configurations.
block/blk-cgroup.c:654:1: warning: the frame size of 1360 bytes is larger than 1024 bytes
This is because blkg_rwstat now contains four percpu_counter which can
be pretty big depending on debug options although it shouldn't be a
problem in production configs. This patch removes one of the two
local blkg_rwstat variables used by blkg_rwstat_recursive_sum() to
reduce stack usage.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Link: http://article.gmane.org/gmane.linux.kernel.cgroups/13835
Signed-off-by: Jens Axboe <axboe@fb.com>
cfq_stats->sectors is a blkg_stat which keeps track of the total
number of sectors serviced; however, this can be trivially calculated
from blkcg_gq->stat_bytes. The only thing necessary is adding up
READs and WRITEs and then dividing by sector size.
Remove cfqg_stats->sectors and make cfq print "sectors" and
"sectors_recursive" from stat_bytes.
While this is a bit more code, it removes duplicate stat allocations
and updates and ensures that the reported stats stay in tune with each
other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, both cfq-iosched and blk-throttle keep track of
io_service_bytes and io_serviced stats. While keeping track of them
separately may be useful during development, it doesn't make much
sense otherwise. Also, blk-throttle was counting bio's as IOs while
cfq-iosched request's, which is more confusing than informative.
This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
removes the counterparts from cfq-iosched and blk-throttle and let
them print from the common blkg counters. The common counters are
incremented during bio issue in blkcg_bio_issue_check().
The outputs are still filtered by whether the policy has
blkg_policy_data on a given blkg, so cfq's output won't show up if it
has never been used for a given blkg. The only times when the outputs
would differ significantly are when policies are attached on the fly
or elevators are switched back and forth. Those are quite exceptional
operations and I don't think they warrant keeping separate counters.
v3: Update blkio-controller.txt accordingly.
v2: Account IOs during bio issues instead of request completions so
that bio-based drivers can be handled the same way.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, blkg_[rw]stat_recursive_sum() assume that the target
counter is located in pd (blkg_policy_data); however, some counters
are planned to be moved to blkg (blkcg_gq).
This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
blkg_policy pointers instead of pd. If policy is NULL, it indexes
into blkg. If non-NULL, into the blkg's pd of the policy.
The existing usages are updated to maintain the current behaviors.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg_[rw]stat are used as stat counters for blkcg policies. It isn't
per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
it. This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
per-cpu wrapping in blk-throttle.
* blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
percpu_counter. This makes syncp unnecessary as remote accesses are
handled by percpu_counter itself.
* blkg_[rw]stat_init() can now fail due to percpu allocation failure
and thus are updated to return int.
* percpu_counters need explicit freeing. blkg_[rw]stat_exit() added.
* As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
and summing results are stored in ->aux_cnt[] instead.
* Custom per-cpu stat implementation in blk-throttle is removed.
This makes all blkcg stat counters per-cpu without complicating policy
implmentations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
cgroup stats are local to each cgroup and doesn't propagate to
ancestors by default. When recursive stats are necessary, the sum is
calculated over all the descendants. This initially was for backward
compatibility to support both group-local and recursive stats but this
mode of operation makes general sense as stat update is much hotter
thafn reporting those stats.
This however ends up losing recursive stats when a child is removed.
To work around this, cfq-iosched adds its stats to its parent
cfq_group->dead_stats which is summed up together when calculating
recursive stats.
It's planned that the core stats will be moved to blkcg_gq, so we want
to move the mechanism for keeping track of the stats of dead children
from cfq to blkcg core. This patch adds blkg_[rw]stat->aux_cnt which
are atomic64_t's keeping track of auxiliary counts which are excluded
when reading local counts but included for recursive.
blkg_[rw]stat_merge() which were used by cfq to implement dead_stats
are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of
a dead cgroup to the aux counts of parent->stats instead of separate
->dead_stats.
This will also help making blkg_[rw]stats per-cpu.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkg (blkcg_gq) currently is created by blkcg policies invoking
blkg_lookup_create() which ends up repeating about the same code in
different policies. Theoretically, this can avoid the overhead of
looking and/or creating blkg's if blkcg is enabled but no policy is in
use; however, the cost of blkg lookup / creation is very low
especially if only the root blkcg is in use which is highly likely if
no blkcg policy is in active use - it boils down to a single very
predictable conditional and surrounding RCU protection.
This patch consolidates blkg creation to a new function
blkcg_bio_issue_check() which is called during bio issue from
generic_make_request_checks(). blkcg_bio_issue_check() is now the
only function which tries to create missing blkg's. The subsequent
policy and request_list operations just perform blkg_lookup() and if
missing falls back to the root.
* blk_get_rl() no longer tries to create blkg. It uses blkg_lookup()
instead of blkg_lookup_create().
* blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
read locked and blkg already looked up. Both throtl_lookup_tg() and
throtl_lookup_create_tg() are dropped.
* cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with
cfq_lookup_cfqg()which uses blkg_lookup().
This consolidates blkg handling and avoids unnecessary blkg creation
retries under memory pressure. In addition, this provides a common
bio entry point into blkcg where things like common accounting can be
performed.
v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
!CONFIG_BLK_DEV_THROTTLING.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If a queue is bypassing, all blkcg policies should become noops but
blk-throttle wasn't. It only became noop if the queue was dying.
While this wouldn't lead to an oops as falling back to the root blkg
is safe in this case, this can be a bit surprising - a bypassing queue
could still be applying throttle limits.
Fix it by removing blk_queue_dying() test in throtl_lookup_create_tg()
and testing blk_queue_bypass() in blk_throtl_bio() and bypassing
before doing anything else.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, both throttle and cfq policies implement their own root
blkg (blkcg_gq) lookup fast path. This patch moves root blkg
optimization from throtl_lookup_tg() to __blkg_lookup(). cfq-iosched
currently doesn't use blkg_lookup() but will be converted and drop the
optimization too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkg_lookup() checks whether the target queue is bypassing and, if
not, calls __blkg_lookup() which first checks the lookup hint and then
performs radix tree walk. The operations upto hint checking are
trivial and there are many users of this function. This patch inlines
blkg_lookup() and the fast path part of __blkg_lookup(). The radix
tree lookup and hint update are now in blkg_lookup_slowpath().
This will help consolidating blkg handling by easing moving root blkcg
short-circuit to inlined lookup fast path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Each active policy has a cpd (blkcg_policy_data) on each blkcg. The
cpd's were allocated by blkcg core and each policy could request to
allocate extra space at the end by setting blkcg_policy->cpd_size
larger than the size of cpd.
This is a bit unusual but blkg (blkcg_gq) policy data used to be
handled this way too so it made sense to be consistent; however, blkg
policy data switched to alloc/free callbacks.
This patch makes similar changes to cpd handling.
blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size. As
cpd allocation is now done from policy side, it can simply allocate a
larger area which embeds cpd at the beginning.
As ->cpd_alloc_fn() may be able to perform all necessary
initializations, this patch makes ->cpd_init_fn() optional.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
* Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used
for blkcg_policy_data.
* Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of
blkcg. This makes it consistent with blkg_policy_data methods and
to-be-added cpd alloc/free methods.
* blkcg_policy_data->blkcg and cpd_to_blkcg() added so that
cpd_init_fn() can determine the associated blkcg from
blkcg_policy_data.
v2: blkcg_policy_data->blkcg initializations were missing. Added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
(blkg_policy_data) while the older ones use blkg (blkcg_gq). As using
blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
can always be mapped to blkg and given that these are policy-specific
methods, it makes sense to converge on pd.
This patch makes all methods deal with pd instead of blkg. Most
conversions are trivial. In blk-cgroup.c, a couple method invocation
sites now test whether pd exists instead of policy state for
consistency. This shouldn't cause any behavioral differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
With the recent addition of alloc and free methods, things became
messier. This patch reorganizes them according to the followings.
* ->pd_alloc_fn()
Responsible for allocation and static initializations - the ones
which can be done independent of where the pd might be attached.
* ->pd_init_fn()
Initializations which require the knowledge of where the pd is
attached.
* ->pd_free_fn()
The counter part of pd_alloc_fn(). Static de-init and freeing.
This leaves ->pd_exit_fn() without any users. Removed.
While at it, collapse an one liner function throtl_pd_exit(), which
has only one user, into its user.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Because percpu allocator couldn't do non-blocking allocations,
blk-throttle was forced to implement an ad-hoc asynchronous allocation
mechanism for its percpu stats for cases where blkg's (blkcg_gq's) are
allocated from an IO path without sleepable context.
Now that percpu allocator can handle gfp_mask and blkg_policy_data
alloc / free are handled by policy methods, the ad-hoc asynchronous
allocation mechanism can be replaced with direct allocation from
tg_stats_alloc_fn(). Rit it out.
This ensures that an active throtl_grp always has valid non-NULL
->stats_cpu. Remove checks on it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
A blkg (blkcg_gq) represents the relationship between a cgroup and
request_queue. Each active policy has a pd (blkg_policy_data) on each
blkg. The pd's were allocated by blkcg core and each policy could
request to allocate extra space at the end by setting
blkcg_policy->pd_size larger than the size of pd.
This is a bit unusual but was done this way mostly to simplify error
handling and all the existing use cases could be handled this way;
however, this is becoming too restrictive now that percpu memory can
be allocated without blocking.
This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
and pd_free_fn() - which are used to allocate and release pd for a
given policy. As pd allocation is now done from policy side, it can
simply allocate a larger area which embeds pd at the beginning. This
change makes ->pd_size pointless. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkg_create() allows NULL ->pd_init_fn() but blkcg_activate_policy()
doesn't. As both in-kernel policies implement ->pd_init_fn, it
currently doesn't break anything. Update blkcg_activate_policy() so
that its behavior is consistent with blkg_create().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When a policy gets activated, it needs to allocate and install its
policy data on all existing blkg's (blkcg_gq's). Because blkg
iteration is protected by a spinlock, it currently counts the total
number of blkg's in the system, allocates the matching number of
policy data on a list and installs them during a single iteration.
This can be simplified by using speculative GFP_NOWAIT allocations
while iterating and falling back to a preallocated policy data on
failure. If the preallocated one has already been consumed, it
releases the lock, preallocate with GFP_KERNEL and then restarts the
iteration. This can be a bit more expensive than before but policy
activation is a very cold path and shouldn't matter.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg_css_alloc() bypasses policy data allocation and blkcg_css_free()
bypasses policy data and blkcg freeing for blkcg_root. There's no
reason to to treat policy data any differently for blkcg_root. If the
root css gets allocated after policies are registered, policy
registration path will add policy data; otherwise, the alloc path
will. The free path isn't never invoked for root csses.
This patch removes the unnecessary special handling of blkcg_root from
css_alloc/free paths.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When blkcg_init_queue() fails midway after creating a new blkg, it
performs kfree() directly; however, this doesn't free the policy data
areas. Make it use blkg_free() instead. In turn, blkg_free() is
updated to handle root request_list special case.
While this fixes a possible memory leak, it's on an unlikely failure
path of an already cold path and the size leaked per occurrence is
miniscule too. I don't think it needs to be tagged for -stable.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Up until now, all async IOs were queued to async queues which are
shared across the whole request_queue, which means that blkcg resource
control is completely void on async IOs including all writeback IOs.
It was done this way because writeback didn't support writeback and
there was no way of telling which writeback IO belonged to which
cgroup; however, writeback recently became cgroup aware and writeback
bio's are sent down properly tagged with the blkcg's to charge them
against.
This patch makes async cfq_queues per-cfq_cgroup instead of
per-cfq_data so that each async IO is charged to the blkcg that it was
tagged for instead of unconditionally attributing it to root.
* cfq_data->async_cfqq and ->async_idle_cfqq are moved to cfq_group
and alloc / destroy paths are updated accordingly.
* cfq_link_cfqq_cfqg() no longer overrides @cfqg to root for async
queues.
* check_blkcg_changed() now also invalidates async queues as they no
longer stay the same across cgroups.
After this patch, cfq's proportional IO control through blkio.weight
works correctly when cgroup writeback is in use.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
cfq_find_alloc_queue() checks whether a queue actually needs to be
allocated, which is unnecessary as its sole caller, cfq_get_queue(),
only calls it if so. Also, the oom queue fallback logic is scattered
between cfq_get_queue() and cfq_find_alloc_queue(). There really
isn't much going on in the latter and things can be made simpler by
folding it into cfq_get_queue().
This patch collapses cfq_find_alloc_queue() into cfq_get_queue(). The
change is fairly straight-forward with one exception - async_cfqq is
now initialized to NULL and the "!is_sync" test in the last if
conditional is replaced with "async_cfqq" test. This is because gcc
(5.1.1) gets confused for some reason and warns that async_cfqq may be
used uninitialized otherwise. Oh well, the code isn't necessarily
worse this way.
This patch doesn't cause any functional difference.
v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This is necessary for making async cfq_cgroups per-cfq_group instead
of per-cfq_data. While this change makes cfq_get_queue() perform RCU
locking and look up cfq_group even when it reuses async queue, the
extra overhead is extremely unlikely to be noticeable given that this
is already sitting behind cic->cfqq[] cache and the overall cost of
cfq operation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Even when allocations fail, cfq_find_alloc_queue() always returns a
valid cfq_queue by falling back to the oom cfq_queue. As such, there
isn't much point in taking @gfp_mask and trying "harder" if __GFP_WAIT
is set. GFP_NOWAIT allocations don't fail often and even when they do
the degraded behavior is acceptable and temporary.
After all, the only reason get_request(), which ultimately determines
the gfp_mask, cares about __GFP_WAIT is to guarantee request
allocation, assuming IO forward progress, for callers which are
willing to wait. There's no reason for cfq_find_alloc_queue() to
behave differently on __GFP_WAIT when it already has a fallback
mechanism.
Remove @gfp_mask from cfq_find_alloc_queue() and propagate the changes
to its callers. This simplifies the function quite a bit and will
help making async queues per-cfq_group.
v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg performs several allocations to track IOs per cgroup and enforce
resource control. Most of these allocations are performed lazily on
demand in the IO path and thus can't involve reclaim path. Currently,
these allocations use GFP_ATOMIC; however, blkcg can gracefully deal
with occassional failures of these allocations by punting IOs to the
root cgroup and there's no reason to reach into the emergency reserve.
This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following
allocations.
* bdi_writeback_congested and blkcg_gq allocations in blkg_create().
* radix tree node allocations for blkcg->blkg_tree.
* cfq_queue allocation on ioprio changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
* Some were accessing cic->cfqq[] directly. Always use cic_to_cfqq()
and cic_set_cfqq().
* check_ioprio_changed() doesn't need to verify cfq_get_queue()'s
return for NULL. It's always non-NULL. Simplify accordingly.
This patch doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If the cfq_queue cached in cfq_io_cq is the oom one, cfq_set_request()
replaces it by invoking cfq_get_queue() again without putting the oom
queue leaking the reference it was holding. While oom queues are not
released through reference counting, they're still reference counted
and this can theoretically lead to the reference count overflowing and
incorrectly invoke the usual release path on it.
Fix it by making cfq_set_request() put the ref it was holding.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Async cfqq's (cfq_queue's) are shared across cfq_data. When
cfq_get_queue() obtains a new queue from cfq_find_alloc_queue(), it
stashes the pointer in cfq_data and reuses it from then on; however,
the function doesn't consider that cfq_find_alloc_queue() may return
the oom_cfqq under memory pressure and installs the returned queue
unconditionally.
If the oom_cfqq is installed as an async cfqq, cfq_set_request() will
continue calling cfq_get_queue() hoping to replace it with a proper
queue; however, cfq_get_queue() will keep returning the cached queue
for the slot - the oom_cfqq.
Fix it by skipping caching if the queue is the oom one.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
cfq_get_queue()'s control flow looks like the following.
async_cfqq = NULL;
cfqq = NULL;
if (!is_sync) {
...
async_cfqq = ...;
cfqq = *async_cfqq;
}
if (!cfqq)
cfqq = ...;
if (!is_sync && !(*async_cfqq))
...;
The only thing the local variable init, the second if, and the
async_cfqq test in the third if achieves is to skip cfqq creation and
installation if *async_cfqq was already non-NULL. This is needlessly
complicated with different tests examining the same condition.
Simplify it to the following.
if (!is_sync) {
...
async_cfqq = ...;
cfqq = *async_cfqq;
if (cfqq)
goto out;
}
cfqq = ...;
if (!is_sync)
...;
out:
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This reverts commit 34b48db66e.
That commit caused performance regressions for streaming I/O
workloads on a number of different storage devices, from
SATA disks to external RAID arrays. It also managed to
trip up some buggy firmware in at least one drive, causing
data corruption.
The next patch will bump the default max_sectors_kb value to
1280, which will accommodate a 10-data-disk stripe write
with chunk size 128k. In the testing I've done using iozone,
fio, and aio-stress, a value of 1280 does not show a big
performance difference from 512. This will hopefully still
help the software RAID setup that Christoph saw the original
performance gains with while still not regressing other
storage configurations.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[hch: split from a larger patch by Dan]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch consists of 2 libfc fixes causing rare crashes, one iscsi one
causing a potential hang on shutdown, an I/O blocksize issue which caused a
regression and a memory leak in scsi-mq.
Signed-off-by: James Bottomley <JBottomley@Odin.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJVzsOiAAoJEDeqqVYsXL0M4wQIAIUe1SGv66OcAQLBcxxJUL2T
i+Ph2RE/er1iyzJepN56i77Mn2hnBgnQtmB/ibnjQfUsz/zo5PANjIfy+eFcG53G
qcb8l7a/BFCH3JHWXL7rJYN9G64sirADDL6SDLpX1JFsL21bAGdcEgbmefysvDmr
qFkiGH0Ty9YH58W+6j1pzQhh437rRgcM1KuY08sJsbKmyCVdzG5ketzBkONmBcTh
OTfPQjL32L4KR3THDUbpCiK6YAUtDvHjVB51lwoiB1ER7Ke+E+nqlCuxhXvZfMoD
lGfvWgwaQnoN1fun6c85zcqnl72kymropWWiJUhhPPjeZEj8sgn/eaaAAST9mlQ=
=Fztd
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This has two libfc fixes for bugs causing rare crashes, one iscsi fix
for a potential hang on shutdown, and a fix for an I/O blocksize issue
which caused a regression"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
sd: Fix maximum I/O size for BLOCK_PC requests
libfc: Fix fc_fcp_cleanup_each_cmd()
libfc: Fix fc_exch_recv_req() error path
libiscsi: Fix host busy blocking during connection teardown
There may be lots of pending requests so that the buffer of PAGE_SIZE
can't hold them at all.
One typical example is scsi-mq, the queue depth(.can_queue) of
scsi_host and blk-mq is quite big but scsi_device's queue_depth
is a bit small(.cmd_per_lun), then it is quite easy to have lots
of pending requests in hw queue.
This patch fixes the following warning and the related memory
destruction.
[ 359.025101] fill_read_buffer: blk_mq_hw_sysfs_show+0x0/0x7d returned bad count^M
[ 359.055595] irq event stamp: 15537^M
[ 359.055606] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[ 359.055614] Dumping ftrace buffer:^M
[ 359.055660] (ftrace buffer empty)^M
[ 359.055672] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[ 359.055678] CPU: 4 PID: 21631 Comm: stress-ng-sysfs Not tainted 4.2.0-rc5-next-20150805 #434^M
[ 359.055679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[ 359.055682] task: ffff8802161cc000 ti: ffff88021b4a8000 task.ti: ffff88021b4a8000^M
[ 359.055693] RIP: 0010:[<ffffffff811541c5>] [<ffffffff811541c5>] __kmalloc+0xe8/0x152^M
Cc: <stable@vger.kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The split code in blkdev_issue_{discard,write_same} can go away
now that any driver that cares does the split. We have to make
sure bio size doesn't overflow.
For discard, we set max discard sectors to (1<<31)>>9 to ensure
it doesn't overflow bi_size and hopefully it is of the proper
granularity as long as the granularity is a power of two.
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since generic_make_request() can now handle arbitrary size bios, all we
have to do is make sure the bvec array doesn't overflow.
__bio_add_page() doesn't need to call ->merge_bvec_fn(), where
we can get rid of unnecessary code paths.
Removing the call to ->merge_bvec_fn() is also fine, as no driver that
implements support for BLOCK_PC commands even has a ->merge_bvec_fn()
method.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: rebase and resolve merge conflicts, change a couple of comments,
make bio_add_page() warn once upon a cloned bio.]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.
But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them. In the future this will
let us delete merge_bvec_fn and a bunch of other code.
We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.
Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:
* nfhd_make_request (arch/m68k/emu/nfblock.c)
* axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
* simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
* brd_make_request (ramdisk - drivers/block/brd.c)
* mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
* loop_make_request
* null_queue_bio
* bcache's make_request fns
Some others are almost certainly safe to remove now, but will be left
for future patches.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jim Paris <jim@jtan.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit bcdb247c6b ("sd: Limit transfer length") clamped the maximum
size of an I/O request to the MAXIMUM TRANSFER LENGTH field in the BLOCK
LIMITS VPD. This had the unfortunate effect of also limiting the maximum
size of non-filesystem requests sent to the device through sg/bsg.
Avoid using blk_queue_max_hw_sectors() and set the max_sectors queue
limit directly.
Also update the comment in blk_limits_max_hw_sectors() to clarify that
max_hw_sectors defines the limit for the I/O controller only.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Brian King <brking@linux.vnet.ibm.com>
Tested-by: Brian King <brking@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org # 3.17+
Signed-off-by: James Bottomley <JBottomley@Odin.com>
Some places use helpers now, others don't. We only have the 'is set'
helper, add helpers for setting and clearing flags too.
It was a bit of a mess of atomic vs non-atomic access. With
BIO_UPTODATE gone, we don't have any risk of concurrent access to the
flags. So relax the restriction and don't make any of them atomic. The
flags that do have serialization issues (reffed and chained), we
already handle those separately.
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This fixes a data corruption bug when using discard on top of MD linear,
raid0 and raid10 personalities.
Commit 20d0189b10 "block: Introduce new bio_split()" permits sharing
the bio_vec between the two resulting bios. That is fine for read/write
requests where the bio_vec is immutable. For discards, however, we need
to be able to attach a payload and update the bio_vec so the page can
get mapped to a scatterlist entry. Therefore the bio_vec can not be
shared when splitting discards and we must do a full clone.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Seunguk Shin <seunguk.shin@samsung.com>
Tested-by: Seunguk Shin <seunguk.shin@samsung.com>
Cc: Seunguk Shin <seunguk.shin@samsung.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: <stable@vger.kernel.org> # v3.14+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
bio_associate_blkcg(), bio_associate_current() and wbc_account_io()
are used to implement cgroup writeback support for filesystems and
thus need to be exported. Export them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@fb.com>
JBD layer wrote back data buffers without setting PageWriteback bit.
Thus standard mechanism for guaranteeing stable pages under IO did not
work. Since JBD is gone now and there is no other user of the
functionality, just remove it.
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
When a blkcg configuration is targeted to a partition rather than a
whole device, blkg_conf_prep fails with -EINVAL; unfortunately, it
forgets to put the gendisk ref in that case. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Lots of devices support huge discard sizes these days. Depending
on how the device handles them internally, huge discards can
introduce massive latencies (hundreds of msec) on the device side.
We have a sysfs file, discard_max_bytes, that advertises the max
hardware supported discard size. Make this writeable, and split
the settings into a soft and hard limit. This can be set from
'discard_granularity' and up to the hardware limit.
Add a new sysfs file, 'discard_max_hw_bytes', that shows the hw
set limit.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Percpu refcount is the perfect match for partition's case,
and the conversion is quite straight.
With the convertion, one pair of atomic inc/dec can be saved
for accounting block I/O, which is run in hot path of block I/O.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
So the helper can be used in both generic partition
case and part0 case.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It is reasonable to set default timeout of request as 30 seconds instead of
30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64
based systems may choose 100 HZ.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Fixes: c76cbbcf40 ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()"
Signed-off-by: Jens Axboe <axboe@fb.com>
e48453c386 ("block, cgroup: implement policy-specific per-blkcg
data") updated per-blkcg policy data to be dynamically allocated.
When a policy is registered, its policy data aren't created. Instead,
when the policy is activated on a queue, the policy data are allocated
if there are blkg's (blkcg_gq's) which are attached to a given blkcg.
This is buggy. Consider the following scenario.
1. A blkcg is created. No blkg's attached yet.
2. The policy is registered. No policy data is allocated.
3. The policy is activated on a queue. As the above blkcg doesn't
have any blkg's, it won't allocate the matching blkcg_policy_data.
4. An IO is issued from the blkcg and blkg is created and the blkcg
still doesn't have the matching policy data allocated.
With cfq-iosched, this leads to an oops.
It also doesn't free policy data on policy unregistration assuming
that freeing of all policy data on blkcg destruction should take care
of it; however, this also is incorrect.
1. A blkcg has policy data.
2. The policy gets unregistered but the policy data remains.
3. Another policy gets registered on the same slot.
4. Later, the new policy tries to allocate policy data on the previous
blkcg but the slot is already occupied and gets skipped. The
policy ends up operating on the policy data of the previous policy.
There's no reason to manage blkcg_policy_data lazily. The reason we
do lazy allocation of blkg's is that the number of all possible blkg's
is the product of cgroups and block devices which can reach a
surprising level. blkcg_policy_data is contrained by the number of
cgroups and shouldn't be a problem.
This patch makes blkcg_policy_data to be allocated for all existing
blkcg's on policy registration and freed on unregistration and removes
blkcg_policy_data handling from policy [de]activation paths. This
makes that blkcg_policy_data are created and removed with the policy
they belong to and fixes the above described problems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: e48453c386 ("block, cgroup: implement policy-specific per-blkcg data")
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add all_blkcgs list goes through blkcg->all_blkcgs_node and is
protected by blkcg_pol_mutex. This will be used to fix
blkcg_policy_data allocation bug.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
An entry in blkcg_policy[] is stable while there are non-bypassing
in-flight IOs on a request_queue which has the policy activated. This
is why most derefs of blkcg_policy[] don't need explicit locking;
however, blkcg_css_alloc() isn't invoked from IO path and thus doesn't
have this protection and may race policies being added and removed.
Fix it by adding explicit blkcg_pol_mutex protection around
blkcg_policy[] iteration in blkcg_css_alloc().
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: e48453c386 ("block, cgroup: implement policy-specific per-blkcg data")
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg_pol_mutex primarily protects the blkcg_policy array. It also
protects cgroup file type [un]registration during policy addition /
removal. This puts blkcg_pol_mutex outside cgroup internal
synchronization and in turn makes it impossible to grab from blkcg's
cgroup methods as that leads to cyclic dependency.
Another problematic dependency arising from this is through cgroup
interface file deactivation. Removing a cftype requires removing all
files of the type which in turn involves draining all on-going
invocations of the file methods. This means that an interface file
implementation can't grab blkcg_pol_mutex as draining can lead to AA
deadlock.
blkcg_reset_stats() is already in this situation. It currently
trylocks blkcg_pol_mutex and then unwinds and retries the whole
operation on failure, which is cumbersome at best. It has a lengthy
comment explaining how cgroup internal synchronization is involved and
expected to be updated but as explained above this doesn't need cgroup
internal locking to deadlock. It's a self-contained AA deadlock.
The described circular dependencies can be easily broken by moving
cftype [un]registration out of blkcg_pol_mutex and protect them with
an outer mutex. This patch introduces blkcg_pol_register_mutex which
wraps entire policy [un]registration including cftype operations and
shrinks blkcg_pol_mutex critical section. This also makes the trylock
dancing in blkcg_reset_stats() unnecessary. Removed.
This patch is necessary for the following blkcg_policy_data allocation
bug fixes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, per-blkcg data is freed each time a policy is deactivated,
that is also upon scheduler switch. However, when switching from a
scheduler implementing a policy which requires per-blkcg data to
another one, that same policy might be active on other devices, and
therefore those same per-blkcg data could be still in use.
This commit lets per-blkcg data be freed when the blkcg is freed
instead of on policy deactivation.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Reported-and-tested-by: Michael Kaminsky <kaminsky@cs.cmu.edu>
Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
Signed-off-by: Jens Axboe <axboe@fb.com>
bio_integrity_alloc() and bio_integrity_free() assume that if a bio was
allocated from a bioset that that bioset also had its bio_integrity_pool
allocated using bioset_integrity_create(). This is a very bad
assumption given that bioset_create() and bioset_integrity_create() are
completely disjoint. Not all callers of bioset_create() have been
trained to also call bioset_integrity_create() -- and they may not care
to be.
Fix this by falling back to kmalloc'ing 'struct bio_integrity_payload'
rather than force all bioset consumers to (wastefully) preallocate a
bio_integrity_pool that they very likely won't actually need (given the
niche nature of the current block integrity support).
Otherwise, a NULL pointer "Kernel BUG" with a trace like the following
will be observed (as seen on s390x using zfcp storage) because dm-io
doesn't use bioset_integrity_create() when creating its bioset:
[ 791.643338] Call Trace:
[ 791.643339] ([<00000003df98b848>] 0x3df98b848)
[ 791.643341] [<00000000002c5de8>] bio_integrity_alloc+0x48/0xf8
[ 791.643348] [<00000000002c6486>] bio_integrity_prep+0xae/0x2f0
[ 791.643349] [<0000000000371e38>] blk_queue_bio+0x1c8/0x3d8
[ 791.643355] [<000000000036f8d0>] generic_make_request+0xc0/0x100
[ 791.643357] [<000000000036f9b2>] submit_bio+0xa2/0x198
[ 791.643406] [<000003ff801f9774>] dispatch_io+0x15c/0x3b0 [dm_mod]
[ 791.643419] [<000003ff801f9b3e>] dm_io+0x176/0x2f0 [dm_mod]
[ 791.643423] [<000003ff8074b28a>] do_reads+0x13a/0x1a8 [dm_mirror]
[ 791.643425] [<000003ff8074b43a>] do_mirror+0x142/0x298 [dm_mirror]
[ 791.643428] [<0000000000154fca>] process_one_work+0x18a/0x3f8
[ 791.643432] [<000000000015598a>] worker_thread+0x132/0x3b0
[ 791.643435] [<000000000015d49a>] kthread+0xd2/0xd8
[ 791.643438] [<00000000005bc0ca>] kernel_thread_starter+0x6/0xc
[ 791.643446] [<00000000005bc0c4>] kernel_thread_starter+0x0/0xc
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Whenever blk_fill_sghdr_rq fails, its errno code is ignored and changed to
EFAULT. This can cause very confusing errors:
$ sg_persist -k /dev/sda
persistent reservation in: pass through os error: Bad address
The fix is trivial, just propagate the return value from
blk_fill_sghdr_rq.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
ability to handle partial request completions -- otherwise with the
current SCSI LLDs these changes could lead to silent data corruption.
- Fix two DM version bumps that were missing from the initial 4.2 DM
pull request (enabled userspace lvm2 to know certain changes have been
made).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVjWetAAoJEMUj8QotnQNaEngIAMVwExw0u04jqoW9rUwLDbpr
PS2A4lh/MGtMqGGPwJp5qiwnKkgQ5/FcxRpslNQYqA6KrIlnjWJhacWl7tOrwqxn
+WBsHIUwjcpwK2RqxSS3Petb6xDd7A3LfTQVhKV9xKZpZp8Y25a+1MPmUYKsFLBH
DJ1d9bXPMdN1qjBXBU1rKkVxj6z8iNz/lv24eN0MGyWhfUUTc8lQg3eey3L0BzCc
siOuupFQXaWIkbawLZrmvPPNm1iMoABC1OPZCTB1AYZYx1rqzEGUR1nZN+qWf6Wf
rZtAPZehbRzvOaf5jC6tEfAcTF23aPEyp4LD+aAQpbuC/1IBi8a3S8z6PvR5EjA=
=QY48
-----END PGP SIGNATURE-----
Merge tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"Apologies for not pressing this request-based DM partial completion
issue further, it was an oversight on my part. We'll have to get it
fixed up properly and revisit for a future release.
- Revert block and DM core changes the removed request-based DM's
ability to handle partial request completions -- otherwise with the
current SCSI LLDs these changes could lead to silent data
corruption.
- Fix two DM version bumps that were missing from the initial 4.2 DM
pull request (enabled userspace lvm2 to know certain changes have
been made)"
* tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache policy smq: fix "default" version to be 1.4.0
dm: bump the ioctl version to 4.32.0
Revert "block, dm: don't copy bios for request clones"
Revert "dm: do not allocate any mempools for blk-mq request-based DM"
This reverts commit 5f1b670d0b.
Justification for revert as reported in this dm-devel post:
https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html
this change should not be pushed to mainline yet.
Firstly, Christoph has a newer version of the patch that fixes silent
data corruption problem:
https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html
And the new version still depends on LLDDs to always complete requests
to the end when error happens, while block API doesn't enforce such a
requirement. If the assumption is ever broken, the inconsistency between
request and bio (e.g. rq->__sector and rq->bio) will cause silent data
corruption:
https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html
Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.
This code has been in development for a long time, and it has been
simmering in for-next for a good chunk of this cycle too. This is one
of those problems that has been talked about for at least half a
decade, finally there's a solution and code to go with it.
Also see last weeks writeup on LWN:
http://lwn.net/Articles/648292/"
* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
writeback, blkio: add documentation for cgroup writeback support
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
writeback: do foreign inode detection iff cgroup writeback is enabled
v9fs: fix error handling in v9fs_session_init()
bdi: fix wrong error return value in cgwb_create()
buffer: remove unusued 'ret' variable
writeback: disassociate inodes from dying bdi_writebacks
writeback: implement foreign cgroup inode bdi_writeback switching
writeback: add lockdep annotation to inode_to_wb()
writeback: use unlocked_inode_to_wb transaction in inode_congested()
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
writeback: implement [locked_]inode_to_wb_and_lock_list()
writeback: implement foreign cgroup inode detection
writeback: make writeback_control track the inode being written back
writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
writeback: implement memcg writeback domain based throttling
writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
writeback: implement memcg wb_domain
writeback: update wb_over_bg_thresh() to use wb_domain aware operations
...
Pull core block IO update from Jens Axboe:
"Nothing really major in here, mostly a collection of smaller
optimizations and cleanups, mixed with various fixes. In more detail,
this contains:
- Addition of policy specific data to blkcg for block cgroups. From
Arianna Avanzini.
- Various cleanups around command types from Christoph.
- Cleanup of the suspend block I/O path from Christoph.
- Plugging updates from Shaohua and Jeff Moyer, for blk-mq.
- Eliminating atomic inc/dec of both remaining IO count and reference
count in a bio. From me.
- Fixes for SG gap and chunk size support for data-less (discards)
IO, so we can merge these better. From me.
- Small restructuring of blk-mq shared tag support, freeing drivers
from iterating hardware queues. From Keith Busch.
- A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
IOPS mode the default for non-rotational storage"
* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
cfq-iosched: move group scheduling functions under ifdef
cfq-iosched: fix the setting of IOPS mode on SSDs
blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
block, cgroup: implement policy-specific per-blkcg data
block: Make CFQ default to IOPS mode on SSDs
block: add blk_set_queue_dying() to blkdev.h
blk-mq: Shared tag enhancements
block: don't honor chunk sizes for data-less IO
block: only honor SG gap prevention for merges that contain data
block: fix returnvar.cocci warnings
block, dm: don't copy bios for request clones
block: remove management of bi_remaining when restoring original bi_end_io
block: replace trylock with mutex_lock in blkdev_reread_part()
block: export blkdev_reread_part() and __blkdev_reread_part()
suspend: simplify block I/O handling
block: collapse bio bit space
block: remove unused BIO_RW_BLOCK and BIO_EOF flags
block: remove BIO_EOPNOTSUPP
...
Pull scheduler updates from Ingo Molnar:
"The main changes are:
- lockless wakeup support for futexes and IPC message queues
(Davidlohr Bueso, Peter Zijlstra)
- Replace spinlocks with atomics in thread_group_cputimer(), to
improve scalability (Jason Low)
- NUMA balancing improvements (Rik van Riel)
- SCHED_DEADLINE improvements (Wanpeng Li)
- clean up and reorganize preemption helpers (Frederic Weisbecker)
- decouple page fault disabling machinery from the preemption
counter, to improve debuggability and robustness (David
Hildenbrand)
- SCHED_DEADLINE documentation updates (Luca Abeni)
- topology CPU masks cleanups (Bartosz Golaszewski)
- /proc/sched_debug improvements (Srikar Dronamraju)"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
sched/deadline: Remove needless parameter in dl_runtime_exceeded()
sched: Remove superfluous resetting of the p->dl_throttled flag
sched/deadline: Drop duplicate init_sched_dl_class() declaration
sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
sched/deadline: Make init_sched_dl_class() __init
sched/deadline: Optimize pull_dl_task()
sched/preempt: Add static_key() to preempt_notifiers
sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched
sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
sched/debug: Properly format runnable tasks in /proc/sched_debug
sched/numa: Only consider less busy nodes as numa balancing destinations
Revert 095bebf61a ("sched/numa: Do not move past the balance point if unbalanced")
sched/fair: Prevent throttling in early pick_next_task_fair()
preempt: Reorganize the notrace definitions a bit
preempt: Use preempt_schedule_context() as the official tracing preemption point
sched: Make preempt_schedule_context() function-tracing safe
x86: Remove cpu_sibling_mask() and cpu_core_mask()
x86: Replace cpu_**_mask() with topology_**_cpumask()
...
Commit 9470e4a693 only covered the initial bug report, there are
other spots in CFQ where we need to check that we actually have
a valid cfq_group_data structure.
Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
Signed-off-by: Jens Axboe <axboe@fb.com>
If none of the devices in the system are using CFQ, then attempting to
read:
/sys/fs/cgroup/blkio/blkio.leaf_weight
will results in a NULL dereference. Check for a valid cfq_group_data
struct before attempting to dereference it.
Reported-by: Andrey Wagin <avagin@gmail.com>
Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
Signed-off-by: Jens Axboe <axboe@fb.com>
If CFQ_GROUP_IOSCHED is not set, the compiler produces the
following warning:
CC block/cfq-iosched.o
linux/block/cfq-iosched.c:469:2:
warning: 'cpd_to_cfqgd' defined but not used [-Wunused-function]
*cpd_to_cfqgd(struct blkcg_policy_data *cpd)
^
In reality, two other lookup functions aren't used either if
CFQ_GROUP_IOSCHED isn't set. Move all three under one of the
CFQ_GROUP_IOSCHED sections in the code.
Reported-by: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
=================================
[ INFO: inconsistent lock state ]
4.1.0-rc7+ #217 Tainted: G O
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(ext_devt_lock){+.?...}, at: [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70
{SOFTIRQ-ON-W} state was registered at:
[<ffffffff810bf6b1>] __lock_acquire+0x461/0x1e70
[<ffffffff810c1947>] lock_acquire+0xb7/0x290
[<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
[<ffffffff8143a07d>] blk_alloc_devt+0x6d/0xd0 <-- take the lock in process context
[..]
[<ffffffff810bf64e>] __lock_acquire+0x3fe/0x1e70
[<ffffffff810c00ad>] ? __lock_acquire+0xe5d/0x1e70
[<ffffffff810c1947>] lock_acquire+0xb7/0x290
[<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
[<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
[<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
[<ffffffff8143a60c>] blk_free_devt+0x3c/0x70 <-- take the lock in softirq
[<ffffffff8143bfec>] part_release+0x1c/0x50
[<ffffffff8158edf6>] device_release+0x36/0xb0
[<ffffffff8145ac2b>] kobject_cleanup+0x7b/0x1a0
[<ffffffff8145aad0>] kobject_put+0x30/0x70
[<ffffffff8158f147>] put_device+0x17/0x20
[<ffffffff8143c29c>] delete_partition_rcu_cb+0x16c/0x180
[<ffffffff8143c130>] ? read_dev_sector+0xa0/0xa0
[<ffffffff810e0e0f>] rcu_process_callbacks+0x2ff/0xa90
[<ffffffff810e0dcf>] ? rcu_process_callbacks+0x2bf/0xa90
[<ffffffff81067e2e>] __do_softirq+0xde/0x600
Neil sees this in his tests and it also triggers on pmem driver unbind
for the libnvdimm tests. This fix is on top of an initial fix by Keith
for incorrect usage of mutex_lock() in this path: 2da78092dd "block:
Fix dev_t minor allocation lifetime". Both this and 2da78092dd are
candidates for -stable.
Fixes: 2da78092dd ("block: Fix dev_t minor allocation lifetime")
Cc: <stable@vger.kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
A previous commit wanted to make CFQ default to IOPS mode on
non-rotational storage, however it did so when the queue was
initialized and the non-rotational flag is only set later on
in the probe.
Add an elevator hook that gets called off the add_disk() path,
at that point we know that feature probing has finished, and
we can reliably check for the various flags that drivers can
set.
Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
Tested-by: Romain Francoise <romain@orebokech.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now blk_cleanup_queue() can be called before calling
del_gendisk()[1], inside which hctx->ctxs is touched
from blk_mq_unregister_hctx(), but the variable has
been freed by blk_cleanup_queue() at that time.
So this patch moves freeing of hctx->ctxs into queue's
release handler for fixing the oops reported by Stefan.
[1], 6cd18e711d (block: destroy bdi before blockdev is
unregistered)
Reported-by: Stefan Seyfried <stefan.seyfried@googlemail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The block IO (blkio) controller enables the block layer to provide service
guarantees in a hierarchical fashion. Specifically, service guarantees
are provided by registered request-accounting policies. As of now, a
proportional-share and a throttling policy are available. They are
implemented, respectively, by the CFQ I/O scheduler and the blk-throttle
subsystem. Unfortunately, as for adding new policies, the current
implementation of the block IO controller is only halfway ready to allow
new policies to be plugged in. This commit provides a solution to make
the block IO controller fully ready to handle new policies.
In what follows, we first describe briefly the current state, and then
list the changes made by this commit.
The throttling policy does not need any per-cgroup information to perform
its task. In contrast, the proportional share policy uses, for each cgroup,
both the weight assigned by the user to the cgroup, and a set of dynamically-
computed weights, one for each device.
The first, user-defined weight is stored in the blkcg data structure: the
block IO controller allocates a private blkcg data structure for each
cgroup in the blkio cgroups hierarchy (regardless of which policy is active).
In other words, the block IO controller internally mirrors the blkio cgroups
with private blkcg data structures.
On the other hand, for each cgroup and device, the corresponding dynamically-
computed weight is maintained in the following, different way. For each device,
the block IO controller keeps a private blkcg_gq structure for each cgroup in
blkio. In other words, block IO also keeps one private mirror copy of the blkio
cgroups hierarchy for each device, made of blkcg_gq structures.
Each blkcg_gq structure keeps per-policy information in a generic array of
dynamically-allocated 'dedicated' data structures, one for each registered
policy (so currently the array contains two elements). To be inserted into the
generic array, each dedicated data structure embeds a generic blkg_policy_data
structure. Consider now the array contained in the blkcg_gq structure
corresponding to a given pair of cgroup and device: one of the elements
of the array contains the dedicated data structure for the proportional-share
policy, and this dedicated data structure contains the dynamically-computed
weight for that pair of cgroup and device.
The generic strategy adopted for storing per-policy data in blkcg_gq structures
is already capable of handling new policies, whereas the one adopted with blkcg
structures is not, because per-policy data are hard-coded in the blkcg
structures themselves (currently only data related to the proportional-
share policy).
This commit addresses the above issues through the following changes:
. It generalizes blkcg structures so that per-policy data are stored in the same
way as in blkcg_gq structures.
Specifically, it lets also the blkcg structure store per-policy data in a
generic array of dynamically-allocated dedicated data structures. We will
refer to these data structures as blkcg dedicated data structures, to
distinguish them from the dedicated data structures inserted in the generic
arrays kept by blkcg_gq structures.
To allow blkcg dedicated data structures to be inserted in the generic array
inside a blkcg structure, this commit also introduces a new blkcg_policy_data
structure, which is the equivalent of blkg_policy_data for blkcg dedicated
data structures.
. It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a
cpd_size field and a cpd_init field, to be initialized by the policy with,
respectively, the size of the blkcg dedicated data structures, and the
address of a constructor function for blkcg dedicated data structures.
. It moves the CFQ-specific fields embedded in the blkcg data structure (i.e.,
the fields related to the proportional-share policy), into a new blkcg
dedicated data structure called cfq_group_data.
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
CFQ idling causes reduced IOPS throughput on non-rotational disks.
Since disk head seeking is not applicable to SSDs, it doesn't really
help performance by anticipating future near-by IO requests.
By turning off idling (and switching to IOPS mode), we allow other
processes to dispatch IO requests down to the driver and so increase IO
throughput.
Following FIO benchmark results were taken on a cloud SSD offering with
idling on and off:
Idling iops avg-lat(ms) stddev bw
------------------------------------------------------
On 7054 90.107 38.697 28217KB/s
Off 29255 21.836 11.730 117022KB/s
fio --name=temp --size=100G --time_based --ioengine=libaio \
--randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
--verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
--filename=/dev/sdb --runtime=10 --iodepth=64 --numjobs=10
And the following is from a local SSD run:
Idling iops avg-lat(ms) stddev bw
------------------------------------------------------
On 19320 33.043 14.068 77281KB/s
Off 21626 29.465 12.662 86507KB/s
fio --name=temp --size=5G --time_based --ioengine=libaio \
--randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
--verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
--filename=/fio_data --runtime=10 --iodepth=64 --numjobs=10
Reviewed-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that bdi layer can handle per-blkcg bdi_writeback_congested state,
blk_{set|clear}_congested() can propagate non-root blkcg congestion
state to them.
This can be easily achieved by disabling the root_rl tests in
blk_{set|clear}_congested(). Note that we still need those tests when
!CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg
wb's congestion state for events happening on other blkcgs.
v2: Updated for bdi_writeback_congested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_{set|clear}_queue_congested() take @q and set or clear,
respectively, the congestion state of its bdi's root wb. Because bdi
used to be able to handle congestion state only on the root wb, the
callers of those functions tested whether the congestion is on the
root blkcg and skipped if not.
This is cumbersome and makes implementation of per cgroup
bdi_writeback congestion state propagation difficult. This patch
renames blk_{set|clear}_queue_congested() to
blk_{set|clear}_congested(), and makes them take request_list instead
of request_queue and test whether the specified request_list is the
root one before updating bdi_writeback congestion state. This makes
the tests in the callers unnecessary and simplifies them.
As there are no external users of these functions, the definitions are
moved from include/linux/blkdev.h to block/blk-core.c.
This patch doesn't introduce any noticeable behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
A blkg (blkcg_gq) can be congested and decongested independently from
other blkgs on the same request_queue. Accordingly, for cgroup
writeback support, the congestion status at bdi (backing_dev_info)
should be split and updated separately from matching blkg's.
This patch prepares by adding blkg->wb_congested and associating a
blkg with its matching per-blkcg bdi_writeback_congested on creation.
v2: Updated to associate bdi_writeback_congested instead of
bdi_writeback.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback). This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).
On the default hierarchy, blkcg implicitly enables memcg. This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg. This means that there may be multiple
wb's of a bdi mapped to the same blkcg. As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state. This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.
bdi->wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
by its memcg id. Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().
Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi->wb.
v3: inode_attach_wb() in account_page_dirtied() moved inside
mapping_cap_account_dirty() block where it's known to be !NULL.
Also, an unnecessary NULL check before kfree() removed. Both
detected by the kbuild bot.
v2: Updated so that wb association is per inode and wb is per memcg
rather than blkcg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
cgroup writeback requires support from both bdi and filesystem sides.
Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
both MEMCG and BLK_CGROUP are enabled.
inode_cgwb_enabled() which determines whether a given inode's both bdi
and fs support cgroup writeback is added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.
This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.
v2: fs/fat build failure fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->state into wb.
* enum bdi_state is renamed to wb_state and the prefix of all enums is
changed from BDI_ to WB_.
* Explicit zeroing of bdi->state is removed without adding zeoring of
wb->state as the whole data structure is zeroed on init anyway.
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: drbd-dev@lists.linbit.com
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, a bio can only be associated with the io_context and blkcg
of %current using bio_associate_current(). This is too restrictive
for cgroup writeback support. Implement bio_associate_blkcg() which
associates a bio with the specified blkcg.
bio_associate_blkcg() leaves the io_context unassociated.
bio_associate_current() is updated so that it considers a bio as
already associated if it has a blkcg_css, instead of an io_context,
associated with it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
bio_associate_current() currently open codes task_css() and
css_tryget_online() to find and pin $current's blkcg css. Abstract it
into task_get_css() which is implemented from cgroup side. As a task
is always associated with an online css for every subsystem except
while the css_set update is propagating, task_get_css() retries till
css_tryget_online() succeeds.
This is a cleanup and shouldn't lead to noticeable behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add global constant blkcg_root_css which points to &blkcg_root.css.
This will be used by cgroup writeback support. If blkcg is disabled,
it's defined as ERR_PTR(-EINVAL).
v2: The declarations moved to include/linux/blk-cgroup.h as suggested
by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, blkcg does a minor optimization where the root blkcg is
created when the first blkcg policy is activated on a queue and
destroyed on the deactivation of the last. On systems where blkcg is
configured but not used, this saves one blkcg_gq struct per queue. On
systems where blkcg is actually used, there's no difference. The only
case where this can lead to any meaninful, albeit still minute, save
in memory consumption is when all blkcg policies are deactivated after
being widely used in the system, which is a hihgly unlikely scenario.
The conditional existence of root blkcg_gq has already created several
bugs in blkcg and became an issue once again for the new per-cgroup
wb_congested mechanism for cgroup writeback support leading to a NULL
dereference when no blkcg policy is active. This is really not worth
bothering with. This patch makes blkcg always allocate and link the
root blkcg_gq and release it only on queue destruction.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
cgroup aware writeback support will require exposing some of blkcg
details. In preprataion, move block/blk-cgroup.h to
include/linux/blk-cgroup.h. This patch is pure file move.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Storage controllers may expose multiple block devices that share hardware
resources managed by blk-mq. This patch enhances the shared tags so a
low-level driver can access the shared resources not tied to the unshared
h/w contexts. This way the LLD can dynamically add and delete disks and
request queues without having to track all the request_queue hctx's to
iterate outstanding tags.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can safely merge anything that wont generate an SG list entry,
so if the bio is data-less (discard), don't look at potential
SG gaps.
Signed-off-by: Jens Axboe <axboe@fb.com>
bdi_unregister() now contains very little functionality.
It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no
real consequence as bdi->dev isn't needed by anything else in the function,
and it triggers if
blk_cleanup_queue() -> bdi_destroy()
is called before bdi_unregister, which happens since
Commit: 6cd18e711d ("block: destroy bdi before blockdev is unregistered.")
So this isn't wanted.
It also calls bdi_set_min_ratio(). This needs to be called after
writes through the bdi have all been flushed, and before the bdi is destroyed.
Calling it early is better than calling it late as it frees up a global
resource.
Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
perfectly fits these requirements.
So bdi_unregister() can be discarded with the important content moved to
bdi_destroy(), as can the
writeback_bdi_unregister
event which is already not used.
Reported-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org (v4.0)
Fixes: c4db59d31e ("fs: don't reassign dirty inodes to default_backing_dev_info")
Fixes: 6cd18e711d ("block: destroy bdi before blockdev is unregistered.")
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.
This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence. With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.
I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
1) saving a bio's original bi_end_io
2) wiring up an intermediate bi_end_io
3) restoring the original bi_end_io from intermediate bi_end_io
4) calling bio_endio() to execute the restored original bi_end_io
The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called. For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0. When bio_inc_remaining() occurred during step
3 it brought it to a value of 2. When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).
Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining. For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.
Also, the bio_inc_remaining() interface has been moved local to bio.c.
Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The only possible problem of using mutex_lock() instead of trylock
is about deadlock.
If there aren't any locks held before calling blkdev_reread_part(),
deadlock can't be caused by this conversion.
If there are locks held before calling blkdev_reread_part(),
and if these locks arn't required in open, close handler and I/O
path, deadlock shouldn't be caused too.
Both user space's ioctl(BLKRRPART) and md_setup_drive() from
init/do_mounts_md.c belongs to the 1st case, so the conversion is safe
for the two cases.
For loop, the previous patches in this pathset has fixed the ABBA lock
dependency, so the conversion is OK.
For nbd, tx_lock is held when calling the function:
- both open and release won't hold the lock
- when blkdev_reread_part() is run, I/O thread has been stopped
already, so tx_lock won't be acquired in I/O path at that time.
- so the conversion won't cause deadlock for nbd
For dasd, both dasd_open(), dasd_release() and request function don't
acquire any mutex/semphone, so the conversion should be safe.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch exports blkdev_reread_part() for block drivers, also
introduce __blkdev_reread_part().
For some drivers, such as loop, reread of partitions can be run
from the release path, and bd_mutex may already be held prior to
calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
__blkdev_reread_part for use in such cases.
CC: Christoph Hellwig <hch@lst.de>
CC: Jens Axboe <axboe@kernel.dk>
CC: Tejun Heo <tj@kernel.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Markus Pargmann <mpa@pengutronix.de>
CC: Stefan Weinhuber <wein@de.ibm.com>
CC: Stefan Haberland <stefan.haberland@de.ibm.com>
CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
CC: Fabian Frederick <fabf@skynet.be>
CC: Ming Lei <ming.lei@canonical.com>
CC: David Herrmann <dh.herrmann@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: nbd-general@lists.sourceforge.net
CC: linux-s390@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
lockdep gets unhappy about the not disabling irqs when using the queue_lock
around it. Instead of trying to fix that up just switch to an atomic_t
and get rid of the lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
With commit ff36ab345 ("dm: remove request-based logic from
make_request_fn wrapper") DM no longer calls blk_queue_bio() directly,
so remove its export. Doing so required a forward declaration in
blk-core.c.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Last patch makes plug work for multiple queue case. However it only
works for single disk case, because it assumes only one request in the
plug list. If a task is accessing multiple disks, eg MD/DM, the
assumption is wrong. Let blk_attempt_plug_merge() record request from
the same queue.
V2: use NULL parameter in !mq case. Fix a bug. Add comments in
blk_attempt_plug_merge to make it less (hopefully) confusion.
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
plug is still helpful for workload with IO merge, but it can be harmful
otherwise especially with multiple hardware queues, as there is
(supposed) no lock contention in this case and plug can introduce
latency. For multiple queues, we do limited plug, eg plug only if there
is request merge. If a request doesn't have merge with following
request, the requet will be dispatched immediately.
V2: check blk_queue_nomerges() as suggested by Jeff.
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If we directly issue a request and it fails, we use
blk_mq_merge_queue_io(). But we already assigned bio to a request in
blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
blk_mq_bio_to_request again.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The following appears in blk_sq_make_request:
/*
* If we have multiple hardware queues, just go directly to
* one of those for sync IO.
*/
We clearly don't have multiple hardware queues, here! This comment was
introduced with this commit 07068d5b8e (blk-mq: split make request
handler for multi and single queue):
We want slightly different behavior from them:
- On single queue devices, we currently use the per-process plug
for deferred IO and for merging.
- On multi queue devices, we don't use the per-process plug, but
we want to go straight to hardware for SYNC IO.
The old code had this:
use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);
and that was converted to:
use_plug = !is_flush_fua && !is_sync;
which is not equivalent. For the single queue case, that second half of
the && expression is always true. So, what I think was actually inteded
follows (and this more closely matches what is done in blk_queue_bio).
V2: delete the 'likely', which should not be a big deal
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Current code looks like inner plug gets flushed with a
blk_finish_plug(). Actually it's a nop. All requests/callbacks are added
to current->plug, while only outmost plug is assigned to current->plug.
So inner plug always has empty request/callback list, which makes
blk_flush_plug_list() a nop. This tries to make the code more clear.
Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This removes the request types and hacks from the block code and into the
old IDE driver. There is a small amunt of code duplication due to this,
but it's not too bad.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.
If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.
Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.
Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.
For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.
Tested-by: Robert Elliott <elliott@hp.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Normally if driver is busy to dispatch a request the logic is like below:
block layer: driver:
__blk_mq_run_hw_queue
a. blk_mq_stop_hw_queue
b. rq add to ctx->dispatch
later:
1. blk_mq_start_hw_queue
2. __blk_mq_run_hw_queue
But it's possible step 1-2 runs between a and b. And since rq isn't in
ctx->dispatch yet, step 2 will not run rq. The rq might get lost if
there are no subsequent requests kick in.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Because of the peculiar way that md devices are created (automatically
when the device node is opened), a new device can be created and
registered immediately after the
blk_unregister_region(disk_devt(disk), disk->minors);
call in del_gendisk().
Therefore it is important that all visible artifacts of the previous
device are removed before this call. In particular, the 'bdi'.
Since:
commit c4db59d31e
Author: Christoph Hellwig <hch@lst.de>
fs: don't reassign dirty inodes to default_backing_dev_info
moved the
device_unregister(bdi->dev);
call from bdi_unregister() to bdi_destroy() it has been quite easy to
lose a race and have a new (e.g.) "md127" be created after the
blk_unregister_region() call and before bdi_destroy() is ultimately
called by the final 'put_disk', which must come after del_gendisk().
The new device finds that the bdi name is already registered in sysfs
and complains
> [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
> [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'
We can fix this by moving the bdi_destroy() call out of
blk_release_queue() (which can happen very late when a refcount
reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
device driver calls it.
Then it is only necessary for md to call blk_cleanup_queue() before
del_gendisk(). As loop.c devices are also created on demand by
opening the device node, we make the same change there.
Fixes: c4db59d31e
Reported-by: Azat Khuzhin <a3at.mail@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Our issue is descripted in below call path:
->elevator_init
->elevator_init_fn
->{cfq,deadline,noop}_init_queue
->elevator_alloc
->kzalloc_node
fail to call kzalloc_node and then put module in elevator_alloc;
fail to call elevator_init_fn and then put module again in elevator_init.
Remove elevator_put invoking in error path of elevator_alloc to avoid
double release issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
hctx->tags has to be set as NULL in case that it is to be unmapped
no matter if set->tags[hctx->queue_num] is NULL or not in blk_mq_map_swqueue()
because shared tags can be freed already from another request queue.
The same situation has to be considered during handling CPU online too.
Unmapped hw queue can be remapped after CPU topo is changed, so we need
to allocate tags for the hw queue in blk_mq_map_swqueue(). Then tags
allocation for hw queue can be removed in hctx cpu online notifier, and it
is reasonable to do that after mapping is updated.
Cc: <stable@vger.kernel.org>
Reported-by: Dongsu Park <dongsu.park@profitbricks.com>
Tested-by: Dongsu Park <dongsu.park@profitbricks.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Firstly during CPU hotplug, even queue is freezed, timeout
handler still may come and access hctx->tags, which may cause
use after free, so this patch deactivates timeout handler
inside CPU hotplug notifier.
Secondly, tags can be shared by more than one queues, so we
have to check if the hctx has been unmapped, otherwise
still use-after-free on tags can be triggered.
Cc: <stable@vger.kernel.org>
Reported-by: Dongsu Park <dongsu.park@profitbricks.com>
Tested-by: Dongsu Park <dongsu.park@profitbricks.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 889fa31f00 was a bit too eager in reducing the loop count,
so we ended up missing queues in some configurations. Ensure that
our division rounds up, so that's not the case.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 889fa31f00 ("blk-mq: reduce unnecessary software queue looping")
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block layer core bits from Jens Axboe:
"This is the core pull request for 4.1. Not a lot of stuff in here for
this round, mostly little fixes or optimizations. This pull request
contains:
- An optimization that speeds up queue runs on blk-mq, especially for
the case where there's a large difference between nr_cpu_ids and
the actual mapped software queues on a hardware queue. From Chong
Yuan.
- Honor node local allocations for requests on legacy devices. From
David Rientjes.
- Cleanup of blk_mq_rq_to_pdu() from me.
- exit_aio() fixup from me, greatly speeding up exiting multiple IO
contexts off exit_group(). For my particular test case, fio exit
took ~6 seconds. A typical case of both exposing RCU grace periods
to user space, and serializing exit of them.
- Make blk_mq_queue_enter() honor the gfp mask passed in, so we only
wait if __GFP_WAIT is set. From Keith Busch.
- blk-mq exports and two added helpers from Mike Snitzer, which will
be used by the dm-mq code.
- Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang"
* 'for-4.1/core' of git://git.kernel.dk/linux-block:
blk-mq: reduce unnecessary software queue looping
aio: fix serial draining in exit_aio()
blk-mq: cleanup blk_mq_rq_to_pdu()
blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()
block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()
block: allocate request memory local to request queue
blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set
blk-mq: export blk_mq_run_hw_queues
blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk
In flush_busy_ctxs() and blk_mq_hctx_has_pending(), regardless of how many
ctxs assigned to one hctx, they will all loop hctx->ctx_map.map_size
times. Here hctx->ctx_map.map_size is a const ALIGN(nr_cpu_ids, 8) / 8.
Especially, flush_busy_ctxs() is in hot code path. And it's unnecessary.
Change ->map_size to contain the actually mapped software queues, so we
only loop for as many iterations as we have to.
And remove cpumask setting and nr_ctx count in blk_mq_init_cpu_queues()
since they are all re-done in blk_mq_map_swqueue().
blk_mq_map_swqueue().
Signed-off-by: Chong Yuan <chong.yuan@memblaze.com>
Reviewed-by: Wenbo Wang <wenbo.wang@memblaze.com>
Updated by me for formatting and commenting.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull vfs update from Al Viro:
"Part one:
- struct filename-related cleanups
- saner iov_iter_init() replacements (and switching the syscalls to
use of those)
- ntfs switch to ->write_iter() (Anton)
- aio cleanups and splitting iocb into common and async parts
(Christoph)
- assorted fixes (me, bfields, Andrew Elble)
There's a lot more, including the completion of switchover to
->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
race fixes, etc, but that goes after #for-davem merge. David has
pulled it, and once it's in I'll send the next vfs pull request"
* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
sg_start_req(): use import_iovec()
sg_start_req(): make sure that there's not too many elements in iovec
blk_rq_map_user(): use import_single_range()
sg_io(): use import_iovec()
process_vm_access: switch to {compat_,}import_iovec()
switch keyctl_instantiate_key_common() to iov_iter
switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
vmsplice_to_user(): switch to import_iovec()
kill aio_setup_single_vector()
aio: simplify arguments of aio_setup_..._rw()
aio: lift iov_iter_init() into aio_setup_..._rw()
lift iov_iter into {compat_,}do_readv_writev()
NFS: fix BUG() crash in notify_change() with patch to chown_common()
dcache: return -ESTALE not -EBUSY on distributed fs race
NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
VFS: Add iov_iter_fault_in_multipages_readable()
drop bogus check in file_open_root()
switch security_inode_getattr() to struct path *
constify tomoyo_realpath_from_path()
...
Jan Engelhardt reports a strange oops with an invalid ->sense_buffer
pointer in scsi_init_cmd_errh() with the blk-mq code.
The sense_buffer pointer should have been initialized by the call to
scsi_init_request() from blk_mq_init_rq_map(), but there seems to be
some non-repeatable memory corruptor.
This patch makes sure we initialize the whole struct request allocation
(and the associated 'struct scsi_cmnd' for the SCSI case) to zero, by
using __GFP_ZERO in the allocation. The old code initialized a couple
of individual fields, leaving the rest undefined (although many of them
are then initialized in later phases, like blk_mq_rq_ctx_init() etc.
It's not entirely clear why this matters, but it's the rigth thing to do
regardless, and with 4.0 imminent this is the defensive "let's just make
sure everything is initialized properly" patch.
Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linux 3.19 commit 69c953c ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n")
caused blk_stack_limits() to not properly stack queue_limits for stacked
devices (e.g. DM).
Fix this regression by establishing lcm_not_zero() and switching
blk_stack_limits() over to using it.
DM uses blk_set_stacking_limits() to establish the initial top-level
queue_limits that are then built up based on underlying devices' limits
using blk_stack_limits(). In the case of optimal_io_size (io_opt)
blk_set_stacking_limits() establishes a default value of 0. With commit
69c953c, lcm(0, n) is no longer n, which compromises proper stacking of
the underlying devices' io_opt.
Test:
$ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536
$ cat /sys/block/sde/queue/optimal_io_size
786432
$ dmsetup create node --table "0 100 linear /dev/sde 0"
Before this fix:
$ cat /sys/block/dm-5/queue/optimal_io_size
0
After this fix:
$ cat /sys/block/dm-5/queue/optimal_io_size
786432
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
At the beginning of blk_mq_alloc_tag_set(), we have already checked whether
'set->nr_hw_queues' is zero, so here remove this redundant check.
Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_init_rl() allocates a mempool using mempool_create_node() with node
local memory. This only allocates the mempool and element list locally
to the requeue queue node.
What we really want to do is allocate the request itself local to the
queue. To do this, we need our own alloc and free functions that will
allocate from request_cachep and pass the request queue node in to prefer
node local memory.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use the right array index to reference the last
element of rq->biotail->bi_io_vec[]
Signed-off-by: Wenbo Wang <wenbo.wang@memblaze.com>
Reviewed-by: Chong Yuan <chong.yuan@memblaze.com>
Fixes: 66cb45aa41 ("block: add support for limiting gaps in SG lists")
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
When allocating from the reserved tags pool, bt_get() is called with
a NULL hctx. If all tags are in use, the hw queue is kicked to push
out any pending IO, potentially freeing tags, and tag allocation is
retried. The problem is that blk_mq_run_hw_queue() doesn't check for
a NULL hctx. So we avoid it with a simple NULL hctx test.
Tested by hammering mtip32xx with concurrent smartctl/hdparm.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Fixes: b32232073e ("blk-mq: fix hang in bt_get()")
Cc: stable@kernel.org
Added appropriate comment.
Signed-off-by: Jens Axboe <axboe@fb.com>
Return -EBUSY if we're unable to enter a queue immediately when
allocating a blk-mq request without __GFP_WAIT.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Rename blk_mq_run_queues to blk_mq_run_hw_queues, add async argument,
and export it.
DM's suspend support must be able to run the queue without starting
stopped hw queues.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add a variant of blk_mq_init_queue that allows a previously allocated
queue to be initialized. blk_mq_init_allocated_queue models
blk_init_allocated_queue -- which was also created for DM's use.
DM's approach to device creation requires a placeholder request_queue be
allocated for use with alloc_dev() but the decision about what type of
request_queue will be ultimately created is deferred until all component
devices referenced in the DM table are processed to determine the table
type (request-based, blk-mq request-based, or bio-based).
Also, because of DM's late finalization of the request_queue type
the call to blk_mq_register_disk() doesn't happen during alloc_dev().
Must export blk_mq_register_disk() so that DM can backfill the 'mq' dir
once the blk-mq queue is fully allocated.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If percpu_ref_init() fails the allocated q and hctxs must get cleaned
up; using 'err_map' doesn't allow that to happen.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block driver changes from Jens Axboe:
"This contains:
- The 4k/partition fixes for brd from Boaz/Matthew.
- A few xen front/back block fixes from David Vrabel and Roger Pau
Monne.
- Floppy changes from Takashi, cleaning the device file creation.
- Switching libata to use the new blk-mq tagging policy, removing
code (and a suboptimal implementation) from libata. This will
throw you a merge conflict, since a bug in the original libata
tagging code was fixed since this code was branched. Trivial.
From Shaohua.
- Conversion of loop to blk-mq, from Ming Lei.
- Cleanup of the io_schedule() handling in bsg from Peter Zijlstra.
He claims it improves on unreadable code, which will cost him a
beer.
- Maintainer update or NDB, now handled by Markus Pargmann.
- NVMe:
- Optimization from me that avoids a kmalloc/kfree per IO for
smaller (<= 8KB) IO. This cuts about 1% of high IOPS CPU
overhead.
- Removal of (now) dead RCU code, a relic from before NVMe was
converted to blk-mq"
* 'for-3.20/drivers' of git://git.kernel.dk/linux-block:
xen-blkback: default to X86_32 ABI on x86
xen-blkfront: fix accounting of reqs when migrating
xen-blkback,xen-blkfront: add myself as maintainer
block: Simplify bsg complete all
floppy: Avoid manual call of device_create_file()
NVMe: avoid kmalloc/kfree for smaller IO
MAINTAINERS: Update NBD maintainer
libata: make sata_sil24 use fifo tag allocator
libata: move sas ata tag allocation to libata-scsi.c
libata: use blk taging
NVMe: within nvme_free_queues(), delete RCU sychro/deferred free
null_blk: suppress invalid partition info
brd: Request from fdisk 4k alignment
brd: Fix all partitions BUGs
axonram: Fix bug in direct_access
loop: add blk-mq.h include
block: loop: don't handle REQ_FUA explicitly
block: loop: introduce lo_discard() and lo_req_flush()
block: loop: say goodby to bio
block: loop: improve performance via blk-mq
Pull core block IO changes from Jens Axboe:
"This contains:
- A series from Christoph that cleans up and refactors various parts
of the REQ_BLOCK_PC handling. Contributions in that series from
Dongsu Park and Kent Overstreet as well.
- CFQ:
- A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
- A stable patch fixing a potential crash in CFQ in OOM
situations. From Konstantin Khlebnikov.
- blk-mq:
- Add support for tag allocation policies, from Shaohua. This is
a prep patch enabling libata (and other SCSI parts) to use the
blk-mq tagging, instead of rolling their own.
- Various little tweaks from Keith and Mike, in preparation for
DM blk-mq support.
- Minor little fixes or tweaks from me.
- A double free error fix from Tony Battersby.
- The partition 4k issue fixes from Matthew and Boaz.
- Add support for zero+unprovision for blkdev_issue_zeroout() from
Martin"
* 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
block: remove unused function blk_bio_map_sg
block: handle the null_mapped flag correctly in blk_rq_map_user_iov
blk-mq: fix double-free in error path
block: prevent request-to-request merging with gaps if not allowed
blk-mq: make blk_mq_run_queues() static
dm: fix multipath regression due to initializing wrong request
cfq-iosched: handle failure of cfq group allocation
block: Quiesce zeroout wrapper
block: rewrite and split __bio_copy_iov()
block: merge __bio_map_user_iov into bio_map_user_iov
block: merge __bio_map_kern into bio_map_kern
block: pass iov_iter to the BLOCK_PC mapping functions
block: add a helper to free bio bounce buffer pages
block: use blk_rq_map_user_iov to implement blk_rq_map_user
block: simplify bio_map_kern
block: mark blk-mq devices as stackable
block: keep established cmd_flags when cloning into a blk-mq request
block: add blk-mq support to blk_insert_cloned_request()
block: require blk_rq_prep_clone() be given an initialized clone request
blk-mq: add tag allocation policy
...
Pull backing device changes from Jens Axboe:
"This contains a cleanup of how the backing device is handled, in
preparation for a rework of the life time rules. In this part, the
most important change is to split the unrelated nommu mmap flags from
it, but also removing a backing_dev_info pointer from the
address_space (and inode), and a cleanup of other various minor bits.
Christoph did all the work here, I just fixed an oops with pages that
have a swap backing. Arnd fixed a missing export, and Oleg killed the
lustre backing_dev_info from staging. Last patch was from Al,
unexporting parts that are now no longer needed outside"
* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
Make super_blocks and sb_lock static
mtd: export new mtd_mmap_capabilities
fs: make inode_to_bdi() handle NULL inode
staging/lustre/llite: get rid of backing_dev_info
fs: remove default_backing_dev_info
fs: don't reassign dirty inodes to default_backing_dev_info
nfs: don't call bdi_unregister
ceph: remove call to bdi_unregister
fs: remove mapping->backing_dev_info
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
nilfs2: set up s_bdi like the generic mount_bdev code
block_dev: get bdev inode bdi directly from the block device
block_dev: only write bdev inode on close
fs: introduce f_op->mmap_capabilities for nommu mmap support
fs: kill BDI_CAP_SWAP_BACKED
fs: deduplicate noop_backing_dev_info
The tape drivers (and the sg driver in a special case that doesn't matter
here) use the null_mapped flag to tell blk_rq_map_user to not copy around
any data into or out of the bounce buffers. blk_rq_map_user_iov never
got that treatment, which didn't matter until I refactored blk_rq_map_user
to be implemented in terms of blk_rq_map_user_iov.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fixes: ddad8dd0a1 ("block: use blk_rq_map_user_iov to implement blk_rq_map_user")
Signed-off-by: Jens Axboe <axboe@fb.com>
If the allocation of bt->bs fails, then bt->map can be freed twice, once
in blk_mq_init_bitmap_tags() -> bt_alloc(), and once in
blk_mq_init_bitmap_tags() -> bt_free(). Fix by setting the pointer to
NULL after the first free.
Cc: <stable@vger.kernel.org>
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If the queue has SG_GAPS set, we must not merge across an sg gap.
This is caught for the bio case, but currently not for the
more rare case of merging two requests directly.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Cut the dm bits, those will go through the dm tree, and fixed
the test_bit() test.
Signed-off-by: Jens Axboe <axboe@fb.com>
We no longer use it outside of blk-mq.c, so we can make it static
and stop exporting it. Additionally, kill the 'async' argument, as
there's only one used of it.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull EFI updates from Ingo Molnar:
"Main changes:
- Move efivarfs from the misc filesystem section to pseudo filesystem
- Expose firmware platform size in sysfs
- Improve robustness of get_memory_map() by removing assumptions on
the size of efi_memory_desc_t.
- various cleanups and fixes
The biggest risk is the get_memory_map() change, which changes the way
that both the arm64 and x86 EFI boot stub build the early memory map.
There are no known regressions with it at the moment, BYMMV"
* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Don't look for chosen@0 node on DT platforms
firmware: efi: Remove unneeded guid unparse
efi/libstub: Call get_memory_map() to obtain map and desc sizes
efi: Small leak on error in runtime map code
efi: rtc-efi: Mark UIE as unsupported
arm64/efi: efistub: Apply __init annotation
efi: Expose underlying UEFI firmware platform size to userland
efi: Rename efi_guid_unparse to efi_guid_to_str
efi: Update the URLs for efibootmgr
fs: Make efivarfs a pseudo filesystem, built by default with EFI
Cfq_lookup_create_cfqg() allocates struct blkcg_gq using GFP_ATOMIC.
In cfq_find_alloc_queue() possible allocation failure is not handled.
As a result kernel oopses on NULL pointer dereference when
cfq_link_cfqq_cfqg() calls cfqg_get() for NULL pointer.
Bug was introduced in v3.5 in commit cd1604fab4 ("blkcg: factor
out blkio_group creation"). Prior to that commit cfq group lookup
had returned pointer to root group as fallback.
This patch handles this error using existing fallback oom_cfqq.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Fixes: cd1604fab4 ("blkcg: factor out blkio_group creation")
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
blkdev_issue_zeroout() printed a warning if a device failed a discard or
write same request despite advertising support for these. That's fine
for SCSI since we'll disable these commands if we get an error back from
the disk saying that they are not supported. And consequently the
warning only gets printed once.
There are other types of block devices that support discard, however,
and these may return -EOPNOTSUPP for each command but leave discard
enabled in the queue limits. This will cause a warning message for every
blkdev_issue_zeroout() invocation.
Remove the offending warning messages.
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Rewrite __bio_copy_iov using the copy_page_{from,to}_iter helpers, and
split it into two simpler functions.
This commit should contain only literal replacements, without
functional changes.
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dongsu Park <dongsu.park@profitbricks.com>
[hch: removed the __bio_copy_iov wrapper]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
And also remove the unused bdev argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This saves a little code, and allow to simplify the error handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Make use of a new interface provided by iov_iter, backed by
scatter-gather list of iovec, instead of the old interface based on
sg_iovec. Also use iov_iter_advance() instead of manual iteration.
This commit should contain only literal replacements, without
functional changes.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dongsu.park@profitbricks.com>
[hch: fixed to do a deep clone of the iov_iter, and to properly use
the iov_iter direction]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The code sniplet to walk all bio_vecs and free their pages is opencoded in
way to many places, so factor it into a helper. Also convert the slightly
more complex cases in bio_kern_endio and __bio_copy_iov where we break
the freeing from an existing loop into a separate one.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Just open code the trivial mapping from a kernel virtual address to
a bio instead of going through the complex user address mapping
machinery.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It took me a few tries to figure out what this code did; lets rewrite
it into a more regular form.
The thing that makes this one 'special' is the BSG_F_BLOCK flag, if
that is not set we're not supposed/allowed to block and should spin
wait for completion.
The (new) io_wait_event() will never see a false condition in case of
the spinning and we will therefore not block.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
since that's a more logical and accurate place - Leif Lindholm
* Update efibootmgr URL in Kconfig help - Peter Jones
* Improve accuracy of EFI guid function names - Borislav Petkov
* Expose firmware platform size in sysfs for the benefit of EFI boot
loader installers and other utilities - Steve McIntyre
* Cleanup __init annotations for arm64/efi code - Ard Biesheuvel
* Mark the UIE as unsupported for rtc-efi - Ard Biesheuvel
* Fix memory leak in error code path of runtime map code - Dan Carpenter
* Improve robustness of get_memory_map() by removing assumptions on the
size of efi_memory_desc_t (which could change in future spec
versions) and querying the firmware instead of guessing about the
memmap size - Ard Biesheuvel
* Remove superfluous guid unparse calls - Ivan Khoronzhuk
* Delete unnecessary chosen@0 DT node FDT code since was duplicated
from code in drivers/of and is entirely unnecessary - Leif Lindholm
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUv69oAAoJEC84WcCNIz1VEYgP/1b27WRfCXs4q/8FP+UheSDS
nAFbGe9PjVPnxo5pA9VwPP6eNQ2zYiyNGEK1BlbQlFPZdSD1updIraA78CiF5iys
iSYyG9xVIcTB23RZI8aJLnBXbosIUKPJZ3FORv1LPhI6Mz1rCpraEaaUlv67rUKr
FLBG9cR7t9f/f+fJw6LOAAISGIG/4s0wQdA5/noaYkj5R5bICl2UTGtbwa0oNstb
NUO93aKDgaG/VljpIEeG6XV96Ioz7cHjQsEaX8sTrvT0n7nPNIqSDjFJOqWKJOXl
RsFrzyl8fFIbMuQatYv1f3efPvyH+iKOfHnHrvcjUNje0xhm7F0Bd86BkOw1a3JQ
pNb0YUWecI0Z/8GSzN8X0JQ7cowa3wI15Z/Hfs03odTXiM6VqwFAhuz/s5DEUdKS
U+rOPjU0ezt3G4oBB/VGgF9w5JWKfsMcsHgmLX9P+JYzKFrxggo1SXAtXUeRAqQp
agKmUB+k6Y1baQO8efkoM7rKL2F0q1SR9QiK+16BHCCkevD23v7IFGrHm2r1xKil
kvWlY4MkRVa4KGPxEFEDVty0HjXxImwYsxTaYVHTS7SMeoP41f6koHKB19NaB3No
5fqn/rT1KcJuhQj/I+vAixIX4WMJkX/MQVbtKfqSaKlAiRg3eRY6ONYr0jOglfF6
gaMuvmDd0HlV6UJvH/9L
=iPpM
-----END PGP SIGNATURE-----
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/efi
Pull EFI updates from Matt Fleming:
" - Move efivarfs from the misc filesystem section to pseudo filesystem,
since that's a more logical and accurate place - Leif Lindholm
- Update efibootmgr URL in Kconfig help - Peter Jones
- Improve accuracy of EFI guid function names - Borislav Petkov
- Expose firmware platform size in sysfs for the benefit of EFI boot
loader installers and other utilities - Steve McIntyre
- Cleanup __init annotations for arm64/efi code - Ard Biesheuvel
- Mark the UIE as unsupported for rtc-efi - Ard Biesheuvel
- Fix memory leak in error code path of runtime map code - Dan Carpenter
- Improve robustness of get_memory_map() by removing assumptions on the
size of efi_memory_desc_t (which could change in future spec
versions) and querying the firmware instead of guessing about the
memmap size - Ard Biesheuvel
- Remove superfluous guid unparse calls - Ivan Khoronzhuk
- Delete unnecessary chosen@0 DT node FDT code since was duplicated
from code in drivers/of and is entirely unnecessary - Leif Lindholm
There's nothing super scary, mainly cleanups, and a merge from Ricardo who
kindly picked up some patches from the linux-efi mailing list while I
was out on annual leave in December.
Perhaps the biggest risk is the get_memory_map() change from Ard, which
changes the way that both the arm64 and x86 EFI boot stub build the
early memory map. It would be good to have it bake in linux-next for a
while.
"
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The kobject memory inside blk-mq hctx/ctx shouldn't have been freed
before the kobject is released because driver core can access it freely
before its release.
We can't do that in all ctx/hctx/mq_kobj's release handler because
it can be run before blk_cleanup_queue().
Given mq_kobj shouldn't have been introduced, this patch simply moves
mq's release into blk_release_queue().
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This reverts commit 76d697d107.
The commit 76d697d107 causes general protection fault
reported from Bart Van Assche:
https://lkml.org/lkml/2015/1/28/334
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_mq_alloc_request() may establish REQ_MQ_INFLIGHT in addition to
incrementing the hctx->nr_active count. Any cmd_flags that are
established in the newly allocated clone request must be preserved in
addition to the cmd_flags that are later copied over from the original
request as part of blk_rq_prep_clone().
Otherwise, if REQ_MQ_INFLIGHT isn't set in the clone request the
hctx->nr_active count won't get decremented via blk_mq_free_request().
The only consumer of blk_rq_prep_clone() is request-based DM, which uses
blk_rq_init() prior to calling blk_rq_prep_clone() for the non-blk-mq
case. Given the cloned request's cmd_flags will be 0 it is safe to OR
them with the original request's cmd_flags for both the non-blk-mq and
blk-mq cases.
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If the request passed to blk_insert_cloned_request() was allocated by
a blk-mq device it must be submitted using blk_mq_insert_request().
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Prepare to allow blk_rq_prep_clone() to accept clone requests that were
allocated from blk-mq request queues. As such the blk_rq_prep_clone()
caller must first initialize the clone request.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This is the blk-mq part to support tag allocation policy. The default
allocation policy isn't changed (though it's not a strict FIFO). The new
policy is round-robin for libata. But it's a try-best implementation. If
multiple tasks are competing, the tags returned will be mixed (which is
unavoidable even with !mq, as requests from different tasks can be
mixed in queue)
Cc: Jens Axboe <axboe@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The libata tag allocation is using a round-robin policy. Next patch will
make libata use block generic tag allocation, so let's add a policy to
tag allocation.
Currently two policies: FIFO (default) and round-robin.
Cc: Jens Axboe <axboe@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
As Christoph put it:
Can we just get rid of the warnings? It's fairly annoying as devices
without partitions are perfectly fine and very useful.
Me too I see this message every VM boot for ages on all my
devices. Would love to just remove it. For me a partition-table
is only needed for a booting BIOS, grub, and stuff.
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkdev_issue_discard() will zero a given block range. This is done by
way of explicit writing, thus provisioning or allocating the blocks on
disk.
There are use cases where the desired behavior is to zero the blocks but
unprovision them if possible. The blocks must deterministically contain
zeroes when they are subsequently read back.
This patch adds a flag to blkdev_issue_zeroout() that provides this
variant. If the discard flag is set and a block device guarantees
discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
the device does not support discard_zeroes_data or if the discard
request fails we will fall back to first REQ_WRITE_SAME and then a
regular REQ_WRITE.
Also update the callers of blkdev_issue_zero() to reflect the new flag
and make sb_issue_zeroout() prefer the discard approach.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Hi,
If you can manage to submit an async write as the first async I/O from
the context of a process with realtime scheduling priority, then a
cfq_queue is allocated, but filed into the wrong async_cfqq bucket. It
ends up in the best effort array, but actually has realtime I/O
scheduling priority set in cfqq->ioprio.
The reason is that cfq_get_queue assumes the default scheduling class and
priority when there is no information present (i.e. when the async cfqq
is created):
static struct cfq_queue *
cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
struct bio *bio, gfp_t gfp_mask)
{
const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio);
cic->ioprio starts out as 0, which is "invalid". So, class of 0
(IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so:
async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
static struct cfq_queue **
cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
{
switch (ioprio_class) {
case IOPRIO_CLASS_RT:
return &cfqd->async_cfqq[0][ioprio];
case IOPRIO_CLASS_NONE:
ioprio = IOPRIO_NORM;
/* fall through */
case IOPRIO_CLASS_BE:
return &cfqd->async_cfqq[1][ioprio];
case IOPRIO_CLASS_IDLE:
return &cfqd->async_idle_cfqq;
default:
BUG();
}
}
Here, instead of returning a class mapped from the process' scheduling
priority, we get back the bucket associated with IOPRIO_CLASS_BE.
Now, there is no queue allocated there yet, so we create it:
cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask);
That function ends up doing this:
cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
cfq_init_prio_data(cfqq, cic);
cfq_init_cfqq marks the priority as having changed. Then, cfq_init_prio
data does this:
ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
switch (ioprio_class) {
default:
printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class);
case IOPRIO_CLASS_NONE:
/*
* no prio set, inherit CPU scheduling settings
*/
cfqq->ioprio = task_nice_ioprio(tsk);
cfqq->ioprio_class = task_nice_ioclass(tsk);
break;
So we basically have two code paths that treat IOPRIO_CLASS_NONE
differently, which results in an RT async cfqq filed into a best effort
bucket.
Attached is a patch which fixes the problem. I'm not sure how to make
it cleaner. Suggestions would be welcome.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Since "BDI: Provide backing device capability information [try #3]" the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.
Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead. Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device. It also removes the need for
the mtd_inodefs filesystem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The kobject memory shouldn't have been freed before the kobject
is released because driver core can access it freely before its
release.
This patch frees hctx in its release callback. For ctx, they
share one single per-cpu variable which is associated with
the request queue, so free ctx in q->mq_kobj's release handler.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(fix ctx kobjects)
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The blk-mq tagging tries to maintain some locality between CPUs and
the tags issued. The tags are split into groups of words, and the
words may not be fully populated. When searching for a new free tag,
blk-mq may look at partial words, hence it passes in an offset/size
to find_next_zero_bit(). However, it does that wrong, the size must
always be the full length of the number of tags in that word,
otherwise we'll potentially miss some near the end.
Another issue is when __bt_get() goes from one word set to the next.
It bumps the index, but not the last_tag associated with the
previous index. Bump that to be in the range of the new word.
Finally, clean up __bt_get() and __bt_get_word() a bit and get
rid of the goto in there, and the unnecessary 'wrap' variable.
Signed-off-by: Jens Axboe <axboe@fb.com>
Requests that haven't been started prior to a queue dying can be ended
in error without waiting for them to start and time out.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Added code comment to explain why this is done.
Signed-off-by: Jens Axboe <axboe@fb.com>
Some types of requests may be started that are not gauranteed to ever
complete. This adds a request flag that a driver can use so mark the
request as such.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Adds a helper function a driver can use to abort requeued requests in
case any are pending when h/w queues are being removed.
Signed-off-by: Jens Axboe <axboe@fb.com>
Kicking requeued requests will start h/w queues in a work_queue, which
may alter the driver's requested state to temporarily stop them. This
patch exports a method to cancel the q->requeue_work so a driver can be
assured stopped h/w queues won't be started up before it is ready.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Drivers can iterate over all allocated request tags, but their callback
needs a way to know if the driver started the request in the first place.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When the queue is set to dying, wake up tasks that are waiting on frozen
queue so they realize it is dying and abandon their request.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Modified by me to add a code comment on the need for the wakeup.
Signed-off-by: Jens Axboe <axboe@fb.com>
Call it what it does - "unparse" is plain-misleading.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
We store it in the tag set, we don't need it in the hardware queue.
While removing cmd_size, place ->queue_num further down to avoid
a hole on 64-bit archs. It's not used in any fast paths, so we
can safely move it.
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit b4c6a02877 exported the start and unfreeze, but we need
the regular blk_mq_freeze_queue() for the loop conversion.
Signed-off-by: Jens Axboe <axboe@fb.com>
If it's dying, we can't expect new request to complete and come
in an wake up other tasks waiting for requests. So after we
have marked it as dying, wake up everybody currently waiting
for a request. Once they wake, they will retry their allocation
and fail appropriately due to the state of the queue.
Tested-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This reverts commit 52f7eb945f.
The optimization is only really safe for a single queue, otherwise
'bs' and 'bt' can indeed change, and if we don't do a finish_wait()
for each loop, we'll potentially change the wait structure and
corrupt task wait list.
Reported-by: Jan Kara <jack@suse.cz>
Pull block driver core update from Jens Axboe:
"This is the pull request for the core block IO changes for 3.19. Not
a huge round this time, mostly lots of little good fixes:
- Fix a bug in sysfs blktrace interface causing a NULL pointer
dereference, when enabled/disabled through that API. From Arianna
Avanzini.
- Various updates/fixes/improvements for blk-mq:
- A set of updates from Bart, mostly fixing buts in the tag
handling.
- Cleanup/code consolidation from Christoph.
- Extend queue_rq API to be able to handle batching issues of IO
requests. NVMe will utilize this shortly. From me.
- A few tag and request handling updates from me.
- Cleanup of the preempt handling for running queues from Paolo.
- Prevent running of unmapped hardware queues from Ming Lei.
- Move the kdump memory limiting check to be in the correct
location, from Shaohua.
- Initialize all software queues at init time from Takashi. This
prevents a kobject warning when CPUs are brought online that
weren't online when a queue was registered.
- Single writeback fix for I_DIRTY clearing from Tejun. Queued with
the core IO changes, since it's just a single fix.
- Version X of the __bio_add_page() segment addition retry from
Maurizio. Hope the Xth time is the charm.
- Documentation fixup for IO scheduler merging from Jan.
- Introduce (and use) generic IO stat accounting helpers for non-rq
drivers, from Gu Zheng.
- Kill off artificial limiting of max sectors in a request from
Christoph"
* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
bio: modify __bio_add_page() to accept pages that don't start a new segment
blk-mq: Fix uninitialized kobject at CPU hotplugging
blktrace: don't let the sysfs interface remove trace from running list
blk-mq: Use all available hardware queues
blk-mq: Micro-optimize bt_get()
blk-mq: Fix a race between bt_clear_tag() and bt_get()
blk-mq: Avoid that __bt_get_word() wraps multiple times
blk-mq: Fix a use-after-free
blk-mq: prevent unmapped hw queue from being scheduled
blk-mq: re-check for available tags after running the hardware queue
blk-mq: fix hang in bt_get()
blk-mq: move the kdump check to blk_mq_alloc_tag_set
blk-mq: cleanup tag free handling
blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
blk: introduce generic io stat accounting help function
blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
genhd: check for int overflow in disk_expand_part_tbl()
blk-mq: add blk_mq_free_hctx_request()
blk-mq: export blk_mq_free_request()
blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
...
The original behaviour is to refuse to add a new page if the maximum
number of segments has been reached, regardless of the fact the page we
are going to add can be merged into the last segment or not.
Unfortunately, when the system runs under heavy memory fragmentation
conditions, a driver may try to add multiple pages to the last segment.
The original code won't accept them and EBUSY will be reported to
userspace.
This patch modifies the function so it refuses to add a page only in case
the latter starts a new segment and the maximum number of segments has
already been reached.
The bug can be easily reproduced with the st driver:
1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16
2) modprobe st buffer_kbs=1024
3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
dd: error writing `/dev/st0': Device or resource busy
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: Jet Chen <jet.chen@intel.com>
Cc: Tomas Henzl <thenzl@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
This time we have some more new material than we used to have during
the last couple of development cycles.
The most important part of it to me is the introduction of a unified
interface for accessing device properties provided by platform
firmware. It works with Device Trees and ACPI in a uniform way and
drivers using it need not worry about where the properties come
from as long as the platform firmware (either DT or ACPI) makes
them available. It covers both devices and "bare" device node
objects without struct device representation as that turns out to
be necessary in some cases. This has been in the works for quite
a few months (and development cycles) and has been approved by
all of the relevant maintainers.
On top of that, some drivers are switched over to the new interface
(at25, leds-gpio, gpio_keys_polled) and some additional changes are
made to the core GPIO subsystem to allow device drivers to manipulate
GPIOs in the "canonical" way on platforms that provide GPIO information
in their ACPI tables, but don't assign names to GPIO lines (in which
case the driver needs to do that on the basis of what it knows about
the device in question). That also has been approved by the GPIO
core maintainers and the rfkill driver is now going to use it.
Second is support for hardware P-states in the intel_pstate driver.
It uses CPUID to detect whether or not the feature is supported by
the processor in which case it will be enabled by default. However,
it can be disabled entirely from the kernel command line if necessary.
Next is support for a platform firmware interface based on ACPI
operation regions used by the PMIC (Power Management Integrated
Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms.
That interface is used for manipulating power resources and for
thermal management: sensor temperature reporting, trip point setting
and so on.
Also the ACPI core is now going to support the _DEP configuration
information in a limited way. Basically, _DEP it supposed to reflect
off-the-hierarchy dependencies between devices which may be very
indirect, like when AML for one device accesses locations in an
operation region handled by another device's driver (usually, the
device depended on this way is a serial bus or GPIO controller).
The support added this time is sufficient to make the ACPI battery
driver work on Asus T100A, but it is general enough to be able to
cover some other use cases in the future.
Finally, we have a new cpufreq driver for the Loongson1B processor.
In addition to the above, there are fixes and cleanups all over the
place as usual and a traditional ACPICA update to a recent upstream
release.
As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver
for Intel platforms should be able to handle power management of
the DMA engine correctly, the cpufreq-dt driver should interact
with the thermal subsystem in a better way and the ACPI backlight
driver should handle some more corner cases, among other things.
On top of the ACPICA update there are fixes for race conditions
in the ACPICA's interrupt handling code which might lead to some
random and strange looking failures on some systems.
In the cleanups department the most visible part is the series
of commits targeted at getting rid of the CONFIG_PM_RUNTIME
configuration option. That was triggered by a discussion
regarding the generic power domains code during which we realized
that trying to support certain combinations of PM config options
was painful and not really worth it, because nobody would use them
in production anyway. For this reason, we decided to make
CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and that lead to the
conclusion that the latter became redundant and CONFIG_PM could
be used instead of it. The material here makes that replacement
in a major part of the tree, but there will be at least one more
batch of that in the second part of the merge window.
Specifics:
- Support for retrieving device properties information from ACPI
_DSD device configuration objects and a unified device properties
interface for device drivers (and subsystems) on top of that.
As stated above, this works with Device Trees and ACPI and allows
device drivers to be written in a platform firmware (DT or ACPI)
agnostic way. The at25, leds-gpio and gpio_keys_polled drivers
are now going to use this new interface and the GPIO subsystem
is additionally modified to allow device drivers to assign names
to GPIO resources returned by ACPI _CRS objects (in case _DSD is
not present or does not provide the expected data). The changes
in this set are mostly from Mika Westerberg, Rafael J Wysocki,
Aaron Lu, and Darren Hart with some fixes from others (Fabio Estevam,
Geert Uytterhoeven).
- Support for Hardware Managed Performance States (HWP) as described
in Volume 3, section 14.4, of the Intel SDM in the intel_pstate
driver. CPUID is used to detect whether or not the feature is
supported by the processor. If supported, it will be enabled
automatically unless the intel_pstate=no_hwp switch is present in
the kernel command line. From Dirk Brandewie.
- New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie).
- Support for firmware interface based on ACPI operation regions
used by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR
platforms for power resource control and thermal management
(Aaron Lu).
- Limited support for retrieving off-the-hierarchy dependencies
between devices from ACPI _DEP device configuration objects
and deferred probing support for the ACPI battery driver based
on the _DEP information to make that driver work on Asus T100A
(Lan Tianyu).
- New cpufreq driver for the Loongson1B processor (Kelvin Cheung).
- ACPICA update to upstream revision 20141107 which only affects
tools (Bob Moore).
- Fixes for race conditions in the ACPICA's interrupt handling
code and in the ACPI code related to system suspend and resume
(Lv Zheng and Rafael J Wysocki).
- ACPI core fix for an RCU-related issue in the ioremap() regions
management code that slowed down significantly after CPUs had
been allowed to enter idle states even if they'd had RCU callbakcs
queued and triggered some problems in certain proprietary graphics
driver (and elsewhere). The fix replaces synchronize_rcu() in
that code with synchronize_rcu_expedited() which makes the issue
go away. From Konstantin Khlebnikov.
- ACPI LPSS (Low-Power Subsystem) driver fix to handle power
management of the DMA engine included into the LPSS correctly.
The problem is that the DMA engine doesn't have ACPI PM support
of its own and it simply is turned off when the last LPSS device
having ACPI PM support goes into D3cold. To work around that,
the PM domain used by the ACPI LPSS driver is redesigned so at
least one device with ACPI PM support will be on as long as the
DMA engine is in use. From Andy Shevchenko.
- ACPI backlight driver fix to avoid using it on "Win8-compatible"
systems where it doesn't work and where it was used by default by
mistake (Aaron Lu).
- Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki,
Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and
Ashwin Chaugule (mostly related to the upcoming ARM64 support).
- Intel RAPL (Running Average Power Limit) power capping driver
fixes and improvements including new processor IDs (Jacob Pan).
- Generic power domains modification to power up domains after
attaching devices to them to meet the expectations of device
drivers and bus types assuming devices to be accessible at
probe time (Ulf Hansson).
- Preliminary support for controlling device clocks from the
generic power domains core code and modifications of the
ARM/shmobile platform to use that feature (Ulf Hansson).
- Assorted minor fixes and cleanups of the generic power
domains core code (Ulf Hansson, Geert Uytterhoeven).
- Assorted minor fixes and cleanups of the device clocks control
code in the PM core (Geert Uytterhoeven, Grygorii Strashko).
- Consolidation of device power management Kconfig options by making
CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter
which is now redundant (Rafael J Wysocki and Kevin Hilman). That
is the first batch of the changes needed for this purpose.
- Core device runtime power management support code cleanup related
to the execution of callbacks (Andrzej Hajda).
- cpuidle ARM support improvements (Lorenzo Pieralisi).
- cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and
a new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and
Bartlomiej Zolnierkiewicz).
- New cpufreq driver callback (->ready) to be executed when the
cpufreq core is ready to use a given policy object and cpufreq-dt
driver modification to use that callback for cooling device
registration (Viresh Kumar).
- cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu,
James Geboski, Tomeu Vizoso).
- Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate,
cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao,
Stefan Wahren, Petr Cvek).
- OPP (Operating Performance Points) framework modification to
allow OPPs to be removed too and update of a few cpufreq drivers
(cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added
during initialization) on driver removal (Viresh Kumar).
- Hibernation core fixes and cleanups (Tina Ruchandani and
Markus Elfring).
- PM Kconfig fix related to CPU power management (Pankaj Dubey).
- cpupower tool fix (Prarit Bhargava).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUhj6JAAoJEILEb/54YlRxTM4P/j5g5SfqvY0QKsn7sR7MGZ6v
nsgCBhJAqTw3ocNC7EAs8z9h2GWy1KbKpakKYWAh9Fs1yZoey7tFSlcv/Rgjlp70
uU5sDQHtpE9mHKiymdsowiQuWgpl962L4k+k8hUslhlvgk1PvVbpajR6OqG8G+pD
asuIW9eh1APNkLyXmRJ3ZPomzs0VmRdZJ0NEs0lKX9mJskqEvxPIwdaxq3iaJq9B
Fo0J345zUDcJnxWblDRdHlOigCimglElfN5qJwaC4KpwUKuBvLRKbp4f69+wfT0c
kYFiR29X5KjJ2kLfP/wKsLyuDCYYXRq3tCia5M1tAqOjZ+UA89H/GDftx/5lntmv
qUlBa35VfdS1SX4HyApZitOHiLgo+It/hl8Z9bJnhyVw66NxmMQ8JYN2imb8Lhqh
XCLR7BxLTah82AapLJuQ0ZDHPzZqMPG2veC2vAzRMYzVijict/p4Y2+qBqONltER
4rs9uRVn+hamX33lCLg8BEN8zqlnT3rJFIgGaKjq/wXHAU/zpE9CjOrKMQcAg9+s
t51XMNPwypHMAYyGVhEL89ImjXnXxBkLRuquhlmEpvQchIhR+mR3dLsarGn7da44
WPIQJXzcsojXczcwwfqsJCR4I1FTFyQIW+UNh02GkDRgRovQqo+Jk762U7vQwqH+
LBdhvVaS1VW4v+FWXEoZ
=5dox
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
"This time we have some more new material than we used to have during
the last couple of development cycles.
The most important part of it to me is the introduction of a unified
interface for accessing device properties provided by platform
firmware. It works with Device Trees and ACPI in a uniform way and
drivers using it need not worry about where the properties come from
as long as the platform firmware (either DT or ACPI) makes them
available. It covers both devices and "bare" device node objects
without struct device representation as that turns out to be necessary
in some cases. This has been in the works for quite a few months (and
development cycles) and has been approved by all of the relevant
maintainers.
On top of that, some drivers are switched over to the new interface
(at25, leds-gpio, gpio_keys_polled) and some additional changes are
made to the core GPIO subsystem to allow device drivers to manipulate
GPIOs in the "canonical" way on platforms that provide GPIO
information in their ACPI tables, but don't assign names to GPIO lines
(in which case the driver needs to do that on the basis of what it
knows about the device in question). That also has been approved by
the GPIO core maintainers and the rfkill driver is now going to use
it.
Second is support for hardware P-states in the intel_pstate driver.
It uses CPUID to detect whether or not the feature is supported by the
processor in which case it will be enabled by default. However, it
can be disabled entirely from the kernel command line if necessary.
Next is support for a platform firmware interface based on ACPI
operation regions used by the PMIC (Power Management Integrated
Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms.
That interface is used for manipulating power resources and for
thermal management: sensor temperature reporting, trip point setting
and so on.
Also the ACPI core is now going to support the _DEP configuration
information in a limited way. Basically, _DEP it supposed to reflect
off-the-hierarchy dependencies between devices which may be very
indirect, like when AML for one device accesses locations in an
operation region handled by another device's driver (usually, the
device depended on this way is a serial bus or GPIO controller). The
support added this time is sufficient to make the ACPI battery driver
work on Asus T100A, but it is general enough to be able to cover some
other use cases in the future.
Finally, we have a new cpufreq driver for the Loongson1B processor.
In addition to the above, there are fixes and cleanups all over the
place as usual and a traditional ACPICA update to a recent upstream
release.
As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver for
Intel platforms should be able to handle power management of the DMA
engine correctly, the cpufreq-dt driver should interact with the
thermal subsystem in a better way and the ACPI backlight driver should
handle some more corner cases, among other things.
On top of the ACPICA update there are fixes for race conditions in the
ACPICA's interrupt handling code which might lead to some random and
strange looking failures on some systems.
In the cleanups department the most visible part is the series of
commits targeted at getting rid of the CONFIG_PM_RUNTIME configuration
option. That was triggered by a discussion regarding the generic
power domains code during which we realized that trying to support
certain combinations of PM config options was painful and not really
worth it, because nobody would use them in production anyway. For
this reason, we decided to make CONFIG_PM_SLEEP select
CONFIG_PM_RUNTIME and that lead to the conclusion that the latter
became redundant and CONFIG_PM could be used instead of it. The
material here makes that replacement in a major part of the tree, but
there will be at least one more batch of that in the second part of
the merge window.
Specifics:
- Support for retrieving device properties information from ACPI _DSD
device configuration objects and a unified device properties
interface for device drivers (and subsystems) on top of that. As
stated above, this works with Device Trees and ACPI and allows
device drivers to be written in a platform firmware (DT or ACPI)
agnostic way. The at25, leds-gpio and gpio_keys_polled drivers are
now going to use this new interface and the GPIO subsystem is
additionally modified to allow device drivers to assign names to
GPIO resources returned by ACPI _CRS objects (in case _DSD is not
present or does not provide the expected data). The changes in
this set are mostly from Mika Westerberg, Rafael J Wysocki, Aaron
Lu, and Darren Hart with some fixes from others (Fabio Estevam,
Geert Uytterhoeven).
- Support for Hardware Managed Performance States (HWP) as described
in Volume 3, section 14.4, of the Intel SDM in the intel_pstate
driver. CPUID is used to detect whether or not the feature is
supported by the processor. If supported, it will be enabled
automatically unless the intel_pstate=no_hwp switch is present in
the kernel command line. From Dirk Brandewie.
- New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie).
- Support for firmware interface based on ACPI operation regions used
by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR
platforms for power resource control and thermal management (Aaron
Lu).
- Limited support for retrieving off-the-hierarchy dependencies
between devices from ACPI _DEP device configuration objects and
deferred probing support for the ACPI battery driver based on the
_DEP information to make that driver work on Asus T100A (Lan
Tianyu).
- New cpufreq driver for the Loongson1B processor (Kelvin Cheung).
- ACPICA update to upstream revision 20141107 which only affects
tools (Bob Moore).
- Fixes for race conditions in the ACPICA's interrupt handling code
and in the ACPI code related to system suspend and resume (Lv Zheng
and Rafael J Wysocki).
- ACPI core fix for an RCU-related issue in the ioremap() regions
management code that slowed down significantly after CPUs had been
allowed to enter idle states even if they'd had RCU callbakcs
queued and triggered some problems in certain proprietary graphics
driver (and elsewhere). The fix replaces synchronize_rcu() in that
code with synchronize_rcu_expedited() which makes the issue go
away. From Konstantin Khlebnikov.
- ACPI LPSS (Low-Power Subsystem) driver fix to handle power
management of the DMA engine included into the LPSS correctly. The
problem is that the DMA engine doesn't have ACPI PM support of its
own and it simply is turned off when the last LPSS device having
ACPI PM support goes into D3cold. To work around that, the PM
domain used by the ACPI LPSS driver is redesigned so at least one
device with ACPI PM support will be on as long as the DMA engine is
in use. From Andy Shevchenko.
- ACPI backlight driver fix to avoid using it on "Win8-compatible"
systems where it doesn't work and where it was used by default by
mistake (Aaron Lu).
- Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki,
Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and Ashwin
Chaugule (mostly related to the upcoming ARM64 support).
- Intel RAPL (Running Average Power Limit) power capping driver fixes
and improvements including new processor IDs (Jacob Pan).
- Generic power domains modification to power up domains after
attaching devices to them to meet the expectations of device
drivers and bus types assuming devices to be accessible at probe
time (Ulf Hansson).
- Preliminary support for controlling device clocks from the generic
power domains core code and modifications of the ARM/shmobile
platform to use that feature (Ulf Hansson).
- Assorted minor fixes and cleanups of the generic power domains core
code (Ulf Hansson, Geert Uytterhoeven).
- Assorted minor fixes and cleanups of the device clocks control code
in the PM core (Geert Uytterhoeven, Grygorii Strashko).
- Consolidation of device power management Kconfig options by making
CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter
which is now redundant (Rafael J Wysocki and Kevin Hilman). That
is the first batch of the changes needed for this purpose.
- Core device runtime power management support code cleanup related
to the execution of callbacks (Andrzej Hajda).
- cpuidle ARM support improvements (Lorenzo Pieralisi).
- cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and a
new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and
Bartlomiej Zolnierkiewicz).
- New cpufreq driver callback (->ready) to be executed when the
cpufreq core is ready to use a given policy object and cpufreq-dt
driver modification to use that callback for cooling device
registration (Viresh Kumar).
- cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu, James
Geboski, Tomeu Vizoso).
- Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate,
cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao,
Stefan Wahren, Petr Cvek).
- OPP (Operating Performance Points) framework modification to allow
OPPs to be removed too and update of a few cpufreq drivers
(cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added
during initialization) on driver removal (Viresh Kumar).
- Hibernation core fixes and cleanups (Tina Ruchandani and Markus
Elfring).
- PM Kconfig fix related to CPU power management (Pankaj Dubey).
- cpupower tool fix (Prarit Bhargava)"
* tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (120 commits)
i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c
dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
tools: cpupower: fix return checks for sysfs_get_idlestate_count()
drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME
MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
leds: leds-gpio: Fix multiple instances registration without 'label' property
iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
hwrandom / exynos / PM: Use CONFIG_PM in #ifdef
block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
USB / PM: Drop CONFIG_PM_RUNTIME from the USB core
PM: Merge the SET*_RUNTIME_PM_OPS() macros
...
When a CPU is hotplugged, the current blk-mq spews a warning like:
kobject '(null)' (ffffe8ffffc8b5d8): tried to add an uninitialized object, something is seriously wrong.
CPU: 1 PID: 1386 Comm: systemd-udevd Not tainted 3.18.0-rc7-2.g088d59b-default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
0000000000000000 0000000000000002 ffffffff81605f07 ffffe8ffffc8b5d8
ffffffff8132c7a0 ffff88023341d370 0000000000000020 ffff8800bb05bd58
ffff8800bb05bd08 000000000000a0a0 000000003f441940 0000000000000007
Call Trace:
[<ffffffff81005306>] dump_trace+0x86/0x330
[<ffffffff81005644>] show_stack_log_lvl+0x94/0x170
[<ffffffff81006d21>] show_stack+0x21/0x50
[<ffffffff81605f07>] dump_stack+0x41/0x51
[<ffffffff8132c7a0>] kobject_add+0xa0/0xb0
[<ffffffff8130aee1>] blk_mq_register_hctx+0x91/0xb0
[<ffffffff8130b82e>] blk_mq_sysfs_register+0x3e/0x60
[<ffffffff81309298>] blk_mq_queue_reinit_notify+0xf8/0x190
[<ffffffff8107cfdc>] notifier_call_chain+0x4c/0x70
[<ffffffff8105fd23>] cpu_notify+0x23/0x50
[<ffffffff81060037>] _cpu_up+0x157/0x170
[<ffffffff810600d9>] cpu_up+0x89/0xb0
[<ffffffff815fa5b5>] cpu_subsys_online+0x35/0x80
[<ffffffff814323cd>] device_online+0x5d/0xa0
[<ffffffff81432485>] online_store+0x75/0x80
[<ffffffff81236a5a>] kernfs_fop_write+0xda/0x150
[<ffffffff811c5532>] vfs_write+0xb2/0x1f0
[<ffffffff811c5f42>] SyS_write+0x42/0xb0
[<ffffffff8160c4ed>] system_call_fastpath+0x16/0x1b
[<00007f0132fb24e0>] 0x7f0132fb24e0
This is indeed because of an uninitialized kobject for blk_mq_ctx.
The blk_mq_ctx kobjects are initialized in blk_mq_sysfs_init(), but it
goes loop over hctx_for_each_ctx(), i.e. it initializes only for
online CPUs. Thus, when a CPU is hotplugged, the ctx for the newly
onlined CPU is registered without initialization.
This patch fixes the issue by initializing the all ctx kobjects
belonging to each queue.
Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=908794
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Suppose that a system has two CPU sockets, three cores per socket,
that it does not support hyperthreading and that four hardware
queues are provided by a block driver. With the current algorithm
this will lead to the following assignment of CPU cores to hardware
queues:
HWQ 0: 0 1
HWQ 1: 2 3
HWQ 2: 4 5
HWQ 3: (none)
This patch changes the queue assignment into:
HWQ 0: 0 1
HWQ 1: 2
HWQ 2: 3 4
HWQ 3: 5
In other words, this patch has the following three effects:
- All four hardware queues are used instead of only three.
- CPU cores are spread more evenly over hardware queues. For the
above example the range of the number of CPU cores associated
with a single HWQ is reduced from [0..2] to [1..2].
- If the number of HWQ's is a multiple of the number of CPU sockets
it is now guaranteed that all CPU cores associated with a single
HWQ reside on the same CPU socket.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr()
calls into a single call.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
What we need is the following two guarantees:
* Any thread that observes the effect of the test_and_set_bit() by
__bt_get_word() also observes the preceding addition of 'current'
to the appropriate wait list. This is guaranteed by the semantics
of the spin_unlock() operation performed by prepare_and_wait().
Hence the conversion of test_and_set_bit_lock() into
test_and_set_bit().
* The wait lists are examined by bt_clear() after the tag bit has
been cleared. clear_bit_unlock() guarantees that any thread that
observes that the bit has been cleared also observes the store
operations preceding clear_bit_unlock(). However,
clear_bit_unlock() does not prevent that the wait lists are examined
before that the tag bit is cleared. Hence the addition of a memory
barrier between clear_bit() and the wait list examination.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
If __bt_get_word() is called with last_tag != 0, if the first
find_next_zero_bit() fails, if after wrap-around the
test_and_set_bit() call fails and find_next_zero_bit() succeeds,
if the next test_and_set_bit() call fails and subsequently
find_next_zero_bit() does not find a zero bit, then another
wrap-around will occur. Avoid this by introducing an additional
local variable.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq users are allowed to free the memory request_queue.tag_set
points at after blk_cleanup_queue() has finished but before
blk_release_queue() has started. This can happen e.g. in the SCSI
core. The SCSI core namely embeds the tag_set structure in a SCSI
host structure. The SCSI host structure is freed by
scsi_host_dev_release(). This function is called after
blk_cleanup_queue() finished but can be called before
blk_release_queue().
This means that it is not safe to access request_queue.tag_set from
inside blk_release_queue(). Hence remove the blk_sync_queue() call
from blk_release_queue(). This call is not necessary - outstanding
requests must have finished before blk_release_queue() is
called. Additionally, move the blk_mq_free_queue() call from
blk_release_queue() to blk_cleanup_queue() to avoid that struct
request_queue.tag_set gets accessed after it has been freed.
This patch avoids that the following kernel oops can be triggered
when deleting a SCSI host for which scsi-mq was enabled:
Call Trace:
[<ffffffff8109a7c4>] lock_acquire+0xc4/0x270
[<ffffffff814ce111>] mutex_lock_nested+0x61/0x380
[<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180
[<ffffffff8124d654>] blk_release_queue+0x84/0xd0
[<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
[<ffffffff8126c140>] kobject_put+0x30/0x70
[<ffffffff81245895>] blk_put_queue+0x15/0x20
[<ffffffff8125c409>] disk_release+0x99/0xd0
[<ffffffff8133d056>] device_release+0x36/0xb0
[<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
[<ffffffff8126c140>] kobject_put+0x30/0x70
[<ffffffff8125a78a>] put_disk+0x1a/0x20
[<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0
[<ffffffff811d56a0>] blkdev_put+0x50/0x160
[<ffffffff81199eb4>] kill_block_super+0x44/0x70
[<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60
[<ffffffff8119a87e>] deactivate_super+0x4e/0x70
[<ffffffff811b9833>] cleanup_mnt+0x43/0x90
[<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20
[<ffffffff8107252c>] task_work_run+0xac/0xe0
[<ffffffff81002c01>] do_notify_resume+0x61/0xa0
[<ffffffff814d2c58>] int_signal+0x12/0x17
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch is the usual mix of driver updates (srp, ipr, scsi_debug, NCR5380,
fnic, 53c974, ses, wd719x, hpsa, megaraid_sas). Of those, wd7a9x is new and
53c974 is a rewrite of the old tmscsim driver and the extensive work by Finn
Thain rewrites all the NCR5380 based drivers. There's also extensive
infrastructure updates: a new logging infrastructure for sense information and
a rewrite of the tagged command queue API and an assortment of minor updates.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJUhgazAAoJEDeqqVYsXL0MxIYH/2wCs9sne5cfDTjfufLJDdu1
WLoJgNMxv+14OukknJfG2Kk1WlHgLRM5+TVWIGiG0mmjXuFyShzIqEOHKTDWqxnE
tBH4wLi/+XYqZAmAeim4/2zhvf+cUVVlIu01VERR5uwaBWYM8BeLSwjdnAAvEEwb
iV74p1WV6frXo4guADplgtkjD0YxI4MTUZ1figRMlLO6WLFFyQ+95UfY8jFs+eQv
zk63y7Mm7dQNd57/Wl3i89lw0kqlaJSZNl8Ovj1axy4rDYzT1wXhY8mEwD8fI8Ym
wahldjFE5vXgj0NpO3tB3Z+UDP2YmQduMyzTxkPNnrEPKiOCsnQo42XR6vb92cQ=
=Y+DU
-----END PGP SIGNATURE-----
Merge tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This patch is the usual mix of driver updates (srp, ipr, scsi_debug,
NCR5380, fnic, 53c974, ses, wd719x, hpsa, megaraid_sas).
Of those, wd7a9x is new and 53c974 is a rewrite of the old tmscsim
driver and the extensive work by Finn Thain rewrites all the NCR5380
based drivers.
There's also extensive infrastructure updates: a new logging
infrastructure for sense information and a rewrite of the tagged
command queue API and an assortment of minor updates"
* tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (183 commits)
scsi: set fmt to NULL scsi_extd_sense_format() by default
libsas: remove task_collector mode
wd719x: remove dma_cache_sync call
scsi_debug: add Report supported opcodes+tmfs; Compare and write
scsi_debug: change SCSI command parser to table driven
scsi_debug: add Capacity Changed Unit Attention
scsi_debug: append inject error flags onto scsi_cmnd object
scsi_debug: pinpoint invalid field in sense data
wd719x: Add firmware documentation
wd719x: Introduce Western Digital WD7193/7197/7296 PCI SCSI card driver
eeprom-93cx6: Add (read-only) support for 8-bit mode
esas2r: fix an oversight in setting return value
esas2r: fix an error path in esas2r_ioctl_handler
esas2r: fir error handling in do_fm_api
scsi: add SPC-3 command definitions
scsi: rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16
scsi: remove scsi_driver owner field
scsi: move scsi_dispatch_cmd to scsi_lib.c
scsi: stop passing a gfp_mask argument down the command setup path
scsi: remove scsi_next_command
...
When one hardware queue has no mapped software queues, it
shouldn't have been scheduled. Otherwise WARNING or OOPS
can triggered.
blk_mq_hw_queue_mapped() helper is introduce for fixing
the problem.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If we run out of tags and have to sleep, we run the hardware queue
to kick pending IO into gear. During that run, we may have completed
requests, so re-check if we have free tags before going to sleep.
Signed-off-by: Jens Axboe <axboe@fb.com>
Avoid that if there are fewer hardware queues than CPU threads that
bt_get() can hang. The symptoms of the hang were as follows:
* All tags allocated for a particular hardware queue.
* (nr_tags) pending commands for that hardware queue.
* No pending commands for the software queues associated with that
hardware queue.
Signed-off-by: Jens Axboe <axboe@fb.com>
After commit b2b49ccbdd (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
selected) PM_RUNTIME is always set if PM is set, so #ifdef blocks
depending on CONFIG_PM_RUNTIME may now be changed to depend on
CONFIG_PM.
Replace CONFIG_PM_RUNTIME with CONFIG_PM in the block device core.
Reviewed-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
bio integrity handling is broken on a system with LVM layered atop a
DIF/DIX SCSI drive because device mapper clones the bio, modifies the
clone, and sends the clone to the lower layers for processing.
However, the clone bio has bi_vcnt == 0, which means that when the sd
driver calls bio_integrity_process to attach DIX data, the
for_each_segment_all() call (which uses bi_vcnt) returns immediately
and random garbage is sent to the disk on a disk write. The disk of
course returns an error.
Therefore, teach bio_integrity_process() to use bio_for_each_segment()
to iterate the bio_vecs, since the per-bio iterator tracks which
bio_vecs are associated with that particular bio. The integrity
handling code is effectively part of the "driver" (it's not the bio
owner), so it must use the correct iterator function.
v2: Fix a compiler warning about abandoned local variables. This
patch supersedes "block: bio_integrity_process uses wrong bio_vec
iterator". Patch applies against 3.18-rc6.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We call blk_mq_alloc_tag_set() first then blk_mq_init_queue(). The requests are
allocated in the former function. So the kdump check should be moved to there
to really save memory.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We only call __blk_mq_put_tag() and __blk_mq_put_reserved_tag()
from blk_mq_put_tag(), so just inline the two calls instead of
having them as separate functions.
Signed-off-by: Jens Axboe <axboe@fb.com>
We currently use num_possible_cpus(), but that breaks on sparc64 where
the CPU ID space is discontig. Use nr_cpu_ids as the highest CPU ID
instead, so we don't end up reading from invalid memory.
Cc: stable@kernel.org # 3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
SPC-3 defines SERVICE ACTION IN(12) and SERVICE ACTION IN(16).
So rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 to be
consistent with SPC and to allow for better distinction.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Robert Elliott <elliott@hp.com>
Reviewed-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Many block drivers accounting io stat based on bio (e.g. NVMe...),
the blk_account_io_start/end() which is based on request
does not make sense to them, so here we introduce the similar help
function named generic_start/end_io_acct base on raw sectors, and it can
simplify some driver's open io accounting code.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Don't duplicate the code to handle the not cpu bounce case in the
caller, do it inside blk_mq_hctx_next_cpu instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition()
with a user passed in partno value. If we pass in 0x7fffffff, the
new target in disk_expand_part_tbl() overflows the 'int' and we
access beyond the end of ptbl->part[] and even write to it when we
do the rcu_assign_pointer() to assign the new partition.
Reported-by: David Ramos <daramos@stanford.edu>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
It's silly to use blk_mq_free_request() which in turn maps the
request to the hardware queue, for places where we already know
what the hardware queue is. This saves us an extra mapping of a
hardware queue on request completion, if the caller knows this
information already.
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently scsi piggy backs on the block layer to define the concept
of a tagged command. But we want to be able to have block-level host-wide
tags assigned even for untagged commands like the initial INQUIRY, so add
a new SCSI-level flag for commands that are tagged at the scsi level, so
that even commands without that set can have tags assigned to them. Note
that this alredy is the case for the blk-mq code path, and this just lets
the old path catch up with it.
We also set this flag based upon sdev->simple_tags instead of the block
queue flag, so that it is entirely independent of the block layer tagging,
and thus always correct even if a driver doesn't use block level tagging
yet.
Also remove the old blk_rq_tagged; it was only used by SCSI drivers, and
removing it forces them to look for the proper replacement.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
The queuecommand() callback functions in SCSI low-level drivers
need to know which hardware context has been selected by the
block layer. Since this information is not available in the
request structure, and since passing the hctx pointer directly to
the queuecommand callback function would require modification of
all SCSI LLDs, add a function to the block layer that allows to
query the hardware context index.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
For cloned bio, bio->bi_vcnt can't be used at all, and we
have resort to bio_segments() to figure out how many
segment there are in the bio.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq is using preempt_disable/enable in order to ensure that the
queue runners are placed on the right CPU. This does not work with
the RT patches, because __blk_mq_run_hw_queue takes a non-raw
spinlock with the preemption-disabled region. If there is contention
on the lock, this violates the rules for preemption-disabled regions.
While this should be easily fixable within the RT patches just by doing
migrate_disable/enable, we can do better and document _why_ this
particular region runs with disabled preemption. After the previous
patch, it is trivial to switch it to get/put_cpu; the RT patches then
can change it to get_cpu_light, which lets virtio-blk run under RT
kernels.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Clark Williams <williams@redhat.com>
Tested-by: Clark Williams <williams@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
preempt_disable/enable surrounds every call to blk_mq_run_hw_queue,
except the one in blk-flush.c. In fact that one is always asynchronous,
and it does not need smp_processor_id().
We can do the same for all other calls, avoiding preempt_disable when
async is true. This avoids peppering blk-mq.c with preemption-disabled
regions.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Clark Williams <williams@redhat.com>
Tested-by: Clark Williams <williams@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Fix an error path in SCSI_IOCTL_SEND_COMMAND that calls
blk_put_request(rq) on an invalid IS_ERR(rq) pointer.
Fixes: a492f07545 ("block,scsi: fixup blk_get_request dead queue scenarios")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
q->mq_usage_counter is a percpu_ref which is killed and drained when
the queue is frozen. On a CPU hotplug event, blk_mq_queue_reinit()
which involves freezing the queue is invoked on all existing queues.
Because percpu_ref killing and draining involve a RCU grace period,
doing the above on one queue after another may take a long time if
there are many queues on the system.
This patch splits out initiation of freezing and waiting for its
completion, and updates blk_mq_queue_reinit_notify() so that the
queues are frozen in parallel instead of one after another. Note that
freezing and unfreezing are moved from blk_mq_queue_reinit() to
blk_mq_queue_reinit_notify().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Priority of a merged request is computed by ioprio_best(). If one of the
requests has undefined priority (IOPRIO_CLASS_NONE) and another request
has priority from IOPRIO_CLASS_BE, the function will return the
undefined priority which is wrong. Fix the function to properly return
priority of a request with the defined priority.
Fixes: d58cdfb89c
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Drivers can now tell blk-mq if they take advantage of the deferred
issue through 'last' or not. If they do, don't do queue-direct
for sync IO. This is a preparation patch for the nvme conversion.
Signed-off-by: Jens Axboe <axboe@fb.com>
Since we have the notion of a 'last' request in a chain, we can use
this to have the hardware optimize the issuing of requests. Add
a list_head parameter to queue_rq that the driver can use to
temporarily store hw commands for issue when 'last' is true. If we
are doing a chain of requests, pass in a NULL list for the first
request to force issue of that immediately, then batch the remainder
for deferred issue until the last request has been sent.
Instead of adding yet another argument to the hot ->queue_rq path,
encapsulate the passed arguments in a blk_mq_queue_data structure.
This is passed as a constant, and has been tested as faster than
passing 4 (or even 3) args through ->queue_rq. Update drivers for
the new ->queue_rq() prototype. There are no functional changes
in this patch for drivers - if they don't use the passed in list,
then they will just queue requests individually like before.
Signed-off-by: Jens Axboe <axboe@fb.com>
while compiling integer err was showing as a set but unused variable.
elevator_init_fn can be either cfq_init_queue or deadline_init_queue
or noop_init_queue.
all three of these functions are returning -ENOMEM if they fail to
allocate the queue.
so we should actually be returning the error code rather than
returning 0 always.
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
When sg_scsi_ioctl() fails to prepare request to submit in
blk_rq_map_kern() we jump to a label where we just end up copying
(luckily zeroed-out) kernel buffer to userspace instead of reporting
error. Fix the problem by jumping to the right label.
CC: Jens Axboe <axboe@kernel.dk>
CC: linux-scsi@vger.kernel.org
CC: stable@vger.kernel.org
Coverity-id: 1226871
Signed-off-by: Jan Kara <jack@suse.cz>
Fixed up the, now unused, out label.
Signed-off-by: Jens Axboe <axboe@fb.com>
The problem is introduced by commit 764f612c6c3c231b(blk-merge:
don't compute bi_phys_segments from bi_vcnt for cloned bio),
and merge is needed if number of current segment isn't less than
max segments.
Strictly speaking, bio->bi_vcnt shouldn't be used here since
it may not be accurate in cases of both cloned bio or bio cloned
from, but bio_segments() is a bit expensive, and bi_vcnt is still
the biggest number, so the approach should work.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Set max_sectors to the value the drivers provides as hardware limit by
default. Linux had proper I/O throttling for a long time and doesn't
rely on a artifically small maximum I/O size anymore. By not limiting
the I/O size by default we remove an annoying tuning step required for
most Linux installation.
Note that both the user, and if absolutely required the driver can still
impose a limit for FS requests below max_hw_sectors_kb.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.
- blk-mq timeout updates and fixes from Christoph.
- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.
- blk integrity updates and fixes from Martin and Gu Zheng.
- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.
- Improve the error printing, from Rob Elliott.
- Backing device improvements and cleanups from Tejun.
- Fixup of a misplaced rq_complete() tracepoint from Hannes.
- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.
- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.
- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.
- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"
* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...
All other allocs are done on the specific node, somehow the
cpumask for hw queue runs was missed. Fix that by using
zalloc_cpumask_var_node() in blk_mq_init_queue().
Signed-off-by: Jens Axboe <axboe@fb.com>
bip_slab is created with SLAB_PANIC, so the fail handler is unneeded.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
In __get_request calls to printk_ratelimited, include the function name so
the callbacks suppressed message matches the messages that are printed,
and add "dev" before the device name so it matches other block layer
messages.
Signed-off-by: Robert Elliott <elliott@hp.com>
Reviewed-by: Webb Scales <webbnh@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull percpu updates from Tejun Heo:
"A lot of activities on percpu front. Notable changes are...
- percpu allocator now can take @gfp. If @gfp doesn't contain
GFP_KERNEL, it tries to allocate from what's already available to
the allocator and a work item tries to keep the reserve around
certain level so that these atomic allocations usually succeed.
This will replace the ad-hoc percpu memory pool used by
blk-throttle and also be used by the planned blkcg support for
writeback IOs.
Please note that I noticed a bug in how @gfp is interpreted while
preparing this pull request and applied the fix 6ae833c7fe
("percpu: fix how @gfp is interpreted by the percpu allocator")
just now.
- percpu_ref now uses longs for percpu and global counters instead of
ints. It leads to more sparse packing of the percpu counters on
64bit machines but the overhead should be negligible and this
allows using percpu_ref for refcnting pages and in-memory objects
directly.
- The switching between percpu and single counter modes of a
percpu_ref is made independent of putting the base ref and a
percpu_ref can now optionally be initialized in single or killed
mode. This allows avoiding percpu shutdown latency for cases where
the refcounted objects may be synchronously created and destroyed
in rapid succession with only a fraction of them reaching fully
operational status (SCSI probing does this when combined with
blk-mq support). It's also planned to be used to implement forced
single mode to detect underflow more timely for debugging.
There's a separate branch percpu/for-3.18-consistent-ops which cleans
up the duplicate percpu accessors. That branch causes a number of
conflicts with s390 and other trees. I'll send a separate pull
request w/ resolutions once other branches are merged"
* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
percpu: fix how @gfp is interpreted by the percpu allocator
blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
percpu_ref: add PERCPU_REF_INIT_* flags
percpu_ref: decouple switching to percpu mode and reinit
percpu_ref: decouple switching to atomic mode and killing
percpu_ref: add PCPU_REF_DEAD
percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref: minor code and comment updates
percpu_ref: relocate percpu_ref_reinit()
Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
Revert "percpu: free percpu allocation info for uniprocessor system"
percpu-refcount: make percpu_ref based on longs instead of ints
percpu-refcount: improve WARN messages
percpu: fix locking regression in the failure path of pcpu_alloc()
percpu-refcount: add @gfp to percpu_ref_init()
proportions: add @gfp to init functions
percpu_counter: add @gfp to percpu_counter_init()
percpu_counter: make percpu_counters_lock irq-safe
...
It isn't correct to figure out req->bi_phys_segments from bio->bi_vcnt
if the bio is cloned.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The math in both blk_stack_limits() and queue_limit_alignment_offset()
assume that a block device's io_min (aka minimum_io_size) is always a
power-of-2. Fix the math such that it works for non-power-of-2 io_min.
This issue (of alignment_offset != 0) became apparent when testing
dm-thinp with a thinp blocksize that matches a RAID6 stripesize of
1280K. Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data
block size") unlocked the potential for alignment_offset != 0 due to
the dm-thin-pool's io_min possibly being a non-power-of-2.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We currently divide the queue depth by 4 as our batch wakeup
count, but we split the wakeups over BT_WAIT_QUEUES number of
wait queues. This defaults to 8. If the product of the resulting
batch wake count and BT_WAIT_QUEUES is higher than the device
queue depth, we can get into a situation where a task goes to
sleep waiting for a request, but never gets woken up.
Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes: 4bb659b156
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Users of bio_clone_fast() do not want bios with their own bvecs.
Allocating a bvec mempool as part of the bioset intended for such users
is a waste of memory.
bioset_create_nobvec() creates a bioset that doesn't have the bvec
mempool.
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Request cloning clones bios in the request to track the completion
of each bio.
For that purpose, we can use bio_clone_fast() instead of bio_clone()
to avoid unnecessary allocation and copy of bvecs.
This patch reduces memory footprint of request-based device-mapper
(about 1-4KB for each request) and is a preparation for further
reduction of memory usage by removing unused bvec mempool.
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The rq_complete tracepoint was never issued for empty requests,
causing the resulting blktrace information to never show any
completion for those request.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics
and a slightly buggy strncasecmp. The latter is the POSIX name, so
strnicmp was renamed to strncasecmp, and strnicmp made into a wrapper
for the new strncasecmp to avoid breaking existing users.
To allow the compat wrapper strnicmp to be removed at some point in
the future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
The T10 Protection Information format is also used by some devices that
do not go through the SCSI layer (virtual block devices, NVMe). Relocate
the relevant functions to a block layer library that can be used without
involving SCSI.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We'd occasionally merge requests with conflicting integrity flags.
Introduce a merge helper which checks that the requests have compatible
integrity payloads.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Make the choice of checksum a per-I/O property by introducing a flag
that can be inspected by the SCSI layer. There are several reasons for
this:
1. It allows us to switch choice of checksum without unloading and
reloading the HBA driver.
2. During error recovery we need to be able to tell the HBA that
checksums read from disk should not be verified and converted to IP
checksums.
3. For error injection purposes we need to be able to write a bad guard
tag to storage. Since the storage device only supports T10 CRC we
need to be able to disable IP checksum conversion on the HBA.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Move flags affecting the integrity code out of the bio bi_flags and into
the block integrity payload.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
So far we have relied on the app tag size to determine whether a disk
has been formatted with T10 protection information or not. However, not
all target devices provide application tag storage.
Add a flag to the block integrity profile that indicates whether the
disk has been formatted with protection information.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add a BLK_ prefix to the integrity profile flags. Also rename the flags
to be more consistent with the generate/verify terminology in the rest
of the integrity code.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of the "operate" parameter we pass in a seed value and a pointer
to a function that can be used to process the integrity metadata. The
generation function is changed to have a return value to fit into this
scheme.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that the protection interval has been detached from the sector size
we need to be able to handle sizes that are different from 4K and
512. Make the interval calculation generic.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The protection interval is not necessarily tied to the logical block
size of a block device. Stop using the terms "sector" and "sectors".
Going forward we will use the term "seed" to describe the initial
reference tag value for a given I/O. "Interval" will be used to describe
the portion of the data buffer that a given piece of protection
information is associated with.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
bip_buf is not really needed so we can remove it.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
None of the filesystems appear interested in using the integrity tagging
feature. Potentially because very few storage devices actually permit
using the application tag space.
Remove the tagging functions.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For commands like REQ_COPY we need a way to pass extra information along
with each bio. Like integrity metadata this information must be
available at the bottom of the stack so bi_private does not suffice.
Rename the existing bi_integrity field to bi_special and make it a union
so we can have different bio extensions for each class of command.
We previously used bi_integrity != NULL as a way to identify whether a
bio had integrity metadata or not. Introduce a REQ_INTEGRITY to be the
indicator now that bi_special can contain different things.
In addition, bio_integrity(bio) will now return a pointer to the
integrity payload (when applicable).
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
bdev_integrity_enabled() is only used by bio_integrity_enabled().
Combine these two functions.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch supports to run one single flush machinery for
each blk-mq dispatch queue, so that:
- current init_request and exit_request callbacks can
cover flush request too, then the buggy copying way of
initializing flush request's pdu can be fixed
- flushing performance gets improved in case of multi hw-queue
In fio sync write test over virtio-blk(4 hw queues, ioengine=sync,
iodepth=64, numjobs=4, bs=4K), it is observed that througput gets
increased a lot over my test environment:
- throughput: +70% in case of virtio-blk over null_blk
- throughput: +30% in case of virtio-blk over SSD image
The multi virtqueue feature isn't merged to QEMU yet, and patches for
the feature can be found in below tree:
git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4
And simply passing 'num_queues=4 vectors=5' should be enough to
enable multi queue(quad queue) feature for QEMU virtio-blk.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(),
so that this function can find the corresponding blk_flush_queue
bound with current mq context since the flush queue will become
per hw-queue.
For legacy queue, the parameter can be simply 'NULL'.
For multiqueue case, the parameter should be set as the context
from which the related request is originated. With this context
info, the hw queue and related flush queue can be found easily.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Just figuring out flush queue at the entry of kicking off flush
machinery and request's completion handler, then pass it through.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now mission of the two helpers is over, and just call
blk_alloc_flush_queue() and blk_free_flush_queue() directly.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch introduces 'struct blk_flush_queue' and puts all
flush machinery related fields into this structure, so that
- flush implementation details aren't exposed to driver
- it is easy to convert to per dispatch-queue flush machinery
This patch is basically a mechanical replacement.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch trys to use local variable to access flush request,
so that we can convert to per-queue flush machinery a bit easier.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
These fields are always used with the flush request, so
initialize them together.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
These two temporary functions are introduced for holding flush
initialization and de-initialization, so that we can
introduce 'flush queue' easier in the following patch. And
once 'flush queue' and its allocation/free functions are ready,
they will be removed for sake of code readability.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It is reasonable to allocate flush req in blk_mq_init_flush().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Failure of initializing one hctx isn't handled, so this patch
introduces blk_mq_init_hctx() and its pair to handle it explicitly.
Also this patch makes code cleaner.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.
Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take above ten seconds when
scsi-mq is used.
[ 0.949892] scsi host0: Virtio SCSI HBA
[ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz
<stall>
[ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
[ 16.194099] osd: LOADED open-osd 0.2.1
[ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.208478] sd 0:0:0:0: [sda] Write Protect is off
[ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off
[ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
This is also the reason why request_queues start in bypass mode which
is ended on blk_register_queue() as shutting down a fully functional
queue also involves a RCU grace period and the queues for non-existent
SCSI devices never reach registration.
blk-mq basically needs to do the same thing - start the mq in a
degraded mode which is faster to shut down and then make it fully
functional only after the queue reaches registration. percpu_ref
recently grew facilities to force atomic operation until explicitly
switched to percpu mode, which can be used for this purpose. This
patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
switch it to percpu mode only once blk_register_queue() is reached.
Note that this issue was previously worked around by 0a30288da1
("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during
probe") for v3.17. The temp fix was reverted in preparation of adding
persistent atomic mode to percpu_ref by 9eca80461a ("Revert "blk-mq,
percpu_ref: implement a kludge for SCSI blk-mq stall during probe"").
This patch and the prerequisite percpu_ref changes will be merged
during v3.18 devel cycle.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
Fixes: add703fda9 ("blk-mq: use percpu_ref for mq usage count")
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.
Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.
This patch adds @flags to percpu_ref_init() and implements the
following flags.
* PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
* PERCPU_REF_INIT_DEAD : start ref killed and drained
These flags should be able to serve the above two use cases.
v2: target_core_tpg.c conversion was missing. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
This reverts commit 0a30288da1, which
was a temporary fix for SCSI blk-mq stall issue. The following
patches will fix the issue properly by introducing atomic mode to
percpu_ref.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.
Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take more than ten seconds
when scsi-mq is used.
This will be properly fixed by implementing a mechanism to keep
q->mq_usage_counter in atomic mode till genhd registration; however,
that involves rather big updates to percpu_ref which is difficult to
apply late in the devel cycle (v3.17-rc6 at the moment). As a
stop-gap measure till the proper fix can be implemented in the next
cycle, this patch introduces __percpu_ref_kill_expedited() and makes
blk_mq_freeze_queue() use it. This is heavy-handed but should work
for testing the experimental SCSI blk-mq implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
Fixes: add703fda9 ("blk-mq: use percpu_ref for mq usage count")
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 2da78092 changed the locking from a mutex to a spinlock,
so we now longer sleep in this context. But there was a leftover
might_sleep() in there, which now triggers since we do the final
free from an RCU callback. Get rid of it.
Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header.
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 8cb34819cdd5d(blk-mq: unshared timeout handler) introduces
blk-mq's own timeout handler, and removes following line:
blk_queue_rq_timed_out(q, blk_mq_rq_timed_out);
which then causes blk_add_timer() to bypass adding the timer,
since blk-mq no longer has q->rq_timed_out_fn defined.
This patch fixes the problem by bypassing the check for blk-mq,
so that both request deadlines are still set and the rolling
timer updated.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It's not uncommon for crash dump kernels to be limited to 128MB or
something low in that area. This is normally not a problem for
devices as we don't use that much memory, but for some shared SCSI
setups with huge queue depths, it can potentially fill most of
memory with tons of request allocations. blk-mq does scale back
when it fails to allocate memory, but it scales back just enough
so that blk-mq succeeds. This could still leave the system with
not enough memory to make any real progress.
Check if we are in a kdump environment and limit the hardware
queues and tag depth.
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch removes two unnecessary blk_clear_rq_complete(),
the REQ_ATOM_COMPLETE flag is cleared inside blk_mq_start_request(),
so:
- The blk_clear_rq_complete() in blk_flush_restore_request()
needn't because the request will be freed later, and clearing
it here may open a small race window with timeout.
- The blk_clear_rq_complete() in blk_mq_requeue_request() isn't
necessary too, even though REQ_ATOM_STARTED is cleared in
__blk_mq_requeue_request(), in theory it still may cause a small
race window with timeout since the two clear_bit() may be
reordered.
Signed-off-by: Ming Lei <ming.lei@canoical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allow blk-mq to pass an argument to the timeout handler to indicate
if we're timing out a reserved or regular command. For many drivers
those need to be handled different.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Duplicate the (small) timeout handler in blk-mq so that we can pass
arguments more easily to the driver timeout handler. This enables
the next patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Don't do a kmalloc from timer to handle timeouts, chances are we could be
under heavy load or similar and thus just miss out on the timeouts.
Fortunately it is very easy to just iterate over all in use tags, and doing
this properly actually cleans up the blk_mq_busy_iter API as well, and
prepares us for the next patch by passing a reserved argument to the
iterator.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we've changed the driver API on the submission side use the
opportunity to fix up the name on the completion side to fit into the
general scheme.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When we call blk_mq_start_request from the core blk-mq code before calling into
->queue_rq there is a racy window where the timeout handler can hit before we've
fully set up the driver specific part of the command.
Move the call to blk_mq_start_request into the driver so the driver can start
the request only once it is fully set up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pass an explicit parameter for the last request in a batch to ->queue_rq
instead of using a request flag. Besides being a cleaner and non-stateful
interface this is also required for the next patch, which fixes the blk-mq
I/O submission code to not start a time too early.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When requests are retried due to hw or sw resource shortages,
we often stop the associated hardware queue. So ensure that we
restart the queues when running the requeue work, otherwise the
queue run will be a no-op.
Signed-off-by: Jens Axboe <axboe@fb.com>
__blk_mq_alloc_rq_maps() can be invoked multiple times, if we scale
back the queue depth if we are low on memory. So don't clear
set->tags when we fail, this is handled directly in
the parent function, blk_mq_alloc_tag_set().
Reported-by: Robert Elliott <Elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We should not insert requests into the flush state machine from
blk_mq_insert_request. All incoming flush requests come through
blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait
should only be called for BLOCK_PC requests. All other callers
deal with requests that already went through the flush statemchine
and shouldn't be reinserted into it.
Reported-by: Robert Elliott <Elliott@hp.com>
Debugged-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch should fix the bug reported in
https://lkml.org/lkml/2014/9/11/249.
We have to initialize at least the atomic_flags and the cmd_flags when
allocating storage for the requests.
Otherwise blk_mq_timeout_check() might dereference uninitialized
pointers when racing with the creation of a request.
Also move the reset of cmd_flags for the initializing code to the point
where a request is freed. So we will never end up with pending flush
request indicators that might trigger dereferences of invalid pointers
in blk_mq_timeout_check().
Cc: stable@vger.kernel.org
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reported-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com>
Tested-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When we start the request, we set the deadline and flip the bits
marking the request as started and non-complete. However, it's
important that the deadline store is ordered before flipping the
bits, otherwise we could have a small window where the request is
marked started but with an invalid deadline. This can confuse the
timeout handling.
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
A bit of churn on the for-linus side that would be nice to have
in the core bits for 3.18, so pull it in to catch us up and make
forward progress easier.
Signed-off-by: Jens Axboe <axboe@fb.com>
Conflicts:
block/scsi_ioctl.c
If we are running in a kdump environment, resources are scarce.
For some SCSI setups with a huge set of shared tags, we run out
of memory allocating what the drivers is asking for. So implement
a scale back logic to reduce the tag depth for those cases, allowing
the driver to successfully load.
We should extend this to detect low memory situations, and implement
a sane fallback for those (1 queue, 64 tags, or something like that).
Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When a queue is registered, the block layer turns off the bypass
setting (because bypass is enabled when the queue is created). This
doesn't work well for queues that are unregistered and then registered
again; we get a WARNING because of the unbalanced calls to
blk_queue_bypass_end().
This patch fixes the problem by making blk_register_queue() call
blk_queue_bypass_end() only the first time the queue is registered.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Tejun Heo <tj@kernel.org>
CC: James Bottomley <James.Bottomley@HansenPartnership.com>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch fix spelling typo found in DocBook/kernel-api.xml.
It is because the file is generated from the source comments,
I have to fix the comments in source codes.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
bdev_get_queue() returns the request_queue associated with the
specified block_device. blk_get_backing_dev_info() makes use of
bdev_get_queue() to determine the associated bdi given a block_device.
All the callers of bdev_get_queue() including
blk_get_backing_dev_info() assume that bdev_get_queue() may return
NULL and implement NULL handling; however, bdev_get_queue() requires
the passed in block_device is opened and attached to its gendisk.
Because an active gendisk always has a valid request_queue associated
with it, bdev_get_queue() can never return NULL and neither can
blk_get_backing_dev_info().
Make it clear that neither of the two functions can return NULL and
remove NULL handling from all the callers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg->id is a unique id given to each blkcg; however, the
cgroup_subsys_state which each blkcg embeds already has ->serial_nr
which can be used for the same purpose. Drop blkcg->id and replace
its uses with blkcg->css.serial_nr. Rename cfq_cgroup->blkcg_id to
->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for
consistency.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Percpu allocator now supports allocation mask. Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.
This patch doesn't make any functional difference.
v2: blk-mq conversion was missing. Updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Jens Axboe <axboe@kernel.dk>
Releases the dev_t minor when all references are closed to prevent
another device from acquiring the same major/minor.
Since the partition's release may be invoked from call_rcu's soft-irq
context, the ext_dev_idr's mutex had to be replaced with a spinlock so
as not so sleep.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
In blk-mq.c blk_mq_alloc_tag_set, if:
set->tags = kmalloc_node()
succeeds, but one of the blk_mq_init_rq_map() calls fails,
goto out_unwind;
needs to free set->tags so the caller is not obligated
to do so. None of the current callers (null_blk,
virtio_blk, virtio_blk, or the forthcoming scsi-mq)
do so.
set->tags needs to be set to NULL after doing so,
so other tag cleanup logic doesn't try to free
a stale pointer later. Also set it to NULL
in blk_mq_free_tag_set.
Tested with error injection on the forthcoming
scsi-mq + hpsa combination.
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
QUEUE_FLAG_NO_SG_MERGE is set at default for blk-mq devices,
so bio->bi_phys_segment computed may be bigger than
queue_max_segments(q) for blk-mq devices, then drivers will
fail to handle the case, for example, BUG_ON() in
virtio_queue_rq() can be triggerd for virtio-blk:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1359146
This patch fixes the issue by ignoring the QUEUE_FLAG_NO_SG_MERGE
flag if the computed bio->bi_phys_segment is bigger than
queue_max_segments(q), and the regression is caused by commit
05f1dd53152173(block: add queue flag for disabling SG merging).
Reported-by: Kick In <pierre-andre.morey@canonical.com>
Tested-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Dan writes:
block/bsg.c:327 bsg_map_hdr() error: 'next_rq' dereferencing possible
ERR_PTR().
Fix this by setting next_rq to NULL, for the case where it can be
!= NULL but an error pointer.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The blk_get_request function may fail in low-memory conditions or during
device removal (even if __GFP_WAIT is set). To distinguish between these
errors, modify the blk_get_request call stack to return the appropriate
ERR_PTR. Verify that all callers check the return status and consider
IS_ERR instead of a simple NULL pointer check.
For consistency, make a similar change to the blk_mq_alloc_request leg
of blk_get_request. It may fail if the queue is dead, or the caller was
unwilling to wait.
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd]
Acked-by: Boaz Harrosh <bharrosh@panasas.com> [for osd]
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Explain that weight has to be updated on activation.
This complements previous fix e15693ef18 ("cfq-iosched: Fix wrong
children_weight calculation").
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <axboe@fb.com>
The blk-core dead queue checks introduce an error scenario to
blk_get_request that returns NULL if the request queue has been
shutdown. This affects the behavior for __GFP_WAIT callers, who should
verify the return value before dereferencing.
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd]
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
cfq_group_service_tree_add() is applying new_weight at the beginning of
the function via cfq_update_group_weight().
This actually allows weight to change between adding it to and subtracting
it from children_weight, and triggers WARN_ON_ONCE() in
cfq_group_service_tree_del(), or even causes oops by divide error during
vfr calculation in cfq_group_service_tree_add().
The detailed scenario is as follows:
1. Create blkio cgroups X and Y as a child of X.
Set X's weight to 500 and perform some I/O to apply new_weight.
This X's I/O completes before starting Y's I/O.
2. Y starts I/O and cfq_group_service_tree_add() is called with Y.
3. cfq_group_service_tree_add() walks up the tree during children_weight
calculation and adds parent X's weight (500) to children_weight of root.
children_weight becomes 500.
4. Set X's weight to 1000.
5. X starts I/O and cfq_group_service_tree_add() is called with X.
6. cfq_group_service_tree_add() applies its new_weight (1000).
7. I/O of Y completes and cfq_group_service_tree_del() is called with Y.
8. I/O of X completes and cfq_group_service_tree_del() is called with X.
9. cfq_group_service_tree_del() subtracts X's weight (1000) from
children_weight of root. children_weight becomes -500.
This triggers WARN_ON_ONCE().
10. Set X's weight to 500.
11. X starts I/O and cfq_group_service_tree_add() is called with X.
12. cfq_group_service_tree_add() applies its new_weight (500) and adds it
to children_weight of root. children_weight becomes 0. Calcularion of
vfr triggers oops by divide error.
weight should be updated right before adding it to children_weight.
Reported-by: Ruki Sekiya <sekiya.ruki@lab.ntt.co.jp>
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Before commit 2cada584b2 ("block: cleanup error handling in sg_io"),
we had ret = 0 before entering the last big if block of sg_io.
Since 2cada584b2, ret = -EFAULT, which breaks hdparm:
/dev/sda:
setting Advanced Power Management level to 0xc8 (200)
HDIO_DRIVE_CMD failed: Bad address
APM_level = 128
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Fixes: 2cada584b2 ("block: cleanup error handling in sg_io")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_rq_set_block_pc() memsets rq->cmd to 0, so it should come
immediately after blk_get_request() to avoid overwriting the
user-supplied CDB. Also check for failure to allocate rq.
Fixes: f27b087b81 ("block: add blk_rq_set_block_pc()")
Cc: <stable@vger.kernel.org> # 3.16.x
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch fixes code such as the following with scsi-mq enabled:
rq = blk_get_request(...);
blk_rq_set_block_pc(rq);
rq->cmd = my_cmd_buffer; /* separate CDB buffer */
blk_execute_rq_nowait(...);
Code like this appears in e.g. sg_start_req() in drivers/scsi/sg.c (for
large CDBs only). Without this patch, scsi_mq_prep_fn() will set
rq->cmd back to rq->__cmd, causing the wrong CDB to be sent to the device.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Make sure we always clean up through the out label and just have
a single place to put the request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
While converting to percpu_ref for freezing, add703fda9 ("blk-mq:
use percpu_ref for mq usage count") incorrectly made
blk_mq_freeze_queue() misbehave when freezing is nested due to
percpu_ref_kill() being invoked on an already killed ref.
Fix it by making blk_mq_freeze_queue() kill and kick the queue only
for the outermost freeze attempt. All the nested ones can simply wait
for the ref to reach zero.
While at it, remove unnecessary @wake initialization from
blk_mq_unfreeze_queue().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When getting a pi error we get to bio_integrity_end_io with
bi_remaining already decremented to 0 where we will eventually
need to call bio_endio with restored original bio completion handler.
Calling bio_endio invokes a BUG_ON(). We should call bio_endio_nodec
instead, like what is done in bio_integrity_verify_fn.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq uses BLK_MQ_F_SHOULD_MERGE, as set by the driver at init time,
to determine whether it should merge IO or not. However, this could
also be disabled by the admin, if merging is switched off through
sysfs. So check the general queue state as well before attempting
to merge IO.
Reported-by: Rob Elliott <Elliott@hp.com>
Tested-by: Rob Elliott <Elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Before doing queue release, the queue has been freezed already
by blk_cleanup_queue(), so needn't to freeze queue for deleting
tag set.
This patch fixes the WARNING of "percpu_ref_kill() called more than once!"
which is triggered during unloading block driver.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
allow thin snapshots to be larger than the external origin.
. Add support for quickly loading a repetitive pattern into the
dm-switch target.
. Use per-bio data in the dm-crypt target instead of always using a
mempool for each allocation. Required switching to kmalloc alignment
for the bio slab.
. Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag
. Fix the dm-cache and dm-thin targets' export of the minimum_io_size to
match the data block size -- this fixes an issue where mkfs.xfs would
improperly infer raid striping was in place on the underlying storage.
. Small cleanups in dm-io, dm-mpath and dm-cache
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT64yEAAoJEMUj8QotnQNatjQH/2mqm8EtPuZas70zHVDzjMlE
ZyV8xgHpU0MBmiBi+JhUBv9iKX4sVa+C25559WkKtxRVMnZmI1WDry4TagiqrhnK
9o/uvdWigJMR+uwahwe4UErEtKscOQJD30a8taN/suJ6Z2C7XJJRUZPsyL4a3Vov
w+UIi7aYDEGp/2VQ8mvTTxjdF5x5km4wKsjBTs03uTrrkEJ+bIUndl2I1X+X4bsw
kiWYOQwmcnD8GwYkSrthJYLsS3Hjur/J/My7KZwXc00ANLOexqHdKfRDwH8b36+m
olKXv3swCd8vi+jJYEYzuW9213ACsSEGP7h8NFVZ/+2FeDsSzB/C7zjW9okIUIw=
=y/3r
-----END PGP SIGNATURE-----
Merge tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper changes from Mike Snitzer:
- Allow the thin target to paired with any size external origin; also
allow thin snapshots to be larger than the external origin.
- Add support for quickly loading a repetitive pattern into the
dm-switch target.
- Use per-bio data in the dm-crypt target instead of always using a
mempool for each allocation. Required switching to kmalloc alignment
for the bio slab.
- Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag
- Fix the dm-cache and dm-thin targets' export of the minimum_io_size
to match the data block size -- this fixes an issue where mkfs.xfs
would improperly infer raid striping was in place on the underlying
storage.
- Small cleanups in dm-io, dm-mpath and dm-cache
* tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm table: propagate QUEUE_FLAG_NO_SG_MERGE
dm switch: efficiently support repetitive patterns
dm switch: factor out switch_region_table_read
dm cache: set minimum_io_size to cache's data block size
dm thin: set minimum_io_size to pool's data block size
dm crypt: use per-bio data
block: use kmalloc alignment for bio slab
dm table: make dm_table_supports_discards static
dm cache metadata: use dm-space-map-metadata.h defined size limits
dm cache: fail migrations in the do_worker error path
dm cache: simplify deferred set reference count increments
dm thin: relax external origin size constraints
dm thin: switch to an atomic_t for tracking pending new block preparations
dm mpath: eliminate pg_ready() wrapper
dm io: simplify dec_count and sync_io
Pull block driver changes from Jens Axboe:
"Nothing out of the ordinary here, this pull request contains:
- A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
and Surbhi Palande. No new features, just a lot of fixes.
- The usual round of drbd updates from Andreas Gruenbacher, Lars
Ellenberg, and Philipp Reisner.
- virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
has taken it one step further and added support for actually using
more than one queue.
- Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
compliment the the default behavior of adding to the tail of the
queue. From Douglas Gilbert"
* 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
bcache: Drop unneeded blk_sync_queue() calls
bcache: add mutex lock for bch_is_open
bcache: Correct printing of btree_gc_max_duration_ms
bcache: try to set b->parent properly
bcache: fix memory corruption in init error path
bcache: fix crash with incomplete cache set
bcache: Fix more early shutdown bugs
bcache: fix use-after-free in btree_gc_coalesce()
bcache: Fix an infinite loop in journal replay
bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
bcache: bcache_write tracepoint was crashing
bcache: fix typo in bch_bkey_equal_header
bcache: Allocate bounce buffers with GFP_NOWAIT
bcache: Make sure to pass GFP_WAIT to mempool_alloc()
bcache: fix uninterruptible sleep in writeback thread
bcache: wait for buckets when allocating new btree root
bcache: fix crash on shutdown in passthrough mode
bcache: fix lockdep warnings on shutdown
bcache allocator: send discards with correct size
bcache: Fix to remove the rcu_sched stalls.
...
Pull block core bits from Jens Axboe:
"Small round this time, after the massive blk-mq dump for 3.16. This
pull request contains:
- Fixes for max_sectors overflow in ioctls from Akinoby Mita.
- Partition off-by-one bug fix in aix partitions from Dan Carpenter.
- Various small partition cleanups from Fabian Frederick.
- Fix for the block integrity code sometimes returning the wrong
vector count from Gu Zheng.
- Cleanup an re-org of the blk-mq queue enter/exit percpu counters
from Tejun. Dependent on the percpu pull for 3.17 (which was in
the block tree too), that you have already pulled in.
- A blkcg oops fix, also from Tejun"
* 'for-3.17/core' of git://git.kernel.dk/linux-block:
partitions: aix.c: off by one bug
blkcg: don't call into policy draining if root_blkg is already gone
Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment"
bio: modify __bio_add_page() to accept pages that don't start a new segment
block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge
block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX
block/partitions/efi.c: kerneldoc fixing
block/partitions/msdos.c: code clean-up
block/partitions/amiga.c: replace nolevel printk by pr_err
block/partitions/aix.c: replace count*size kzalloc by kcalloc
bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload
blk-mq: use percpu_ref for mq usage count
blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
blk-mq: decouble blk-mq freezing from generic bypassing
block, blk-mq: draining can't be skipped even if bypass_depth was non-zero
blk-mq: fix a memory ordering bug in blk_mq_queue_enter()
The lvip[] array has "state->limit" elements so the condition here
should be >= instead of >.
Fixes: 6ceea22bbb ('partitions: add aix lvm partition support files')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Philippe De Muyter <phdm@macqel.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull cgroup changes from Tejun Heo:
"Mostly changes to get the v2 interface ready. The core features are
mostly ready now and I think it's reasonable to expect to drop the
devel mask in one or two devel cycles at least for a subset of
controllers.
- cgroup added a controller dependency mechanism so that block cgroup
can depend on memory cgroup. This will be used to finally support
IO provisioning on the writeback traffic, which is currently being
implemented.
- The v2 interface now uses a separate table so that the interface
files for the new interface are explicitly declared in one place.
Each controller will explicitly review and add the files for the
new interface.
- cpuset is getting ready for the hierarchical behavior which is in
the similar style with other controllers so that an ancestor's
configuration change doesn't change the descendants' configurations
irreversibly and processes aren't silently migrated when a CPU or
node goes down.
All the changes are to the new interface and no behavior changed for
the multiple hierarchies"
* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
cpuset: fix the WARN_ON() in update_nodemasks_hier()
cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
cgroup: distinguish the default and legacy hierarchies when handling cftypes
cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
cpuset: export effective masks to userspace
cpuset: allow writing offlined masks to cpuset.cpus/mems
cpuset: enable onlined cpu/node in effective masks
cpuset: refactor cpuset_hotplug_update_tasks()
cpuset: make cs->{cpus, mems}_allowed as user-configured masks
cpuset: apply cs->effective_{cpus,mems}
cpuset: initialize top_cpuset's configured masks at mount
cpuset: use effective cpumask to build sched domains
cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
cpuset: update cs->effective_{cpus, mems} when config changes
cpuset: update cpuset->effective_{cpus,mems} at hotplug
cpuset: add cs->effective_cpus and cs->effective_mems
cgroup: clean up sane_behavior handling
...
Various subsystems can ask the bio subsystem to create a bio slab cache
with some free space before the bio. This free space can be used for any
purpose. Device mapper uses this per-bio-data feature to place some
target-specific and device-mapper specific data before the bio, so that
the target-specific data doesn't have to be allocated separately.
This per-bio-data mechanism is used in place of kmalloc, so we need the
allocated slab to have the same memory alignment as memory allocated
with kmalloc.
Change bio_find_or_create_slab() so that it uses ARCH_KMALLOC_MINALIGN
alignment when creating the slab cache. This is needed so that dm-crypt
can use per-bio-data for encryption - the crypto subsystem assumes this
data will have the same alignment as kmalloc'ed memory.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Jens Axboe <axboe@fb.com>
While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL. If someone else starts to drain
while the queue is in this state, the following oops happens.
NULL pointer dereference at 0000000000000028
IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
PGD e4a1067 PUD b773067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
RIP: 0010:[<ffffffff8144e944>] [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
Stack:
ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
Call Trace:
[<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
[<ffffffff81427641>] __blk_drain_queue+0x71/0x180
[<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
[<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
[<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
[<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
[<ffffffff8142d476>] blk_release_queue+0x26/0xd0
[<ffffffff81454968>] kobject_cleanup+0x38/0x70
[<ffffffff81454848>] kobject_put+0x28/0x60
[<ffffffff81427505>] blk_put_queue+0x15/0x20
[<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
[<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
[<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
[<ffffffff817930e2>] device_release+0x32/0xa0
[<ffffffff81454968>] kobject_cleanup+0x38/0x70
[<ffffffff81454848>] kobject_put+0x28/0x60
[<ffffffff817934d7>] put_device+0x17/0x20
[<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
[<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
[<ffffffff817d1257>] sdev_store_delete+0x27/0x30
[<ffffffff81792ca8>] dev_attr_store+0x18/0x30
[<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
[<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
[<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
[<ffffffff811f69bd>] SyS_write+0x4d/0xc0
[<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
776687bce4 ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.
Fix it by skippping calling into policy draining if all the blkgs are
already gone.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Shirish Pargaonkar <spargaonkar@suse.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Jet Chen <jet.chen@intel.com>
Cc: stable@vger.kernel.org
Tested-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, cftypes added by cgroup_add_cftypes() are used for both the
unified default hierarchy and legacy ones and subsystems can mark each
file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
appear only on one of them. This is quite hairy and error-prone.
Also, we may end up exposing interface files to the default hierarchy
without thinking it through.
cgroup_subsys will grow two separate cftype addition functions and
apply each only on the hierarchies of the matching type. This will
allow organizing cftypes in a lot clearer way and encourage subsystems
to scrutinize the interface which is being exposed in the new default
hierarchy.
In preparation, this patch adds cgroup_add_legacy_cftypes() which
currently is a simple wrapper around cgroup_add_cftypes() and replaces
all cgroup_add_cftypes() usages with it.
While at it, this patch drops a completely spurious return from
__hugetlb_cgroup_file_init().
This patch doesn't introduce any functional differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Currently, cgroup_subsys->base_cftypes is used for both the unified
default hierarchy and legacy ones and subsystems can mark each file
with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
only on one of them. This is quite hairy and error-prone. Also, we
may end up exposing interface files to the default hierarchy without
thinking it through.
cgroup_subsys will grow two separate cftype arrays and apply each only
on the hierarchies of the matching type. This will allow organizing
cftypes in a lot clearer way and encourage subsystems to scrutinize
the interface which is being exposed in the new default hierarchy.
In preparation, this patch renames cgroup_subsys->base_cftypes to
cgroup_subsys->legacy_cftypes. This patch is pure rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer
to two uint64_t values, so there is no need to translate it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # 3.7+
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL. If someone else starts to drain
while the queue is in this state, the following oops happens.
NULL pointer dereference at 0000000000000028
IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
PGD e4a1067 PUD b773067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
RIP: 0010:[<ffffffff8144e944>] [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
Stack:
ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
Call Trace:
[<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
[<ffffffff81427641>] __blk_drain_queue+0x71/0x180
[<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
[<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
[<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
[<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
[<ffffffff8142d476>] blk_release_queue+0x26/0xd0
[<ffffffff81454968>] kobject_cleanup+0x38/0x70
[<ffffffff81454848>] kobject_put+0x28/0x60
[<ffffffff81427505>] blk_put_queue+0x15/0x20
[<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
[<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
[<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
[<ffffffff817930e2>] device_release+0x32/0xa0
[<ffffffff81454968>] kobject_cleanup+0x38/0x70
[<ffffffff81454848>] kobject_put+0x28/0x60
[<ffffffff817934d7>] put_device+0x17/0x20
[<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
[<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
[<ffffffff817d1257>] sdev_store_delete+0x27/0x30
[<ffffffff81792ca8>] dev_attr_store+0x18/0x30
[<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
[<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
[<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
[<ffffffff811f69bd>] SyS_write+0x4d/0xc0
[<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
776687bce4 ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.
Fix it by skippping calling into policy draining if all the blkgs are
already gone.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Shirish Pargaonkar <spargaonkar@suse.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Jet Chen <jet.chen@intel.com>
Cc: stable@vger.kernel.org
Tested-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
sane_behavior has been used as a development vehicle for the default
unified hierarchy. Now that the default hierarchy is in place, the
flag became redundant and confusing as its usage is allowed on all
hierarchies. There are gonna be either the default hierarchy or
legacy ones. Let's make that clear by removing sane_behavior support
on non-default hierarchies.
This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
cgroup_on_dfl() with sane_behavior specific part dropped.
On the default and legacy hierarchies w/o sane_behavior, this
shouldn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Currently, the blkio subsystem attributes all of writeback IOs to the
root. One of the issues is that there's no way to tell who originated
a writeback IO from block layer. Those IOs are usually issued
asynchronously from a task which didn't have anything to do with
actually generating the dirty pages. The memory subsystem, when
enabled, already keeps track of the ownership of each dirty page and
it's desirable for blkio to piggyback instead of adding its own
per-page tag.
cgroup now has a mechanism to express such dependency -
cgroup_subsys->depends_on. This patch declares that blkcg depends on
memcg so that memcg is enabled automatically on the default hierarchy
when available. Future changes will make blkcg map the memcg tag to
find out the cgroup to blame for writeback IOs.
As this means that a memcg may be made invisible, this patch also
implements css_reset() for memcg which resets its basic
configurations. This implementation will probably need to be expanded
to cover other states which are used in the default hierarchy.
v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
build failure. Reported by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
There is no inherent reason why the last put of a tag structure must be
the one for the Scsi_Host, as device model objects can be held for
arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single
funtion that just release a references and get rid of the BUG() when the
host reference wasn't the last.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
The original behaviour is to refuse to add a new page if the maximum
number of segments has been reached, regardless of the fact the page we
are going to add can be merged into the last segment or not.
Unfortunately, when the system runs under heavy memory fragmentation
conditions, a driver may try to add multiple pages to the last segment.
The original code won't accept them and EBUSY will be reported to
userspace.
This patch modifies the function so it refuses to add a page only in case
the latter starts a new segment and the maximum number of segments has
already been reached.
The bug can be easily reproduced with the st driver:
1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16
2) modprobe st buffer_kbs=1024
3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
dd: error writing `/dev/st0': Device or resource busy
[ming.lei@canonical.com: update bi_iter.bi_size before recounting segments]
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Dongsu Park <dongsu.park@profitbricks.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
After the SG_IO ioctl was copied into the block layer and
later into the bsg driver, subtle differences emerged.
One difference is the way injected commands are queued through
the block layer (i.e. this is not SCSI device queueing nor SATA
NCQ). Summarizing:
- SG_IO on block layer device: blk_exec*(at_head=false)
- sg device SG_IO: at_head=true
- bsg device SG_IO: at_head=true
Some time ago Boaz Harrosh introduced a sg v4 flag called
BSG_FLAG_Q_AT_TAIL to override the bsg driver default. A
recent patch titled: "sg: add SG_FLAG_Q_AT_TAIL flag"
allowed the sg driver default to be overridden. This patch
allows a SG_IO ioctl sent to a block layer device to have
its default overridden.
ChangeLog:
- introduce SG_FLAG_Q_AT_HEAD flag in sg.h to cause
commands that are injected via a block layer
device SG_IO ioctl to set at_head=true
- make comments clearer about queueing in sg.h since the
header is used both by the sg device and block layer
device implementations of the SG_IO ioctl.
- introduce BSG_FLAG_Q_AT_HEAD in bsg.h for compatibility
(it does nothing) and update comments.
Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <axboe@fb.com>
SG_GET_RESERVED_SIZE and SG_SET_RESERVED_SIZE ioctls access a reserved
buffer in bytes as int type. The value needs to be capped at the request
queue's max_sectors. But integer overflow is not correctly handled in
the calculation when converting max_sectors from sectors to bytes.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: linux-scsi@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
BLKSECTGET ioctl loads the request queue's max_sectors as unsigned
short value to the argument pointer. So if the max_sector is greater
than USHRT_MAX, the upper 16 bits of that is just discarded.
In such case, USHRT_MAX is more preferable than the lower 16 bits of
max_sectors.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: linux-scsi@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
checkpatch fixing:
WARNING: Missing a blank line after declarations
WARNING: space prohibited between function name and open parenthesis '('
ERROR: spaces required around that '<' (ctx:VxV)
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
Also add no prefix pr_fmt to avoid any future default format update
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 08778795 ("block: Fix nr_vecs for inline integrity vectors") from
Martin introduces the function bip_integrity_vecs(get the useful vectors)
to fix the issue about nr_vecs for inline integrity vectors that reported
by David Milburn.
But it seems that bip_integrity_vecs() will return the wrong number if the
bio is not based on any bio_set for some reason(bio->bi_pool == NULL),
because in that case, the bip_inline_vecs[0] is malloced directly. So
here we add the bip_max_vcnt to record the count of vector slots, and
cleanup the function bip_integrity_vecs().
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, blk-mq uses a percpu_counter to keep track of how many
usages are in flight. The percpu_counter is drained while freezing to
ensure that no usage is left in-flight after freezing is complete.
blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism.
This type of code has relatively high chance of subtle bugs which are
extremely difficult to trigger and it's way too hairy to be open coded
in blk-mq. percpu_ref can serve the same purpose after the recent
changes. This patch replaces the open-coded per-cpu usage counting
and draining mechanism with percpu_ref.
blk_mq_queue_enter() performs tryget_live on the ref and exit()
performs put. blk_mq_freeze_queue() kills the ref and waits until the
reference count reaches zero. blk_mq_unfreeze_queue() revives the ref
and wakes up the waiters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Keeping __blk_mq_drain_queue() as a separate function doesn't buy us
anything and it's gonna be further simplified. Let's flatten it into
its caller.
This patch doesn't make any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_mq freezing is entangled with generic bypassing which bypasses
blkcg and io scheduler and lets IO requests fall through the block
layer to the drivers in FIFO order. This allows forward progress on
IOs with the advanced features disabled so that those features can be
configured or altered without worrying about stalling IO which may
lead to deadlock through memory allocation.
However, generic bypassing doesn't quite fit blk-mq. blk-mq currently
doesn't make use of blkcg or ioscheds and it maps bypssing to
freezing, which blocks request processing and drains all the in-flight
ones. This causes problems as bypassing assumes that request
processing is online. blk-mq works around this by conditionally
allowing request processing for the problem case - during queue
initialization.
Another weirdity is that except for during queue cleanup, bypassing
started on the generic side prevents blk-mq from processing new
requests but doesn't drain the in-flight ones. This shouldn't break
anything but again highlights that something isn't quite right here.
The root cause is conflating blk-mq freezing and generic bypassing
which are two different mechanisms. The only intersecting purpose
that they serve is during queue cleanup. Let's properly separate
blk-mq freezing from generic bypassing and simply use it where
necessary.
* request_queue->mq_freeze_depth is added and
blk_mq_[un]freeze_queue() now operate on this counter instead of
->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added
but the counter is tested directly. This will be further updated by
later changes.
* blk_mq_drain_queue() is dropped and "__" prefix is dropped from
blk_mq_freeze_queue(). Queue cleanup path now calls
blk_mq_freeze_queue() directly.
* blk_queue_enter()'s fast path condition is simplified to simply
check @q->mq_freeze_depth. Previously, the condition was
!blk_queue_dying(q) &&
(!blk_queue_bypass(q) || !blk_queue_init_done(q))
mq_freeze_depth is incremented right after dying is set and
blk_queue_init_done() exception isn't necessary as blk-mq doesn't
start frozen, which only leaves the blk_queue_bypass() test which
can be replaced by @q->mq_freeze_depth test.
This change simplifies the code and reduces confusion in the area.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
skip queue draining if bypass_depth was already above zero. The
assumption is that the one which bumped the bypass_depth should have
performed draining already; however, there's nothing which prevents a
new instance of bypassing/freezing from starting before the previous
one finishes draining. The current code may allow the later
bypassing/freezing instances to complete while there still are
in-flight requests which haven't finished draining.
Fix it by draining regardless of bypass_depth. We still skip draining
from blk_queue_bypass_start() while the queue is initializing to avoid
introducing excessive delays during boot. INIT_DONE setting is moved
above the initial blk_queue_bypass_end() so that bypassing attempts
can't slip inbetween.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq uses a percpu_counter to keep track of how many usages are in
flight. The percpu_counter is drained while freezing to ensure that
no usage is left in-flight after freezing is complete.
blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism; unfortunately, it contains a subtle bug -
smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from
fetching @q->bypass_depth before incrementing @q->mq_usage_counter and
if freezing happens inbetween the caller can slip through and freezing
can be complete while there are active users.
Use smp_mb() instead so that bypass_depth and mq_usage_counter
modifications and tests are properly interlocked.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block fixes from Jens Axboe:
"A small collection of fixes/changes for the current series. This
contains:
- Removal of dead code from Gu Zheng.
- Revert of two bad fixes that went in earlier in this round, marking
things as __init that were not purely used from init.
- A fix for blk_mq_start_hw_queue() using the __blk_mq_run_hw_queue(),
which could place us wrongly. Make it use the non __ variant,
which handles cases where we are called from the wrong CPU set.
From me.
- A fix for drbd, which allocates discard requests without room for
the SCSI payload. From Lars Ellenberg.
- A fix for user-after-free in the blkcg code from Tejun.
- Addition of limiting gaps in SG lists, if the hardware needs it.
This is the last pre-req patch for blk-mq to enable the full NVMe
conversion. Could wait until 3.17, but it's simple enough so would
be nice to have everything we need for the NVMe port in the 3.17
release. From me"
* 'for-linus' of git://git.kernel.dk/linux-block:
drbd: fix NULL pointer deref in blk_add_request_payload
blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue()
block: add support for limiting gaps in SG lists
bio: remove unused macro bip_vec_idx()
Revert "block: add __init to elv_register"
Revert "block: add __init to blkcg_policy_register"
blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t
floppy: format block0 read error message properly
Currently it calls __blk_mq_run_hw_queue(), which depends on the
CPU placement being correct. This means it's not possible to call
blk_mq_start_hw_queues(q) from a context that is correct for all
queues, leading to triggering the
WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
in __blk_mq_run_hw_queue().
Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Another restriction inherited for NVMe - those devices don't support
SG lists that have "gaps" in them. Gaps refers to cases where the
previous SG entry doesn't end on a page boundary. For NVMe, all SG
entries must start at offset 0 (except the first) and end on a page
boundary (except the last).
Signed-off-by: Jens Axboe <axboe@fb.com>
This reverts commit b5097e956a.
The original commit is buggy, we do use the registration functions
at runtime, for instance when loading IO schedulers through sysfs.
Reported-by: Damien Wyart <damien.wyart@gmail.com>
Hello,
So, this patch should do. Joe, Vivek, can one of you guys please
verify that the oops goes away with this patch?
Jens, the original thread can be read at
http://thread.gmane.org/gmane.linux.kernel/1720729
The fix converts blkg->refcnt from int to atomic_t. It does some
overhead but it should be minute compared to everything else which is
going on and the involved cacheline bouncing, so I think it's highly
unlikely to cause any noticeable difference. Also, the refcnt in
question should be converted to a perpcu_ref for blk-mq anyway, so the
atomic_t is likely to go away pretty soon anyway.
Thanks.
------- 8< -------
__blkg_release_rcu() may be invoked after the associated request_queue
is released with a RCU grace period inbetween. As such, the function
and callbacks invoked from it must not dereference the associated
request_queue. This is clearly indicated in the comment above the
function.
Unfortunately, while trying to fix a different issue, 2a4fd070ee
("blkcg: move bulk of blkcg_gq release operations to the RCU
callback") ignored this and added [un]locking of @blkg->q->queue_lock
to __blkg_release_rcu(). This of course can cause oops as the
request_queue may be long gone by the time this code gets executed.
general protection fault: 0000 [#1] SMP
CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
RIP: 0010:[<ffffffff8162e9e5>] [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086
RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
Stack:
ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
Call Trace:
[<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150
[<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300
[<ffffffff81091d81>] kthread+0xe1/0x100
[<ffffffff8163813c>] ret_from_fork+0x7c/0xb0
Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
+fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
+b7
RIP [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
RSP <ffff88085403fdf0>
The request_queue locking was added because blkcg_gq->refcnt is an int
protected with the queue lock and __blkg_release_rcu() needs to put
the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and
dropping queue locking in the function.
Given the general heavy weight of the current request_queue and blkcg
operations, this is unlikely to cause any noticeable overhead.
Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
the near future, so whatever (most likely negligible) overhead it may
add is temporary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block fixes from Jens Axboe:
"A smaller collection of fixes for the block core that would be nice to
have in -rc2. This pull request contains:
- Fixes for races in the wait/wakeup logic used in blk-mq from
Alexander. No issues have been observed, but it is definitely a
bit flakey currently. Alternatively, we may drop the cyclic
wakeups going forward, but that needs more testing.
- Some cleanups from Christoph.
- Fix for an oops in null_blk if queue_mode=1 and softirq completions
are used. From me.
- A fix for a regression caused by the chunk size setting. It
inadvertently used max_hw_sectors instead of max_sectors, which is
incorrect, and causes hangs on btrfs multi-disk setups (where hw
sectors apparently isn't set). From me.
- Removal of WQ_POWER_EFFICIENT in the kblockd creation. This was a
recent addition as well, but it actually breaks blk-mq which relies
on strict scheduling. If the workqueue power_efficient mode is
turned on, this breaks blk-mq. From Matias.
- null_blk module parameter description fix from Mike"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: bitmap tag: fix races in bt_get() function
blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt
blk-mq: bitmap tag: fix races on shared ::wake_index fields
block: blk_max_size_offset() should check ->max_sectors
null_blk: fix softirq completions for queue_mode == 1
blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue
blk-mq: properly drain stopped queues
block: remove WQ_POWER_EFFICIENT from kblockd
null_blk: fix name and description of 'queue_mode' module parameter
block: remove elv_abort_queue and blk_abort_flushes
This update fixes few issues in bt_get() function:
- list_empty(&wait.task_list) check is not protected;
- was_empty check is always true which results in *every* thread
entering the loop resets bt_wait_state::wait_cnt counter rather
than every bt->wake_cnt'th thread;
- 'bt_wait_state::wait_cnt' counter update is redundant, since
it also gets reset in bt_clear_tag() function;
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This piece of code in bt_clear_tag() function is racy:
bs = bt_wake_ptr(bt);
if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
atomic_set(&bs->wait_cnt, bt->wake_cnt);
wake_up(&bs->wait);
}
Since nothing prevents bt_wake_ptr() from returning the very
same 'bs' address on multiple CPUs, the following scenario is
possible:
CPU1 CPU2
---- ----
0. bs = bt_wake_ptr(bt); bs = bt_wake_ptr(bt);
1. atomic_dec_and_test(&bs->wait_cnt)
2. atomic_dec_and_test(&bs->wait_cnt)
3. atomic_set(&bs->wait_cnt, bt->wake_cnt);
If the decrement in [1] yields zero then for some amount of time
the decrement in [2] results in a negative/overflow value, which
is not expected. The follow-up assignment in [3] overwrites the
invalid value with the batch value (and likely prevents the issue
from being severe) which is still incorrect and should be a lesser.
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Fix racy updates of shared blk_mq_bitmap_tags::wake_index
and blk_mq_hw_ctx::wake_index fields.
Cc: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull NVMe update from Matthew Wilcox:
"Mostly bugfixes again for the NVMe driver. I'd like to call out the
exported tracepoint in the block layer; I believe Keith has cleared
this with Jens.
We've had a few reports from people who're really pounding on NVMe
devices at scale, hence the timeout changes (and new module
parameters), hotplug cpu deadlock, tracepoints, and minor performance
tweaks"
[ Jens hadn't seen that tracepoint thing, but is ok with it - it will
end up going away when mq conversion happens ]
* git://git.infradead.org/users/willy/linux-nvme: (22 commits)
NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.
NVMe: Use Log Page constants in SCSI emulation
NVMe: Define Log Page constants
NVMe: Fix hot cpu notification dead lock
NVMe: Rename io_timeout to nvme_io_timeout
NVMe: Use last bytes of f/w rev SCSI Inquiry
NVMe: Adhere to request queue block accounting enable/disable
NVMe: Fix nvme get/put queue semantics
NVMe: Delete NVME_GET_FEAT_TEMP_THRESH
NVMe: Make admin timeout a module parameter
NVMe: Make iod bio timeout a parameter
NVMe: Prevent possible NULL pointer dereference
NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)
NVMe: Update data structures for NVMe 1.2
NVMe: Enable BUILD_BUG_ON checks
NVMe: Update namespace and controller identify structures to the 1.1a spec
NVMe: Flush with data support
NVMe: Configure support for block flush
NVMe: Add tracepoints
NVMe: Protect against badly formatted CQEs
...