License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 21:07:57 +07:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* include/linux/backing-dev.h
|
|
|
|
*
|
|
|
|
* low-level device information and state which is propagated up through
|
|
|
|
* to high-level code.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _LINUX_BACKING_DEV_H
|
|
|
|
#define _LINUX_BACKING_DEV_H
|
|
|
|
|
2008-04-30 14:54:32 +07:00
|
|
|
#include <linux/kernel.h>
|
2008-04-30 14:54:37 +07:00
|
|
|
#include <linux/fs.h>
|
2009-09-09 14:08:54 +07:00
|
|
|
#include <linux/sched.h>
|
2015-05-23 04:13:33 +07:00
|
|
|
#include <linux/blkdev.h>
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 13:11:04 +07:00
|
|
|
#include <linux/device.h>
|
2009-09-09 14:08:54 +07:00
|
|
|
#include <linux/writeback.h>
|
2015-05-23 04:13:37 +07:00
|
|
|
#include <linux/blk-cgroup.h>
|
2015-05-23 04:13:32 +07:00
|
|
|
#include <linux/backing-dev-defs.h>
|
2015-07-02 21:44:34 +07:00
|
|
|
#include <linux/slab.h>
|
2015-01-14 16:42:36 +07:00
|
|
|
|
2017-02-02 21:56:51 +07:00
|
|
|
static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi)
|
|
|
|
{
|
|
|
|
kref_get(&bdi->refcnt);
|
|
|
|
return bdi;
|
|
|
|
}
|
|
|
|
|
2019-08-26 23:06:53 +07:00
|
|
|
struct backing_dev_info *bdi_get_by_id(u64 id);
|
2017-02-02 21:56:51 +07:00
|
|
|
void bdi_put(struct backing_dev_info *bdi);
|
2007-10-17 13:25:47 +07:00
|
|
|
|
2017-04-12 17:24:49 +07:00
|
|
|
__printf(2, 3)
|
|
|
|
int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...);
|
2018-04-07 03:14:51 +07:00
|
|
|
__printf(2, 0)
|
2017-04-12 17:24:49 +07:00
|
|
|
int bdi_register_va(struct backing_dev_info *bdi, const char *fmt,
|
|
|
|
va_list args);
|
2020-05-04 19:47:59 +07:00
|
|
|
void bdi_set_owner(struct backing_dev_info *bdi, struct device *owner);
|
block: don't release bdi while request_queue has live references
bdi's are initialized in two steps, bdi_init() and bdi_register(), but
destroyed in a single step by bdi_destroy() which, for a bdi embedded
in a request_queue, is called during blk_cleanup_queue() which makes
the queue invisible and starts the draining of remaining usages.
A request_queue's user can access the congestion state of the embedded
bdi as long as it holds a reference to the queue. As such, it may
access the congested state of a queue which finished
blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
Because the congested state was embedded in backing_dev_info which in
turn is embedded in request_queue, accessing the congested state after
bdi_destroy() was called was fine. The bdi was destroyed but the
memory region for the congested state remained accessible till the
queue got released.
a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
bdi_writeback") changed the situation. Now, the root congested state
which is expected to be pinned while request_queue remains accessible
is separately reference counted and the base ref is put during
bdi_destroy(). This means that the root congested state may go away
prematurely while the queue is between bdi_dstroy() and
blk_cleanup_queue(), which was detected by Andrey's KASAN tests.
The root cause of this problem is that bdi doesn't distinguish the two
steps of destruction, unregistration and release, and now the root
congested state actually requires a separate release step. To fix the
issue, this patch separates out bdi_unregister() and bdi_exit() from
bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
simple wrapper calling the two steps back-to-back.
While at it, the prototype of bdi_destroy() is moved right below
bdi_setup_and_register() so that the counterpart operations are
located together.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Reported-and-tested-by: Andrey Konovalov <andreyknvl@google.com>
Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
Reviewed-by: Jan Kara <jack@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-08 23:20:22 +07:00
|
|
|
void bdi_unregister(struct backing_dev_info *bdi);
|
|
|
|
|
2020-05-04 19:48:00 +07:00
|
|
|
struct backing_dev_info *bdi_alloc(int node_id);
|
block: don't release bdi while request_queue has live references
bdi's are initialized in two steps, bdi_init() and bdi_register(), but
destroyed in a single step by bdi_destroy() which, for a bdi embedded
in a request_queue, is called during blk_cleanup_queue() which makes
the queue invisible and starts the draining of remaining usages.
A request_queue's user can access the congestion state of the embedded
bdi as long as it holds a reference to the queue. As such, it may
access the congested state of a queue which finished
blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
Because the congested state was embedded in backing_dev_info which in
turn is embedded in request_queue, accessing the congested state after
bdi_destroy() was called was fine. The bdi was destroyed but the
memory region for the congested state remained accessible till the
queue got released.
a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
bdi_writeback") changed the situation. Now, the root congested state
which is expected to be pinned while request_queue remains accessible
is separately reference counted and the base ref is put during
bdi_destroy(). This means that the root congested state may go away
prematurely while the queue is between bdi_dstroy() and
blk_cleanup_queue(), which was detected by Andrey's KASAN tests.
The root cause of this problem is that bdi doesn't distinguish the two
steps of destruction, unregistration and release, and now the root
congested state actually requires a separate release step. To fix the
issue, this patch separates out bdi_unregister() and bdi_exit() from
bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
simple wrapper calling the two steps back-to-back.
While at it, the prototype of bdi_destroy() is moved right below
bdi_setup_and_register() so that the counterpart operations are
located together.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Reported-and-tested-by: Andrey Konovalov <andreyknvl@google.com>
Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
Reviewed-by: Jan Kara <jack@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-08 23:20:22 +07:00
|
|
|
|
2015-05-23 04:13:54 +07:00
|
|
|
void wb_start_background_writeback(struct bdi_writeback *wb);
|
writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.
This patch moves bdi->wb_lock and ->worklist into wb.
* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.
* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)
* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().
* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().
* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-23 04:13:30 +07:00
|
|
|
void wb_workfn(struct work_struct *work);
|
|
|
|
void wb_wakeup_delayed(struct bdi_writeback *wb);
|
2008-04-30 14:54:32 +07:00
|
|
|
|
2019-08-26 23:06:52 +07:00
|
|
|
void wb_wait_for_completion(struct wb_completion *done);
|
|
|
|
|
2009-09-09 14:08:54 +07:00
|
|
|
extern spinlock_t bdi_lock;
|
2009-09-02 14:19:46 +07:00
|
|
|
extern struct list_head bdi_list;
|
|
|
|
|
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-02 09:08:06 +07:00
|
|
|
extern struct workqueue_struct *bdi_wq;
|
2019-06-28 03:39:52 +07:00
|
|
|
extern struct workqueue_struct *bdi_async_bio_wq;
|
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-02 09:08:06 +07:00
|
|
|
|
2015-05-23 04:13:45 +07:00
|
|
|
static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
|
2009-09-09 14:08:54 +07:00
|
|
|
{
|
2015-05-23 04:13:45 +07:00
|
|
|
return test_bit(WB_has_dirty_io, &wb->state);
|
2009-09-09 14:08:54 +07:00
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:47 +07:00
|
|
|
static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* @bdi->tot_write_bandwidth is guaranteed to be > 0 if there are
|
|
|
|
* any dirty wbs. See wb_update_write_bandwidth().
|
|
|
|
*/
|
|
|
|
return atomic_long_read(&bdi->tot_write_bandwidth);
|
2009-09-09 14:08:54 +07:00
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:27 +07:00
|
|
|
static inline void __add_wb_stat(struct bdi_writeback *wb,
|
|
|
|
enum wb_stat_item item, s64 amount)
|
2007-10-17 13:25:47 +07:00
|
|
|
{
|
2017-06-21 01:01:20 +07:00
|
|
|
percpu_counter_add_batch(&wb->stat[item], amount, WB_STAT_BATCH);
|
2007-10-17 13:25:47 +07:00
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:27 +07:00
|
|
|
static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
|
2007-10-17 13:25:47 +07:00
|
|
|
{
|
2017-07-13 04:37:51 +07:00
|
|
|
__add_wb_stat(wb, item, 1);
|
2007-10-17 13:25:47 +07:00
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:27 +07:00
|
|
|
static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
|
2007-10-17 13:25:47 +07:00
|
|
|
{
|
2017-07-13 04:37:51 +07:00
|
|
|
__add_wb_stat(wb, item, -1);
|
2007-10-17 13:25:47 +07:00
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:27 +07:00
|
|
|
static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
|
2007-10-17 13:25:47 +07:00
|
|
|
{
|
2015-05-23 04:13:27 +07:00
|
|
|
return percpu_counter_read_positive(&wb->stat[item]);
|
2007-10-17 13:25:47 +07:00
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:27 +07:00
|
|
|
static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item)
|
2007-10-17 13:25:46 +07:00
|
|
|
{
|
2017-07-11 05:49:35 +07:00
|
|
|
return percpu_counter_sum_positive(&wb->stat[item]);
|
2007-10-17 13:25:46 +07:00
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:27 +07:00
|
|
|
extern void wb_writeout_inc(struct bdi_writeback *wb);
|
2008-04-30 14:54:37 +07:00
|
|
|
|
2007-10-17 13:25:47 +07:00
|
|
|
/*
|
|
|
|
* maximal error of a stat counter.
|
|
|
|
*/
|
2017-11-16 08:39:03 +07:00
|
|
|
static inline unsigned long wb_stat_error(void)
|
2007-10-17 13:25:46 +07:00
|
|
|
{
|
2007-10-17 13:25:47 +07:00
|
|
|
#ifdef CONFIG_SMP
|
2015-05-23 04:13:27 +07:00
|
|
|
return nr_cpu_ids * WB_STAT_BATCH;
|
2007-10-17 13:25:47 +07:00
|
|
|
#else
|
|
|
|
return 1;
|
|
|
|
#endif
|
2007-10-17 13:25:46 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2008-04-30 14:54:35 +07:00
|
|
|
int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio);
|
2008-04-30 14:54:36 +07:00
|
|
|
int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
|
2008-04-30 14:54:35 +07:00
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* Flags in backing_dev_info::capability
|
2008-04-30 14:54:37 +07:00
|
|
|
*
|
2020-09-24 13:51:40 +07:00
|
|
|
* BDI_CAP_WRITEBACK: Supports dirty page writeback, and dirty pages
|
|
|
|
* should contribute to accounting
|
|
|
|
* BDI_CAP_WRITEBACK_ACCT: Automatically account writeback pages
|
|
|
|
* BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2020-09-24 13:51:40 +07:00
|
|
|
#define BDI_CAP_WRITEBACK (1 << 0)
|
|
|
|
#define BDI_CAP_WRITEBACK_ACCT (1 << 1)
|
|
|
|
#define BDI_CAP_STRICTLIMIT (1 << 2)
|
2008-04-30 14:54:37 +07:00
|
|
|
|
2010-04-25 13:54:42 +07:00
|
|
|
extern struct backing_dev_info noop_backing_dev_info;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-05-23 04:13:53 +07:00
|
|
|
/**
|
|
|
|
* writeback_in_progress - determine whether there is writeback in progress
|
|
|
|
* @wb: bdi_writeback of interest
|
|
|
|
*
|
|
|
|
* Determine whether there is writeback waiting to be handled against a
|
|
|
|
* bdi_writeback.
|
|
|
|
*/
|
|
|
|
static inline bool writeback_in_progress(struct bdi_writeback *wb)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2015-05-23 04:13:53 +07:00
|
|
|
return test_bit(WB_writeback_running, &wb->state);
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:33 +07:00
|
|
|
static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2015-05-23 04:13:33 +07:00
|
|
|
struct super_block *sb;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2015-05-23 04:13:33 +07:00
|
|
|
if (!inode)
|
|
|
|
return &noop_backing_dev_info;
|
|
|
|
|
|
|
|
sb = inode->i_sb;
|
|
|
|
#ifdef CONFIG_BLOCK
|
|
|
|
if (sb_is_blkdev_sb(sb))
|
2017-02-02 21:56:53 +07:00
|
|
|
return I_BDEV(inode)->bd_bdi;
|
2015-05-23 04:13:33 +07:00
|
|
|
#endif
|
|
|
|
return sb->s_bdi;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:41 +07:00
|
|
|
static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2020-07-01 16:06:21 +07:00
|
|
|
return wb->congested & cong_bits;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2009-07-11 21:06:54 +07:00
|
|
|
|
2009-07-09 19:52:32 +07:00
|
|
|
long congestion_wait(int sync, long timeout);
|
mm/vmscan: don't mess with pgdat->flags in memcg reclaim
memcg reclaim may alter pgdat->flags based on the state of LRU lists in
cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
pages. But the worst here is PGDAT_CONGESTED, since it may force all
direct reclaims to stall in wait_iff_congested(). Note that only kswapd
have powers to clear any of these bits. This might just never happen if
cgroup limits configured that way. So all direct reclaims will stall as
long as we have some congested bdi in the system.
Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
it's reasonable to leave all decisions about node state to kswapd.
Why only kswapd? Why not allow to global direct reclaim change these
flags? It is because currently only kswapd can clear these flags. I'm
less worried about the case when PGDAT_CONGESTED falsely not set, and
more worried about the case when it falsely set. If direct reclaimer
sets PGDAT_CONGESTED, do we have guarantee that after the congestion
problem is sorted out, kswapd will be woken up and clear the flag? It
seems like there is no such guarantee. E.g. direct reclaimers may
eventually balance pgdat and kswapd simply won't wake up (see
wakeup_kswapd()).
Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
now loses its congestion throttling mechanism. Add per-cgroup
congestion state and throttle cgroup2 reclaimers if memcg is in
congestion state.
Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
bits since they alter only kswapd behavior.
The problem could be easily demonstrated by creating heavy congestion in
one cgroup:
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/congester
echo 512M > /sys/fs/cgroup/congester/memory.max
echo $$ > /sys/fs/cgroup/congester/cgroup.procs
/* generate a lot of diry data on slow HDD */
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
....
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
and some job in another cgroup:
mkdir /sys/fs/cgroup/victim
echo 128M > /sys/fs/cgroup/victim/memory.max
# time cat /dev/sda > /dev/null
real 10m15.054s
user 0m0.487s
sys 1m8.505s
According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
of the time sleeping there.
With the patch, cat don't waste time anymore:
# time cat /dev/sda > /dev/null
real 5m32.911s
user 0m0.411s
sys 0m56.664s
[aryabinin@virtuozzo.com: congestion state should be per-node]
Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
[ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 06:28:03 +07:00
|
|
|
long wait_iff_congested(int sync, long timeout);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2020-09-24 13:51:40 +07:00
|
|
|
static inline bool mapping_can_writeback(struct address_space *mapping)
|
2008-04-30 14:54:37 +07:00
|
|
|
{
|
2020-09-24 13:51:40 +07:00
|
|
|
return inode_to_bdi(mapping->host)->capabilities & BDI_CAP_WRITEBACK;
|
2008-04-30 14:54:37 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-09-09 14:08:54 +07:00
|
|
|
static inline int bdi_sched_wait(void *word)
|
|
|
|
{
|
|
|
|
schedule();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:36 +07:00
|
|
|
#ifdef CONFIG_CGROUP_WRITEBACK
|
|
|
|
|
2019-08-26 23:06:54 +07:00
|
|
|
struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi,
|
|
|
|
struct cgroup_subsys_state *memcg_css);
|
2015-05-23 04:13:37 +07:00
|
|
|
struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
|
|
|
|
struct cgroup_subsys_state *memcg_css,
|
|
|
|
gfp_t gfp);
|
|
|
|
void wb_memcg_offline(struct mem_cgroup *memcg);
|
|
|
|
void wb_blkcg_offline(struct blkcg *blkcg);
|
2015-05-23 04:13:44 +07:00
|
|
|
int inode_congested(struct inode *inode, int cong_bits);
|
2015-05-23 04:13:37 +07:00
|
|
|
|
2015-05-23 04:13:36 +07:00
|
|
|
/**
|
|
|
|
* inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
|
|
|
|
* @inode: inode of interest
|
|
|
|
*
|
2020-09-24 13:51:35 +07:00
|
|
|
* Cgroup writeback requires support from the filesystem. Also, both memcg and
|
|
|
|
* iocg have to be on the default hierarchy. Test whether all conditions are
|
|
|
|
* met.
|
2015-09-24 04:07:29 +07:00
|
|
|
*
|
|
|
|
* Note that the test result may change dynamically on the same inode
|
|
|
|
* depending on how memcg and iocg are configured.
|
2015-05-23 04:13:36 +07:00
|
|
|
*/
|
|
|
|
static inline bool inode_cgwb_enabled(struct inode *inode)
|
|
|
|
{
|
|
|
|
struct backing_dev_info *bdi = inode_to_bdi(inode);
|
|
|
|
|
2015-09-25 03:59:19 +07:00
|
|
|
return cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
|
|
|
|
cgroup_subsys_on_dfl(io_cgrp_subsys) &&
|
2020-09-24 13:51:40 +07:00
|
|
|
(bdi->capabilities & BDI_CAP_WRITEBACK) &&
|
2015-06-17 05:48:31 +07:00
|
|
|
(inode->i_sb->s_iflags & SB_I_CGROUPWB);
|
2015-05-23 04:13:36 +07:00
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:37 +07:00
|
|
|
/**
|
|
|
|
* wb_find_current - find wb for %current on a bdi
|
|
|
|
* @bdi: bdi of interest
|
|
|
|
*
|
|
|
|
* Find the wb of @bdi which matches both the memcg and blkcg of %current.
|
|
|
|
* Must be called under rcu_read_lock() which protects the returend wb.
|
|
|
|
* NULL if not found.
|
|
|
|
*/
|
|
|
|
static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
|
|
|
|
{
|
|
|
|
struct cgroup_subsys_state *memcg_css;
|
|
|
|
struct bdi_writeback *wb;
|
|
|
|
|
|
|
|
memcg_css = task_css(current, memory_cgrp_id);
|
|
|
|
if (!memcg_css->parent)
|
|
|
|
return &bdi->wb;
|
|
|
|
|
|
|
|
wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* %current's blkcg equals the effective blkcg of its memcg. No
|
|
|
|
* need to use the relatively expensive cgroup_get_e_css().
|
|
|
|
*/
|
blkcg: rename subsystem name from blkio to io
blkio interface has become messy over time and is currently the
largest. In addition to the inconsistent naming scheme, it has
multiple stat files which report more or less the same thing, a number
of debug stat files which expose internal details which shouldn't have
been part of the public interface in the first place, recursive and
non-recursive stats and leaf and non-leaf knobs.
Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
don't make any sense on the unified hierarchy as only leaf cgroups can
contain processes. cgroups is going through a major interface
revision with the unified hierarchy involving significant fundamental
usage changes and given that a significant portion of the interface
doesn't make sense anymore, it's a good time to reorganize the
interface.
As the first step, this patch renames the external visible subsystem
name from "blkio" to "io". This is more concise, matches the other
two major subsystem names, "cpu" and "memory", and better suited as
blkcg will be involved in anything writeback related too whether an
actual block device is involved or not.
As the subsystem legacy_name is set to "blkio", the only userland
visible change outside the unified hierarchy is that blkcg is reported
as "io" instead of "blkio" in the subsystem initialized message during
boot. On the unified hierarchy, blkcg now appears as "io".
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-19 04:55:29 +07:00
|
|
|
if (likely(wb && wb->blkcg_css == task_css(current, io_cgrp_id)))
|
2015-05-23 04:13:37 +07:00
|
|
|
return wb;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* wb_get_create_current - get or create wb for %current on a bdi
|
|
|
|
* @bdi: bdi of interest
|
|
|
|
* @gfp: allocation mask
|
|
|
|
*
|
|
|
|
* Equivalent to wb_get_create() on %current's memcg. This function is
|
|
|
|
* called from a relatively hot path and optimizes the common cases using
|
|
|
|
* wb_find_current().
|
|
|
|
*/
|
|
|
|
static inline struct bdi_writeback *
|
|
|
|
wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
|
|
|
|
{
|
|
|
|
struct bdi_writeback *wb;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
wb = wb_find_current(bdi);
|
|
|
|
if (wb && unlikely(!wb_tryget(wb)))
|
|
|
|
wb = NULL;
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
if (unlikely(!wb)) {
|
|
|
|
struct cgroup_subsys_state *memcg_css;
|
|
|
|
|
|
|
|
memcg_css = task_get_css(current, memory_cgrp_id);
|
|
|
|
wb = wb_get_create(bdi, memcg_css, gfp);
|
|
|
|
css_put(memcg_css);
|
|
|
|
}
|
|
|
|
return wb;
|
|
|
|
}
|
|
|
|
|
2015-05-29 01:50:55 +07:00
|
|
|
/**
|
|
|
|
* inode_to_wb_is_valid - test whether an inode has a wb associated
|
|
|
|
* @inode: inode of interest
|
|
|
|
*
|
|
|
|
* Returns %true if @inode has a wb associated. May be called without any
|
|
|
|
* locking.
|
|
|
|
*/
|
|
|
|
static inline bool inode_to_wb_is_valid(struct inode *inode)
|
|
|
|
{
|
|
|
|
return inode->i_wb;
|
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:37 +07:00
|
|
|
/**
|
|
|
|
* inode_to_wb - determine the wb of an inode
|
|
|
|
* @inode: inode of interest
|
|
|
|
*
|
2015-05-29 01:50:55 +07:00
|
|
|
* Returns the wb @inode is currently associated with. The caller must be
|
2018-04-11 06:36:56 +07:00
|
|
|
* holding either @inode->i_lock, the i_pages lock, or the
|
2015-05-29 01:50:55 +07:00
|
|
|
* associated wb's list_lock.
|
2015-05-23 04:13:37 +07:00
|
|
|
*/
|
2018-01-17 22:14:14 +07:00
|
|
|
static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
|
2015-05-23 04:13:37 +07:00
|
|
|
{
|
2015-05-29 01:50:55 +07:00
|
|
|
#ifdef CONFIG_LOCKDEP
|
|
|
|
WARN_ON_ONCE(debug_locks &&
|
|
|
|
(!lockdep_is_held(&inode->i_lock) &&
|
2018-04-11 06:36:56 +07:00
|
|
|
!lockdep_is_held(&inode->i_mapping->i_pages.xa_lock) &&
|
2015-05-29 01:50:55 +07:00
|
|
|
!lockdep_is_held(&inode->i_wb->list_lock)));
|
|
|
|
#endif
|
2015-05-23 04:13:37 +07:00
|
|
|
return inode->i_wb;
|
|
|
|
}
|
|
|
|
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
/**
|
|
|
|
* unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
|
|
|
|
* @inode: target inode
|
2018-04-21 04:55:42 +07:00
|
|
|
* @cookie: output param, to be passed to the end function
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
*
|
|
|
|
* The caller wants to access the wb associated with @inode but isn't
|
2018-04-11 06:36:56 +07:00
|
|
|
* holding inode->i_lock, the i_pages lock or wb->list_lock. This
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
* function determines the wb associated with @inode and ensures that the
|
|
|
|
* association doesn't change until the transaction is finished with
|
|
|
|
* unlocked_inode_to_wb_end().
|
|
|
|
*
|
2018-04-21 04:55:42 +07:00
|
|
|
* The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and
|
|
|
|
* can't sleep during the transaction. IRQs may or may not be disabled on
|
|
|
|
* return.
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
*/
|
|
|
|
static inline struct bdi_writeback *
|
2018-04-21 04:55:42 +07:00
|
|
|
unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
{
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
/*
|
2019-03-06 06:50:03 +07:00
|
|
|
* Paired with store_release in inode_switch_wbs_work_fn() and
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
* ensures that we see the new wb if we see cleared I_WB_SWITCH.
|
|
|
|
*/
|
2018-04-21 04:55:42 +07:00
|
|
|
cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
|
2018-04-21 04:55:42 +07:00
|
|
|
if (unlikely(cookie->locked))
|
|
|
|
xa_lock_irqsave(&inode->i_mapping->i_pages, cookie->flags);
|
2015-05-29 01:50:55 +07:00
|
|
|
|
|
|
|
/*
|
2018-04-11 06:36:56 +07:00
|
|
|
* Protected by either !I_WB_SWITCH + rcu_read_lock() or the i_pages
|
|
|
|
* lock. inode_to_wb() will bark. Deref directly.
|
2015-05-29 01:50:55 +07:00
|
|
|
*/
|
|
|
|
return inode->i_wb;
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* unlocked_inode_to_wb_end - end inode wb access transaction
|
|
|
|
* @inode: target inode
|
2018-04-21 04:55:42 +07:00
|
|
|
* @cookie: @cookie from unlocked_inode_to_wb_begin()
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
*/
|
2018-04-21 04:55:42 +07:00
|
|
|
static inline void unlocked_inode_to_wb_end(struct inode *inode,
|
|
|
|
struct wb_lock_cookie *cookie)
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
{
|
2018-04-21 04:55:42 +07:00
|
|
|
if (unlikely(cookie->locked))
|
|
|
|
xa_unlock_irqrestore(&inode->i_mapping->i_pages, cookie->flags);
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:36 +07:00
|
|
|
#else /* CONFIG_CGROUP_WRITEBACK */
|
|
|
|
|
|
|
|
static inline bool inode_cgwb_enabled(struct inode *inode)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:37 +07:00
|
|
|
static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
|
|
|
|
{
|
|
|
|
return &bdi->wb;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct bdi_writeback *
|
|
|
|
wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
|
|
|
|
{
|
|
|
|
return &bdi->wb;
|
|
|
|
}
|
|
|
|
|
2015-05-29 01:50:55 +07:00
|
|
|
static inline bool inode_to_wb_is_valid(struct inode *inode)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:37 +07:00
|
|
|
static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
|
|
|
|
{
|
|
|
|
return &inode_to_bdi(inode)->wb;
|
|
|
|
}
|
|
|
|
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
static inline struct bdi_writeback *
|
2018-04-21 04:55:42 +07:00
|
|
|
unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
{
|
|
|
|
return inode_to_wb(inode);
|
|
|
|
}
|
|
|
|
|
2018-04-21 04:55:42 +07:00
|
|
|
static inline void unlocked_inode_to_wb_end(struct inode *inode,
|
|
|
|
struct wb_lock_cookie *cookie)
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 01:50:53 +07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:37 +07:00
|
|
|
static inline void wb_memcg_offline(struct mem_cgroup *memcg)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void wb_blkcg_offline(struct blkcg *blkcg)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:44 +07:00
|
|
|
static inline int inode_congested(struct inode *inode, int cong_bits)
|
|
|
|
{
|
|
|
|
return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
|
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:36 +07:00
|
|
|
#endif /* CONFIG_CGROUP_WRITEBACK */
|
|
|
|
|
2015-05-23 04:13:44 +07:00
|
|
|
static inline int inode_read_congested(struct inode *inode)
|
|
|
|
{
|
|
|
|
return inode_congested(inode, 1 << WB_sync_congested);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int inode_write_congested(struct inode *inode)
|
|
|
|
{
|
|
|
|
return inode_congested(inode, 1 << WB_async_congested);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int inode_rw_congested(struct inode *inode)
|
|
|
|
{
|
|
|
|
return inode_congested(inode, (1 << WB_sync_congested) |
|
|
|
|
(1 << WB_async_congested));
|
|
|
|
}
|
|
|
|
|
2015-05-23 04:13:41 +07:00
|
|
|
static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits)
|
|
|
|
{
|
|
|
|
return wb_congested(&bdi->wb, cong_bits);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int bdi_read_congested(struct backing_dev_info *bdi)
|
|
|
|
{
|
|
|
|
return bdi_congested(bdi, 1 << WB_sync_congested);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int bdi_write_congested(struct backing_dev_info *bdi)
|
|
|
|
{
|
|
|
|
return bdi_congested(bdi, 1 << WB_async_congested);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int bdi_rw_congested(struct backing_dev_info *bdi)
|
|
|
|
{
|
|
|
|
return bdi_congested(bdi, (1 << WB_sync_congested) |
|
|
|
|
(1 << WB_async_congested));
|
|
|
|
}
|
|
|
|
|
2020-05-04 19:47:54 +07:00
|
|
|
const char *bdi_dev_name(struct backing_dev_info *bdi);
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 13:11:04 +07:00
|
|
|
|
2015-05-23 04:13:36 +07:00
|
|
|
#endif /* _LINUX_BACKING_DEV_H */
|