2019-05-31 15:09:56 +07:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2006-01-16 23:50:04 +07:00
|
|
|
/*
|
|
|
|
* Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved.
|
2008-01-31 23:31:39 +07:00
|
|
|
* Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved.
|
2006-01-16 23:50:04 +07:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/spinlock.h>
|
|
|
|
#include <linux/completion.h>
|
|
|
|
#include <linux/buffer_head.h>
|
2006-02-28 05:23:27 +07:00
|
|
|
#include <linux/gfs2_ondisk.h>
|
2008-05-21 23:03:22 +07:00
|
|
|
#include <linux/bio.h>
|
2009-10-02 17:54:39 +07:00
|
|
|
#include <linux/posix_acl.h>
|
2015-12-24 23:09:40 +07:00
|
|
|
#include <linux/security.h>
|
2006-01-16 23:50:04 +07:00
|
|
|
|
|
|
|
#include "gfs2.h"
|
2006-02-28 05:23:27 +07:00
|
|
|
#include "incore.h"
|
2006-01-16 23:50:04 +07:00
|
|
|
#include "bmap.h"
|
|
|
|
#include "glock.h"
|
|
|
|
#include "glops.h"
|
|
|
|
#include "inode.h"
|
|
|
|
#include "log.h"
|
|
|
|
#include "meta_io.h"
|
|
|
|
#include "recovery.h"
|
|
|
|
#include "rgrp.h"
|
2006-02-28 05:23:27 +07:00
|
|
|
#include "util.h"
|
2006-10-03 22:10:41 +07:00
|
|
|
#include "trans.h"
|
2011-06-15 16:29:37 +07:00
|
|
|
#include "dir.h"
|
2019-05-03 02:17:40 +07:00
|
|
|
#include "lops.h"
|
2006-01-16 23:50:04 +07:00
|
|
|
|
2014-11-14 09:42:04 +07:00
|
|
|
struct workqueue_struct *gfs2_freeze_wq;
|
|
|
|
|
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-29 02:23:45 +07:00
|
|
|
extern struct workqueue_struct *gfs2_control_wq;
|
|
|
|
|
2011-08-02 19:09:36 +07:00
|
|
|
static void gfs2_ail_error(struct gfs2_glock *gl, const struct buffer_head *bh)
|
|
|
|
{
|
2015-03-16 23:52:05 +07:00
|
|
|
fs_err(gl->gl_name.ln_sbd,
|
|
|
|
"AIL buffer %p: blocknr %llu state 0x%08lx mapping %p page "
|
|
|
|
"state 0x%lx\n",
|
2011-08-02 19:09:36 +07:00
|
|
|
bh, (unsigned long long)bh->b_blocknr, bh->b_state,
|
|
|
|
bh->b_page->mapping, bh->b_page->flags);
|
2015-03-16 23:52:05 +07:00
|
|
|
fs_err(gl->gl_name.ln_sbd, "AIL glock %u:%llu mapping %p\n",
|
2011-08-02 19:09:36 +07:00
|
|
|
gl->gl_name.ln_type, gl->gl_name.ln_number,
|
|
|
|
gfs2_glock2aspace(gl));
|
2020-01-24 00:41:00 +07:00
|
|
|
gfs2_lm(gl->gl_name.ln_sbd, "AIL error\n");
|
|
|
|
gfs2_withdraw(gl->gl_name.ln_sbd);
|
2011-08-02 19:09:36 +07:00
|
|
|
}
|
|
|
|
|
2006-10-03 22:10:41 +07:00
|
|
|
/**
|
2011-04-14 15:54:02 +07:00
|
|
|
* __gfs2_ail_flush - remove all buffers for a given lock from the AIL
|
2006-10-03 22:10:41 +07:00
|
|
|
* @gl: the glock
|
2011-09-07 16:33:25 +07:00
|
|
|
* @fsync: set when called from fsync (not all buffers will be clean)
|
2006-10-03 22:10:41 +07:00
|
|
|
*
|
|
|
|
* None of the buffers should be dirty, locked, or pinned.
|
|
|
|
*/
|
|
|
|
|
2013-07-27 05:09:33 +07:00
|
|
|
static void __gfs2_ail_flush(struct gfs2_glock *gl, bool fsync,
|
|
|
|
unsigned int nr_revokes)
|
2006-10-03 22:10:41 +07:00
|
|
|
{
|
2015-03-16 23:52:05 +07:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2006-10-03 22:10:41 +07:00
|
|
|
struct list_head *head = &gl->gl_ail_list;
|
2011-09-07 16:33:25 +07:00
|
|
|
struct gfs2_bufdata *bd, *tmp;
|
2006-10-03 22:10:41 +07:00
|
|
|
struct buffer_head *bh;
|
2011-09-07 16:33:25 +07:00
|
|
|
const unsigned long b_state = (1UL << BH_Dirty)|(1UL << BH_Pinned)|(1UL << BH_Lock);
|
2009-02-05 17:12:38 +07:00
|
|
|
|
2011-09-07 16:33:25 +07:00
|
|
|
gfs2_log_lock(sdp);
|
2011-03-11 18:52:25 +07:00
|
|
|
spin_lock(&sdp->sd_ail_lock);
|
2013-07-27 05:09:33 +07:00
|
|
|
list_for_each_entry_safe_reverse(bd, tmp, head, bd_ail_gl_list) {
|
|
|
|
if (nr_revokes == 0)
|
|
|
|
break;
|
2006-10-03 22:10:41 +07:00
|
|
|
bh = bd->bd_bh;
|
2011-09-07 16:33:25 +07:00
|
|
|
if (bh->b_state & b_state) {
|
|
|
|
if (fsync)
|
|
|
|
continue;
|
2011-08-02 19:09:36 +07:00
|
|
|
gfs2_ail_error(gl, bh);
|
2011-09-07 16:33:25 +07:00
|
|
|
}
|
2007-09-03 17:01:33 +07:00
|
|
|
gfs2_trans_add_revoke(sdp, bd);
|
2013-07-27 05:09:33 +07:00
|
|
|
nr_revokes--;
|
2006-10-03 22:10:41 +07:00
|
|
|
}
|
2012-10-15 16:57:02 +07:00
|
|
|
GLOCK_BUG_ON(gl, !fsync && atomic_read(&gl->gl_ail_count));
|
2011-03-11 18:52:25 +07:00
|
|
|
spin_unlock(&sdp->sd_ail_lock);
|
2011-09-07 16:33:25 +07:00
|
|
|
gfs2_log_unlock(sdp);
|
2011-04-14 15:54:02 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2019-11-14 03:09:28 +07:00
|
|
|
static int gfs2_ail_empty_gl(struct gfs2_glock *gl)
|
2011-04-14 15:54:02 +07:00
|
|
|
{
|
2015-03-16 23:52:05 +07:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2011-04-14 15:54:02 +07:00
|
|
|
struct gfs2_trans tr;
|
2019-11-14 03:09:28 +07:00
|
|
|
int ret;
|
2011-04-14 15:54:02 +07:00
|
|
|
|
|
|
|
memset(&tr, 0, sizeof(tr));
|
2014-02-21 22:22:35 +07:00
|
|
|
INIT_LIST_HEAD(&tr.tr_buf);
|
|
|
|
INIT_LIST_HEAD(&tr.tr_databuf);
|
2020-06-06 02:12:34 +07:00
|
|
|
INIT_LIST_HEAD(&tr.tr_ail1_list);
|
|
|
|
INIT_LIST_HEAD(&tr.tr_ail2_list);
|
2011-04-14 15:54:02 +07:00
|
|
|
tr.tr_revokes = atomic_read(&gl->gl_ail_count);
|
|
|
|
|
2019-11-14 02:47:02 +07:00
|
|
|
if (!tr.tr_revokes) {
|
|
|
|
bool have_revokes;
|
|
|
|
bool log_in_flight;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have nothing on the ail, but there could be revokes on
|
|
|
|
* the sdp revoke queue, in which case, we still want to flush
|
|
|
|
* the log and wait for it to finish.
|
|
|
|
*
|
|
|
|
* If the sdp revoke list is empty too, we might still have an
|
|
|
|
* io outstanding for writing revokes, so we should wait for
|
|
|
|
* it before returning.
|
|
|
|
*
|
|
|
|
* If none of these conditions are true, our revokes are all
|
|
|
|
* flushed and we can return.
|
|
|
|
*/
|
|
|
|
gfs2_log_lock(sdp);
|
|
|
|
have_revokes = !list_empty(&sdp->sd_log_revokes);
|
|
|
|
log_in_flight = atomic_read(&sdp->sd_log_in_flight);
|
|
|
|
gfs2_log_unlock(sdp);
|
|
|
|
if (have_revokes)
|
|
|
|
goto flush;
|
|
|
|
if (log_in_flight)
|
|
|
|
log_flush_wait(sdp);
|
2019-11-14 03:09:28 +07:00
|
|
|
return 0;
|
2019-11-14 02:47:02 +07:00
|
|
|
}
|
2011-04-14 15:54:02 +07:00
|
|
|
|
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 10:26:55 +07:00
|
|
|
/* A shortened, inline version of gfs2_trans_begin()
|
|
|
|
* tr->alloced is not set since the transaction structure is
|
|
|
|
* on the stack */
|
2019-12-13 21:10:51 +07:00
|
|
|
tr.tr_reserved = 1 + gfs2_struct2blk(sdp, tr.tr_revokes);
|
2014-10-04 01:15:36 +07:00
|
|
|
tr.tr_ip = _RET_IP_;
|
2019-11-14 03:09:28 +07:00
|
|
|
ret = gfs2_log_reserve(sdp, tr.tr_reserved);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2012-10-15 16:57:02 +07:00
|
|
|
WARN_ON_ONCE(current->journal_info);
|
2011-04-14 15:54:02 +07:00
|
|
|
current->journal_info = &tr;
|
|
|
|
|
2013-07-27 05:09:33 +07:00
|
|
|
__gfs2_ail_flush(gl, 0, tr.tr_revokes);
|
2011-04-14 15:54:02 +07:00
|
|
|
|
|
|
|
gfs2_trans_end(sdp);
|
2019-11-14 02:47:02 +07:00
|
|
|
flush:
|
2018-01-08 22:34:17 +07:00
|
|
|
gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
|
|
|
|
GFS2_LFC_AIL_EMPTY_GL);
|
2019-11-14 03:09:28 +07:00
|
|
|
return 0;
|
2011-04-14 15:54:02 +07:00
|
|
|
}
|
2006-10-03 22:10:41 +07:00
|
|
|
|
2011-09-07 16:33:25 +07:00
|
|
|
void gfs2_ail_flush(struct gfs2_glock *gl, bool fsync)
|
2011-04-14 15:54:02 +07:00
|
|
|
{
|
2015-03-16 23:52:05 +07:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2011-04-14 15:54:02 +07:00
|
|
|
unsigned int revokes = atomic_read(&gl->gl_ail_count);
|
2013-07-27 05:09:33 +07:00
|
|
|
unsigned int max_revokes = (sdp->sd_sb.sb_bsize - sizeof(struct gfs2_log_descriptor)) / sizeof(u64);
|
2011-04-14 15:54:02 +07:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!revokes)
|
|
|
|
return;
|
|
|
|
|
2013-07-27 05:09:33 +07:00
|
|
|
while (revokes > max_revokes)
|
|
|
|
max_revokes += (sdp->sd_sb.sb_bsize - sizeof(struct gfs2_meta_header)) / sizeof(u64);
|
|
|
|
|
|
|
|
ret = gfs2_trans_begin(sdp, 0, max_revokes);
|
2011-04-14 15:54:02 +07:00
|
|
|
if (ret)
|
|
|
|
return;
|
2013-07-27 05:09:33 +07:00
|
|
|
__gfs2_ail_flush(gl, fsync, max_revokes);
|
2006-10-03 22:10:41 +07:00
|
|
|
gfs2_trans_end(sdp);
|
2018-01-08 22:34:17 +07:00
|
|
|
gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
|
|
|
|
GFS2_LFC_AIL_FLUSH);
|
2006-10-03 22:10:41 +07:00
|
|
|
}
|
2006-07-26 22:27:10 +07:00
|
|
|
|
|
|
|
/**
|
2009-03-09 16:03:51 +07:00
|
|
|
* rgrp_go_sync - sync out the metadata for this glock
|
2006-01-16 23:50:04 +07:00
|
|
|
* @gl: the glock
|
|
|
|
*
|
|
|
|
* Called when demoting or unlocking an EX glock. We must flush
|
|
|
|
* to disk all dirty buffers/pages relating to this glock, and must not
|
2017-06-30 19:55:08 +07:00
|
|
|
* return to caller to demote/unlock the glock until I/O is complete.
|
2006-01-16 23:50:04 +07:00
|
|
|
*/
|
|
|
|
|
2019-11-14 03:09:28 +07:00
|
|
|
static int rgrp_go_sync(struct gfs2_glock *gl)
|
2006-01-16 23:50:04 +07:00
|
|
|
{
|
2015-03-16 23:52:05 +07:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2013-12-06 23:19:54 +07:00
|
|
|
struct address_space *mapping = &sdp->sd_aspace;
|
gfs2: Rework how rgrp buffer_heads are managed
Before this patch, the rgrp code had a serious problem related to
how it managed buffer_heads for resource groups. The problem caused
file system corruption, especially in cases of journal replay.
When an rgrp glock was demoted to transfer ownership to a
different cluster node, do_xmote() first calls rgrp_go_sync and then
rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
gfs2_rgrp_brelse() that dropped the buffer_head reference count.
In most cases, the reference count went to zero, which is right.
However, there were other places where the buffers are handled
differently.
After rgrp_go_sync, do_xmote called rgrp_go_inval which called
gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
truncate_inode_pages_range would get rid of the pages in memory,
but only if the reference count drops to 0.
Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
to the buffer_heads in cases where the reference count was still 1.
Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
it failed the check for "if (bi->bi_bh)" and thus failed to call
brelse a second time. Because of that, the reference count on those
buffers sometimes failed to drop from 1 to 0. And that caused
function truncate_inode_pages_range to keep the pages in page cache
rather than freeing them.
The next time the rgrp glock was acquired, the metadata read of
the rgrp buffers re-used the pages in memory, which were now
wrong because they were likely modified by the other node who
acquired the glock in EX (which is why we demoted the glock).
This re-use of the page cache caused corruption because changes
made by the other nodes were never seen, so the bitmaps were
inaccurate.
For some reason, the problem became most apparent when journal
replay forced the replay of rgrps in memory, which caused newer
rgrp data to be overwritten by the older in-core pages.
A big part of the problem was that the rgrp buffer were released
in multiple places: The go_unlock function would release them when
the glock was released rather than when the glock is demoted,
which is clearly wrong because our intent was to cache them until
the glock is demoted from SH or EX.
This patch attempts to clean up the mess and make one consistent
and centralized mechanism for managing the rgrp buffer_heads by
implementing several changes:
1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
We don't want to release the buffers or zero the pointers when
syncing for the reasons stated above. It only makes sense to
release them when the glock is actually invalidated (go_inval).
And when we do, then we set the bh pointers to NULL.
2. The go_unlock function (which was only used for rgrps) is
eliminated, as we've talked about doing many times before.
The go_unlock function was called too early in the glock dq
process, and should not happen until the glock is invalidated.
3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
That will now happen automatically when the rgrp glocks are
demoted, and shouldn't happen any sooner or later than that.
Instead, function gfs2_clear_rgrpd has been modified to demote
the rgrp glocks, and therefore, free those pages, before the
remaining glocks are culled by gfs2_gl_hash_clear. This
prevents the gl_object from hanging around when the glocks are
culled.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-14 00:50:30 +07:00
|
|
|
struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
|
2009-03-09 16:03:51 +07:00
|
|
|
int error;
|
|
|
|
|
|
|
|
if (!test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
|
2019-11-14 03:09:28 +07:00
|
|
|
return 0;
|
2012-10-15 16:57:02 +07:00
|
|
|
GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
|
2007-01-23 00:15:34 +07:00
|
|
|
|
2018-01-08 22:34:17 +07:00
|
|
|
gfs2_log_flush(sdp, gl, GFS2_LOG_HEAD_FLUSH_NORMAL |
|
|
|
|
GFS2_LFC_RGRP_GO_SYNC);
|
2013-12-06 23:19:54 +07:00
|
|
|
filemap_fdatawrite_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
|
|
|
|
error = filemap_fdatawait_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
|
2019-11-14 03:09:28 +07:00
|
|
|
WARN_ON_ONCE(error);
|
2013-12-06 23:19:54 +07:00
|
|
|
mapping_set_error(mapping, error);
|
2019-11-14 03:09:28 +07:00
|
|
|
if (!error)
|
|
|
|
error = gfs2_ail_empty_gl(gl);
|
GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme
Here is an update of Bob's original rbtree patch which, in addition, also
resolves the rather strange ref counting that was being done relating to
the bitmap blocks.
Originally we had a dual system for journaling resource groups. The metadata
blocks were journaled and also the rgrp itself was added to a list. The reason
for adding the rgrp to the list in the journal was so that the "repolish
clones" code could be run to update the free space, and potentially send any
discard requests when the log was flushed. This was done by comparing the
"cloned" bitmap with what had been written back on disk during the transaction
commit.
Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
until the journal had been flushed. For that reason, there was a rather
complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
count on the buffers.
However, the journal maintains a reference count on the buffers anyway, since
they are being journaled as metadata buffers. So by moving the code which deals
with the post-journal accounting for bitmap blocks to the metadata journaling
code, we can entirely dispense with the rather strange buffer ref counting
scheme and also the requirement to journal the rgrps.
The net result of all this is that the ->sd_rindex_spin is left to do exactly
one job, and that is to look after the rbtree or rgrps.
This patch is designed to be a stepping stone towards using RCU for the rbtree
of resource groups, however the reduction in the number of uses of the
->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
anyway.
The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
be removed in future in favour of calling the functions directly where required
in the code. That will allow locking of resource groups without needing to
actually read them in - something that could be useful in speeding up statfs.
In the mean time though it is valid to dereference ->bi_bh only when the rgrp
is locked. This is basically the same rule as before, modulo the references not
being valid until the following journal flush.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Cc: Benjamin Marzinski <bmarzins@redhat.com>
2011-08-31 15:53:19 +07:00
|
|
|
|
2015-10-29 22:58:09 +07:00
|
|
|
spin_lock(&gl->gl_lockref.lock);
|
2011-08-31 22:38:29 +07:00
|
|
|
rgd = gl->gl_object;
|
|
|
|
if (rgd)
|
|
|
|
gfs2_free_clones(rgd);
|
2015-10-29 22:58:09 +07:00
|
|
|
spin_unlock(&gl->gl_lockref.lock);
|
2019-11-14 03:09:28 +07:00
|
|
|
return error;
|
2006-01-16 23:50:04 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2009-03-09 16:03:51 +07:00
|
|
|
* rgrp_go_inval - invalidate the metadata for this glock
|
2006-01-16 23:50:04 +07:00
|
|
|
* @gl: the glock
|
|
|
|
* @flags:
|
|
|
|
*
|
2009-03-09 16:03:51 +07:00
|
|
|
* We never used LM_ST_DEFERRED with resource groups, so that we
|
|
|
|
* should always see the metadata flag set here.
|
|
|
|
*
|
2006-01-16 23:50:04 +07:00
|
|
|
*/
|
|
|
|
|
2009-03-09 16:03:51 +07:00
|
|
|
static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
|
2006-01-16 23:50:04 +07:00
|
|
|
{
|
2015-03-16 23:52:05 +07:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2013-12-06 23:19:54 +07:00
|
|
|
struct address_space *mapping = &sdp->sd_aspace;
|
2017-06-30 19:55:08 +07:00
|
|
|
struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
|
2015-06-05 20:38:57 +07:00
|
|
|
|
|
|
|
if (rgd)
|
|
|
|
gfs2_rgrp_brelse(rgd);
|
2006-01-16 23:50:04 +07:00
|
|
|
|
2012-10-15 16:57:02 +07:00
|
|
|
WARN_ON_ONCE(!(flags & DIO_METADATA));
|
2013-12-06 17:16:14 +07:00
|
|
|
truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
|
2008-01-31 23:31:39 +07:00
|
|
|
|
2015-06-05 20:38:57 +07:00
|
|
|
if (rgd)
|
2008-01-31 23:31:39 +07:00
|
|
|
rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
|
2006-01-16 23:50:04 +07:00
|
|
|
}
|
|
|
|
|
2017-06-30 19:47:15 +07:00
|
|
|
static struct gfs2_inode *gfs2_glock2inode(struct gfs2_glock *gl)
|
|
|
|
{
|
|
|
|
struct gfs2_inode *ip;
|
|
|
|
|
|
|
|
spin_lock(&gl->gl_lockref.lock);
|
|
|
|
ip = gl->gl_object;
|
|
|
|
if (ip)
|
|
|
|
set_bit(GIF_GLOP_PENDING, &ip->i_flags);
|
|
|
|
spin_unlock(&gl->gl_lockref.lock);
|
|
|
|
return ip;
|
|
|
|
}
|
|
|
|
|
2017-06-30 19:55:08 +07:00
|
|
|
struct gfs2_rgrpd *gfs2_glock2rgrp(struct gfs2_glock *gl)
|
|
|
|
{
|
|
|
|
struct gfs2_rgrpd *rgd;
|
|
|
|
|
|
|
|
spin_lock(&gl->gl_lockref.lock);
|
|
|
|
rgd = gl->gl_object;
|
|
|
|
spin_unlock(&gl->gl_lockref.lock);
|
|
|
|
|
|
|
|
return rgd;
|
|
|
|
}
|
|
|
|
|
2017-06-30 19:47:15 +07:00
|
|
|
static void gfs2_clear_glop_pending(struct gfs2_inode *ip)
|
|
|
|
{
|
|
|
|
if (!ip)
|
|
|
|
return;
|
|
|
|
|
|
|
|
clear_bit_unlock(GIF_GLOP_PENDING, &ip->i_flags);
|
|
|
|
wake_up_bit(&ip->i_flags, GIF_GLOP_PENDING);
|
|
|
|
}
|
|
|
|
|
2007-01-23 00:15:34 +07:00
|
|
|
/**
|
|
|
|
* inode_go_sync - Sync the dirty data and/or metadata for an inode glock
|
|
|
|
* @gl: the glock protecting the inode
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2019-11-14 03:09:28 +07:00
|
|
|
static int inode_go_sync(struct gfs2_glock *gl)
|
2007-01-23 00:15:34 +07:00
|
|
|
{
|
2017-06-30 19:47:15 +07:00
|
|
|
struct gfs2_inode *ip = gfs2_glock2inode(gl);
|
|
|
|
int isreg = ip && S_ISREG(ip->i_inode.i_mode);
|
2009-12-08 19:12:13 +07:00
|
|
|
struct address_space *metamapping = gfs2_glock2aspace(gl);
|
2020-05-08 21:18:03 +07:00
|
|
|
int error = 0, ret;
|
2007-11-02 15:39:34 +07:00
|
|
|
|
2017-06-30 19:47:15 +07:00
|
|
|
if (isreg) {
|
2013-12-19 18:04:14 +07:00
|
|
|
if (test_and_clear_bit(GIF_SW_PAGED, &ip->i_flags))
|
|
|
|
unmap_shared_mapping_range(ip->i_inode.i_mapping, 0, 0);
|
|
|
|
inode_dio_wait(&ip->i_inode);
|
|
|
|
}
|
2009-03-09 16:03:51 +07:00
|
|
|
if (!test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
|
2017-06-30 19:47:15 +07:00
|
|
|
goto out;
|
2007-01-23 00:15:34 +07:00
|
|
|
|
2012-10-15 16:57:02 +07:00
|
|
|
GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
|
2009-03-09 16:03:51 +07:00
|
|
|
|
2018-01-08 22:34:17 +07:00
|
|
|
gfs2_log_flush(gl->gl_name.ln_sbd, gl, GFS2_LOG_HEAD_FLUSH_NORMAL |
|
|
|
|
GFS2_LFC_INODE_GO_SYNC);
|
2009-03-09 16:03:51 +07:00
|
|
|
filemap_fdatawrite(metamapping);
|
2017-06-30 19:47:15 +07:00
|
|
|
if (isreg) {
|
2009-03-09 16:03:51 +07:00
|
|
|
struct address_space *mapping = ip->i_inode.i_mapping;
|
|
|
|
filemap_fdatawrite(mapping);
|
|
|
|
error = filemap_fdatawait(mapping);
|
|
|
|
mapping_set_error(mapping, error);
|
2007-01-23 00:15:34 +07:00
|
|
|
}
|
2020-05-08 21:18:03 +07:00
|
|
|
ret = filemap_fdatawait(metamapping);
|
|
|
|
mapping_set_error(metamapping, ret);
|
|
|
|
if (!error)
|
|
|
|
error = ret;
|
2009-03-09 16:03:51 +07:00
|
|
|
gfs2_ail_empty_gl(gl);
|
2009-04-20 14:58:45 +07:00
|
|
|
/*
|
|
|
|
* Writeback of the data mapping may cause the dirty flag to be set
|
|
|
|
* so we have to clear it again here.
|
|
|
|
*/
|
2014-03-18 00:06:10 +07:00
|
|
|
smp_mb__before_atomic();
|
2009-04-20 14:58:45 +07:00
|
|
|
clear_bit(GLF_DIRTY, &gl->gl_flags);
|
2017-06-30 19:47:15 +07:00
|
|
|
|
|
|
|
out:
|
|
|
|
gfs2_clear_glop_pending(ip);
|
2019-11-14 03:09:28 +07:00
|
|
|
return error;
|
2007-01-23 00:15:34 +07:00
|
|
|
}
|
|
|
|
|
2006-01-16 23:50:04 +07:00
|
|
|
/**
|
|
|
|
* inode_go_inval - prepare a inode glock to be released
|
|
|
|
* @gl: the glock
|
|
|
|
* @flags:
|
2014-06-29 17:21:39 +07:00
|
|
|
*
|
|
|
|
* Normally we invalidate everything, but if we are moving into
|
2009-03-09 16:03:51 +07:00
|
|
|
* LM_ST_DEFERRED from LM_ST_SHARED or LM_ST_EXCLUSIVE then we
|
|
|
|
* can keep hold of the metadata, since it won't have changed.
|
2006-01-16 23:50:04 +07:00
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void inode_go_inval(struct gfs2_glock *gl, int flags)
|
|
|
|
{
|
2017-06-30 19:47:15 +07:00
|
|
|
struct gfs2_inode *ip = gfs2_glock2inode(gl);
|
2006-01-16 23:50:04 +07:00
|
|
|
|
2009-03-09 16:03:51 +07:00
|
|
|
if (flags & DIO_METADATA) {
|
2009-12-08 19:12:13 +07:00
|
|
|
struct address_space *mapping = gfs2_glock2aspace(gl);
|
2009-03-09 16:03:51 +07:00
|
|
|
truncate_inode_pages(mapping, 0);
|
2009-10-02 17:54:39 +07:00
|
|
|
if (ip) {
|
2006-11-23 22:51:34 +07:00
|
|
|
set_bit(GIF_INVALID, &ip->i_flags);
|
2009-10-02 17:54:39 +07:00
|
|
|
forget_all_cached_acls(&ip->i_inode);
|
2015-12-24 23:09:40 +07:00
|
|
|
security_inode_invalidate_secctx(&ip->i_inode);
|
2011-06-15 16:29:37 +07:00
|
|
|
gfs2_dir_hash_inval(ip);
|
2009-10-02 17:54:39 +07:00
|
|
|
}
|
2006-11-23 22:51:34 +07:00
|
|
|
}
|
|
|
|
|
2015-03-16 23:52:05 +07:00
|
|
|
if (ip == GFS2_I(gl->gl_name.ln_sbd->sd_rindex)) {
|
2018-01-17 06:01:33 +07:00
|
|
|
gfs2_log_flush(gl->gl_name.ln_sbd, NULL,
|
2018-01-08 22:34:17 +07:00
|
|
|
GFS2_LOG_HEAD_FLUSH_NORMAL |
|
|
|
|
GFS2_LFC_INODE_GO_INVAL);
|
2015-03-16 23:52:05 +07:00
|
|
|
gl->gl_name.ln_sbd->sd_rindex_uptodate = 0;
|
2011-06-14 02:27:40 +07:00
|
|
|
}
|
2007-10-15 21:40:33 +07:00
|
|
|
if (ip && S_ISREG(ip->i_inode.i_mode))
|
2006-11-23 22:51:34 +07:00
|
|
|
truncate_inode_pages(ip->i_inode.i_mapping, 0);
|
2017-06-30 19:47:15 +07:00
|
|
|
|
|
|
|
gfs2_clear_glop_pending(ip);
|
2006-01-16 23:50:04 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* inode_go_demote_ok - Check to see if it's ok to unlock an inode glock
|
|
|
|
* @gl: the glock
|
|
|
|
*
|
|
|
|
* Returns: 1 if it's ok
|
|
|
|
*/
|
|
|
|
|
2008-11-20 20:39:47 +07:00
|
|
|
static int inode_go_demote_ok(const struct gfs2_glock *gl)
|
2006-01-16 23:50:04 +07:00
|
|
|
{
|
2015-03-16 23:52:05 +07:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2011-01-19 16:30:01 +07:00
|
|
|
|
2008-11-20 20:39:47 +07:00
|
|
|
if (sdp->sd_jindex == gl->gl_object || sdp->sd_rindex == gl->gl_object)
|
|
|
|
return 0;
|
2011-01-19 16:30:01 +07:00
|
|
|
|
2008-11-20 20:39:47 +07:00
|
|
|
return 1;
|
2006-01-16 23:50:04 +07:00
|
|
|
}
|
|
|
|
|
2011-05-09 19:49:59 +07:00
|
|
|
static int gfs2_dinode_in(struct gfs2_inode *ip, const void *buf)
|
|
|
|
{
|
|
|
|
const struct gfs2_dinode *str = buf;
|
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 09:36:02 +07:00
|
|
|
struct timespec64 atime;
|
2011-05-09 19:49:59 +07:00
|
|
|
u16 height, depth;
|
|
|
|
|
|
|
|
if (unlikely(ip->i_no_addr != be64_to_cpu(str->di_num.no_addr)))
|
|
|
|
goto corrupt;
|
|
|
|
ip->i_no_formal_ino = be64_to_cpu(str->di_num.no_formal_ino);
|
|
|
|
ip->i_inode.i_mode = be32_to_cpu(str->di_mode);
|
|
|
|
ip->i_inode.i_rdev = 0;
|
|
|
|
switch (ip->i_inode.i_mode & S_IFMT) {
|
|
|
|
case S_IFBLK:
|
|
|
|
case S_IFCHR:
|
|
|
|
ip->i_inode.i_rdev = MKDEV(be32_to_cpu(str->di_major),
|
|
|
|
be32_to_cpu(str->di_minor));
|
|
|
|
break;
|
2019-10-04 22:55:29 +07:00
|
|
|
}
|
2011-05-09 19:49:59 +07:00
|
|
|
|
2013-02-01 13:08:10 +07:00
|
|
|
i_uid_write(&ip->i_inode, be32_to_cpu(str->di_uid));
|
|
|
|
i_gid_write(&ip->i_inode, be32_to_cpu(str->di_gid));
|
2017-08-01 23:33:17 +07:00
|
|
|
set_nlink(&ip->i_inode, be32_to_cpu(str->di_nlink));
|
2011-05-09 19:49:59 +07:00
|
|
|
i_size_write(&ip->i_inode, be64_to_cpu(str->di_size));
|
|
|
|
gfs2_set_inode_blocks(&ip->i_inode, be64_to_cpu(str->di_blocks));
|
|
|
|
atime.tv_sec = be64_to_cpu(str->di_atime);
|
|
|
|
atime.tv_nsec = be32_to_cpu(str->di_atime_nsec);
|
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 09:36:02 +07:00
|
|
|
if (timespec64_compare(&ip->i_inode.i_atime, &atime) < 0)
|
2011-05-09 19:49:59 +07:00
|
|
|
ip->i_inode.i_atime = atime;
|
|
|
|
ip->i_inode.i_mtime.tv_sec = be64_to_cpu(str->di_mtime);
|
|
|
|
ip->i_inode.i_mtime.tv_nsec = be32_to_cpu(str->di_mtime_nsec);
|
|
|
|
ip->i_inode.i_ctime.tv_sec = be64_to_cpu(str->di_ctime);
|
|
|
|
ip->i_inode.i_ctime.tv_nsec = be32_to_cpu(str->di_ctime_nsec);
|
|
|
|
|
|
|
|
ip->i_goal = be64_to_cpu(str->di_goal_meta);
|
|
|
|
ip->i_generation = be64_to_cpu(str->di_generation);
|
|
|
|
|
|
|
|
ip->i_diskflags = be32_to_cpu(str->di_flags);
|
2011-06-16 20:06:55 +07:00
|
|
|
ip->i_eattr = be64_to_cpu(str->di_eattr);
|
|
|
|
/* i_diskflags and i_eattr must be set before gfs2_set_inode_flags() */
|
2011-05-09 19:49:59 +07:00
|
|
|
gfs2_set_inode_flags(&ip->i_inode);
|
|
|
|
height = be16_to_cpu(str->di_height);
|
|
|
|
if (unlikely(height > GFS2_MAX_META_HEIGHT))
|
|
|
|
goto corrupt;
|
|
|
|
ip->i_height = (u8)height;
|
|
|
|
|
|
|
|
depth = be16_to_cpu(str->di_depth);
|
|
|
|
if (unlikely(depth > GFS2_DIR_MAX_DEPTH))
|
|
|
|
goto corrupt;
|
|
|
|
ip->i_depth = (u8)depth;
|
|
|
|
ip->i_entries = be32_to_cpu(str->di_entries);
|
|
|
|
|
|
|
|
if (S_ISREG(ip->i_inode.i_mode))
|
|
|
|
gfs2_set_aops(&ip->i_inode);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
corrupt:
|
|
|
|
gfs2_consist_inode(ip);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* gfs2_inode_refresh - Refresh the incore copy of the dinode
|
|
|
|
* @ip: The GFS2 inode
|
|
|
|
*
|
|
|
|
* Returns: errno
|
|
|
|
*/
|
|
|
|
|
|
|
|
int gfs2_inode_refresh(struct gfs2_inode *ip)
|
|
|
|
{
|
|
|
|
struct buffer_head *dibh;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
error = gfs2_meta_inode_buffer(ip, &dibh);
|
|
|
|
if (error)
|
|
|
|
return error;
|
|
|
|
|
|
|
|
error = gfs2_dinode_in(ip, dibh->b_data);
|
|
|
|
brelse(dibh);
|
|
|
|
clear_bit(GIF_INVALID, &ip->i_flags);
|
|
|
|
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2006-01-16 23:50:04 +07:00
|
|
|
/**
|
|
|
|
* inode_go_lock - operation done after an inode lock is locked by a process
|
|
|
|
* @gl: the glock
|
|
|
|
* @flags:
|
|
|
|
*
|
|
|
|
* Returns: errno
|
|
|
|
*/
|
|
|
|
|
|
|
|
static int inode_go_lock(struct gfs2_holder *gh)
|
|
|
|
{
|
|
|
|
struct gfs2_glock *gl = gh->gh_gl;
|
2015-03-16 23:52:05 +07:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2006-02-28 05:23:27 +07:00
|
|
|
struct gfs2_inode *ip = gl->gl_object;
|
2006-01-16 23:50:04 +07:00
|
|
|
int error = 0;
|
|
|
|
|
2008-04-30 00:35:48 +07:00
|
|
|
if (!ip || (gh->gh_flags & GL_SKIP))
|
2006-01-16 23:50:04 +07:00
|
|
|
return 0;
|
|
|
|
|
2006-11-02 04:05:38 +07:00
|
|
|
if (test_bit(GIF_INVALID, &ip->i_flags)) {
|
2006-01-16 23:50:04 +07:00
|
|
|
error = gfs2_inode_refresh(ip);
|
|
|
|
if (error)
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2013-12-19 18:04:14 +07:00
|
|
|
if (gh->gh_state != LM_ST_DEFERRED)
|
|
|
|
inode_dio_wait(&ip->i_inode);
|
|
|
|
|
2008-11-04 17:05:22 +07:00
|
|
|
if ((ip->i_diskflags & GFS2_DIF_TRUNC_IN_PROG) &&
|
2006-01-16 23:50:04 +07:00
|
|
|
(gl->gl_state == LM_ST_EXCLUSIVE) &&
|
2008-11-18 20:38:48 +07:00
|
|
|
(gh->gh_state == LM_ST_EXCLUSIVE)) {
|
|
|
|
spin_lock(&sdp->sd_trunc_lock);
|
|
|
|
if (list_empty(&ip->i_trunc_list))
|
2017-07-21 19:40:59 +07:00
|
|
|
list_add(&ip->i_trunc_list, &sdp->sd_trunc_list);
|
2008-11-18 20:38:48 +07:00
|
|
|
spin_unlock(&sdp->sd_trunc_lock);
|
|
|
|
wake_up(&sdp->sd_quota_wait);
|
|
|
|
return 1;
|
|
|
|
}
|
2006-01-16 23:50:04 +07:00
|
|
|
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2008-05-21 23:03:22 +07:00
|
|
|
/**
|
|
|
|
* inode_go_dump - print information about an inode
|
|
|
|
* @seq: The iterator
|
|
|
|
* @ip: the inode
|
2019-05-09 21:21:48 +07:00
|
|
|
* @fs_id_buf: file system id (may be empty)
|
2008-05-21 23:03:22 +07:00
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2019-05-09 21:21:48 +07:00
|
|
|
static void inode_go_dump(struct seq_file *seq, struct gfs2_glock *gl,
|
|
|
|
const char *fs_id_buf)
|
2008-05-21 23:03:22 +07:00
|
|
|
{
|
2018-04-19 02:05:01 +07:00
|
|
|
struct gfs2_inode *ip = gl->gl_object;
|
|
|
|
struct inode *inode = &ip->i_inode;
|
|
|
|
unsigned long nrpages;
|
|
|
|
|
2008-05-21 23:03:22 +07:00
|
|
|
if (ip == NULL)
|
2014-01-16 17:31:13 +07:00
|
|
|
return;
|
2018-04-19 02:05:01 +07:00
|
|
|
|
|
|
|
xa_lock_irq(&inode->i_data.i_pages);
|
|
|
|
nrpages = inode->i_data.nrpages;
|
|
|
|
xa_unlock_irq(&inode->i_data.i_pages);
|
|
|
|
|
2019-05-09 21:21:48 +07:00
|
|
|
gfs2_print_dbg(seq, "%s I: n:%llu/%llu t:%u f:0x%02lx d:0x%08x s:%llu "
|
|
|
|
"p:%lu\n", fs_id_buf,
|
2008-05-21 23:03:22 +07:00
|
|
|
(unsigned long long)ip->i_no_formal_ino,
|
|
|
|
(unsigned long long)ip->i_no_addr,
|
2008-11-10 17:10:12 +07:00
|
|
|
IF2DT(ip->i_inode.i_mode), ip->i_flags,
|
|
|
|
(unsigned int)ip->i_diskflags,
|
2018-04-19 02:05:01 +07:00
|
|
|
(unsigned long long)i_size_read(inode), nrpages);
|
2008-05-21 23:03:22 +07:00
|
|
|
}
|
|
|
|
|
2006-01-16 23:50:04 +07:00
|
|
|
/**
|
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 10:26:55 +07:00
|
|
|
* freeze_go_sync - promote/demote the freeze glock
|
2006-01-16 23:50:04 +07:00
|
|
|
* @gl: the glock
|
|
|
|
* @state: the requested state
|
|
|
|
* @flags:
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2019-11-14 03:09:28 +07:00
|
|
|
static int freeze_go_sync(struct gfs2_glock *gl)
|
2006-01-16 23:50:04 +07:00
|
|
|
{
|
2014-11-14 09:42:04 +07:00
|
|
|
int error = 0;
|
2015-03-16 23:52:05 +07:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2006-01-16 23:50:04 +07:00
|
|
|
|
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-29 02:23:45 +07:00
|
|
|
if (gl->gl_state == LM_ST_SHARED && !gfs2_withdrawn(sdp) &&
|
2006-01-16 23:50:04 +07:00
|
|
|
test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags)) {
|
2014-11-14 09:42:04 +07:00
|
|
|
atomic_set(&sdp->sd_freeze_state, SFS_STARTING_FREEZE);
|
|
|
|
error = freeze_super(sdp->sd_vfs);
|
|
|
|
if (error) {
|
2019-05-13 21:42:18 +07:00
|
|
|
fs_info(sdp, "GFS2: couldn't freeze filesystem: %d\n",
|
|
|
|
error);
|
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-29 02:23:45 +07:00
|
|
|
if (gfs2_withdrawn(sdp)) {
|
|
|
|
atomic_set(&sdp->sd_freeze_state, SFS_UNFROZEN);
|
2019-11-14 03:09:28 +07:00
|
|
|
return 0;
|
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-29 02:23:45 +07:00
|
|
|
}
|
2014-11-14 09:42:04 +07:00
|
|
|
gfs2_assert_withdraw(sdp, 0);
|
|
|
|
}
|
|
|
|
queue_work(gfs2_freeze_wq, &sdp->sd_freeze_work);
|
2018-01-08 22:34:17 +07:00
|
|
|
gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_FREEZE |
|
|
|
|
GFS2_LFC_FREEZE_GO_SYNC);
|
2006-01-16 23:50:04 +07:00
|
|
|
}
|
2019-11-14 03:09:28 +07:00
|
|
|
return 0;
|
2006-01-16 23:50:04 +07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 10:26:55 +07:00
|
|
|
* freeze_go_xmote_bh - After promoting/demoting the freeze glock
|
2006-01-16 23:50:04 +07:00
|
|
|
* @gl: the glock
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 10:26:55 +07:00
|
|
|
static int freeze_go_xmote_bh(struct gfs2_glock *gl, struct gfs2_holder *gh)
|
2006-01-16 23:50:04 +07:00
|
|
|
{
|
2015-03-16 23:52:05 +07:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2006-06-15 02:32:57 +07:00
|
|
|
struct gfs2_inode *ip = GFS2_I(sdp->sd_jdesc->jd_inode);
|
2006-02-28 05:23:27 +07:00
|
|
|
struct gfs2_glock *j_gl = ip->i_gl;
|
2006-10-14 08:47:13 +07:00
|
|
|
struct gfs2_log_header_host head;
|
2006-01-16 23:50:04 +07:00
|
|
|
int error;
|
|
|
|
|
2008-05-21 23:03:22 +07:00
|
|
|
if (test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags)) {
|
2006-11-20 22:37:45 +07:00
|
|
|
j_gl->gl_ops->go_inval(j_gl, DIO_METADATA);
|
2006-01-16 23:50:04 +07:00
|
|
|
|
2019-05-03 02:17:40 +07:00
|
|
|
error = gfs2_find_jhead(sdp->sd_jdesc, &head, false);
|
2006-01-16 23:50:04 +07:00
|
|
|
if (error)
|
|
|
|
gfs2_consist(sdp);
|
|
|
|
if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT))
|
|
|
|
gfs2_consist(sdp);
|
|
|
|
|
|
|
|
/* Initialize some head of the log stuff */
|
2019-11-14 21:52:15 +07:00
|
|
|
if (!gfs2_withdrawn(sdp)) {
|
2006-01-16 23:50:04 +07:00
|
|
|
sdp->sd_log_sequence = head.lh_sequence + 1;
|
|
|
|
gfs2_log_pointers_init(sdp, head.lh_blkno);
|
|
|
|
}
|
|
|
|
}
|
2008-05-21 23:03:22 +07:00
|
|
|
return 0;
|
2006-01-16 23:50:04 +07:00
|
|
|
}
|
|
|
|
|
2008-11-20 20:39:47 +07:00
|
|
|
/**
|
|
|
|
* trans_go_demote_ok
|
|
|
|
* @gl: the glock
|
|
|
|
*
|
|
|
|
* Always returns 0
|
|
|
|
*/
|
|
|
|
|
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 10:26:55 +07:00
|
|
|
static int freeze_go_demote_ok(const struct gfs2_glock *gl)
|
2008-11-20 20:39:47 +07:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-07-24 06:52:34 +07:00
|
|
|
/**
|
|
|
|
* iopen_go_callback - schedule the dcache entry for the inode to be deleted
|
|
|
|
* @gl: the glock
|
|
|
|
*
|
2015-10-29 22:58:09 +07:00
|
|
|
* gl_lockref.lock lock is held while calling this
|
2009-07-24 06:52:34 +07:00
|
|
|
*/
|
2013-04-10 16:26:55 +07:00
|
|
|
static void iopen_go_callback(struct gfs2_glock *gl, bool remote)
|
2009-07-24 06:52:34 +07:00
|
|
|
{
|
2017-06-30 19:55:08 +07:00
|
|
|
struct gfs2_inode *ip = gl->gl_object;
|
2015-03-16 23:52:05 +07:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2011-03-30 20:17:51 +07:00
|
|
|
|
2017-07-17 14:45:34 +07:00
|
|
|
if (!remote || sb_rdonly(sdp->sd_vfs))
|
2011-03-30 20:17:51 +07:00
|
|
|
return;
|
2009-07-24 06:52:34 +07:00
|
|
|
|
|
|
|
if (gl->gl_demote_state == LM_ST_UNLOCKED &&
|
2009-12-08 19:12:13 +07:00
|
|
|
gl->gl_state == LM_ST_SHARED && ip) {
|
2013-10-15 21:18:08 +07:00
|
|
|
gl->gl_lockref.count++;
|
2020-01-17 02:12:26 +07:00
|
|
|
if (!queue_delayed_work(gfs2_delete_workqueue,
|
|
|
|
&gl->gl_delete, 0))
|
2013-10-15 21:18:08 +07:00
|
|
|
gl->gl_lockref.count--;
|
2009-07-24 06:52:34 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-01-17 02:12:26 +07:00
|
|
|
static int iopen_go_demote_ok(const struct gfs2_glock *gl)
|
|
|
|
{
|
|
|
|
return !gfs2_delete_work_queued(gl);
|
|
|
|
}
|
|
|
|
|
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-29 02:23:45 +07:00
|
|
|
/**
|
|
|
|
* inode_go_free - wake up anyone waiting for dlm's unlock ast to free it
|
|
|
|
* @gl: glock being freed
|
|
|
|
*
|
|
|
|
* For now, this is only used for the journal inode glock. In withdraw
|
|
|
|
* situations, we need to wait for the glock to be freed so that we know
|
|
|
|
* other nodes may proceed with recovery / journal replay.
|
|
|
|
*/
|
|
|
|
static void inode_go_free(struct gfs2_glock *gl)
|
|
|
|
{
|
|
|
|
/* Note that we cannot reference gl_object because it's already set
|
|
|
|
* to NULL by this point in its lifecycle. */
|
|
|
|
if (!test_bit(GLF_FREEING, &gl->gl_flags))
|
|
|
|
return;
|
|
|
|
clear_bit_unlock(GLF_FREEING, &gl->gl_flags);
|
|
|
|
wake_up_bit(&gl->gl_flags, GLF_FREEING);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* nondisk_go_callback - used to signal when a node did a withdraw
|
|
|
|
* @gl: the nondisk glock
|
|
|
|
* @remote: true if this came from a different cluster node
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
static void nondisk_go_callback(struct gfs2_glock *gl, bool remote)
|
|
|
|
{
|
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
|
|
|
|
|
|
|
/* Ignore the callback unless it's from another node, and it's the
|
|
|
|
live lock. */
|
|
|
|
if (!remote || gl->gl_name.ln_number != GFS2_LIVE_LOCK)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* First order of business is to cancel the demote request. We don't
|
|
|
|
* really want to demote a nondisk glock. At best it's just to inform
|
|
|
|
* us of another node's withdraw. We'll keep it in SH mode. */
|
|
|
|
clear_bit(GLF_DEMOTE, &gl->gl_flags);
|
|
|
|
clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags);
|
|
|
|
|
|
|
|
/* Ignore the unlock if we're withdrawn, unmounting, or in recovery. */
|
|
|
|
if (test_bit(SDF_NORECOVERY, &sdp->sd_flags) ||
|
|
|
|
test_bit(SDF_WITHDRAWN, &sdp->sd_flags) ||
|
|
|
|
test_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* We only care when a node wants us to unlock, because that means
|
|
|
|
* they want a journal recovered. */
|
|
|
|
if (gl->gl_demote_state != LM_ST_UNLOCKED)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (sdp->sd_args.ar_spectator) {
|
|
|
|
fs_warn(sdp, "Spectator node cannot recover journals.\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
fs_warn(sdp, "Some node has withdrawn; checking for recovery.\n");
|
|
|
|
set_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags);
|
|
|
|
/*
|
|
|
|
* We can't call remote_withdraw directly here or gfs2_recover_journal
|
|
|
|
* because this is called from the glock unlock function and the
|
|
|
|
* remote_withdraw needs to enqueue and dequeue the same "live" glock
|
|
|
|
* we were called from. So we queue it to the control work queue in
|
|
|
|
* lock_dlm.
|
|
|
|
*/
|
|
|
|
queue_delayed_work(gfs2_control_wq, &sdp->sd_control_work, 0);
|
|
|
|
}
|
|
|
|
|
2006-08-30 20:30:00 +07:00
|
|
|
const struct gfs2_glock_operations gfs2_meta_glops = {
|
2006-09-05 21:53:09 +07:00
|
|
|
.go_type = LM_TYPE_META,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 01:28:45 +07:00
|
|
|
.go_flags = GLOF_NONDISK,
|
2006-01-16 23:50:04 +07:00
|
|
|
};
|
|
|
|
|
2006-08-30 20:30:00 +07:00
|
|
|
const struct gfs2_glock_operations gfs2_inode_glops = {
|
2012-10-25 01:41:05 +07:00
|
|
|
.go_sync = inode_go_sync,
|
2006-01-16 23:50:04 +07:00
|
|
|
.go_inval = inode_go_inval,
|
|
|
|
.go_demote_ok = inode_go_demote_ok,
|
|
|
|
.go_lock = inode_go_lock,
|
2008-05-21 23:03:22 +07:00
|
|
|
.go_dump = inode_go_dump,
|
2006-09-05 21:53:09 +07:00
|
|
|
.go_type = LM_TYPE_INODE,
|
2020-01-14 03:21:49 +07:00
|
|
|
.go_flags = GLOF_ASPACE | GLOF_LRU | GLOF_LVB,
|
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-29 02:23:45 +07:00
|
|
|
.go_free = inode_go_free,
|
2006-01-16 23:50:04 +07:00
|
|
|
};
|
|
|
|
|
2006-08-30 20:30:00 +07:00
|
|
|
const struct gfs2_glock_operations gfs2_rgrp_glops = {
|
2012-10-25 01:41:05 +07:00
|
|
|
.go_sync = rgrp_go_sync,
|
2009-03-09 16:03:51 +07:00
|
|
|
.go_inval = rgrp_go_inval,
|
GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme
Here is an update of Bob's original rbtree patch which, in addition, also
resolves the rather strange ref counting that was being done relating to
the bitmap blocks.
Originally we had a dual system for journaling resource groups. The metadata
blocks were journaled and also the rgrp itself was added to a list. The reason
for adding the rgrp to the list in the journal was so that the "repolish
clones" code could be run to update the free space, and potentially send any
discard requests when the log was flushed. This was done by comparing the
"cloned" bitmap with what had been written back on disk during the transaction
commit.
Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
until the journal had been flushed. For that reason, there was a rather
complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
count on the buffers.
However, the journal maintains a reference count on the buffers anyway, since
they are being journaled as metadata buffers. So by moving the code which deals
with the post-journal accounting for bitmap blocks to the metadata journaling
code, we can entirely dispense with the rather strange buffer ref counting
scheme and also the requirement to journal the rgrps.
The net result of all this is that the ->sd_rindex_spin is left to do exactly
one job, and that is to look after the rbtree or rgrps.
This patch is designed to be a stepping stone towards using RCU for the rbtree
of resource groups, however the reduction in the number of uses of the
->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
anyway.
The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
be removed in future in favour of calling the functions directly where required
in the code. That will allow locking of resource groups without needing to
actually read them in - something that could be useful in speeding up statfs.
In the mean time though it is valid to dereference ->bi_bh only when the rgrp
is locked. This is basically the same rule as before, modulo the references not
being valid until the following journal flush.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Cc: Benjamin Marzinski <bmarzins@redhat.com>
2011-08-31 15:53:19 +07:00
|
|
|
.go_lock = gfs2_rgrp_go_lock,
|
2009-05-20 16:48:47 +07:00
|
|
|
.go_dump = gfs2_rgrp_dump,
|
2006-09-05 21:53:09 +07:00
|
|
|
.go_type = LM_TYPE_RGRP,
|
2013-12-06 23:19:54 +07:00
|
|
|
.go_flags = GLOF_LVB,
|
2006-01-16 23:50:04 +07:00
|
|
|
};
|
|
|
|
|
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 10:26:55 +07:00
|
|
|
const struct gfs2_glock_operations gfs2_freeze_glops = {
|
|
|
|
.go_sync = freeze_go_sync,
|
|
|
|
.go_xmote_bh = freeze_go_xmote_bh,
|
|
|
|
.go_demote_ok = freeze_go_demote_ok,
|
2006-09-05 21:53:09 +07:00
|
|
|
.go_type = LM_TYPE_NONDISK,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 01:28:45 +07:00
|
|
|
.go_flags = GLOF_NONDISK,
|
2006-01-16 23:50:04 +07:00
|
|
|
};
|
|
|
|
|
2006-08-30 20:30:00 +07:00
|
|
|
const struct gfs2_glock_operations gfs2_iopen_glops = {
|
2006-09-05 21:53:09 +07:00
|
|
|
.go_type = LM_TYPE_IOPEN,
|
2009-07-24 06:52:34 +07:00
|
|
|
.go_callback = iopen_go_callback,
|
2020-01-17 02:12:26 +07:00
|
|
|
.go_demote_ok = iopen_go_demote_ok,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 01:28:45 +07:00
|
|
|
.go_flags = GLOF_LRU | GLOF_NONDISK,
|
2006-01-16 23:50:04 +07:00
|
|
|
};
|
|
|
|
|
2006-08-30 20:30:00 +07:00
|
|
|
const struct gfs2_glock_operations gfs2_flock_glops = {
|
2006-09-05 21:53:09 +07:00
|
|
|
.go_type = LM_TYPE_FLOCK,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 01:28:45 +07:00
|
|
|
.go_flags = GLOF_LRU | GLOF_NONDISK,
|
2006-01-16 23:50:04 +07:00
|
|
|
};
|
|
|
|
|
2006-08-30 20:30:00 +07:00
|
|
|
const struct gfs2_glock_operations gfs2_nondisk_glops = {
|
2006-09-05 21:53:09 +07:00
|
|
|
.go_type = LM_TYPE_NONDISK,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 01:28:45 +07:00
|
|
|
.go_flags = GLOF_NONDISK,
|
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-29 02:23:45 +07:00
|
|
|
.go_callback = nondisk_go_callback,
|
2006-01-16 23:50:04 +07:00
|
|
|
};
|
|
|
|
|
2006-08-30 20:30:00 +07:00
|
|
|
const struct gfs2_glock_operations gfs2_quota_glops = {
|
2006-09-05 21:53:09 +07:00
|
|
|
.go_type = LM_TYPE_QUOTA,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 01:28:45 +07:00
|
|
|
.go_flags = GLOF_LVB | GLOF_LRU | GLOF_NONDISK,
|
2006-01-16 23:50:04 +07:00
|
|
|
};
|
|
|
|
|
2006-08-30 20:30:00 +07:00
|
|
|
const struct gfs2_glock_operations gfs2_journal_glops = {
|
2006-09-05 21:53:09 +07:00
|
|
|
.go_type = LM_TYPE_JOURNAL,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 01:28:45 +07:00
|
|
|
.go_flags = GLOF_NONDISK,
|
2006-01-16 23:50:04 +07:00
|
|
|
};
|
|
|
|
|
GFS2: Add a "demote a glock" interface to sysfs
This adds a sysfs file called demote_rq to GFS2's
per filesystem directory. Its possible to use this
file to demote arbitrary glocks in exactly the same
way as if a request had come in from a remote node.
This is intended for testing issues relating to caching
of data under glocks. Despite that, the interface is
generic enough to send requests to any type of glock,
but be careful as its not always safe to send an
arbitrary message to an arbitrary glock. For that reason
and to prevent DoS, this interface is restricted to root
only.
The messages look like this:
<type>:<glocknumber> <mode>
Example:
echo -n "2:13324 EX" >/sys/fs/gfs2/unity:myfs/demote_rq
Which means "please demote inode glock (type 2) number 13324 so that
I can get an EX (exclusive) lock". The lock modes are those which
would normally be sent by a remote node in its callback so if you
want to unlock a glock, you use EX, to demote to shared, use SH or PR
(depending on whether you like GFS2 or DLM lock modes better!).
If the glock doesn't exist, you'll get -ENOENT returned. If the
arguments don't make sense, you'll get -EINVAL returned.
The plan is that this interface will be used in combination with
the blktrace patch which I recently posted for comments although
it is, of course, still useful in its own right.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-02-12 20:31:58 +07:00
|
|
|
const struct gfs2_glock_operations *gfs2_glops_list[] = {
|
|
|
|
[LM_TYPE_META] = &gfs2_meta_glops,
|
|
|
|
[LM_TYPE_INODE] = &gfs2_inode_glops,
|
|
|
|
[LM_TYPE_RGRP] = &gfs2_rgrp_glops,
|
|
|
|
[LM_TYPE_IOPEN] = &gfs2_iopen_glops,
|
|
|
|
[LM_TYPE_FLOCK] = &gfs2_flock_glops,
|
|
|
|
[LM_TYPE_NONDISK] = &gfs2_nondisk_glops,
|
|
|
|
[LM_TYPE_QUOTA] = &gfs2_quota_glops,
|
|
|
|
[LM_TYPE_JOURNAL] = &gfs2_journal_glops,
|
|
|
|
};
|
|
|
|
|