In the case of v4.0 clients, we may call into the "create" client
tracking operation multiple times (once for each openowner). Upcalling
for each one of those is wasteful and slow however. We can skip doing
further "create" operations after the first one if we know that one has
already been done.
v4.1+ clients generally only call into this function once (on
RECLAIM_COMPLETE), and we can't skip upcalling on the create even if the
STABLE bit is set. Doing so would make it impossible for nfsdcltrack to
lift the grace period early since the timestamp has a different meaning
in the case where the client is expected to issue a RECLAIM_COMPLETE.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
The nfsdcltrack upcall doesn't utilize the NFSD4_CLIENT_STABLE flag,
which basically results in an upcall every time we call into the client
tracking ops.
Change it to set this bit on a successful "check" or "create" request,
and clear it on a "remove" request. Also, check to see if that bit is
set before upcalling on a "check" or "remove" request, and skip
upcalling appropriately, depending on its state.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
In a later patch, we want to add a flag that will allow us to reduce the
need for upcalls. In order to handle that correctly, we'll need to
ensure that racing upcalls for the same client can't occur. In practice
it should be rare for this to occur with a well-behaved client, but it
is possible.
Convert one of the bits in the cl_flags field to be an upcall bitlock,
and use it to ensure that upcalls for the same client are serialized.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
In order to support lifting the grace period early, we must tell
nfsdcltrack what sort of client the "create" upcall is for. We can't
reliably tell if a v4.0 client has completed reclaiming, so we can only
lift the grace period once all the v4.1+ clients have issued a
RECLAIM_COMPLETE and if there are no v4.0 clients.
Also, in order to lift the grace period, we have to tell userland when
the grace period started so that it can tell whether a RECLAIM_COMPLETE
has been issued for each client since then.
Since this is all optional info, we pass it along in environment
variables to the "init" and "create" upcalls. By doing this, we don't
need to revise the upcall format. The UMH upcall can simply make use of
this info if it happens to be present. If it's not then it can just
avoid lifting the grace period early.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Allow a privileged userland process to end the v4 grace period early.
Writing "Y", "y", or "1" to the file will cause the v4 grace period to
be lifted. The basic idea with this will be to allow the userland
client tracking program to lift the grace period once it knows that no
more clients will be reclaiming state.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Add a new procfile that will allow a (privileged) userland process to
end the NLM grace period early. The basic idea here will be to have
sm-notify write to this file, if it sent out no NOTIFY requests when
it runs. In that situation, we can generally expect that there will be
no reclaim requests so the grace period can be lifted early.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
As stated in RFC 5661, section 18.51.3:
Once a RECLAIM_COMPLETE is done, there can be no further reclaim
operations for locks whose scope is defined as having completed
recovery. Once the client sends RECLAIM_COMPLETE, the server will
not allow the client to do subsequent reclaims of locking state for
that scope and, if these are attempted, will return
NFS4ERR_NO_GRACE.
Ensure that we enforce that requirement.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Currently, all of the grace period handling is part of lockd. Eventually
though we'd like to be able to build v4-only servers, at which point
we'll need to put all of this elsewhere.
Move the code itself into fs/nfs_common and have it build a grace.ko
module. Then, rejigger the Kconfig options so that both nfsd and lockd
enable it automatically.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
When a ranged fsync finishes if there are still extent maps in the modified
list, still set the inode's logged_trans and last_log_commit. This is important
in case an inode is fsync'ed and unlinked in the same transaction, to ensure its
inode ref gets deleted from the log and the respective dentries in its parent
are deleted too from the log (if the parent directory was fsync'ed in the same
transaction).
Instead make btrfs_inode_in_log() return false if the list of modified extent
maps isn't empty.
This is an incremental on top of the v4 version of the patch:
"Btrfs: fix fsync data loss after a ranged fsync"
which was added to its v5, but didn't make it on time.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
on large sparse files, a negative dentry hashing fix, a fix for
flock, and a bug fix relating to d_splice_alias usage. There are
also (patches 1 and 5) a couple of updates which are less
critical, but small and low risk.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJUGAT8AAoJEMrg3m4a/8jSevIQAJh0sfUAzDZ2p1etZw2OR2bm
0ritvFVkOB4CZ97TS7DvItoqcke+e2ehT8ykIzCNZhJotld7jsitzuQm0ugsMvG2
ltrUil2eqyeBFt1dJk9m366+akbdocVhSpVL/6hLRJI9GTbfWr26jHwUTIQisq5u
Pdr42lLwwAPrj9D7GQRA1jSZLbA36teRGUuV+3CFxAWVgqG0kOmq7iGd27LZZ/hP
m4hjPvdczTfhM3kjNvohicLXQiXhsJdlOZsZKfMTSGmJ2WsMRKoK2PZx2eRmDEEe
ypfkboyfcHCJlbIx+x0qPRheYMC8a2TX3JMb3l6eaV0mZoeExB2kPHeBlfu1jddp
XwLvbs/fbfC96Q6WPKM3JdBDY9DBf5t30Is0VxNJeMH2ERIubaLrOACeD4P7V5mX
PIc3/bcDtNMOirgdd2NGxGR0TxmyhKs7pbtWC+e6Wud04qgsK1JtU8cNro0j8MRE
Rfay0RY+8OjHfDn+shQYl4mh8MF2u3JF3/ThcZjtDfAB0AIhF1Bs5Ihl12B45ov1
7ah0J+8Yk7gwH++KDFNv6XGNzjpjkbBEr9FGLLN3DNDysk4ibEYP3NwEf0Vnme7A
EhYv0IFgUbgdLrU3owA7rRh38A8W23cr78qRSM66/TEGYqBi9+4/DwLE4EIorGYA
mKSpLXhcNtCIrQGNCz3Q
=5NIi
-----END PGP SIGNATURE-----
Merge tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes
Pull gfs2 fixes from Steven Whitehouse:
"Here are a number of small fixes for GFS2.
There is a fix for FIEMAP on large sparse files, a negative dentry
hashing fix, a fix for flock, and a bug fix relating to d_splice_alias
usage.
There are also (patches 1 and 5) a couple of updates which are less
critical, but small and low risk"
* tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
GFS2: fix d_splice_alias() misuses
GFS2: Don't use MAXQUOTAS value
GFS2: Hash the negative dentry during inode lookup
GFS2: Request demote when a "try" flock fails
GFS2: Change maxlen variables to size_t
GFS2: fs/gfs2/super.c: replace seq_printf by seq_puts
Commit d6bb3e9075 ("vfs: simplify and shrink stack frame of
link_path_walk()") introduced build problems with GCC versions older
than 4.6 due to the initialisation of a member of an anonymous union in
struct qstr without enclosing braces.
This hits GCC bug 10676 [1] (which was fixed in GCC 4.6 by [2]), and
causes the following build error:
fs/namei.c: In function 'link_path_walk':
fs/namei.c:1778: error: unknown field 'hash_len' specified in initializer
This is worked around by adding explicit braces.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
[2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=159206
Fixes: d6bb3e9075 (vfs: simplify and shrink stack frame of link_path_walk())
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-metag@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the mfsymlinks file size has changed (e.g. the file no longer
represents an emulated symlink) we were not returning an error properly.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
If the inode is same and its data index are needed to truncate, we can fall into
double lock for its inode page via get_dnode_of_data.
Error case is like this.
1. write data 1, 2, 3, 4, 5 in inode #4.
2. write data 100, 102, 103, 104, 105 in dnode #6 of inode #4.
3. sync
4. update data 100->106 in dnode #6.
5. fsync inode #4.
6. power-cut
-> Then,
1. go back to #3's checkpoint
2. in do_recover_data, get_dnode_of_data() gets inode #4.
3. detect 100->106 in dnode #6.
4. check_index_in_prev_nodes tries to truncate 100 in dnode #6.
5. to trigger truncate_hole, get_dnode_of_data should grab inode #4.
6. detect *kernel hang*
This patch should resolve that bug.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The nm_i->fcnt checking is executed before spin_lock, so if another
thread delete the last free_nid from the list, the wrong nid may be
gotten. So fix the race condition by moving the nm_i->fnct checking
into spin_lock.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, if there is no free nid in nm_i->free_nid_list, 0 may be saved
into next_free_nid of checkpoint, this may cause useless scanning for
next mount. nm_i->next_scan_nid should be a better default value than
0.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If user wrote F2FS_IPU_FSYNC:4 in /sys/fs/f2fs/ipu_policy, f2fs_sync_file
only starts to try in-place-updates.
And, if the number of dirty pages is over /sys/fs/f2fs/min_fsync_blocks, it
keeps out-of-order manner. Otherwise, it triggers in-place-updates.
This may be used by storage showing very high random write performance.
For example, it can be used when,
Seq. writes (Data) + wait + Seq. writes (Node)
is pretty much slower than,
Rand. writes (Data)
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously f2fs only counts dirty dentry pages, but there is no reason not to
expand the scope.
This patch changes the names on the management of dirty pages and to count
dirty pages in each inode info as well.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
cifs provides two dummy functions 'sess_auth_lanman' and
'sess_auth_kerberos' for the case in which the respective
features are not defined. However, the caller is also under
an #ifdef, so we just get warnings about unused code:
fs/cifs/sess.c:1109:1: warning: 'sess_auth_kerberos' defined but not used [-Wunused-function]
sess_auth_kerberos(struct sess_data *sess_data)
Removing the dead functions gets rid of the warnings without
any downsides that I can see.
(Yalin Wang reported the identical problem and fix so added him)
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Signed-off-by: Steve French <smfrench@gmail.com>
This reverts commit 52a3624444.
Causes rmmod to fail for at least 7 seconds after unmount which
makes automated testing a little harder when reloading cifs.ko
between test runs.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
CC: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Commit 9226b5b440 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.
However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant. We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".
We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.
End result: fewer live variables in the loop, a smaller stack frame, and
better code generation. And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information. So this removes more lines than it adds, and the
source code is clearer too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We were not checking for symlink support properly for SMB2/SMB3
mounts so could oops when mounted with mfsymlinks when try
to create symlink when mfsymlinks on smb2/smb3 mounts
Signed-off-by: Steve French <smfrench@gmail.com>
Cc: <stable@vger.kernel.org> # 3.14+
CC: Sachin Prabhu <sprabhu@redhat.com>
Pull vfs fixes from Al Viro:
"double iput() on failure exit in lustre, racy removal of spliced
dentries from ->s_anon in __d_materialise_dentry() plus a bunch of
assorted RCU pathwalk fixes"
The RCU pathwalk fixes end up fixing a couple of cases where we
incorrectly dropped out of RCU walking, due to incorrect initialization
and testing of the sequence locks in some corner cases. Since dropping
out of RCU walk mode forces the slow locked accesses, those corner cases
slowed down quite dramatically.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
be careful with nd->inode in path_init() and follow_dotdot_rcu()
don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu()
fix bogus read_seqretry() checks introduced in b37199e
move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon)
[fix] lustre: d_make_root() does iput() on dentry allocation failure
The performance regression that Josef Bacik reported in the pathname
lookup (see commit 99d263d4c5 "vfs: fix bad hashing of dentries") made
me look at performance stability of the dcache code, just to verify that
the problem was actually fixed. That turned up a few other problems in
this area.
There are a few cases where we exit RCU lookup mode and go to the slow
serializing case when we shouldn't, Al has fixed those and they'll come
in with the next VFS pull.
But my performance verification also shows that link_path_walk() turns
out to have a very unfortunate 32-bit store of the length and hash of
the name we look up, followed by a 64-bit read of the combined hash_len
field. That screws up the processor store to load forwarding, causing
an unnecessary hickup in this critical routine.
It's caused by the ugly calling convention for the "hash_name()"
function, and easily fixed by just making hash_name() fill in the whole
'struct qstr' rather than passing it a pointer to just the hash value.
With that, the profile for this function looks much smoother.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xfstest generic/258 sets the time on a file to a negative value
(before 1970) which fails since do_div can not handle negative
numbers. In addition 'normal' division of 64 bit values does
not build on 32 bit arch so have to workaround this by special
casing negative values in cifs_NTtimeToUnix
Samba server also has a bug with this (see samba bugzilla 7771)
but it works to Windows server.
Signed-off-by: Steve French <smfrench@gmail.com>
in the former we simply check if dentry is still valid after picking
its ->d_inode; in the latter we fetch ->d_inode in the same places
where we fetch dentry and its ->d_seq, under the same checks.
Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
return the value instead, and have path_init() do the assignment. Broken by
"vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
which was Cc-stable with 2.6.38+ as destination. This one should go where
it went.
To avoid dummy value returned in case when root is already set (it would do
no harm, actually, since the only caller that doesn't ignore the return value
is guaranteed to have nd->root *not* set, but it's more obvious that way),
lift the check into callers. And do the same to set_root(), to keep them
in sync.
Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Josef Bacik found a performance regression between 3.2 and 3.10 and
narrowed it down to commit bfcfaa77bd ("vfs: use 'unsigned long'
accesses for dcache name comparison and hashing"). He reports:
"The test case is essentially
for (i = 0; i < 1000000; i++)
mkdir("a$i");
On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
dir/sec with 3.10. This is because we spend waaaaay more time in
__d_lookup on 3.10 than in 3.2.
The new hashing function for strings is suboptimal for <
sizeof(unsigned long) string names (and hell even > sizeof(unsigned
long) string names that I've tested). I broke out the old hashing
function and the new one into a userspace helper to get real numbers
and this is what I'm getting:
Old hash table had 1000000 entries, 0 dupes, 0 max dupes
New hash table had 12628 entries, 987372 dupes, 900 max dupes
We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash
My test does the hash, and then does the d_hash into a integer pointer
array the same size as the dentry hash table on my system, and then
just increments the value at the address we got to see how many
entries we overlap with.
As you can see the old hash function ended up with all 1 million
entries in their own bucket, whereas the new one they are only
distributed among ~12.5k buckets, which is why we're using so much
more CPU in __d_lookup".
The reason for this hash regression is two-fold:
- On 64-bit architectures the down-mixing of the original 64-bit
word-at-a-time hash into the final 32-bit hash value is very
simplistic and suboptimal, and just adds the two 32-bit parts
together.
In particular, because there is no bit shuffling and the mixing
boundary is also a byte boundary, similar character patterns in the
low and high word easily end up just canceling each other out.
- the old byte-at-a-time hash mixed each byte into the final hash as it
hashed the path component name, resulting in the low bits of the hash
generally being a good source of hash data. That is not true for the
word-at-a-time case, and the hash data is distributed among all the
bits.
The fix is the same in both cases: do a better job of mixing the bits up
and using as much of the hash data as possible. We already have the
"hash_32|64()" functions to do that.
Reported-by: Josef Bacik <jbacik@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Callers of d_splice_alias(dentry, inode) don't need iput(), neither
on success nor on failure. Either the reference to inode is stored
in a previously negative dentry, or it's dropped. In either case
inode reference the caller used to hold is consumed.
__gfs2_lookup() does iput() in case when d_splice_alias() has failed.
Double iput() if we ever hit that. And gfs2_create_inode() ends up
not only with double iput(), but with link count dropped to zero - on
an inode it has just found in directory.
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull btrfs fixes from Chris Mason:
"Filipe is doing a careful pass through fsync problems, and these are
the fixes so far. I'll have one more for rc6 that we're still
testing.
My big commit is fixing up some inode hash races that Al Viro found
(thanks Al)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: use insert_inode_locked4 for inode creation
Btrfs: fix fsync data loss after a ranged fsync
Btrfs: kfree()ing ERR_PTRs
Btrfs: fix crash while doing a ranged fsync
Btrfs: fix corruption after write/fsync failure + fsync + log recovery
Btrfs: fix autodefrag with compression
Both blocks layout and objects layout want to use it to avoid CB_LAYOUTRECALL
but that should only happen if client is doing truncation to a smaller size.
For other cases, we let server decide if it wants to recall client's layouts.
Change PNFS_LAYOUTRET_ON_SETATTR to follow the logic and not to send
layoutreturn unnecessarily.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This code is internal to the v3 module, so other parts of the client
shouldn't have any knowledge of it.
nfs3_getxattr(), nfs3_setxattr(), and nfs3_removexattr() no longer exist
anywhere so I remove the declarations while I'm here.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This check is already performed by the module loading code - if the
module can't be found then -EPROTONOSUPPORT will be returned. Let's
handle v3 this way, too.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
I am generally against the "one big header file" approach, and
everything in the client includes this file. Let's move all the NFS v3
declarations into a v3-only header file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The goal is to create a generic NFS module with code that does not
depend on what versions of NFS are enabled.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This code has been around for a while, but never was enabled, although
it is in a working shape.
Note that we implement NOTIFY_DEVICEID4_CHANGE identical to
NOTIFY_DEVICEID4_DELETE. Given that in either case we can't do anything
but preventing further lookups of a given device ID there isn't much difference
in semantics for the two. For the delete case the server MUST ensure that
there are no outstanding layouts, while for the change case it doesn't, but
that has little relevance to the client.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patches moves parsing of the GETDEVICEINFO XDR to kernel space, as well
as the management of complex devices. The reason for that is we might have
multiple outstanding complex devices after a NOTIFY_DEVICEID4_CHANGE, which
device mapper or md can't handle as they claim devices exclusively.
But as is turns out simple striping / concatenation is fairly trivial to
implement anyway, so we make our life simpler by reducing the reliance
on blkmapd. For now we still use blkmapd by feeding it synthetic SIMPLE
device XDR to translate device signatures to device numbers, but in the
long runs I have plans to eliminate it entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Create a file to house all the rpc_pipefs boilerplate code instead of
sprinkling it over a few files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Factor out a helper for all per-extent work, and merge the now trivial
functions for lseg allocation and parsing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This isn't device(id) related, so move it into the main file. Simple move
for now, the next commit will clean it up a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of overflowing the XDR send buffer with our extent list allocate
pages and pre-encode the layoutupdate payload into them. We optimistically
allocate a single page use alloc_page and only switch to vmalloc when we
have more extents outstanding. Currently there is only a single testcase
(xfstests generic/113) which can reproduce large enough extent lists for
this to occur.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The current GETDEVICELIST implementation is buggy in that it doesn't handle
cursors correctly, and in that it returns an error if the server returns
NFSERR_NOTSUPP. Given that there is no actual need for GETDEVICELIST,
it has various issues and might get removed for NFSv4.2 stop using it in
the blocklayout driver, and thus the Linux NFS client as whole.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The kbuild test robot complained about a new sparse warning in
objio_alloc_deviceid_node, but it turns out that this was just a moved
reference to an existing variable. Fix it to have the right big endian
annotated type.
Note that there are some other endianess issues in this file that I didn't
bother to sort out as they involve global headers.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The kbuild test robot complained that we got the printk format wrong.
Let's just kill these printks instead of fixing them as there is not
point after the initial tree algorithm debugging.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
commit 4fa2c54b51
NFS: nfs4_do_open should add negative results to the dcache.
used "d_drop(); d_add();" to ensure that a dentry was hashed
as a negative cached entry.
This is not safe if the dentry has an non-NULL ->d_inode.
It will trigger a BUG_ON in d_instantiate().
In that case, d_delete() is needed.
Also, only d_add if the dentry is currently unhashed, it seems
pointless removed and re-adding it unchanged.
Reported-by: Christoph Hellwig <hch@infradead.org>
Fixes: 4fa2c54b51
Cc: Jeff Layton <jeff.layton@primarydata.com>
Link: http://lkml.kernel.org/r/20140908144525.GB19811@infradead.org
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This fixes a failure in xfstests generic/313 because nfs doesn't update
mtime on a truncate. The protocol requires this to be done implicity
for a size changing setattr.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types gfs2 supports and with
addition of project quotas these two numbers stop matching. So make gfs2
use its private definition.
CC: cluster-devel@redhat.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Fix a regression introduced by:
6d4ade986f GFS2: Add atomic_open support
where an early return misses d_splice_alias() which had been
adding the negative dentry.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Merge misc fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
fs/notify: don't show f_handle if exportfs_encode_inode_fh failed
fsnotify/fdinfo: use named constants instead of hardcoded values
kcmp: fix standard comparison bug
mm/mmap.c: use pr_emerg when printing BUG related information
shm: add memfd.h to UAPI export list
checkpatch: allow commit descriptions on separate line from commit id
sh: get_user_pages_fast() must flush cache
eventpoll: fix uninitialized variable in epoll_ctl
kernel/printk/printk.c: fix faulty logic in the case of recursive printk
mem-hotplug: let memblock skip the hotpluggable memory regions in __next_mem_range()
Currently we handle only ENOSPC. In case of other errors the file_handle
variable isn't filled properly and we will show a part of stack.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MAX_HANDLE_SZ is equal to 128, but currently the size of pad is only 64
bytes, so exportfs_encode_inode_fh can return an error.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When calling epoll_ctl with operation EPOLL_CTL_DEL, structure epds is
not initialized but ep_take_care_of_epollwakeup reads its event field.
When this unintialized field has EPOLLWAKEUP bit set, a capability check
is done for CAP_BLOCK_SUSPEND in ep_take_care_of_epollwakeup. This
produces unexpected messages in the audit log, such as (on a system
running SELinux):
type=AVC msg=audit(1408212798.866:410): avc: denied
{ block_suspend } for pid=7754 comm="dbus-daemon" capability=36
scontext=unconfined_u:unconfined_r:unconfined_t
tcontext=unconfined_u:unconfined_r:unconfined_t
tclass=capability2 permissive=1
type=SYSCALL msg=audit(1408212798.866:410): arch=c000003e syscall=233
success=yes exit=0 a0=3 a1=2 a2=9 a3=7fffd4d66ec0 items=0 ppid=1
pid=7754 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
fsgid=0 tty=(none) ses=3 comm="dbus-daemon"
exe="/usr/bin/dbus-daemon"
subj=unconfined_u:unconfined_r:unconfined_t key=(null)
("arch=c000003e syscall=233 a1=2" means "epoll_ctl(op=EPOLL_CTL_DEL)")
Remove use of epds in epoll_ctl when op == EPOLL_CTL_DEL.
Fixes: 4d7e30d989 ("epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready")
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull UDF fixes from Jan Kara:
"Fixes for UDF handling of NFS handles and one fix for proper handling
of corrupted media"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: saner calling conventions for udf_new_inode()
udf: fix the udf_iget() vs. udf_new_inode() races
udf: merge the pieces inserting a new non-directory object into directory
udf: Set i_generation field
udf: Properly detect stale inodes
udf: Make udf_read_inode() and udf_iget() return error
udf: Avoid infinite loop when processing indirect ICBs
udf: Fold udf_fill_inode() into __udf_read_inode()
udf: Avoid dir link count to go negative
sparse says:
fs/nfs/file.c:543:60: warning: incorrect type in argument 1 (different address spaces)
fs/nfs/file.c:543:60: expected struct rpc_xprt *xprt
fs/nfs/file.c:543:60: got struct rpc_xprt [noderef] <asn:4>*cl_xprt
fs/nfs/file.c:548:53: warning: incorrect type in argument 1 (different address spaces)
fs/nfs/file.c:548:53: expected struct rpc_xprt *xprt
fs/nfs/file.c:548:53: got struct rpc_xprt [noderef] <asn:4>*cl_xprt
cl_xprt is RCU-managed, so we need to take care to dereference and use
it while holding the RCU read lock.
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The VFS never calls setattr with ATTR_SIZE on anything but regular
files. Remove the if check and turn it into an assert similar to
what some other file systems do.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This will be used by the block layout driver when splitting extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At a simple helper to issue a GETDEVICELIST operation and pre-load
the device id cache based on the result.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Add support to the common pNFS core to issue GETDEVICEINFO calls on
a device ID cache miss. The code is taken from the well debugged
file layout implementation and calls out to the layoutdriver through
a new alloc_deviceid_node method. The calling conventions for
nfs4_find_get_deviceid are changed so that all information needed to
send a GETDEVICEINFO request is passed to the common code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This speads up truncate-heavy workloads like fsx by multiple orders of
magnitude.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This allows removing extents from the extent tree especially on truncate
operations, and thus fixing reads from truncated and re-extended that
previously returned stale data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Currently the block layout driver tracks extents in three separate
data structures:
- the two list of pnfs_block_extent structures returned by the server
- the list of sectors that were in invalid state but have been written to
- a list of pnfs_block_short_extent structures for LAYOUTCOMMIT
All of these share the property that they are not only highly inefficient
data structures, but also that operations on them are even more inefficient
than nessecary.
In addition there are various implementation defects like:
- using an int to track sectors, causing corruption for large offsets
- incorrect normalization of page or block granularity ranges
- insufficient error handling
- incorrect synchronization as extents can be modified while they are in
use
This patch replace all three data with a single unified rbtree structure
tracking all extents, as well as their in-memory state, although we still
need to instance for read-only and read-write extent due to the arcane
client side COW feature in the block layouts spec.
To fix the problem of extent possibly being modified while in use we make
sure to return a copy of the extent for use in the write path - the
extent can only be invalidated by a layout recall or return which has
to wait until the I/O operations finished due to refcounts on the layout
segment.
The new extent tree work similar to the schemes used by block based
filesystems like XFS or ext4.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The core nfs code handles setting pages uptodate on reads, no need to mess
with the pageflags outselves. Also remove a debug function to dump page
flags.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Use the new PNFS_READ_WHOLE_PAGE flag to offload read-modify-write
handling to core nfs code, and remove a huge chunk of deadlock prone
mess from the block layout writeback path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a layout driver keeps per-inode state outside of the layout segments it
needs to be notified of any layout returns or recalls on an inode, and not
just about the freeing of layout segments. Add a method to acomplish this,
which will allow the block layout driver to handle the case of truncated
and re-expanded files properly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Like all block based filesystems, the pNFS block layout driver can't read
or write at a byte granularity and thus has to perform read-modify-write
cycles on writes smaller than this granularity.
Add a flag so that the core NFS code always reads a whole page when
starting a smaller write, so that we can do it in the place where the VFS
expects it instead of doing in very deadlock prone way in the writeback
handler.
Note that in theory we could do less than page size reads here for disks
that have a smaller sector size which are served by a server with a smaller
pnfs block size. But so far that doesn't seem like a worthwhile
optimization.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Expedite layout recall processing by forcing a layout commit when
we see busy segments. Without it the layout recall might have to wait
until the VM decided to start writeback for the file, which can introduce
long delays.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
gcc reports:
linux/fs/nfs/write.c: In function ‘nfs_page_find_head_request_locked.isra.17’:
linux/fs/nfs/write.c:121:64: warning: ‘cinfo.mds’ may be used uninitialized in this function [-Wmaybe-uninitialized]
list_for_each_entry_safe(freq, t, &cinfo.mds->list, wb_list) {
^
linux/fs/nfs/write.c:110:25: note: ‘cinfo.mds’ was declared here
struct nfs_commit_info cinfo;
Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we do non-page sized reads we can underflow the extent_length variable
and read incorrect data. Fix the extent_length calculation and change to
defensive <= checks for the extent length in the read and write path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Make sure the block queue is plugged when performing pNFS blocklayout I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tell userspace what stage of GETDEVICEINFO failed so that there is a chance
to debug it, especially with the userspace daemon clusterf***k in the block
layout driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The Linux VM subsystem can't support block sizes larger than page size
for block based filesystems very well. While this can be hacked around
to some extent for simple filesystems the read-modify-write cycles
required for pnfs block invalid extents are extremly deadlock prone
when operating on multiple pages. Reject this case early on instead
of pretending to support it (badly).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Currently there is no XDR buffer space allocated for the per-layout driver
layoutcommit payload, which leads to server buffer overflows in the
blocklayout driver even under simple workloads. As we can't do per-layout
sizes for XDR operations we'll have to splice a previously encoded list
of pages into the XDR stream, similar to how we handle ACL buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
After we issued a layoutreturn operations the may free the layout stateid
and will thus cause bad stateid error when the client uses it again.
We currently try to avoid this case by chosing the open stateid if not
lsegs are present for this inode. But various places can hold refererence
on lsegs and thus cause the list not to be empty shortly after a layout
return. Add an explicit flag to mark the current layout stateid invalid
and force usage of the openstateid after we did a full file layoutreturn.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Currently we fall through to nfs4_async_handle_error when we get
a bad stateid error back from layoutget. nfs4_async_handle_error
with a NULL state argument will never retry the operations but return
the error to higher layer, causing an avoiable fallback to MDS I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When layoutget returns an entirely new layout stateid it should not
check the generation counter as the new stateid will start with a new
counter entirely unrelated to old one.
The current behavior causes constant layoutget failures against a block
server which allocates a new stateid after an recall that removed all
outstanding layouts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Ensure the lsegs are initialized early so that we don't pass an unitialized
one back to ->free_lseg during error processing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
pNFS servers may return arbitrarily large layouts. Trim back the I/O size
to one that we can at least allocate the page array for.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Following http://www.rfc-editor.org/errata_search.php?rfc=5661&eid=2751
Don't set layoutcommit for commit_through_mds case.
For FILE_SYNC writes, don't set layoutcommit.
For DATA_SYNC wirtes, set layout commit right after wirtes done.
For UNSTABLE writes, set layout commit when commit done.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Track lwb in nfs_commit_data so that we can use it to setup
layoutcommit in commit_done callback.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
can_open_cached() reads values out of the state structure, meaning that
we need the so_lock to have a correct return value. As a bonus, this
helps clear up some potentially confusing code.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
filelayout_retry_commit was recently split out from alloc_ds_commits,
but was done in such a way that the bucket pointer always starts at
index 0 no matter what the @idx argument is set to.
The intention of the @idx argument is to retry commits starting at
bucket @idx. This is called when alloc_ds_commits fails for a bucket.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull cifs/smb3 fixes from Steve French:
"This includes various cifs and smb3 bug fixes including those for bugs
found with the recently updated xfstests.
Also I am working fixes for two additional cifs problems found by
xfstests which I plan to send later (when reviewed and run additional
tests)"
* 'for-next-3.17' of git://git.samba.org/sfrench/cifs-2.6:
Clarify Kconfig help text for CIFS and SMB2/SMB3
CIFS: Fix wrong filename length for SMB2
CIFS: Fix wrong restart readdir for SMB1
CIFS: Fix directory rename error
cifs: No need to send SIGKILL to demux_thread during umount
cifs: Allow directIO read/write during cache=strict
cifs: remove unneeded check of null checking in if condition
cifs: fix a possible use of uninit variable in SMB2_sess_setup
cifs: fix memory leak when password is supplied multiple times
cifs: fix a possible null pointer deref in decode_ascii_ssetup
Trivial whitespace fix
If application throws negative value of lseek with SEEK_DATA|SEEK_HOLE,
previous f2fs went into BUG_ON in get_dnode_of_data, which was reported
by Tommi Rantala.
He could make a simple code to detect this having:
lseek(fd, -17595150933902LL, SEEK_DATA);
This patch should resolve that bug.
Reported-by: Tommi Rentala <tt.rantala@gmail.com>
[Jaegeuk Kim: relocate the condition as suggested by Chao]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In gc_node_segment, if node page gc is run concurrently with node page
writeback, and check_valid_map and get_node_page run after page locked
and before cur_valid_map is updated as below, it is possible for the
page to be written twice unnecessarily.
sync_node_pages
try_lock_page
...
check_valid_map f2fs_write_node_page
...
write_node_page
do_write_page
allocate_data_block
...
refresh_sit_entry /* update cur_valid_map */
...
...
unlock_page
get_node_page
...
set_page_dirty
...
f2fs_put_page
unlock_page
This can be solved via calling check_valid_map after get_node_page again.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We use flush cmd control to collect many flush cmds, and flush them
together. In this case, we use two list to manage the flush cmds
(collect and dispatch), and one spin lock is used to protect this.
In fact, the lock-less list(llist) is very suitable to this case,
and we use simplify this routine.
-
v2:
-use llist_for_each_entry_safe to fix possible use-after-free issue.
-remove the unused field from struct flush_cmd.
Thanks for Yu's suggestion.
-
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In commit aec71382c6 ("f2fs: refactor flush_nat_entries codes for reducing NAT
writes"), we descripte the issue as below:
"Although building NAT journal in cursum reduce the read/write work for NAT
block, but previous design leave us lower performance when write checkpoint
frequently for these cases:
1. if journal in cursum has already full, it's a bit of waste that we flush all
nat entries to page for persistence, but not to cache any entries.
2. if journal in cursum is not full, we fill nat entries to journal util
journal is full, then flush the left dirty entries to disk without merge
journaled entries, so these journaled entries may be flushed to disk at next
checkpoint but lost chance to flushed last time."
Actually, we have the same problem in using SIT journal area.
In this patch, firstly we will update sit journal with dirty entries as many as
possible. Secondly if there is no space in sit journal, we will remove all
entries in journal and walk through the whole dirty entry bitmap of sit,
accounting dirty sit entries located in same SIT block to sit entry set. All
entry sets are linked to list sit_entry_set in sm_info, sorted ascending order
by count of entries in set. Later we flush entries in set which have fewest
entries into journal as many as we can, and then flush dense set with merged
entries to disk.
In this way we can use sit journal area more effectively, also we will reduce
SIT update, result in gaining in performance and saving lifetime of flash
device.
In my testing environment, it shows this patch can help to reduce SIT block
update obviously.
virtual machine + hard disk:
fsstress -p 20 -n 400 -l 5
sit page num cp count sit pages/cp
based 2006.50 1349.75 1.486
patched 1566.25 1463.25 1.070
Our latency of merging op is small when handling a great number of dirty SIT
entries in flush_sit_entries:
latency(ns) dirty sit count
36038 2151
49168 2123
37174 2232
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
sit_i in macro SIT_BLOCK_OFFSET/START_SEGNO is not used, remove it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch replaces BUG cases with f2fs_bug_on to remain fsck.f2fs information.
And it implements some void functions to initiate fsck.f2fs too.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull ext4 bugfix from Ted Ts'o.
[ Hmm. It's possible we should make kfree() aware of error pointers,
and use IS_ERR_OR_NULL rather than a NULL check. But in the meantime
this is obviously the right fix. - Linus ]
* 'for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: avoid trying to kfree an ERR_PTR pointer
Pull nfsd bugfixes from Bruce Fields:
"A couple minor nfsd bugfixes"
* 'for-3.17' of git://linux-nfs.org/~bfields/linux:
lockd: fix rpcbind crash on lockd startup failure
nfsd4: fix rd_dircount enforcement
Btrfs was inserting inodes into the hash table before we had fully
set the inode up on disk. This leaves us open to rare races that allow
two different inodes in memory for the same [root, inode] pair.
This patch fixes things by using insert_inode_locked4 to insert an I_NEW
inode and unlock_new_inode when we're ready for the rest of the kernel
to use the inode.
It also makes sure to init the operations pointers on the inode before
going into the error handling paths.
Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
While we're doing a full fsync (when the inode has the flag
BTRFS_INODE_NEEDS_FULL_SYNC set) that is ranged too (covers only a
portion of the file), we might have ordered operations that are started
before or while we're logging the inode and that fall outside the fsync
range.
Therefore when a full ranged fsync finishes don't remove every extent
map from the list of modified extent maps - as for some of them, that
fall outside our fsync range, their respective ordered operation hasn't
finished yet, meaning the corresponding file extent item wasn't inserted
into the fs/subvol tree yet and therefore we didn't log it, and we must
let the next fast fsync (one that checks only the modified list) see this
extent map and log a matching file extent item to the log btree and wait
for its ordered operation to finish (if it's still ongoing).
A test case for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The "inherit" in btrfs_ioctl_snap_create_v2() and "vol_args" in
btrfs_ioctl_rm_dev() are ERR_PTRs so we can't call kfree() on them.
These kind of bugs are "One Err Bugs" where there is just one error
label that does everything. I could set the "inherit = NULL" and keep
the single out label but it ends up being more complicated that way. It
makes the code simpler to re-order the unwind so it's in the mirror
order of the allocation and introduce some new error labels.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Commit 3b29970909 "nfsd4: enforce rd_dircount" totally misunderstood
rd_dircount; it refers to total non-attribute bytes returned, not number
of directory entries returned.
Bring the code into agreement with RFC 3530 section 14.2.24.
Cc: stable@vger.kernel.org
Fixes: 3b29970909 "nfsd4: enforce rd_dircount"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull filesystem fixes from Al Viro:
"Several bugfixes (all of them -stable fodder).
Alexey's one deals with double mutex_lock() in UFS (apparently, nobody
has tried to test "ufs: sb mutex merge + mutex_destroy" on something
like file creation/removal on ufs). Mine deal with two kinds of
umount bugs, in umount propagation and in handling of automounted
submounts, both resulting in bogus transient EBUSY from umount"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ufs: fix deadlocks introduced by sb mutex merge
fix EBUSY on umount() from MNT_SHRINKABLE
get rid of propagate_umount() mistakenly treating slaves as busy.
Commit 0244756edc ("ufs: sb mutex merge + mutex_destroy") introduces
deadlocks in ufs_new_inode() and ufs_free_inode().
Most callers of that functions acqure the mutex by themselves and
ufs_{new,free}_inode() do that via lock_ufs(),
i.e we have an unavoidable double lock.
The patch proposes to resolve the issue by making sure that
ufs_{new,free}_inode() are not called with the mutex held.
Found by Linux Driver Verification project (linuxtesting.org).
Cc: stable@vger.kernel.org # 3.16
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix:
- a direct IO read/buffered read data corruption
- the associated fallout from the DIO data corruption fix
- collapse range bugs that are potential data corruption issues.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJUCkM+AAoJEK3oKUf0dfodUBgP+gJu50/XV4TFRLPlCRxhvN61
371i3ASls1y7ivhj40NzgbDDAZHM2q8Zqwd//318dFViHhWQDlH/1ga06kscRpZX
d8cQEbFHApgUGQL5Gdq2l2hvAzYa75H0m6cq3jveyrN2rscjCSmAXwtlcEmx3AR6
TnCpxuVL5asjEGZYb0KfQACq//rASHJbhukpo1gB4ccZ0boWHOVf5SxuS4remzs9
y+rlPFNl5RD/WVdnJSvu9zu/nP6op3Ax5r7jZanoKbisKHfd7QOa+k65+Vz0Vq9G
kxgfhz+yLfkOvcktq+41e1lVBln7fCIlcO9m3b53uxWPx5cla323893UiGYsA4F/
j/gGlh1qaT6C/1M1JBWDLDx931S78XiR1Y+WbtAU1PO+GuO0IEap3+iqtS2+oNAv
OrpThLOgqTspK6MJeToCzdn2lRT2BJpcKwxIyDK8g+p9N6qCpyw3DfiKyu0wipGH
D2D3mtE6drSHNaSceFAz8CrQvPOR7Ygj92QGpGSfkohxap9h6VJR/wNp/oGnpmN0
qgcxTrHvx3kw1hXssB4gjh6fBDnOUkac0isqxdow22Qt529t9sIanzMBvz+JxHQF
zeqeFSh96lOXmB7UFBU+QyOhbDp3cJWChHrtY3Esw/+FmG6fxEy8z6pdZsiJYELr
5tka2richPD+gyXzcZwP
=tRQo
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-3.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs fixes from Dave Chinner:
"The fixes all address recently discovered data corruption issues.
The original Direct IO issue was discovered by Chris Mason @ Facebook
on a production workload which mixed buffered reads with direct reads
and writes IO to the same file. The fix for that exposed other issues
with page invalidation (exposed by millions of fsx operations) failing
due to dirty buffers beyond EOF.
Finally, the collapse_range code could also cause problems due to
racing writeback changing the extent map while it was being shifted
around. The commits for that problem are simple mitigation fixes that
prevent the problem from occuring. A more robust fix for 3.18 that
addresses the underlying problem is currently being worked on by
Brian.
Summary of fixes:
- a direct IO read/buffered read data corruption
- the associated fallout from the DIO data corruption fix
- collapse range bugs that are potential data corruption issues"
* tag 'xfs-for-linus-3.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: trim eofblocks before collapse range
xfs: xfs_file_collapse_range is delalloc challenged
xfs: don't log inode unless extent shift makes extent modifications
xfs: use ranged writeback and invalidation for direct IO
xfs: don't zero partial page cache pages during O_DIRECT writes
xfs: don't zero partial page cache pages during O_DIRECT writes
xfs: don't dirty buffers beyond EOF
This patch changes sync_filesystem() to be EXPORT_SYMBOL().
The reason this is needed is that starting with 3.15 kernel, due to
Theodore Ts'o's commit 02b9984d64 ("fs: push sync_filesystem() down to
the file system's remount_fs()"), all file systems that have dirty data
to be written out need to call sync_filesystem() from their
->remount_fs() method when remounting read-only.
As this is now a generically required function rather than an internal
only function it should be EXPORT_SYMBOL() so that all file systems can
call it.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull aio bugfixes from Ben LaHaise:
"Two small fixes"
* git://git.kvack.org/~bcrl/aio-fixes:
aio: block exit_aio() until all context requests are completed
aio: add missing smp_rmb() in read_events_ring
It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
Currently udf_iget() (triggered by NFS) can race with udf_new_inode()
leading to two inode structures with the same inode number:
nfsd: iget_locked() creates inode
nfsd: try to read from disk, block on that.
udf_new_inode(): allocate inode with that inumber
udf_new_inode(): insert it into icache, set it up and dirty
udf_write_inode(): write inode into buffer cache
nfsd: get CPU again, look into buffer cache, see nice and sane on-disk
inode, set the in-core inode from it
Fix the problem by putting inode into icache in locked state (I_NEW set)
and unlocking it only after it's fully set up.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
boilerplate code in udf_{create,mknod,symlink} taken to new helper
symlink case converted to unique id calculated by udf_new_inode() - no
point finding a new one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently UDF doesn't initialize i_generation in any way and thus NFS
can easily get reallocated inodes from stale file handles. Luckily UDF
already has a unique object identifier associated with each inode -
i_unique. Use that for initialization of i_generation.
Signed-off-by: Jan Kara <jack@suse.cz>
NFS can easily ask for inodes that are already deleted. Currently UDF
happily returns such inodes which is a bug. Return -ESTALE if
udf_read_inode() is asked to read deleted inode.
Signed-off-by: Jan Kara <jack@suse.cz>
Currently __udf_read_inode() wasn't returning anything and we found out
whether we succeeded reading inode by checking whether inode is bad or
not. udf_iget() returned NULL on failure and inode pointer otherwise.
Make these two functions properly propagate errors up the call stack and
use the return value in callers.
Signed-off-by: Jan Kara <jack@suse.cz>
We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.
Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.
Signed-off-by: Jan Kara <jack@suse.cz>
There's no good reason to separate these since udf_fill_inode() is
called only from __udf_read_inode() and both do part of the same thing.
Signed-off-by: Jan Kara <jack@suse.cz>
If we are writing back inode of unlinked directory, its link count ends
up being (u16)-1. Although the inode is deleted, udf_iget() can load the
inode when NFS uses stale file handle and get confused.
Signed-off-by: Jan Kara <jack@suse.cz>
Pull f2fs bug fixes from Jaegeuk Kim:
"This series includes patches to:
- fix recovery routines
- fix bugs related to inline_data/xattr
- fix when casting the dentry names
- handle EIO or ENOMEM correctly
- fix memory leak
- fix lock coverage"
* tag 'for-f2fs-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (28 commits)
f2fs: reposition unlock_new_inode to prevent accessing invalid inode
f2fs: fix wrong casting for dentry name
f2fs: simplify by using a literal
f2fs: truncate stale block for inline_data
f2fs: use macro for code readability
f2fs: introduce need_do_checkpoint for readability
f2fs: fix incorrect calculation with total/free inode num
f2fs: remove rename and use rename2
f2fs: skip if inline_data was converted already
f2fs: remove rewrite_node_page
f2fs: avoid double lock in truncate_blocks
f2fs: prevent checkpoint during roll-forward
f2fs: add WARN_ON in f2fs_bug_on
f2fs: handle EIO not to break fs consistency
f2fs: check s_dirty under cp_mutex
f2fs: unlock_page when node page is redirtied out
f2fs: introduce f2fs_cp_error for readability
f2fs: give a chance to mount again when encountering errors
f2fs: trigger release_dirty_inode in f2fs_put_super
f2fs: don't skip checkpoint if there is no dirty node pages
...
Thanks to Dan Carpenter for extending smatch to find bugs like this.
(This was found using a development version of smatch.)
Fixes: 36de928641
Reported-by: Dan Carpenter <dan.carpenter@oracle.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
While doing a ranged fsync, that is, one whose range doesn't cover the
whole possible file range (0 to LLONG_MAX), we can crash under certain
circumstances with a trace like the following:
[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>] [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919] [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919] [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919] [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919] [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919] [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919] [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919] [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919] [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919] [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919] [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919] [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)
The necessary conditions that lead to such crash are:
* an incremental fsync (when the inode doesn't have the
BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
a file extent item ending at offset X;
* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
to a file truncate operation that reduces the file to a size smaller
than X;
* a ranged fsync call happens (via an msync for example), with a range that
doesn't cover the whole file and the end of this range, lets call it Y, is
smaller than X;
* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
calls btrfs_truncate_inode_items() to remove all items from the log
tree that are associated with our file;
* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
file extent item it removed is the one ending at offset X, where X > 0 and
X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
parameter set to X;
* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
ordered operation with a range covering the offset X, calling a BUG_ON() if
such ordered operation exists. This assumption is made because the disk_i_size
is only increased after the corresponding file extent item is added to the
btree (btrfs_finish_ordered_io);
* But because our fsync covers only a limited range, such an ordered extent might
exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
extent to finish when calling btrfs_wait_ordered_range();
And then by the time btrfs_ordered_update_i_size() is called, via:
btrfs_sync_file() ->
btrfs_log_dentry_safe() ->
btrfs_log_inode_parent() ->
btrfs_log_inode() ->
btrfs_truncate_inode_items() ->
btrfs_ordered_update_i_size()
We hit the BUG_ON(), which could never happen if the fsync range covered the whole
possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
finish before calling btrfs_truncate_inode_items().
So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size.
Issue found while running xfstests/generic/127 (happens very rarely for me), more
specifically via the fsx calls that use memory mapped IO (and issue msync calls).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
While writing to a file, in inode.c:cow_file_range() (and same applies to
submit_compressed_extents()), after reserving an extent for the file data,
we create a new extent map for the written range and insert it into the
extent map cache. After that, we create an ordered operation, but if it
fails (due to a transient/temporary-ENOMEM), we return without dropping
that extent map, which points to a reserved extent that is freed when we
return. A subsequent incremental fsync (when the btrfs inode doesn't have
the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
logs a file extent item based on that extent map, which points to a disk
extent that doesn't contain valid data - it was freed by us earlier, at this
point it might contain any random/garbage data.
Therefore, if we reach an error condition when cowing a file range after
we added the new extent map to the cache, drop it from the cache before
returning.
Some sequence of steps that lead to this:
$ mkfs.btrfs -f /dev/sdd
$ mount -o commit=9999 /dev/sdd /mnt
$ cd /mnt
$ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
$ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
$ sync
$ od -t x1 foo
0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
*
0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
*
0020000
$ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
# Now this write + fsync fail with -ENOMEM, which was returned by
# btrfs_add_ordered_extent() in inode.c:cow_file_range().
$ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
$ xfs_io -c "fsync" foo
fsync: Cannot allocate memory
# Now do a new write + fsync, which will succeed. Our previous
# -ENOMEM was a transient/temporary error.
$ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
$ xfs_io -c "fsync" foo
# Our file content (in page cache) is now:
$ od -t x1 foo
0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
*
0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0050000
# Now reboot the machine, and mount the fs, so that fsync log replay
# takes place.
# The file content is now weird, in particular the first 8Kb, which
# do not match our data before nor after the sync command above.
$ od -t x1 foo
0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
*
0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0050000
# In fact these first 4Kb are a duplicate of the last 4kb block.
# The last write got an extent map/file extent item that points to
# the same disk extent that we got in the write+fsync that failed
# with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
# verify that:
$ btrfs-debug-tree /dev/sdd
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
extent data disk byte 12582912 nr 8192
extent data offset 0 nr 8192 ram 8192
item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
extent data disk byte 0 nr 0
extent data offset 0 nr 8192 ram 8192
item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
extent data disk byte 12582912 nr 4096
extent data offset 0 nr 4096 ram 4096
$ umount /dev/sdd
$ btrfsck /dev/sdd
Checking filesystem on /dev/sdd
UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
checking extents
extent item 12582912 has multiple extent items
ref mismatch on [12582912 4096] extent item 1, found 2
Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
backpointer mismatch on [12582912 4096]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
root 5 inode 257 errors 1000, some csum missing
found 131074 bytes used err is 1
total csum bytes: 4
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 123404
file data blocks allocated: 274432
referenced 274432
Btrfs v3.14.1-96-gcc7fd5a-dirty
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
This fixes an Oopsable race when starting up the callback server.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This fixes an Oopsable race when starting lockd.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events. After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves. This small patch fixes the
problem in testing.
Thanks to Zach for helping to look into this, and suggesting the fix.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
As the race condition on the inode cache, following scenario can appear:
[Thread a] [Thread b]
->f2fs_mkdir
->f2fs_add_link
->__f2fs_add_link
->init_inode_metadata failed here
->gc_thread_func
->f2fs_gc
->do_garbage_collect
->gc_data_segment
->f2fs_iget
->iget_locked
->wait_on_inode
->unlock_new_inode
->move_data_page
->make_bad_inode
->iput
When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode
should be set as bad to avoid being accessed by other thread. But in above
scenario, it allows f2fs to access the invalid inode before this inode was set
as bad.
This patch fix the potential problem, and this issue was found by code review.
change log from v1:
o Add condition judgment in gc_data_segment() suggested by Changman Lee.
o use iget_failed to simplify code.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
xfs_collapse_file_space() currently writes back the entire file
undergoing collapse range to settle things down for the extent shift
algorithm. While this prevents changes to the extent list during the
collapse operation, the writeback itself is not enough to prevent
unnecessary collapse failures.
The current shift algorithm uses the extent index to iterate the in-core
extent list. If a post-eof delalloc extent persists after the writeback
(e.g., a prior zero range op where the end of the range aligns with eof
can separate the post-eof blocks such that they are not written back and
converted), xfs_bmap_shift_extents() becomes confused over the encoded
br_startblock value and fails the collapse.
As with the full writeback, this is a temporary fix until the algorithm
is improved to cope with a volatile extent list and avoid attempts to
shift post-eof extents.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.
However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....
And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.
The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.
Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
all subsequent extents down into the specified, previously punched out,
region. This function performs some validation, such as whether a
sufficient hole exists in the target region of the collapse, then shifts
the remaining exents downward.
The exit path of the function currently logs the inode unconditionally.
While we must log the inode (and abort) if an error occurs and the
transaction is dirty, the initial validation paths can generate errors
before the transaction has been dirtied. This creates an unnecessary
filesystem shutdown scenario, as the caller will cancel a transaction
that has been marked dirty.
Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
are made to the inode bmap. Only log the inode in the exit path if
logflags has been set. This ensures we only have to cancel a dirty
transaction if modifications have been made and prevents an unnecessary
filesystem shutdown otherwise.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.
Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Similar to direct IO reads, direct IO writes are using
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.
This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.
cc: stable@vger.kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads. This is different from the other filesystems who
only invalidate pages during DIO writes.
truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page. This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.
buffered reads will find an up to date page with zeros instead of
the data actually on disk.
This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.
[dchinner: catch error and warn if it fails. Comment.]
cc: stable@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.
The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.
This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.
Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.
cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUAmbRAAoJEAAOaEEZVoIVah4P/iupUO7Ae5ODMDMog/vOp+SM
+sWnyqnEyeMlQlNDoHoef5TPQ28aKEAq1Sg7CsqlK3qZSYSSPhb4KFsGWLZe6D5A
7iWSMKabdnuQ3qBCsb2Y6ZdB8IRAJz81sIAVI8F32NDmSs325wN/coVwfV4g8mQF
QSpv78TjwBY0qNhNw06pS/FLV45IaPTDDgnTHRcOLrfHajDdGTdqrKI/L0ES1PFB
0ZUtG3qMPS2XYRyS6ZQ0TZZrl2/HMA5/fOwqKspNKxYxKS+TOf/umKwPPjHBnMHo
mfD1XnG64ECkNio9bpg2CkjUqaT8aloNPgDxuP15vEV6bZ5WBLKjGUOY2IvPa198
do8CaAdp2Ql6kE2IyD+G+IkjqcxZ9H8hNH3cBM+3TzvxYqaiZKKZAky5UTau2LvG
E5cyWhDPsVBGvAJXEPBf4vhIgzhaSuNox0+73nL4xU+L7bPTDzYIYyhS/InKO0X+
ZwAZn2u62XQmUDI8b+zrgOAHfWB0hHlcIfIsIrDxotM24TPPbJ2k2Dz0hKCZAraR
DYDYPJZg+/QyPc8bujL5Hwjh8MogdDt1vxd9B65MwQWRn791LSGbq6VSsUnsMoAa
dhG5U+a5eI8oQ2gkMEEK45o2ljcnZ3BSim6SGdmZ6YrNyEdk63xA4GjozK1YhWzG
tLkXb4/7zV/dR8VTOQR/
=Wb1t
-----END PGP SIGNATURE-----
Merge tag 'locks-v3.17-3' of git://git.samba.org/jlayton/linux
Pull file locking bugfx from Jeff Layton:
"Just a bugfix for a bug that crept in to v3.15. It's in a rather rare
error path, and I'm not aware of anyone having hit it, but it's worth
fixing for v3.17"
* tag 'locks-v3.17-3' of git://git.samba.org/jlayton/linux:
locks: pass correct "before" pointer to locks_unlink_lock in generic_add_lease
We need the parents of victims alive until namespace_unlock() gets to
dput() of the (ex-)mountpoints. However, that screws up the "is it
busy" checks in case when we have shrinkable mounts that need to be
killed. Solution: go ahead and decrement refcounts of parents right
in umount_tree(), increment them again just before dropping rwsem in
namespace_unlock() (and let the loop in the end of namespace_unlock()
finally drop those references for good, as we do now). Parents can't
get freed until we drop rwsem - at least one reference is kept until
then, both in case when parent is among the victims and when it is
not. So they'll still be around when we get to namespace_unlock().
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The check in __propagate_umount() ("has somebody explicitly mounted
something on that slave?") is done *before* taking the already doomed
victims out of the child lists.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>