* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
JFS: Free sbi memory in error path
fs/sysv: dereferencing ERR_PTR()
Fix double-free in logfs
Fix the regression created by "set S_DEAD on unlink()..." commit
I spotted the missing kfree() while removing the BKL.
[akpm@linux-foundation.org: avoid multiple returns so it doesn't happen again]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I moved the dir_put_page() inside the if condition so we don't dereference
"page", if it's an ERR_PTR().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) i_flags simply doesn't work for mount/unlink race prevention;
we may have many links to file and rm on one of those obviously
shouldn't prevent bind on top of another later on. To fix it
right way we need to mark _dentry_ as unsuitable for mounting
upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and
i_mutex on the inode in question. Set it (with dont_mount(dentry))
in unlink/rmdir/etc., check (with cant_mount(dentry)) in places
in namespace.c that used to check for S_DEAD. Setting S_DEAD
is still needed in places where we used to set it (for directories
getting killed), since we rely on it for readdir/rmdir race
prevention.
2) rename()/mount() protection has another bogosity - we unhash
the target before we'd checked that it's not a mountpoint. Fixed.
3) ancient bogosity in pivot_root() - we locked i_mutex on the
right directory, but checked S_DEAD on the different (and wrong)
one. Noticed and fixed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The /proc/fs/nfsd/versions file calls nfsd_vers() to check whether
the particular nfsd version is present/available. The problem is
that once I turn off e.g. NFSD-V4 this call returns -1 which is
true from the callers POV which is wrong.
The proposal is to report false in that case.
The bug has existed since 6658d3a7bb "[PATCH] knfsd: remove
nfsd_versbits as intermediate storage for desired versions".
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: stable@kernel.org
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
iput() can potentially attempt to allocate memory, so we should avoid
calling it in a memory shrinker. Instead, rely on the fact that iput() will
call nfs_access_zap_cache().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Both iput() and put_rpccred() might allocate memory under certain
circumstances, so make sure that we don't recurse and deadlock...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We do not want to have the state recovery thread kick off and wait for a
memory reclaim, since that may deadlock when the writebacks end up
waiting for the state recovery thread to complete.
The safe thing is therefore to use GFP_NOFS in all open, close,
delegation return, lock, etc. operations that may be called by the
state recovery thread.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I'm about to change task->tk_start from a jiffies value to a ktime_t
value in order to make RPC RTT reporting more precise.
Recently (commit dc96aef9) nfs4_renew_done() started to reference
task->tk_start so that a jiffies value no longer had to be passed
from nfs4_proc_async_renew(). This allowed the calldata to point to
an nfs_client instead.
Changing task->tk_start to a ktime_t value makes it effectively
useless for renew timestamps, so we need to restore the pre-dc96aef9
logic that provided a jiffies "start" timestamp to nfs4_renew_done().
Both an nfs_client pointer and a timestamp need to be passed to
nfs4_renew_done(), so create a new nfs_renewdata structure that
contains both, resembling what is already done for delegreturn,
lock, and unlock.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up:
fs/nfs/iostat.h: In function ‘nfs_add_server_stats’:
fs/nfs/iostat.h:41: warning: comparison between signed and unsigned integer expressions
fs/nfs/iostat.h:41: warning: comparison between signed and unsigned integer expressions
fs/nfs/iostat.h:41: warning: comparison between signed and unsigned integer expressions
fs/nfs/iostat.h:41: warning: comparison between signed and unsigned integer expressions
Commit fce22848 replaced the open-coded per-cpu logic in several
functions in fs/nfs/iostat.h with a single invocation of
this_cpu_ptr(). This macro assumes its second argument is signed,
not unsigned.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: fscache_uniq takes a string, so it should be included
with the other string mount option definitions, by convention.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Seen with -Wextra:
/home/cel/linux/fs/nfs/fscache.c: In function ‘__nfs_readpages_from_fscache’:
/home/cel/linux/fs/nfs/fscache.c:479: warning: comparison between signed and unsigned integer expressions
The comparison implicitly converts "int" to "unsigned", making it
safe. But there's no need for the implicit type conversions here, and
the dfprintk() already uses a "%u" formatter for "npages." Better to
reduce confusion.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server has given us a delegation on a file, we _know_ that we can
cache the attribute information even when the user has specified 'noac'.
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Keep a global count of how many referrals that the current task has
traversed on a path lookup. Return ELOOP if the count exceeds
MAX_NESTED_LINKS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the O_EXCL open handling into _nfs4_do_open() where it belongs. Doing
so also allows us to reuse the struct fattr from the opendata.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All we really want is the ability to retrieve the root file handle. We no
longer need the ability to walk down the path, since that is now done in
nfs_follow_remote_path().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS Filehandles and struct fattr are really too large to be allocated on
the stack. This patch adds in a couple of helper functions to allocate them
dynamically instead.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
inotify: don't leak user struct on inotify release
inotify: race use after free/double free in inotify inode marks
inotify: clean up the inotify_add_watch out path
Inotify: undefined reference to `anon_inode_getfd'
Manual merge to remove duplicate "select ANON_INODES" from Kconfig file
inotify_new_group() receives a get_uid-ed user_struct and saves the
reference on group->inotify_data.user. The problem is that free_uid() is
never called on it.
Issue seem to be introduced by 63c882a0 (inotify: reimplement inotify
using fsnotify) after 2.6.30.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Eric Paris <eparis@parisplace.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
There is a race in the inotify add/rm watch code. A task can find and
remove a mark which doesn't have all of it's references. This can
result in a use after free/double free situation.
Task A Task B
------------ -----------
inotify_new_watch()
allocate a mark (refcnt == 1)
add it to the idr
inotify_rm_watch()
inotify_remove_from_idr()
fsnotify_put_mark()
refcnt hits 0, free
take reference because we are on idr
[at this point it is a use after free]
[time goes on]
refcnt may hit 0 again, double free
The fix is to take the reference BEFORE the object can be found in the
idr.
Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: <stable@kernel.org>
inotify_add_watch explictly frees the unused inode mark, but it can just
use the generic code. Just do that.
Signed-off-by: Eric Paris <eparis@redhat.com>
This is a mandatory operation. Also, here (not in open) is where we
should be committing the reboot recovery information.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
nfsd4_set_callback_client must be called under the state lock to atomically
set or unset the callback client and shutting down the previous one.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Get a refcount on the client on SEQUENCE,
Release the refcount and renew the client when all respective compounds completed.
Do not expire the client by the laundromat while in use.
If the client was expired via another path, free it when the compounds
complete and the refcount reaches 0.
Note that unhash_client_locked must call list_del_init on cl_lru as
it may be called twice for the same client (once from nfs4_laundromat
and then from expire_client)
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Mark the client as expired under the client_lock so it won't be renewed
when an nfsv4.1 session is done, after it was explicitly expired
during processing of the compound.
Do not renew a client mark as expired (in particular, it is not
on the lru list anymore)
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Currently just initialize the cl_refcount to 1
and decrement in expire_client(), conditionally freeing the
client when the refcount reaches 0.
To be used later by nfsv4.1 compounds to keep the client from
timing out while in use.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
According to specification
mkdir d; ln -s d a; open("a/", O_NOFOLLOW | O_RDONLY)
should return success but currently it returns ELOOP. This is a
regression caused by path lookup cleanup patch series.
Fix the code to ignore O_NOFOLLOW in case the provided path has trailing
slashes.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Marius Tolzmann <tolzmann@molgen.mpg.de>
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: preserve seq # on requeued messages after transient transport errors
ceph: fix cap removal races
ceph: zero unused message header, footer fields
ceph: fix locking for waking session requests after reconnect
ceph: resubmit requests on pg mapping change (not just primary change)
ceph: fix open file counting on snapped inodes when mds returns no caps
ceph: unregister osd request on failure
ceph: don't use writeback_control in writepages completion
ceph: unregister bdi before kill_anon_super releases device name
cachefiles_determine_cache_security() is expected to return with a
security override in place. However, if set_create_files_as() fails, we
fail to do this. In this case, we should just reinstate the security
override that was set by the caller.
Furthermore, if set_create_files_as() fails, we should dispose of the
new credentials we were in the process of creating.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix:
fs/built-in.o: In function `sys_inotify_init1':
summary.c:(.text+0x347a4): undefined reference to `anon_inode_getfd'
found by kautobuild with arms bcmring_defconfig, which ends up with
INOTIFY_USER enabled (through the 'default y') but leaves ANON_INODES
unset. However, inotify_user.c uses anon_inode_getfd().
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Eric Paris <eparis@redhat.com>
If the tcp connection drops and we reconnect to reestablish a stateful
session (with the mds), we need to resend previously sent (and possibly
received) messages with the _same_ seq # so that they can be dropped on
the other end if needed. Only assign a new seq once after the message is
queued.
Signed-off-by: Sage Weil <sage@newdream.net>
The iterate_session_caps helper traverses the session caps list and tries
to grab an inode reference. However, the __ceph_remove_cap was clearing
the inode backpointer _before_ removing itself from the session list,
causing a null pointer dereference.
Clear cap->ci under protection of s_cap_lock to avoid the race, and to
tightly couple the list and backpointer state. Use a local flag to
indicate whether we are releasing the cap, as cap->session may be modified
by a racing thread in iterate_session_caps.
Signed-off-by: Sage Weil <sage@newdream.net>
CIFS has stubs for XFS-style quotas without an actual implementation backing
them, hidden behind a config option not visible in Kconfig. Remove these
stubs for now as the quota operations will see some major changes and this
code simply gets in the way.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Separate out unhashing of the client and session.
To be used later by the laundromat.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
To be used later on to hold a reference count on the client while in use by a
nfsv4.1 compound.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
and grab the client lock once for all the client's sessions.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
In preparation to share the lock's scope to both client
and session hash tables.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Originally, commit d899bf7b ("procfs: provide stack information for
threads") attempted to introduce a new feature for showing where the
threadstack was located and how many pages are being utilized by the
stack.
Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
applied to fix the NO_MMU case.
Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
64-bit") was applied to fix a bug in ia32 executables being loaded.
Commit 9ebd4eba7 ("procfs: fix /proc/<pid>/stat stack pointer for kernel
threads") was applied to fix a bug which had kernel threads printing a
userland stack address.
Commit 1306d603f ('proc: partially revert "procfs: provide stack
information for threads"') was then applied to revert the stack pages
being used to solve a significant performance regression.
This patch nearly undoes the effect of all these patches.
The reason for reverting these is it provides an unusable value in
field 28. For x86_64, a fork will result in the task->stack_start
value being updated to the current user top of stack and not the stack
start address. This unpredictability of the stack_start value makes
it worthless. That includes the intended use of showing how much stack
space a thread has.
Other architectures will get different values. As an example, ia64
gets 0. The do_fork() and copy_process() functions appear to treat the
stack_start and stack_size parameters as architecture specific.
I only partially reverted c44972f1 ("procfs: disable per-task stack usage
on NOMMU") . If I had completely reverted it, I would have had to change
mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
configured. Since I could not test the builds without significant effort,
I decided to not change mm/Makefile.
I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
information for threads on 64-bit") . I left the KSTK_ESP() change in
place as that seemed worthwhile.
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We shouldn't leak any prior memory contents to other parties. And random
data, particularly in the 'version' field, can cause problems down the
line.
Signed-off-by: Sage Weil <sage@newdream.net>
When we made serverino the default, we trusted that the field sent by the
server in the "uniqueid" field was actually unique. It turns out that it
isn't reliably so.
Samba, in particular, will just put the st_ino in the uniqueid field when
unix extensions are enabled. When a share spans multiple filesystems, it's
quite possible that there will be collisions. This is a server bug, but
when the inodes in question are a directory (as is often the case) and
there is a collision with the root inode of the mount, the result is a
kernel panic on umount.
Fix this by checking explicitly for directory inodes with the same
uniqueid. If that is the case, then we can assume that using server inode
numbers will be a problem and that they should be disabled.
Fixes Samba bugzilla 7407
Signed-off-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Fix an occasional EIO returned by a call to vfs_unlink():
[ 4868.465413] CacheFiles: I/O Error: Unlink failed
[ 4868.465444] FS-Cache: Cache cachefiles stopped due to I/O error
[ 4947.320011] CacheFiles: File cache on md3 unregistering
[ 4947.320041] FS-Cache: Withdrawing cache "mycache"
[ 5127.348683] FS-Cache: Cache "mycache" added (type cachefiles)
[ 5127.348716] CacheFiles: File cache on md3 registered
[ 7076.871081] CacheFiles: I/O Error: Unlink failed
[ 7076.871130] FS-Cache: Cache cachefiles stopped due to I/O error
[ 7116.780891] CacheFiles: File cache on md3 unregistering
[ 7116.780937] FS-Cache: Withdrawing cache "mycache"
[ 7296.813394] FS-Cache: Cache "mycache" added (type cachefiles)
[ 7296.813432] CacheFiles: File cache on md3 registered
What happens is this:
(1) A cached NFS file is seen to have become out of date, so NFS retires the
object and immediately acquires a new object with the same key.
(2) Retirement of the old object is done asynchronously - so the lookup/create
to generate the new object may be done first.
This can be a problem as the old object and the new object must exist at
the same point in the backing filesystem (i.e. they must have the same
pathname).
(3) The lookup for the new object sees that a backing file already exists,
checks to see whether it is valid and sees that it isn't. It then deletes
that file and creates a new one on disk.
(4) The retirement phase for the old file is then performed. It tries to
delete the dentry it has, but ext4_unlink() returns -EIO because the inode
attached to that dentry no longer matches the inode number associated with
the filename in the parent directory.
The trace below shows this quite well.
[md5sum] ==> __fscache_relinquish_cookie(ffff88002d12fb58{NFS.fh,ffff88002ce62100},1)
[md5sum] ==> __fscache_acquire_cookie({NFS.server},{NFS.fh},ffff88002ce62100)
NFS has retired the old cookie and asked for a new one.
[kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_ACTIVE,24})
[kslowd] <== fscache_object_state_machine() [->OBJECT_DYING]
[kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_INIT,0})
[kslowd] <== fscache_object_state_machine() [->OBJECT_LOOKING_UP]
[kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_DYING,24})
[kslowd] <== fscache_object_state_machine() [->OBJECT_RECYCLING]
The old object (OBJ52) is going through the terminal states to get rid of it,
whilst the new object - (OBJ53) - is coming into being.
[kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_LOOKING_UP,0})
[kslowd] ==> cachefiles_walk_to_object({ffff88003029d8b8},OBJ53,@68,)
[kslowd] lookup '@68'
[kslowd] next -> ffff88002ce41bd0 positive
[kslowd] advance
[kslowd] lookup 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA'
[kslowd] next -> ffff8800369faac8 positive
The new object has looked up the subdir in which the file would be in (getting
dentry ffff88002ce41bd0) and then looked up the file itself (getting dentry
ffff8800369faac8).
[kslowd] validate 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA'
[kslowd] ==> cachefiles_bury_object(,'@68','Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA')
[kslowd] remove ffff8800369faac8 from ffff88002ce41bd0
[kslowd] unlink stale object
[kslowd] <== cachefiles_bury_object() = 0
It then checks the file's xattrs to see if it's valid. NFS says that the
auxiliary data indicate the file is out of date (obvious to us - that's why NFS
ditched the old version and got a new one). CacheFiles then deletes the old
file (dentry ffff8800369faac8).
[kslowd] redo lookup
[kslowd] lookup 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA'
[kslowd] next -> ffff88002cd94288 negative
[kslowd] create -> ffff88002cd94288{ffff88002cdaf238{ino=148247}}
CacheFiles then redoes the lookup and gets a negative result in a new dentry
(ffff88002cd94288) which it then creates a file for.
[kslowd] ==> cachefiles_mark_object_active(,OBJ53)
[kslowd] <== cachefiles_mark_object_active() = 0
[kslowd] === OBTAINED_OBJECT ===
[kslowd] <== cachefiles_walk_to_object() = 0 [148247]
[kslowd] <== fscache_object_state_machine() [->OBJECT_AVAILABLE]
The new object is then marked active and the state machine moves to the
available state - at which point NFS can start filling the object.
[kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_RECYCLING,20})
[kslowd] ==> fscache_release_object()
[kslowd] ==> cachefiles_drop_object({OBJ52,2})
[kslowd] ==> cachefiles_delete_object(,OBJ52{ffff8800369faac8})
The old object, meanwhile, goes on with being retired. If allocation occurs
first, cachefiles_delete_object() has to wait for dir->d_inode->i_mutex to
become available before it can continue.
[kslowd] ==> cachefiles_bury_object(,'@68','Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA')
[kslowd] remove ffff8800369faac8 from ffff88002ce41bd0
[kslowd] unlink stale object
EXT4-fs warning (device sda6): ext4_unlink: Inode number mismatch in unlink (148247!=148193)
CacheFiles: I/O Error: Unlink failed
FS-Cache: Cache cachefiles stopped due to I/O error
CacheFiles then tries to delete the file for the old object, but the dentry it
has (ffff8800369faac8) no longer points to a valid inode for that directory
entry, and so ext4_unlink() returns -EIO when de->inode does not match i_ino.
[kslowd] <== cachefiles_bury_object() = -5
[kslowd] <== cachefiles_delete_object() = -5
[kslowd] <== fscache_object_state_machine() [->OBJECT_DEAD]
[kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_AVAILABLE,0})
[kslowd] <== fscache_object_state_machine() [->OBJECT_ACTIVE]
(Note that the above trace includes extra information beyond that produced by
the upstream code).
The fix is to note when an object that is being retired has had its object
deleted preemptively by a replacement object that is being created, and to
skip the second removal attempt in such a case.
Reported-by: Greg M <gregm@servu.net.au>
Reported-by: Mark Moseley <moseleymark@gmail.com>
Reported-by: Romain DEGEZ <romain.degez@smartjog.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The session->s_waiting list is protected by mdsc->mutex, not s_mutex. This
was causing (rare) s_waiting list corruption.
Fix errors paths too, while we're here. A more thorough cleanup of this
function is coming soon.
Signed-off-by: Sage Weil <sage@newdream.net>
OSD requests need to be resubmitted on any pg mapping change, not just when
the pg primary changes. Resending only when the primary changes results in
occasional 'hung' requests during osd cluster recovery or rebalancing.
Signed-off-by: Sage Weil <sage@newdream.net>
It's possible the MDS will not issue caps on a snapped inode, in which case
an open request may not __ceph_get_fmode(), botching the open file
counting. (This is actually a server bug, but the client shouldn't BUG out
in this case.)
Signed-off-by: Sage Weil <sage@newdream.net>
The osd request wasn't being unregistered when the osd returned a failure
code, even though the result was returned to the caller. This would cause
it to eventually time out, and then crash the kernel when it tried to
resend the request using a stale page vector.
Signed-off-by: Sage Weil <sage@newdream.net>
epoll should not touch flags in wait_queue_t. This patch introduces a new
function __add_wait_queue_exclusive(), for the users, who use wait queue as a
LIFO queue.
__add_wait_queue_tail_exclusive() is introduced too instead of
add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as
it is a duplicate of __remove_wait_queue(), disliked by users, and with less
users.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: <containers@lists.linux-foundation.org>
LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
..otherwise memory allocation errors go undetected.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
It will be used in suspend code and serves as an easy wrap around
copy_from_user. Similar to simple_read_from_buffer, it takes care
of transfers with proper lengths depending on available and count
parameters and advances ppos appropriately.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Once file or link creation gets going, it can't be interrupted by a
signal. They're not idempotent.
This blocks signals in ocfs2_mknod(), ocfs2_link(), and ocfs2_symlink()
once we start actually changing things. ocfs2_mknod() covers mknod(),
creat(), mkdir(), and open(O_CREAT).
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2 sometimes needs to block signals around dlm operations, but it
currently does it with sigprocmask(). Even worse, it's checking the
error code of sigprocmask(). The in-kernel sigprocmask() can only error
if you get the SIG_* argument wrong. We don't.
Wrap the sigprocmask() calls with ocfs2_[un]block_signals(). These
functions are void, but they will BUG() if somehow sigprocmask() returns
an error.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The comments make it clear the otherwise subtle behavior of cifs_new_fileinfo().
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
--
fs/cifs/dir.c | 18 ++++++++++++++++--
1 files changed, 16 insertions(+), 2 deletions(-)
Signed-off-by: Steve French <sfrench@us.ibm.com>
After commit 1f36f774b2 ("Switch !O_CREAT case to use of do_last()") in
2.6.34-rc1 autofs direct mounts stopped working. This is caused by
current->link_count being 0 when ->follow_link() is called from
do_filp_open().
I can't work out why this hasn't been seen before Als patch series.
This patch removes the autofs dependence on current->link_count.
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
..a left over from the commit 3321b791b2.
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
It's legal to send a DESTROY_SESSION outside any session (as the only
operation in a compound), in which case cstate->session will be NULL;
check for that case.
While we're at it, move these checks into a separate helper function.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix RCU issues in the NFSv4 delegation code
NFSv4: Fix the locking in nfs_inode_reclaim_delegation()
The write buffer may not have been written and may no longer be written
due to an interrupted write in the affected page.
Signed-off-by: Joern Engel <joern@logfs.org>
The get_mtd_device() function returns error pointers on failure and if we
don't handle it, it leads to a crash.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Joern Engel <joern@logfs.org>
The ->writepages writeback_control is not still valid in the writepages
completion. We were touching it solely to adjust pages_skipped when there
was a writeback error (EIO, ENOSPC, EPERM due to bad osd credentials),
causing an oops in the writeback code shortly thereafter. Updating
pages_skipped on error isn't correct anyway, so let's just rip out this
(clearly broken) code to pass the wbc to the completion.
Signed-off-by: Sage Weil <sage@newdream.net>
Lockres hash size of 16KB is far too small for large filesystems (where we
have hundreds of thousands of lock resources stored in the table).
This patch increases it to 128KB.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2, we use ocfs2_extend_trans() to extend a journal handle's
blocks. But if jbd2_journal_extend() fails, it will only restart
with the the new number of blocks. This tends to be awkward since
in most cases we want additional reserved blocks. It makes our code
harder to mantain since the caller can't be sure all the original
blocks will not be accessed and dirtied again. There are 15 callers
of ocfs2_extend_trans() in fs/ocfs2, and 12 of them have to add
h_buffer_credits before they call ocfs2_extend_trans(). This makes
ocfs2_extend_trans() really extend atop the original block count.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Two tiny cleanup for allocation reservation.
1. Remove some extra codes in ocfs2_local_alloc_find_clear_bits.
2. Remove an unuseful variables in ocfs2_find_resv_lhs.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When we allocate some bits from the reservation, we always
allocate from the r_start(see ocfs2_resmap_resv_bits).
So there should be no reason to check between r_start
and start. And I don't think we will change this behaviour
later by allocating from some bits after r_start. Why not make
ocfs2_adjust_resv_from_alloc simple for now?
The only chance we have to adjust the reservation is when we haven't
reached the end. With this patch, the function is more readable.
Note:
btw, this patch also fixes an original bug in the function
which I haven't found before.
if (end < ocfs2_resv_end(resv))
rhs = end - ocfs2_resv_end(resv);
This code is of course buggy. ;)
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
OCFS2 has never really supported intr. This patch acknowledges this reality
and makes nointr the default mount option. In a later patch, we intend to
support intr.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
o2dlm join and leave messages are more than informational as they are
required for debugging locking issues. This patch changes them from
KERN_INFO to KERN_NOTICE.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch logs socket state changes that lead to socket shutdown.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Print the node number of a peer node if sending it a message failed.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The default behavior for directory reservations stays the same, but we add a
mount option so people can tweak the size of directory reservations
according to their workloads.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The default reservation size of 4 (32-bit windows) is a bit too ambitious.
Scale it back to 16 bits (resv_level=2). I have been testing various sizes
on a 4-node cluster which runs a mixed workload that is heavily threaded.
With a 256MB local alloc, I get *roughly* the following levels of average file
fragmentation:
resv_level=0 70%
resv_level=1 21%
resv_level=2 23%
resv_level=3 24%
resv_level=4 60%
resv_level=5 did not test
resv_level=6 60%
resv_level=2 seemed like a good compromise between not letting windows be
too small, but not so big that heavier workloads will immediately suffer
without tuning.
This patch also change the behavior of directory reservations - they now
track file reservations. The previous compromise of giving directory
windows only 8 bits wound up fragmenting more at some window sizes because
file allocations had smaller unused windows to poach from.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
I have observed that the current size of 8M gives us pretty poor
fragmentation on multi-threaded workloads which do lots of writes.
Generally, I can increase the size of local alloc windows and observe a
marked decrease in fragmentation, even up and beyond window sizes of 512
megabytes. This makes sense for a couple reasons - larger local alloc means
more room for reservation windows. On multi-node workloads the larger local
alloc helps as well because we don't have to do window slides as often.
Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
longer used and the comment above it was out of date.
To test fragmentation, I used a workload which launched 4 threads that did
4k writes into a series of about 140 alternating files.
With resv_level=2, and a 4k/4k file system I observed the following average
fragmentation for various localalloc= parameters:
localalloc= avg. fragmentation
8 48
32 16
64 10
120 7
On larger cluster sizes, the difference is more dramatic.
The new default size top out at 256M, which we'll only get for cluster
sizes of 32K and above.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch pulls the local alloc sizing code into localalloc.c and provides
a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
except that I correctly calculate the maximum local alloc size. The old code
in ocfs2_parse_options() calculated the max size as:
ocfs2_local_alloc_size(sb) * 8
which is correct, in bits. Unfortunately though the option passed in is in
megabytes. Ultimately, this bug made no real difference - the shrink code
would catch a too-large size and bring it down to something reasonable.
Still, it's less than efficient as-is.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Inodes are always allocated from the global bitmap now so we don't need this
any more. Also, the existing implementation bounces reservations around
needlessly.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Otherwise, the need for a very large contiguous allocation tends to
wreak havoc on many inode allocation reservations on the local alloc, thus
ruining any chances for contiguousness.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Use the reservations system for unindexed dir tree allocations. We don't
bother with the indexed tree as reads from it are mostly random anyway.
Directory reservations are marked seperately, to allow the reservations code
a chance to optimize their window sizes. This patch allocates only 8 bits
for directory windows as they generally are not expected to grow as quickly
as file data. Future improvements to dir window sizing can trivially be
made.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch improves Ocfs2 allocation policy by allowing an inode to
reserve a portion of the local alloc bitmap for itself. The reserved
portion (allocation window) is advisory in that other allocation
windows might steal it if the local alloc bitmap becomes
full. Otherwise, the reservations are honored and guaranteed to be
free. When the local alloc window is moved to a different portion of
the bitmap, existing reservations are discarded.
Reservation windows are represented internally by a red-black
tree. Within that tree, each node represents the reservation window of
one inode. An LRU of active reservations is also maintained. When new
data is written, we allocate it from the inodes window. When all bits
in a window are exhausted, we allocate a new one as close to the
previous one as possible. Should we not find free space, an existing
reservation is pulled off the LRU and cannibalized.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
jbd[2]_journal_dirty_metadata() only returns 0. It's been returning 0
since before the kernel moved to git. There is no point in checking
this error.
ocfs2_journal_dirty() has been faithfully returning the status since the
beginning. All over ocfs2, we have blocks of code checking this can't
fail status. In the past few years, we've tried to avoid adding these
checks, because they are pointless. But anyone who looks at our code
assumes they are needed.
Finally, ocfs2_journal_dirty() is made a void function. All error
checking is removed from other files. We'll BUG_ON() the status of
jbd2_journal_dirty_metadata() just in case they change it someday. They
won't.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
...rather than the secType. This allows us to get rid of the MSKerberos
securityEnum. The client just makes a decision at upcall time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
So that we can reasonably set up the secType based on both the
NegotiateProtocol response and the parsed mount options.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When CONFIG_BLOCK is not enabled:
fs/logfs/super.c:142: error: implicit declaration of function 'bdev_get_queue'
fs/logfs/super.c:142: error: invalid type argument of '->' (have 'int')
Found by Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Joern Engel <joern@logfs.org>
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Avoid a gcc warning in ocfs2_wipe_inode().
ocfs2: Avoid direct write if we fall back to buffered I/O
ocfs2_dlmfs: Fix math error when reading LVB.
ocfs2: Update VFS inode's id info after reflink.
ocfs2: potential ERR_PTR dereference on error paths
ocfs2: Add directory entry later in ocfs2_symlink() and ocfs2_mknod()
ocfs2: use OCFS2_INODE_SKIP_ORPHAN_DIR in ocfs2_mknod error path
ocfs2: use OCFS2_INODE_SKIP_ORPHAN_DIR in ocfs2_symlink error path
ocfs2: add OCFS2_INODE_SKIP_ORPHAN_DIR flag and honor it in the inode wipe code
ocfs2: Reset status if we want to restart file extension.
ocfs2: Compute metaecc for superblocks during online resize.
ocfs2: Check the owner of a lockres inside the spinlock
ocfs2: one more warning fix in ocfs2_file_aio_write(), v2
ocfs2_dlmfs: User DLM_* when decoding file open flags.
Unregister and destroy the bdi in put_super, after mount is r/o, but before
put_anon_super releases the device name.
For symmetry, bdi_destroy in destroy_client (we bdi_init in create_client).
Only set s_bdi if bdi_register succeeds, since we use it to decide whether
to bdi_unregister.
Signed-off-by: Sage Weil <sage@newdream.net>
li_refcount was not re-initialized in function logfs_init_inode(), small
patch that will fix the problem
Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com>
Signed-off-by: Joern Engel <joern@logfs.org>
gcc warns that a variable is uninitialized. It's actually handled, but
an early return fools gcc. Let's just initialize the variable to a
garbage value that will crash if the usage is ever broken.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: remove bad auth_x kmem_cache
ceph: fix lockless caps check
ceph: clear dir complete, invalidate dentry on replayed rename
ceph: fix direct io truncate offset
ceph: discard incoming messages with bad seq #
ceph: fix seq counting for skipped messages
ceph: add missing #includes
ceph: fix leaked spinlock during mds reconnect
ceph: print more useful version info on module load
ceph: fix snap realm splits
ceph: clear dir complete on d_move
It's useless, since our allocations are already a power of 2. And it was
allocated per-instance (not globally), which caused a name collision when
we tried to mount a second file system with auth_x enabled.
Signed-off-by: Sage Weil <sage@newdream.net>
If a rename operation is resent to the MDS following an MDS restart, the
client does not get a full reply (containing the resulting metadata) back.
In that case, a ceph_rename() needs to compensate by doing anything useful
that fill_inode() would have, like d_move().
It also needs to invalidate the dentry (to workaround the vfs_rename_dir()
bug) and clear the dir complete flag, just like fill_trace().
Signed-off-by: Sage Weil <sage@newdream.net>
We can get old message seq #'s after a tcp reconnect for stateful sessions
(i.e., the MDS). If we get a higher seq #, that is an error, and we
shouldn't see any bad seq #'s for stateless (mon, osd) connections.
Signed-off-by: Sage Weil <sage@newdream.net>
The snap realm split was checking i_snap_realm, not the list_head, to
determine if an inode belonged in the new realm. The check always failed,
which meant we always moved the inode, corrupting the old realm's list and
causing various crashes.
Also wait to release old realm reference to avoid possibility of use after
free.
Signed-off-by: Sage Weil <sage@newdream.net>
d_move() reorders the d_subdirs list, breaking the readdir result caching.
Unless/until d_move preserves that ordering, clear CEPH_I_COMPLETE on
rename.
Signed-off-by: Sage Weil <sage@newdream.net>
As of 32a88aa1, __sync_filesystem() will return 0 if s_bdi is not set.
And nilfs does not set s_bdi anywhere. I noticed this problem by the
warning introduced by the recent commit 5129a469 ("Catch filesystem
lacking s_bdi").
WARNING: at fs/super.c:959 vfs_kern_mount+0xc5/0x14e()
Hardware name: PowerEdge 2850
Modules linked in: nilfs2 loop tpm_tis tpm tpm_bios video shpchp pci_hotplug output dcdbas
Pid: 3773, comm: mount.nilfs2 Not tainted 2.6.34-rc6-debug #38
Call Trace:
[<c1028422>] warn_slowpath_common+0x60/0x90
[<c102845f>] warn_slowpath_null+0xd/0x10
[<c1095936>] vfs_kern_mount+0xc5/0x14e
[<c1095a03>] do_kern_mount+0x32/0xbd
[<c10a811e>] do_mount+0x671/0x6d0
[<c1073794>] ? __get_free_pages+0x1f/0x21
[<c10a684f>] ? copy_mount_options+0x2b/0xe2
[<c107b634>] ? strndup_user+0x48/0x67
[<c10a81de>] sys_mount+0x61/0x8f
[<c100280c>] sysenter_do_call+0x12/0x32
This ensures to set s_bdi for nilfs and fixes the sync silent failure.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the replay case, the
renew_client(session->se_client);
happens after we've droppped the sessionid_lock, and without holding a
reference on the session; so there's nothing preventing the session
being freed before we get here.
Thanks to Benny Halevy for catching a bug in an earlier version of this
patch.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Acked-by: Benny Halevy <bhalevy@panasas.com>
Fix a number of RCU issues in the NFSv4 delegation code.
(1) delegation->cred doesn't need to be RCU protected as it's essentially an
invariant refcounted structure.
By the time we get to nfs_free_delegation(), the delegation is being
released, so no one else should be attempting to use the saved
credentials, and they can be cleared.
However, since the list of delegations could still be under traversal at
this point by such as nfs_client_return_marked_delegations(), the cred
should be released in nfs_do_free_delegation() rather than in
nfs_free_delegation(). Simply using rcu_assign_pointer() to clear it is
insufficient as that doesn't stop the cred from being destroyed, and nor
does calling put_rpccred() after call_rcu(), given that the latter is
asynchronous.
(2) nfs_detach_delegation_locked() and nfs_inode_set_delegation() should use
rcu_derefence_protected() because they can only be called if
nfs_client::cl_lock is held, and that guards against anyone changing
nfsi->delegation under it. Furthermore, the barrier imposed by
rcu_dereference() is superfluous, given that the spin_lock() is also a
barrier.
(3) nfs_detach_delegation_locked() is now passed a pointer to the nfs_client
struct so that it can issue lockdep advice based on clp->cl_lock for (2).
(4) nfs_inode_return_delegation_noreclaim() and nfs_inode_return_delegation()
should use rcu_access_pointer() outside the spinlocked region as they
merely examine the pointer and don't follow it, thus rendering unnecessary
the need to impose a partial ordering over the one item of interest.
These result in an RCU warning like the following:
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
fs/nfs/delegation.c:332 invoked rcu_dereference_check() without protection!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by mount.nfs4/2281:
#0: (&type->s_umount_key#34){+.+...}, at: [<ffffffff810b25b4>] deactivate_super+0x60/0x80
#1: (iprune_sem){+.+...}, at: [<ffffffff810c332a>] invalidate_inodes+0x39/0x13a
stack backtrace:
Pid: 2281, comm: mount.nfs4 Not tainted 2.6.34-rc1-cachefs #110
Call Trace:
[<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffffa00b4591>] nfs_inode_return_delegation_noreclaim+0x5b/0xa0 [nfs]
[<ffffffffa0095d63>] nfs4_clear_inode+0x11/0x1e [nfs]
[<ffffffff810c2d92>] clear_inode+0x9e/0xf8
[<ffffffff810c3028>] dispose_list+0x67/0x10e
[<ffffffff810c340d>] invalidate_inodes+0x11c/0x13a
[<ffffffff810b1dc1>] generic_shutdown_super+0x42/0xf4
[<ffffffff810b1ebe>] kill_anon_super+0x11/0x4f
[<ffffffffa009893c>] nfs4_kill_super+0x3f/0x72 [nfs]
[<ffffffff810b25bc>] deactivate_super+0x68/0x80
[<ffffffff810c6744>] mntput_no_expire+0xbb/0xf8
[<ffffffff810c681b>] release_mounts+0x9a/0xb0
[<ffffffff810c689b>] put_mnt_ns+0x6a/0x79
[<ffffffffa00983a1>] nfs_follow_remote_path+0x5a/0x146 [nfs]
[<ffffffffa0098334>] ? nfs_do_root_mount+0x82/0x95 [nfs]
[<ffffffffa00985a9>] nfs4_try_mount+0x75/0xaf [nfs]
[<ffffffffa0098874>] nfs4_get_sb+0x291/0x31a [nfs]
[<ffffffff810b2059>] vfs_kern_mount+0xb8/0x177
[<ffffffff810b2176>] do_kern_mount+0x48/0xe8
[<ffffffff810c810b>] do_mount+0x782/0x7f9
[<ffffffff810c8205>] sys_mount+0x83/0xbe
[<ffffffff81001eeb>] system_call_fastpath+0x16/0x1b
Also on:
fs/nfs/delegation.c:215 invoked rcu_dereference_check() without protection!
[<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffffa00b4223>] nfs_inode_set_delegation+0xfe/0x219 [nfs]
[<ffffffffa00a9c6f>] nfs4_opendata_to_nfs4_state+0x2c2/0x30d [nfs]
[<ffffffffa00aa15d>] nfs4_do_open+0x2a6/0x3a6 [nfs]
...
And:
fs/nfs/delegation.c:40 invoked rcu_dereference_check() without protection!
[<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffffa00b3bef>] nfs_free_delegation+0x3d/0x6e [nfs]
[<ffffffffa00b3e71>] nfs_do_return_delegation+0x26/0x30 [nfs]
[<ffffffffa00b406a>] __nfs_inode_return_delegation+0x1ef/0x1fe [nfs]
[<ffffffffa00b448a>] nfs_client_return_marked_delegations+0xc9/0x124 [nfs]
...
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we correctly rcu-dereference the delegation itself, and that we
protect against removal while we're changing the contents.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
logfs_seek_hole() may return the same offset it is passed as argument.
Found by Prasad Joshi <prasadjoshi124@gmail.com>
Signed-off-by: Joern Engel <joern@logfs.org>
when we fall back to buffered write from direct write, we call
__generic_file_aio_write() but that will end up doing direct write
even we are only prepared to do buffered write because the file
has the O_DIRECT flag set. This is a fix for
https://bugzilla.novell.com/show_bug.cgi?id=591039
revised with Joel's comments.
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Commit 7fe2b3190b fixed possible
misordering of completion asts (casts) and blocking asts (basts)
for kernel locks. This patch does the same for locks taken by
user space applications.
Signed-off-by: David Teigland <teigland@redhat.com>