The xlog_grant_push_ail() currently takes the grant lock internally to sample
the tail lsn, last sync lsn and the reserve grant head. Most of the callers
already hold the grant lock but have to drop it before calling
xlog_grant_push_ail(). This is a left over from when the AIL tail pushing was
done in line and hence xlog_grant_push_ail had to drop the grant lock. AIL push
is now done in another thread and hence we can safely hold the grant lock over
the entire xlog_grant_push_ail call.
Push the grant lock outside of xlog_grant_push_ail() to simplify the locking
and synchronisation needed for tail pushing. This will reduce traffic on the
grant lock by itself, but this is only one step in preparing for the complete
removal of the grant lock.
While there, clean up the formatting of xlog_grant_push_ail() to match the
rest of the XFS code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The log grant queues are one of the few places left using sv_t
constructs for waiting. Given we are touching this code, we should
convert them to plain wait queues. While there, convert all the
other sv_t users in the log code as well.
Seeing as this removes the last users of the sv_t type, remove the
header file defining the wrapper and the fragments that still
reference it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Prepare for switching the grant heads to atomic variables by
combining the two 32 bit values that make up the grant head into a
single 64 bit variable. Provide wrapper functions to combine and
split the grant heads appropriately for calculations and use them as
necessary.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The log grant space calculations are repeated for both write and
reserve grant heads. To make it simpler to convert the calculations
toa different algorithm, factor them so both the gratn heads use the
same calculation functions. Once this is done we can drop the
wrappers that are used in only a couple of place to update both
grant heads at once as they don't provide any particular value.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Factor repeated debug code out of grant head manipulation functions into a
separate function. This removes ifdef DEBUG spagetti from the code and makes
the code easier to follow.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The grant write and reserve queues use a roll-your-own double linked
list, so convert it to a standard list_head structure and convert
all the list traversals to use list_for_each_entry(). We can also
get rid of the XLOG_TIC_IN_Q flag as we can use the list_empty()
check to tell if the ticket is in a list or not.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We now have two copies of AIL delete operations that are mostly
duplicate functionality. The single log item deletes can be
implemented via the bulk updates by turning xfs_trans_ail_delete()
into a simple wrapper. This removes all the duplicate delete
functionality and associated helpers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We now have two copies of AIL insert operations that are mostly
duplicate functionality. The single log item updates can be
implemented via the bulk updates by turning xfs_trans_ail_update()
into a simple wrapper. This removes all the duplicate insert
functionality and associated helpers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When inode buffer IO completes, usually all of the inodes are removed from the
AIL. This involves processing them one at a time and taking the AIL lock once
for every inode. When all CPUs are processing inode IO completions, this causes
excessive amount sof contention on the AIL lock.
Instead, change the way we process inode IO completion in the buffer
IO done callback. Allow the inode IO done callback to walk the list
of IO done callbacks and pull all the inodes off the buffer in one
go and then process them as a batch.
Once all the inodes for removal are collected, take the AIL lock
once and do a bulk removal operation to minimise traffic on the AIL
lock.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
To allow buffer iodone callbacks to consume multiple items off the
callback list, first we need to convert the xfs_buf_do_callbacks()
to consume items and always pull the next item from the head of the
list.
The means the item list walk is never dependent on knowing the
next item on the list and hence allows callbacks to remove items
from the list as well. This allows callbacks to do bulk operations
by scanning the list for identical callbacks, consuming them all
and then processing them in bulk, negating the need for multiple
callbacks of that type.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The xfaild often tries to rest to wait for congestion to pass of for
IO to complete, but is regularly woken in tail-pushing situations.
In severe cases, the xfsaild is getting woken tens of thousands of
times a second. Reduce the number needless wakeups by only waking
the xfsaild if the new target is larger than the old one. Further
make short sleeps uninterruptible as they occur when the xfsaild has
decided it needs to back off to allow some IO to complete and being
woken early is counter-productive.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When inserting items into the AIL from the transaction committed
callbacks, we take the AIL lock for every single item that is to be
inserted. For a CIL checkpoint commit, this can be tens of thousands
of individual inserts, yet almost all of the items will be inserted
at the same point in the AIL because they have the same index.
To reduce the overhead and contention on the AIL lock for such
operations, introduce a "bulk insert" operation which allows a list
of log items with the same LSN to be inserted in a single operation
via a list splice. To do this, we need to pre-sort the log items
being committed into a temporary list for insertion.
The complexity is that not every log item will end up with the same
LSN, and not every item is actually inserted into the AIL. Items
that don't match the commit LSN will be inserted and unpinned as per
the current one-at-a-time method (relatively rare), while items that
are not to be inserted will be unpinned and freed immediately. Items
that are to be inserted at the given commit lsn are placed in a
temporary array and inserted into the AIL in bulk each time the
array fills up.
As a result of this, we trade off AIL hold time for a significant
reduction in traffic. lock_stat output shows that the worst case
hold time is unchanged, but contention from AIL inserts drops by an
order of magnitude and the number of lock traversal decreases
significantly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_ail_delete() has a needlessly complex interface. It returns the log item
that was passed in for deletion (which the callers then assert is identical to
the one passed in), and callers of xfs_ail_delete() still need to invalidate
current traversal cursors.
Make xfs_ail_delete() return void, move the cursor invalidation inside it, and
clean up the callers just to use the log item pointer they passed in.
While cleaning up, remove the messy and unnecessary "/* ARGUSED */" comments
around all these functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
EFI/EFD interactions are protected from races by the AIL lock. They
are the only type of log items that require the the AIL lock to
serialise internal state, so they need to be separated from the AIL
lock before we can do bulk insert operations on the AIL.
To acheive this, convert the counter of the number of extents in the
EFI to an atomic so it can be safely manipulated by EFD processing
without locks. Also, convert the EFI state flag manipulations to use
atomic bit operations so no locks are needed to record state
changes. Finally, use the state bits to determine when it is safe to
free the EFI and clean up the code to do this neatly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
XFS_EFI_CANCELED has not been set in the code base since
xfs_efi_cancel() was removed back in 2006 by commit
065d312e15 ("[XFS] Remove unused
iop_abort log item operation), and even then xfs_efi_cancel() was
never called. I haven't tracked it back further than that (beyond
git history), but it indicates that the handling of EFIs in
cancelled transactions has been broken for a long time.
Basically, when we get an IOP_UNPIN(lip, 1); call from
xfs_trans_uncommit() (i.e. remove == 1), if we don't free the log
item descriptor we leak it. Fix the behviour to be correct and kill
the XFS_EFI_CANCELED flag.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
reiserfs_acl_chmod() can be called by reiserfs_set_attr() and then take
the reiserfs lock a second time. Thereafter it may call journal_begin()
that definitely requires the lock not to be nested in order to release
it before taking the journal mutex because the reiserfs lock depends on
the journal mutex already.
So, aviod nesting the lock in reiserfs_acl_chmod().
Reported-by: Pawel Zawora <pzawora@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Pawel Zawora <pzawora@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: <stable@kernel.org> [2.6.32.x+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the attribute cache timeout for CIFS is hardcoded to 1 second. This
means that the client might have to issue a QPATHINFO/QFILEINFO call every 1
second to verify if something has changes, which seems too expensive. On the
other hand, if the timeout is hardcoded to a higher value, workloads that
expect strict cache coherency might see unexpected results.
Making attribute cache timeout as a tunable will allow us to make a tradeoff
between performance and cache metadata correctness depending on the
application/workload needs.
Add 'actimeo' tunable that can be used to tune the attribute cache timeout.
The default timeout is set to 1 second. Also, display actimeo option value in
/proc/mounts.
It appears to me that 'actimeo' and the proposed (but not yet merged)
'strictcache' option cannot coexist, so care must be taken that we reset the
other option if one of them is set.
Changes since last post:
- fix option parsing and handle possible values correcly
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: only run xfs_error_test if error injection is active
xfs: avoid moving stale inodes in the AIL
xfs: delayed alloc blocks beyond EOF are valid after writeback
xfs: push stale, pinned buffers on trylock failures
xfs: fix failed write truncation handling.
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: fix parsing of hostname in dfs referrals
cifs: display fsc in /proc/mounts
cifs: enable fscache iff fsc mount option is used explicitly
cifs: allow fsc mount option only if CONFIG_CIFS_FSCACHE is set
cifs: Handle extended attribute name cifs_acl to generate cifs acl blob (try #4)
cifs: Misc. cleanup in cifsacl handling [try #4]
cifs: trivial comment fix for cifs_invalidate_mapping
[CIFS] fs/cifs/Kconfig: CIFS depends on CRYPTO_HMAC
cifs: don't take extra tlink reference in initiate_cifs_search
cifs: Percolate error up to the caller during get/set acls [try #4]
cifs: fix another memleak, in cifs_root_iget
cifs: fix potential use-after-free in cifs_oplock_break_put
We need to ensure that the entries in the nfs_cache_array get cleared
when the page is removed from the page cache. To do so, we use the
freepage address_space operation.
Change nfs_readdir_clear_array to use kmap_atomic(), so that the
function can be safely called from all contexts.
Finally, modify the cache_page_release helper to call
nfs_readdir_clear_array directly, when dealing with an anonymous
page from 'uncached_readdir'.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that the buffer reclaim infrastructure can handle different reclaim
priorities for different types of buffers, reconnect the hooks in the
XFS code that has been sitting dormant since it was ported to Linux. This
should finally give use reclaim prioritisation that is on a par with the
functionality that Irix provided XFS 15 years ago.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Introduce a per-buftarg LRU for memory reclaim to operate on. This
is the last piece we need to put in place so that we can fully
control the buffer lifecycle. This allows XFS to be responsibile for
maintaining the working set of buffers under memory pressure instead
of relying on the VM reclaim not to take pages we need out from
underneath us.
The implementation introduces a b_lru_ref counter into the buffer.
This is currently set to 1 whenever the buffer is referenced and so is used to
determine if the buffer should be added to the LRU or not when freed.
Effectively it allows lazy LRU initialisation of the buffer so we do not need
to touch the LRU list and locks in xfs_buf_find().
Instead, when the buffer is being released and we drop the last
reference to it, we check the b_lru_ref count and if it is none zero
we re-add the buffer reference and add the inode to the LRU. The
b_lru_ref counter is decremented by the shrinker, and whenever the
shrinker comes across a buffer with a zero b_lru_ref counter, if
released the LRU reference on the buffer. In the absence of a lookup
race, this will result in the buffer being freed.
This counting mechanism is used instead of a reference flag so that
it is simple to re-introduce buffer-type specific reclaim reference
counts to prioritise reclaim more effectively. We still have all
those hooks in the XFS code, so this will provide the infrastructure
to re-implement that functionality.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fill in the local lock with response data if appropriate,
and don't call posix_lock_file when reading locks.
Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Previously the kernel client incorrectly assumed everything was a directory.
Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
last may be NULL, but we dereference it in the else branch without
checking. Normally it doesn't trigger because last == NULL when fpos == 2,
but it could happen on a newly opened dir if the user seeks forward.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Recent tests writing lots of small files showed the flusher thread
being CPU bound and taking a long time to do allocations on a debug
kernel. perf showed this as the prime reason:
samples pcnt function DSO
_______ _____ ___________________________ _________________
224648.00 36.8% xfs_error_test [kernel.kallsyms]
86045.00 14.1% xfs_btree_check_sblock [kernel.kallsyms]
39778.00 6.5% prandom32 [kernel.kallsyms]
37436.00 6.1% xfs_btree_increment [kernel.kallsyms]
29278.00 4.8% xfs_btree_get_rec [kernel.kallsyms]
27717.00 4.5% random32 [kernel.kallsyms]
Walking btree blocks during allocation checking them requires each
block (a cache hit, so no I/O) call xfs_error_test(), which then
does a random32() call as the first operation. IOWs, ~50% of the
CPU is being consumed just testing whether we need to inject an
error, even though error injection is not active.
Kill this overhead when error injection is not active by adding a
global counter of active error traps and only calling into
xfs_error_test when fault injection is active.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When an inode has been marked stale because the cluster is being
freed, we don't want to (re-)insert this inode into the AIL. There
is a race condition where the cluster buffer may be unpinned before
the inode is inserted into the AIL during transaction committed
processing. If the buffer is unpinned before the inode item has been
committed and inserted, then it is possible for the buffer to be
released and hence processthe stale inode callbacks before the inode
is inserted into the AIL.
In this case, we then insert a clean, stale inode into the AIL which
will never get removed by an IO completion. It will, however, get
reclaimed and that triggers an assert in xfs_inode_free()
complaining about freeing an inode still in the AIL.
This race can be avoided by not moving stale inodes forward in the AIL
during transaction commit completion processing. This closes the
race condition by ensuring we never insert clean stale inodes into
the AIL. It is safe to do this because a dirty stale inode, by
definition, must already be in the AIL.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
There is an assumption in the parts of XFS that flushing a dirty
file will make all the delayed allocation blocks disappear from an
inode. That is, that after calling xfs_flush_pages() then
ip->i_delayed_blks will be zero.
This is an invalid assumption as we may have specualtive
preallocation beyond EOF and they are recorded in
ip->i_delayed_blks. A flush of the dirty pages of an inode will not
change the state of these blocks beyond EOF, so a non-zero
deeelalloc block count after a flush is valid.
The bmap code has an invalid ASSERT() that needs to be removed, and
the swapext code has a bug in that while it swaps the data forks
around, it fails to swap the i_delayed_blks counter associated with
the fork and hence can get the block accounting wrong.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
As reported by Nick Piggin, XFS is suffering from long pauses under
highly concurrent workloads when hosted on ramdisks. The problem is
that an inode buffer is stuck in the pinned state in memory and as a
result either the inode buffer or one of the inodes within the
buffer is stopping the tail of the log from being moved forward.
The system remains in this state until a periodic log force issued
by xfssyncd causes the buffer to be unpinned. The main problem is
that these are stale buffers, and are hence held locked until the
transaction/checkpoint that marked them state has been committed to
disk. When the filesystem gets into this state, only the xfssyncd
can cause the async transactions to be committed to disk and hence
unpin the inode buffer.
This problem was encountered when scaling the busy extent list, but
only the blocking lock interface was fixed to solve the problem.
Extend the same fix to the buffer trylock operations - if we fail to
lock a pinned, stale buffer, then force the log immediately so that
when the next attempt to lock it comes around, it will have been
unpinned.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Since the move to the new truncate sequence we call xfs_setattr to
truncate down excessively instanciated blocks. As shown by the testcase
in kernel.org BZ #22452 that doesn't work too well. Due to the confusion
of the internal inode size, and the VFS inode i_size it zeroes data that
it shouldn't.
But full blown truncate seems like overkill here. We only instanciate
delayed allocations in the write path, and given that we never released
the iolock we can't have converted them to real allocations yet either.
The only nasty case is pre-existing preallocation which we need to skip.
We already do this for page discard during writeback, so make the delayed
allocation block punching a generic function and call it from the failed
write path as well as xfs_aops_discard_page. The callers are
responsible for ensuring that partial blocks are not truncated away,
and that they hold the ilock.
Based on a fix originally from Christoph Hellwig. This version used
filesystem blocks as the range unit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We need to use the cookie from the previous array entry, not the
actual cookie that we are searching for (except for the case of
uncached_readdir).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Note: this patch targets 2.6.37 and tries to be as simple as possible.
That is why it adds more copy-and-paste horror into fs/compat.c and
uglifies fs/exec.c, this will be cleanuped later.
compat_copy_strings() plays with bprm->vma/mm directly and thus has
two problems: it lacks the RLIMIT_STACK check and argv/envp memory
is not visible to oom killer.
Export acct_arg_size() and get_arg_page(), change compat_copy_strings()
to use get_arg_page(), change compat_do_execve() to do acct_arg_size(0)
as do_execve() does.
Add the fatal_signal_pending/cond_resched checks into compat_count() and
compat_copy_strings(), this matches the code in fs/exec.c and certainly
makes sense.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Brad Spengler published a local memory-allocation DoS that
evades the OOM-killer (though not the virtual memory RLIMIT):
http://www.grsecurity.net/~spender/64bit_dos.c
execve()->copy_strings() can allocate a lot of memory, but
this is not visible to oom-killer, nobody can see the nascent
bprm->mm and take it into account.
With this patch get_arg_page() increments current's MM_ANONPAGES
counter every time we allocate the new page for argv/envp. When
do_execve() succeds or fails, we change this counter back.
Technically this is not 100% correct, we can't know if the new
page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but
I don't think this really matters and everything becomes correct
once exec changes ->mm or fails.
Reported-by: Brad Spengler <spender@grsecurity.net>
Reviewed-and-discussed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The DFS referral parsing code does a memchr() call to find the '\\'
delimiter that separates the hostname in the referral UNC from the
sharename. It then uses that value to set the length of the hostname via
pointer subtraction. Instead of subtracting the start of the hostname
however, it subtracts the start of the UNC, which causes the code to
pass in a hostname length that is 2 bytes too long.
Regression introduced in commit 1a4240f4.
Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Wang Lei <wang840925@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
When comparing filehandles in the helper nfs_same_file(), we should not be
using 'strncmp()': filehandles are not null terminated strings.
Instead, we should just use the existing helper nfs_compare_fh().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can only merge the fields into a bitfield if the locking
rules for them are the same. In this case gl_spin covers all
of the fields (write side) but a couple of them are used
with GLF_LOCK as the read side lock, which should be ok
since we know that the field in question won't be changing
at the time.
The gl_req setting has to be done earlier (in glock.c) in order
to place it under gl_spin. The gl_reply setting also has to be
brought under gl_spin in order to comply with the new rules.
This saves 4*sizeof(unsigned int) per glock.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
When you truncate the rindex file, you need to avoid calling gfs2_rindex_hold,
since you already hold it. However, if you haven't already read in the
resource groups, you need to do that.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Verify that the total length of the iovec returned in FUSE_IOCTL_RETRY
doesn't overflow iov_length().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org> [2.6.31+]
If a 32bit CUSE server is run on 64bit this results in EIO being
returned to the caller.
The reason is that FUSE_IOCTL_RETRY reply was defined to use 'struct
iovec', which is different on 32bit and 64bit archs.
Work around this by looking at the size of the reply to determine
which struct was used. This is only needed if CONFIG_COMPAT is
defined.
A more permanent fix for the interface will be to use the same struct
on both 32bit and 64bit.
Reported-by: "ccmail111" <ccmail111@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org> [2.6.31+]
When GFS2 grew the filesystem, it was never rereading the rindex file during
the grow. This is necessary for large grows when the filesystem is almost full,
and GFS2 needs to use some of the space allocated earlier in the grow to
complete it. Now, if GFS2 fails to reserve the necessary space and the rindex
file is not uptodate, it rereads it. Also, the only difference between
gfs2_ri_update() and gfs2_ri_update_special() was that gfs2_ri_update_special()
didn't clear out the existing resource groups, since you knew that it was only
called when there were no resource groups. Attempting to clear out the
resource groups when there are none takes almost no time, and rarely happens,
so I simply removed gfs2_ri_update_special().
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There are a number of duplicated #defines in glock.h
plus one which is unused. This removes the extra
definitions.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allow architectures to redefine this macro if needed. This is useful for
example in architectures where 64-bit ELF vmcores are not supported.
Specifying zero vmcore_elf64_check_arch() allows compiler to optimize
away unnecessary parts of parse_crash_elf64_headers().
We also rename the macro to vmcore_elf64_check_arch() to reflect that it
is used for 64-bit vmcores only.
Signed-off-by: Mika Westerberg <mika.westerberg@iki.fi>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
The DLM never returns -EAGAIN in response to dlm_lock(), and even
if it did, the test in gdlm_lock() was wrong anyway. Once that
test is removed, it is possible to greatly simplify this code
by simply using a "normal" error return code (0 for success).
We then no longer need the LM_OUT_ASYNC return code which can
be removed.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
With this patch the gfs2_set_dqblk() function will be able to update the
quota usage block count (FS_DQ_BCOUNT) in addition to the already supported
FS_DQ_BHARD (limit) and FS_DQ_BSOFT (warn) fields of the dquot structure.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Functions that use printf formatting, especially
those that use %pV, should have their uses of
printf format and arguments checked by the compiler.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Using %pV reduces the number of printk calls and
eliminates any possible message interleaving from
other printk calls.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
While preparing the last patch I noticed that the gfs2_setattr_simple
code had been duplicated into two other places. This patch updates
those to call gfs2_setattr_simple rather than open coding it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The WQ_RESCUER flag should only be used internally to the
workqueue implementation.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Before we introduce per-buftarg LRU lists, split the shrinker
implementation into per-buftarg shrinker callbacks. At the moment
we wake all the xfsbufds to run the delayed write queues to free
the dirty buffers and make their pages available for reclaim.
However, with an LRU, we want to be able to free clean, unused
buffers as well, so we need to separate the xfsbufd from the
shrinker callbacks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
now that we are using RCU protection for the inode cache lookups,
the lock is only needed on the modification side. Hence it is not
necessary for the lock to be a rwlock as there are no read side
holders anymore. Convert it to a spin lock to reflect it's exclusive
nature.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
With delayed logging greatly increasing the sustained parallelism of inode
operations, the inode cache locking is showing significant read vs write
contention when inode reclaim runs at the same time as lookups. There is
also a lot more write lock acquistions than there are read locks (4:1 ratio)
so the read locking is not really buying us much in the way of parallelism.
To avoid the read vs write contention, change the cache to use RCU locking on
the read side. To avoid needing to RCU free every single inode, use the built
in slab RCU freeing mechanism. This requires us to be able to detect lookups of
freed inodes, so enѕure that ever freed inode has an inode number of zero and
the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
lookup path, but also add a check for a zero inode number as well.
We canthen convert all the read locking lockups to use RCU read side locking
and hence remove all read side locking.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Introduce RCU freeing of XFS inodes so that we can convert lookup
traversals to use rcu_read_lock() protection. This patch only
introduces the RCU freeing to minimise the potential conflicts with
mainline if this is merged into mainline via a VFS patchset. It
abuses the i_dentry list for the RCU callback structure because the
VFS patches make this a union so it is safe to use like this and
simplifies and merge issues.
This patch uses basic RCU freeing rather than SLAB_DESTROY_BY_RCU.
The later lookup patches need the same "found free inode" protection
regardless of the RCU freeing method used, so once again the RCU
freeing method can be dealt with apprpriately at merge time without
affecting any other code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
A long standing problem for streaming writeѕ through the NFS server
has been that the NFS server opens and closes file descriptors on an
inode for every write. The result of this behaviour is that the
->release() function is called on every close and that results in
XFS truncating speculative preallocation beyond the EOF. This has
an adverse effect on file layout when multiple files are being
written at the same time - they interleave their extents and can
result in severe fragmentation.
To avoid this problem, keep track of ->release calls made on a dirty
inode. For most cases, an inode is only going to be opened once for
writing and then closed again during it's lifetime in cache. Hence
if there are multiple ->release calls when the inode is dirty, there
is a good chance that the inode is being accessed by the NFS server.
Hence set a flag the first time ->release is called while there are
delalloc blocks still outstanding on the inode.
If this flag is set when ->release is next called, then do no
truncate away the speculative preallocation - leave it there so that
subsequent writes do not need to reallocate the delalloc space. This
will prevent interleaving of extents of different inodes written
concurrently to the same AG.
If we get this wrong, it is not a big deal as we truncate
speculative allocation beyond EOF anyway in xfs_inactive() when the
inode is thrown out of the cache.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently the size of the speculative preallocation during delayed
allocation is fixed by either the allocsize mount option of a
default size. We are seeing a lot of cases where we need to
recommend using the allocsize mount option to prevent fragmentation
when buffered writes land in the same AG.
Rather than using a fixed preallocation size by default (up to 64k),
make it dynamic by basing it on the current inode size. That way the
EOF preallocation will increase as the file size increases. Hence
for streaming writes we are much more likely to get large
preallocations exactly when we need it to reduce fragementation.
For default settings, the size of the initial extents is determined
by the number of parallel writers and the amount of memory in the
machine. For 4GB RAM and 4 concurrent 32GB file writes:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576
1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576
2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152
3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304
4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608
5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208
6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088
7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088
and for 16 concurrent 16GB file writes:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144
1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144
2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288
3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576
4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152
5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304
6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608
7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208
Because it is hard to take back specualtive preallocation, cases
where there are large slow growing log files on a nearly full
filesystem may cause premature ENOSPC. Hence as the filesystem nears
full, the maximum dynamic prealloc size іs reduced according to this
table (based on 4k block size):
freespace max prealloc size
>5% full extent (8GB)
4-5% 2GB (8GB >> 2)
3-4% 1GB (8GB >> 3)
2-3% 512MB (8GB >> 4)
1-2% 256MB (8GB >> 5)
<1% 128MB (8GB >> 6)
This should reduce the amount of space held in speculative
preallocation for such cases.
The allocsize mount option turns off the dynamic behaviour and fixes
the prealloc size to whatever the mount option specifies. i.e. the
behaviour is unchanged.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
When listing attributes, we are doiing memory allocations under the
inode ilock using only KM_SLEEP. This allows memory allocation to
recurse back into the filesystem and do writeback, which may the
ilock we already hold on the current inode. THis will deadlock.
Hence use KM_NOFS for such allocations outside of transaction
context to ensure that reclaim recursion does not occur.
Reported-by: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The XFS iolock needs to be re-initialised to a new lock class before
it enters reclaim to prevent lockdep false positives. Unfortunately,
this is not sufficient protection as inodes in the XFS_IRECLAIMABLE
state can be recycled and not re-initialised before being reused.
We need to re-initialise the lock state when transfering out of
XFS_IRECLAIMABLE state to XFS_INEW, but we need to keep the same
class as if the inode was just allocated. Hence we need a specific
lockdep class variable for the iolock so that both initialisations
use the same class.
While there, add a specific class for inodes in the reclaim state so
that it is easy to tell from lockdep reports what state the inode
was in that generated the report.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add a new xfs_alloc_find_best_extent that does a forward/backward
search in the allocation btree. That code previously was existed
two times in xfs_alloc_ag_vextent_near, once for each search
direction.
Based on an earlier patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Use a goto label to consolidate all block not found cases, and add a
tracepoint for them. Also clean up a few whitespace issues.
Based on an earlier patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Move the buffer locking into the callers as they need to do it
wether they call xfs_map_at_offset or not. Remove the b_bdev
assignment, which is already done by get_blocks. Remove the
duplicate extent type asserts in xfs_convert_page just before
calling xfs_map_at_offset.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
After the last patches the code for overwrites is the same as for
delayed and unwritten extents except that it doesn't need to call
xfs_map_at_offset. Take care of that fact to simplify
xfs_vm_writepage.
The buffer loop now first checks the type of buffer and checks/sets
the ioend type, or continues to the next buffer if it's not
interesting to us. Only after that we validate the iomap and
perform the block mapping if needed, all in common code for the
cases where we have to do work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The all_bh flag is always set when entering the page clustering
machinery with a regular written extent, which means the check for
it is superflous.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
xfs_map_blocks always calls xfs_bmapi with the XFS_BMAPI_ENTIRE
entire flag, which tells it to not cap the extent at the passed in
size, but just treat the size as an minimum to map. This means
xfs_probe_cluster is entirely useless as we'll always get the whole
extent back anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
No need to lock the extent map exclusive when performing an
overwrite, we know the extent map must already have been loaded by
get_blocks. Apply the non-blocking inode semantics to all mapping
types instead of just delayed allocations. Remove the handling of
not yet allocated blocks for the IO_UNWRITTEN case - if an extent is
marked as unwritten allocated in the buffer it must already have an
extent on disk.
Add asserts to verify all the assumptions above in debug builds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Opencode the xfs_iomap code in it's two callers. The overlap of
passed flags already was minimal and will be further reduced in the
next patch.
As a side effect the BMAPI_* flags for xfs_bmapi and the IO_* flags
for I/O end processing are merged into a single set of flags, which
should be a bit more descriptive of the operation we perform.
Also improve the tracing by giving each caller it's own type set of
tracepoints.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove passing the BMAPI_* flags to these helpers, in
xfs_iomap_write_direct the check BMAPI_DIRECT was always true, and
in the xfs_iomap_write_delay path is was never checked at all.
Remove the nmap return value as we never make use of it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Don't trylock the buffer. We are the only one ever locking it for a
regular file address space, and trylock was only copied from the
generic code which did it due to the old buffer based writeout in
jbd. Also make sure to only write out the buffer if the iomap
actually is valid, because we wouldn't have a proper mapping
otherwise. In practice we will never get an invalid mapping here as
the page lock guarantees truncate doesn't race with us, but better
be safe than sorry. Also make sure we allocate a new ioend when
crossing boundaries between mappings, just like we do for delalloc
and unwritten extents. Again this currently doesn't matter as the
I/O end handler only cares for the boundaries for unwritten extents,
but this makes the code fully correct and the same as for
delalloc/unwritten extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We'll never have BIO_EOPNOTSUPP set after calling submit_bio as this
can only happen for discards, and used to happen for barriers, none
of which is every submitted by xfs_submit_ioend_bio. Also remove
the loop around bio_alloc as it will never fail due to it's mempool
backing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently we only refuse a "read-only" mapping for writing out
unwritten and delayed buffers, and refuse any other for overwrites.
Improve the checks to require delalloc mappings for delayed buffers,
and unwritten extent mappings for unwritten extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Dispatch to a different helper for phase1 vs phase2 in
xlog_recover_commit_trans instead of doing it in all the
low-level functions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Merge the call to xlog_recover_reorder_trans and the loop over the
recovery items from xlog_recover_do_trans into xlog_recover_commit_trans,
and keep the switch statement over the log item types as a separate helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
XFS used to support different types of buffer log items long time
ago. Remove the switch statements checking the log item type in
various buffer recovery helpers that were left over from those days
and the rather useless xlog_recover_do_buffer_pass2 wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We now support mounting and using filesystems with 64-bit inodes
even when not mounted with the inode64 option (which now only
controls if we allocate new inodes in that space or not). Make sure
we always use large NFS file handles when exporting a filesystem
that may contain 64-bit inodes. Note that this only affects newly
generated file handles, any outstanding 32-bit file handle is still
accepted.
[hch: the comment and commit log are mine, the rest is from a patch
snipplet from Samuel]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently, if CONFIG_CIFS_FSCACHE is set, fscache is enabled on files opened
as read-only irrespective of the 'fsc' mount option. Fix this by enabling
fscache only if 'fsc' mount option is specified explicitly.
Remove an extraneous cFYI debug message and fix a typo while at it.
Reported-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Currently, it is possible to specify 'fsc' mount option even if
CONFIG_CIFS_FSCACHE has not been set. The option is being ignored silently
while the user fscache functionality to work. Fix this by raising error when
the CONFIG option is not set.
Reported-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Add extended attribute name system.cifs_acl
Get/generate cifs/ntfs acl blob and hand over to the invoker however
it wants to parse/process it under experimental configurable option CIFS_ACL.
Do not get CIFS/NTFS ACL for xattr for attribute system.posix_acl_access
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Change the name of function mode_to_acl to mode_to_cifs_acl.
Handle return code in functions mode_to_cifs_acl and
cifs_acl_to_fattr.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
Btrfs: don't use migrate page without CONFIG_MIGRATION
Btrfs: deal with DIO bios that span more than one ordered extent
Btrfs: setup blank root and fs_info for mount time
Btrfs: fix fiemap
Btrfs - fix race between btrfs_get_sb() and umount
Btrfs: update inode ctime when using links
Btrfs: make sure new inode size is ok in fallocate
Btrfs: fix typo in fallocate to make it honor actual size
Btrfs: avoid NULL pointer deref in try_release_extent_buffer
Btrfs: make btrfs_add_nondir take parent inode as an argument
Btrfs: hold i_mutex when calling btrfs_log_dentry_safe
Btrfs: use dget_parent where we can UPDATED
Btrfs: fix more ESTALE problems with NFS
Btrfs: handle NFS lookups properly
btrfs: make 1-bit signed fileds unsigned
btrfs: Show device attr correctly for symlinks
btrfs: Set file size correctly in file clone
btrfs: Check if dest_offset is block-size aligned before cloning file
Btrfs: handle the space_cache option properly
btrfs: Fix early enospc because 'unused' calculated with wrong sign.
...
Dan Carpenter pointed out that the new sysfs_merge_group() and
sysfs_unmerge_group() routines requires their grp argument to be
non-NULL, because they dereference grp to obtain the list of
attributes. Hence it's pointless for the routines to include a test
and special-case handling for when grp is NULL. This patch (as1433)
removes the unneeded tests.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: Dan Carpenter <error27@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Only the callers check whether the invalid_mapping flag is set and not
cifs_invalidate_mapping().
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The new DIO bio splitting code has problems when the bio
spans more than one ordered extent. This will happen as the
generic DIO code merges our get_blocks calls together into
a bigger single bio.
This fixes things by walking forward in the ordered extent
code finding all the overlapping ordered extents and completing them
all at once.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This avoids some include-file hell, and the function isn't really
important enough to be inlined anyway.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
And in particular, use it in 'pipe_fcntl()'.
The other pipe functions do not need to use the 'careful' version, since
they are only ever called for things that are already known to be pipes.
The normal read/write/ioctl functions are called through the file
operations structures, so if a file isn't a pipe, they'd never get
called. But pipe_fcntl() is special, and called directly from the
generic fcntl code, and needs to use the same careful function that the
splice code is using.
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. and change it to take the 'file' pointer instead of an inode, since
that's what all users want anyway.
The renaming is preparatory to exporting it to other users. The old
'pipe_info()' name was too generic and is already used elsewhere, so
before making the function public we need to use a more specific name.
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a problem with how we use sget, it searches through the list of supers
attached to the fs_type looking for a super with the same fs_devices as what
we're trying to mount. This depends on sb->s_fs_info being filled, but we don't
fill that in until we get to btrfs_fill_super, so we could hit supers on the
fs_type super list that have a null s_fs_info. In order to fix that we need to
go ahead and setup a blank root with a blank fs_info to hold fs_devices, that
way our test will work out right and then we can set s_fs_info in
btrfs_set_super, and then open_ctree will simply use our pre-allocated root and
fs_info when setting everything up. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are two big problems currently with FIEMAP
1) We return extents for holes. This isn't supposed to happen, we just don't
return extents for holes and then userspace interprets the lack of an extent as
a hole.
2) We sometimes don't set FIEMAP_EXTENT_LAST properly. This is because we wait
to see a EXTENT_FLAG_VACANCY flag on the em, but this won't happen if say we ask
fiemap to map up to the last extent in a file, and there is nothing but holes up
to the i_size. To fix this we need to lookup the last extent in this file and
save the logical offset, so if we happen to try and map that extent we can be
sure to set FIEMAP_EXTENT_LAST.
With this patch we now pass xfstest 225, which we never have before.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When mounting a btrfs file system btrfs_test_super() may attempt to
use sb->s_fs_info, the btrfs root, of a super block that is going away
and that has had the btrfs root set to NULL in its ->put_super(). But
if the super block is going away it cannot be an existing super block
so we can return false in this case.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently we fail xfstest 236 because we're not updating the inode ctime on
link. This is a simple fix, and makes it so we pass 236 now.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have been failing xfstest 228 forever, because we don't check to make sure
the new inode size is acceptable as far as RLIMIT is concerned. Just check to
make sure it's ok to create a inode with this new size and error out if not.
With this patch we now pass 228.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a typo in __btrfs_prealloc_file_range() where we set the i_size to
actual_len/cur_offset, and then just set it to cur_offset again, and do the same
with btrfs_ordered_update_i_size(). This fixes it back to keeping i_size in a
local variable and then updating i_size properly. Tested this with
xfs_io -F -f -c "falloc 0 1" -c "pwrite 0 1" foo
stat'ing foo gives us a size of 1 instead of 4096 like it was. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Ensure we return the dirent->d_type when it is known
NFS: Correct the array bound calculation in nfs_readdir_add_to_array
NFS: Don't ignore errors from nfs_do_filldir()
NFS: Fix the error handling in "uncached_readdir()"
NFS: Fix a page leak in uncached_readdir()
NFS: Fix a page leak in nfs_do_filldir()
NFS: Assume eof if the server returns no readdir records
NFS: Buffer overflow in ->decode_dirent() should not be fatal
Pure nfs client performance using odirect.
SUNRPC: Fix an infinite loop in call_refresh/call_refreshresult
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cciss: fix build for PROC_FS disabled
block: fix amiga and atari floppy driver compile warning
blk-throttle: Fix calculation of max number of WRITES to be dispatched
ioprio: grab rcu_read_lock in sys_ioprio_{set,get}()
xen/blkfront: cope with backend that fail empty BLKIF_OP_WRITE_BARRIER requests
xen/blkfront: Implement FUA with BLKIF_OP_WRITE_BARRIER
xen/blkfront: change blk_shadow.request to proper pointer
xen/blkfront: map REQ_FLUSH into a full barrier
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: fix typo in comment of nilfs_dat_move function
nilfs2: nilfs_iget_for_gc() returns ERR_PTR
reiserfs_unpack() locks the inode mutex with reiserfs_mutex_lock_safe()
to protect against reiserfs lock dependency. However this protection
requires to have the reiserfs lock to be locked.
This is the case if reiserfs_unpack() is called by reiserfs_ioctl but
not from reiserfs_quota_on() when it tries to unpack tails of quota
files.
Fix the ordering of the two locks in reiserfs_unpack() to fix this
issue.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: Markus Gapp <markus.gapp@gmx.net>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: <stable@kernel.org> [2.6.36.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently one pagemap_read() call walks in PAGEMAP_WALK_SIZE bytes (== 512
pages.) But there is a corner case where walk_pmd_range() accidentally
runs over a VMA associated with a hugetlbfs file.
For example, when a process has mappings to VMAs as shown below:
# cat /proc/<pid>/maps
...
3a58f6d000-3a58f72000 rw-p 00000000 00:00 0
7fbd51853000-7fbd51855000 rw-p 00000000 00:00 0
7fbd5186c000-7fbd5186e000 rw-p 00000000 00:00 0
7fbd51a00000-7fbd51c00000 rw-s 00000000 00:12 8614 /hugepages/test
then pagemap_read() goes into walk_pmd_range() path and walks in the range
0x7fbd51853000-0x7fbd51a53000, but the hugetlbfs VMA should be handled by
walk_hugetlb_range(). Otherwise PMD for the hugepage is considered bad
and cleared, which causes undesirable results.
This patch fixes it by separating pagemap walk range into one PMD.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The attribute cache for a file was not being cleared when a file is opened
with O_TRUNC.
If the filesystem's open operation truncates the file ("atomic_o_trunc"
feature flag is set) then the kernel should invalidate the cached st_mtime
and st_ctime attributes.
Also i_size should be explicitly be set to zero as it is used sometimes
without refreshing the cache.
Signed-off-by: Ken Sumrall <ksumrall@android.com>
Cc: Anfei <anfei.zhou@gmail.com>
Cc: "Anand V. Avati" <avati@gluster.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Flush the disk cache in fsync and sync to make sure data actually is
on disk on completion of these system calls. There is a nobarrier
mount option to disable this behaviour. It's slightly misnamed now
that barrier actually are gone, but it matches the name used by all
major filesystems.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Avoid doing unessecary work in fsync. Do nothing unless the inode
was marked dirty, and only write the various metadata inodes out if
they contain any dirty state from this inode. This is archived by
adding three new dirty bits to the hfsplus-specific inode which are
set in the correct places.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Split the flags field in the hfsplus inode into an extent_state
flag that is locked by the extent_lock, and a new flags field
that uses atomic bitops. The second will grow more flags in the
next patch.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
fsync is supposed to not just work on regular files, but also on
directories. Fortunately enough hfsplus_file_fsync works just fine
for directories, so we can just wire it up.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Remove lots of code we don't need from fsync, we just need to call
->write_inode on the inode if it's dirty, for which sync_inode_metadata
is a lot more efficient than write_inode_now, and we need to write
out the various metadata inodes, which we now do explicitly instead
of by calling ->sync_fs.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
There is no reason to write out the metadata inodes or volume headers
during a non-blocking sync, as we are almost guaranteed to dirty them
again during the inode writeouts.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
hfsplus stores all metadata except for the volume headers in special
inodes. While these are marked hashed and periodically written out
by the flusher threads, we can't rely on that for sync. For the case
of a data integrity sync the VM has life-lock avoidance code that
avoids writing inodes again that are redirtied during the sync,
which is something that can happen easily for hfsplus. So make sure
we explicitly write out the metadata inodes at the beginning of
hfsplus_sync_fs.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Switch the hfsplus partition table reding for cdroms to use our bio
helpers. Again we don't rely on any caching in the buffer_heads, and
this gets rid of the last buffer_head use in hfsplus.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
The hfsplus backup volume header is located two blocks from the end of
the device. In case of device sizes that are not 4k aligned this means
we can't access it using buffer_heads when using the default 4k block
size.
Switch to using raw bios to read/write all buffer headers. We were not
relying on any caching behaviour of the buffer heads anyway. Additionally
always read in the backup volume header during mount to verify that we
can actually read it.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Remove opencoded writing of the volume header in hfsplus_fill_super
and hfsplus_put_super and offload it to hfsplus_sync_fs. In the
put_super case this means we only write the superblock once instead
of twice.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Turn a few noisy debug printks that show up during xfstests into
complied out debug print statements.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
nilfs_iget_for_gc() returns an ERR_PTR() on failure and doesn't return
NULL.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Store the dirent->d_type in the struct nfs_cache_array_entry so that we
can use it in getdents() calls.
This fixes a regression with the new readdir code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It looks as if the array size calculation in MAX_READDIR_ARRAY does not
take the alignment of struct nfs_cache_array_entry into account.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We should ignore the errors from the filldir callback, and just interpret
them as meaning we should exit, however we should definitely pass back
ENOMEM errors.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, uncached_readdir() is broken because if fails to handle
the results from nfs_readdir_xdr_to_array() correctly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_do_filldir() must always free desc->page when it is done, otherwise
we end up leaking the page.
Also remove unused variable 'dentry'.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Overflowing the buffer in the readdir ->decode_dirent() should not lead to
a fatal error, but rather to an attempt to reread the record in question.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When an application opens a file with O_DIRECT flag, if the size of
the data that is written is equal to wsize, the client sends a
WRITE RPC with stable flag set to UNSTABLE followed by a single
COMMIT RPC rather than sending a single WRITE RPC with the stable
flag set to FILE_SYNC. This a bug.
Patch to fix this.
Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Everybody who calls btrfs_add_nondir just passes in the dentry of the new file
and then dereference dentry->d_parent->d_inode, but everybody who calls
btrfs_add_nondir() are already passed the parent's inode. So instead of
dereferencing dentry->d_parent, just make btrfs_add_nondir take the dir inode as
an argument and pass that along so we don't have to worry about d_parent.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since we walk up the path logging all of the parts of the inode's path, we need
to hold i_mutex to make sure that the inode is not renamed while we're logging
everything. btrfs_log_dentry_safe does dget_parent and all of that jazz, but we
may get unexpected results if the rename changes the inode's location while
we're higher up the path logging those dentries, so do this for safety reasons.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are lots of places where we do dentry->d_parent->d_inode without holding
the dentry->d_lock. This could cause problems with rename. So instead we need
to use dget_parent() and hold the reference to the parent as long as we are
going to use it's inode and then dput it at the end.
Signed-off-by: Josef Bacik <josef@redhat.com>
Cc: raven@themaw.net
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When creating new inodes we don't setup inode->i_generation. So if we generate
an fh with a newly created inode we save the generation of 0, but if we flush
the inode to disk and have to read it back when getting the inode on the server
we'll have the right i_generation, so gens wont match and we get ESTALE. This
patch properly sets inode->i_generation when we create the new inode and now I'm
no longer getting ESTALE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
People kept reporting NFS issues, specifically getting ESTALE alot. I figured
out how to reproduce the problem
SERVER
mkfs.btrfs /dev/sda1
mount /dev/sda1 /mnt/btrfs-test
<add /mnt/btrfs-test to /etc/exports>
btrfs subvol create /mnt/btrfs-test/foo
service nfs start
CLIENT
mount server:/mnt/btrfs /mnt/test
cd /mnt/test/foo
ls
SERVER
echo 3 > /proc/sys/vm/drop_caches
CLIENT
ls <-- get an ESTALE here
This is because the standard way to lookup a name in nfsd is to use readdir, and
what it does is do a readdir on the parent directory looking for the inode of
the child. So in this case the parent being / and the child being foo. Well
subvols all have the same inode number, so doing a readdir of / looking for
inode 256 will return '.', which obviously doesn't match foo. So instead we
need to have our own .get_name so that we can find the right name.
Our .get_name will either lookup the inode backref or the root backref,
whichever we're looking for, and return the name we find. Running the above
reproducer with this patch results in everything acting the way its supposed to.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the
original filesize of the dest file 'file2' is 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Now clone file1 to file2, the dest file should be 40K, but it
still shows 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We've done the check for src_offset and src_length, and We should
also check dest_offset, otherwise we'll corrupt the destination
file:
(After cloning file1 to file2 with unaligned dest_offset)
# cat /mnt/file2
cat: /mnt/file2: Input/output error
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When I added the clear_cache option I screwed up and took the break out of
the space_cache case statement, so whenever you mount with space_cache you also
get clear_cache, which does you no good if you say set space_cache in fstab so
it always gets set. This patch adds the break back in properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
'unused' calculated with wrong sign in reserve_metadata_bytes().
This might have lead to unwanted over-reservations.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
extent_bio_alloc() and compressed_bio_alloc() are similar, cleanup
similar source code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
bio_endio() will free dip and dip->csums, so dip and dip->csums twice will
be freed twice. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Migrate page will directly call the btrfs btree writepage function,
which isn't actually allowed.
Our writepage assumes that you have locked the extent_buffer and
flagged the block as written. Without doing these steps, we can
corrupt metadata blocks.
A later commit will remove the btree writepage function since
it is really only safely used internally by btrfs. We
use writepages for everything else.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Add EXT4_IOC_TRIM ioctl to handle batched discard
fs: Do not dispatch FITRIM through separate super_operation
ext4: ext4_fill_super shouldn't return 0 on corruption
jbd2: fix /proc/fs/jbd2/<dev> when using an external journal
ext4: missing unlock in ext4_clear_request_list()
ext4: fix setting random pages PageUptodate
Filesystem independent ioctl was rejected as not common enough to be in
core vfs ioctl. Since we still need to access to this functionality this
commit adds ext4 specific ioctl EXT4_IOC_TRIM to dispatch
ext4_trim_fs().
It takes fstrim_range structure as an argument. fstrim_range is definec in
the include/linux/fs.h and its definition is as follows.
struct fstrim_range {
__u64 start;
__u64 len;
__u64 minlen;
}
start - first Byte to trim
len - number of Bytes to trim from start
minlen - minimum extent length to trim, free extents shorter than this
number of Bytes will be ignored. This will be rounded up to fs
block size.
After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There was concern that FITRIM ioctl is not common enough to be included
in core vfs ioctl, as Christoph Hellwig pointed out there's no real point
in dispatching this out to a separate vector instead of just through
->ioctl.
So this commit removes ioctl_fstrim() from vfs ioctl and trim_fs
from super_operation structure.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix readdir EOVERFLOW on 32-bit archs
ceph: fix frag offset for non-leftmost frags
ceph: fix dangling pointer
ceph: explicitly specify page alignment in network messages
ceph: make page alignment explicit in osd interface
ceph: fix comment, remove extraneous args
ceph: fix update of ctime from MDS
ceph: fix version check on racing inode updates
ceph: fix uid/gid on resent mds requests
ceph: fix rdcache_gen usage and invalidate
ceph: re-request max_size if cap auth changes
ceph: only let auth caps update max_size
ceph: fix open for write on clustered mds
ceph: fix bad pointer dereference in ceph_fill_trace
ceph: fix small seq message skipping
Revert "ceph: update issue_seq on cap grant"
At the start of ext4_fill_super, ret is set to -EINVAL, and any failure path
out of that function returns ret. However, the generic_check_addressable
clause sets ret = 0 (if it passes), which means that a subsequent failure (e.g.
a group checksum error) returns 0 even though the mount should fail. This
causes vfs_kern_mount in turn to think that the mount succeeded, leading to an
oops.
A simple fix is to avoid using ret for the generic_check_addressable check,
which was last changed in commit 30ca22c70e.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Userland programs using the quotactl() syscall assume limit/warn/usage
block counts in 512b basic blocks which were instead being read/written
in fs blocksize in gfs2. With this patch, gfs2 correctly interacts with
the syscall using 512b blocks.
Signed-off-by: Abhi Das <adas@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If ocfs2_live_connection_list is empty, ocfs2_connection_find() will return
a pointer to the LIST_HEAD, cast as a ocfs2_live_connection. This can cause
an oops when ocfs2_control_send_down() dereferences c->oc_conn:
Call Trace:
[<ffffffffa00c2a3c>] ocfs2_control_message+0x28c/0x2b0 [ocfs2_stack_user]
[<ffffffffa00c2a95>] ocfs2_control_write+0x35/0xb0 [ocfs2_stack_user]
[<ffffffff81143a88>] vfs_write+0xb8/0x1a0
[<ffffffff8155cc13>] ? do_page_fault+0x153/0x3b0
[<ffffffff811442f1>] sys_write+0x51/0x80
[<ffffffff810121b2>] system_call_fastpath+0x16/0x1b
Fix by explicitly returning NULL if no match is found.
Signed-off-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Commit 1c66b360fe (Change some lock status member in ocfs2_lock_res
to char.) states that these fields need to be signed due to comparision
to -1, but only changed the type from unsigned char to char. However, it
is a compiler option if char is a signed or unsigned type. Change these
fields to signed char so the code will work with all compilers.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
I suddenly hit the problem during 2.6.37-rc1 regression test, which was
introduced by commit '5e98d492406818e6a94c0ba54c61f59d40cefa4a'(Track
negative entries v3), following scenario reproduces the issue easily:
Node A Node B
================ ============
$touch testfile
$ls testfile
$rm -rf testfile
$touch testfile
$ls testfile
ls: cannot access testfile: No such file or directory
This patch stops tracking the dentry which was negativated by a inode deletion,
so as to force the revaliation in next lookup, in case we'll touch the inode
again in the same node.
It didn't hurt the performance of multiple lookup for none-existed files anyway,
while regresses a bit in the first try after a file deletion.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Stanse found that o2hb_heartbeat_group_make_item leaks some memory on
fail paths. Fix the paths by adding a new label and jump there.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: ocfs2-devel@oss.oracle.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
coccinelle check scripts/coccinelle/locks/call_kern.cocci found that
in fs/ocfs2/dlm/dlmdomain.c an allocation with GFP_KERNEL is done
with locks held:
dlm_query_region_handler
spin_lock(dlm_domain_lock)
dlm_match_regions
kmalloc(GFP_KERNEL)
Change it to GFP_ATOMIC.
Signed-off-by: David Sterba <dsterba@suse.cz>
CC: Joel Becker <joel.becker@oracle.com>
CC: Mark Fasheh <mfasheh@suse.com>
CC: ocfs2-devel@oss.oracle.com
--
Exists in v2.6.37-rc1 and current linux-next.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
One of the readdir filldir_t callers was passing the raw ceph 64-bit ino
instead of the hashed 32-bit one, producing an EOVERFLOW in the filler
callback. Fix this by calling the ceph_vino_to_ino() helper to do the
conversion.
Reported-by: Jan Smets <jan.smets@alcatel-lucent.com>
Tested-by: Jan Smets <jan.smets@alcatel-lucent.com>
Signed-off-by: Sage Weil <sage@newdream.net>
In jbd2_journal_init_dev(), we need make sure the journal structure is
fully initialzied before calling jbd2_stats_proc_init().
Reviewed-by: Andreas Dilger <andreas.dilger@oracle.com>
Signed-off-by: yangsheng <sheng.yang@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If the the li_request_list was empty then it returned with the lock
held. Instead of adding a "goto unlock" I just removed that special
case and let it go past the empty list_for_each_safe().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_end_bio calls put_page and kmem_cache_free before calling
SetPageUpdate(). This can result in setting the PageUptodate bit on
random pages and causes the following BUG:
BUG: Bad page state in process rm pfn:52e54
page:ffffea0001222260 count:0 mapcount:0 mapping: (null) index:0x0
arch kernel: page flags: 0x4000000000000008(uptodate)
Fix the problem by moving put_io_page() after the SetPageUpdate() call.
Thanks to Hugh Dickins for analyzing this problem.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Lock_kernel is gone from the code, so the comments should be updated,
too. nfsd now uses lock_flocks instead of lock_kernel to protect
against posix file locks.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.
Remove this too as a cleanup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It allows users to see what consoles are currently known to the system
and with what flags.
It is based on Werner's patch, the part about traversing fds was
removed, the code was moved to kernel/printk.c, where consoles are
handled and it makes more sense to me.
Signed-off-by: Jiri Slaby <jslaby@suse.cz> [cleanups]
Signed-off-by: "Dr. Werner Fink" <werner@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Strings allocated via kmemdup() in nfs_readdir_make_qstr() are
referenced from the nfs_cache_array which is stored in a page cache
page. Kmemleak does not scan such pages and it reports several false
positives. This patch annotates the string->name pointer so that
kmemleak does not consider it a real leak.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix up the issue that array->eof_index needs to be able to be set
even if array->size == 0.
Ensure that we catch all important memory allocation error conditions
and/or kmap() failures.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This reverts commit 80e60639f1.
This change requires further fixes to ensure that the open doesn't
succeed if the lookup later results in a regular file being created.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trying to mount NFS (root partition in my case) fails if CONFIG_NFS_V3
is not selected. nfs_validate_mount_data() returns EPROTONOSUPPORT,
because of this check:
#ifndef CONFIG_NFS_V3
if (args->version == 3)
goto out_v3_not_compiled;
#endif /* !CONFIG_NFS_V3 */
and args->version was always initialized to 3.
It was working in 2.6.36
Signed-off-by: Paulius Zaleckas <paulius.zaleckas@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Nick Bowler reports:
There are no unusual messages on the client... but I just logged into
the server and I see lots of messages of the following form:
nfsd: request from insecure port (192.168.8.199:35766)!
nfsd: request from insecure port (192.168.8.199:35766)!
nfsd: request from insecure port (192.168.8.199:35766)!
nfsd: request from insecure port (192.168.8.199:35766)!
nfsd: request from insecure port (192.168.8.199:35766)!
Bisected to commit 9247685088 (SUNRPC:
Properly initialize sock_xprt.srcaddr in all cases)
Apparently, removing the 'transport->srcaddr.ss_family = family' from
xs_create_sock() triggers this due to nlmclnt_lookup_host() incorrectly
initialising the srcaddr family to AF_UNSPEC.
Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This area of the code has always been a bit delicate due to the
subtleties of lock ordering. The problem is that for "normal"
alloc/dealloc, we always grab the inode locks first and the rgrp lock
later.
In order to ensure no races in looking up the unlinked, but still
allocated inodes, we need to hold the rgrp lock when we do the lookup,
which means that we can't take the inode glock.
The solution is to borrow the technique already used by NFS to solve
what is essentially the same problem (given an inode number, look up
the inode carefully, checking that it really is in the expected
state).
We cannot do that directly from the allocation code (lock ordering
again) so we give the job to the pre-existing delete workqueue and
carry on with the allocation as normal.
If we find there is no space, we do a journal flush (required anyway
if space from a deallocation is to be released) which should block
against the pending deallocations, so we should always get the space
back.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Using:
- CONFIG_LOCKUP_DETECTOR=y
- CONFIG_PREEMPT=y
- CONFIG_LOCKDEP=y
- CONFIG_PROVE_LOCKING=y
- CONFIG_PROVE_RCU=y
found a missing rcu lock during boot on a 512 MiB x86_64 ubuntu vm:
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
kernel/pid.c:419 invoked rcu_dereference_check() without protection!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
1 lock held by ureadahead/1355:
#0: (tasklist_lock){.+.+..}, at: [<ffffffff8115bc09>] sys_ioprio_set+0x7f/0x29e
stack backtrace:
Pid: 1355, comm: ureadahead Not tainted 2.6.37-dbg-DEV #1
Call Trace:
[<ffffffff8109c10c>] lockdep_rcu_dereference+0xaa/0xb3
[<ffffffff81088cbf>] find_task_by_pid_ns+0x44/0x5d
[<ffffffff81088cfa>] find_task_by_vpid+0x22/0x24
[<ffffffff8115bc3e>] sys_ioprio_set+0xb4/0x29e
[<ffffffff8147cf21>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff8105c409>] sysenter_dispatch+0x7/0x2c
[<ffffffff8147cee2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
The fix is to:
a) grab rcu lock in sys_ioprio_{set,get}() and
b) avoid grabbing tasklist_lock.
Discussion in: http://marc.info/?l=linux-kernel&m=128951324702889
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Modified by Jens to remove the now redundant inner rcu lock and
unlock since they are now protected by the outer lock.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
linux-2.6.37-rc1: I compiled a kernel with CIFS which subsequently
failed with an error indicating it couldn't initialize crypto module
"hmacmd5". CONFIG_CRYPTO_HMAC=y fixed the problem.
This patch makes CIFS depend on CRYPTO_HMAC in kconfig.
Signed-off-by: Jody Bruchon<jody@nctritech.com>
CC: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit 83fd9c7 changes l_level, l_requested and l_blocking of
ocfs2_lock_res from int to unsigned char. But actually it is
initially as -1(ocfs2_lock_res_init_common) which
correspoding to 255 for unsigned char. So the whole dlm lock
mechanism doesn't work now which means a disaster to ocfs2.
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.
All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.
* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().
* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.
* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.
The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
bdev read-only status can be queried using bdev_read_only() and may
change while the device is being opened. Enforce it by checking it
from blkdev_get() after open succeeds.
This makes bdev_read_only() check in open_bdev_exclusive() and
fsg_lun_open() unnecessary. Drop them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Brownell <dbrownell@users.sourceforge.net>
Cc: linux-usb@vger.kernel.org
With claim/release rolled into blkdev_get/put(), there's no reason to
keep bd_abort/finish_claim(), __bd_claim() and bd_release() as
separate functions. It only makes the code difficult to follow.
Collapse them into blkdev_get/put(). This will ease future changes
around claim/release.
Signed-off-by: Tejun Heo <tj@kernel.org>
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.
* blkdev_get/put() are the primary open and close functions.
* bd_claim/release() deal with exclusive open.
* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.
* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.
* open_by_devnum() wraps bdget() + blkdev_get().
The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,
* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.
* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.
* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.
* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.
* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().
* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().
Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.
open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.
Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
complex with multiple holder considerations, redundant extra
references to all involved kobjects, unused generic kobject holder
support and unnecessary mixup with bd_claim/release functionalities.
Strip it down to what's necessary (single gendisk holder) and make it
use a separate interface. This is a step for cleaning up
bd_claim/release. This patch makes dm-table slightly more complex but
it will be simplified again with further changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
In the failure path of __btrfs_open_devices(), close_bdev_exclusive()
is called with @flags which doesn't match the one used during
open_bdev_exclusive(). Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <chris.mason@oracle.com>
It's possible for initiate_cifs_search to be called on a filp that
already has private_data attached. If this happens, we'll end up
calling cifs_sb_tlink, taking an extra reference to the tlink and
attaching that to the cifsFileInfo. This leads to refcount leaks
that manifest as a "stuck" cifsd at umount time.
Fix this by only looking up the tlink for the cifsFile on the filp's
first pass through this function. When called on a filp that already
has cifsFileInfo associated with it, just use the tlink reference
that it already owns.
This patch fixes samba.org bug 7792:
https://bugzilla.samba.org/show_bug.cgi?id=7792
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Calling cond_resched() after every send can unnecessarily
degrade performance. Go back to an old method of scheduling
after 25 messages.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
So far as I can tell, there is no reason to use a single-threaded
send workqueue for dlm, since it may need to send to several sockets
concurrently. Both workqueues are set to WQ_MEM_RECLAIM to avoid
any possible deadlocks, WQ_HIGHPRI since locking traffic is highly
latency sensitive (and to avoid a priority inversion wrt GFS2's
glock_workqueue) and WQ_FREEZABLE just in case someone needs to do
that (even though with current cluster infrastructure, it doesn't
make sense as the node will most likely land up ejected from the
cluster) in the future.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits)
block: remove unused copy_io_context()
Documentation: remove anticipatory scheduler info
block: remove REQ_HARDBARRIER
ioprio: rcu_read_lock/unlock protect find_task_by_vpid call (V2)
ioprio: fix RCU locking around task dereference
block: ioctl: fix information leak to userland
block: read i_size with i_size_read()
cciss: fix proc warning on attempt to remove non-existant directory
bio: take care not overflow page count when mapping/copying user data
block: limit vec count in bio_kmalloc() and bio_alloc_map_data()
block: take care not to overflow when calculating total iov length
block: check for proper length of iov entries in blk_rq_map_user_iov()
cciss: remove controllers supported by hpsa
cciss: use usleep_range not msleep for small sleeps
cciss: limit commands allocated on reset_devices
cciss: Use kernel provided PCI state save and restore functions
cciss: fix board status waiting code
drbd: Removed checks for REQ_HARDBARRIER on incomming BIOs
drbd: REQ_HARDBARRIER -> REQ_FUA transition for meta data accesses
drbd: Removed the BIO_RW_BARRIER support form the receiver/epoch code
...
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: remove incorrect assert in xfs_vm_writepage
xfs: use hlist_add_fake
xfs: fix a few compiler warnings with CONFIG_XFS_QUOTA=n
xfs: tell lockdep about parent iolock usage in filestreams
xfs: move delayed write buffer trace
xfs: fix per-ag reference counting in inode reclaim tree walking
xfs: xfs_ioctl: fix information leak to userland
xfs: remove experimental tag from the delaylog option
WARN_ONCE is a bit strong for a deprecation warning, given that it spews a
huge backtrace.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We start at offset 2 for the leftmost frag, and 0 for subsequent frags.
When we reach the end (rightmost), we go back to 2. This fixes readdir on
fragmented (large) directories.
Signed-off-by: Sage Weil <sage@newdream.net>
In the normal regime where an application uses non-blocking I/O
writes on a socket, they will handle -EAGAIN and use poll() to
wait for send space.
They don't actually sleep on the socket I/O write.
But kernel level RPC layers that do socket I/O operations directly
and key off of -EAGAIN on the write() to "try again later" don't
use poll(), they instead have their own sleeping mechanism and
rely upon ->sk_write_space() to trigger the wakeup.
So they do effectively sleep on the write(), but this mechanism
alone does not let the socket layers know what's going on.
Therefore they must emulate what would have happened, otherwise
TCP cannot possibly see that the connection is application window
size limited.
Handle this, therefore, like SUNRPC by setting SOCK_NOSPACE and
bumping the ->sk_write_count as needed when we hit the send buffer
limits.
This should make TCP send buffer size auto-tuning and the
->sk_write_space() callback invocations actually happen.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: David Teigland <teigland@redhat.com>
We already export 'ro' for the disk. This adds the same attribute
for partitions.
Cc: Karel Zak <kzak@redhat.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Modify get/set_cifs_acl* calls to reutrn error code and percolate the
error code up to the caller.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs_root_iget allocates full_path through
cifs_build_path_to_root, but fails to kfree it upon
cifs_get_inode_info* failure.
Make all failure exit paths traverse clean up
handling at the end of the function.
Signed-off-by: Oskar Schirmer <oskar@scara.com>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Cc: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
In commit 20cb52ebd1, titled
"xfs: simplify xfs_vm_writepage" I added an assert that any !mapped and
uptodate buffers are not dirty. That asserts turns out to trigger a lot
when running fsx on filesystems with small block sizes. The reason for
that is that the assert is simply incorrect. !mapped and uptodate
just mean this buffer covers a hole, and whenever we do a set_page_dirty
we mark all blocks in the page dirty, no matter if they have data or
not. So remove the assert, and update the comment above the condition
to match reality.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
A minor oversight from f7347ce4ee,
"fasync: re-organize fasync entry insertion to allow it under a
spinlock": this cleanup-on-error was only needed to handle -ENOMEM. Now
that we're preallocating it's unneeded.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We must also free the passed-in lease in the case it wasn't used because
an existing lease was upgrade/downgraded or already existed.
Note the nfsd caller doesn't care because it's fl_change callback
returns an error in those cases.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
XFS does not need it's inodes to actuall be hashed in the VFS inode
cache, but we require the inode to be marked hashed for the
writeback code to work.
Insted of using insert_inode_hash, which requires a second
inode_lock roundtrip after the partial merge of the inode
scalability patches in 2.6.37-rc simply use the new hlist_add_fake
helper to mark it hashed without requiring a lock or touching a
global cache line.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Andi Kleen reported that gcc-4.5 gives lots of warnings for him
inside the XFS code. It turned out most of them are due to the
quota stubs beeing macros, and gcc now complaining about macros
evaluating to 0 that are not assigned to variables.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The filestreams code may take the iolock on the parent inode while
holding it on a child. This is the only place in XFS where we take
both the child and parent iolock, so just telling lockdep about it
is enough. The lock flag required for that was already added as
part of the ilock lockdep annotations and unused so far.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The delayed write buffer split trace currently issues a trace for
every buffer it scans. These buffers are not necessarily queued for
delayed write. Indeed, when buffers are pinned, there can be
thousands of traces of buffers that aren't actually queued for
delayed write and the ones that are are lost in the noise. Move the
trace point to record only buffers that are split out for IO to be
issued on.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The walk fails to decrement the per-ag reference count when the
non-blocking walk fails to obtain the per-ag reclaim lock, leading
to an assert failure on debug kernels when unmounting a filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
al_hreq is copied from userland. If al_hreq.buflen is not properly aligned
then xfs_attr_list will ignore the last bytes of kbuf. These bytes are
unitialized. It leads to leaking of contents of kernel stack memory.
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We promised to do this for 2.6.37, and the code looks stable enough to
keep that promise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
cfile may very well be freed after the cifsFileInfo_put. Make sure we
have a valid pointer to the superblock for cifs_sb_deactive.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit 4221a9918e "Add RCU check for
find_task_by_vpid()" introduced rcu_lockdep_assert to find_task_by_pid_ns=
Assertion failed in sys_ioprio_get. The patch is fixing assertion
failure in ioprio_set as well.
kernel/pid.c:419 invoked rcu_dereference_check() without protection!
stack backtrace:
Pid: 4254, comm: iotop Not tainted
Call Trace:
[<ffffffff810656f2>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffff81053c67>] find_task_by_pid_ns+0x4f/0x68
[<ffffffff81053c9d>] find_task_by_vpid+0x1d/0x1f
[<ffffffff811104e2>] sys_ioprio_get+0x50/0x2da
[<ffffffff81002182>] system_call_fastpath+0x16/0x1b
V2: rcu critical section expanded according to comment by Paul E. McKenney
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
With 2.6.37-rc1, I observe sys_ioprio_set not taking the RCU lock [1]
across access to the task credentials.
Inspecting the code in fs/ioprio.c, the tasklist_lock is held for read
across the __task_cred call, which is presumably sufficient to prevent
the task credentials becoming stale.
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
kernel/pid.c:419 invoked rcu_dereference_check() without protection!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 1
1 lock held by start-stop-daem/2246:
#0: (tasklist_lock){.?.?..}, at: [<ffffffff811a2dfa>]
sys_ioprio_set+0x8a/0x400
stack backtrace:
Pid: 2246, comm: start-stop-daem Not tainted 2.6.37-rc1-330cd+ #2
Call Trace:
[<ffffffff8109f5f4>] lockdep_rcu_dereference+0xa4/0xc0
[<ffffffff81085651>] find_task_by_pid_ns+0x81/0x90
[<ffffffff8108567d>] find_task_by_vpid+0x1d/0x20
[<ffffffff811a3160>] sys_ioprio_set+0x3f0/0x400
[<ffffffff816efa79>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81003482>] system_call_fastpath+0x16/0x1b
Take the RCU lock for read across acquiring the pointer to the task
credentials and dereferencing it.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Fixed up by Jens to fix missing rcu_read_unlock() on mismatches.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If the iovec is being set up in a way that causes uaddr + PAGE_SIZE
to overflow, we could end up attempting to map a huge number of
pages. Check for this invalid input type.
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We used to infer alignment of IOs within a page based on the file offset,
which assumed they matched. This broke with direct IO that was not aligned
to pages (e.g., 512-byte aligned IO). We were also trusting the alignment
specified in the OSD reply, which could have been adjusted by the server.
Explicitly specify the page alignment when setting up OSD IO requests.
Signed-off-by: Sage Weil <sage@newdream.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: fix a memleak in cifs_setattr_nounix()
cifs: make cifs_ioctl handle NULL filp->private_data correctly
Fix openpromfs compilation by adding a missing semicolon in
fs/openpromfs/inode.c openprom_mount().
Signed-off-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Add new ext4 inode tracepoints
ext4: Don't call sb_issue_discard() in ext4_free_blocks()
ext4: do not try to grab the s_umount semaphore in ext4_quota_off
ext4: fix potential race when freeing ext4_io_page structures
ext4: handle writeback of inodes which are being freed
ext4: initialize the percpu counters before replaying the journal
ext4: "ret" may be used uninitialized in ext4_lazyinit_thread()
ext4: fix lazyinit hang after removing request
Commit 13cfb7334e made cifs_ioctl use the tlink attached to the
cifsFileInfo for a filp. This ignores the case of an open directory
however, which in CIFS can have a NULL private_data until a readdir
is done on it.
This patch re-adds the NULL pointer checks that were removed in commit
50ae28f01 and moves the setting of tcon and "caps" variables lower.
Long term, a better fix would be to establish a f_op->open routine for
directories that populates that field at open time, but that requires
some other changes to how readdir calls are handled.
Reported-by: Kjell Rune Skaaraas <kjella79@yahoo.no>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit 5c521830cf (ext4: Support discard requests when running in
no-journal mode) attempts to add sb_issue_discard() for data blocks
(in data=writeback mode) and in no-journal mode. Unfortunately, this
no longer works, because in commit dd3932eddf (block: remove
BLKDEV_IFL_WAIT), sb_issue_discard() only presents a synchronous
interface, and there are times when we call ext4_free_blocks() when we
are are holding a spinlock, or are otherwise in an atomic context.
For now, I've removed the call to sb_issue_discard() to prevent a
deadlock or (if spinlock debugging is enabled) failures like this:
BUG: scheduling while atomic: rc.sysinit/1376/0x00000002
Pid: 1376, comm: rc.sysinit Not tainted 2.6.36-ARCH #1
Call Trace:
[<ffffffff810397ce>] __schedule_bug+0x5e/0x70
[<ffffffff81403110>] schedule+0x950/0xa70
[<ffffffff81060bad>] ? insert_work+0x7d/0x90
[<ffffffff81060fbd>] ? queue_work_on+0x1d/0x30
[<ffffffff81061127>] ? queue_work+0x37/0x60
[<ffffffff8140377d>] schedule_timeout+0x21d/0x360
[<ffffffff812031c3>] ? generic_make_request+0x2c3/0x540
[<ffffffff81402680>] wait_for_common+0xc0/0x150
[<ffffffff81041490>] ? default_wake_function+0x0/0x10
[<ffffffff812034bc>] ? submit_bio+0x7c/0x100
[<ffffffff810680a0>] ? wake_bit_function+0x0/0x40
[<ffffffff814027b8>] wait_for_completion+0x18/0x20
[<ffffffff8120a969>] blkdev_issue_discard+0x1b9/0x210
[<ffffffff811ba03e>] ext4_free_blocks+0x68e/0xb60
[<ffffffff811b1650>] ? __ext4_handle_dirty_metadata+0x110/0x120
[<ffffffff811b098c>] ext4_ext_truncate+0x8cc/0xa70
[<ffffffff810d713e>] ? pagevec_lookup+0x1e/0x30
[<ffffffff81191618>] ext4_truncate+0x178/0x5d0
[<ffffffff810eacbb>] ? unmap_mapping_range+0xab/0x280
[<ffffffff810d8976>] vmtruncate+0x56/0x70
[<ffffffff811925cb>] ext4_setattr+0x14b/0x460
[<ffffffff811319e4>] notify_change+0x194/0x380
[<ffffffff81117f80>] do_truncate+0x60/0x90
[<ffffffff811e08fa>] ? security_inode_permission+0x1a/0x20
[<ffffffff811eaec1>] ? tomoyo_path_truncate+0x11/0x20
[<ffffffff81127539>] do_last+0x5d9/0x770
[<ffffffff811278bd>] do_filp_open+0x1ed/0x680
[<ffffffff8140644f>] ? page_fault+0x1f/0x30
[<ffffffff81132bfc>] ? alloc_fd+0xec/0x140
[<ffffffff81118db1>] do_sys_open+0x61/0x120
[<ffffffff81118e8b>] sys_open+0x1b/0x20
[<ffffffff81002e6b>] system_call_fastpath+0x16/0x1b
https://bugzilla.kernel.org/show_bug.cgi?id=22302
Reported-by: Mathias Burén <mathias.buren@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: jiayingz@google.com
It's not needed to sync the filesystem, and it fixes a lock_dep complaint.
Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Use an atomic_t and make sure we don't free the structure while we
might still be submitting I/O for that page.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The following BUG can occur when an inode which is getting freed when
it still has dirty pages outstanding, and it gets deleted (in this
because it was the target of a rename). In ordered mode, we need to
make sure the data pages are written just in case we crash before the
rename (or unlink) is committed. If the inode is being freed then
when we try to igrab the inode, we end up tripping the BUG_ON at
fs/ext4/page-io.c:146.
To solve this problem, we need to keep track of the number of io
callbacks which are pending, and avoid destroying the inode until they
have all been completed. That way we don't have to bump the inode
count to keep the inode from being destroyed; an approach which
doesn't work because the count could have already been dropped down to
zero before the inode writeback has started (at which point we're not
allowed to bump the count back up to 1, since it's already started
getting freed).
Thanks to Dave Chinner for suggesting this approach, which is also
used by XFS.
kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
Call Trace:
[<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
[<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
[<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
[<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
[<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
[<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
[<ffffffff81087910>] do_writepages+0x1c/0x25
[<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
[<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
[<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
[<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
[<ffffffff810c14a3>] evict+0x22/0x92
[<ffffffff810c1a3d>] iput+0x212/0x249
[<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
[<ffffffff810bdf6b>] d_kill+0x3d/0x5d
[<ffffffff810be613>] dput+0x13a/0x147
[<ffffffff810b990d>] sys_renameat+0x1b5/0x258
[<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
[<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
[<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
[<ffffffff810b99c6>] sys_rename+0x16/0x18
[<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
The client can have a newer ctime than the MDS due to AUTH_EXCL and
XATTR_EXCL caps as well; update the check in ceph_fill_file_time
appropriately.
This fixes cases where ctime/mtime goes backward under the right sequence
of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
goes to the MDS).
Signed-off-by: Sage Weil <sage@newdream.net>
We may get updates on the same inode from multiple MDSs; generally we only
pay attention if the update is newer than what we already have. The
exception is when an MDS sense unstable information, in which case we
always update.
The old > check got this wrong when our version was odd (e.g. 3) and the
reply version was even (e.g. 2): the older stale (v2) info would be
applied. Fixed and clarified the comment.
Signed-off-by: Sage Weil <sage@newdream.net>
MDS requests can be rebuilt and resent in non-process context, but were
filling in uid/gid from current_fsuid/gid. Put that information in the
request struct on request setup.
This fixes incorrect (and root) uid/gid getting set for requests that
are forwarded between MDSs, usually due to metadata migrations.
Signed-off-by: Sage Weil <sage@newdream.net>
We used to use rdcache_gen to indicate whether we "might" have cached
pages. Now we just look at the mapping to determine that. However, some
old behavior remains from that transition.
First, rdcache_gen == 0 no longer means we have no pages. That can happen
at any time (presumably when we carry FILE_CACHE). We should not reset it
to zero, and we should not check that it is zero.
That means that the only purpose for rdcache_revoking is to resolve races
between new issues of FILE_CACHE and an async invalidate. If they are
equal, we should invalidate. On success, we decrement rdcache_revoking,
so that it is no longer equal to rdcache_gen. Similarly, if we success
in doing a sync invalidate, set revoking = gen - 1. (This is a small
optimization to avoid doing unnecessary invalidate work and does not
affect correctness.)
Signed-off-by: Sage Weil <sage@newdream.net>
hfsplus only actually uses the force option during remount, but it uses
the full option parser with a fake superblock to do so. This means remount
will fail if any nls option is set (which happens frequently with older
mount tools), even if it is the same.
Fix this by adding a simpler version of the parser that only parses the force
option for remount.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
If the auth cap migrates to another MDS, clear requested_max_size so that
we resend any pending max_size increase requests. This fixes potential
hangs on writes that extend a file and race with an cap migration between
MDSs.
Signed-off-by: Sage Weil <sage@newdream.net>
Only the auth MDS has a meaningful max_size value for us, so only update it
in fill_inode if we're being issued an auth cap. Otherwise, a random
stat result from a non-auth MDS can clobber a meaningful max_size, get
the client<->mds cap state out of sync, and make writes hang.
Specifically, even if the client re-requests a larger max_size (which it
will), the MDS won't respond because as far as it knows we already have a
sufficiently large value.
Signed-off-by: Sage Weil <sage@newdream.net>
Normally when we open a file we already have a cap, and simply update the
wanted set. However, if we open a file for write, but don't have an auth
cap, that doesn't work; we need to open a new cap with the auth MDS. Only
reuse existing caps if we are opening for read or the existing cap is auth.
Signed-off-by: Sage Weil <sage@newdream.net>
We dereference *in a few lines down, but only set it on rename. It is
apparently pretty rare for this to trigger, but I have been hitting it
with a clustered MDSs.
Signed-off-by: Sage Weil <sage@newdream.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: make cifs_set_oplock_level() take a cifsInodeInfo pointer
cifs: dereferencing first then checking
cifs: trivial comment fix: tlink_tree is now a rbtree
[CIFS] Cleanup unused variable build warning
cifs: convert tlink_tree to a rbtree
cifs: store pointer to master tlink in superblock (try #2)
cifs: trivial doc fix: note setlease implemented
CIFS: Add cifs_set_oplock_level
FS: cifs, remove unneeded NULL tests
All the callers already have a pointer to struct cifsInodeInfo. Use it.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This patch is based on Dan's original patch. His original description is
below:
Smatch complained about a couple checking for NULL after dereferencing
bugs. I'm not super familiar with the code so I did the conservative
thing and move the dereferences after the checks.
The dereferences in cifs_lock() and cifs_fsync() were added in
ba00ba64cf "cifs: make various routines use the cifsFileInfo->tcon
pointer". The dereference in find_writable_file() was added in
6508d904e6 "cifs: have find_readable/writable_file filter by fsuid".
The comments there say it's possible to trigger the NULL dereference
under stress.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Noticed while reviewing (late) the rbtree conversion patchset (which has been merged
already).
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We now initialize the percpu counters before replaying the journal,
but after the journal, we recalculate the global counters, to deal
with the possibility of the per-blockgroup counts getting updated by
the journal replay.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If a connection is closed just after a sequence or create_session
is sent over it, we could end up trying to register a callback that will
never get called since the xprt is already marked dead.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Radix trees are ideal when you want to track a bunch of pointers and
can't embed a tracking structure within the target of those pointers.
The tradeoff is an increase in memory, particularly if the tree is
sparse.
In CIFS, we use the tlink_tree to track tcon_link structs. A tcon_link
can never be in more than one tlink_tree, so there's no impediment to
using a rb_tree here instead of a radix tree.
Convert the new multiuser mount code to use a rb_tree instead. This
should reduce the memory required to manage the tlink_tree.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This is the second version of this patch, the only difference between
it and the first one is that this explicitly makes cifs_sb_master_tlink
a static inline.
Instead of keeping a tag on the master tlink in the tree, just keep a
pointer to the master in the superblock. That eliminates the need for
using the radix tree to look up a tagged entry.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Simplify many places when we need to set oplock level on an inode.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Newer GCC's reported the following build warning:
fs/ext4/super.c: In function 'ext4_lazyinit_thread':
fs/ext4/super.c:2702: warning: 'ret' may be used uninitialized in this function
Fix it by removing the need for the ret variable in the first place.
Signed-off-by: "Lukas Czerner" <lczerner@redhat.com>
Reported-by: "Stefan Richter" <stefanr@s5r6.in-berlin.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When the request has been removed from the list and no other request
has been issued, we will end up with next wakeup scheduled to
MAX_JIFFY_OFFSET which is bad. So check for that.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Linus noted, and complained to me, that doing while lots of "git diff"'s
of kernel sources, these spinlocks were responsible for 27% of the
spinlock cost on his two-processor system as reported by perf.
Git was doing lots of parallel stats, and this was putting a lot of
pressure on ext4_getattr(). A spinlock to protect a single
memory-to-memory copy is pointless, so remove it.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stanse found that pSMBFile in cifs_ioctl and file->f_path.dentry in
cifs_user_write are dereferenced prior their test to NULL.
The alternative is not to dereference them before the tests. The patch is
to point out the problem, you have to decide.
While at it we cache the inode in cifs_user_write to a local variable
and use all over the function.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Steve French <sfrench@samba.org>
Cc: linux-cifs@vger.kernel.org
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit 7d945a3aa7 ("logfs get_sb, part 3") broke the logfs build when
CONFIG_MTD is set due to a mangled logfs_get_sb_mtd() definition.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Fix up irq_node() for irq_data changes.
genirq: Add single IRQ reservation helper
genirq: Warn if enable_irq is called before irq is set up
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
semaphore: Remove mutex emulation
staging: Final semaphore cleanup
jbd2: Convert jbd2_slab_create_sem to mutex
hpfs: Convert sbi->hpfs_creation_de to mutex
Fix up trivial change/delete conflicts with deleted 'dream' drivers
(drivers/staging/dream/camera/{mt9d112.c,mt9p012_fox.c,mt9t013.c,s5k3e2fx.c})
This one was only used for a nasty hack in nfsd, which has recently
been removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The caller allocated it, the caller should free it.
The only issue so far is that we could change the flp pointer even on an
error return if the fl_change callback failed. But we can simply move
the flp assignment after the fl_change invocation, as the callers don't
care about the flp return value if the setlease call failed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The NFSv4 server was initializing the dp->dl_flock pointer by the
somewhat ridiculous method of a locks_copy_lock callback.
Now that setlease uses the passed-in lock instead of doing a copy,
dl_flock no longer gets set, resulting in the lock leaking on delegation
release, and later possible hangs (among other problems).
So, initialize dl_flock and get rid of the callback.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We modified setlease to require the caller to allocate the new lease in
the case of creating a new lease, but forgot to fix up the filesystem
methods.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're depending on setlease to free the passed-in lease on failure.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Removing a lock shouldn't require any allocations; a failure due to
ENOMEM leaves the caller with a choice between retrying or giving up and
leaking an unused lease.
Next we should split the other lease calls into add and delete cases.
I wanted to start with just the bugfix.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.infradead.org/users/eparis/notify: (22 commits)
Ensure FMODE_NONOTIFY is not set by userspace
make fanotify_read() restartable across signals
fsnotify: remove alignment padding from fsnotify_mark on 64 bit builds
fs/notify/fanotify/fanotify_user.c: fix warnings
fanotify: Fix FAN_CLOSE comments
fanotify: do not recalculate the mask if the ignored mask changed
fanotify: ignore events on directories unless specifically requested
fsnotify: rename FS_IN_ISDIR to FS_ISDIR
fanotify: do not send events for irregular files
fanotify: limit number of listeners per user
fanotify: allow userspace to override max marks
fanotify: limit the number of marks in a single fanotify group
fanotify: allow userspace to override max queue depth
fsnotify: implement a default maximum queue depth
fanotify: ignore fanotify ignore marks if open writers
fanotify: allow userspace to flush all marks
fsnotify: call fsnotify_parent in perm events
fsnotify: correctly handle return codes from listeners
fanotify: use __aligned_u64 in fanotify userspace metadata
fanotify: implement fanotify listener ordering
...
In fanotify_read() return -ERESTARTSYS instead of -EINTR to
make read() restartable across signals (BSD semantic).
Signed-off-by: Eric Paris <eparis@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
Btrfs: deal with errors from updating the tree log
Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Btrfs: make SNAP_DESTROY async
Btrfs: add SNAP_CREATE_ASYNC ioctl
Btrfs: add START_SYNC, WAIT_SYNC ioctls
Btrfs: async transaction commit
Btrfs: fix deadlock in btrfs_commit_transaction
Btrfs: fix lockdep warning on clone ioctl
Btrfs: fix clone ioctl where range is adjacent to extent
Btrfs: fix delalloc checks in clone ioctl
Btrfs: drop unused variable in block_alloc_rsv
Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
Btrfs: Use ERR_CAST helpers
Btrfs: use memdup_user helpers
Btrfs: fix raid code for removing missing drives
Btrfs: Switch the extent buffer rbtree into a radix tree
Btrfs: restructure try_release_extent_buffer()
Btrfs: use the flusher threads for delalloc throttling
Btrfs: tune the chunk allocation to 5% of the FS as metadata
...
Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
useless and removed in commit 5e8067adfd: "rcu head remove init")
The btrfs merge looks like hell, because it changes fs-writeback.c, and
the crazy code has this repeated "estimate number of dirty pages"
counting that involves three different helper functions. And it's done
in two different places.
Just unify that whole calculation as a "get_nr_dirty_pages()" helper
function, and the merge result will look half-way decent.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.infradead.org/mtd-2.6: (82 commits)
mtd: fix build error in m25p80.c
mtd: Remove redundant mutex from mtd_blkdevs.c
MTD: Fix wrong check register_blkdev return value
Revert "mtd: cleanup Kconfig dependencies"
mtd: cfi_cmdset_0002: make sector erase command variable
mtd: cfi_cmdset_0002: add CFI detection for SST 38VF640x chips
mtd: cfi_util: add support for switching SST 39VF640xB chips into QRY mode
mtd: cfi_cmdset_0001: use defined value of P_ID_INTEL_PERFORMANCE instead of hardcoded one
block2mtd: dubious assignment
P4080/mtd: Fix the freescale lbc issue with 36bit mode
P4080/eLBC: Make Freescale elbc interrupt common to elbc devices
mtd: phram: use KBUILD_MODNAME
mtd: OneNAND: S5PC110: Fix double call suspend & resume function
mtd: nand: fix MTD_MODE_RAW writes
jffs2: use kmemdup
mtd: sm_ftl: cosmetic, use bool when possible
mtd: r852: remove useless pci powerup/down from suspend/resume routines
mtd: blktrans: fix a race vs kthread_stop
mtd: blktrans: kill BKL
mtd: allow to unload the mtdtrans module if its block devices aren't open
...
Fix up trivial whitespace-introduced conflict in drivers/mtd/mtdchar.c
The definition of PAGE_CACHE_MASK in <linux/pagemap.h> is needed to use
MAX_RW_COUNT, and on x86-64 that gets done indirectly through the
architecture header includes. But on MIPS and s390 that doesn't happen,
and we need to make sure that fs/compat.c includes pagemap.h explicitly.
Introduced in commit 435f49a518 ("readv/writev: do the same
MAX_RW_COUNT truncation that read/write does").
Reported-by: Sachin Sant <sachinp@in.ibm.com> (S390)
Reported-by: wu zhangjin <wuzhangjin@gmail.com> (MIPS)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
drivers/mtd/mtd_blkdevs.c
Merge Grant's device-tree bits so that we can apply the subsequent fixes.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
During unlink we remove any references to the inode from
the tree log. It can return -ENOENT and other errors,
and this changes the unlink code to deal with it.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
jbd2_slab_create_sem is used as a mutex, so make it one.
[ akpm muttered: We may as well make it local to
jbd2_journal_create_slab() also. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <alpine.LFD.2.00.1010162231480.2496@localhost6.localdomain6>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
sbi->hpfs_creation_de is used as mutex so make it a mutex.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
LKML-Reference: <20100907125056.228874895@linutronix.de>
Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2). We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).
We additionally require that the user has write+exec permission on the
subvol root inode.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is no reason to force an immediate commit when deleting a snapshot.
Users have some expectation that space from a deleted snapshot be freed
immediately, but even if we do commit the reclaim is a background process.
If users _do_ want the deletion to be durable, they can call 'sync'.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Create a snap without waiting for it to commit to disk. The ioctl is
ordered such that subsequent operations will not be contained by the
created snapshot, and the commit is initiated, but the ioctl does not
wait for the snapshot to commit to disk.
We return the specific transid to userspace so that an application can wait
for this specific snapshot creation to commit via the WAIT_SYNC ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
eCryptfs: Print mount_auth_tok_only param in ecryptfs_show_options
ecryptfs: added ecryptfs_mount_auth_tok_only mount parameter
ecryptfs: checking return code of ecryptfs_find_auth_tok_for_sig()
ecryptfs: release keys loaded in ecryptfs_keyring_auth_tok_for_sig()
eCryptfs: Clear LOOKUP_OPEN flag when creating lower file
ecryptfs: call vfs_setxattr() in ecryptfs_setxattr()
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk. This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.
The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
num_writers > 1 or should_grow at the top of the loop. Then, much much
later, we wait for that timeout if either num_writers or should_grow is
true. However, it's possible for a racing process (calling
btrfs_end_transaction()) to decrement num_writers such that we wait
forever instead of for 1.
Fix this by deciding how long to wait when we wait. Include a smp_mb()
before checking if the waitqueue is active to ensure the num_writers
is visible.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I'm no lockdep expert, but this appears to make the lockdep warning go
away for the i_mutex locking in the clone ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We had an edge case issue where the requested range was just
following an existing extent. Instead of skipping to the next
extent, we used the previous one which lead to having zero
sized extents.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The lookup_first_ordered_extent() was done on the wrong inode, and the
->delalloc_bytes test was wrong, as the following
btrfs_wait_ordered_range() would only invoke a range write and wouldn't
write the entire file data range. Also, a bad parameter was passed to
btrfs_wait_ordered_range().
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.
Still needs more review.
Found by gcc 4.6's new warnings
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These are all the cases where a variable is set, but not
read which are really bugs.
- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things
Still needs more review.
Found by gcc 4.6's new warnings.
[akpm@linux-foundation.org: fix build. Might have been bitrot]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
T x;
identifier f;
@@
T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }
@@
expression x;
@@
- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use memdup_user when user data is immediately copied into the
allocated region.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@
- to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {
<+... when != goto l1;
- -ENOMEM
+ PTR_ERR(to)
...+>
}
- if (copy_from_user(to, from, size) != 0) {
- <+... when != goto l2;
- -EFAULT
- ...+>
- }
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: Cleanup and thus reduce smb session structure and fields used during authentication
NTLM auth and sign - Use appropriate server challenge
cifs: add kfree() on error path
NTLM auth and sign - minor error corrections and cleanup
NTLM auth and sign - Use kernel crypto apis to calculate hashes and smb signatures
NTLM auth and sign - Define crypto hash functions and create and send keys needed for key exchange
cifs: cifs_convert_address() returns zero on error
NTLM auth and sign - Allocate session key/client response dynamically
cifs: update comments - [s/GlobalSMBSesLock/cifs_file_list_lock/g]
cifs: eliminate cifsInodeInfo->write_behind_rc (try #6)
[CIFS] Fix checkpatch warnings and bump cifs version number
cifs: wait for writeback to complete in cifs_flush
cifs: convert cifsFileInfo->count to non-atomic counter
We used to protect against overflow, but rather than return an error, do
what read/write does, namely to limit the total size to MAX_RW_COUNT.
This is not only more consistent, but it also means that any broken
low-level read/write routine that still keeps counts in 'int' can't
break.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a new mount parameter 'ecryptfs_mount_auth_tok_only' to
force ecryptfs to use only authentication tokens which signature has
been specified at mount time with parameters 'ecryptfs_sig' and
'ecryptfs_fnek_sig'. In this way, after disabling the passthrough and
the encrypted view modes, it's possible to make available to users only
files encrypted with the specified authentication token.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: James Morris <jmorris@namei.org>
[Tyler: Clean up coding style errors found by checkpatch]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This patch replaces the check of the 'matching_auth_tok' pointer with
the exit status of ecryptfs_find_auth_tok_for_sig().
This avoids to use authentication tokens obtained through the function
ecryptfs_keyring_auth_tok_for_sig which are not valid.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This patch allows keys requested in the function
ecryptfs_keyring_auth_tok_for_sig()to be released when they are no
longer required. In particular keys are directly released in the same
function if the obtained authentication token is not valid.
Further, a new function parameter 'auth_tok_key' has been added to
ecryptfs_find_auth_tok_for_sig() in order to provide callers the key
pointer to be passed to key_put().
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: James Morris <jmorris@namei.org>
[Tyler: Initialize auth_tok_key to NULL in ecryptfs_parse_packet_set]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
eCryptfs was passing the LOOKUP_OPEN flag through to the lower file
system, even though ecryptfs_create() doesn't support the flag. A valid
filp for the lower filesystem could be returned in the nameidata if the
lower file system's create() function supported LOOKUP_OPEN, possibly
resulting in unencrypted writes to the lower file.
However, this is only a potential problem in filesystems (FUSE, NFS,
CIFS, CEPH, 9p) that eCryptfs isn't known to support today.
https://bugs.launchpad.net/ecryptfs/+bug/641703
Reported-by: Kevin Buhr
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Ecryptfs is a stackable filesystem which relies on lower filesystems the
ability of setting/getting extended attributes.
If there is a security module enabled on the system it updates the
'security' field of inodes according to the owned extended attribute set
with the function vfs_setxattr(). When this function is performed on a
ecryptfs filesystem the 'security' field is not updated for the lower
filesystem since the call security_inode_post_setxattr() is missing for
the lower inode.
Further, the call security_inode_setxattr() is missing for the lower inode,
leading to policy violations in the security module because specific
checks for this hook are not performed (i. e. filesystem
'associate' permission on SELinux is not checked for the lower filesystem).
This patch replaces the call of the setxattr() method of the lower inode
in the function ecryptfs_setxattr() with vfs_setxattr().
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Cc: stable <stable@kernel.org>
Cc: Dustin Kirkland <kirkland@canonical.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
When btrfs is mounted in degraded mode, it has some internal structures
to track the missing devices. This missing device is setup as readonly,
but the mapping code can get upset when we try to write to it.
This changes the mapping code to return -EIO instead of oops when we try
to write to the readonly device.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch reduces the CPU time spent in the extent buffer search by using the
radix tree instead of the rbtree and using the rcu lock instead of the spin
lock.
I did a quick test by the benchmark tool[1] and found the patch improve the
file creation/deletion performance problem that I have reported[2].
Before applying this patch:
Create files:
Total files: 50000
Total time: 0.971531
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.366761
Average time: 0.000027
After applying this patch:
Create files:
Total files: 50000
Total time: 0.927455
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.292280
Average time: 0.000026
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
[2] http://marc.info/?l=linux-btrfs&m=128212635122920&w=2
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
restructure try_release_extent_buffer() and write a function to release the
extent buffer. It will be used later.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have a fairly complex set of loops around walking our list of
delalloc inodes when we find metadata delalloc space running low.
It doesn't work very well, can use large amounts of CPU and doesn't
do very efficient writeback.
This switches us to kick the bdi flusher threads instead. All dirty
data in btrfs is accounted as delalloc data, so this is very similar
in terms of what it writes, but we're able to just kick off the IO
and wait for progress.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
An earlier commit tried to keep us from allocating too many
empty metadata chunks. It was somewhat too restrictive and could
lead to ENOSPC errors on empty filesystems.
This increases the limits to about 5% of the FS size, allowing more
metadata chunks to be preallocated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs is running low on metadata space, it needs to force delayed
allocation pages to disk. It currently does this with a suboptimal walk
of a private list of inodes with delayed allocation, and it would be
much better if we used the generic flusher threads.
writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
thread to start IO on all the dirty pages in the FS before it returns.
This adds variants of writeback_inodes_sb* that allow the caller to
control how many pages get sent down.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs discovers the generation number in a btree block is
incorrect, it can loop forever without forcing the RAID
code to try a valid mirror, and without returning EIO.
This changes things to properly kick out the EIO.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If you mount -o space_cache, the option will be persistent across mounts, but to
make sure the user knows that they did this, emit a message telling them if they
didn't mount with -o space_cache but the feature is still used.
Signed-off-by: Josef Bacik <josef@redhat.com>
If something goes wrong with the free space cache we need a way to make sure
it's not loaded on mount and that it's cleared for everybody. When you pass the
clear_cache option it will make it so all block groups are setup to be cleared,
which keeps them from being loaded and then they will be truncated when the
transaction is committed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
There are just a few things that need to be fixed in the kernel to support mixed
data+metadata block groups. Mostly we just need to make sure that if we are
using mixed block groups that we continue to allocate mixed block groups as we
need them. Also we need to make sure __find_space_info will find our space info
if we search for DATA or METADATA only. Tested this with xfstests and it works
nicely. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
With the free space disk caching we can mark the block group as started with the
caching, but we don't have a caching ctl. This can race with anybody else who
tries to get the caching ctl before we cache (this is very hard to do btw). So
instead check to see if cache->caching_ctl is set, and if not return NULL.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This patch actually loads the free space cache if it exists. The only thing
that really changes here is that we need to cache the block group if we're going
to remove an extent from it. Previously we did not do this since the caching
kthread would pick it up. With the on disk cache we don't have this luxury so
we need to make sure we read the on disk cache in first, and then remove the
extent, that way when the extent is unpinned the free space is added to the
block group. This has been tested with all sorts of things.
Signed-off-by: Josef Bacik <josef@redhat.com>
This is a simple bit, just dump the free space cache out to our preallocated
inode when we're writing out dirty block groups. There are a bunch of changes
in inode.c in order to account for special cases. Mostly when we're doing the
writeout we're holding trans_mutex, so we need to use the nolock transacation
functions. Also we can't do asynchronous completions since the async thread
could be blocked on already completed IO waiting for the transaction lock. This
has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
tests. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
a) switch ->put_device() to logfs_super *
b) actually call it on early failures in logfs_get_sb_device()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
take setting s_bdev/s_mtd/s_devops to callers of logfs_get_sb_device(),
don't bother passing them separately
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
eventual replacement for ->get_sb() - does *not* get vfsmount,
return ERR_PTR(error) or root of subtree to be mounted.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
nameidata_to_filp() drops nd->path or transfers it to opened
file. In the former case it's a Bad Idea(tm) to do mnt_drop_write()
on nd->path.mnt, since we might race with umount and vfsmount in
question might be gone already.
Fix: don't drop it, then... IOW, have nameidata_to_filp() grab nd->path
in case it transfers it to file and do path_drop() in callers. After
they are through with accessing nd->path...
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Removed following fields from smb session structure
cryptkey, ntlmv2_hash, tilen, tiblob
and ntlmssp_auth structure is allocated dynamically only if the auth mech
in NTLMSSP.
response field within a session_key structure is used to initially store the
target info (either plucked from type 2 challenge packet in case of NTLMSSP
or fabricated in case of NTLMv2 without extended security) and then to store
Message Authentication Key (mak) (session key + client response).
Server challenge or cryptkey needed during a NTLMSSP authentication
is now part of ntlmssp_auth structure which gets allocated and freed
once authenticaiton process is done.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Need to have cryptkey or server challenge in smb connection
(struct TCP_Server_Info) for ntlm and ntlmv2 auth types for which
cryptkey (Encryption Key) is supplied just once in Negotiate Protocol
response during an smb connection setup for all the smb sessions over
that smb connection.
For ntlmssp, cryptkey or server challenge is provided for every
smb session in type 2 packet of ntlmssp negotiation, the cryptkey
provided during Negotiation Protocol response before smb connection
does not count.
Rename cryptKey to cryptkey and related changes.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
nfs4: The difference of 2 pointers is ptrdiff_t
nfs: testing the wrong variable
nfs: handle lock context allocation failures in nfs_create_request
Fixed Regression in NFS Direct I/O path
We need to make check if a page does not have buffes by checking
page_has_buffers(page) before calling page_buffers(page) in
ext4_writepage(). Otherwise page_buffers() could throw a BUG_ON.
Thanks also to Markus Trippelsdorf and Avinash Kurup who also reported
the problem.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Sedat Dilek <sedat.dilek@googlemail.com>
Tested-by: Sedat Dilek <sedat.dilek@googlemail.com>
fs/notify/fanotify/fanotify_user.c: In function 'fanotify_release':
fs/notify/fanotify/fanotify_user.c:375: warning: unused variable 'lre'
fs/notify/fanotify/fanotify_user.c:375: warning: unused variable 're'
this is really ugly.
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
If fanotify sets a new bit in the ignored mask it will cause the generic
fsnotify layer to recalculate the real mask. This is stupid since we
didn't change that part.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify has a very limited number of events it sends on directories. The
usefulness of these events is yet to be seen and still we send them. This
is particularly painful for mount marks where one might receive many of
these useless events. As such this patch will drop events on IS_DIR()
inodes unless they were explictly requested with FAN_ON_DIR.
This means that a mark on a directory without FAN_EVENT_ON_CHILD or
FAN_ON_DIR is meaningless and will result in no events ever (although it
will still be allowed since detecting it is hard)
Signed-off-by: Eric Paris <eparis@redhat.com>
The _IN_ in the naming is reserved for flags only used by inotify. Since I
am about to use this flag for fanotify rename it to be generic like the
rest.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify_should_send_event has a test to see if an object is a file or
directory and does not send an event otherwise. The problem is that the
test is actually checking if the object with a mark is a file or directory,
not if the object the event happened on is a file or directory. We should
check the latter.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify currently has no limit on the number of listeners a given user can
have open. This patch limits the total number of listeners per user to
128. This is the same as the inotify default limit.
Signed-off-by: Eric Paris <eparis@redhat.com>
Some fanotify groups, especially those like AV scanners, will need to place
lots of marks, particularly ignore marks. Since ignore marks do not pin
inodes in cache and are cleared if the inode is removed from core (usually
under memory pressure) we expose an interface for listeners, with
CAP_SYS_ADMIN, to override the maximum number of marks and be allowed to
set and 'unlimited' number of marks. Programs which make use of this
feature will be able to OOM a machine.
Signed-off-by: Eric Paris <eparis@redhat.com>
There is currently no limit on the number of marks a given fanotify group
can have. Since fanotify is gated on CAP_SYS_ADMIN this was not seen as
a serious DoS threat. This patch implements a default of 8192, the same as
inotify to work towards removing the CAP_SYS_ADMIN gating and eliminating
the default DoS'able status.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify has a defualt max queue depth. This patch allows processes which
explicitly request it to have an 'unlimited' queue depth. These processes
need to be very careful to make sure they cannot fall far enough behind
that they OOM the box. Thus this flag is gated on CAP_SYS_ADMIN.
Signed-off-by: Eric Paris <eparis@redhat.com>
Currently fanotify has no maximum queue depth. Since fanotify is
CAP_SYS_ADMIN only this does not pose a normal user DoS issue, but it
certianly is possible that an fanotify listener which can't keep up could
OOM the box. This patch implements a default 16k depth. This is the same
default depth used by inotify, but given fanotify's better queue merging in
many situations this queue will contain many additional useful events by
comparison.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify will clear ignore marks if a task changes the contents of an
inode. The problem is with the races around when userspace finishes
checking a file and when that result is actually attached to the inode.
This race was described as such:
Consider the following scenario with hostile processes A and B, and
victim process C:
1. Process A opens new file for writing. File check request is generated.
2. File check is performed in userspace. Check result is "file has no malware".
3. The "permit" response is delivered to kernel space.
4. File ignored mark set.
5. Process A writes dummy bytes to the file. File ignored flags are cleared.
6. Process B opens the same file for reading. File check request is generated.
7. File check is performed in userspace. Check result is "file has no malware".
8. Process A writes malware bytes to the file. There is no cached response yet.
9. The "permit" response is delivered to kernel space and is cached in fanotify.
10. File ignored mark set.
11. Now any process C will be permitted to open the malware file.
There is a race between steps 8 and 10
While fanotify makes no strong guarantees about systems with hostile
processes there is no reason we cannot harden against this race. We do
that by simply ignoring any ignore marks if the inode has open writers (aka
i_writecount > 0). (We actually do not ignore ignore marks if the
FAN_MARK_SURV_MODIFY flag is set)
Reported-by: Vasily Novikov <vasily.novikov@kaspersky.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
fsnotify perm events do not call fsnotify parent. That means you cannot
register a perm event on a directory and enforce permissions on all inodes in
that directory. This patch fixes that situation.
Signed-off-by: Eric Paris <eparis@redhat.com>
When fsnotify groups return errors they are ignored. For permissions
events these should be passed back up the stack, but for most events these
should continue to be ignored.
Signed-off-by: Eric Paris <eparis@redhat.com>
The fanotify listeners needs to be able to specify what types of operations
they are going to perform so they can be ordered appropriately between other
listeners doing other types of operations. They need this to be able to make
sure that things like hierarchichal storage managers will get access to inodes
before processes which need the data. This patch defines 3 possible uses
which groups must indicate in the fanotify_init() flags.
FAN_CLASS_PRE_CONTENT
FAN_CLASS_CONTENT
FAN_CLASS_NOTIF
Groups will receive notification in that order. The order between 2 groups in
the same class is undeterministic.
FAN_CLASS_PRE_CONTENT is intended to be used by listeners which need access to
the inode before they are certain that the inode contains it's final data. A
hierarchical storage manager should choose to use this class.
FAN_CLASS_CONTENT is intended to be used by listeners which need access to the
inode after it contains its intended contents. This would be the appropriate
level for an AV solution or document control system.
FAN_CLASS_NOTIF is intended for normal async notification about access, much the
same as inotify and dnotify. Syncronous permissions events are not permitted
at this class.
Signed-off-by: Eric Paris <eparis@redhat.com>
fanotify needs to be able to specify that some groups get events before
others. They use this idea to make sure that a hierarchical storage
manager gets access to files before programs which actually use them. This
is purely infrastructure. Everything will have a priority of 0, but the
infrastructure will exist for it to be non-zero.
Signed-off-by: Eric Paris <eparis@redhat.com>
We disabled the ability to build fanotify in commit 7c5347733d.
This reverts that commit and allows people to build fanotify.
Signed-off-by: Eric Paris <eparis@redhat.com>
In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group. So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have. We
truncate and pre-allocate space everytime to make sure it's uptodate.
This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way. When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.
Signed-off-by: Josef Bacik <josef@redhat.com>
On m68k, which is 32-bit:
fs/nfs/nfs4proc.c: In function ‘nfs41_sequence_done’:
fs/nfs/nfs4proc.c:432: warning: format ‘%ld’ expects type ‘long int’, but argument 3 has type ‘int’
fs/nfs/nfs4proc.c: In function ‘nfs4_setup_sequence’:
fs/nfs/nfs4proc.c:576: warning: format ‘%ld’ expects type ‘long int’, but argument 5 has type ‘int’
On 32-bit, ptrdiff_t is int; on 64-bit, ptrdiff_t is long.
Introduced by commit dfb4f30983 ("NFSv4.1: keep
seq_res.sr_slot as pointer rather than an index")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This merges the staging-next tree to Linus's tree and resolves
some conflicts that were present due to changes in other trees that were
affected by files here.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The fourth argument should be unsigned. Also add missing include
so that the function prototype is defined in xattr_id.c
This fixes a couple of sparse warnings.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus:
hfsplus: free space correcly for files unlinked while open
hfsplus: fix double lock typo in ioctl
Commit 5dabfc78dc ("ext4: rename {exit,init}_ext4_*() to
ext4_{exit,init}_*()") causes
fs/ext4/super.c:4776: error: implicit declaration of function ‘ext4_init_xattr’
when CONFIG_EXT4_FS_XATTR is disabled.
It renamed init_ext4_xattr to ext4_init_xattr but forgot to update the
dummy definition in fs/ext4/xattr.h.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The intent was to test "*desc" for allocation failures, but it tests
"desc" which is always a valid pointer here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_get_lock_context can return NULL on an allocation failure.
Regression introduced by commit f11ac8db.
Reported-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A typo, introduced by commit f11ac8db, in the nfs_direct_write()
routine causes writes with O_DIRECT set to fail with a ENOMEM error.
Found-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve Dickson <steved@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
SYNOPSIS
size[4] Tfsync tag[2] fid[4] datasync[4]
size[4] Rfsync tag[2]
DESCRIPTION
The Tfsync transaction transfers ("flushes") all modified in-core data of
file identified by fid to the disk device (or other permanent storage
device) where that file resides.
If datasync flag is specified data will be fleshed but does not flush
modified metadata unless that metadata is needed in order to allow a
subsequent data retrieval to be correctly handled.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
We need to do O_LARGEFILE check even in case of 9p. Use the
generic_file_open helper
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Make sure we drop inode reference in the error path
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
A create without LOOKUP_OPEN flag set is due to mknod of regular
files. Use mknod 9P operation for the same
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Synopsis
size[4] TReadlink tag[2] fid[4]
size[4] RReadlink tag[2] target[s]
Description
Readlink is used to return the contents of the symoblic link
referred by fid. Contents of symboic link is returned as a
response.
target[s] - Contents of the symbolic link referred by fid.
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Use V9FS_MAGIC as the file system type while filling kernel statfs
strucutre instead of using host file system magic number. Also move
the definition of V9FS_MAGIC from v9fs.h to standard magic.h file.
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Synopsis
size[4] TGetlock tag[2] fid[4] getlock[n]
size[4] RGetlock tag[2] getlock[n]
Description
TGetlock is used to test for the existence of byte range posix locks on a file
identified by given fid. The reply contains getlock structure. If the lock could
be placed it returns F_UNLCK in type field of getlock structure. Otherwise it
returns the details of the conflicting locks in the getlock structure
getlock structure:
type[1] - Type of lock: F_RDLCK, F_WRLCK
start[8] - Starting offset for lock
length[8] - Number of bytes to check for the lock
If length is 0, check for lock in all bytes starting at the location
'start' through to the end of file
pid[4] - PID of the process that wants to take lock/owns the task
in case of reply
client[4] - Client id of the system that owns the process which
has the conflicting lock
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Synopsis
size[4] TLock tag[2] fid[4] flock[n]
size[4] RLock tag[2] status[1]
Description
Tlock is used to acquire/release byte range posix locks on a file
identified by given fid. The reply contains status of the lock request
flock structure:
type[1] - Type of lock: F_RDLCK, F_WRLCK, F_UNLCK
flags[4] - Flags could be either of
P9_LOCK_FLAGS_BLOCK - Blocked lock request, if there is a
conflicting lock exists, wait for that lock to be released.
P9_LOCK_FLAGS_RECLAIM - Reclaim lock request, used when client is
trying to reclaim a lock after a server restrart (due to crash)
start[8] - Starting offset for lock
length[8] - Number of bytes to lock
If length is 0, lock all bytes starting at the location 'start'
through to the end of file
pid[4] - PID of the process that wants to take lock
client_id[4] - Unique client id
status[1] - Status of the lock request, can be
P9_LOCK_SUCCESS(0), P9_LOCK_BLOCKED(1), P9_LOCK_ERROR(2) or
P9_LOCK_GRACE(3)
P9_LOCK_SUCCESS - Request was successful
P9_LOCK_BLOCKED - A conflicting lock is held by another process
P9_LOCK_ERROR - Error while processing the lock request
P9_LOCK_GRACE - Server is in grace period, it can't accept new lock
requests in this period (except locks with
P9_LOCK_FLAGS_RECLAIM flag set)
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
SYNOPSIS
size[4] Tfsync tag[2] fid[4]
size[4] Rfsync tag[2]
DESCRIPTION
The Tfsync transaction transfers ("flushes") all modified in-core data of
file identified by fid to the disk device (or other permanent storage
device) where that file resides.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
We need update the acl value on chmod
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This patch also update mode bits, as a normal file system.
I am not sure wether we should do that, considering that
a setxattr on the server will again update the ACL/mode value
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This patch implement fetching POSIX ACL from the server
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The ACL value is fetched as a part of inode initialization
from the server and the permission checking function use the
cached value of the ACL
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
the same calculation is done in p9_client_write
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The presence of v9fs_direct_IO() in the address space ops vector
allowes open() O_DIRECT flags which would have failed otherwise.
In the non-cached mode, we shunt off direct read and write requests before
the VFS gets them, so this method should never be called.
Direct IO is not 'yet' supported in the cached mode. Hence when
this routine is called through generic_file_aio_read(), the read/write fails
with an error.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The current implementation of 9p client mkdir function does not
set the S_ISGID mode bit for the directory being created if the
parent directory has this bit set. This patch fixes this problem
so that the newly created directory inherits the gid from parent
directory and not from the process creating this directory, when
the S_ISGID bit is set in parent directory.
Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
A patch was accepted recently for sending correct buffer size to p9stat_read.
We need a similar patch in v9fs_dir_readdir_dotl to send correct end of buffer
to p9dirent_read.
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Current 9p client file write code does not check for RLIMIT_FSIZE resource.
This bug was found by running LTP test case for setrlimit. This bug is fixed
by calling generic_write_checks before sending the write request to the
server.
Without this patch: the write function is allowed to write above the
RLIMIT_FSIZE set by user.
With this patch: the write function checks for RLIMIT_SIZE and writes upto
the size limit.
Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
git_t is unsigned an can never be less than zero.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* 'upstream-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (50 commits)
ext4,jbd2: convert tracepoints to use major/minor numbers
ext4: optimize orphan_list handling for ext4_setattr
ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new
ext4: fix compile error in ext4_fallocate()
ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static
ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end()
ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static
ext4: rename {ext,idx}_pblock and inline small extent functions
ext4: make various ext4 functions be static
ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()
ext4: fix kernel oops if the journal superblock has a non-zero j_errno
ext4: update writeback_index based on last page scanned
ext4: implement writeback livelock avoidance using page tagging
ext4: tidy up a void argument in inode.c
ext4: add batched_discard into ext4 feature list
ext4: Add batched discard support for ext4
fs: Add FITRIM ioctl
ext4: Use return value from sb_issue_discard()
ext4: Check return value of sb_getblk() and friends
ext4: use bio layer instead of buffer layer in mpage_da_submit_io
...
This reverts commit d91f2438d8.
The intent of issue_seq is to distinguish between mds->client messages that
(re)create the cap and those that do not, which means we should _only_ be
updating that value in the create paths. By updating it in handle_cap_grant,
we reset it to zero, which then breaks release.
The larger question is what workload/problem made me think it should be
updated here...
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (24 commits)
quota: Fix possible oops in __dquot_initialize()
ext3: Update kernel-doc comments
jbd/2: fixed typos
ext2: fixed typo.
ext3: Fix debug messages in ext3_group_extend()
jbd: Convert atomic_inc() to get_bh()
ext3: Remove misplaced BUFFER_TRACE() in ext3_truncate()
jbd: Fix debug message in do_get_write_access()
jbd: Check return value of __getblk()
ext3: Use DIV_ROUND_UP() on group desc block counting
ext3: Return proper error code on ext3_fill_super()
ext3: Remove unnecessary casts on bh->b_data
ext3: Cleanup ext3_setup_super()
quota: Fix issuing of warnings from dquot_transfer
quota: fix dquot_disable vs dquot_transfer race v2
jbd: Convert bitops to buffer fns
ext3/jbd: Avoid WARN() messages when failing to write the superblock
jbd: Use offset_in_page() instead of manual calculation
jbd: Remove unnecessary goto statement
jbd: Use printk_ratelimited() in journal_alloc_journal_head()
...
Surprisingly chown() on ext4 is not SMP scalable operation.
Due to unconditional orphan_del(NULL, inode) in ext4_setattr()
result in significant performance overhead because of global orphan
mutex, especially in no-journal mode (where orphan_add() is noop).
It is possible to skip explicit orphan_del if possible.
Results of fchown() micro-benchmark in no-journal mode
while (1) {
iteration++;
fchown(fd, uid, gid);
fchown(fd, uid + 1, gid + 1)
}
measured: iterations per millisecond
| nr_tasks | w/o patch | with patch |
| 1 | 142 | 185 |
| 4 | 109 | 642 |
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* akpm-incoming-2: (139 commits)
epoll: make epoll_wait() use the hrtimer range feature
select: rename estimate_accuracy() to select_estimate_accuracy()
Remove duplicate includes from many files
ramoops: use the platform data structure instead of module params
kernel/resource.c: handle reinsertion of an already-inserted resource
kfifo: fix kfifo_alloc() to return a signed int value
w1: don't allow arbitrary users to remove w1 devices
alpha: remove dma64_addr_t usage
mips: remove dma64_addr_t usage
sparc: remove dma64_addr_t usage
fuse: use release_pages()
taskstats: use real microsecond granularity for CPU times
taskstats: split fill_pid function
taskstats: separate taskstats commands
delayacct: align to 8 byte boundary on 64-bit systems
delay-accounting: reimplement -c for getdelays.c to report information on a target command
namespaces Kconfig: move namespace menu location after the cgroup
namespaces Kconfig: remove the cgroup device whitelist experimental tag
namespaces Kconfig: remove pointless cgroup dependency
namespaces Kconfig: make namespace a submenu
...
When I compiled 2.6.36-rc3 kernel with EXT4FS_DEBUG definition, I got
the following compile error.
CC [M] fs/ext4/extents.o
fs/ext4/extents.c: In function 'ext4_fallocate':
fs/ext4/extents.c:3772: error: 'block' undeclared (first use in this function)
fs/ext4/extents.c:3772: error: (Each undeclared identifier is reported only once
fs/ext4/extents.c:3772: error: for each function it appears in.)
make[2]: *** [fs/ext4/extents.o] Error 1
The patch fixes this problem.
Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
These functions are only used within fs/ext4/mballoc.c, so move them
so they are used after they are defined, and then make them be static.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cleanup namespace leaks from fs/ext4 and the inline trivial functions
ext4_{ext,idx}_pblock() and ext4_{ext,idx}_store_pblock() since the
code size actually shrinks when we make these functions inline,
they're so trivial.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
These functions have no need to be exported beyond file context.
No functions needed to be moved for this commit; just some function
declarations changed to be static and removed from header files.
(A similar patch was submitted by Eric Sandeen, but I wanted to handle
code movement in separate patches to make sure code changes didn't
accidentally get dropped.)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Commit 84061e0 fixed an accounting bug only to introduce the
possibility of a kernel OOPS if the journal has a non-zero j_errno
field indicating that the file system had detected a fs inconsistency.
After the journal replay, if the journal superblock indicates that the
file system has an error, this indication is transfered to the file
system and then ext4_commit_super() is called to write this to the
disk.
But since the percpu counters are now initialized after the journal
replay, the call to ext4_commit_super() will cause a kernel oops since
it needs to use the percpu counters the ext4 superblock structure.
The fix is to skip setting the ext4 free block and free inode fields
if the percpu counter has not been set.
Thanks to Ken Sumrall for reporting and analyzing the root causes of
this bug.
Addresses-Google-Bug: #3054080
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
As pointed out in a prior patch, updating the mapping's
writeback_index based on pages written isn't quite right;
what the writeback index is really supposed to reflect is
the next page which should be scanned for writeback during
periodic flush.
As in write_cache_pages(), write_cache_pages_da() does
this scanning for us as we assemble the mpd for later
writeout. If we keep track of the next page after the
current scan, we can easily update writeback_index without
worrying about pages written vs. pages skipped, etc.
Without this, an fsync will reset writeback_index to
0 (its starting index) + however many pages it wrote, which
can mess up the progress of periodic flush.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This is analogous to Jan Kara's commit,
f446daaea9
mm: implement writeback livelock avoidance using page tagging
but since we forked write_cache_pages, we need to reimplement
it there (and in ext4_da_writepages, since range_cyclic handling
was moved to there)
If you start a large buffered IO to a file, and then set
fsync after it, you'll find that fsync does not complete
until the other IO stops.
If you continue re-dirtying the file (say, putting dd
with conv=notrunc in a loop), when fsync finally completes
(after all IO is done), it reports via tracing that
it has written many more pages than the file contains;
in other words it has synced and re-synced pages in
the file multiple times.
This then leads to problems with our writeback_index
update, since it advances it by pages written, and
essentially sets writeback_index off the end of the
file...
With the following patch, we only sync as much as was
dirty at the time of the sync.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This doesn't fix anything at all, it just removes a vestige
of prior use from __mpage_da_writepage()
__mpage_da_writepage() had a *void argument leftover from
its previous life as a callback; make it reflect the actual type.
Fixing this up makes it slightly more obvious to read, and
enables proper typechecking.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Should be applied on the top of "lazy inode table initialization"
and "batched discard support" patch-sets.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).
It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks
are marked as used and then trimmed. Afterwards these blocks are marked
as free in per-group bitmap.
Since fstrim is a long operation it is good to have an ability to
interrupt it by a signal. This was added by Dmitry Monakhov.
Thanks Dimitry.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Adds an filesystem independent ioctl to allow implementation of file
system batched discard support. I takes fstrim_range structure as an
argument. fstrim_range is definec in the include/fs.h and its
definition is as follows.
struct fstrim_range {
start;
len;
minlen;
}
start - first Byte to trim
len - number of Bytes to trim from start
minlen - minimum extent length to trim, free extents shorter than this
number of Bytes will be ignored. This will be rounded up to fs
block size.
It is also possible to specify NULL as an argument. In this case the
arguments will set itself as follows:
start = 0;
len = ULLONG_MAX;
minlen = 0;
So it will trim the whole file system at one run.
After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use return value from sb_issue_discard() as return value in
ext4_issue_discard(). Since sb_issue_discard() may result in more
serious errors than just -EOPNOTSUPP it is worth to inform user of this
function about them to handle error cases properly.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fail block allocation if sb_getblk() returns NULL. In that case,
sb_find_get_block() also likely to fail so that it should skip
calling ext4_forget().
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Call the block I/O layer directly instad of going through the buffer
layer. This should give us much better performance and scalability,
as well as lowering our CPU utilization when doing buffered writeback.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This massively simplifies the ext4_da_writepages() code path by
completely removing mpage_put_bnr_bhs(), which is almost 100 lines of
code iterating over a set of pages using pagevec_lookup(), and folds
that functionality into mpage_da_submit_io()'s existing
pagevec_lookup() loop.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Expand the call:
if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
ext4_bh_delay_or_unwritten))
goto redirty_page
into mpage_da_submit_io().
This will allow us to merge in mpage_put_bnr_to_bhs() in the next
patch.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
As a prepratory step to switching to bio_submit, inline
ext4_writepage() into mpage_da_submit() and then simplify things a
bit. This makes it clearer what mpage_da_submit needs to do.
Also, move the ClearPageChecked(page) call into
__ext4_journalled_writepage(), as a minor bit of cleanup refactoring.
This also allows us to pull i_size_read() and
ext4_should_journal_data() out of the loop, which should be a very
minor CPU savings.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The actual code in ext4_writepage() is unnecessarily convoluted.
Simplify it so it is easier to understand, but otherwise logically
equivalent.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Eventually we need to completely reorganize the ext4 writepage
callpath, but for now, we simplify things a little by calling
mpage_da_submit_io() from mpage_da_map_blocks(), since all of the
places where we call mpage_da_map_blocks() it is followed up by a call
to mpage_da_submit_io().
We're also a wee bit better with respect to error handling, but there
are still a number of issues where it's not clear what the right thing
is to do with ext4 functions deep in the writeback codepath fails.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem
cache. This slab tends to be fairly static, so it shouldn't be marked
as likely to have free pages that can be reclaimed.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use the search_dirblock() in ext4_dx_find_entry(). It makes the code
easier to read, and it takes advantage of common code. It also saves
100 bytes or so of text space.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
If the first block of htree directory is missing '.' or '..' but is
otherwise a valid directory, and we do a lookup for '.' or '..', it's
possible to dereference an uninitialized memory pointer in
ext4_htree_next_block().
We avoid this by moving the special case from ext4_dx_find_entry() to
ext4_find_entry(); this also means we can optimize ext4_find_entry()
slightly when NFS looks up "..".
Thanks to Brad Spengler for pointing a Clang warning that led me to
look more closely at this code. The warning was harmless, but it was
useful in pointing out code that was too ugly to live. This warning was
also reported by Roman Borisov.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
Not that these take up a lot of room, but the structure is long enough
as it is, and there's no need to confuse people with these various
undocumented & unused structure members...
Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
By queuing the io end on the unwritten workqueue before adding it
to our inode's list of completed IOs, I think we run the risk
of the work getting completed, and the IO freed, before we try
to add it to the inode's i_completed_io_list.
It should be safe to add it to the inode's list of completed
IOs, and -then- queue it for completion, I think.
Thanks to Dave Chinner for pointing out the race.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Many tracepoints were populating an ext4_allocation_context
to pass in, but this requires a slab allocation even when
tracepoints are off. In fact, 4 of 5 of these allocations
were only for tracing. In addition, we were only using a
small fraction of the 144 bytes of this structure for this
purpose.
We can do away with all these alloc/frees of the ac and
simply pass in the bits we care about, instead.
I tested this by turning on tracing and running through
xfstests on x86_64. I did not actually do anything with
the trace output, however.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
On linux-2.6.36-rc2, if we execute the following script, we can hang
the system when the /bin/sync command is executed:
========================================================================
#!/bin/sh
echo -n "HANG UP TEST: "
/bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
/sbin/mkfs.ext4 -Fq /tmp/img
/bin/mount -o loop -t ext4 /tmp/img /mnt
/bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
/bin/sync
/bin/umount /mnt
echo "DONE"
exit 0
========================================================================
We can see the following backtrace if we get the kdump when this
hangup occurs:
======================================================================
kthread()
=> bdi_writeback_thread()
=> wb_do_writeback()
=> wb_writeback()
=> writeback_inodes_wb()
=> writeback_sb_inodes()
=> writeback_single_inode()
=> ext4_da_writepages() ---+
^ infinite |
| loop |
+-------------+
======================================================================
The reason why this hangup happens is described as follows:
1) We write the last extent block of the file whose size is the filesystem
maximum size.
2) "BH_Delay" flag is set on the buffer_head of its block.
3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
- the member, "m_len" of struct mpage_da_data is 1
mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
cannot clear "BH_Delay" flag of the buffer_head because the type of
m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.
Therefore an infinite loop occurs because ext4_da_writepages()
cannot write the page (which corresponds to the block) since
"BH_Delay" flag isn't cleared.
----------------------------------------------------------------------
static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
struct ext4_map_blocks *map)
{
...
int blocks = map->m_len;
...
do {
// cur_logical = 4294967295
// map->m_lblk = 4294967295
// blocks = 1
// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
// (cur_logical >= map->m_lblk + blocks) => true
if (cur_logical >= map->m_lblk + blocks)
break;
----------------------------------------------------------------------
NOTE: Mounting with the nodelalloc option will avoid this codepath,
and thus, avoid this hang
Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The llseek system call should return EINVAL if passed a seek offset
which results in a write error. What this maximum offset should be
depends on whether or not the huge_file file system feature is set,
and whether or not the file is extent based or not.
If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be
written (write systemcall) is different from the maximum size which can be
sought (lseek systemcall).
For example, the following 2 cases demonstrates the differences
between the maximum size which can be written, versus the seek offset
allowed by the llseek system call:
#1: mkfs.ext3 <dev>; mount -t ext4 <dev>
#2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>
Table. the max file size which we can write or seek
at each filesystem feature tuning and file flag setting
+============+===============================+===============================+
| \ File flag| | |
| \ | !EXT4_EXTENTS_FL | EXT4_EXTETNS_FL |
|case \| | |
+------------+-------------------------------+-------------------------------+
| #1 | write: 2194719883264 | write: -------------- |
| | seek: 2199023251456 | seek: -------------- |
+------------+-------------------------------+-------------------------------+
| #2 | write: 4402345721856 | write: 17592186044415 |
| | seek: 17592186044415 | seek: 17592186044415 |
+------------+-------------------------------+-------------------------------+
The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
(= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped
maxbytes). Although generic_file_llseek uses only extent-mapped maxbytes.
(llseek of ext4_file_operations is generic_file_llseek which uses
sb->s_maxbytes.)
Therefore we create ext4 llseek function which uses 2 maxbytes.
The new own function originates from generic_file_llseek().
If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters
inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.
Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
An ext4 filesystem on a read-only device, with an external journal
which is at a different device number then recorded in the superblock
will fail to honor the read-only setting of the device and trigger
a superblock update (write).
For example:
- ext4 on a software raid which is in read-only mode
- external journal on a read-write device which has changed device num
- attempt to mount with -o journal_dev=<new_number>
- hits BUG_ON(mddev->ro = 1) in md.c
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Change ext4_ext_zeroout to use sb_issue_zeroout instead of its
own approach to zero out extents.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use sb_issue_zeroout to zero out inode table and descriptor table
blocks instead of old approach which involves journaling.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
User-space should have the opportunity to check what features doest ext4
support in each particular copy. This adds easy interface by creating new
"features" directory in sys/fs/ext4/. In that directory files
advertising feature names can be created.
Add lazy_itable_init to the feature list.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
An attempt to modify the file system during the call to
jbd2_destroy_journal() can lead to a system lockup. So add some
checking to make it much more obvious when this happens to and to
determine where the offending code is located.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We can't hold the block group spinlock because we ext4_issue_discard()
calls wait and hence can get rescheduled.
Google-Bug-Id: 3017678
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I'm uneasy with lots of stuff going on in ext4_da_writepages(),
but bumping nr_to_write from LLONG_MAX to -8 clearly isn't
making anything better, so avoid the multiplier in that case.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Today we simply break out of the inner loop when we have accumulated
max_pages; this keeps scanning forwad and doing pagevec_lookup_tag()
in the while (!done) loop, this does potentially a lot of work
with no net effect.
When we have accumulated max_pages, just clean up and return.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_group_info structures are currently allocated with kmalloc().
With a typical 4K block size, these are 136 bytes each -- meaning
they'll each consume a 256-byte slab object. On a system with many
ext4 large partitions, that's a lot of wasted kernel slab space.
(E.g., a single 1TB partition will have about 8000 block groups, using
about 2MB of slab, of which nearly 1MB is wasted.)
This patch creates an array of slab pointers created as needed --
depending on the superblock block size -- and uses these slabs to
allocate the group info objects.
Google-Bug-Id: 2980809
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This fixes a hang seen in jbd2_journal_release_jbd_inode
on a lot of Power 6 systems running with ext4. When we get
in the hung state, all I/O to the disk in question gets blocked
where we stay indefinitely. Looking at the task list, I can see
we are stuck in jbd2_journal_release_jbd_inode waiting on a
wake up. I added some debug code to detect this scenario and
dump additional data if we were stuck in jbd2_journal_release_jbd_inode
for longer than 30 minutes. When it hit, I was able to see that
i_flags was 0, suggesting we missed the wake up.
This patch changes i_flags to be an unsigned long, uses bit operators
to access it, and adds barriers around the accesses. Prior to applying
this patch, we were regularly hitting this hang on numerous systems
in our test environment. After applying the patch, the hangs no longer
occur.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
It turns out we have several problems with how EOFBLOCKS_FL is
handled. First of all, there was a fencepost error where we were not
clearing the EOFBLOCKS_FL when fill in the last uninitialized block,
but rather when we allocate the next block _after_ the uninitalized
block. Secondly we were not testing to see if we needed to clear the
EOFBLOCKS_FL when writing to the file O_DIRECT or when were converting
an uninitialized block (which is the most common case).
Google-Bug-Id: 2928259
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In commit f7347ce4ee ("fasync: re-organize fasync entry insertion to
allow it under a spinlock") Arnd took an earlier patch of mine that had
the comment about the FASYNC flag above the wrong function.
When the fasync_add_entry() function was split to introduce the new
fasync_insert_entry() helper function, the code that actually cares
about the FASYNC bit moved to that new helper.
So just move the comment to the right point.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'flock' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
locks: turn lock_flocks into a spinlock
fasync: re-organize fasync entry insertion to allow it under a spinlock
locks/nfsd: allocate file lock outside of spinlock
lockd: fix nlmsvc_notify_blocked locking
lockd: push lock_flocks down
This make epoll use hrtimers for the timeout value which prevents
epoll_wait() from timing out up to a millisecond early.
This mirrors the behavior of select() and poll().
Signed-off-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it a subsystem-specific identifier because we wish to amke it
non-static in the next patch ("epoll: make epoll_wait() use the hrtimer
range feature").
Cc: Shawn Bohrer <shawn.bohrer@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace iterated page_cache_release() with release_pages(), which is
faster and shorter.
Needs release_pages() to be exported to modules.
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently do_execve() turns PF_KTHREAD off before search_binary_handler().
THis has a theorical risk of PF_KTHREAD getting lost. We don't have to
turn PF_KTHREAD off in the ENOEXEC case.
This patch moves this flag modification to after the finding of the
executable file.
This is only a theorical issue because kthreads do not call do_execve()
directly. But fixing would be better.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In /proc/stat, the number of per-IRQ event is shown by making a sum each
irq's events on all cpus. But we can make use of kstat_irqs().
kstat_irqs() do the same calculation, If !CONFIG_GENERIC_HARDIRQ,
it's not a big cost. (Both of the number of cpus and irqs are small.)
If a system is very big and CONFIG_GENERIC_HARDIRQ, it does
for_each_irq()
for_each_cpu()
- look up a radix tree
- read desc->irq_stat[cpu]
This seems not efficient. This patch adds kstat_irqs() for
CONFIG_GENRIC_HARDIRQ and change the calculation as
for_each_irq()
look up radix tree
for_each_cpu()
- read desc->irq_stat[cpu]
This reduces cost.
A test on (4096cpusp, 256 nodes, 4592 irqs) host (by Jack Steiner)
%time cat /proc/stat > /dev/null
Before Patch: 2.459 sec
After Patch : .561 sec
[akpm@linux-foundation.org: unexport kstat_irqs, coding-style tweaks]
[akpm@linux-foundation.org: fix unused variable 'per_irq_sum']
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/stat shows the total number of all interrupts to each cpu. But when
the number of IRQs are very large, it take very long time and 'cat
/proc/stat' takes more than 10 secs. This is because sum of all irq
events are counted when /proc/stat is read. This patch adds "sum of all
irq" counter percpu and reduce read costs.
The cost of reading /proc/stat is important because it's used by major
applications as 'top', 'ps', 'w', etc....
A test on a mechin (4096cpu, 256 nodes, 4592 irqs) shows
%time cat /proc/stat > /dev/null
Before Patch: 12.627 sec
After Patch: 2.459 sec
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The length of the BLOCK_IPOLL string is making i's value be printed too
far to the right. This patch fixes this and makes the output a bit
neater.
Currently:
CPU0
HI: 0
TIMER: 599792
NET_TX: 2
NET_RX: 6
BLOCK: 80807
BLOCK_IOPOLL: 0
TASKLET: 20012
SCHED: 0
HRTIMER: 63
RCU: 619279
With patch:
CPU0
HI: 0
TIMER: 585582
NET_TX: 2
NET_RX: 6
BLOCK: 80320
BLOCK_IOPOLL: 0
TASKLET: 19287
SCHED: 0
HRTIMER: 62
RCU: 604441
Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Acked-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export the number of anonymous pages in a mapping via smaps.
Even the private pages in a mapping backed by a file, would be marked as
anonymous, when they are modified. Export this information to user-space via
smaps.
Exporting this count will help gdb to make a better decision on which
areas need to be dumped in its coredump; and should be useful to others
studying the memory usage of a process.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The userland ELF tools have been coping with partial-segments core files
for a few years now. Multiple distro builds are now setting this option.
It behooves everyone who ever deals with core files to have more info
dumped in there, especially as more and more people's compilers are
producing build IDs. Make it the default.
Anyone using older tools confused by these core files can configure this
option off, or just change /proc/PID/coredump_filter after boot.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We met a parameter truncated issue, consider following:
> echo "|/root/core_pattern_pipe_test %p /usr/libexec/blah-blah-blah \
%s %c %p %u %g 11 12345678901234567890123456789012345678 %t" > \
/proc/sys/kernel/core_pattern
This is okay because the strings is less than CORENAME_MAX_SIZE. "cat
/proc/sys/kernel/core_pattern" shows the whole string. but after we run
core_pattern_pipe_test in man page, we found last parameter was truncated
like below:
argc[10]=<12807486>
The root cause is core_pattern allows % specifiers, which need to be
replaced during parse time, but the replace may expand the strings to
larger than CORENAME_MAX_SIZE. So if the last parameter is % specifiers,
the replace code is using snprintf(out_ptr, out_end - out_ptr, ...), this
will write out of corename array.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent
execve() has no worth.
Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a CD has both Rock Ridge and Joliet extensions and the ISO root
directory is empty, no files are visible. Disable Rock Ridge extensions
in this case and use Joliet root directory instead.
Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Guenter Roeck <guenter.roeck@ericsson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We leak 256 bytes here on this error path.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When quotaon(8) races with __dquot_initialize() or dqget() fails because
of EIO, ENOSPC, or similar error, we could possibly dereference NULL pointer
in inode->i_dquot[cnt]. Add proper checking.
Reported-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Update missing/broken argument descriptions and fix formatting.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Convert atomic_inc(&bh->b_count) to get_bh(bh) for consistency.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fail journal creation if __getblk() returns NULL. unlikely() is
added because it is called in a loop and we've been OK without
the check until now.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
bh->b_data is already a pointer to char so casts to 'char *' should
be meaningless. Remove them.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix mount-count check to emit warning only if s_max_mnt_count
is greater than 0 according to man tune2fs(8). Also removes
unnecessary casts.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
__dquot_transfer accidentally called flush_warnings for a wrong set of
dquots which could result in quota warnings being issued with a wrong
identification. Also when operation fails because of EDQUOT, there's no
need check for issuing information message about user getting below limits
(no transfer has actually happened).
Signed-off-by: Jan Kara <jack@suse.cz>
I've got following lockup:
dquot_disable dquot_transfer
->dqget()
sb_has_quota_active
dqopt->flags &= ~dquot_state_flag(f, cnt) atomic_inc(dq->dq_count)
->drop_dquot_ref(sb, cnt);
down_write(dqptr_sem)
inode->i_dquot[cnt] = NULL ->__dquot_transfer
invalidate_dquots(sb, cnt); down_write(&dqptr_sem)
->wait for dq_wait_unused inode->i_dquot = new_dquot
/* wait forever */ ^^^^New quota user^^^^^^
We cannot allow new references to dquots from inodes after drop_dquot_ref()
has removed them. We have to recheck quota state under dqptr_sem and before
assignment, as we do it in dquot_initialize().
Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Convert set/clear_bit(BH_JWrite, ...) to set/clear_buffer_jwrite()
for consistency.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This fixes a WARN backtrace in mark_buffer_dirty() that occurs during unmount
when the underlying block device is removed. This bug has been seen on System
Z when removing all paths from a multipath-backed ext3 mount; on System P when
injecting enough PCI EEH errors to make the SCSI controller go offline; and
similar warnings have been seen (and patched) with ext2/ext4.
The super block update from a previous operation has marked the buffer as in
error, and the flag has to be cleared before doing the update. Similar changes
have been made to ext4 by commit 914258bf2c.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Remove goto statement which jumps to very next line. Also remove
target label because it is no longer used anywhere.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Move call to jbd_debug() into #ifdef CONFIG_JBD_DEBUG block because
'dropped' is declared there. The code could be compiled without this
change anyway, simply because jbd_debug() expands to nothing if
!CONFIG_JBD_DEBUG but IMHO it doesn't look good in general.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
@handle doesn't exist in ext2. Remove it.
Also, fit comment header into kernel-doc format.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Nothing depends on lock_flocks using the BKL
any more, so we can do the switch over to
a private spinlock.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
You currently cannot use "fasync_helper()" in an atomic environment to
insert a new fasync entry, because it will need to allocate the new
"struct fasync_struct".
Yet fcntl_setlease() wants to call this under lock_flocks(), which is in
the process of being converted from the BKL to a spinlock.
In order to fix this, this abstracts out the actual fasync list
insertion and the fasync allocations into functions of their own, and
teaches fs/locks.c to pre-allocate the fasync_struct entry. That way
the actual list insertion can happen while holding the required
spinlock.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bfields@redhat.com: rebase on top of my changes to Arnd's patch]
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
As suggested by Christoph Hellwig, this moves allocation
of new file locks out of generic_setlease into the
callers, nfs4_open_delegation and fcntl_setlease in order
to allow GFP_KERNEL allocations when lock_flocks has
become a spinlock.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
nlmsvc_notify_blocked walks the nlm_blocked list,
which requires nlm_blocked_lock.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
lockd should use lock_flocks() instead of lock_kernel()
to lock against posix locks accessing the i_flock list.
This is a prerequisite to turning lock_flocks into a
spinlock.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
hfsplus_delete_inode only truncates away all block allocations if
i_nlink is zero. Make sure we properly drop the unlink count even
when doing the rename hack for open but unlinked files.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Minor cleanup - Fix spelling mistake, make meaningful (goto) label
In function setup_ntlmv2_rsp(), do not return 0 and leak memory,
let the tiblob get freed.
For function find_domain_name(), pass already available nls table pointer
instead of loading and unloading the table again in this function.
For ntlmv2, the case sensitive password length is the length of the
response, so subtract session key length (16 bytes) from the .len.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...
* akpm-incoming-1: (176 commits)
scripts/checkpatch.pl: add check for declaration of pci_device_id
scripts/checkpatch.pl: add warnings for static char that could be static const char
checkpatch: version 0.31
checkpatch: statement/block context analyser should look at sanitised lines
checkpatch: handle EXPORT_SYMBOL for DEVICE_ATTR and similar
checkpatch: clean up structure definition macro handline
checkpatch: update copyright dates
checkpatch: Add additional attribute #defines
checkpatch: check for incorrect permissions
checkpatch: ensure kconfig help checks only apply when we are adding help
checkpatch: simplify and consolidate "missing space after" checks
checkpatch: add check for space after struct, union, and enum
checkpatch: returning errno typically should be negative
checkpatch: handle casts better fixing false categorisation of : as binary
checkpatch: ensure we do not collapse bracketed sections into constants
checkpatch: suggest cleanpatch and cleanfile when appropriate
checkpatch: types may sit on a line on their own
checkpatch: fix regressions in "fix handling of leading spaces"
div64_u64(): improve precision on 32bit platforms
lib/parser: cleanup match_number()
...
PF_FLUSHER is only ever set, not tested, remove it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :
<quote>
We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:
atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;
The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().
n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;
In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.
</quote>
Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of atomic_t.
get_max_files() is changed to return an unsigned long. get_nr_files() is
changed to return a long.
unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.
Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968
After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648
Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The lock number in /proc/locks (first field) is implemented by a counter
(private field of struct seq_file) which is incremented at each call of
locks_show() and reset to 1 in locks_start() whatever the offset is. It
should be reset according to the actual position in the list. Because of
this, the numbering erratically restarts at 1 several times when reading a
long /proc/locks file.
Moreover, locks_show() can be called twice to print a single line thus
skipping a number. The counter should be incremented in locks_next().
And last, pos is a loff_t, which can be bigger than a pointer, so we don't
use the pointer as an integer anymore, and allocate a loff_t instead.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the EXPORTFS kconfig symbol out of the NETWORK_FILESYSTEMS block
since it provides a library function that can be (and is) used by other
(non-network) filesystems.
This also eliminates a kconfig dependency warning:
warning: (XFS_FS && BLOCK || NFSD && NETWORK_FILESYSTEMS && INET && FILE_LOCKING && BKL) selects EXPORTFS which has unmet direct dependencies (NETWORK_FILESYSTEMS)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bh->b_private is initialized within init_buffer(), thus this assignment is
redundant.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix up truncation (ssize_t->int). This only matters with >2G
reads/writes, which the kernel doesn't permit.
Signed-off-by: Edward Shishkin <edward@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7909b1c640 ("fuse: don't use atomic kmap") removed KM_USER0 usage
from fuse/dev.c. Switch KM_USER1 uses to KM_USER0 for clarity. Also
replace open coded clear_highpage().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After all that's what they are intended for.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some code cleanups for hostfs.
Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I had to go back to a 2.6.20 tree to work out why we're adding a
number-of-inodes into a number-of-pages count. Restore the lost comment.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dirty_ratio was silently limited in global_dirty_limits() to >= 5%.
This is not a user expected behavior. And it's inconsistent with
calc_period_shift(), which uses the plain vm_dirty_ratio value.
Let's remove the internal bound.
At the same time, fix balance_dirty_pages() to work with the
dirty_thresh=0 case. This allows applications to proceed when
dirty+writeback pages are all cleaned.
And ">" fits with the name "exceeded" better than ">=" does. Neil thinks
it is an aesthetic improvement as well as a functional one :)
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Proposed-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To help developers and applications gain visibility into writeback
behaviour this patch adds two counters to /proc/vmstat.
# grep nr_dirtied /proc/vmstat
nr_dirtied 3747
# grep nr_written /proc/vmstat
nr_written 3618
These entries allow user apps to understand writeback behaviour over time
and learn how it is impacting their performance. Currently there is no
way to inspect dirty and writeback speed over time. It's not possible for
nr_dirty/nr_writeback.
These entries are necessary to give visibility into writeback behaviour.
We have /proc/diskstats which lets us understand the io in the block
layer. We have blktrace for more in depth understanding. We have
e2fsprogs and debugsfs to give insight into the file systems behaviour,
but we don't offer our users the ability understand what writeback is
doing. There is no way to know how active it is over the whole system, if
it's falling behind or to quantify it's efforts. With these values
exported users can easily see how much data applications are sending
through writeback and also at what rates writeback is processing this
data. Comparing the rates of change between the two allow developers to
see when writeback is not able to keep up with incoming traffic and the
rate of dirty memory being sent to the IO back end. This allows folks to
understand their io workloads and track kernel issues. Non kernel
engineers at Google often use these counters to solve puzzling performance
problems.
Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written
Patch #5 add writeback thresholds to /proc/vmstat
Currently these values are in debugfs. But they should be promoted to
/proc since they are useful for developers who are writing databases
and file servers and are not debugging the kernel.
The output is as below:
# grep threshold /proc/vmstat
nr_pages_dirty_threshold 409111
nr_pages_dirty_background_threshold 818223
This patch:
This allows code outside of the mm core to safely manipulate page
writeback state and not worry about the other accounting. Not using these
routines means that some code will lose track of the accounting and we get
bugs.
Modify nilfs2 to use interface.
Signed-off-by: Michael Rubin <mrubin@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Jiro SEKIBA <jir@unicus.jp>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes more dead code that was somehow missed by commit 0d99519efe
(writeback: remove unused nonblocking and congestion checks). There are
no behavior change except for the removal of two entries from one of the
ext4 tracing interface.
The nonblocking checks in ->writepages are no longer used because the
flusher now prefer to block on get_request_wait() than to skip inodes on
IO congestion. The latter will lead to more seeky IO.
The nonblocking checks in ->writepage are no longer used because it's
redundant with the WB_SYNC_NONE check.
We no long set ->nonblocking in VM page out and page migration, because
a) it's effectively redundant with WB_SYNC_NONE in current code
b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
that would skip some dirty inodes on congestion and page out others, which
is unfair in terms of LRU age.
Inspired by Christoph Hellwig. Thanks!
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: Steve French <sfrench@samba.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The locking order in oom_adjust_write() and oom_score_adj_write() for
task->alloc_lock and task->sighand->siglock is reversed, and lockdep
notices that irqs could encounter an ABBA scenario.
This fixes the locking order so that we always take task_lock(task) prior
to lock_task_sighand(task).
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's better to use proper error handling in oom_adjust_write() and
oom_score_adj_write() instead of duplicating the locking order on various
exit paths.
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's pointless to kill a task if another thread sharing its mm cannot be
killed to allow future memory freeing. A subsequent patch will prevent
kills in such cases, but first it's necessary to have a way to flag a task
that shares memory with an OOM_DISABLE task that doesn't incur an
additional tasklist scan, which would make select_bad_process() an O(n^2)
function.
This patch adds an atomic counter to struct mm_struct that follows how
many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
They cannot be killed by the kernel, so their memory cannot be freed in
oom conditions.
This only requires task_lock() on the task that we're operating on, it
does not require mm->mmap_sem since task_lock() pins the mm and the
operation is atomic.
[rientjes@google.com: changelog and sys_unshare() code]
[rientjes@google.com: protect oom_disable_count with task_lock in fork]
[rientjes@google.com: use old_mm for oom_disable_count in exec]
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We use vmcore in our production kernel for a long time, it is pretty
stable now. So I don't think we need to mark it as experimental any more.
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
365b1818 ("add f_flags to struct statfs(64)") resized f_spare within
struct statfs which caused a UML crash. There is no need to copy f_spare.
Signed-off-by: Richard Weinberger <richard@nod.at>
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Tested-by: Toralf Förster <toralf.foerster@gmx.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kernel crypto sync hash apis insetead of cifs crypto functions.
The calls typically corrospond one to one except that insead of
key init, setkey is used.
Use crypto apis to generate smb signagtures also.
Use hmac-md5 to genereate ntlmv2 hash, ntlmv2 response, and HMAC (CR1 of
ntlmv2 auth blob.
User crypto apis to genereate signature and to verify signature.
md5 hash is used to calculate signature.
Use secondary key to calculate signature in case of ntlmssp.
For ntlmv2 within ntlmssp, during signature calculation, only 16 bytes key
(a nonce) stored within session key is used. during smb signature calculation.
For ntlm and ntlmv2 without extended security, 16 bytes key
as well as entire response (24 bytes in case of ntlm and variable length
in case of ntlmv2) is used for smb signature calculation.
For kerberos, there is no distinction between key and response.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* ima-memory-use-fixes:
IMA: fix the ToMToU logic
IMA: explicit IMA i_flag to remove global lock on inode_delete
IMA: drop refcnt from ima_iint_cache since it isn't needed
IMA: only allocate iint when needed
IMA: move read counter into struct inode
IMA: use i_writecount rather than a private counter
IMA: use inode->i_lock to protect read and write counters
IMA: convert internal flags from long to char
IMA: use unsigned int instead of long for counters
IMA: drop the inode opencount since it isn't needed for operation
IMA: use rbtree instead of radix tree for inode information cache
IMA currently allocated an inode integrity structure for every inode in
core. This stucture is about 120 bytes long. Most files however
(especially on a system which doesn't make use of IMA) will never need
any of this space. The problem is that if IMA is enabled we need to
know information about the number of readers and the number of writers
for every inode on the box. At the moment we collect that information
in the per inode iint structure and waste the rest of the space. This
patch moves those counters into the struct inode so we can eventually
stop allocating an IMA integrity structure except when absolutely
needed.
This patch does the minimum needed to move the location of the data.
Further cleanups, especially the location of counter updates, may still
be possible.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark dependency on crypto modules in Kconfig.
Defining per structures sdesc and cifs_secmech which are used to store
crypto hash functions and contexts. They are stored per smb connection
and used for all auth mechs to genereate hash values and signatures.
Allocate crypto hashing functions, security descriptiors, and respective
contexts when a smb/tcp connection is established.
Release them when a tcp/smb connection is taken down.
md5 and hmac-md5 are two crypto hashing functions that are used
throught the life of an smb/tcp connection by various functions that
calcualte signagure and ntlmv2 hash, HMAC etc.
structure ntlmssp_auth is defined as per smb connection.
ntlmssp_auth holds ciphertext which is genereated by rc4/arc4 encryption of
secondary key, a nonce using ntlmv2 session key and sent in the session key
field of the type 3 message sent by the client during ntlmssp
negotiation/exchange
A key is exchanged with the server if client indicates so in flags in
type 1 messsage and server agrees in flag in type 2 message of ntlmssp
negotiation. If both client and agree, a key sent by client in
type 3 message of ntlmssp negotiation in the session key field.
The key is a ciphertext generated off of secondary key, a nonce, using
ntlmv2 hash via rc4/arc4.
Signing works for ntlmssp in this patch. The sequence number within
the server structure needs to be zero until session is established
i.e. till type 3 packet of ntlmssp exchange of a to be very first
smb session on that smb connection is sent.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The cifs_convert_address() returns zero on error but this caller is
testing for negative returns.
Btw. "i" is unsigned here, so it's never negative.
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Start calculating auth response within a session. Move/Add pertinet
data structures like session key, server challenge and ntlmv2_hash in
a session structure. We should do the calculations within a session
before copying session key and response over to server data
structures because a session setup can fail.
Only after a very first smb session succeeds, it copy/make its
session key, session key of smb connection. This key stays with
the smb connection throughout its life.
sequence_number within server is set to 0x2.
The authentication Message Authentication Key (mak) which consists
of session key followed by client response within structure session_key
is now dynamic. Every authentication type allocates the key + response
sized memory within its session structure and later either assigns or
frees it once the client response is sent and if session's session key
becomes connetion's session key.
ntlm/ntlmi authentication functions are rearranged. A function
named setup_ntlm_resp(), similar to setup_ntlmv2_resp(), replaces
function cifs_calculate_session_key().
size of CIFS_SESS_KEY_SIZE is changed to 16, to reflect the byte size
of the key it holds.
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Stephen Rothwell reports:
> /home/test/linux-2.6/fs/nfs/nfsroot.c: In function 'nfs_root_debug':
> /home/test/linux-2.6/fs/nfs/nfsroot.c:110:2: error: 'nfs_debug'
> undeclared (first use in this function)
> /home/test/linux-2.6/fs/nfs/nfsroot.c:110:2: note: each undeclared
> identifier is reported only once for each function it appears in
> make[3]: *** [fs/nfs/nfsroot.o] Error 1
> make[2]: *** [fs/nfs] Error 2
> make[1]: *** [fs] Error 2
> make: *** [sub-make] Error 2
Which is caused by commit 306a075362
(NFS: Allow NFSROOT debugging messages to be enabled dynamically)
Fix is to disable this code when RPC_DEBUG is disabled.
Reported-by: Zimny Lech <napohybelskurwysynom2010@gmail.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-2.6.37' of git://linux-nfs.org/~bfields/linux: (99 commits)
svcrpc: svc_tcp_sendto XPT_DEAD check is redundant
svcrpc: no need for XPT_DEAD check in svc_xprt_enqueue
svcrpc: assume svc_delete_xprt() called only once
svcrpc: never clear XPT_BUSY on dead xprt
nfsd4: fix connection allocation in sequence()
nfsd4: only require krb5 principal for NFSv4.0 callbacks
nfsd4: move minorversion to client
nfsd4: delay session removal till free_client
nfsd4: separate callback change and callback probe
nfsd4: callback program number is per-session
nfsd4: track backchannel connections
nfsd4: confirm only on succesful create_session
nfsd4: make backchannel sequence number per-session
nfsd4: use client pointer to backchannel session
nfsd4: move callback setup into session init code
nfsd4: don't cache seq_misordered replies
SUNRPC: Properly initialize sock_xprt.srcaddr in all cases
SUNRPC: Use conventional switch statement when reclassifying sockets
sunrpc/xprtrdma: clean up workqueue usage
sunrpc: Turn list_for_each-s into the ..._entry-s
...
Fix up trivial conflicts (two different deprecation notices added in
separate branches) in Documentation/feature-removal-schedule.txt
Because btrfs_dirty_inode does a btrfs_join_transaction, it doesn't actually
reserve space. It does this so we can try and dirty the inode quickly without
having to deal with the ENOSPC problems. But if it does get back ENOSPC it
handles it properly. The problem is use_block_rsv does a WARN_ON whenever this
case happens, even tho btrfs_dirty_inode takes it into account and actually
expects to get -ENOSPC if things are particularly tight. So instead just remove
the warning. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
btrfs_commit_transaction will free our trans, but because we pass trans to
shrink_delalloc we could possibly have a use after free situation. So instead
if we commit the transaction, set trans to null and set committed to true so we
don't keep trying to commit a transaction. This fixes a panic I could reproduce
at will. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
net/sunrpc: Use static const char arrays
nfs4: fix channel attribute sanity-checks
NFSv4.1: Use more sensible names for 'initialize_mountpoint'
NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure
NFSv4.1: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure
NFS: client needs to maintain list of inodes with active layouts
NFS: create and destroy inode's layout cache
NFSv4.1: pnfs: filelayout: introduce minimal file layout driver
NFSv4.1: pnfs: full mount/umount infrastructure
NFS: set layout driver
NFS: ask for layouttypes during v4 fsinfo call
NFS: change stateid to be a union
NFSv4.1: pnfsd, pnfs: protocol level pnfs constants
SUNRPC: define xdr_decode_opaque_fixed
NFSD: remove duplicate NFS4_STATEID_SIZE
The sanity checks here are incorrect; in the worst case they allow
values that crash the client.
They're also over-reliant on the preprocessor.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull removal of fsnotify marks into generic_shutdown_super().
Split umount-time work into a new function - evict_inodes().
Make sure that invalidate_inodes() will be able to cope with
I_FREEING once we change locking in iput().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Skip I_FREEING inodes just like I_WILL_FREE and I_NEW when walking the
writeback lists. Currenly this can't happen, but once we move from
inode_lock to more fine grained locking we can have an inode that's
still on the writeback lists but has I_FREEING set, and we absolutely
need to skip it here, just like we do for all other inode list walks.
Based on a patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Despite the comment above it we can not safely drop the lock here.
invalidate_list is called from many other places that just umount.
Also switch to proper list macros now that we never drop the lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
bdev inodes can remain dirty even after their last close. Hence the
BDI associated with the bdev->inode gets modified duringthe last
close to point to the default BDI. However, the bdev inode still
needs to be moved to the dirty lists of the new BDI, otherwise it
will corrupt the writeback list is was left on.
Add a new function bdev_inode_switch_bdi() to move all the bdi state
from the old bdi to the new one safely. This is only a temporary
measure until the bdev inode<->bdi lifecycle problems are sorted
out.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We must not call invalidate_inode_buffers in invalidate_list unless the
inode can be reclaimed. If we remove the buffer association of a busy
inode fsync won't find the buffers anymore. As invalidate_inode_buffers
is called from various others sources than umount this actually does
matter in practice.
While at it change the loop to a more natural form and remove the
WARN_ON for I_NEW, wich we already tested a few lines above.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use dget_parent instead of opencoding it. This simplifies the code, but
more importanly prepares for the more complicated locking for a parent
dget in the dcache scale patch series.
It means we do grab a reference to the parent now if need to be watched,
but not with the specified mask. If this turns out to be a problem
we'll have to revisit it, but for now let's keep as much as possible
dcache internals inside dcache.[ch].
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use dget_parent instead of opencoding it. This simplifies the code, but
more importanly prepares for the more complicated locking for a parent
dget in the dcache scale patch series.
Note that the d_time assignment in smb_renew_times moves out of d_lock,
but it's a single atomic 32-bit value, and that's what other sites
setting it do already.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use dget_parent instead of opencoding it. This simplifies the code, but
more importanly prepares for the more complicated locking for a parent
dget in the dcache scale patch series.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d_validate does a purely read lookup in the dentry hash, so use RCU read side
locking instead of dcache_lock. Split out from a larget patch by
Nick Piggin <npiggin@suse.de>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Always do a list_del_init on the LRU to make sure the list_empty invariant for
not beeing on the LRU always holds true, and fold dentry_lru_del_init into
dentry_lru_del. Replace the dentry_lru_add_tail primitive with a
dentry_lru_move_tail operations that simpler when the dentry already is one
the list, which is always is. Move the list_empty into dentry_lru_add to
fit the scheme of the other lru helpers, and simplify locking once we
move to a separate LRU lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently __shrink_dcache_sb has an extremly awkward calling convention
because it tries to please very different callers. Split out the
main loop into a shrink_dentry_list helper, which gets called directly
from shrink_dcache_sb for the cases where all dentries need to be pruned,
or from __shrink_dcache_sb for pruning only a certain number of dentries.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
dentry referenced bit is only set when installing the dentry back
onto the LRU. However with lazy LRU, the dentry can already be on
the LRU list at dput time, thus missing out on setting the referenced
bit. Fix this.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The nr_dentry stat is a globally touched cacheline and atomic operation
twice over the lifetime of a dentry. It is used for the benfit of userspace
only. Turn it into a per-cpu counter and always decrement it in d_free instead
of doing various batching operations to reduce lock hold times in the callers.
Based on an earlier patch from Nick Piggin <npiggin@suse.de>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove d_callback and always call __d_free with a RCU head.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All callers take dcache_lock just around the call to __d_path, so
take the lock into it in preparation of getting rid of dcache_lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new_inode() dirties a contended cache line to get increasing
inode numbers. This limits performance on workloads that cause
significant parallel inode allocation.
Solve this problem by using a per_cpu variable fed by the shared
last_ino in batches of 1024 allocations. This reduces contention on
the shared last_ino, and give same spreading ino numbers than before
(i.e. same wraparound after 2^32 allocations).
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split up inode_add_to_list/__inode_add_to_list. Locking for the two
lists will be split soon so these helpers really don't buy us much
anymore.
The __ prefixes for the sb list helpers will go away soon, but until
inode_lock is gone we'll need them to distinguish between the locked
and unlocked variants.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that iunique is not abusing find_inode anymore we can move the i_ref
increment back to where it belongs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
Introduce a new iunique_lock to protect the iunique counters once inode_lock
is removed.
Based on a patch originally from Nick Piggin.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Before replacing the inode hash locking with a more scalable
mechanism, factor the removal of the inode from the hashes rather
than open coding it in several places.
Based on a patch originally from Nick Piggin.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic. We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.
This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.
Based on a patch originally from Eric Dumazet.
[AV: cleaned up a bit, fixed build breakage on weird configs
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If clone_mnt() happens while mnt_make_readonly() is running, the
cloned mount might have MNT_WRITE_HOLD flag set, which results in
mnt_want_write() spinning forever on this mount.
Needs CAP_SYS_ADMIN to trigger deliberately and unlikely to happen
accidentally. But if it does happen it can hang the machine.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make node look as if it was on hlist, with hlist_del()
working correctly. Usable without any locking...
Convert a couple of places where we want to do that to
inode->i_hash.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>