Commit Graph

17426 Commits

Author SHA1 Message Date
Al Viro
a2c36b450e pull more into do_last()
Handling of LAST_DOT/LAST_ROOT/LAST_DOTDOT/terminating slash
can be pulled in as well

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05 09:01:24 -05:00
Al Viro
c99658fe97 bail out with ELOOP earlier in do_link loop
If we'd passed through 32 trailing symlinks already, there's
no sense following the 33rd - we'll bail out anyway.  Better
bugger off earlier.

It *does* change behaviour, after a fashion - if the 33rd happens
to be a procfs-style symlink, original code *would* allow it.
This one will not.  Cry me a river if that hurts you.  Please, do.
And post a video of that, while you are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05 09:01:22 -05:00
Al Viro
a1e28038df pull the common predecessors into do_last()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05 09:01:20 -05:00
Al Viro
c41c140562 postpone __putname() until after do_last()
Since do_last() doesn't mangle nd->last_name, we can safely postpone
__putname() done in handling of trailing symlinks until after the
call of do_last()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05 09:01:18 -05:00
Al Viro
27bff34300 unroll do_last: loop in do_filp_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05 09:01:16 -05:00
Al Viro
3343eb8209 Shift releasing nd->root from do_last() to its caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05 09:01:15 -05:00
Al Viro
fb1cc555d5 gut do_filp_open() a bit more (do_last separation)
Brute-force separation of stuff reachable from do_last: with
the exception of do_link:; just take all that crap to a helper
function as-is and have it tell the caller if it has to go
to do_link.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05 09:01:13 -05:00
Al Viro
648fa8611d beginning to untangle do_filp_open()
That's going to be a long and painful series.  The first step:
take the stuff reachable from 'ok' label in do_filp_open() into
a new helper (finish_open()).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05 09:01:11 -05:00
Venkatesh Pallipadi
64e290ec69 ext4: fix up rb_root initializations to use RB_ROOT
ext4 uses rb_node = NULL; to zero rb_root at few places.  Using
RB_ROOT as the initializer is more portable in case the underlying
implementation of rbtrees changes in the future.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Paris <eparis@redhat.com>
2010-03-04 22:25:21 -05:00
Christoph Hellwig
efd8f0e6f6 quota: stop using QUOTA_OK / NO_QUOTA
Just use 0 / -EDQUOT directly - that's what it translates to anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:31 +01:00
Christoph Hellwig
871a293155 dquot: cleanup dquot initialize routine
Get rid of the initialize dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_initialize helper to __dquot_initialize
and vfs_dq_init to dquot_initialize to have a consistent namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:30 +01:00
Christoph Hellwig
907f4554e2 dquot: move dquot initialization responsibility into the filesystem
Currently various places in the VFS call vfs_dq_init directly.  This means
we tie the quota code into the VFS.  Get rid of that and make the
filesystem responsible for the initialization.   For most metadata operations
this is a straight forward move into the methods, but for truncate and
open it's a bit more complicated.

For truncate we currently only call vfs_dq_init for the sys_truncate case
because open already takes care of it for ftruncate and open(O_TRUNC) - the
new code causes an additional vfs_dq_init for those which is harmless.

For open the initialization is moved from do_filp_open into the open method,
which means it happens slightly earlier now, and only for regular files.
The latter is fine because we don't need to initialize it for operations
on special files, and we already do it as part of the namespace operations
for directories.

Add a dquot_file_open helper that filesystems that support generic quotas
can use to fill in ->open.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:30 +01:00
Christoph Hellwig
9f75475802 dquot: cleanup dquot drop routine
Get rid of the drop dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_drop helper to __dquot_drop
and vfs_dq_drop to dquot_drop to have a consistent namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:30 +01:00
Christoph Hellwig
257ba15ced dquot: move dquot drop responsibility into the filesystem
Currently clear_inode calls vfs_dq_drop directly.  This means
we tie the quota code into the VFS.  Get rid of that and make the
filesystem responsible for the drop inside the ->clear_inode
superblock operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:29 +01:00
Christoph Hellwig
b43fa8284d dquot: cleanup dquot transfer routine
Get rid of the transfer dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_transfer helper to __dquot_transfer
and vfs_dq_transfer to dquot_transfer to have a consistent namespace,
and make the new dquot_transfer return a normal negative errno value
which all callers expect.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:29 +01:00
Christoph Hellwig
759bfee658 dquot: move dquot transfer responsibility into the filesystem
Currently notify_change calls vfs_dq_transfer directly.  This means
we tie the quota code into the VFS.  Get rid of that and make the
filesystem responsible for the transfer.  Most filesystems already
do this, only ufs and udf need the code added, and for jfs it needs to
be enabled unconditionally instead of only when ACLs are enabled.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:28 +01:00
Christoph Hellwig
63936ddaa1 dquot: cleanup inode allocation / freeing routines
Get rid of the alloc_inode and free_inode dquot operations - they are
always called from the filesystem and if a filesystem really needs
their own (which none currently does) it can just call into it's
own routine directly.

Also get rid of the vfs_dq_alloc/vfs_dq_free wrappers and always
call the lowlevel dquot_alloc_inode / dqout_free_inode routines
directly, which now lose the number argument which is always 1.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:28 +01:00
Christoph Hellwig
5dd4056db8 dquot: cleanup space allocation / freeing routines
Get rid of the alloc_space, free_space, reserve_space, claim_space and
release_rsv dquot operations - they are always called from the filesystem
and if a filesystem really needs their own (which none currently does)
it can just call into it's own routine directly.

Move shared logic into the common __dquot_alloc_space,
dquot_claim_space_nodirty and __dquot_free_space low-level methods,
and rationalize the wrappers around it to move as much as possible
code into the common block for CONFIG_QUOTA vs not.  Also rename
all these helpers to be named dquot_* instead of vfs_dq_*.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:28 +01:00
Dmitry Monakhov
49792c806d ext3: add writepage sanity checks
- There is theoretical possibility to perform writepage on
   RO superblock. Add explicit check for what case.
- Page must being locked before writepage.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:27 +01:00
Jan Kara
7eb4969e04 ext3: Truncate allocated blocks if direct IO write fails to update i_size
We have to truncate blocks allocated to file during direct IO when we
fail to update i_size properly.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:27 +01:00
Jan Kara
ab94c39b6f quota: Properly invalidate caches even for filesystems with blocksize < pagesize
Sometimes invalidate_bdev() can fail to invalidate a part of block
device cache because of dirty data. If the filesystem has blocksize
smaller than page size, this can happen even for pages containing
quota files and thus kernel would operate on stale data. Fix the
issue by syncing the filesystem before invalidating the cache.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:27 +01:00
Dmitry Monakhov
8ddd69d6df quota: generalize quota transfer interface
Current quota transfer interface support only uid/gid.
This patch extend interface in order to support various quotas types
The goal is accomplished without changes in most frequently used
vfs_dq_transfer() func.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:26 +01:00
Dmitry Monakhov
ad1e6e8da9 quota: sb_quota state flags cleanup
- remove hardcoded USRQUOTA/GRPQUOTA flags
- convert int to bool for appropriate functions

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:26 +01:00
Jan Kara
8696391896 jbd: Delay discarding buffers in journal_unmap_buffer
Delay discarding buffers in journal_unmap_buffer until
we know that "add to orphan" operation has definitely been
committed, otherwise the log space of committing transation
may be freed and reused before truncate get committed, updates
may get lost if crash happens.

This patch is a backport of JBD2 fix by dingdinghua <dingdinghua@nrchpc.ac.cn>.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:26 +01:00
Dmitry Monakhov
e5472147e1 ext3: quota_write cross block boundary behaviour
We always assume what dquot update result in changes in one data block
But ext3_quota_write() function may handle cross block boundary writes
In fact if this ever happen it will result in incorrect journal credits
reservation. And later bug_on triggering. As soon this never happen the
boundary cross loop is NOOP. In order to make things straight
let's remove this loop and assert cross boundary condition.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:26 +01:00
Christoph Hellwig
ac0e773718 quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
We already do these checks in the generic code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:25 +01:00
Christoph Hellwig
5582c76f90 quota: split out compat_sys_quotactl support from quota.c
Instead of adding ifdefs just split it into a new file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:25 +01:00
Christoph Hellwig
799a9d4402 quota: split out netlink notification support from quota.c
Instead of adding ifdefs just split it into a new file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:25 +01:00
Christoph Hellwig
a56fca23f6 quota: remove invalid optimization from quota_sync_all
Checking the "VFS" quota enabled and dirty bits from generic code means
this code will never get called for other implementations, e.g. XFS and
GFS2.  Grabbing the reference on the superblock really isn't much overhead
for a global Q_SYNC call, so just drop this optimization.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:24 +01:00
Christoph Hellwig
5fb324ad24 quota: move code from sync_quota_sb into vfs_quota_sync
Currenly sync_quota_sb does a lot of sync and truncate action that only
applies to "VFS" style quotas and is actively harmful for the sync
performance in XFS.  Move it into vfs_quota_sync and add a wait parameter
to ->quota_sync to tell if we need it or not.

My audit of the GFS2 code says it's also not needed given the way GFS2
implements quotas, but I'd be happy if this can get a detailed review.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:24 +01:00
Christoph Hellwig
8c4e4acd66 quota: clean up Q_XQUOTASYNC
Currently Q_XQUOTASYNC calls into the quota_sync method, but XFS does something
entirely different in it than the rest of the filesystems.  xfs_quota which
calls Q_XQUOTASYNC expects an asynchronous data writeout to flush delayed
allocations, while the "VFS" quota support wants to flush changes to the quota
file.

So make Q_XQUOTASYNC call into the writeback code directly and make the
quota_sync method optional as XFS doesn't need in the sense expected by the
rest of the quota code.

GFS2 was using limited XFS-style quota and has a quota_sync method fitting
neither the style used by vfs_quota_sync nor xfs_fs_quota_sync.  I left it
in for now as per discussion with Steve it expects to be called from the
sync path this way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:24 +01:00
Christoph Hellwig
c988afb5fa quota: simplify permission checking
Stop having complicated different routines for checking permissions for
XQM vs "VFS" quotas.  Instead do the checks for having sb->s_qcop and
a valid type directly in do_quotactl, and munge the *quotactl_valid functions
into a check_quotactl_permission helper that only checks for permissions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:22 +01:00
Christoph Hellwig
6ae09575b3 quota: special case Q_SYNC without device name
The Q_SYNC command can be called without the path to a device, in which case
it iterates over all superblocks.  Special case this variant directly in
sys_quotactl so that the other code always gets a superblock and doesn't
need to deal with this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:22 +01:00
Christoph Hellwig
f450d4fee4 quota: clean up checks for supported quota methods
Move the checks for sb->s_qcop->foo next to the actual calls for them, same
for sb_has_quota_active checks where applicable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:21 +01:00
Christoph Hellwig
c411e5f66a quota: split do_quotactl
Split out a helper for each non-trivial command from do_quotactl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:21 +01:00
Jan Kara
0a5a9c7255 quota: Fix warning when a delayed write happens before quota is enabled
If a delayed-allocation write happens before quota is enabled, the
kernel spits out a warning:
WARNING: at fs/quota/dquot.c:988 dquot_claim_space+0x77/0x112()

because the fact that user has some delayed allocation is not recorded
in quota structure.

Make dquot_initialize() update amount of reserved space for user if it sees
inode has some space reserved. Also make sure that reserved quota space does
not go negative and we warn about the filesystem bug just once.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:21 +01:00
Dmitry Monakhov
c469070aea quota: manage reserved space when quota is not active [v2]
Since we implemented generic reserved space management interface,
then it is possible to account reserved space even when quota
is not active (similar to i_blocks/i_bytes).

Without this patch following testcase result in massive comlain from
WARN_ON in dquot_claim_space()

TEST_CASE:
mount /dev/sdb /mnt -oquota
dd if=/dev/zero of=/mnt/test bs=1M count=1
quotaon /mnt
# fs_reserved_spave == 1Mb
# quota_reserved_space == 0, because quota was disabled
dd if=/dev/zero of=/mnt/test seek=1 bs=1M count=1
# fs_reserved_spave == 2Mb
# quota_reserved_space == 1Mb
sync  # ->dquot_claim_space() -> WARN_ON

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:21 +01:00
Dmitry Monakhov
e1f5c67a19 ext3: trivial quota cleanup
The patch is aimed to reorganize and simplify quota code a bit.
Quota code is itself complex enouth, but we can make it more readable
in some places:
- Move quota option parsing to separate functions.
- Simplify old-quota and journaled-quota mix check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:20 +01:00
Dmitry Monakhov
e3c9643597 ext3: mount flags manipulation cleanup
Replace intermediate EXT3_MOUNT_XXX flags manipulation to
corresponding macro.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:20 +01:00
Jan Kara
9df93939b7 ext3: Use bitops to read/modify EXT3_I(inode)->i_state
At several places we modify EXT3_I(inode)->i_state without holding i_mutex
(ext3_release_file, ext3_bmap, ext3_journalled_writepage, ext3_do_update_inode,
...). These modifications are racy and we can lose updates to i_state. So
convert handling of i_state to use bitops which are atomic.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:20 +01:00
Jan Kara
26245c949c quota: Cleanup S_NOQUOTA handling
Cleanup handling of S_NOQUOTA inode flag and document it a bit. The flag
does not have to be set under dqptr_sem. Only functions modifying inode's
dquot pointers have to check the flag under dqptr_sem before going forward
with the modification. This way we are sure that we cannot add new dquot
pointers to the inode which is just becoming a quota file.

The good thing about this cleanup is that there are no more places in quota
code which enforce i_mutex vs. dqptr_sem lock ordering (in particular that
dqptr_sem -> i_mutex of quota file). This should silence some (false) lockdep
warnings with ext4 + quota and generally make life of some filesystems easier.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:19 +01:00
Joern Engel
c6d3830140 [LogFS] Only write journal if dirty
This prevents unnecessary journal writes.  More importantly it prevents
an oops due to a journal write on failed mount.
2010-03-04 21:36:19 +01:00
Joern Engel
9421502b4f [LogFS] Fix bdev erases
Erases for block devices were always just emulated by writing 0xff.
Some time back the write was removed and only the page cache was
changed to 0xff.  Superficialy a good idea with two problems:
1. Touching the page cache isn't necessary either.
2. However, writing out 0xff _is_ necessary for the journal.  As the
   journal is scanned linearly, an old non-overwritten commit entry
   can be used on next mount and cause havoc.

This should fix both aspects.
2010-03-04 21:30:58 +01:00
Yehuda Sadeh
422d2cb8f9 ceph: reset osd after relevant messages timed out
This simplifies the process of timing out messages. We
keep lru of current messages that are in flight. If a
timeout has passed, we reset the osd connection, so that
messages will be retransmitted.  This is a failsafe in case
we hit some sort of problem sending out message to the OSD.
Normally, we'll get notification via an updated osdmap if
there are problems.

If a request is older than the keepalive timeout, send a
keepalive to ensure we detect any breaks in the TCP connection.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-04 11:26:35 -08:00
J. Bruce Fields
4ea41e2de5 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs into for-2.6.34-incoming
Resolve merge conflict in fs/xfs/linux-2.6/xfs_export.c.
2010-03-04 12:04:51 -05:00
Linus Torvalds
64ba992675 Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: groups support
  exofs: Prepare for groups
  exofs: Error recovery if object is missing from storage
  exofs: convert io_state to use pages array instead of bio at input
  exofs: RAID0 support
  exofs: Define on-disk per-inode optional layout attribute
  exofs: unindent exofs_sbi_read
  exofs: Move layout related members to a layout structure
  exofs: Recover in the case of read-passed-end-of-file
  exofs: Micro-optimize exofs_i_info
  exofs: debug print even less
2010-03-04 08:26:08 -08:00
Linus Torvalds
0f2cc4ecd8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
  init: Open /dev/console from rootfs
  mqueue: fix typo "failues" -> "failures"
  mqueue: only set error codes if they are really necessary
  mqueue: simplify do_open() error handling
  mqueue: apply mathematics distributivity on mq_bytes calculation
  mqueue: remove unneeded info->messages initialization
  mqueue: fix mq_open() file descriptor leak on user-space processes
  fix race in d_splice_alias()
  set S_DEAD on unlink() and non-directory rename() victims
  vfs: add NOFOLLOW flag to umount(2)
  get rid of ->mnt_parent in tomoyo/realpath
  hppfs can use existing proc_mnt, no need for do_kern_mount() in there
  Mirror MS_KERNMOUNT in ->mnt_flags
  get rid of useless vfsmount_lock use in put_mnt_ns()
  Take vfsmount_lock to fs/internal.h
  get rid of insanity with namespace roots in tomoyo
  take check for new events in namespace (guts of mounts_poll()) to namespace.c
  Don't mess with generic_permission() under ->d_lock in hpfs
  sanitize const/signedness for udf
  nilfs: sanitize const/signedness in dealing with ->d_name.name
  ...

Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
2010-03-04 08:15:33 -08:00
Akira Fujita
c437b27335 ext4: Code cleanup for EXT4_IOC_MOVE_EXT ioctl
a) Fix sparse warning in ext4_ioctl()
b) Remove unneeded variable in mext_leaf_block()
c) Fix spelling typo in mext_check_arguments()

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 00:39:24 -05:00
Akira Fujita
7247c0caa2 ext4: Fix the NULL reference in double_down_write_data_sem()
If EXT4_IOC_MOVE_EXT ioctl is called with NULL donor_fd, fget() in
ext4_ioctl() gets inappropriate file structure for donor; so we need
to do this check earlier, before calling double_down_write_data_sem().

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 00:34:58 -05:00
Akira Fujita
5fd5249aa3 ext4: Fix insertion point of extent in mext_insert_across_blocks()
If the leaf node has 2 extent space or fewer and EXT4_IOC_MOVE_EXT
ioctl is called with the file offset where after the 2nd extent
covers, mext_insert_across_blocks() always tries to insert extent into
the first extent.  As a result, the file gets corrupted because of
wrong extent order.  The patch fixes this problem.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 00:31:06 -05:00
Akinobu Mita
731eb1a03a ext4: consolidate in_range() definitions
There are duplicate macro definitions of in_range() in mballoc.h and
balloc.c.  This consolidates these two definitions into ext4.h, and
changes extents.c to use in_range() as well.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
2010-03-03 23:55:01 -05:00
Akinobu Mita
bda00de7e8 ext4: cleanup to use ext4_grp_offs_to_block()
More cleanup to convert open-coded calculations of the first block
number of a free extent to use ext4_grp_offs_to_block() instead.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
2010-03-03 23:53:25 -05:00
Akinobu Mita
5661bd6861 ext4: cleanup to use ext4_group_first_block_no()
This is a cleanup and simplification patch which takes some open-coded
calculations to calculate the first block number of a group and
converts them to use the (already defined) ext4_group_first_block_no()
function.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
2010-03-03 23:53:39 -05:00
Al Viro
9643f5d94a Merge branch 'for-fsnotify' into for-linus 2010-03-03 17:12:40 -05:00
Jan Kara
9b1d0998d2 ext4: Release page references acquired in ext4_da_block_invalidatepages
We forget to release page references we acquire in
ext4_da_block_invalidatepages.  Luckily, this function gets called only if we
are not able to allocate blocks for delay-allocated data so that function
should better never be called.

Also cleanup handling of index variable.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-03 16:19:32 -05:00
J. Bruce Fields
8d75da8afd nfsd4: fix minor memory leak
There's no need to allocate this cred more than once.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-03-03 16:13:29 -05:00
Al Viro
4919c5e45a fix race in d_splice_alias()
rehashing the negative placeholder opens a race with d_lookup();
we unhash it almost immediately (by d_move()), but the race
window is there.  Since d_move() doesn't rely on target being
hashed, we don't need that d_rehash() at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:13:08 -05:00
Al Viro
bec1052e5b set S_DEAD on unlink() and non-directory rename() victims
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:12:08 -05:00
Miklos Szeredi
db1f05bb85 vfs: add NOFOLLOW flag to umount(2)
Add a new UMOUNT_NOFOLLOW flag to umount(2).  This is needed to prevent
symlink attacks in unprivileged unmounts (fuse, samba, ncpfs).

Additionally, return -EINVAL if an unknown flag is used (and specify
an explicitly unused flag: UMOUNT_UNUSED).  This makes it possible for
the caller to determine if a flag is supported or not.

CC: Eugene Teo <eugene@redhat.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:08:00 -05:00
Al Viro
0ceeca5a08 hppfs can use existing proc_mnt, no need for do_kern_mount() in there
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:08:00 -05:00
Al Viro
8089352a13 Mirror MS_KERNMOUNT in ->mnt_flags
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:08:00 -05:00
Al Viro
d498b25a4f get rid of useless vfsmount_lock use in put_mnt_ns()
It hadn't been needed since we'd sanitized the logics in
mark_mounts_for_expiry() (which, in turn, used to be a
rudiment of bad old times when namespace_sem was per-ns).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:59 -05:00
Al Viro
47cd813f29 Take vfsmount_lock to fs/internal.h
no more users left outside of fs/*.c (and very few outside of
fs/namespace.c, actually)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:59 -05:00
Al Viro
9f5596af44 take check for new events in namespace (guts of mounts_poll()) to namespace.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:59 -05:00
Al Viro
e21e7095a7 Don't mess with generic_permission() under ->d_lock in hpfs
Just use dentry_unhash() there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:58 -05:00
Al Viro
391e8bbd38 sanitize const/signedness for udf
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:58 -05:00
Al Viro
072f98b463 nilfs: sanitize const/signedness in dealing with ->d_name.name
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:58 -05:00
Al Viro
0319003d0d nilfs really shouldn't slap struct dentry on stack...
... especially when it only needs (and initializes) .d_name of it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:58 -05:00
Al Viro
89031bc797 sanitize const/signedness of ufs a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:57 -05:00
Al Viro
7e7742ee00 sanitize signedness/const for pointers to char in hpfs a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:57 -05:00
Al Viro
1f707137b5 new helper: iterate_mounts()
apply function to vfsmounts in set returned by collect_mounts(),
stop if it returns non-zero.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:57 -05:00
Al Viro
462d60577a fix NFS4 handling of mountpoint stat
RFC says we need to follow the chain of mounts if there's more
than one stacked on that point.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:57 -05:00
Al Viro
3088dd7080 Clean follow_dotdot() up a bit
No need to open-code follow_up() in it and locking can be lighter.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:56 -05:00
Al Viro
f694869709 a couple of mntget+dget -> path_get in nfs4proc
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:56 -05:00
Al Viro
6eae7974d0 Switch alloc_nfs_open_context() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:56 -05:00
Al Viro
2096f759ab New helper: path_is_under(path1, path2)
Analog of is_subdir for vfsmount,dentry pairs, moved from audit_tree.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:55 -05:00
Valerie Aurora
495d6c9c65 VFS: Clean up shared mount flag propagation
The handling of mount flags in set_mnt_shared() got a little tangled
up during previous cleanups, with the following problems:

* MNT_PNODE_MASK is defined as a literal constant when it should be a
bitwise xor of other MNT_* flags
* set_mnt_shared() clears and then sets MNT_SHARED (part of MNT_PNODE_MASK)
* MNT_PNODE_MASK could use a comment in mount.h
* MNT_PNODE_MASK is a terrible name, change to MNT_SHARED_MASK

This patch fixes these problems.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:55 -05:00
Al Viro
5b7e934d88 Use kill_litter_super() in autofs4 ->kill_sb()
... and get rid of open-coding its guts (i.e. RIP autofs4_force_release())

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:54 -05:00
Al Viro
3899167dbd Get rid of mnt_mountpoint abuses in ext4
path to mnt/mnt->mnt_root is no worse than that to
mnt->mnt_parent/mnt->mnt_mountpoint *and* needs no
pinning the sucker down (mnt is not going away and
mnt->mnt_root won't change)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:54 -05:00
Al Viro
f598f9f125 Sanitize autofs_dev_ioctl_ismountpoint()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:53 -05:00
Al Viro
796a6b521d Kill CL_PROPAGATION, sanitize fs/pnode.c:get_source()
First of all, get_source() never results in CL_PROPAGATION
alone.  We either get CL_MAKE_SHARED (for the continuation
of peer group) or CL_SLAVE (slave that is not shared) or both
(beginning of peer group among slaves).  Massage the code to
make that explicit, kill CL_PROPAGATION test in clone_mnt()
(nothing sets CL_MAKE_SHARED without CL_PROPAGATION and in
clone_mnt() we are checking CL_PROPAGATION after we'd found
that there's no CL_SLAVE, so the check for CL_MAKE_SHARED
would do just as well).

Fix comments, while we are at it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:22 -05:00
Al Viro
c177c2ac8c Switch gfs2 to nd_set_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:22 -05:00
Al Viro
8737c9305b Switch may_open() and break_lease() to passing O_...
... instead of mixing FMODE_ and O_

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:21 -05:00
Nick Piggin
d208bbdda9 fs: improve remount,ro vs buffercache coherency
Invalidate sb->s_bdev on remount,ro.

Fixes a problem reported by Jorge Boncompte who is seeing corruption
trying to snapshot a minix filesystem image.  Some filesystems modify
their metadata via a path other than the bdev buffer cache (eg.  they may
use a private linear mapping for their metadata, or implement directories
in pagecache, etc).  Also, file data modifications usually go to the bdev
via their own mappings.

These updates are not coherent with buffercache IO (eg.  via /dev/bdev)
and never have been.  However there could be a reasonable expectation that
after a mount -oremount,ro operation then the buffercache should
subsequently be coherent with previous filesystem modifications.

So invalidate the bdev mappings on a remount,ro operation to provide a
coherency point.

The problem was exposed when we switched the old rd to brd because old rd
didn't really function like a normal block device and updates to rd via
mappings other than the buffercache would still end up going into its
buffercache.  But the same problem has always affected other "normal"
block devices, including loop.

[akpm@linux-foundation.org: repair comment layout]
Reported-by: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Tested-by: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:20 -05:00
H Hartley Sweeten
ec4f860597 fs/dcache.c: CodingStyle cleanup
Cleanup EXPORT* macros according to Documantation/CodingStyle.

Move EXPORT* macros to the line immediately after the closing
function brace.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:19 -05:00
Helight.Xu
587d4a17d8 some clean up in fs/proc
EXPORT_SYMBOL(proc_symlink);
EXPORT_SYMBOL(proc_mkdir);
EXPORT_SYMBOL(create_proc_entry);
EXPORT_SYMBOL(proc_create_data);
EXPORT_SYMBOL(remove_proc_entry);

Those EXPORT_SYMBOL shouldn't be in fs/proc/root.c,
should be in fs/proc/generic.c.

Signed-off-by: Helight.Xu <helight.xu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:18 -05:00
Boaz Harrosh
193cf4b991 libfs: Unexport and kill simple_prepare_write
Remove the EXPORT_UNUSED_SYMBOL of simple_prepare_write

Collapse simple_prepare_write into it's only caller, though
making it simpler and clearer to understand.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:17 -05:00
Boaz Harrosh
ad2a722f19 libfs: Open code simple_commit_write into only user
* simple_commit_write was only called by simple_write_end.
  Open coding it makes it tiny bit less heavy on the arithmetic and
  much more readable.

* While at it use zero_user() for clearing a partial page.
* While at it add a docbook comment for simple_write_end.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:16 -05:00
Al Viro
4b1ae27a96 Revert "autofs4: always use lookup for lookup"
This reverts commit 213614d583.

Alas, ->d_revalidate() can't rely on ->lookup() finishing what
it's started; if d_alloc() in do_lookup() fails, we are not going
to call ->lookup() at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 12:58:31 -05:00
Linus Torvalds
feaf77d51a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: add reader's lock for cno in nilfs_ioctl_sync
  nilfs2: delete unnecessary condition in load_segment_summary
  nilfs2: move iterator to write log into segment buffer
  nilfs2: get rid of s_dirt flag use
  nilfs2: get rid of nilfs_segctor_req struct
  nilfs2: delete unnecessary condition in nilfs_dat_translate
  nilfs2: fix potential hang in nilfs_error on errors=remount-ro
  nilfs2: use mnt_want_write in ioctls where write access is needed
  nilfs2: issue discard request after cleaning segments
2010-03-03 08:53:49 -08:00
Linus Torvalds
eca281aad0 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (36 commits)
  Ocfs2: Move ocfs2 ioctl definitions from ocfs2_fs.h to newly added ocfs2_ioctl.h
  ocfs2: send SIGXFSZ if new filesize exceeds limit -v2
  ocfs2/userdlm: Add tracing in userdlm
  ocfs2: Use a separate masklog for AST and BASTs
  dlm: allow dlm do recovery during shutdown
  ocfs2: Only bug out in direct io write for reflinked extent.
  ocfs2: fix warning in ocfs2_file_aio_write()
  ocfs2_dlmfs: Enable the use of user cluster stacks.
  ocfs2_dlmfs: Use the stackglue.
  ocfs2_dlmfs: Don't honor truncate.  The size of a dlmfs file is LVB_LEN
  ocfs2: Pass the locking protocol into ocfs2_cluster_connect().
  ocfs2: Remove the ast pointers from ocfs2_stack_plugins
  ocfs2: Hang the locking proto on the cluster conn and use it in asts.
  ocfs2: Attach the connection to the lksb
  ocfs2: Pass lksbs back from stackglue ast/bast functions.
  ocfs2_dlmfs: Move to its own directory
  ocfs2_dlmfs: Use poll() to signify BASTs.
  ocfs2_dlmfs: Add capabilities parameter.
  ocfs2: Handle errors while setting external xattr values.
  ocfs2: Set inline xattr entries with ocfs2_xa_set()
  ...
2010-03-03 08:53:17 -08:00
Linus Torvalds
60f8a8d4c6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix large stack use
  fuse: cleanup in fuse_notify_inval_...()
2010-03-03 08:08:21 -08:00
Linus Torvalds
0a135ba14d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: add __percpu sparse annotations to what's left
  percpu: add __percpu sparse annotations to fs
  percpu: add __percpu sparse annotations to core kernel subsystems
  local_t: Remove leftover local.h
  this_cpu: Remove pageset_notifier
  this_cpu: Page allocator conversion
  percpu, x86: Generic inc / dec percpu instructions
  local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c
  module: Use this_cpu_xx to dynamically allocate counters
  local_t: Remove cpu_local_xx macros
  percpu: refactor the code in pcpu_[de]populate_chunk()
  percpu: remove compile warnings caused by __verify_pcpu_ptr()
  percpu: make accessors check for percpu pointer in sparse
  percpu: add __percpu for sparse.
  percpu: make access macros universal
  percpu: remove per_cpu__ prefix.
2010-03-03 07:34:18 -08:00
Linus Torvalds
4850f524b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
  GFS2: print glock numbers in hex
  GFS2: ordered writes are backwards
  GFS2: Remove old, unused linked list code from quota
  GFS2: Remove loopy umount code
  GFS2: Metadata address space clean up
2010-03-03 07:33:50 -08:00
Linus Torvalds
4846546f7e Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] pSesInfo->sesSem is used as mutex. Rename it to session_mutex and
  [CIFS] Use unsigned ea length for clarity
  cifs: set server_eof in cifs_fattr_to_inode
  [CIFS] Minor cleanup to EA patch
  cifs: merge CIFSSMBQueryEA with CIFSSMBQAllEAs
  cifs: verify lengths of QueryAllEAs reply
  cifs: increase maximum buffer size in CIFSSMBQAllEAs
  cifs: rename name_len to list_len in CIFSSMBQAllEAs
  cifs: clean up indentation in CIFSSMBQAllEAs
  cifs: add parens around smb_var in BCC macros
2010-03-03 07:32:10 -08:00
Linus Torvalds
832d30ca72 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (38 commits)
  SELinux: Make selinux_kernel_create_files_as() shouldn't just always return 0
  TOMOYO: Protect find_task_by_vpid() with RCU.
  Security: add static to security_ops and default_security_ops variable
  selinux: libsepol: remove dead code in check_avtab_hierarchy_callback()
  TOMOYO: Remove __func__ from tomoyo_is_correct_path/domain
  security: fix a couple of sparse warnings
  TOMOYO: Remove unneeded parameter.
  TOMOYO: Use shorter names.
  TOMOYO: Use enum for index numbers.
  TOMOYO: Add garbage collector.
  TOMOYO: Add refcounter on domain structure.
  TOMOYO: Merge headers.
  TOMOYO: Add refcounter on string data.
  TOMOYO: Reduce lines by using common path for addition and deletion.
  selinux: fix memory leak in sel_make_bools
  TOMOYO: Extract bitfield
  syslog: clean up needless comment
  syslog: use defined constants instead of raw numbers
  syslog: distinguish between /proc/kmsg and syscalls
  selinux: allow MLS->non-MLS and vice versa upon policy reload
  ...
2010-03-02 14:47:24 -08:00
Tristan Ye
9df5778ece Ocfs2: Move ocfs2 ioctl definitions from ocfs2_fs.h to newly added ocfs2_ioctl.h
Currently we were adding ioctl cmds/structures for ocfs2 into ocfs2_fs.h
which was used for define ocfs2 on-disk layout. That sounds a little bit
confusing, and it may be quickly polluted espcially when growing the
ocfs2_info_request ioctls afterwards(it will grow i bet).

As a result, such OCFS2 IOCs do need to be placed somewhere other than
ocfs2_fs.h, a separated ocfs2_ioctl.h will be added to store such ioctl
structures and definitions which could also be used from userspace to
invoke ioctls call.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-02 14:10:20 -08:00
Andy Adamson
180b62a3d8 nfs41 fix NFS4ERR_CLID_INUSE for exchange id
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 13:45:33 -05:00
Linus Torvalds
6c0ad5dfd3 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  Revert "blkdev: fix merge_bvec_fn return value checks"
2010-03-02 10:33:36 -08:00
Jens Axboe
9599945bac Revert "blkdev: fix merge_bvec_fn return value checks"
This reverts commit 9f7cdbc33f.

It's causing oopses om dm setups, so revert it until we investigate.

Reported-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-03-02 19:17:34 +01:00
Trond Myklebust
ebed9203b6 NFS: Fix an allocation-under-spinlock bug
sunrpc_cache_update() will always call detail->update() from inside the
detail->hash_lock, so it cannot allocate memory.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-03-02 13:06:22 -05:00
Trond Myklebust
0f79fd6f5c NFSv4.1: Various fixes to the sequence flag error handling
Ensure that we change the EXCHANGE_ID verifier (i.e. clp->cl_boot_time)
when we want to reset all state. This is mainly needed when the server
tells us that it is revoking our open or lock stateids.

Handle revoking of recallable state by expiring the delegations.

Handle callback path issues by expiring the delegations and then resetting
the session.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 13:06:21 -05:00
Alexandros Batsakis
0851de0617 nfs4: renewd renew operations should take/put a client reference
renewd sends RENEW requests to the NFS server in order to renew state.
As the request is asynchronous, renewd should take a reference to the
nfs_client to prevent concurrent umounts from freeing the client

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 13:00:03 -05:00
Alexandros Batsakis
7135840fc7 nfs41: renewd sequence operations should take/put client reference
renewd sends SEQUENCE requests to the NFS server in order to renew state.
As the request is asynchronous, renewd should take a reference to the
nfs_client to prevent concurrent umounts from freeing the session/client

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 12:54:30 -05:00
Alexandros Batsakis
dc96aef96a nfs: prevent backlogging of renewd requests
If the renewd send queue gets backlogged (e.g., if the server goes down),
we will keep filling the queue with periodic RENEW/SEQUENCE requests.

This patch schedules a new renewd request if and only if the previous one
returns (either success or failure)

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
[Trond.Myklebust@netapp.com: moved nfs4_schedule_state_renewal() into
separate nfs4_renew_release() and nfs41_sequence_release() callbacks
to ensure correct behaviour on call setup failure]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 12:44:07 -05:00
Alexandros Batsakis
888ef2e3f8 nfs: kill renewd before clearing client minor version
renewd should be synchronously killed before we destroy the session in
nfs4_clear_minor_version

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
[Trond.Myklebust@netapp.com: clean up to remove 'unused function
warning when !CONFIG_NFS_V4]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 12:16:12 -05:00
Linus Torvalds
6d6b89bd2e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1341 commits)
  virtio_net: remove forgotten assignment
  be2net: fix tx completion polling
  sis190: fix cable detect via link status poll
  net: fix protocol sk_buff field
  bridge: Fix build error when IGMP_SNOOPING is not enabled
  bnx2x: Tx barriers and locks
  scm: Only support SCM_RIGHTS on unix domain sockets.
  vhost-net: restart tx poll on sk_sndbuf full
  vhost: fix get_user_pages_fast error handling
  vhost: initialize log eventfd context pointer
  vhost: logging thinko fix
  wireless: convert to use netdev_for_each_mc_addr
  ethtool: do not set some flags, if others failed
  ipoib: returned back addrlen check for mc addresses
  netlink: Adding inode field to /proc/net/netlink
  axnet_cs: add new id
  bridge: Make IGMP snooping depend upon BRIDGE.
  bridge: Add multicast count/interval sysfs entries
  bridge: Add hash elasticity/max sysfs entries
  bridge: Add multicast_snooping sysfs toggle
  ...

Trivial conflicts in Documentation/feature-removal-schedule.txt
2010-03-02 07:55:08 -08:00
Dmitry Monakhov
67eeb5685d ext4: Fix ext4_quota_write cross block boundary behaviour
We always assume what dquot update result in changes in one data block
But ext4_quota_write() function may handle cross block boundary writes
In fact if this ever happen it will result in incorrect journal
credits reservation, and later a BUG_ON.  As soon this never happen
the boundary cross loop is NOOP.  In order to make things straight
let's remove this loop and assert cross boundary condition.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 08:08:51 -05:00
Frank Mayhar
273df556b6 ext4: Convert BUG_ON checks to use ext4_error() instead
Convert a bunch of BUG_ONs to emit a ext4_error() message and return
EIO.  This is a first pass and most notably does _not_ cover
mballoc.c, which is a morass of void functions.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 11:46:09 -05:00
Jiaying Zhang
b7adc1f363 ext4: Use direct_IO_no_locking in ext4 dio read
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 13:26:36 -05:00
Jiaying Zhang
744692dc05 ext4: use ext4_get_block_write in buffer write
Allocate uninitialized extent before ext4 buffer write and
convert the extent to initialized after io completes.
The purpose is to make sure an extent can only be marked
initialized after it has been written with new data so
we can safely drop the i_mutex lock in ext4 DIO read without
exposing stale data. This helps to improve multi-thread DIO
read performance on high-speed disks.

Skip the nobh and data=journal mount cases to make things simple for now.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 16:14:02 -05:00
Jiaying Zhang
c7064ef13b ext4: mechanical rename some of the direct I/O get_block's identifiers
This commit renames some of the direct I/O's block allocation flags,
variables, and functions introduced in Mingming's "Direct IO for holes
and fallocate" patches so that they can be used by ext4's buffered
write path as well.  Also changed the related function comments
accordingly to cover both direct write and buffered write cases.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 13:28:44 -05:00
Toshiyuki Okajima
b8b8afe236 ext4: make "offset" consistent in ext4_check_dir_entry()
The callers of ext4_check_dir_entry() usually pass in the "file
offset" (ext4_readdir, htree_dirblock_to_tree, search_dirblock,
ext4_dx_find_entry, empty_dir), but a few callers (add_dirent_to_buf,
ext4_delete_entry) only pass in the buffer offset.

To accomodate those last two (which would be hard to fix otherwise),
this patch changes ext4_check_dir_entry() to print the physical block
number and the relative offset as well as the passed-in offset.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 00:21:35 -05:00
Dmitry Monakhov
6e3617e579 ext4: Handle non empty on-disk orphan link
In case of truncate errors we explicitly remove inode from in-core
orphan list via orphan_del(NULL, inode) without modifying the on-disk list.

But later on, the same inode may be inserted in the orphan list again
which will result the on-disk linked list getting corrupted.  If inode
i_dtime contains valid value, then skip on-disk list modification.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:29:39 -05:00
Dmitry Monakhov
da1dafca84 ext4: explicitly remove inode from orphan list after failed direct io
Otherwise non-empty orphan list will be triggered on umount.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:15:02 -05:00
Dmitry Monakhov
f39490bcd1 ext4: fix error handling in migrate
Set i_nlink to zero for temporary inode from very beginning.
otherwise we may fail to start new journal handle and this
inode will be unreferenced but with i_nlink == 1
Since we hold inode reference it can not be pruned.

Also add missed journal_start retval check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:14:36 -05:00
Dmitry Monakhov
437ca0fda3 ext4: deprecate obsoleted mount options
Declare following list of mount options as deprecated:
 - bsddf, miniddf
 - grpid, bsdgroups, nogrpid, sysvgroups

Declare following list of default mount options as deprecated:
 - bsdgroups

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 22:29:21 -05:00
Christoph Hellwig
f1f724e4b5 xfs: fix locking for inode cache radix tree tag updates
The radix-tree code requires it's users to serialize tag updates
against other updates to the tree.  While XFS protects tag updates
against each other it does not serialize them against updates of the
tree contents, which can lead to tag corruption.  Fix the inode
cache to always take pag_ici_lock in exclusive mode when updating
radix tree tags.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Patrick Schreurs <patrick@news-service.com>
Tested-by: Patrick Schreurs <patrick@news-service.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 19:14:36 -06:00
Tao Ma
cc483f102c ext4: Fix fencepost error in chosing choosing group vs file preallocation.
The ext4 multiblock allocator decides whether to use group or file
preallocation based on the file size.  When the file size reaches
s_mb_stream_request (default is 16 blocks), it changes to use a
file-specific preallocation. This is cool, but it has a tiny problem.

See a simple script:
mkfs.ext4 -b 1024 /dev/sda8 1000000
mount -t ext4 -o nodelalloc /dev/sda8 /mnt/ext4
for((i=0;i<5;i++))
do
cat /mnt/4096>>/mnt/ext4/a	#4096 is a file with 4096 characters.
cat /mnt/4096>>/mnt/ext4/b
done
debuge4fs -R 'stat a' /dev/sda8|grep BLOCKS -A 1

And you get
BLOCKS:
(0-14):8705-8719, (15):2356, (16-19):8465-8468

So there are 3 extents, a bit strange for the lonely 15th logical
block.  As we write to the 16 blocks, we choose file preallocation in
ext4_mb_group_or_file, but in ext4_mb_normalize_request, we meet with
the 16*1024 range, so no preallocation will be carried. file b then
reserves the space after '2356', so when when write 16, we start from
another part.

This patch just change the check in ext4_mb_group_or_file, so
that for the lonely 15 we will still use group preallocation.
After the patch, we will get:
debuge4fs -R 'stat a' /dev/sda8|grep BLOCKS -A 1
BLOCKS:
(0-15):8705-8720, (16-19):8465-8468

Looks more sane. Thanks.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 19:06:35 -05:00
Sage Weil
e9964c1023 ceph: fix flush_dirty_caps race with caps migration
The flush_dirty_caps() used to loop over the first entry of the cap_dirty
dirty list on the assumption that after calling ceph_check_caps() it would
be removed from the list.  This isn't true for caps that are being
migrated between MDSs, where we've received the EXPORT but not the IMPORT.

Instead, do a safe list iteration, and pin the next inode on the list via
the CEPH_I_NOFLUSH flag.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-01 15:28:02 -08:00
Sage Weil
7af8f1e4aa ceph: include migrating caps in issued set
We should include caps that are mid-migration (we've received the EXPORT,
but not the IMPORT) in the issued caps set.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-01 15:28:01 -08:00
Sage Weil
e53a8fd773 ceph: fix osdmap decoding when pools include (removed) snaps
Add missing pointer dereference (p is a void **).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-01 15:28:00 -08:00
Sage Weil
195d3ce2cc ceph: return EBADF if waiting for caps on closed file
Verify the file is actually open for the given caps when we are
waiting for caps.  This ensures we will wake up and return EBADF
if another thread closes the file out from under us.

Note that EBADF is also the correct return code from write(2)
when called on a file handle opened for reading (although the
vfs should catch that).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-01 15:28:00 -08:00
Sage Weil
6f863e712d ceph: set osd request message front length correctly
We didn't set the front length correctly.  When messages used
the message pool we ended up with the conservative max (4 KB), and
the rest of the time the slightly less conservative estimate.  Even
though the OSD ignores the extra data, set it to the right value to avoid
sending extra data over the network.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-01 15:26:41 -08:00
Sage Weil
3ca02ef96e ceph: reset front len on return to msgpool; BUG on mismatched front iov
Reset msg front len when a message is returned to the pool: the caller
may have changed it.

BUG if we try to send a message with a hdr.front_len that doesn't match
the front iov.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-01 15:25:00 -08:00
Sage Weil
70edb55bdf ceph: fix snaptrace decoding on cap migration between mds
This was simply broken.  Apparently at some point we thought about putting
the snaptrace in the middle section, but didn't.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-01 15:20:05 -08:00
Sage Weil
c16e786927 ceph: use single osd op reply msg
Use a single ceph_msg for the osd reply, even when we are getting multiple
replies.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-01 15:20:02 -08:00
Sage Weil
1679f876a6 ceph: reset bits on connection close
Clear LOSSYTX bit, so that if/when we reconnect, said reconnect
will retry on failure.

Clear _PENDING bits too, to avoid polluting subsequent
connection state.

Drop unused REGISTERED bit.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-01 15:19:51 -08:00
Christoph Hellwig
a14a5ab58f xfs: remove xfs_ipin/xfs_iunpin
Inodes are only pinned/unpinned via the inode item methods, and lots of
code relies on that fact.  So remove the separate xfs_ipin/xfs_iunpin
helpers and merge them into their only callers.  This also fixes up
various duplicate and/or incorrect comments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:56 -06:00
Christoph Hellwig
60ec678371 xfs: cleanup xfs_iunpin_wait/xfs_iunpin_nowait
Remove the inode item pointer and ili_last_lsn checks in
__xfs_iunpin_wait as any pinned inode is guaranteed to have them
valid.  After this the xfs_iunpin_nowait case is nothing more than a
xfs_log_force_lsn, as we know that the caller has already checked
the pincount.

Make xfs_iunpin_nowait the new low-level routine just doing the log
force and rewrite xfs_iunpin_wait around it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:50 -06:00
Christoph Hellwig
d7658d487f xfs: kill xfs_lrw.h
Move the two declarations to better fitting headers now that
xfs_lrw.c is gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:44 -06:00
Christoph Hellwig
d7e84f4137 xfs: factor common xfs_trans_bjoin code
Most of xfs_trans_bjoin is duplicated in xfs_trans_get_buf,
xfs_trans_getsb and xfs_trans_read_buf.  Add a new _xfs_trans_bjoin
which can be called by all four functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:37 -06:00
Christoph Hellwig
35a8a72f06 xfs: stop passing opaque handles to xfs_log.c routines
Currenly we pass opaque xfs_log_ticket_t handles instead of
struct xlog_ticket pointers, and void pointers instead of
struct xlog_in_core pointers to various log manager functions.
Instead pass properly typed pointers after adding forward
declarations for them to xfs_log.h, and adjust the touched
function prototypes to the standard XFS style while at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:32 -06:00
Christoph Hellwig
c467c049e7 xfs: split xfs_bmap_btalloc
Split out the nullfb case into a separate function to reduce the stack
footprint and make the code more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:25 -06:00
Christoph Hellwig
f7008d0aeb xfs: fix xfs_fsblock_t tracing
Using a static buffer in xfs_fmtfsblock means we can corrupt traces if
multiple CPUs hit this code path at the same.  Just remove xfs_fmtfsblock
for now and print the block number purely numerical.  If we want the
NULLFSBLOCK and NULLSTARTBLOCK formatting back the best way would be
a decoding plugin in the trace-cmd userspace command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:17 -06:00
Christoph Hellwig
024910cbac xfs: fix inode pincount check in fsync
We need to hold the ilock to check the inode pincount safely.  While
we're at it also remove the check for ip->i_itemp->ili_last_lsn, a
pinned inode always has it set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:10 -06:00
Dave Chinner
77d7a0c2ee xfs: Non-blocking inode locking in IO completion
The introduction of barriers to loop devices has created a new IO
order completion dependency that XFS does not handle. The loop
device implements barriers using fsync and so turns a log IO in the
XFS filesystem on the loop device into a data IO in the backing
filesystem. That is, the completion of log IOs in the loop
filesystem are now dependent on completion of data IO in the backing
filesystem.

This can cause deadlocks when a flush daemon issues a log force with
an inode locked because the IO completion of IO on the inode is
blocked by the inode lock. This in turn prevents further data IO
completion from occuring on all XFS filesystems on that CPU (due to
the shared nature of the completion queues). This then prevents the
log IO from completing because the log is waiting for data IO
completion as well.

The fix for this new completion order dependency issue is to make
the IO completion inode locking non-blocking. If the inode lock
can't be grabbed, simply requeue the IO completion back to the work
queue so that it can be processed later. This prevents the
completion queue from being blocked and allows data IO completion on
other inodes to proceed, hence avoiding completion order dependent
deadlocks.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:52 -06:00
Christoph Hellwig
66d834ea60 xfs: implement optimized fdatasync
Allow us to track the difference between timestamp and size updates
by using mark_inode_dirty from the I/O completion code, and checking
the VFS inode flags in xfs_file_fsync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:45 -06:00
Christoph Hellwig
fd3200bef7 xfs: remove wrapper for the fsync file operation
Currently the fsync file operation is divided into a low-level
routine doing all the work and one that implements the Linux file
operation and does minimal argument wrapping.  This is a leftover
from the days of the vnode operations layer and can be removed to
simplify the code a bit, as well as preparing for the implementation
of an optimized fdatasync which needs to look at the Linux inode
state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:38 -06:00
Christoph Hellwig
00258e36b2 xfs: remove wrappers for read/write file operations
Currently the aio_read, aio_write, splice_read and splice_write file
operations are divided into a low-level routine doing all the work
and one that implements the Linux file operations and does minimal
argument wrapping.  This is a leftover from the days of the vnode
operations layer and can be removed to simplify the code a lot.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:29 -06:00
Christoph Hellwig
dda35b8f84 xfs: merge xfs_lrw.c into xfs_file.c
Currently the code to implement the file operations is split over
two small files.  Merge the content of xfs_lrw.c into xfs_file.c to
have it in one place.  Note that I haven't done various cleanups
that are possible after this yet, they will follow in the next
patch.  Also the function xfs_dev_is_read_only which was in
xfs_lrw.c before really doesn't fit in here at all and was moved to
xfs_mount.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:18 -06:00
Christoph Hellwig
b262e5dfd9 xfs: fix dquota trace format
The be32_to_cpu in the TP_printk output breaks automatic parsing of
the trace format by the trace-cmd tools, so we have to move it into
the TP_assign block.  While we're at it also fix the format for the
quota limits to more regular and easier parseable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:11 -06:00
Eric Sandeen
a9cc799eca xfs: increase readdir buffer size
While doing some testing of readdir perf a while back,
I noticed that the buffer size we're using internally is
smaller than what glibc gives us by default.  Upping this
size helped a bit, and seems safe.

glibc's __alloc_dir() does:

  const size_t default_allocation = (4 * BUFSIZ < sizeof (struct dirent64)
                                     ? sizeof (struct dirent64) : 4 * BUFSIZ);
  const size_t small_allocation = (BUFSIZ < sizeof (struct dirent64)
                                   ? sizeof (struct dirent64) : BUFSIZ);
  size_t allocation = default_allocation;
#ifdef _STATBUF_ST_BLKSIZE
  if (statp != NULL && default_allocation < statp->st_blksize)
    allocation = statp->st_blksize;
#endif

and

#define _G_BUFSIZ 8192
#define _IO_BUFSIZ _G_BUFSIZ
# define BUFSIZ _IO_BUFSIZ

so the default buffer is 4 * 8192 = 32768
(except in the unlikely case of blocks > 32k....)

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:33:41 -06:00
Linus Torvalds
b1bf936840 Merge branch 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block: (38 commits)
  block: don't access jiffies when initialising io_context
  cfq: remove 8 bytes of padding from cfq_rb_root on 64 bit builds
  block: fix for "Consolidate phys_segment and hw_segment limits"
  cfq-iosched: quantum check tweak
  blktrace: perform cleanup after setup error
  blkdev: fix merge_bvec_fn return value checks
  cfq-iosched: requests "in flight" vs "in driver" clarification
  cciss: Fix problem with scatter gather elements in the scsi half of the driver
  cciss: eliminate unnecessary pointer use in cciss scsi code
  cciss: do not use void pointer for scsi hba data
  cciss: factor out scatter gather chain block mapping code
  cciss: fix scatter gather chain block dma direction kludge
  cciss: simplify scatter gather code
  cciss: factor out scatter gather chain block allocation and freeing
  cciss: detect bad alignment of scsi commands at build time
  cciss: clarify command list padding calculation
  cfq-iosched: rethink seeky detection for SSDs
  cfq-iosched: rework seeky detection
  block: remove padding from io_context on 64bit builds
  block: Consolidate phys_segment and hw_segment limits
  ...
2010-03-01 09:00:29 -08:00
Bob Peterson
4818972efb GFS2: print glock numbers in hex
This patch changes glock numbers from printing in decimal to hex.
Since DLM prints corresponding resource IDs in hex, it makes debugging
easier.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-03-01 14:09:04 +00:00
Dave Chinner
e5884636da GFS2: ordered writes are backwards
When we queue data buffers for ordered write, the buffers are added
to the head of the ordered write list. When the log needs to push
these buffers to disk, it also walks the list from the head. The
result is that the the ordered buffers are submitted to disk in
reverse order.

For large writes, this means that whenever the log flushes large
streams of reverse sequential order buffers are pushed down into the
block layers. The elevators don't handle this particularly well, so
IO rates tend to be significantly lower than if the IO was issued in
ascending block order.

Queue new ordered buffers to the tail of the ordered buffer list to
ensure that IO is dispatched in the order it was submitted. This
should significantly improve large sequential write speeds. On a
disk capable of 85MB/s, speeds increase from 50MB/s to 65MB/s for
noop and from 38MB/s to 50MB/s for cfq.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-03-01 14:08:26 +00:00
Steven Whitehouse
c1184f8ab7 GFS2: Remove loopy umount code
As a consequence of the previous patch, we can now remove the
loop which used to be required due to the circular dependency
between the inodes and glocks. Instead we can just invalidate
the inodes, and then clear up any glocks which are left.

Also we no longer need the rwsem since there is no longer any
danger of the inode invalidation calling back into the glock
code (and from there back into the inode code).

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-03-01 14:07:53 +00:00
Steven Whitehouse
009d851837 GFS2: Metadata address space clean up
Since the start of GFS2, an "extra" inode has been used to store
the metadata belonging to each inode. The only reason for using
this inode was to have an extra address space, the other fields
were unused. This means that the memory usage was rather inefficient.

The reason for keeping each inode's metadata in a separate address
space is that when glocks are requested on remote nodes, we need to
be able to efficiently locate the data and metadata which relating
to that glock (inode) in order to sync or sync and invalidate it
(depending on the remotely requested lock mode).

This patch adds a new type of glock, which has in addition to
its normal fields, has an address space. This applies to all
inode and rgrp glocks (but to no other glock types which remain
as before). As a result, we no longer need to have the second
inode.

This results in three major improvements:
 1. A saving of approx 25% of memory used in caching inodes
 2. A removal of the circular dependency between inodes and glocks
 3. No confusion between "normal" and "metadata" inodes in super.c

Although the first of these is the more immediately apparent, the
second is just as important as it now enables a number of clean
ups at umount time. Those will be the subject of future patches.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-03-01 14:07:37 +00:00
David S. Miller
47871889c6 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	drivers/firmware/iscsi_ibft.c
2010-02-28 19:23:06 -08:00
James Morris
b4ccebdd37 Merge branch 'next' into for-linus 2010-03-01 09:36:31 +11:00
Dmitry Monakhov
9f7cdbc33f blkdev: fix merge_bvec_fn return value checks
merge_bvec_fn() returns bvec->bv_len on success. So we have to check
against this value. But in case of fs_optimization merge we compare
with wrong value. This patch must be included in
 b428cd6da7e6559aca69aa2e3a526037d3f20403
But accidentally i've forgot to add this in the initial patch.
To make things straight let's replace all such checks.
In fact this makes code easy to understand.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-02-28 19:47:18 +01:00
Linus Torvalds
642c4c75a7 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (44 commits)
  rcu: Fix accelerated GPs for last non-dynticked CPU
  rcu: Make non-RCU_PROVE_LOCKING rcu_read_lock_sched_held() understand boot
  rcu: Fix accelerated grace periods for last non-dynticked CPU
  rcu: Export rcu_scheduler_active
  rcu: Make rcu_read_lock_sched_held() take boot time into account
  rcu: Make lockdep_rcu_dereference() message less alarmist
  sched, cgroups: Fix module export
  rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information
  rcu: Fix rcutorture mod_timer argument to delay one jiffy
  rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection
  rcu: Convert to raw_spinlocks
  rcu: Stop overflowing signed integers
  rcu: Use canonical URL for Mathieu's dissertation
  rcu: Accelerate grace period if last non-dynticked CPU
  rcu: Fix citation of Mathieu's dissertation
  rcu: Documentation update for CONFIG_PROVE_RCU
  security: Apply lockdep-based checking to rcu_dereference() uses
  idr: Apply lockdep-based diagnostics to rcu_dereference() uses
  radix-tree: Disable RCU lockdep checking in radix tree
  vfs: Abstract rcu_dereference_check for files-fdtable use
  ...
2010-02-28 10:13:16 -08:00
Boaz Harrosh
50a76fd3c3 exofs: groups support
* _calc_stripe_info() changes to accommodate for grouping
  calculations. Returns additional information

* old _prepare_pages() becomes _prepare_one_group()
  which stores pages belonging to one device group.

* New _prepare_for_striping iterates on all groups calling
  _prepare_one_group().

* Enable mounting of groups data_maps (group_width != 0)

[QUESTION]
what is faster A or B;
A.	x += stride;
	x = x % width + first_x;

B	x += stride
	if (x < last_x)
		x = first_x;

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:55:53 -08:00
Boaz Harrosh
b367e78bd1 exofs: Prepare for groups
* Rename _offset_dev_unit_off() to _calc_stripe_info()
  and recieve a struct for the output params

* In _prepare_for_striping we only need to call
  _calc_stripe_info() once. The other componets
  are easy to calculate from that. This code
  was inspired by what's done in truncate.

* Some code shifts that make sense now but will make
  more sense when group support is added.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:44:44 -08:00
Boaz Harrosh
96391e2bae exofs: Error recovery if object is missing from storage
If an object is referenced by a directory but does not
exist on a target, it is a very serious corruption that
means:
1. Either a power failure with very slim chance of it
  happening. Because the directory update is always submitted
  much after object creation, but if a directory is written
  to one device and the object creation to another it might
  theoretically happen.
2. It only ever happened to me while developing with BUGs
  causing file corruption. Crashes could also cause it but
  they are more like case 1.

In any way the object does not exist, so data is surely lost.
If there is a mix-up in the obj-id or data-map, then lost objects
can be salvaged by off-line fsck. The only recoverable information
is the directory name. By letting it appear as a regular empty file,
with date==0 (1970 Jan 1st) ownership to root, we enable recovery
of the only useful information. And also enable deletion or over-write.
I can see how this can hurt.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:44:43 -08:00
Boaz Harrosh
86093aaff5 exofs: convert io_state to use pages array instead of bio at input
* inode.c operations are full-pages based, and not actually
  true scatter-gather
* Lets us use more pages at once upto 512 (from 249) in 64 bit
* Brings us much much closer to be able to use exofs's io_state engine
  from objlayout driver. (Once I decide where to put the common code)

After RAID0 patch the outer (input) bio was never used as a bio, but
was simply a page carrier into the raid engine. Even in the simple
mirror/single-dev arrangement pages info was copied into a second bio.
It is now easer to just pass a pages array into the io_state and prepare
bio(s) once.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:44:42 -08:00
Boaz Harrosh
5d952b8391 exofs: RAID0 support
We now support striping over mirror devices. Including variable sized
stripe_unit.

Some limits:
* stripe_unit must be a multiple of PAGE_SIZE
* stripe_unit * stripe_count is maximum upto 32-bit (4Gb)

Tested RAID0 over mirrors, RAID0 only, mirrors only. All check.

Design notes:
* I'm not using a vectored raid-engine mechanism yet. Following the
  pnfs-objects-layout data-map structure, "Mirror" is just a private
  case of "group_width" == 1, and RAID0 is a private case of
  "Mirrors" == 1. The performance lose of the general case over the
  particular special case optimization is totally negligible, also
  considering the extra code size.

* In general I added a prepare_stripes() stage that divides the
  to-be-io pages to the participating devices, the previous
  exofs_ios_write/read, now becomes _write/read_mirrors and a new
  write/read upper layer loops on all devices calling
  _write/read_mirrors. Effectively the prepare_stripes stage is the all
  secret.
  Also truncate need fixing to accommodate for striping.

* In a RAID0 arrangement, in a regular usage scenario, if all inode
  layouts will start at the same device, the small files fill up the
  first device and the later devices stay empty, the farther the device
  the emptier it is.

  To fix that, each inode will start at a different stripe_unit,
  according to it's obj_id modulus number-of-stripe-units. And
  will then span all stripe-units in the same incrementing order
  wrapping back to the beginning of the device table. We call it
  a stripe-units moving window.

  Special consideration was taken to keep all devices in a mirror
  arrangement identical. So a broken osd-device could just be cloned
  from one of the mirrors and no FS scrubbing is needed. (We do that
  by rotating stripe-unit at a time and not a single device at a time.)

TODO:
 We no longer verify object_length == inode->i_size in exofs_iget.
 (since i_size is stripped on multiple objects now).
 I should introduce a multiple-device attribute reading, and use
 it in exofs_iget.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:43:08 -08:00
Boaz Harrosh
d9c740d225 exofs: Define on-disk per-inode optional layout attribute
* Layouts describe the way a file is spread on multiple devices.
  The layout information is stored in the objects attribute introduced
  in this patch.

* There can be multiple generating function for the layout.
  Currently defined:
    - No attribute present - use below moving-window on global
      device table, all devices.
      (This is the only one currently used in exofs)
    - an obj_id generated moving window - the obj_id is a randomizing
      factor in the otherwise global map layout.
    - An explicit layout stored, including a data_map and a device
      index list.
    - More might be defined in future ...

* There are two attributes defined of the same structure:
  A-data-files-layout - This layout is used by data-files. If present
                        at a directory, all files of that directory will
                        be created with this layout.
  A-meta-data-layout - This layout is used by a directory and other
                       meta-data information. Also inherited at creation
                       of subdirectories.

* At creation time inodes are created with the layout specified above.
  A usermode utility may change the creation layout on a give directory
  or file. Which in the case of directories, will also apply to newly
  created files/subdirectories, children of that directory.
  In the simple unaltered case of a newly created exofs, no layout
  attributes are present, and all layouts adhere to the layout specified
  at the device-table.

* In case of a future file system loaded in an old exofs-driver.
  At iget(), the generating_function is inspected and if not supported
  will return an IO error to the application and the inode will not
  be loaded. So not to damage any data.
  Note: After this patch we do not yet support any type of layout
        only the RAID0 patch that enables striping at the super-block
        level will add support for RAID0 layouts above. This way we
        are past and future compatible and fully bisectable.

* Access to the device table is done by an accessor since
  it will change according to above information.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:28 -08:00
Boaz Harrosh
46f4d973f6 exofs: unindent exofs_sbi_read
The original idea was that a mirror read can be sub-divided
to multiple devices. But this has very little gain and only
at very large IOes so it's not going to be implemented soon.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:27 -08:00
Boaz Harrosh
45d3abcb1a exofs: Move layout related members to a layout structure
* Abstract away those members in exofs_sb_info that are related/needed
  by a layout into a new exofs_layout structure. Embed it in exofs_sb_info.

* At exofs_io_state receive/keep a pointer to an exofs_layout. No need for
  an exofs_sb_info pointer, all we need is at exofs_layout.

* Change any usage of above exofs_sb_info members to their new name.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:27 -08:00
Boaz Harrosh
22ddc55638 exofs: Recover in the case of read-passed-end-of-file
In check_io, implement the case of reading passed end of
file, by clearing the pages and recover with no error. In
a raid arrangement this can become a legitimate situation
in case of holes in the file.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:26 -08:00
Boaz Harrosh
518f167a37 exofs: Micro-optimize exofs_i_info
optimize the exofs_i_info struct usage by moving the embedded
vfs_inode to be first. A compiler might optimize away an "add"
operation with constant zero. (Which it cannot with other constants)

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:25 -08:00
Boaz Harrosh
34ce4e7c23 exofs: debug print even less
* Last debug trimming left in some stupid print, remove them.
  Fixup some other prints
* Shift printing from inode.c to ios.c
* Add couple of prints when memory allocation fails.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:25 -08:00
Wengang Wang
5051f76883 ocfs2: send SIGXFSZ if new filesize exceeds limit -v2
This patch makes ocfs2 send SIGXFSZ if new file size exceeds the rlimit.
Processes may get SIGXFSZ on one node (in the cluster) while others will
not on another if file size limits are different on the two nodes.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-27 20:08:51 -08:00
Sunil Mushran
6fcef3f04a ocfs2/userdlm: Add tracing in userdlm
Make use of the newly added BASTS masklog to trace ASTs and BASTs in userdlm.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-27 19:57:07 -08:00
Sunil Mushran
9b915181af ocfs2: Use a separate masklog for AST and BASTs
This patch adds a new masklog and uses it allow tracing ASTs and BASTs
in the dlmglue layer. This has been found to be very useful in debugging
cluster locking issues.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-27 19:57:06 -08:00
Christian Kujau
4912002fff Remove EXPERIMENTAL from NFS_FSCACHE
There's currently an open Ubuntu bug[0], with the intent to compile NFS_FSCACHE
(and possibly AFS_FSCACHE, 9P_FSCACHE) into the standard Ubuntu kernel.
However, since *_FSCACHE still depends on EXPERIMENTAL, this won't happen.

As Arjan van de Ven pointed out[1], the EXPERIMENTAL flag doesn't mean that
much any more, I propose the following patch to fs/nfs/Kconfig.  I'd do the
same for fs/9p/Kconfig and fs/afs/Kconfig, but as I did not test 9p or AFS, I
feel it would not be appropriate for me to remove the flag.

[0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/440522/comments/5
[1] http://lkml.org/lkml/2010/1/23/145

Signed-off-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-26 17:22:35 -08:00
Linus Torvalds
4cbd55188f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: use bastmode in debugfs output
  dlm: Send lockspace name with uevents
  dlm: send reply before bast
  dlm: fix ordering of bast and cast
2010-02-26 17:19:30 -08:00
Linus Torvalds
b305956abc Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (52 commits)
  fs/xfs: Correct NULL test
  xfs: optimize log flushing in xfs_fsync
  xfs: only clear the suid bit once in xfs_write
  xfs: kill xfs_bawrite
  xfs: log changed inodes instead of writing them synchronously
  xfs: remove invalid barrier optimization from xfs_fsync
  xfs: kill the unused XFS_QMOPT_* flush flags V2
  xfs: Use delay write promotion for dquot flushing
  xfs: Sort delayed write buffers before dispatch
  xfs: Don't issue buffer IO direct from AIL push V2
  xfs: Use delayed write for inodes rather than async V2
  xfs: Make inode reclaim states explicit
  xfs: more reserved blocks fixups
  xfs: turn off sign warnings
  xfs: don't hold onto reserved blocks on remount,ro
  xfs: quota limit statvfs available blocks
  xfs: replace KM_LARGE with explicit vmalloc use
  xfs: cleanup up xfs_log_force calling conventions
  xfs: kill XLOG_VEC_SET_TYPE
  xfs: remove duplicate buffer flags
  ...
2010-02-26 17:18:52 -08:00
Linus Torvalds
f24407d2bd Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt:
  xfs: fix xfs to work with Virtually Indexed architectures
  sh: add mm API for DMA to vmalloc/vmap areas
  arm: add mm API for DMA to vmalloc/vmap areas
  parisc: add mm API for DMA to vmalloc/vmap areas
  mm: add coherence API for DMA to vmalloc/vmap areas
2010-02-26 17:05:10 -08:00
Srinivas Eeda
bc9838c4d4 dlm: allow dlm do recovery during shutdown
If a node down event happens while dlm shutdown in progress, dlm recovery
should be done before dlm is shutdown.  We can't migrate unrecovered locks,
obviously.  But dlm_reco_thread only does recovery if the dlm_state is
in DLM_CTXT_JOINED.

dlm_reco_thread should do recovery if dlm_state is in DLM_CTXT_JOINED or
DLM_CTXT_IN_SHUTDOWN.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:19 -08:00
Tao Ma
cbaee472f2 ocfs2: Only bug out in direct io write for reflinked extent.
In ocfs2_direct_IO_get_blocks, we only need to bug out
in case of we are going to write a recounted extent rec.

What a silly bug introduced by me!

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
2010-02-26 15:41:19 -08:00
Coly Li
66b116c9d8 ocfs2: fix warning in ocfs2_file_aio_write()
This patch fixes a compiling warning in ocfs2_file_aio_write().

Signed-off-by: Coly Li <coly.li@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:18 -08:00
Joel Becker
cbe0e331fd ocfs2_dlmfs: Enable the use of user cluster stacks.
Unlike ocfs2, dlmfs has no permanent storage.  It can't store off a
cluster stack it is supposed to be using.  So it can't specify the stack
name in ocfs2_cluster_connect().

Instead, we create ocfs2_cluster_connect_agnostic(), which simply uses
the stack that is currently enabled.  This is find for dlmfs, which will
rely on the stack initialization.

We add the "stackglue" capability to dlmfs's capability list.  This lets
userspace know dlmfs can be used with all cluster stacks.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:18 -08:00
Joel Becker
0016eedc41 ocfs2_dlmfs: Use the stackglue.
Rather than directly using o2dlm, dlmfs can now use the stackglue.  This
allows it to use userspace cluster stacks and fs/dlm.  This commit
forces o2cb for now.  A latter commit will bump the protocol version and
allow non-o2cb stacks.

This is one big sed, really.  LKM_xxMODE becomes DLM_LOCK_xx.  LKM_flag
becomes DLM_LKF_flag.

We also learn to check that the LVB is valid before reading it.  Any DLM
can lose the contents of the LVB during a complicated recovery.  userdlm
should be checking this.  Now it does.  dlmfs will return 0 from read(2)
if the LVB was invalid.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:18 -08:00
Joel Becker
e8fce482f3 ocfs2_dlmfs: Don't honor truncate. The size of a dlmfs file is LVB_LEN
We want folks using dlmfs to be able to use the LVB in places other than
just write(2)/read(2).  By ignoring truncate requests, we allow 'echo
"contents" > /dlm/space/lockname' to work.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:18 -08:00
Joel Becker
553b5eb91a ocfs2: Pass the locking protocol into ocfs2_cluster_connect().
Inside the stackglue, the locking protocol structure is hanging off of
the ocfs2_cluster_connection.  This takes it one further; the locking
protocol is passed into ocfs2_cluster_connect().  Now different cluster
connections can have different locking protocols with distinct asts.
Note that all locking protocols have to keep their maximum protocol
version in lock-step.

With the protocol structure set in ocfs2_cluster_connect(), there is no
need for the stackglue to have a static pointer to a specific protocol
structure.  We can change initialization to only pass in the maximum
protocol version.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:17 -08:00
Joel Becker
e603cfb074 ocfs2: Remove the ast pointers from ocfs2_stack_plugins
With the full ocfs2_locking_protocol hanging off of the
ocfs2_cluster_connection, ast wrappers can get the ast/bast pointers
there.  They don't need to get them from their plugin structure.

The user plugin still needs the maximum locking protocol version,
though.  This changes the plugin structure so that it only holds the max
version, not the entire ocfs2_locking_protocol pointer.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:16 -08:00
Joel Becker
110946c8fb ocfs2: Hang the locking proto on the cluster conn and use it in asts.
With the ocfs2_cluster_connection hanging off of the ocfs2_dlm_lksb, we
have access to it in the ast and bast wrapper functions.  Attach the
ocfs2_locking_protocol to the conn.

Now, instead of refering to a static variable for ast/bast pointers, the
wrappers can look at the connection.  This means different connections
can have different ast/bast pointers, and it reduces the need for the
static pointer.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:16 -08:00
Joel Becker
c0e4133851 ocfs2: Attach the connection to the lksb
We're going to want it in the ast functions, so we convert union
ocfs2_dlm_lksb to struct ocfs2_dlm_lksb and let it carry the connection.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:14 -08:00
Joel Becker
a796d2862a ocfs2: Pass lksbs back from stackglue ast/bast functions.
The stackglue ast and bast functions tried to maintain the fiction that
their arguments were void pointers.  In reality, stack_user.c had to
know that the argument was an ocfs2_lock_res in order to get the status
off of the lksb.  That's ugly.

This changes stackglue to always pass the lksb as the argument to ast
and bast functions.  The caller can always use container_of() to get the
ocfs2_lock_res or user_dlm_lock_res.  The net effect to the caller is
zero.  They still get back the lockres in their ast.  stackglue gets
cleaner, and now can use the lksb itself.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:14 -08:00
Joel Becker
34a9dd7e29 ocfs2_dlmfs: Move to its own directory
We're going to remove the tie between ocfs2_dlmfs and o2dlm.
ocfs2_dlmfs doesn't belong in the fs/ocfs2/dlm directory anymore.  Here
we move it to fs/ocfs2/dlmfs.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:14 -08:00
Joel Becker
65b6f34034 ocfs2_dlmfs: Use poll() to signify BASTs.
o2dlm's userspace filesystem is an easy way to use the DLM from
userspace.  It is intentionally simple. For example, it does not allow
for asynchronous behavior or lock conversion.  This is intentional to
keep the interface simple.

Because there is no asynchronous notification, there is no way for a
process holding a lock to know another node needs the lock.  This is the
number one complaint of ocfs2_dlmfs users.  Turns out, we can solve this
very easily.  We add poll() support to ocfs2_dlmfs.  When a BAST is
received, the lock's file descriptor will receive POLLIN.

This is trivial to implement.  Userdlm already has an appropriate
waitqueue, and the lock knows when it is blocked.

We add the "bast" capability to tell userspace this is available.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:14 -08:00
Joel Becker
14a437c2b6 ocfs2_dlmfs: Add capabilities parameter.
Over time, dlmfs has added some features that were not part of the
initial ABI.  Unfortunately, some of these features are not detectable
via standard usage.  For example, Linux's default poll always returns
POLLIN, so there is no way for a caller of poll(2) to know when dlmfs
added poll support.  Instead, we provide this list of new capabilities.

Capabilities is a read-only attribute.  We do it as a module parameter
so we can discover it whether dlmfs is built in, loaded, or even not
loaded (via modinfo).

The ABI features are local to this machine's dlmfs mount.  This is
distinct from the locking protocol, which is concerned with inter-node
interaction.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:13 -08:00
Joel Becker
399ff3a748 ocfs2: Handle errors while setting external xattr values.
ocfs2 can store extended attribute values as large as a single file.  It
does this using a standard ocfs2 btree for the large value.  However,
the previous code did not handle all error cases cleanly.

There are multiple problems to have.

1) We have trouble allocating space for a new xattr.  This leaves us
   with an empty xattr.
2) We overwrote an existing local xattr with a value root, and now we
   have an error allocating the storage.  This leaves us an empty xattr.
   where there used to be a value.  The value is lost.
3) We have trouble truncating a reused value.  This leaves us with the
   original entry pointing to the truncated original value.  The value
   is lost.
4) We have trouble extending the storage on a reused value.  This leaves
   us with the original value safely in place, but with more storage
   allocated when needed.

This doesn't consider storing local xattrs (values that don't require a
btree).  Those only fail when the journal fails.

Case (1) is easy.  We just remove the xattr we added.  We leak the
storage because we can't safely remove it, but otherwise everything is
happy.  We'll print a warning about the leak.

Case (4) is easy.  We still have the original value in place.  We can
just leave the extra storage attached to this xattr.  We return the
error, but the old value is untouched.  We print a warning about the
storage.

Case (2) and (3) are hard because we've lost the original values.  In
the old code, we ended up with values that could be partially read.
That's not good.  Instead, we just wipe the xattr entry and leak the
storage.  It stinks that the original value is lost, but now there isn't
a partial value to be read.  We'll print a big fat warning.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:13 -08:00
Joel Becker
139ffacebf ocfs2: Set inline xattr entries with ocfs2_xa_set()
ocfs2_xattr_ibody_set() is the only remaining user of
ocfs2_xattr_set_entry().  ocfs2_xattr_set_entry() actually does two
things: it calls ocfs2_xa_set(), and it initializes the inline xattrs.
Initializing the inline space really belongs in its own call.

We lift the initialization to ocfs2_xattr_ibody_init(), called from
ocfs2_xattr_ibody_set() only when necessary.  Now
ocfs2_xattr_ibody_set() can call ocfs2_xa_set() directly.
ocfs2_xattr_set_entry() goes away.

Another nice fact is that ocfs2_init_dinode_xa_loc() can trust
i_xattr_inline_size.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:13 -08:00
Joel Becker
d3981544d7 ocfs2: Set xattr block entries with ocfs2_xa_set()
ocfs2_xattr_block_set() calls into ocfs2_xattr_set_entry() with just the
HAS_XATTR flag.  Most of the machinery of ocfs2_xattr_set_entry() is
skipped.  All that really happens other than the call to ocfs2_xa_set()
is making sure the HAS_XATTR flag is set on the inode.

But HAS_XATTR should be set when we also set di->i_xattr_loc.  And
that's done in ocfs2_create_xattr_block().  So let's move it there, and
then ocfs2_xattr_block_set() can just call ocfs2_xa_set().

While we're there, ocfs2_create_xattr_block() can take the set_ctxt for
a smaller argument list.  It also learns to set HAS_XATTR_FL, because it
knows for sure.  ocfs2_create_empty_xatttr_block() in the reflink path
fakes a set_ctxt to call ocfs2_create_xattr_block().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:13 -08:00
Joel Becker
c5d95df5f7 ocfs2: Let ocfs2_xa_prepare_entry() do space checks.
ocfs2_xattr_set_in_bucket() doesn't need to do its own hacky space
checking.  Let's let ocfs2_xa_prepare_entry() (via ocfs2_xa_set()) do
the more accurate work.  Whenever it doesn't have space,
ocfs2_xattr_set_in_bucket() can try to get more space.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:12 -08:00
Joel Becker
bca5e9bd1e ocfs2: Gell into ocfs2_xa_set()
ocfs2_xa_set() wraps the ocfs2_xa_prepare_entry()/ocfs2_xa_store_value()
logic.  Both callers can now use the same routine.  ocfs2_xa_remove()
moves directly into ocfs2_xa_set().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:11 -08:00
Joel Becker
73857ee0b5 ocfs2: Allocation in ocfs2_xa_prepare_entry(), values in ocfs2_xa_store_value()
ocfs2_xa_prepare_entry() gets all the logic to add, remove, or modify
external value trees.  Now, when it exits, the entry is ready to receive
a value of any size.

ocfs2_xa_remove() is added to handle the complete removal of an entry.
It truncates the external value tree before calling
ocfs2_xa_remove_entry().

ocfs2_xa_store_inline_value() becomes ocfs2_xa_store_value().  It can
store any value.

ocfs2_xattr_set_entry() loses all the allocation logic and just uses
these functions.  ocfs2_xattr_set_value_outside() disappears.

ocfs2_xattr_set_in_bucket() uses these functions and makes
ocfs2_xattr_set_entry_in_bucket() obsolete.  That goes away, as does
ocfs2_xattr_bucket_set_value_outside() and
ocfs2_xattr_bucket_value_truncate().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:11 -08:00
Joel Becker
cf2bc80940 ocfs2: Teach ocfs2_xa_loc how to do its own journal work
We're going to want to make sure our buffers get accessed and dirtied
correctly.  So have the xa_loc do the work.  This includes storing the
inode on ocfs2_xa_loc.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:11 -08:00
Joel Becker
3fc12afa0c ocfs2: Provide ocfs2_xa_fill_value_buf() for external value processing
We use the ocfs2_xattr_value_buf structure to manage external values.
It lets the value tree code do its work regardless of the containing
storage.  ocfs2_xa_fill_value_buf() initializes a value buf from an
ocfs2_xa_loc entry.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:11 -08:00
Joel Becker
9dc474005d ocfs2: Handle value tree roots in ocfs2_xa_set_inline_value()
Previously the xattr code would send in a fake value, containing a tree
root, to the function that installed name+value pairs.  Instead, we pass
the real value to ocfs2_xa_set_inline_value(), and it notices that the
value cannot fit.  Thus, it installs a tree root.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:10 -08:00
Joel Becker
69a3e539d0 ocfs2: Set the xattr name+value pair in one place
We create two new functions on ocfs2_xa_loc, ocfs2_xa_prepare_entry()
and ocfs2_xa_store_inline_value().

ocfs2_xa_prepare_entry() makes sure that the xl_entry field of
ocfs2_xa_loc is ready to receive an xattr.  The entry will point to an
appropriately sized name+value region in storage.  If an existing entry
can be reused, it will be.  If no entry already exists, it will be
allocated.  If there isn't space to allocate it, -ENOSPC will be
returned.

ocfs2_xa_store_inline_value() stores the data that goes into the 'value'
part of the name+value pair.  For values that don't fit directly, this
stores the value tree root.

A number of operations are added to ocfs2_xa_loc_operations to support
these functions.  This reflects the disparate behaviors of xattr blocks
and buckets.

With these functions, the overlapping ocfs2_xattr_set_entry_local() and
ocfs2_xattr_set_entry_normal() can be replaced with a single call
scheme.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:10 -08:00
Joel Becker
199799a360 ocfs2: Wrap calculation of name+value pair size.
An ocfs2 xattr entry stores the text name and value as a pair in the
storage area.  Obviously names and values can be variable-sized.  If a
value is too large for the entry storage, a tree root is stored instead.
The name+value pair is also padded.

Because of this, there are a million places in the code that do:

	if (needs_external_tree(value_size)
		namevalue_size = pad(name_size) + tree_root_size;
	else
		namevalue_size = pad(name_size) + pad(value_size);

Let's create some convenience functions to make the code more readable.
There are three forms.  The first takes the raw sizes.  The second takes
an ocfs2_xattr_info structure.  The third takes an existing
ocfs2_xattr_entry.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:10 -08:00
Joel Becker
18853b95d1 ocfs2: Add a name_len field to ocfs2_xattr_info.
Rather than calculating strlen all over the place, let's store the
name length directly on ocfs2_xattr_info.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:09 -08:00
Joel Becker
6b240ff63c ocfs2: Prefix the member fields of struct ocfs2_xattr_info.
struct ocfs2_xattr_info is a useful structure describing an xattr
you'd like to set.  Let's put prefixes on the member fields so it's
easier to read and use.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:09 -08:00
Joel Becker
bde1e5400a ocfs2: Remove xattrs via ocfs2_xa_loc
Add ocfs2_xa_remove_entry(), which will remove an xattr entry from its
storage via the ocfs2_xa_loc descriptor.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:09 -08:00
Joel Becker
11179f2c92 ocfs2: Introduce ocfs2_xa_loc
The ocfs2 extended attribute (xattr) code is very flexible.  It can
store xattrs in the inode itself, in an external block, or in a tree of
data structures.  This allows the number of xattrs to be bounded by the
filesystem size.

However, the code that manages each possible storage location is
different.  Maintaining the ocfs2 xattr code requires changing each hunk
separately.

This patch is the start of a series introducing the ocfs2_xa_loc
structure.  This structure wraps the on-disk details of an xattr
entry.  The goal is that the generic xattr routines can use
ocfs2_xa_loc without knowing the underlying storage location.

This first pass merely implements the basic structure, initializing it,
and wiping the name+value pair of the entry.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:08 -08:00
Sunil Mushran
8545e03d82 ocfs2: Add current->comm in trace output
Add current->comm to the standard mlog() output to help with debugging.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:08 -08:00
Wengang Wang
96a1cc731a ocfs2: Clean up the checks for CoW and direct I/O.
When ocfs2 has to do CoW for refcounted extents, we disable direct I/O
and go through the buffered I/O path.  This makes the combined check
easier to read.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:07 -08:00
Tiger Yang
b89c54282d ocfs2: add extent block stealing for ocfs2 v5
This patch add extent block (metadata) stealing mechanism for
extent allocation. This mechanism is same as the inode stealing.
if no room in slot specific extent_alloc, we will try to
allocate extent block from the next slot.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:07 -08:00
Alex Elder
398007f863 Merge branch 'linux-2.6.33' 2010-02-26 14:34:02 -06:00
David Teigland
b6fa8796b2 dlm: use bastmode in debugfs output
The bast mode that appears in the debugfs output should be
useful on both master and process nodes.  lkb_highbast is
currently printed, and is only useful on the master node.
lkb_bastmode is only useful on the process node.  This
patch sets lkb_bastmode on the master node as well, and
uses that value in the debugfs print.

Signed-off-by: David Teigland <teigland@redhat.com>
2010-02-26 12:15:54 -06:00
Steven Whitehouse
b4a5d4bc37 dlm: Send lockspace name with uevents
Although it is possible to get this information from the path,
its much easier to provide the lockspace as a seperate env
variable.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2010-02-26 12:14:25 -06:00
Sage Weil
080af17e9c ceph: remove bogus mds forward warning
The must_resend flag is always true, not false.  In any case, we can
just ignore it anyway.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-26 10:02:03 -08:00
David Teigland
cf6620acc0 dlm: send reply before bast
When the lock master processes a successful operation (request,
convert, cancel, or unlock), it will process the effects of the
change before sending the reply for the operation.  The "effects"
of the operation are:

- blocking callbacks (basts) for any newly granted locks
- waiting or converting locks that can now be granted

The cast is queued on the local node when the reply from the lock
master is received.  This means that a lock holder can receive a
bast for a lock mode that is doesn't yet know has been granted.

Signed-off-by: David Teigland <teigland@redhat.com>
2010-02-26 11:57:37 -06:00
Sage Weil
c99eb1c726 ceph: remove fragile __map_osds optimization
We used to try to avoid freeing and then reallocating the osd
struct.  This is a bit fragile due to potential interactions with
other references (beyond o_requests), and may be the cause of
this crash:

[120633.442358] BUG: unable to handle kernel NULL pointer dereference at (null)
[120633.443292] IP: [<ffffffff812549b6>] rb_erase+0x11d/0x277
[120633.443292] PGD f7ff3067 PUD f7f53067 PMD 0
[120633.443292] Oops: 0000 [#1] PREEMPT SMP
[120633.443292] last sysfs file: /sys/kernel/uevent_seqnum
[120633.443292] CPU 1
[120633.443292] Modules linked in: ceph fan ac battery psmouse ehci_hcd ide_pci_generic ohci_hcd thermal processor button
[120633.443292] Pid: 3023, comm: ceph-msgr/1 Not tainted 2.6.32-rc2 #12 H8SSL
[120633.443292] RIP: 0010:[<ffffffff812549b6>]  [<ffffffff812549b6>] rb_erase+0x11d/0x277
[120633.443292] RSP: 0018:ffff8800f7b13a50  EFLAGS: 00010246
[120633.443292] RAX: ffff880022907819 RBX: ffff880022907818 RCX: 0000000000000000
[120633.443292] RDX: ffff8800f7b13a80 RSI: ffff8800f587eb48 RDI: 0000000000000000
[120633.443292] RBP: ffff8800f7b13a60 R08: 0000000000000000 R09: 0000000000000004
[120633.443292] R10: 0000000000000000 R11: ffff8800c4441000 R12: ffff8800f587eb48
[120633.443292] R13: ffff8800f58eaa00 R14: ffff8800f413c000 R15: 0000000000000001
[120633.443292] FS:  00007fbef6e226e0(0000) GS:ffff880009200000(0000) knlGS:0000000000000000
[120633.443292] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[120633.443292] CR2: 0000000000000000 CR3: 00000000f7c53000 CR4: 00000000000006e0
[120633.443292] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[120633.443292] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[120633.443292] Process ceph-msgr/1 (pid: 3023, threadinfo ffff8800f7b12000, task ffff8800f5858b40)
[120633.443292] Stack:
[120633.443292]  ffff8800f413c000 ffff8800f587e9c0 ffff8800f7b13a80 ffffffffa0098a86
[120633.443292] <0> 00000000000006f1 0000000000000000 ffff8800f7b13af0 ffffffffa009959b
[120633.443292] <0> ffff8800f413c000 ffff880022a68400 ffff880022a68400 ffff8800f587e9c0
[120633.443292] Call Trace:
[120633.443292]  [<ffffffffa0098a86>] __remove_osd+0x4d/0xbc [ceph]
[120633.443292]  [<ffffffffa009959b>] __map_osds+0x199/0x4fa [ceph]
[120633.443292]  [<ffffffffa00999f4>] ? __send_request+0xf8/0x186 [ceph]
[120633.443292]  [<ffffffffa0099beb>] kick_requests+0x169/0x3cb [ceph]
[120633.443292]  [<ffffffffa009a8c1>] ceph_osdc_handle_map+0x370/0x522 [ceph]

Since we're probably screwed anyway if a small kmalloc is
failing, don't bother with trying to be clever here.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-26 09:37:33 -08:00
Martin K. Petersen
8a78362c4e block: Consolidate phys_segment and hw_segment limits
Except for SCSI no device drivers distinguish between physical and
hardware segment limits.  Consolidate the two into a single segment
limit.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-02-26 13:58:08 +01:00
Linus Torvalds
6ebdc661b6 Merge branch 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6
* 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6: (41 commits)
  of: remove undefined request_OF_resource & release_OF_resource
  of/sparc: Remove sparc-local declaration of allnodes and devtree_lock
  of: move definition of of_chosen into common code.
  of: remove unused extern reference to devtree_lock
  of: put default string compare and #a/s-cell values into common header
  of/flattree: Don't assume HAVE_LMB
  of: protect linux/of.h with CONFIG_OF
  proc_devtree: fix THIS_MODULE without module.h
  of: Remove old and misplaced function declarations
  of/flattree: Make the kernel accept ePAPR style phandle information
  of/flattree: endian-convert members of boot_param_header
  of: assume big-endian properties, adding conversions where necessary
  of: use __be32 for cell value accessors
  of/flattree: use OF_ROOT_NODE_{SIZE,ADDR}_CELLS DEFAULT for fdt parsing
  of/flattree: use callback to setup initrd from /chosen
  proc_devtree: include linux/of.h
  of: make set_node_proc_entry private to proc_devtree.c
  of: include linux/proc_fs.h
  of/flattree: merge early_init_dt_scan_memory() common code
  of: add 'of_' prefix to machine_is_compatible()
  ...
2010-02-25 15:38:37 -08:00
Sage Weil
e80a52d14f ceph: fix connection fault STANDBY check
Move any out_sent messages to out_queue _before_ checking if
out_queue is empty and going to STANDBY, or else we may drop
something that was never acked.

And clean up the code a bit (less goto).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-25 12:40:45 -08:00
Sage Weil
161fd65ac9 ceph: invalidate_authorizer without con->mutex held
This fixes lock ABBA inversion, as the ->invalidate_authorizer()
op may need to take a lock (or even call back into the
messenger).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-25 12:38:57 -08:00
Paul E. McKenney
7dc5215798 vfs: Apply lockdep-based checking to rcu_dereference() uses
Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
and fcheck_files().

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
LKML-Reference: <1266887105-1528-8-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:48 +01:00
Jens Axboe
7f03292ee1 Merge branch 'master' into for-2.6.34
Conflicts:
	include/linux/blkdev.h

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-02-25 08:48:05 +01:00
Steve French
d7b619cf56 [CIFS] pSesInfo->sesSem is used as mutex. Rename it to session_mutex and
convert it to a real mutex.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-25 05:36:46 +00:00
Chuck Lever
58255a4e3c NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN
The server's callback client should stop trying to connect to the
client's callback server as soon as it gets ECONNREFUSED.

The NFS server's callback client does not call rpc_ping(), but appears
to have it's own "ping" procedure, so it wasn't covered by commit
caabea8a.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-02-24 17:50:28 -08:00
Steve French
122ca0076e [CIFS] Use unsigned ea length for clarity
Jeff correctly noted that using unsigned ea length is more intuitive.
CC: Jeff Lyaton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-24 21:56:48 +00:00
David Teigland
7fe2b3190b dlm: fix ordering of bast and cast
When both blocking and completion callbacks are queued for lock,
the dlm would always deliver the completion callback (cast) first.
In some cases the blocking callback (bast) is queued before the
cast, though, and should be delivered first.  This patch keeps
track of the order in which they were queued and delivers them
in that order.

This patch also keeps track of the granted mode in the last cast
and eliminates the following bast if the bast mode is compatible
with the preceding cast mode.  This happens when a remotely mastered
lock is demoted, e.g. EX->NL, in which case the local node queues
a cast immediately after sending the demote message.  In this way
a cast can be queued for a mode, e.g. NL, that makes an in-transit
bast extraneous.

Signed-off-by: David Teigland <teigland@redhat.com>
2010-02-24 11:46:53 -06:00
dingdinghua
23e2af3518 jbd2: clean up an assertion in jbd2_journal_commit_transaction()
commit_transaction has the same value as journal->j_running_transaction,
so we can simplify the assert statement.

Signed-off-by: dingdinghua <dingdinghua@nrchpc.ac.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-24 12:11:20 -05:00
Dmitry Monakhov
56c50f11f4 ext4: trivial quota cleanup
The patch is aimed to reorganize and simplify quota code a bit.
Quota code is itself complex enough, but we can make it more readable
in some places:
- Move quota option parsing to separate functions.
- Simplify old-quota and journaled-quota mix check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:28:41 -05:00
Dmitry Monakhov
482a74258f ext4: mount flags manipulation cleanup
Replace intermediate EXT4_MOUNT_XXX flags manipulation to
corresponding macro.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-24 11:35:32 -05:00
Jiaying Zhang
c8d46e41bc ext4: Add flag to files with blocks intentionally past EOF
fallocate() may potentially instantiate blocks past EOF, depending
on the flags used when it is called.

e2fsck currently has a test for blocks past i_size, and it
sometimes trips up - noticeably on xfstests 013 which runs fsstress.

This patch from Jiayang does fix it up - it (along with
e2fsprogs updates and other patches recently from Aneesh) has
survived many fsstress runs in a row.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-24 09:52:53 -05:00
Jiri Kosina
7c821a179f Remove fs/ntfs/ChangeLog
Remove fs/ntfs/ChangeLog. No need for such files since we have git.

Acked-by: Anton Altaparmakov <aia21@cam.ac.uk>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-24 13:55:31 +01:00
Jeff Layton
835a36ca4a cifs: set server_eof in cifs_fattr_to_inode
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 23:28:43 +00:00
Yehuda Sadeh
88d892a37f ceph: don't clobber write return value when using O_SYNC
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-23 14:26:36 -08:00
Sage Weil
a1ea787c7b ceph: fix client_request_forward decoding
The tid is in the message header, not body.  Broken since 6df058c0.

No need to look at next mds session; just mark the request and be done.
(The old error path was broken too, but now it's gone.)

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-23 14:26:36 -08:00
Sage Weil
2600d2dd50 ceph: drop messages on unregistered mds sessions; cleanup
Verify the mds session is currently registered before handling
incoming messages.  Clean up message handlers to pull mds out
of session->s_mds instead of less trustworthy src field.

Clean up con_{get,put} debug output.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-23 14:26:35 -08:00
Sage Weil
a6369741c4 ceph: fix comments, locking in destroy_inode
The destroy_inode path needs no inode locks since there are no
inode references.  Update __ceph_remove_cap comment to reflect
that it is called without cap->session->s_mutex in this case.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-23 14:26:35 -08:00
Alexander Beregalov
4ce1e9adab ceph: move dereference after NULL test
Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-23 14:26:34 -08:00
Sage Weil
5b3a4db3e4 ceph: fix up unexpected message handling
Fix skipping of unexpected message types from osd, mon.

Clean up pr_info and debug output.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-23 14:26:29 -08:00
Steve French
96c03bccc7 [CIFS] Minor cleanup to EA patch
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:51:43 +00:00
Jeff Layton
31c0519f7a cifs: merge CIFSSMBQueryEA with CIFSSMBQAllEAs
Add an "ea_name" parameter to CIFSSMBQAllEAs. When it's set make it
behave like CIFSSMBQueryEA does now. The current callers of
CIFSSMBQueryEA are converted to use CIFSSMBQAllEAs, and the old
CIFSSMBQueryEA function is removed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:47:32 +00:00
Jeff Layton
0cd126b504 cifs: verify lengths of QueryAllEAs reply
Make sure the lengths in a QUERY_ALL_EAS reply don't make the parser walk
off the end of the SMB.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:47:11 +00:00
Jeff Layton
e529614ad0 cifs: increase maximum buffer size in CIFSSMBQAllEAs
It's 4000 now, but there's no reason to limit it to that. We should be
able to handle a response up to CIFSMaxBufSize.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:46:49 +00:00
Jeff Layton
6e462b9f2c cifs: rename name_len to list_len in CIFSSMBQAllEAs
...for clarity and so we can reuse the name for the real name_len.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:46:27 +00:00
Jeff Layton
f0d3868b78 cifs: clean up indentation in CIFSSMBQAllEAs
Add a label that we can goto on error, and reduce some of the
if/then/else indentation in this function.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:45:52 +00:00
Jeff Layton
370b41911c cifs: add parens around smb_var in BCC macros
...to remove ambiguity about how these values are interpreted when
passing in more complex values as arguments.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:45:21 +00:00
Michael Neuling
a17e18790a fs/exec.c: fix initial stack reservation
803bf5ec25 ("fs/exec.c: restrict initial
stack space expansion to rlimit") attempts to limit the initial stack to
20*PAGE_SIZE.  Unfortunately, in attempting ensure the stack is not
reduced in size, we ended up not changing the stack at all.

This size reduction check is not necessary as the expand_stack call does
this already.

This caused a regression in UML resulting in most guest processes being
killed.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jouni Malinen <j@w1.fi>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-22 19:50:34 -08:00
stephen hemminger
1cc523271e seq_file: add RCU versions of new hlist/list iterators (v3)
Many usages of seq_file use RCU protected lists, so non RCU
iterators will not work safely.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-22 15:45:54 -08:00
Jens Axboe
f11cbd74c5 Merge branch 'master' into for-2.6.34 2010-02-22 13:48:51 +01:00
Ben Myers
978ebd97d1 xfs_export_operations.commit_metadata
This is the commit_metadata export operation for XFS.

- Takes one inode to be committed.

- Forces the log up to the lsn of the inode.

- Doesn't force the log if the inode doesn't have a pincount.

Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
[bfields@citi.umich.edu: trivial whitespace fix]
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-02-20 13:14:50 -08:00
Ben Myers
f501912a35 commit_metadata export operation replacing nfsd_sync_dir
- Add commit_metadata export_operation to allow the underlying filesystem to
decide how to commit an inode most efficiently.

- Usage of nfsd_sync_dir and write_inode_now has been replaced with the
commit_metadata function that takes a svc_fh.

- The commit_metadata function calls the commit_metadata export_op if it's
there, or else falls back to sync_inode instead of fsync and write_inode_now
because only metadata need be synced here.

- nfsd4_sync_rec_dir now uses vfs_fsync so that commit_metadata can be static

Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-02-20 13:13:44 -08:00
David Howells
8f9941aecc CacheFiles: Fix a race in cachefiles_delete_object() vs rename
cachefiles_delete_object() can race with rename.  It gets the parent directory
of the object it's asked to delete, then locks it - but rename may have changed
the object's parent between the get and the completion of the lock.

However, if such a circumstance is detected, we abandon our attempt to delete
the object - since it's no longer in the index key path, it won't be seen
again by lookups of that key.  The assumption is that cachefilesd may have
culled it by renaming it to the graveyard for later destruction.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-20 10:06:35 -05:00
Jiro SEKIBA
0d561f12b4 nilfs2: add reader's lock for cno in nilfs_ioctl_sync
This adds reader's lock for the_nilfs->cno in nilfs_ioctl_sync,
for the_nilfs->cno should be proctected by segctor_sem when reading.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-20 21:18:19 +09:00
Chuck Ebbert
aeaa5ccd64 vfs: don't call ima_file_check() unconditionally in nfsd_open()
commit 1e41568d73 ("Take ima_path_check()
in nfsd past dentry_open() in nfsd_open()") moved this code back to its
original location but missed the "else".

Signed-off-by: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-20 00:47:31 -05:00
Yehuda Sadeh
bcd2cbd10c ceph: cleanup redundant code in handle_cap_grant
There is no state in local vars that requires us to loop after temporarily
dropping i_lock.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-19 14:41:10 -08:00
Yehuda Sadeh
c9af9fb68e ceph: don't truncate dirty pages in invalidate work thread
Instead of truncating the whole range of pages, we skip those
pages that are dirty or in the middle of writeback. Those pages
will be cleared later when the writeback completes.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-19 14:40:51 -08:00
Yehuda Sadeh
e63dc5c780 ceph: remove page upon writeback completion if lost cache cap
This page should have been removed earlier when the cache cap was
revoked, but a writeback was in flight, so it was skipped. We truncate
it here just as the writeback finishes, while it's still locked.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-19 14:34:18 -08:00
Sage Weil
5ecad6fd7b ceph: fix check for invalidate_mapping_pages success
We need to know whether there was any page left behind, and not the
return value (the total number of pages invalidated).  Look at the mapping
to see if we were successful or not.

Move it all into a helper to simplify the two callers.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-19 14:33:18 -08:00
Al Viro
7fee4868be Switch proc/self to nd_set_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-19 10:25:41 -05:00
Al Viro
ac278a9c50 fix LOOKUP_FOLLOW on automount "symlinks"
Make sure that automount "symlinks" are followed regardless of LOOKUP_FOLLOW;
it should have no effect on them.

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-19 03:56:42 -05:00
Al Viro
c44dcc56d2 switch inotify_user to anon_inode
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-19 03:35:12 -05:00
Jiro SEKIBA
03f29365e8 nilfs2: delete unnecessary condition in load_segment_summary
This is a trivial patch to remove unnecessary condition.

load_segment_summary() checks crc of segment_summary OR crc of whole
log data blocks based on boolean argument full_check.  However,
callers of the function pass only 1 as full_check, which means only
whole log data blocks checking code is running all the time.

This patch deletes the condition and full_check argument and also
deletes enum 'NILFS_SEG_FAIL_CHECKSUM_SEGSUM' and corresponding case
clause, for it is nolonger used anymore.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-18 20:09:03 +09:00
Sage Weil
2c27c9a57c ceph: fix typo in ceph_queue_writeback debug output
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-17 15:45:51 -08:00
Sage Weil
a17d6473cc ceph: v0.19 release
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-17 13:56:07 -08:00
Sage Weil
4fc51be8fa ceph: use rbtree for pg pools; decode new osdmap format
Since we can now create and destroy pg pools, the pool ids will be sparse,
and an array no longer makes sense for looking up by pool id.  Use an
rbtree instead.

The OSDMap encoding also no longer has a max pool count (previously used to
allocate the array).  There is a new pool_max, that is the largest pool id
we've ever used, although we don't actually need it in the client.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-17 10:02:49 -08:00
Sage Weil
9794b146fa ceph: fix memory leak when destroying osdmap with pg_temp mappings
Also move _lookup_pg_mapping into a helper.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-17 10:02:48 -08:00
Sage Weil
7c1332b8cb ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap.  To allow this, we used to
prevent cap reordering while we were iterating.  However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.

Instead, we keep an iterator pointer in the session pointing to
the current cap.  As before, we avoid reordering.  For removal,
if the cap isn't the current cap we are iterating over, we are
fine.  If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list.  In iterate_caps, we
can safely finish removal and get the next cap pointer.

While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-17 10:02:47 -08:00
Sage Weil
85ccce43a3 ceph: clean up readdir caps reservation
Use a global counter for the minimum number of allocated caps instead of
hard coding a check against readdir_max.  This takes into account multiple
client instances, and avoids examining the superblock mount options when a
cap is dropped.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-17 10:02:43 -08:00
David S. Miller
2bb4646fce Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-02-16 22:09:29 -08:00
Sage Weil
5ce6e9dbe6 ceph: fix authentication races, auth_none oops
Call __validate_auth() under monc->mutex, and use helper for
initial hello so that the pending_auth flag is set.  This fixes
possible races in which we have an authentication request (hello
or otherwise) pending and send another one.  In particular, with
auth_none, we _never_ want to call ceph_build_auth() from
__validate_auth(), since the ->build_request() method is NULL.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 22:01:11 -08:00
Sage Weil
85ff03f6bf ceph: use rbtree for mon statfs requests
An rbtree is lighter weight, particularly given we will generally have
very few in-flight statfs requests.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 22:01:10 -08:00
Sage Weil
a105f00cf1 ceph: use rbtree for snap_realms
Switch from radix tree to rbtree for snap realms.  This is much more
appropriate given that realm keys are few and far between.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 22:01:09 -08:00
Sage Weil
44ca18f268 ceph: use rbtree for mds requests
The rbtree is a more appropriate data structure than a radix_tree.  It
avoids extra memory usage and simplifies the code.

It also fixes a bug where the debugfs 'mdsc' file wasn't including the
most recent mds request.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 22:01:08 -08:00
Sage Weil
91e45ce389 ceph: cancel delayed work when closing connection
This ensures that if/when we reopen the connection, we can requeue work on
the connection immediately, without waiting for an old timer to expire.
Queue new delayed work inside con->mutex to avoid any race.

This fixes problems with clients failing to reconnect to the MDS due to
the client_reconnect message arriving too late (due to waiting for an old
delayed work timeout to expire).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 22:01:07 -08:00
Sage Weil
e2663ab60d ceph: allow connection to be reopened by fault callback
Fix the messenger to allow a ceph_con_open() during the fault callback.
Previously the work wasn't getting queued on the connection because the
fault path avoids requeued work (normally spurious).  Loop on reopening by
checking for the OPENING state bit.

This fixes OSD reconnects when a TCP connection drops.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 22:01:03 -08:00
Tejun Heo
003cb608a2 percpu: add __percpu sparse annotations to fs
Add __percpu sparse annotations to fs.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors.  This patch doesn't affect normal builds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-02-17 11:17:38 +09:00
Eric W. Biederman
7c0ff870d1 sysfs: sysfs_sd_setattr set iattrs unconditionally
There is currently a bug in sysfs_sd_setattr inherited from
sysfs_setattr in 2.6.32 where the first time we set the attributes
on a sysfs file we allocate backing store but do not set the
backing store attributes.  Resulting in overly restrictive
permissions on sysfs files.

The fix is to simply modify the code so that it always executes
when we update the sysfs attributes, as we did in 2.6.31 and earlier.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Tested-by: Jean Delvare <khali@linux-fr.org>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-02-16 15:42:42 -08:00
Curt Wohlgemuth
73b50c1c92 ext4: Fix BUG_ON at fs/buffer.c:652 in no journal mode
Calls to ext4_handle_dirty_metadata should only pass in an inode
pointer for inode-specific metadata, and not for shared metadata
blocks such as inode table blocks, block group descriptors, the
superblock, etc.

The BUG_ON can get tripped when updating a special device (such as a
block device) that is opened (so that i_mapping is set in
fs/block_dev.c) and the file system is mounted in no journal mode.

Addresses-Google-Bug: #2404870

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-16 15:06:29 -05:00
Linus Torvalds
0813e22d4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: btrfs_mark_extent_written uses the wrong slot
2010-02-15 19:56:21 -08:00
Chuck Lever
65d269538a NFS: Too many GETATTR and ACCESS calls after direct I/O
The cached read and write paths initialize fattr->time_start in their
setup procedures.  The value of fattr->time_start is propagated to
read_cache_jiffies by nfs_update_inode().  Subsequent calls to
nfs_attribute_timeout() will then use a good time stamp when
computing the attribute cache timeout, and squelch unneeded GETATTR
calls.

Since the direct I/O paths erroneously leave the inode's
fattr->time_start field set to zero, read_cache_jiffies for that inode
is set to zero after any direct read or write operation.  This
triggers an otw GETATTR or ACCESS call to update the file's attribute
and access caches properly, even when the NFS READ or WRITE replies
have usable post-op attributes.

Make sure the direct read and write setup code performs the same fattr
initialization as the cached I/O paths to prevent unnecessary GETATTR
calls.

This was likely introduced by commit 0e574af1 in 2.6.15, which appears
to add new nfs_fattr_init() call sites in the cached read and write
paths, but not in the equivalent places in fs/nfs/direct.c.  A
subsequent commit in the same series, 33801147, introduces the
fattr->time_start field.

Interestingly, the direct write reschedule path already has a call to
nfs_fattr_init() in the right place.

Reported-by: Quentin Barnes <qbarnes@yahoo-inc.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-15 19:53:43 -08:00
Linus Torvalds
0aa2ca9ae1 Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  reiserfs: Fix softlockup while waiting on an inode
2010-02-15 19:51:45 -08:00
dingdinghua
ba869023ea jbd2: delay discarding buffers in journal_unmap_buffer
Delay discarding buffers in journal_unmap_buffer until
we know that "add to orphan" operation has definitely been
committed, otherwise the log space of committing transation
may be freed and reused before truncate get committed, updates
may get lost if crash happens.

Signed-off-by: dingdinghua <dingdinghua@nrchpc.ac.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-15 16:35:42 -05:00
Leonard Michlmayr
aca92ff6f5 ext4: correctly calculate number of blocks for fiemap
ext4_fiemap() rounds the length of the requested range down to
blocksize, which is is not the true number of blocks that cover the
requested region.  This problem is especially impressive if the user
requests only the first byte of a file: not a single extent will be
reported.

We fix this by calculating the last block of the region and then
subtract to find the number of blocks in the extents.

Signed-off-by: Leonard Michlmayr <leonard.michlmayr@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 17:07:28 -05:00
Sage Weil
153a008bf7 ceph: reset osd connections after fault
A single osd connection fault (e.g. tcp disconnect) wasn't
reopening the connection, which causes all current and future
requests for that osd to hang.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-15 12:11:51 -08:00
Roel Kluin
9aaab0589b ext4: add missing error checking to ext4_expand_extra_isize_ea()
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
2010-02-15 14:26:16 -05:00
Eric Sandeen
12062dddda ext4: move __func__ into a macro for ext4_warning, ext4_error
Just a pet peeve of mine; we had a mishash of calls with either __func__
or "function_name" and the latter tends to get out of sync.

I think it's easier to just hide the __func__ in a macro, and it'll
be consistent from then on.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-15 14:19:27 -05:00
Frederic Weisbecker
175359f89d reiserfs: Fix softlockup while waiting on an inode
When we wait for an inode through reiserfs_iget(), we hold
the reiserfs lock. And waiting for an inode may imply waiting
for its writeback. But the inode writeback path may also require
the reiserfs lock, which leads to a deadlock.

We just need to release the reiserfs lock from reiserfs_iget()
to fix this.

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Cc: Chris Mason <chris.mason@oracle.com>
2010-02-14 19:07:56 +01:00
Jeremy Kerr
7c540d9e3d proc_devtree: fix THIS_MODULE without module.h
Commit e22f628395 introduced a build
breakage for ARM devtree work: the THIS_MODULE macro was added, but we
don't have module.h

This change adds the necessary #include to get THIS_MODULE defined.
While we could just replace it with NULL (PROC_FS is a bool, not a
tristate), using THIS_MODULE will prevent unexpected breakage if we
ever do compile this as a module.

Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Michal Simek <monstr@monstr.eu>
2010-02-14 07:13:41 -07:00
Sage Weil
6c5d1a49e5 ceph: fix msgr to keep sent messages until acked
The test was backwards from commit b3d1dbbd: keep the message if the
connection _isn't_ lossy.  This allows the client to continue when the
TCP connection drops for some reason (network glitch) but both ends
survive.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-13 20:29:31 -08:00
Julia Lawall
d67b1b0325 fs/xfs: Correct NULL test
Test the value that was just allocated rather than the previously tested one.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r@
expression *x;
expression e;
identifier l;
@@

if (x == NULL || ...) {
    ... when forall
    return ...; }
... when != goto l;
    when != x = e
    when != &x
*x == NULL
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-02-13 13:22:53 -06:00
Ryusuke Konishi
d1c6b72a72 nilfs2: move iterator to write log into segment buffer
This moves iterator to submit write requests for a series of logs into
segbuf.c, and hides nilfs_segbuf_write() and nilfs_segbuf_wait() in
the file.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:03 +09:00
Ryusuke Konishi
e605f0a724 nilfs2: get rid of s_dirt flag use
This replaces s_dirt flag use in nilfs with a new flag added on the
nilfs object.  The s_dirt flag was used to indicate if
sop->write_super() should be called, however the current version of
nilfs does not use the callback.  Thus, it can be replaced with the
own flag.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Jiro SEKIBA <jir@unicus.jp>
2010-02-13 12:26:03 +09:00
Ryusuke Konishi
dcd7618695 nilfs2: get rid of nilfs_segctor_req struct
This will clean up nilfs_segctor_req struct and the obscure request
argument passed among private methods of segment constructor.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:03 +09:00
Jiro SEKIBA
086d1764b2 nilfs2: delete unnecessary condition in nilfs_dat_translate
This is a trivial patch to delete unnecessary condition in nilfs_dat_translate.

nilfs_dat_translate() will asign translated address to *blocknrp if blocknrp
is not NULL.  However the condition is unneeded, because all callers of
nilfs_dat_translate() pass blocknrp properly.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:03 +09:00
Ryusuke Konishi
fe5f171bb2 nilfs2: fix potential hang in nilfs_error on errors=remount-ro
nilfs_error() calls nilfs_detach_segment_constructor() if
errors=remount-ro option is specified, and this may lead to a hang due
to recursive locking of, for instance, nilfs->ns_segctor_sem and
others.

In this case, detaching segment constructor is not necessary because
read-only flag is set to the filesystem and further writes are
blocked.

This fixes the potential hang issue by removing the
nilfs_detach_segment_constructor() call from nilfs_error.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:03 +09:00
Ryusuke Konishi
7512487e6d nilfs2: use mnt_want_write in ioctls where write access is needed
A few nilfs2 ioctls need to ask for and then later release write
access to the mount in order to avoid potential write to read-only
mounts.

This adds the missing mnt_want_write and mnt_drop_write in
nilfs_ioctl_change_cpmode, nilfs_ioctl_delete_checkpoint, and
nilfs_ioctl_clean_segments.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:02 +09:00
Jiro SEKIBA
e902ec9906 nilfs2: issue discard request after cleaning segments
This adds a function to send discard requests for given array of
segment numbers, and calls the function when garbage collection
succeeded.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:02 +09:00
Shaohua Li
3f6fae9559 Btrfs: btrfs_mark_extent_written uses the wrong slot
My test do: fallocate a big file and do write. The file is 512M, but
after file write is done btrfs-debug-tree shows:
item 6 key (257 EXTENT_DATA 0) itemoff 3516 itemsize 53
                extent data disk byte 1103101952 nr 536870912
                extent data offset 0 nr 399634432 ram 536870912
                extent compression 0
Looks like a regression introducted by
6c7d54ac87, where we set wrong slot.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-12 16:47:19 -05:00
Christoph Hellwig
180040b89e xfs: optimize log flushing in xfs_fsync
If we have a pinned inode it must have a log item attached to it.
Usually that log item will have ili_last_lsn already set, in which
case we only need to flush the log up to that LSN instead of doing a
full log force.  This gives speedups of about 5% in some fsync heavy
workloads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-02-12 13:45:14 -06:00
Christoph Hellwig
87185517de xfs: only clear the suid bit once in xfs_write
file_remove_suid already calls into ->setattr to clear the suid and
sgid bits if needed, no need to start a second transaction to do it
ourselves.

Note that xfs_write_clear_setuid issues a sync transaction while the
path through ->setattr doesn't, but that is consistant with the
other filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-02-12 13:43:57 -06:00
Steven Whitehouse
07ccb7bf2c GFS2: Fix bmap allocation corner-case bug
This patch solves a corner case during allocation which occurs if both
metadata (indirect) and data blocks are required but there is an
obstacle in the filesystem (e.g. a resource group header or another
allocated block) such that when the allocation is requested only
enough blocks for the metadata are returned.

By changing the exit condition of this loop, we ensure that a
minimum of one data block will always be returned.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-12 10:16:14 +00:00
Abhijith Das
0e5a9fb042 GFS2: Fix error code
We need this one-liner to signal the mount helper of the 'insufficient journals' condition.

Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-12 10:15:51 +00:00
Linus Torvalds
efa82bab8e Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Fix the mapping of the NFSERR_SERVERFAULT error
  NFS: Remove a redundant check for PageFsCache in nfs_migrate_page()
  NFS: Fix a bug in nfs_fscache_release_page()
2010-02-11 14:06:28 -08:00
Linus Torvalds
fd48d6c888 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6:
  [SCSI] qla2xxx: Obtain proper host structure during response-queue processing.
  [SCSI] compat_ioct: fix bsg SG_IO
  [SCSI] qla2xxx: make msix interrupt handler safe for irq
  [SCSI] zfcp: Report FC BSG errors in correct field
  [SCSI] mptfusion : mptscsih_abort return value should be SUCCESS instead of value 0.
2010-02-11 14:05:55 -08:00
Michael Neuling
803bf5ec25 fs/exec.c: restrict initial stack space expansion to rlimit
When reserving stack space for a new process, make sure we're not
attempting to expand the stack by more than rlimit allows.

This fixes a bug caused by b6a2fea393 ("mm:
variable length argument support") and unmasked by
fc63cf2370 ("exec: setup_arg_pages() fails
to return errors").

This bug means that when limiting the stack to less the 20*PAGE_SIZE (eg.
80K on 4K pages or 'ulimit -s 79') all processes will be killed before
they start.  This is particularly bad with 64K pages, where a ulimit below
1280K will kill every process.

To test, do:

  'ulimit -s 15; ls'

before and after the patch is applied.  Before it's applied, 'ls' should
be killed.  After the patch is applied, 'ls' should no longer be killed.

A stack limit of 15KB since it's small enough to trigger 20*PAGE_SIZE.
Also 15KB not a multiple of PAGE_SIZE, which is a trickier case to handle
correctly with this code.

4K pages should be fine to test with.

[kosaki.motohiro@jp.fujitsu.com: cleanup]
[akpm@linux-foundation.org: cleanup cleanup]
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-11 13:59:43 -08:00
Andreas Schwab
4cfbafd33f compat_ioctl: add compat handler for TIOCGSID ioctl
This is used by tcgetsid(3).

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-11 13:59:42 -08:00
Sage Weil
8031049147 ceph: remove bogus invalidate_mapping_pages
We were invalidating mapping pages when dropping FILE_CACHE in
__send_cap().  But ceph_check_caps attempts to invalidate already, and
also checks for success, so we should never get to this point.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-11 11:48:55 -08:00
Sage Weil
0840d8af3e ceph: invalidate pages even if truncate is pending
There is no reason not to invalidate pages when a truncate is pending.
Both throw out page cache pages.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-11 11:48:54 -08:00
Sage Weil
3c6f6b79a6 ceph: cleanup async writeback, truncation, invalidate helpers
Grab inode ref in helper.  Make work functions static, with consistent
naming.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-11 11:48:54 -08:00