Commit Graph

10116 Commits

Author SHA1 Message Date
Oleg Nesterov
246bb0b1de kill PF_BORROWED_MM in favour of PF_KTHREAD
Kill PF_BORROWED_MM.  Change use_mm/unuse_mm to not play with ->flags, and
do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.

No functional changes yet.  But this allows us to do further
fixes/cleanups.

oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
kthreads, this is wrong because of use_mm().  The problem with
PF_BORROWED_MM is that we need task_lock() to avoid races.  With this
patch we can check PF_KTHREAD directly, or use a simple lockless helper:

	/* The result must not be dereferenced !!! */
	struct mm_struct *__get_task_mm(struct task_struct *tsk)
	{
		if (tsk->flags & PF_KTHREAD)
			return NULL;
		return tsk->mm;
	}

Note also ecard_task().  It runs with ->mm != NULL, but it's the kernel
thread without PF_BORROWED_MM.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:39 -07:00
Oleg Nesterov
7b34e4283c introduce PF_KTHREAD flag
Introduce the new PF_KTHREAD flag to mark the kernel threads.  It is set
by INIT_TASK() and copied to the forked childs (we could set it in
kthreadd() along with PF_NOFREEZE instead).

daemonize() was changed as well.  In that case testing of PF_KTHREAD is
racy, but daemonize() is hopeless anyway.

This flag is cleared in do_execve(), before search_binary_handler().
Probably not the best place, we can do this in exec_mmap() or in
start_thread(), or clear it along with PF_FORKNOEXEC.  But I think this
doesn't matter in practice, and if do_execve() fails kthread should die
soon.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:39 -07:00
Oleg Nesterov
e4901f92a8 coredump: zap_threads: comments && use while_each_thread()
No changes in fs/exec.o

The for_each_process() loop in zap_threads() is very subtle, it is not
clear why we don't race with fork/exit/exec.  Add the fat comment.

Also, change the code to use while_each_thread().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:38 -07:00
Jan Kara
657d3bfa98 quota: implement sending information via netlink about user below quota
Sometimes it may be useful for userspace to know (e.g.  for some hosting
guys) that some user stopped exceeding his hardlimit or softlimit in
quotas.  Implement sending of such events to userspace via quota netlink
protocol so that they don't have to poll for such events.  Based on idea
and initial implementation by Vladislav Bogdanov.

Cc: Vladislav Bogdanov <slava@nsys.by>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:35 -07:00
Jan Kara
74abb9890d quota: move function-macros from quota.h to quotaops.h
Move declarations of some macros, which should be in fact functions to
quotaops.h.  This way they can be later converted to inline functions
because we can now use declarations from quota.h.  Also add necessary
includes of quotaops.h to a few files.

[akpm@linux-foundation.org: fix JFS build]
[akpm@linux-foundation.org: fix UFS build]
[vegard.nossum@gmail.com: fix QUOTA=n build]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Arjen Pool <arjenpool@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:35 -07:00
Jan Kara
02a55ca871 quota: cleanup loop in sync_dquots()
Make loop in sync_dquots() checking whether there's something to write
more readable, remove useless variable and macro info_any_dirty() which
is used only in this place.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:35 -07:00
Jan Kara
b85f4b87a5 quota: rename quota functions from upper case, make bigger ones non-inline
Cleanup quotaops.h: Rename functions from uppercase to lowercase (and
define backward compatibility macros), move larger functions to dquot.c
and make them non-inline.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:35 -07:00
Jan Kara
b48d380541 quota: fix possible infinite loop in quota code
When quota structure is going to be dropped and it is dirty, quota code tries
to write it.  If the write fails for some reason (e.  g.  transaction cannot
be started because the journal is aborted), we try writing again and again and
again...  Fix the problem by clearing the dirty bit even if the write failed.

(akpm: for 2.6.27, 2.6.26.x and 2.6.25.x)

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: dingdinghua <dingdinghua85@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:34 -07:00
Joe Peterson
b271e067c8 fatfs: add UTC timestamp option
Provide a new mount option ("tz=UTC") for DOS (vfat/msdos) filesystems,
allowing timestamps to be in coordinated universal time (UTC) rather than
local time in applications where doing this is advantageous.

In particular, portable devices that use fat/vfat (such as digital
cameras) can benefit from using UTC in their internal clocks, thus
avoiding daylight saving time errors and general time ambiguity issues.
The user of the device does not have to worry about changing the time when
moving from place or when daylight saving changes.

The new mount option, when set, disables the counter-adjustment that Linux
currently makes to FAT timestamp info in anticipation of the normal
userspace time zone correction.  When used in this new mode, all daylight
saving time and time zone handling is done in userspace as is normal for
many other filesystems (like ext3).  The default mode, which remains
unchanged, is still appropriate when mounting volumes written in Windows
(because of its use of local time).

I originally based this patch on one submitted last year by Paul Collins,
but I updated it to work with current source and changed variable/option
naming.  Ogawa Hirofumi (who maintains these filesystems) and I discussed
this patch at length on lkml, and he suggested using the option name in
the attached version of the patch.  Barry Bouwsma pointed out a good
addition to the patch as well.

Signed-off-by: Joe Peterson <joe@skyrush.com>
Signed-off-by: Paul Collins <paul@ondioline.org>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Barry Bouwsma <free_beer_for_all@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:34 -07:00
Adrian Bunk
e8938a62a8 remove unused #include <linux/dirent.h>'s
Remove some unused #include <linux/dirent.h>'s.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:34 -07:00
Rene Scharfe
7557bc66be msdos fs: remove unsettable atari option
It has been impossible to set the option 'atari' of the MSDOS filesystem
for several years.  Since nobody seems to have missed it, let's remove its
remains.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:34 -07:00
OGAWA Hirofumi
dcd8c53f13 fat: small optimization to __fat_readdir()
This removes unnecessary parsing for directory entries.

If short_only, we don't need to parse longname.  And if !both and it found
the longname, we don't need shortname.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:34 -07:00
OGAWA Hirofumi
98a1516004 fat: use same logic in fat_search_long() and __fat_readdir()
This uses uses stack for shortname, and uses __getname() for longname in
fat_search_long() and __fat_readdir().  By this, it removes unneeded
__getname() for shortname.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:34 -07:00
OGAWA Hirofumi
d688611674 fat: cleanup fs/fat/dir.c
This is no logic changes, just cleans fs/fat/dir.c up.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:34 -07:00
Adrian Bunk
531f710f8e fat/dir.c: switch to struct __fat_dirent
struct __fat_dirent is what was formerly the kernel struct dirent (that
was different from the userspace struct dirent).

Converting all fat users to struct __fat_dirent will allow us to get rid
of the conflicting struct dirent definition.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:34 -07:00
OGAWA Hirofumi
8d44d9741f fat: fix parse_options()
Current parse_options() exits too early.  We need to run the code of
bottom in this function even if users doesn't specify options.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:34 -07:00
Shen Feng
3264d4ded4 reiserfs: remove double definitions of xattr macros
remove the definitions of macros:
XATTR_SECURITY_PREFIX
XATTR_TRUSTED_PREFIX
XATTR_USER_PREFIX
since they are defined in linux/xattr.h

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Jeff Mahoney
90415deac7 reiserfs: convert j_commit_lock to mutex
j_commit_lock is a semaphore but uses it as if it were a mutex.  This patch
converts it to a mutex.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Jeff Mahoney
afe7025907 reiserfs: convert j_flush_sem to mutex
j_flush_sem is a semaphore but uses it as if it were a mutex.  This patch
converts it to a mutex.

[akpm@linux-foundation.org: fix mutex_trylock retval treatment]
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Jeff Mahoney
f68215c464 reiserfs: convert j_lock to mutex
j_lock is a semaphore but uses it as if it were a mutex.  This patch converts
it to a mutex.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Jan Kara
00b441970a reiserfs: correct mount option parsing to detect when quota options can be changed
We should not allow user to change quota mount options when quota is just
suspended.  It would make mount options and internal quota state inconsistent.

Also we should not allow user to change quota format when quota is turned on.
On the other hand we can just silently ignore when some option is set to the
value it already has (some mount versions do this on remount).  Finally, we
should not discard current quota options if parsing of mount options fails.

Cc: <reiserfs-devel@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Jan Kara
4506567b24 reiserfs: fix typos in messages and comments (journalled -> journaled)
Cc: <reiserfs-devel@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Jan Kara
5d4f7fddf8 reiserfs: fix synchronization of quota files in journal=data mode
In journal=data mode, it is not enough to do write_inode_now() as done in
vfs_quota_on() to write all data to their final location (which is needed for
quota_read to work correctly).  Calling journal_end_sync() before calling
vfs_quota_on() does it's job because transactions are committed to the journal
and data marked as dirty in memory so write_inode_now() writes them to their
final locations.

Cc: <reiserfs-devel@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Matthias Kaehlcke
895c23f8c3 hfsplus: convert the extents_lock in a mutex
Apple Extended HFS file system: The semaphore extents lock is used as a
mutex.  Convert it to the mutex API.

Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Matthias Kaehlcke
39f8d472f2 hfs: convert extents_lock in a mutex
Apple Macintosh file system: The semaphore extens_lock is used as a mutex.
Convert it to the mutex API

Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Matthias Kaehlcke
3084b72de7 hfs: convert bitmap_lock in a mutex
Apple Macintosh file system: The semaphore bitmap_lock is used as a mutex.
Convert it to the mutex API

Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Adrian Bunk
de0ca06a99 coda: remove CODA_FS_OLD_API
While fixing CONFIG_ leakages to the userspace kernel headers I ran into
CODA_FS_OLD_API.

After five years, are there still people using the old API left?
Especially considering that you have to choose at compile time which API
to support in the kernel (and distributions tend to offer the new API for
some time).

Jan: "The old API can definitely go.  Around the time the new
      interface went in there were some non-Coda userspace file system
      implementations that took a while longer to convert to the new API,
      but by now they all switched to the new interface or in some cases
      to a FUSE-based solution."

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Adam Greenblatt
c0a1633b62 isofs: fix minor filesystem corruption
Some iso9660 images contain files with rockridge data that is either
incorrect or incompletely parsed.  Prior to commit
f2966632a1 ("[PATCH] rock: handle directory
overflows") (included with kernel 2.6.13) the kernel ignored the rockridge
data for these files, while still allowing the files to be accessed under
their non-rockridge names.  That commit inadvertently changed things so
that files with invalid rockridge data could not be accessed at all.  (I
ran across the problem when comparing some old CDs with hard disk copies I
had made long ago under kernel 2.4: a few of the files on the hard disk
copies were no longer visible on the CDs.)

This change reverts to the pre-2.6.13 behavior.

Signed-off-by: Adam Greenblatt <adam.greenblatt@gmail.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Duane Griffin
275c0a8f12 ext3: validate directory entry data before use
ext3_dx_find_entry uses ext3_next_entry without verifying that the entry
is valid.  If its rec_len == 0 this causes an infinite loop.  Refactor the
loop to check the validity of entries before checking whether they match
and moving onto the next one.

There are other uses of ext3_next_entry in this file which also look
problematic.  They should be reviewed and fixed if/when we have a
test-case that triggers them.

This patch fixes the first case (image hdb.25.softlockup.gz) reported in
http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:33 -07:00
Hidehiro Kawai
cbe5f466f6 jbd: don't abort if flushing file data failed
In ordered mode, the current jbd aborts the journal if a file data buffer
has an error.  But this behavior is unintended, and we found that it has
been adopted accidentally.

This patch undoes it and just calls printk() instead of aborting the
journal.  Additionally, set AS_EIO into the address_space object of the
failed buffer which is submitted by journal_do_submit_data() so that
fsync() can get -EIO.

Missing error checkings are also added to inform errors on file data
buffers to the user.  The following buffers are targeted.

  (a) the buffer which has already been written out by pdflush
  (b) the buffer which has been unlocked before scanned in the
      t_locked_list loop

[akpm@linux-foundation.org: improve grammar in a printk]
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Li Zefan
8ef2720397 ext3: kill 2 useless magic numbers
dx_root_limit() will never return 20, and I can't figure out what 20
stands for.  This function has never changed since htree directory
indexing was merged.

Similar for dx_node_limit() and the magic 22.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Toshiyuki Okajima
fc80c44277 jbd: positively dispose the unmapped data buffers in journal_commit_transaction()
After ext3-ordered files are truncated, there is a possibility that the
pages which cannot be estimated still remain.  Remaining pages can be
released when the system has really few memory.  So, it is not memory
leakage.  But the resource management software etc.  may not work
correctly.

It is possible that journal_unmap_buffer() cannot release the buffers, and
the pages to which they belong because they are attached to a commiting
transaction and journal_unmap_buffer() cannot release them.  To release
such the buffers and the pages later, journal_unmap_buffer() leaves it to
journal_commit_transaction().  (journal_unmap_buffer() puts the mark
'BH_Freed' to the buffers so that journal_commit_transaction() can
identify whether they can be released or not.)

In the journalled mode and the writeback mode, jbd does with only metadata
buffers.  But in the ordered mode, jbd does with metadata buffers and also
data buffers.

Actually, journal_commit_transaction() releases only the metadata buffers
of which release is demanded by journal_unmap_buffer(), and also releases
the pages to which they belong if possible.

As a result, the data buffers of which release is demanded by
journal_unmap_buffer() remain after a transaction commits.  And also the
pages to which they belong remain.

Such the remained pages don't have mapping any longer.  Due to this fact,
there is a possibility that the pages which cannot be estimated remain.

The metadata buffers marked 'BH_Freed' and the pages to which
they belong can be released at 'JBD: commit phase 7'.

Therefore, by applying the same code into 'JBD: commit phase 2' (where the
data buffers are done with), journal_commit_transaction() can also release
the data buffers marked 'BH_Freed' and the pages to which they belong.

As a result, all the buffers marked 'BH_Freed' can be released, and also
all the pages to which these buffers belong can be released at
journal_commit_transaction().  So, the page which cannot be estimated is
lost.

<<Excerpt of code at 'JBD: commit phase 7'>>
 >         spin_lock(&journal->j_list_lock);
 >         while (commit_transaction->t_forget) {
 >                 transaction_t *cp_transaction;
 >                 struct buffer_head *bh;
 >
 >                 jh = commit_transaction->t_forget;
 >...
 >                 if (buffer_freed(bh)) {
 >                 ^^^^^^^^^^^^^^^^^^^^^^^^
 >                         clear_buffer_freed(bh);
 >                        ^^^^^^^^^^^^^^^^^^^^^^^^
 >                         clear_buffer_jbddirty(bh);
 >                 }
 >
 >                 if (buffer_jbddirty(bh)) {
 >                         JBUFFER_TRACE(jh, "add to new checkpointing trans");
 >                         __journal_insert_checkpoint(jh, commit_transaction);
 >                         JBUFFER_TRACE(jh, "refile for checkpoint writeback");
 >                         __journal_refile_buffer(jh);
 >                         jbd_unlock_bh_state(bh);
 >                 } else {
 >                         J_ASSERT_BH(bh, !buffer_dirty(bh));
 > ...
 >                         JBUFFER_TRACE(jh, "refile or unfile freed buffer");
 >                         __journal_refile_buffer(jh);
 >                         if (!jh->b_transaction) {
 >                                 jbd_unlock_bh_state(bh);
 >                                  /* needs a brelse */
 >                                 journal_remove_journal_head(bh);
 >                                 release_buffer_page(bh);
 >                                 ^^^^^^^^^^^^^^^^^^^^^^^^
 >                         } else
 >                 }
****************************************************************
* Apply the code of "^^^^^^" lines into 'JBD: commit phase 2' *
****************************************************************

At journal_commit_transaction() code, there is one extra message in the
series of jbd debug messages.  ("JBD: commit phase 2") This patch fixes
it, too.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Adrian Bunk
a10320e8f7 jbd: unexport journal_update_superblock
Remove the unused EXPORT_SYMBOL(journal_update_superblock).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Duane Griffin
3ccc3167b0 ext3: handle deleting corrupted indirect blocks
While freeing indirect blocks we attach a journal head to the parent
buffer head, free the blocks, then journal the parent.  If the indirect
block list is corrupted and points to the parent the journal head will be
detached when the block is cleared, causing an OOPS.

Check for that explicitly and handle it gracefully.

This patch fixes the third case (image hdb.20000057.nullderef.gz)
reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Immediately above the change, in the ext3_free_data function, we call
ext3_clear_blocks to clear the indirect blocks in this parent block.  If
one of those blocks happens to actually be the parent block it will clear
b_private / BH_JBD.

I did the check at the end rather than earlier as it seemed more elegant.
I don't think there should be much practical difference, although it is
possible the FS may not be quite so badly corrupted if we did it the other
way (and didn't clear the block at all).  To be honest, I'm not convinced
there aren't other similar failure modes lurking in this code, although I
couldn't find any with a quick review.

[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Hidehiro Kawai
95450f5a7e ext3: don't read inode block if the buffer has a write error
A transient I/O error can corrupt inode data.  Here is the scenario:

(1) update inode_A at the block_B
(2) pdflush writes out new inode_A to the filesystem, but it results
    in write I/O error, at this point, BH_Uptodate flag of the buffer
    for block_B is cleared and BH_Write_EIO is set
(3) create new inode_C which located at block_B, and
    __ext3_get_inode_loc() tries to read on-disk block_B because the
    buffer is not uptodate
(4) if it can read on-disk block_B successfully, inode_A is
    overwritten by old data

This patch makes __ext3_get_inode_loc() not read the inode block if the
buffer has BH_Write_EIO flag.  In this case, the buffer should have the
latest information, so setting the uptodate flag to the buffer (this
avoids WARN_ON_ONCE() in mark_buffer_dirty().)

According to this change, we would need to test BH_Write_EIO flag for the
error checking.  Currently nobody checks write I/O errors on metadata
buffers, but it will be done in other patches I'm working on.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Duane Griffin
ae76dd9a6b ext3: handle corrupted orphan list at mount
If the orphan node list includes valid, untruncatable nodes with nlink > 0
the ext3_orphan_cleanup loop which attempts to delete them will not do so,
causing it to loop forever. Fix by checking for such nodes in the
ext3_orphan_get function.

This patch fixes the second case (image hdb.20000009.softlockup.gz)
reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: printk warning fix]
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Shen Feng
ef1afd3951 ext3: remove double definitions of xattr macros
remove the definitions of macros:
XATTR_TRUSTED_PREFIX
XATTR_USER_PREFIX
since they are defined in linux/xattr.h

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Mingming Cao
3f31fddfa2 jbd: fix race between free buffer and commit transaction
journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data
buffer to flush to disk.  If the caller of journal_try_to_free_buffers()
request tries hard to release the buffers, it will treat the failure as
error and return back to the caller.  We have seen the directo IO failed
due to this race.  Some of the caller of releasepage() also expecting the
buffer to be dropped when passed with GFP_KERNEL mask to the
releasepage()->journal_try_to_free_buffers().

With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
indicating this call could wait, in case of try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish commit the
current committing transaction, then try to free those buffers again.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Shen Feng
9ebfbe9f92 ext3: improve some code in rb tree part of dir.c
- remove unnecessary code in free_rb_tree_fname
 - rename free_rb_tree_fname to ext3_htree_create_dir_info
   since it and ext3_htree_free_dir_info are a pair
 - replace kmalloc with kzalloc in ext3_htree_free_dir_info

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Duane Griffin
1984bb763c jbd: tidy up revoke cache initialisation and destruction
Make revocation cache destruction safe to call if initialisation fails
partially or entirely.  This allows it to be used to cleanup in the case
of initialisation failure, simplifying that code slightly.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Duane Griffin
f4d79ca2fa jbd: eliminate duplicated code in revocation table init/destroy functions
The revocation table initialisation/destruction code is repeated for each
of the two revocation tables stored in the journal.  Refactoring the
duplicated code into functions is tidier, simplifies the logic in
initialisation in particular, and slightly reduces the code size.

There should not be any functional change.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Duane Griffin
3850f7a521 jbd: replace potentially false assertion with if block
If an error occurs during jbd cache initialisation it is possible for the
journal_head_cache to be NULL when journal_destroy_journal_head_cache is
called.  Replace the J_ASSERT with an if block to handle the situation
correctly.

Note that even with this fix things will break badly if jbd is statically
compiled in and cache initialisation fails.

Signed-off-by: Duane Griffin <duaneg@dghda.com
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Jan Kara
d06bf1d252 ext3: correct mount option parsing to detect when quota options can be changed
We should not allow user to change quota mount options when quota is just
suspended.  I would make mount options and internal quota state inconsistent.
Also we should not allow user to change quota format when quota is turned on.
On the other hand we can just silently ignore when some option is set to the
value it already has (mount does this on remount).

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:32 -07:00
Jan Kara
99aeaf639f ext3: fix typos in messages and comments (journalled -> journaled)
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:31 -07:00
Jan Kara
9cfe7b9010 ext3: fix synchronization of quota files in journal=data mode
In journal=data mode, it is not enough to do write_inode_now as done in
vfs_quota_on() to write all data to their final location (which is needed for
quota_read to work correctly).  Calling journal_flush() does its job.

Reported-by: Nick <gentuu@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:31 -07:00
Shen Feng
f905f06fca ext2: remove double definitions of xattr macros
remove the definitions of macros:
XATTR_TRUSTED_PREFIX
XATTR_USER_PREFIX
since they are defined in linux/xattr.h

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:31 -07:00
Adrian Bunk
fb523f3227 minix: remove !NO_TRUNCATE code
This patch removes the !NO_TRUNCATE code that anyway required a manual
editing of the code for being used.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:30 -07:00
Hugh Dickins
ba92a43dba exec: remove some includes
fs/exec.c used to need mman.h pagemap.h swap.h and rmap.h when it did
mm-ish stuff in install_arg_page(); but no need for them after 2.6.22.

[akpm@linux-foundation.org: unbreak arm]
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:28 -07:00
Harvey Harrison
b7bbf8fa6b fs: ldm.[ch] use get_unaligned_* helpers
Replace the private BE16/BE32/BE64 macros with direct calls to
get_unaligned_be16/32/64.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:26 -07:00
David Woodhouse
ff877ea80e Merge branch 'linux-next' of git://git.infradead.org/~dedekind/ubi-2.6 2008-07-25 10:40:14 -04:00
Nathan Lynch
483fad1c3f ELF loader support for auxvec base platform string
Some IBM POWER-based platforms have the ability to run in a
mode which mostly appears to the OS as a different processor from the
actual hardware.  For example, a Power6 system may appear to be a
Power5+, which makes the AT_PLATFORM value "power5+".  This means that
programs are restricted to the ISA supported by Power5+;
Power6-specific instructions are treated as illegal.

However, some applications (virtual machines, optimized libraries) can
benefit from knowledge of the underlying CPU model.  A new aux vector
entry, AT_BASE_PLATFORM, will denote the actual hardware.  For
example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM
will be "power5+" and AT_BASE_PLATFORM will be "power6".  The idea is
that AT_PLATFORM indicates the instruction set supported, while
AT_BASE_PLATFORM indicates the underlying microarchitecture.

If the architecture has defined ELF_BASE_PLATFORM, copy that value to
the user stack in the same manner as ELF_PLATFORM.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2008-07-25 15:44:39 +10:00
Adrian Bunk
fb2e405fc1 fix fs/nfs/nfsroot.c compilation
This fixes the following compile error caused by commit
f9247273cb ("UFS: add const to parser
token table"):

    CC      fs/nfs/nfsroot.o
  /home/bunk/linux/kernel-2.6/git/linux-2.6/fs/nfs/nfsroot.c:130: error: tokens causes a section type conflict
  make[3]: *** [fs/nfs/nfsroot.o] Error 1

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 17:32:41 -07:00
Chris Wright
e2d2867ff8 When verifying the decoded header before decoding the object identifier
(expecting a SPNEGO pseudo-mechanism oid), the test to verify it is a
primitive encoding is compared against the asn1 class.  Primitive is not a
class.  This brings check in line with similar check for krb/ntlmssp oid.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-07-24 20:43:34 +00:00
Steven Whitehouse
f9247273cb UFS: add const to parser token table
This patch adds a "const" to the parser token table. I've done an
allmodconfig build to see if this produces any warnings/failures and the
patch includes a fix for the only warning that was produced.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Alexander Viro <aviro@redhat.com>
Acked-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 11:50:15 -07:00
Ian Kent
aa55ddf340 autofs4: remove unused ioctls
The ioctls AUTOFS_IOC_TOGGLEREGHOST and AUTOFS_IOC_ASKREGHOST were added
several years ago but what they were intended for has never been
implemented (as far as I'm aware noone uses them) so remove them.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:33 -07:00
Ian Kent
06a3598552 autofs4: reorganize expire pending wait function calls
This patch re-orgnirzes the checking for and waiting on active expires and
elininates redundant checks.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
ec6e8c7d3f autofs4: fix direct mount pending expire race - correction
Appologies, somehow I seem to have sent an out dated version of this
patch. Here is an additional patch that brings the patch up to date.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
6e60a9ab5f autofs4: fix direct mount pending expire race
For direct and offset type mounts that are covered by another mount we
cannot check the AUTOFS_INF_EXPIRING flag during a path walk which leads
to lookups walking into an expiring mount while it is being expired.

For example, for the direct multi-mount map entry with a couple of
offsets:

/race/mm1  /      <server1>:/<path1>
           /om1   <server2>:/<path2>
           /om2   <server1>:/<path3>

an autofs trigger mount is mounted on /race/mm1 and when accessed it is
over mounted and trigger mounts made for /race/mm1/om1 and /race/mm1/om2.
So it isn't possible for path walks to see the expiring flag at all and
they happily walk into the file system while it is expiring.

When expiring these mounts follow_down() must stop at the autofs mount and
all processes must block in the ->follow_link() method (except the daemon)
until the expire is complete.  This is done by decrementing the d_mounted
field of the autofs trigger mount root dentry until the expire is
completed.  In ->follow_link() all processes wait on the expire and the
mount following is completed for the daemon until the expire is complete.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
97e7449a7a autofs4: fix indirect mount pending expire race
The selection of a dentry for expiration and the setting of the
AUTOFS_INF_EXPIRING flag isn't done atomically which can lead to lookups
walking into an expiring mount.

What happens is that an expire is initiated by the daemon and a dentry is
selected for expire but, since there is no lock held between the selection
and setting of the expiring flag, a process may find the flag clear and
continue walking into the mount tree at the same time the daemon attempts
the expire it.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
26e81b3142 autofs4: fix pending checks
There are two cases for which a dentry that has a pending mount request
does not wait for completion.  One is via autofs4_revalidate() and the
other via autofs4_follow_link().

In revalidate, after the mount point directory is created, but before the
mount is done, the check in try_to_fill_dentry() can can fail to send the
dentry to the wait queue since the dentry is positive and the lookup flags
may contain only LOOKUP_FOLLOW.  Although we don't trigger a mount for the
LOOKUP_FOLLOW flag, if ther's one pending we might as well wait and use
the mounted dentry for the lookup.

In autofs4_follow_link() the dentry is not checked to see if it is pending
so it may fail to call try_to_fill_dentry() and not wait for mount
completion.

A dentry that is pending must always be sent to the wait queue.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
ff9cd499d6 autofs4: cleanup redundant readir code
The mount triggering functionality of readdir and related functions is no
longer used (and is quite broken as well).  The unused portions have been
removed.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
c72305b547 autofs4: indirect dentry must almost always be positive
We have been seeing mount requests comming to the automount daemon for
keys of the form "<map key>/<non key directory>" which are lookups for
invalid map keys.  But we can check for this in the kernel module and
return a fail immediately, without having to send a request to the daemon.

It is possible to recognise these requests are invalid based on whether
the request dentry is negative and its relation to the autofs file system
root.

For example, given the indirect multi-mount map entry:

idm1  \
    /mm1  <server>:/<path1>
    /mm2  <server>:/<path2>

For a request to mount idm1, IS_ROOT((idm1)->d_parent) will be always be
true and the dentry may be negative.  But directories idm1/mm1 and
idm1/mm2 will always be created as part of the mount request for idm1.  So
any mount request within idm1 itself must have a positive dentry otherwise
the map key is invalid.

In version 4 these multi-mount entries are all mounted and umounted as a
single request and in version 5 the directories idm1/mm1 and idm1/mm2 are
created and an autofs fs mounted on them to act as a mount trigger so the
above is also true.

This also holds true for the autofs version 4 pseudo direct mount feature.
 When this feature is used without the "--ghost" option automount(8) will
create internal submounts as we go down the map key paths which are
essentially normal indirect mounts for which the above holds.  If the
"--ghost" option is given the directories for map keys are created at
daemon startup so valid map entries correspond to postive dentries in the
autofs fs.

autofs version 5 direct mount maps are similar except that the IS_ROOT
check is not needed.  This has been addressed in a previous patch tittled
"autofs4 - detect invalid direct mount requests".

For example, given the direct multi-mount map entry:

/test/dm1  \
    /mm1  <server>:/<path1>
    /mm2  <server>:/<path2>

An autofs fs is mounted on /test/dm1 as a trigger mount and when a mount
is triggered for /test/dm1, the multi-mount offset directories
/test/dm1/mm1 and /test/dm1/mm2 are created and an autofs fs is mounted on
them to act as mount triggers.  So valid direct mount requests must always
have a positive dentry if they correspond to a valid map entry.

Signed-off-by: Ian Kent <raven@themaw.net>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
eb3b176796 autofs4: detect invalid direct mount requests
autofs v5 direct and offset mounts within an autofs filesystem are
triggered by existing autofs triger mounts so the mount point dentry must
be positive.  If the mount point dentry is negative then the trigger
doesn't exist so we can return fail immediately.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
296f7bf78b autofs4: fix waitq memory leak
If an autofs mount becomes catatonic before autofs4_wait_release() is
called the wait queue counter will not be decremented down to zero and the
entry will never be freed.  There are also races decrementing the wait
counter in the wait release function.  To deal with this the counter needs
to be updated while holding the wait queue mutex and waiters need to be
woken up unconditionally when the wait is removed from the queue to ensure
we eventually free the wait.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
e64be33cca autofs4: check kernel communication pipe is valid for write
It is possible for an autofs mount to become catatonic (and for the daemon
communication pipe to become NULL) after a wait has been initiallized but
before the request has been sent to the daemon.  We need to check for this
before sending the request packet.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
f4c7da0261 autofs4: add missing kfree
It see that the patch tittled "autofs4 - fix pending mount race" is
missing a change that I had recently made.

It's missing a kfree for the case mutex_lock_interruptible() fails
to aquire the wait queue mutex.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
a1362fe92f autofs4: fix pending mount race
Close a race between a pending mount that is about to finish and a new
lookup for the same directory.

Process P1 triggers a mount of directory foo.  It sets
DCACHE_AUTOFS_PENDING in the ->lookup routine, creates a waitq entry for
'foo', and calls out to the daemon to perform the mount.  The autofs
daemon will then create the directory 'foo', using a new dentry that will
be hashed in the dcache.

Before the mount completes, another process, P2, tries to walk into the
'foo' directory.  The vfs path walking code finds an entry for 'foo' and
calls the revalidate method.  Revalidate finds that the entry is not
PENDING (because PENDING was never set on the dentry created by the
mkdir), but it does find the directory is empty.  Revalidate calls
try_to_fill_dentry, which sets the PENDING flag and then calls into the
autofs4 wait code to trigger or wait for a mount of 'foo'.  The wait code
finds the entry for 'foo' and goes to sleep waiting for the completion of
the mount.

Yet another process, P3, tries to walk into the 'foo' directory.  This
process again finds a dentry in the dcache for 'foo', and calls into the
autofs revalidate code.

The revalidate code finds that the PENDING flag is set, and so calls
try_to_fill_dentry.

a) try_to_fill_dentry sets the PENDING flag redundantly for this
   dentry, then calls into the autofs4 wait code.

b) the autofs4 wait code takes the waitq mutex and searches for an
   entry for 'foo'

Between a and b, P1 is woken up because the mount completed.  P1 takes the
wait queue mutex, clears the PENDING flag from the dentry, and removes the
waitqueue entry for 'foo' from the list.

When it releases the waitq mutex, P3 (eventually) acquires it.  At this
time, it looks for an existing waitq for 'foo', finds none, and so creates
a new one and calls out to the daemon to mount the 'foo' directory.

Now, the reason that three processes are required to trigger this race is
that, because the PENDING flag is not set on the dentry created by mkdir,
the window for the race would be way to slim for it to ever occur.
Basically, between the testing of d_mountpoint(dentry) and the taking of
the waitq mutex, the mount would have to complete and the daemon would
have to be woken up, and that in turn would have to wake up P1.  This is
simply impossible.  Add the third process, though, and it becomes slightly
more likely.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
5a11d4d0ee autofs4: fix waitq locking
The autofs4_catatonic_mode() function accesses the wait queue without any
locking but can be called at any time.  This could lead to a possible
double free of the name field of the wait and a double fput of the daemon
communication pipe or an fput of a NULL file pointer.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Jeff Moyer
70b52a0a50 autofs4: use struct qstr in waitq.c
The autofs_wait_queue already contains all of the fields of the
struct qstr, so change it into a qstr.

This patch, from Jeff Moyer, has been modified a liitle by myself.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:32 -07:00
Ian Kent
6d5cb926fa autofs4: use lookup intent flags to trigger mounts
When an open(2) call is made on an autofs mount point directory that
already exists and the O_DIRECTORY flag is not used the needed mount
callback to the daemon is not done. This leads to the path walk
continuing resulting in a callback to the daemon with an incorrect
key. open(2) is called without O_DIRECTORY by the "find" utility but
this should be handled properly anyway.

This happens because autofs needs to use the lookup flags to decide
when to callback to the daemon to perform a mount to prevent mount
storms. For example, an autofs indirect mount map that has the "browse"
option will have the mount point directories are pre-created and the
stat(2) call made by a color ls against each directory will cause all
these directories to be mounted. It is unfortunate we need to resort
to this but mount maps can be quite large. Additionally, if a user
manually umounts an autofs indirect mount the directory isn't removed
which also leads to this situation.

To resolve this autofs needs to use the lookup intent flags to enable
it to make this decision. This patch adds this check and triggers a
call back if any of the lookup intent flags are set as all these calls
warrant a mount attempt be requested.

I know that external VFS code which uses the lookup flags is something
that the VFS would like to eliminate but I have no choice as I can't
see any other way to do this. A VFS dentry or inode operation callback
which returns the lookup "type" (requires a definition) would be
sufficient. But this change is needed now and I'm not aware of the form
that coming VFS changes will take so I'm not willing to propose anything
along these lines.

If anyone can provide an alternate method I would be happy to use it.

[akpm@linux-foundation.org: fix build for concurrent VFS changes]
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Ian Kent
c432c2586a autofs4: don't release directory mutex if called in oz_mode
Since we now delay hashing of dentrys until the ->mkdir() call, droping
and re-taking the directory mutex within the ->lookup() function when we
are being called by user space is not needed.  This can lead to a race
when other processes are attempting to access the same directory during
mount point directory creation.

In this case we need to hang onto the mutex to ensure we don't get user
processes trying to create a mount request for a newly created dentry
after the mount point entry has already been created.  This ensures that
when we need to check a dentry passed to autofs4_wait(), if it is hashed,
it is always the mount point dentry and not a new dentry created by
another lookup during directory creation.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Ian Kent
ef581a7428 autofs4: fix symlink name allocation
The length of the symlink name has been moved but it needs to be set
before allocating space for it in the dentry info struct.  This corrects a
mistake in a recent patch.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Ian Kent
2576737873 autofs4: use look aside list for lookups
A while ago a patch to resolve a deadlock during directory creation was
merged.  This delayed the hashing of lookup dentrys until the ->mkdir()
(or ->symlink()) operation completed to ensure we always went through
->lookup() instead of also having processes go through ->revalidate() so
our VFS locking remained consistent.

Now we are seeing a couple of side affects of that change in situations
with heavy mount activity.

Two cases have been identified:

1) When a mount request is triggered, due to the delayed hashing, the
   directory created by user space for the mount point doesn't have the
   DCACHE_AUTOFS_PENDING flag set.  In the case of an autofs multi-mount
   where a tree of mount point directories are created this can lead to
   the path walk continuing rather than the dentry being sent to the wait
   queue to wait for request completion.  This is because, if the pending
   flag isn't set, the criteria for deciding this is a mount in progress
   fails to hold, namely that the dentry is not a mount point and has no
   subdirectories.

2) A mount request dentry is initially created negative and unhashed.
   It remains this way until the ->mkdir() callback completes.  Since it
   is unhashed a fresh dentry is used when the user space mount request
   creates the mount point directory.  This leaves the original dentry
   negative and unhashed.  But revalidate has no way to tell the VFS that
   the dentry has changed, other than to force another ->lookup() by
   returning false, which is at best wastefull and at worst not possible.
   This results in an -ENOENT return from the original path walk when in
   fact the mount succeeded.

To resolve this we need to ensure that the same dentry is used in all
calls to ->lookup() during the course of a mount request.  This patch
achieves that by adding the initial dentry to a look aside list and
removes it at ->mkdir() or ->symlink() completion (or when the dentry is
released), since these are the only create operations autofs4 supports.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Ian Kent
caf7da3d5d autofs4: revert - redo lookup in ttfd
This patch series enables the use of a single dentry for lookups prior to
the dentry being hashed and so we no longer need to redo the lookup.  This
patch reverts the patch of commit
033790449b.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Ian Kent
5f6f4f28b6 autofs4: don't make expiring dentry negative
Correct the error of making a positive dentry negative after it has been
instantiated.

The code that makes this error attempts to re-use the dentry from a
concurrent expire and mount to resolve a race and the dentry used for the
lookup must be negative for mounts to trigger in the required cases.  The
fact is that the dentry doesn't need to be re-used because all that is
needed is to preserve the flag that indicates an expire is still
incomplete at the time of the mount request.

This change uses the the dentry to check the flag and wait for the expire
to complete then discards it instead of attempting to re-use it.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Michael Halcrow
391b52f98c eCryptfs: Make all persistent file opens delayed
There is no good reason to immediately open the lower file, and that can
cause problems with files that the user does not intend to immediately
open, such as device nodes.

This patch removes the persistent file open from the interpose step and
pushes that to the locations where eCryptfs really does need the lower
persistent file, such as just before reading or writing the metadata
stored in the lower file header.

Two functions are jumping to out_dput when they should just be jumping to
out on error paths.  This patch also fixes these.

Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Michael Halcrow
72b55fffd6 eCryptfs: do not try to open device files on mknod
When creating device nodes, eCryptfs needs to delay actually opening the lower
persistent file until an application tries to open.  Device handles may not be
backed by anything when they first come into existence.

[Valdis.Kletnieks@vt.edu: build fix]
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: <Valdis.Kletnieks@vt.edu}
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Harvey Harrison
0a688ad713 ecryptfs: inode.c mmap.c use unaligned byteorder helpers
Fixe sparse warnings:
fs/ecryptfs/inode.c:368:15: warning: cast to restricted __be64
fs/ecryptfs/mmap.c:385:12: warning: incorrect type in assignment (different base types)
fs/ecryptfs/mmap.c:385:12:    expected unsigned long long [unsigned] [assigned] [usertype] file_size
fs/ecryptfs/mmap.c:385:12:    got restricted __be64 [usertype] <noident>
fs/ecryptfs/mmap.c:428:12: warning: incorrect type in assignment (different base types)
fs/ecryptfs/mmap.c:428:12:    expected unsigned long long [unsigned] [assigned] [usertype] file_size
fs/ecryptfs/mmap.c:428:12:    got restricted __be64 [usertype] <noident>

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Harvey Harrison
29335c6a41 ecryptfs: crypto.c use unaligned byteorder helpers
Fixes the following sparse warnings:
fs/ecryptfs/crypto.c:1036:8: warning: cast to restricted __be32
fs/ecryptfs/crypto.c:1038:8: warning: cast to restricted __be32
fs/ecryptfs/crypto.c:1077:10: warning: cast to restricted __be32
fs/ecryptfs/crypto.c:1103:6: warning: incorrect type in assignment (different base types)
fs/ecryptfs/crypto.c:1105:6: warning: incorrect type in assignment (different base types)
fs/ecryptfs/crypto.c:1124:8: warning: incorrect type in assignment (different base types)
fs/ecryptfs/crypto.c:1241:21: warning: incorrect type in assignment (different base types)
fs/ecryptfs/crypto.c:1244:30: warning: incorrect type in assignment (different base types)
fs/ecryptfs/crypto.c:1414:23: warning: cast to restricted __be32
fs/ecryptfs/crypto.c:1417:32: warning: cast to restricted __be16

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Miklos Szeredi
8f2368095e ecryptfs: string copy cleanup
Clean up overcomplicated string copy, which also gets rid of this
bogus warning:

fs/ecryptfs/main.c: In function 'ecryptfs_parse_options':
include/asm/arch/string_32.h:75: warning: array subscript is above array bounds

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Eric Sandeen
982363c97f ecryptfs: propagate key errors up at mount time
Mounting with invalid key signatures should probably fail, if they were
specifically requested but not available.

Also fix case checks in process_request_key_err() for the right sign of
the errnos, as spotted by Jan Tluka.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Tluka <jtluka@redhat.com>
Acked-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Tyler Hicks
6c4c17b073 ecryptfs: discard ecryptfsd registration messages in miscdev
The userspace eCryptfs daemon sends HELO and QUIT messages to the kernel
for per-user daemon (un)registration.  These messages are required when
netlink is used as the transport, but (un)registration is handled by
opening and closing the device file when miscdev is the transport.  These
messages should be discarded in the miscdev transport so that a daemon
isn't registered twice.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:31 -07:00
Michael Halcrow
746f1e558b eCryptfs: Privileged kthread for lower file opens
eCryptfs would really like to have read-write access to all files in the
lower filesystem.  Right now, the persistent lower file may be opened
read-only if the attempt to open it read-write fails.  One way to keep
from having to do that is to have a privileged kthread that can open the
lower persistent file on behalf of the user opening the eCryptfs file;
this patch implements this functionality.

This patch will properly allow a less-privileged user to open the eCryptfs
file, followed by a more-privileged user opening the eCryptfs file, with
the first user only being able to read and the second user being able to
both read and write.  eCryptfs currently does this wrong; it will wind up
calling vfs_write() on a file that was opened read-only.  This is fixed in
this patch.

Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:30 -07:00
Ulrich Drepper
9fe5ad9c8c flag parameters add-on: remove epoll_create size param
Remove the size parameter from the new epoll_create syscall and renames the
syscall itself.  The updated test program follows.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_epoll_create2
# ifdef __x86_64__
#  define __NR_epoll_create2 291
# elif defined __i386__
#  define __NR_epoll_create2 329
# else
#  error "need __NR_epoll_create2"
# endif
#endif

#define EPOLL_CLOEXEC O_CLOEXEC

int
main (void)
{
  int fd = syscall (__NR_epoll_create2, 0);
  if (fd == -1)
    {
      puts ("epoll_create2(0) failed");
      return 1;
    }
  int coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (coe & FD_CLOEXEC)
    {
      puts ("epoll_create2(0) set close-on-exec flag");
      return 1;
    }
  close (fd);

  fd = syscall (__NR_epoll_create2, EPOLL_CLOEXEC);
  if (fd == -1)
    {
      puts ("epoll_create2(EPOLL_CLOEXEC) failed");
      return 1;
    }
  coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((coe & FD_CLOEXEC) == 0)
    {
      puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag");
      return 1;
    }
  close (fd);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:29 -07:00
Ulrich Drepper
e38b36f325 flag parameters: check magic constants
This patch adds test that ensure the boundary conditions for the various
constants introduced in the previous patches is met.  No code is generated.

[akpm@linux-foundation.org: fix alpha]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:29 -07:00
Ulrich Drepper
510df2dd48 flag parameters: NONBLOCK in inotify_init
This patch adds non-blocking support for inotify_init1.  The
additional changes needed are minimal.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_inotify_init1
# ifdef __x86_64__
#  define __NR_inotify_init1 294
# elif defined __i386__
#  define __NR_inotify_init1 332
# else
#  error "need __NR_inotify_init1"
# endif
#endif

#define IN_NONBLOCK O_NONBLOCK

int
main (void)
{
  int fd = syscall (__NR_inotify_init1, 0);
  if (fd == -1)
    {
      puts ("inotify_init1(0) failed");
      return 1;
    }
  int fl = fcntl (fd, F_GETFL);
  if (fl == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (fl & O_NONBLOCK)
    {
      puts ("inotify_init1(0) set non-blocking mode");
      return 1;
    }
  close (fd);

  fd = syscall (__NR_inotify_init1, IN_NONBLOCK);
  if (fd == -1)
    {
      puts ("inotify_init1(IN_NONBLOCK) failed");
      return 1;
    }
  fl = fcntl (fd, F_GETFL);
  if (fl == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((fl & O_NONBLOCK) == 0)
    {
      puts ("inotify_init1(IN_NONBLOCK) set non-blocking mode");
      return 1;
    }
  close (fd);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:29 -07:00
Ulrich Drepper
be61a86d72 flag parameters: NONBLOCK in pipe
This patch adds O_NONBLOCK support to pipe2.  It is minimally more involved
than the patches for eventfd et.al but still trivial.  The interfaces of the
create_write_pipe and create_read_pipe helper functions were changed and the
one other caller as well.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_pipe2
# ifdef __x86_64__
#  define __NR_pipe2 293
# elif defined __i386__
#  define __NR_pipe2 331
# else
#  error "need __NR_pipe2"
# endif
#endif

int
main (void)
{
  int fds[2];
  if (syscall (__NR_pipe2, fds, 0) == -1)
    {
      puts ("pipe2(0) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      int fl = fcntl (fds[i], F_GETFL);
      if (fl == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if (fl & O_NONBLOCK)
        {
          printf ("pipe2(0) set non-blocking mode for fds[%d]\n", i);
          return 1;
        }
      close (fds[i]);
    }

  if (syscall (__NR_pipe2, fds, O_NONBLOCK) == -1)
    {
      puts ("pipe2(O_NONBLOCK) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      int fl = fcntl (fds[i], F_GETFL);
      if (fl == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if ((fl & O_NONBLOCK) == 0)
        {
          printf ("pipe2(O_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i);
          return 1;
        }
      close (fds[i]);
    }

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:29 -07:00
Ulrich Drepper
6b1ef0e60d flag parameters: NONBLOCK in timerfd_create
This patch adds support for the TFD_NONBLOCK flag to timerfd_create.  The
additional changes needed are minimal.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_timerfd_create
# ifdef __x86_64__
#  define __NR_timerfd_create 283
# elif defined __i386__
#  define __NR_timerfd_create 322
# else
#  error "need __NR_timerfd_create"
# endif
#endif

#define TFD_NONBLOCK O_NONBLOCK

int
main (void)
{
  int fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, 0);
  if (fd == -1)
    {
      puts ("timerfd_create(0) failed");
      return 1;
    }
  int fl = fcntl (fd, F_GETFL);
  if (fl == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (fl & O_NONBLOCK)
    {
      puts ("timerfd_create(0) set non-blocking mode");
      return 1;
    }
  close (fd);

  fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, TFD_NONBLOCK);
  if (fd == -1)
    {
      puts ("timerfd_create(TFD_NONBLOCK) failed");
      return 1;
    }
  fl = fcntl (fd, F_GETFL);
  if (fl == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((fl & O_NONBLOCK) == 0)
    {
      puts ("timerfd_create(TFD_NONBLOCK) set non-blocking mode");
      return 1;
    }
  close (fd);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:29 -07:00
Ulrich Drepper
e7d476dfdf flag parameters: NONBLOCK in eventfd
This patch adds support for the EFD_NONBLOCK flag to eventfd2.  The
additional changes needed are minimal.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_eventfd2
# ifdef __x86_64__
#  define __NR_eventfd2 290
# elif defined __i386__
#  define __NR_eventfd2 328
# else
#  error "need __NR_eventfd2"
# endif
#endif

#define EFD_NONBLOCK O_NONBLOCK

int
main (void)
{
  int fd = syscall (__NR_eventfd2, 1, 0);
  if (fd == -1)
    {
      puts ("eventfd2(0) failed");
      return 1;
    }
  int fl = fcntl (fd, F_GETFL);
  if (fl == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (fl & O_NONBLOCK)
    {
      puts ("eventfd2(0) sets non-blocking mode");
      return 1;
    }
  close (fd);

  fd = syscall (__NR_eventfd2, 1, EFD_NONBLOCK);
  if (fd == -1)
    {
      puts ("eventfd2(EFD_NONBLOCK) failed");
      return 1;
    }
  fl = fcntl (fd, F_GETFL);
  if (fl == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((fl & O_NONBLOCK) == 0)
    {
      puts ("eventfd2(EFD_NONBLOCK) does not set non-blocking mode");
      return 1;
    }
  close (fd);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:29 -07:00
Ulrich Drepper
5fb5e04926 flag parameters: NONBLOCK in signalfd
This patch adds support for the SFD_NONBLOCK flag to signalfd4.  The
additional changes needed are minimal.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_signalfd4
# ifdef __x86_64__
#  define __NR_signalfd4 289
# elif defined __i386__
#  define __NR_signalfd4 327
# else
#  error "need __NR_signalfd4"
# endif
#endif

#define SFD_NONBLOCK O_NONBLOCK

int
main (void)
{
  sigset_t ss;
  sigemptyset (&ss);
  sigaddset (&ss, SIGUSR1);
  int fd = syscall (__NR_signalfd4, -1, &ss, 8, 0);
  if (fd == -1)
    {
      puts ("signalfd4(0) failed");
      return 1;
    }
  int fl = fcntl (fd, F_GETFL);
  if (fl == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (fl & O_NONBLOCK)
    {
      puts ("signalfd4(0) set non-blocking mode");
      return 1;
    }
  close (fd);

  fd = syscall (__NR_signalfd4, -1, &ss, 8, SFD_NONBLOCK);
  if (fd == -1)
    {
      puts ("signalfd4(SFD_NONBLOCK) failed");
      return 1;
    }
  fl = fcntl (fd, F_GETFL);
  if (fl == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((fl & O_NONBLOCK) == 0)
    {
      puts ("signalfd4(SFD_NONBLOCK) does not set non-blocking mode");
      return 1;
    }
  close (fd);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:29 -07:00
Ulrich Drepper
99829b8329 flag parameters: NONBLOCK in anon_inode_getfd
Building on the previous change to anon_inode_getfd, this patch introduces
support for handling of O_NONBLOCK in addition to the already supported
O_CLOEXEC.  Following patches will take advantage of this support.  As can be
seen, the additional support for supporting this functionality is minimal.

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:28 -07:00
Ulrich Drepper
4006553b06 flag parameters: inotify_init
This patch introduces the new syscall inotify_init1 (note: the 1 stands for
the one parameter the syscall takes, as opposed to no parameter before).  The
values accepted for this parameter are function-specific and defined in the
inotify.h header.  Here the values must match the O_* flags, though.  In this
patch CLOEXEC support is introduced.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_inotify_init1
# ifdef __x86_64__
#  define __NR_inotify_init1 294
# elif defined __i386__
#  define __NR_inotify_init1 332
# else
#  error "need __NR_inotify_init1"
# endif
#endif

#define IN_CLOEXEC O_CLOEXEC

int
main (void)
{
  int fd;
  fd = syscall (__NR_inotify_init1, 0);
  if (fd == -1)
    {
      puts ("inotify_init1(0) failed");
      return 1;
    }
  int coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (coe & FD_CLOEXEC)
    {
      puts ("inotify_init1(0) set close-on-exit");
      return 1;
    }
  close (fd);

  fd = syscall (__NR_inotify_init1, IN_CLOEXEC);
  if (fd == -1)
    {
      puts ("inotify_init1(IN_CLOEXEC) failed");
      return 1;
    }
  coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((coe & FD_CLOEXEC) == 0)
    {
      puts ("inotify_init1(O_CLOEXEC) does not set close-on-exit");
      return 1;
    }
  close (fd);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[akpm@linux-foundation.org: add sys_ni stub]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:28 -07:00
Ulrich Drepper
ed8cae8ba0 flag parameters: pipe
This patch introduces the new syscall pipe2 which is like pipe but it also
takes an additional parameter which takes a flag value.  This patch implements
the handling of O_CLOEXEC for the flag.  I did not add support for the new
syscall for the architectures which have a special sys_pipe implementation.  I
think the maintainers of those archs have the chance to go with the unified
implementation but that's up to them.

The implementation introduces do_pipe_flags.  I did that instead of changing
all callers of do_pipe because some of the callers are written in assembler.
I would probably screw up changing the assembly code.  To avoid breaking code
do_pipe is now a small wrapper around do_pipe_flags.  Once all callers are
changed over to do_pipe_flags the old do_pipe function can be removed.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_pipe2
# ifdef __x86_64__
#  define __NR_pipe2 293
# elif defined __i386__
#  define __NR_pipe2 331
# else
#  error "need __NR_pipe2"
# endif
#endif

int
main (void)
{
  int fd[2];
  if (syscall (__NR_pipe2, fd, 0) != 0)
    {
      puts ("pipe2(0) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      int coe = fcntl (fd[i], F_GETFD);
      if (coe == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if (coe & FD_CLOEXEC)
        {
          printf ("pipe2(0) set close-on-exit for fd[%d]\n", i);
          return 1;
        }
    }
  close (fd[0]);
  close (fd[1]);

  if (syscall (__NR_pipe2, fd, O_CLOEXEC) != 0)
    {
      puts ("pipe2(O_CLOEXEC) failed");
      return 1;
    }
  for (int i = 0; i < 2; ++i)
    {
      int coe = fcntl (fd[i], F_GETFD);
      if (coe == -1)
        {
          puts ("fcntl failed");
          return 1;
        }
      if ((coe & FD_CLOEXEC) == 0)
        {
          printf ("pipe2(O_CLOEXEC) does not set close-on-exit for fd[%d]\n", i);
          return 1;
        }
    }
  close (fd[0]);
  close (fd[1]);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:28 -07:00
Ulrich Drepper
336dd1f70f flag parameters: dup2
This patch adds the new dup3 syscall.  It extends the old dup2 syscall by one
parameter which is meant to hold a flag value.  Support for the O_CLOEXEC flag
is added in this patch.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_dup3
# ifdef __x86_64__
#  define __NR_dup3 292
# elif defined __i386__
#  define __NR_dup3 330
# else
#  error "need __NR_dup3"
# endif
#endif

int
main (void)
{
  int fd = syscall (__NR_dup3, 1, 4, 0);
  if (fd == -1)
    {
      puts ("dup3(0) failed");
      return 1;
    }
  int coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (coe & FD_CLOEXEC)
    {
      puts ("dup3(0) set close-on-exec flag");
      return 1;
    }
  close (fd);

  fd = syscall (__NR_dup3, 1, 4, O_CLOEXEC);
  if (fd == -1)
    {
      puts ("dup3(O_CLOEXEC) failed");
      return 1;
    }
  coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((coe & FD_CLOEXEC) == 0)
    {
      puts ("dup3(O_CLOEXEC) set close-on-exec flag");
      return 1;
    }
  close (fd);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:28 -07:00
Ulrich Drepper
a0998b50c3 flag parameters: epoll_create
This patch adds the new epoll_create2 syscall.  It extends the old epoll_create
syscall by one parameter which is meant to hold a flag value.  In this
patch the only flag support is EPOLL_CLOEXEC which causes the close-on-exec
flag for the returned file descriptor to be set.

A new name EPOLL_CLOEXEC is introduced which in this implementation must
have the same value as O_CLOEXEC.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_epoll_create2
# ifdef __x86_64__
#  define __NR_epoll_create2 291
# elif defined __i386__
#  define __NR_epoll_create2 329
# else
#  error "need __NR_epoll_create2"
# endif
#endif

#define EPOLL_CLOEXEC O_CLOEXEC

int
main (void)
{
  int fd = syscall (__NR_epoll_create2, 1, 0);
  if (fd == -1)
    {
      puts ("epoll_create2(0) failed");
      return 1;
    }
  int coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (coe & FD_CLOEXEC)
    {
      puts ("epoll_create2(0) set close-on-exec flag");
      return 1;
    }
  close (fd);

  fd = syscall (__NR_epoll_create2, 1, EPOLL_CLOEXEC);
  if (fd == -1)
    {
      puts ("epoll_create2(EPOLL_CLOEXEC) failed");
      return 1;
    }
  coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((coe & FD_CLOEXEC) == 0)
    {
      puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag");
      return 1;
    }
  close (fd);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:28 -07:00
Ulrich Drepper
11fcb6c146 flag parameters: timerfd_create
The timerfd_create syscall already has a flags parameter.  It just is
unused so far.  This patch changes this by introducing the TFD_CLOEXEC
flag to set the close-on-exec flag for the returned file descriptor.

A new name TFD_CLOEXEC is introduced which in this implementation must
have the same value as O_CLOEXEC.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_timerfd_create
# ifdef __x86_64__
#  define __NR_timerfd_create 283
# elif defined __i386__
#  define __NR_timerfd_create 322
# else
#  error "need __NR_timerfd_create"
# endif
#endif

#define TFD_CLOEXEC O_CLOEXEC

int
main (void)
{
  int fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, 0);
  if (fd == -1)
    {
      puts ("timerfd_create(0) failed");
      return 1;
    }
  int coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (coe & FD_CLOEXEC)
    {
      puts ("timerfd_create(0) set close-on-exec flag");
      return 1;
    }
  close (fd);

  fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, TFD_CLOEXEC);
  if (fd == -1)
    {
      puts ("timerfd_create(TFD_CLOEXEC) failed");
      return 1;
    }
  coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((coe & FD_CLOEXEC) == 0)
    {
      puts ("timerfd_create(TFD_CLOEXEC) set close-on-exec flag");
      return 1;
    }
  close (fd);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:27 -07:00
Ulrich Drepper
b087498eb5 flag parameters: eventfd
This patch adds the new eventfd2 syscall.  It extends the old eventfd
syscall by one parameter which is meant to hold a flag value.  In this
patch the only flag support is EFD_CLOEXEC which causes the close-on-exec
flag for the returned file descriptor to be set.

A new name EFD_CLOEXEC is introduced which in this implementation must
have the same value as O_CLOEXEC.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_eventfd2
# ifdef __x86_64__
#  define __NR_eventfd2 290
# elif defined __i386__
#  define __NR_eventfd2 328
# else
#  error "need __NR_eventfd2"
# endif
#endif

#define EFD_CLOEXEC O_CLOEXEC

int
main (void)
{
  int fd = syscall (__NR_eventfd2, 1, 0);
  if (fd == -1)
    {
      puts ("eventfd2(0) failed");
      return 1;
    }
  int coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (coe & FD_CLOEXEC)
    {
      puts ("eventfd2(0) sets close-on-exec flag");
      return 1;
    }
  close (fd);

  fd = syscall (__NR_eventfd2, 1, EFD_CLOEXEC);
  if (fd == -1)
    {
      puts ("eventfd2(EFD_CLOEXEC) failed");
      return 1;
    }
  coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((coe & FD_CLOEXEC) == 0)
    {
      puts ("eventfd2(EFD_CLOEXEC) does not set close-on-exec flag");
      return 1;
    }
  close (fd);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[akpm@linux-foundation.org: add sys_ni stub]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:27 -07:00
Ulrich Drepper
9deb27baed flag parameters: signalfd
This patch adds the new signalfd4 syscall.  It extends the old signalfd
syscall by one parameter which is meant to hold a flag value.  In this
patch the only flag support is SFD_CLOEXEC which causes the close-on-exec
flag for the returned file descriptor to be set.

A new name SFD_CLOEXEC is introduced which in this implementation must
have the same value as O_CLOEXEC.

The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

#ifndef __NR_signalfd4
# ifdef __x86_64__
#  define __NR_signalfd4 289
# elif defined __i386__
#  define __NR_signalfd4 327
# else
#  error "need __NR_signalfd4"
# endif
#endif

#define SFD_CLOEXEC O_CLOEXEC

int
main (void)
{
  sigset_t ss;
  sigemptyset (&ss);
  sigaddset (&ss, SIGUSR1);
  int fd = syscall (__NR_signalfd4, -1, &ss, 8, 0);
  if (fd == -1)
    {
      puts ("signalfd4(0) failed");
      return 1;
    }
  int coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if (coe & FD_CLOEXEC)
    {
      puts ("signalfd4(0) set close-on-exec flag");
      return 1;
    }
  close (fd);

  fd = syscall (__NR_signalfd4, -1, &ss, 8, SFD_CLOEXEC);
  if (fd == -1)
    {
      puts ("signalfd4(SFD_CLOEXEC) failed");
      return 1;
    }
  coe = fcntl (fd, F_GETFD);
  if (coe == -1)
    {
      puts ("fcntl failed");
      return 1;
    }
  if ((coe & FD_CLOEXEC) == 0)
    {
      puts ("signalfd4(SFD_CLOEXEC) does not set close-on-exec flag");
      return 1;
    }
  close (fd);

  puts ("OK");

  return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[akpm@linux-foundation.org: add sys_ni stub]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:27 -07:00
Ulrich Drepper
7d9dbca342 flag parameters: anon_inode_getfd extension
This patch just extends the anon_inode_getfd interface to take an additional
parameter with a flag value.  The flag value is passed on to
get_unused_fd_flags in anticipation for a use with the O_CLOEXEC flag.

No actual semantic changes here, the changed callers all pass 0 for now.

[akpm@linux-foundation.org: KVM fix]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:27 -07:00
Akinobu Mita
6e2c10a12a binfmt_misc: use simple_read_from_buffer()
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:27 -07:00
Jon Tollefson
f4a67cceee fs: check for statfs overflow
Adds a check for an overflow in the filesystem size so if someone is
checking with statfs() on a 16G blocksize hugetlbfs in a 32bit binary that
it will report back EOVERFLOW instead of a size of 0.

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:19 -07:00
Andi Kleen
a137e1cc6d hugetlbfs: per mount huge page sizes
Add the ability to configure the hugetlb hstate used on a per mount basis.

- Add a new pagesize= option to the hugetlbfs mount that allows setting
  the page size
- This option causes the mount code to find the hstate corresponding to the
  specified size, and sets up a pointer to the hstate in the mount's
  superblock.
- Change the hstate accessors to use this information rather than the
  global_hstate they were using (requires a slight change in mm/memory.c
  so we don't NULL deref in the error-unmap path -- see comments).

[np: take hstate out of hugetlbfs inode and vma->vm_private_data]

Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:17 -07:00
Andi Kleen
a551643895 hugetlb: modular state for hugetlb page size
The goal of this patchset is to support multiple hugetlb page sizes.  This
is achieved by introducing a new struct hstate structure, which
encapsulates the important hugetlb state and constants (eg.  huge page
size, number of huge pages currently allocated, etc).

The hstate structure is then passed around the code which requires these
fields, they will do the right thing regardless of the exact hstate they
are operating on.

This patch adds the hstate structure, with a single global instance of it
(default_hstate), and does the basic work of converting hugetlb to use the
hstate.

Future patches will add more hstate structures to allow for different
hugetlbfs mounts to have different page sizes.

[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:17 -07:00
Eric Dumazet
a47a126ad5 vmallocinfo: add NUMA information
Christoph recently added /proc/vmallocinfo file to get information about
vmalloc allocations.

This patch adds NUMA specific information, giving number of pages
allocated on each memory node.

This should help to check that vmalloc() is able to respect NUMA policies.

Example of output on a four nodes machine (one cpu per node)

1) network hash tables are evenly spreaded on four nodes (OK) (Same
   point for inodes and dentries hash tables)

2) iptables tables (x_tables) are correctly allocated on each cpu node
   (OK).

3) sys_swapon() allocates its memory from one node only.

4) each loaded module is using memory on one node.

Sysadmins could tune their setup to change points 3) and 4) if necessary.

grep "pages="  /proc/vmallocinfo
0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204/0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128
0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
0xffffc2000031a000-0xffffc2000031d000   12288 alloc_large_system_hash+0x204/0x2c0 pages=2 vmalloc N1=1 N2=1
0xffffc2000031f000-0xffffc2000032b000   49152 cramfs_uncompress_init+0x2e/0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3
0xffffc2000033e000-0xffffc20000341000   12288 sys_swapon+0x640/0xac0 pages=2 vmalloc N0=2
0xffffc20000341000-0xffffc20000344000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N0=2
0xffffc20000344000-0xffffc20000347000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N1=2
0xffffc20000347000-0xffffc2000034a000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N2=2
0xffffc2000034a000-0xffffc2000034d000   12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N3=2
0xffffc20004381000-0xffffc20004402000  528384 alloc_large_system_hash+0x204/0x2c0 pages=128 vmalloc N0=32 N1=32 N2=32 N3=32
0xffffc20004402000-0xffffc20004803000 4198400 alloc_large_system_hash+0x204/0x2c0 pages=1024 vmalloc vpages N0=256 N1=256 N2=256 N3=256
0xffffc20004803000-0xffffc20004904000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
0xffffc20004904000-0xffffc20004bec000 3047424 sys_swapon+0x640/0xac0 pages=743 vmalloc vpages N0=743
0xffffffffa0000000-0xffffffffa000f000   61440 sys_init_module+0xc27/0x1d00 pages=14 vmalloc N1=14
0xffffffffa000f000-0xffffffffa0014000   20480 sys_init_module+0xc27/0x1d00 pages=4 vmalloc N0=4
0xffffffffa0014000-0xffffffffa0017000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N0=2
0xffffffffa0017000-0xffffffffa0022000   45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N1=10
0xffffffffa0022000-0xffffffffa0028000   24576 sys_init_module+0xc27/0x1d00 pages=5 vmalloc N3=5
0xffffffffa0028000-0xffffffffa0050000  163840 sys_init_module+0xc27/0x1d00 pages=39 vmalloc N1=39
0xffffffffa0050000-0xffffffffa0052000    8192 sys_init_module+0xc27/0x1d00 pages=1 vmalloc N1=1
0xffffffffa0052000-0xffffffffa0056000   16384 sys_init_module+0xc27/0x1d00 pages=3 vmalloc N1=3
0xffffffffa0056000-0xffffffffa0081000  176128 sys_init_module+0xc27/0x1d00 pages=42 vmalloc N3=42
0xffffffffa0081000-0xffffffffa00ae000  184320 sys_init_module+0xc27/0x1d00 pages=44 vmalloc N3=44
0xffffffffa00ae000-0xffffffffa00b1000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
0xffffffffa00b1000-0xffffffffa00b9000   32768 sys_init_module+0xc27/0x1d00 pages=7 vmalloc N0=7
0xffffffffa00b9000-0xffffffffa00c4000   45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N3=10
0xffffffffa00c6000-0xffffffffa00e0000  106496 sys_init_module+0xc27/0x1d00 pages=25 vmalloc N2=25
0xffffffffa00e0000-0xffffffffa00f1000   69632 sys_init_module+0xc27/0x1d00 pages=16 vmalloc N2=16
0xffffffffa00f1000-0xffffffffa00f4000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2
0xffffffffa00f4000-0xffffffffa00f7000   12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:17 -07:00
Pavel Machek
cce7708158 SYNC_FILE_RANGE_WRITE may and will block. Document that.
[akpm@linux-foundation.org: fix comment text]
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:17 -07:00
Mel Gorman
04f2cbe356 hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
After patch 2 in this series, a process that successfully calls mmap() for
a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
process calls fork().  At that point, the next write fault from the parent
could fail due to COW if the child still has a reference.

We only reserve pages for the parent but a copy must be made to avoid
leaking data from the parent to the child after fork().  Reserves could be
taken for both parent and child at fork time to guarantee faults but if
the mapping is large it is highly likely we will not have sufficient pages
for the reservation, and it is common to fork only to exec() immediatly
after.  A failure here would be very undesirable.

Note that the current behaviour of mainline with MAP_PRIVATE pages is
pretty bad.  The following situation is allowed to occur today.

1. Process calls mmap(MAP_PRIVATE)
2. Process calls mlock() to fault all pages and makes sure it succeeds
3. Process forks()
4. Process writes to MAP_PRIVATE mapping while child still exists
5. If the COW fails at this point, the process gets SIGKILLed even though it
   had taken care to ensure the pages existed

This patch improves the situation by guaranteeing the reliability of the
process that successfully calls mmap().  When the parent performs COW, it
will try to satisfy the allocation without using reserves.  If that fails
the parent will steal the page leaving any children without a page.
Faults from the child after that point will result in failure.  If the
child COW happens first, an attempt will be made to allocate the page
without reserves and the child will get SIGKILLed on failure.

To summarise the new behaviour:

1. If the original mapper performs COW on a private mapping with multiple
   references, it will attempt to allocate a hugepage from the pool or
   the buddy allocator without using the existing reserves. On fail, VMAs
   mapping the same area are traversed and the page being COW'd is unmapped
   where found. It will then steal the original page as the last mapper in
   the normal way.

2. The VMAs the pages were unmapped from are flagged to note that pages
   with data no longer exist. Future no-page faults on those VMAs will
   terminate the process as otherwise it would appear that data was corrupted.
   A warning is printed to the console that this situation occured.

2. If the child performs COW first, it will attempt to satisfy the COW
   from the pool if there are enough pages or via the buddy allocator if
   overcommit is allowed and the buddy allocator can satisfy the request. If
   it fails, the child will be killed.

If the pool is large enough, existing applications will not notice that
the reserves were a factor.  Existing applications depending on the
no-reserves been set are unlikely to exist as for much of the history of
hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
point or failing the mmap().

[npiggin@suse.de: fix CONFIG_HUGETLB=n build]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:16 -07:00
Mel Gorman
a1e78772d7 hugetlb: reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork()
This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
a similar manner to the reservations taken for MAP_SHARED mappings.  The
reserve count is accounted both globally and on a per-VMA basis for
private mappings.  This guarantees that a process that successfully calls
mmap() will successfully fault all pages in the future unless fork() is
called.

The characteristics of private mappings of hugetlbfs files behaviour after
this patch are;

1. The process calling mmap() is guaranteed to succeed all future faults until
   it forks().
2. On fork(), the parent may die due to SIGKILL on writes to the private
   mapping if enough pages are not available for the COW. For reasonably
   reliable behaviour in the face of a small huge page pool, children of
   hugepage-aware processes should not reference the mappings; such as
   might occur when fork()ing to exec().
3. On fork(), the child VMAs inherit no reserves. Reads on pages already
   faulted by the parent will succeed. Successful writes will depend on enough
   huge pages being free in the pool.
4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
   and at fault time otherwise.

Before this patch, all reads or writes in the child potentially needs page
allocations that can later lead to the death of the parent.  This applies
to reads and writes of uninstantiated pages as well as COW.  After the
patch it is only a write to an instantiated page that causes problems.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:16 -07:00
Kentaro Makita
da3bbdd463 fix soft lock up at NFS mount via per-SB LRU-list of unused dentries
[Summary]

 Split LRU-list of unused dentries to one per superblock to avoid soft
 lock up during NFS mounts and remounting of any filesystem.

 Previously I posted here:
 http://lkml.org/lkml/2008/3/5/590

[Descriptions]

- background

  dentry_unused is a list of dentries which are not referenced.
  dentry_unused grows up when references on directories or files are
  released.  This list can be very long if there is huge free memory.

- the problem

  When shrink_dcache_sb() is called, it scans all dentry_unused linearly
  under spin_lock(), and if dentry->d_sb is differnt from given
  superblock, scan next dentry.  This scan costs very much if there are
  many entries, and very ineffective if there are many superblocks.

  IOW, When we need to shrink unused dentries on one dentry, but scans
  unused dentries on all superblocks in the system.  For example, we scan
  500 dentries to unmount a filesystem, but scans 1,000,000 or more unused
  dentries on other superblocks.

  In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink
  unused dentries on NFS, but scans 100,000,000 unused dentries on
  superblocks in the system such as local ext3 filesystems.  I hear NFS
  mounting took 1 min on some system in use.

* : NFS uses virtual filesystem in rpc layer, so NFS is affected by
  this problem.

  100,000,000 is possible number on large systems.

  Per-superblock LRU of unused dentried can reduce the cost in
  reasonable manner.

- How to fix

  I found this problem is solved by David Chinner's "Per-superblock
  unused dentry LRU lists V3"(1), so I rebase it and add some fix to
  reclaim with fairness, which is in Andrew Morton's comments(2).

  1) http://lkml.org/lkml/2006/5/25/318
  2) http://lkml.org/lkml/2006/5/25/320

  Split LRU-list of unused dentries to each superblocks.  Then, NFS
  mounting will check dentries under a superblock instead of all.  But
  this spliting will break LRU of dentry-unused.  So, I've attempted to
  make reclaim unused dentrins with fairness by calculate number of
  dentries to scan on this sb based on following way

  number of dentries to scan on this sb =
  count * (number of dentries on this sb / number of dentries in the machine)

- ToDo
 - I have to measuring performance number and do stress tests.

 - When unmount occurs during prune_dcache(), scanning on same
  superblock, It is unable to reach next superblock because it is gone
  away.  We restart scannig superblock from first one, it causes
  unfairness of reclaim unused dentries on first superblock.  But I think
  this happens very rarely.

- Test Results

  Result on 6GB boxes with excessive unused dentries.

Without patch:

$ cat /proc/sys/fs/dentry-state
10181835        10180203        45      0       0       0
# mount -t nfs 10.124.60.70:/work/kernel-src nfs
real    0m1.830s
user    0m0.001s
sys     0m1.653s

 With this patch:
$ cat /proc/sys/fs/dentry-state
10236610        10234751        45      0       0       0
# mount -t nfs 10.124.60.70:/work/kernel-src nfs
real    0m0.106s
user    0m0.002s
sys     0m0.032s

[akpm@linux-foundation.org: fix comments]
Signed-off-by: Kentaro Makita <k-makita@np.css.fujitsu.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Chinner <dgc@sgi.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:15 -07:00
Jan Beulich
42b7772812 mm: remove double indirection on tlb parameter to free_pgd_range() & Co
The double indirection here is not needed anywhere and hence (at least)
confusing.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:15 -07:00
Adrian Bunk
c748e1340e mm/vmstat.c: proper externs
This patch adds proper extern declarations for five variables in
include/linux/vmstat.h

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:14 -07:00
Li Zefan
ec05e868ac ext4: improve ext4_fill_flex_info() a bit
- use kzalloc() instead of kmalloc() + memset()
- improve a printk info

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-24 12:49:59 -04:00
Shirish Pargaonkar
ef571cadd5 [CIFS] Fix warnings from checkpatch
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-07-24 15:56:05 +00:00
Shirish Pargaonkar
b1910ad622 [CIFS] Fix improper endian conversion of ACL subauth field
In mode_to_acl when converting a Unix mode to a Windows ACL
the subauth fields of the SID in the ACL were translated
incorrectly on bigendian architectures

Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-07-24 14:53:20 +00:00
Shirish Pargaonkar
76c510ad2e [CIFS] Fix possible double free if search immediately after search rewind fails
Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-07-24 14:48:33 +00:00
Steve French
99b1f5b2f6 [CIFS] remove checkpatch warning
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-07-24 02:37:45 +00:00
Alexey Dobriyan
f984c7b982 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-07-24 02:34:24 +00:00
Harvey Harrison
5ca33c6ac3 cifs: assorted endian annotations
fs/cifs/cifssmb.c:3917:13: warning: incorrect type in assignment (different base types)
fs/cifs/cifssmb.c:3917:13:    expected bool [unsigned] [usertype] is_unicode
fs/cifs/cifssmb.c:3917:13:    got restricted __le16

The comment explains why __force is used here.
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-07-24 01:14:41 +00:00
Jeff Layton
8efdbde647 [CIFS] break ATTR_SIZE changes out into their own function
Move the code that handles ATTR_SIZE changes to its own function. This
makes for a smaller function and reduces the level of indentation.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-07-23 21:28:12 +00:00
Jeff Layton
09e50d55a9 lockdep: annotate cifs in-kernel sockets
Put CIFS sockets in their own class to avoid some lockdep warnings. CIFS
sockets are not exposed to user-space, and so are not subject to the
same deadlock scenarios.

A similar change was made a couple of years ago for RPC sockets in commit
ed07536ed6.

This patch should prevent lockdep false-positives like this one:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.18-98.el5.jtltest.38.bz456320.1debug #1
-------------------------------------------------------
test5/2483 is trying to acquire lock:
 (sk_lock-AF_INET){--..}, at: [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f

but task is already holding lock:
 (&inode->i_alloc_sem){--..}, at: [<ffffffff8002e454>] notify_change+0xf5/0x2e0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&inode->i_alloc_sem){--..}:
       [<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf
       [<ffffffff800a8a72>] lock_acquire+0x55/0x70
       [<ffffffff8002e454>] notify_change+0xf5/0x2e0
       [<ffffffff800a4e36>] down_write+0x3c/0x68
       [<ffffffff8002e454>] notify_change+0xf5/0x2e0
       [<ffffffff800e358d>] do_truncate+0x50/0x6b
       [<ffffffff8005197c>] get_write_access+0x40/0x46
       [<ffffffff80012cf1>] may_open+0x1d3/0x22e
       [<ffffffff8001bc81>] open_namei+0x2c6/0x6dd
       [<ffffffff800289c6>] do_filp_open+0x1c/0x38
       [<ffffffff800683ef>] _spin_unlock+0x17/0x20
       [<ffffffff800167a7>] get_unused_fd+0xf9/0x107
       [<ffffffff8001a704>] do_sys_open+0x44/0xbe
       [<ffffffff80060116>] system_call+0x7e/0x83
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #2 (&sysfs_inode_imutex_key){--..}:
       [<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf
       [<ffffffff8010f6df>] create_dir+0x26/0x1d7
       [<ffffffff800a8a72>] lock_acquire+0x55/0x70
       [<ffffffff8010f6df>] create_dir+0x26/0x1d7
       [<ffffffff800671c0>] mutex_lock_nested+0x104/0x29c
       [<ffffffff800a819d>] __lock_acquire+0x9ca/0xadf
       [<ffffffff8010f6df>] create_dir+0x26/0x1d7
       [<ffffffff8010fc67>] sysfs_create_dir+0x58/0x76
       [<ffffffff8015144c>] kobject_add+0xdb/0x198
       [<ffffffff801be765>] class_device_add+0xb2/0x465
       [<ffffffff8005a6ff>] kobject_get+0x12/0x17
       [<ffffffff80225265>] register_netdevice+0x270/0x33e
       [<ffffffff8022538c>] register_netdev+0x59/0x67
       [<ffffffff80464d40>] net_olddevs_init+0xb/0xac
       [<ffffffff80448a79>] init+0x1f9/0x2fc
       [<ffffffff80068885>] _spin_unlock_irq+0x24/0x27
       [<ffffffff80067f86>] trace_hardirqs_on_thunk+0x35/0x37
       [<ffffffff80061079>] child_rip+0xa/0x11
       [<ffffffff80068885>] _spin_unlock_irq+0x24/0x27
       [<ffffffff800606a8>] restore_args+0x0/0x30
       [<ffffffff80179a59>] acpi_ds_init_one_object+0x0/0x80
       [<ffffffff80448880>] init+0x0/0x2fc
       [<ffffffff8006106f>] child_rip+0x0/0x11
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #1 (rtnl_mutex){--..}:
       [<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf
       [<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7
       [<ffffffff800a8a72>] lock_acquire+0x55/0x70
       [<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7
       [<ffffffff800671c0>] mutex_lock_nested+0x104/0x29c
       [<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7
       [<ffffffff802451b0>] do_ip_setsockopt+0x6d1/0x9bf
       [<ffffffff800a575e>] lock_release_holdtime+0x27/0x48
       [<ffffffff800a575e>] lock_release_holdtime+0x27/0x48
       [<ffffffff8006a85e>] do_page_fault+0x503/0x835
       [<ffffffff8012cbf6>] socket_has_perm+0x5b/0x68
       [<ffffffff80245556>] ip_setsockopt+0x22/0x78
       [<ffffffff8021c973>] sys_setsockopt+0x91/0xb7
       [<ffffffff800602a6>] tracesys+0xd5/0xdf
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #0 (sk_lock-AF_INET){--..}:
       [<ffffffff800a5037>] print_stack_trace+0x59/0x68
       [<ffffffff800a8092>] __lock_acquire+0x8bf/0xadf
       [<ffffffff800a8a72>] lock_acquire+0x55/0x70
       [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f
       [<ffffffff80035466>] lock_sock+0xd4/0xe4
       [<ffffffff80096e91>] _local_bh_enable+0xcb/0xe0
       [<ffffffff800606a8>] restore_args+0x0/0x30
       [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f
       [<ffffffff80057540>] sock_sendmsg+0xf3/0x110
       [<ffffffff800a2bb6>] autoremove_wake_function+0x0/0x2e
       [<ffffffff800a10e4>] kernel_text_address+0x1a/0x26
       [<ffffffff8006f4e2>] dump_trace+0x211/0x23a
       [<ffffffff800a6d3d>] find_usage_backwards+0x5f/0x88
       [<ffffffff8840221a>] MD5Final+0xaf/0xc2 [cifs]
       [<ffffffff884032ec>] cifs_calculate_signature+0x55/0x69 [cifs]
       [<ffffffff8021d891>] kernel_sendmsg+0x35/0x47
       [<ffffffff883ff38e>] smb_send+0xa3/0x151 [cifs]
       [<ffffffff883ff5de>] SendReceive+0x1a2/0x448 [cifs]
       [<ffffffff800a812f>] __lock_acquire+0x95c/0xadf
       [<ffffffff883e758a>] CIFSSMBSetEOF+0x20d/0x25b [cifs]
       [<ffffffff883fa430>] cifs_set_file_size+0x110/0x3b7 [cifs]
       [<ffffffff883faa89>] cifs_setattr+0x3b2/0x6f6 [cifs]
       [<ffffffff8002e454>] notify_change+0xf5/0x2e0
       [<ffffffff8002e4a4>] notify_change+0x145/0x2e0
       [<ffffffff800e358d>] do_truncate+0x50/0x6b
       [<ffffffff8005197c>] get_write_access+0x40/0x46
       [<ffffffff80012cf1>] may_open+0x1d3/0x22e
       [<ffffffff8001bc81>] open_namei+0x2c6/0x6dd
       [<ffffffff800289c6>] do_filp_open+0x1c/0x38
       [<ffffffff800683ef>] _spin_unlock+0x17/0x20
       [<ffffffff800167a7>] get_unused_fd+0xf9/0x107
       [<ffffffff8001a704>] do_sys_open+0x44/0xbe
       [<ffffffff800602a6>] tracesys+0xd5/0xdf
       [<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

2 locks held by test5/2483:
 #0:  (&inode->i_mutex){--..}, at: [<ffffffff800e3582>] do_truncate+0x45/0x6b
 #1:  (&inode->i_alloc_sem){--..}, at: [<ffffffff8002e454>] notify_change+0xf5/0x2e0

stack backtrace:

Call Trace:
 [<ffffffff800a6a7b>] print_circular_bug_tail+0x65/0x6e
 [<ffffffff800a5037>] print_stack_trace+0x59/0x68
 [<ffffffff800a8092>] __lock_acquire+0x8bf/0xadf
 [<ffffffff800a8a72>] lock_acquire+0x55/0x70
 [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f
 [<ffffffff80035466>] lock_sock+0xd4/0xe4
 [<ffffffff80096e91>] _local_bh_enable+0xcb/0xe0
 [<ffffffff800606a8>] restore_args+0x0/0x30
 [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f
 [<ffffffff80057540>] sock_sendmsg+0xf3/0x110
 [<ffffffff800a2bb6>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a10e4>] kernel_text_address+0x1a/0x26
 [<ffffffff8006f4e2>] dump_trace+0x211/0x23a
 [<ffffffff800a6d3d>] find_usage_backwards+0x5f/0x88
 [<ffffffff8840221a>] :cifs:MD5Final+0xaf/0xc2
 [<ffffffff884032ec>] :cifs:cifs_calculate_signature+0x55/0x69
 [<ffffffff8021d891>] kernel_sendmsg+0x35/0x47
 [<ffffffff883ff38e>] :cifs:smb_send+0xa3/0x151
 [<ffffffff883ff5de>] :cifs:SendReceive+0x1a2/0x448
 [<ffffffff800a812f>] __lock_acquire+0x95c/0xadf
 [<ffffffff883e758a>] :cifs:CIFSSMBSetEOF+0x20d/0x25b
 [<ffffffff883fa430>] :cifs:cifs_set_file_size+0x110/0x3b7
 [<ffffffff883faa89>] :cifs:cifs_setattr+0x3b2/0x6f6
 [<ffffffff8002e454>] notify_change+0xf5/0x2e0
 [<ffffffff8002e4a4>] notify_change+0x145/0x2e0
 [<ffffffff800e358d>] do_truncate+0x50/0x6b
 [<ffffffff8005197c>] get_write_access+0x40/0x46
 [<ffffffff80012cf1>] may_open+0x1d3/0x22e
 [<ffffffff8001bc81>] open_namei+0x2c6/0x6dd
 [<ffffffff800289c6>] do_filp_open+0x1c/0x38
 [<ffffffff800683ef>] _spin_unlock+0x17/0x20
 [<ffffffff800167a7>] get_unused_fd+0xf9/0x107
 [<ffffffff8001a704>] do_sys_open+0x44/0xbe
 [<ffffffff800602a6>] tracesys+0xd5/0xdf

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-07-23 18:25:38 +00:00
Harvey Harrison
317602f3e0 lockd: trivial sparse endian annotations
fs/lockd/svcproc.c:115:11: warning: incorrect type in initializer (different base types)
fs/lockd/svcproc.c:115:11:    expected int [signed] rc
fs/lockd/svcproc.c:115:11:    got restricted __be32 [usertype] <noident>
... and so on...

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-23 07:38:04 -04:00
Linus Torvalds
c010b2f76c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (82 commits)
  ipw2200: Call netif_*_queue() interfaces properly.
  netxen: Needs to include linux/vmalloc.h
  [netdrvr] atl1d: fix !CONFIG_PM build
  r6040: rework init_one error handling
  r6040: bump release number to 0.18
  r6040: handle RX fifo full and no descriptor interrupts
  r6040: change the default waiting time
  r6040: use definitions for magic values in descriptor status
  r6040: completely rework the RX path
  r6040: call napi_disable when puting down the interface and set lp->dev accordingly.
  mv643xx_eth: fix NETPOLL build
  r6040: rework the RX buffers allocation routine
  r6040: fix scheduling while atomic in r6040_tx_timeout
  r6040: fix null pointer access and tx timeouts
  r6040: prefix all functions with r6040
  rndis_host: support WM6 devices as modems
  at91_ether: use netstats in net_device structure
  sfc: Create one RX queue and interrupt per CPU package by default
  sfc: Use a separate workqueue for resets
  sfc: I2C adapter initialisation fixes
  ...
2008-07-22 19:09:51 -07:00
Adrian Bunk
8086cd451f netns: make get_proc_net() static
get_proc_net() can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22 14:19:19 -07:00
Linus Torvalds
53baaaa968 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (79 commits)
  arm: bus_id -> dev_name() and dev_set_name() conversions
  sparc64: fix up bus_id changes in sparc core code
  3c59x: handle pci_name() being const
  MTD: handle pci_name() being const
  HP iLO driver
  sysdev: Convert the x86 mce tolerant sysdev attribute to generic attribute
  sysdev: Add utility functions for simple int/ulong variable sysdev attributes
  sysdev: Pass the attribute to the low level sysdev show/store function
  driver core: Suppress sysfs warnings for device_rename().
  kobject: Transmit return value of call_usermodehelper() to caller
  sysfs-rules.txt: reword API stability statement
  debugfs: Implement debugfs_remove_recursive()
  HOWTO: change email addresses of James in HOWTO
  always enable FW_LOADER unless EMBEDDED=y
  uio-howto.tmpl: use unique output names
  uio-howto.tmpl: use standard copyright/legal markings
  sysfs: don't call notify_change
  sysdev: fix debugging statements in registration code.
  kobject: should use kobject_put() in kset-example
  kobject: reorder kobject to save space on 64 bit builds
  ...
2008-07-22 13:13:47 -07:00
Alexey Dobriyan
ee1e6ab605 proc: fix /proc/*/pagemap some more
struct pagemap_walk was placed on stack, some hooks are initialized, the
rest (->pgd_entry, ->pud_entry, ->pte_entry) are valid but junk.

Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-22 09:59:41 -07:00
John Reiser
6519108746 execve filename: document and export via auxiliary vector
The Linux kernel puts the filename argument of execve() into the new
address space.  Many developers are surprised to learn this.  Those who
know and could use it, object "But it's not documented."

Those who want to use it dislike the expression
  (char *)(1+ strlen(env[-1+ n_env]) + env[-1+ n_env])
because it requires locating the last original environment variable,
and assumes that the filename follows the characters.

This patch documents the insertion of the filename, and makes it easier
to find by adding a new tag AT_EXECFN in the ElfXX_auxv_t; see <elf.h>.

In many cases readlink("/proc/self/exe",) gives the same answer.  But if
all the original pages get unmapped, then the kernel erases the symlink
for /proc/self/exe.  This can happen when a program decompressor does a
good job of cleaning up after uncompressing directly to memory, so that
the address space of the target program looks the same as if compression
had never happened.  One example is http://upx.sourceforge.net .

One notable use of the underlying concept (what path containED the
executable) is glibc expanding $ORIGIN in DT_RUNPATH.  In practice for
the near term, it may be a good idea for user-mode code to use both
/proc/self/exe and AT_EXECFN as fall-back methods for each other.
/proc/self/exe can fail due to unmapping, AT_EXECFN can fail because it
won't be present on non-new systems.  The auxvec or {AT_EXECFN}.d_val
also can get overwritten, although in nearly all cases this would be the
result of a bug.

The runtime cost is one NEW_AUX_ENT using two words of stack space.  The
underlying value is maintained already as bprm->exec; setup_arg_pages()
in fs/exec.c slides it for stack_shift, etc.

Signed-off-by: John Reiser <jreiser@BitWagon.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-22 09:59:40 -07:00
Jan Beulich
04e1e0ccca [CIFS] Fix compiler warning on 64-bit
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-07-22 13:04:18 +00:00
Cornelia Huck
36ce6dad6e driver core: Suppress sysfs warnings for device_rename().
driver core: Suppress sysfs warnings for device_rename().

Renaming network devices to an already existing name is not
something we want sysfs to print a scary warning for, since the
callers can deal with this correctly. So let's introduce
sysfs_create_link_nowarn() which gets rid of the common warning.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-07-21 21:55:01 -07:00
Haavard Skinnemoen
9505e63756 debugfs: Implement debugfs_remove_recursive()
debugfs_remove_recursive() will remove a dentry and all its children.
Drivers can use this to zap their whole debugfs tree so that they don't
need to keep track of every single debugfs dentry they created.

It may fail to remove the whole tree in certain cases:

sh-3.2# rmmod atmel-mci < /sys/kernel/debug/mmc0/ios/clock
mmc0: card b368 removed
atmel_mci atmel_mci.0: Lost dma0chan1, falling back to PIO
sh-3.2# ls /sys/kernel/debug/mmc0/
ios

But I'm not sure if that case can be handled in any sane manner.

Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>
Cc: Pierre Ossman <drzeus-list@drzeus.cx>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-07-21 21:54:59 -07:00
Miklos Szeredi
93265d13ea sysfs: don't call notify_change
sysfs_chmod_file() calls notify_change() to change the permission bits
on a sysfs file.  Replace with explicit call to sysfs_setattr() and
fsnotify_change().

This is equivalent, except that security_inode_setattr() is not
called.  This function is called by drivers, so the security checks do
not make any sense.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-07-21 21:54:57 -07:00
Kay Sievers
aab0de2451 driver core: remove KOBJ_NAME_LEN define
Kobjects do not have a limit in name size since a while, so stop
pretending that they do.

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-07-21 21:54:52 -07:00
Greg Kroah-Hartman
6143b59970 device create: coda: convert device_create to device_create_drvdata
device_create() is race-prone, so use the race-free
device_create_drvdata() instead as device_create() is going away.

Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-07-21 21:54:41 -07:00
Linus Torvalds
14b395e35d Merge branch 'for-2.6.27' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.27' of git://linux-nfs.org/~bfields/linux: (51 commits)
  nfsd: nfs4xdr.c do-while is not a compound statement
  nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c
  lockd: Pass "struct sockaddr *" to new failover-by-IP function
  lockd: get host reference in nlmsvc_create_block() instead of callers
  lockd: minor svclock.c style fixes
  lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock
  lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock
  lockd: nlm_release_host() checks for NULL, caller needn't
  file lock: reorder struct file_lock to save space on 64 bit builds
  nfsd: take file and mnt write in nfs4_upgrade_open
  nfsd: document open share bit tracking
  nfsd: tabulate nfs4 xdr encoding functions
  nfsd: dprint operation names
  svcrdma: Change WR context get/put to use the kmem cache
  svcrdma: Create a kmem cache for the WR contexts
  svcrdma: Add flush_scheduled_work to module exit function
  svcrdma: Limit ORD based on client's advertised IRD
  svcrdma: Remove unused wait q from svcrdma_xprt structure
  svcrdma: Remove unneeded spin locks from __svc_rdma_free
  svcrdma: Add dma map count and WARN_ON
  ...
2008-07-20 21:21:46 -07:00
Linus Torvalds
db6d8c7a40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (1232 commits)
  iucv: Fix bad merging.
  net_sched: Add size table for qdiscs
  net_sched: Add accessor function for packet length for qdiscs
  net_sched: Add qdisc_enqueue wrapper
  highmem: Export totalhigh_pages.
  ipv6 mcast: Omit redundant address family checks in ip6_mc_source().
  net: Use standard structures for generic socket address structures.
  ipv6 netns: Make several "global" sysctl variables namespace aware.
  netns: Use net_eq() to compare net-namespaces for optimization.
  ipv6: remove unused macros from net/ipv6.h
  ipv6: remove unused parameter from ip6_ra_control
  tcp: fix kernel panic with listening_get_next
  tcp: Remove redundant checks when setting eff_sacks
  tcp: options clean up
  tcp: Fix MD5 signatures for non-linear skbs
  sctp: Update sctp global memory limit allocations.
  sctp: remove unnecessary byteshifting, calculate directly in big-endian
  sctp: Allow only 1 listening socket with SO_REUSEADDR
  sctp: Do not leak memory on multiple listen() calls
  sctp: Support ipv6only AF_INET6 sockets.
  ...
2008-07-20 17:43:29 -07:00
Linus Torvalds
f7df406dce Merge branch 'configfs-fixup-ptr-error' of git://oss.oracle.com/git/jlbec/linux-2.6
* 'configfs-fixup-ptr-error' of git://oss.oracle.com/git/jlbec/linux-2.6:
  configfs: Allow ->make_item() and ->make_group() to return detailed errors.
  Revert "configfs: Allow ->make_item() and ->make_group() to return detailed errors."
2008-07-20 17:17:52 -07:00
Alan Cox
a352def21a tty: Ldisc revamp
Move the line disciplines towards a conventional ->ops arrangement.  For
the moment the actual 'tty_ldisc' struct in the tty is kept as part of
the tty struct but this can then be changed if it turns out that when it
all settles down we want to refcount ldiscs separately to the tty.

Pull the ldisc code out of /proc and put it with our ldisc code.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-20 17:12:34 -07:00
David S. Miller
407d819cf0 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6 2008-07-19 00:30:39 -07:00
Harvey Harrison
5108b27651 nfsd: nfs4xdr.c do-while is not a compound statement
The WRITEMEM macro produces sparse warnings of the form:
fs/nfsd/nfs4xdr.c:2668:2: warning: do-while statement is not a compound statement

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-18 15:18:35 -04:00
J. Bruce Fields
ad1060c89c nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c
Thanks to problem report and original patch from Harvey Harrison.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Benny Halevy <bhalevy@panasas.com>
2008-07-18 15:04:58 -04:00
Pavel Emelyanov
b6fcbdb4f2 proc: consolidate per-net single-release callers
They are symmetrical to single_open ones :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:07:44 -07:00
Pavel Emelyanov
de05c557b2 proc: consolidate per-net single_open callers
There are already 7 of them - time to kill some duplicate code.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18 04:07:21 -07:00
David S. Miller
49997d7515 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	Documentation/powerpc/booting-without-of.txt
	drivers/atm/Makefile
	drivers/net/fs_enet/fs_enet-main.c
	drivers/pci/pci-acpi.c
	net/8021q/vlan.c
	net/iucv/iucv.c
2008-07-18 02:39:39 -07:00
Joel Becker
a6795e9ebb configfs: Allow ->make_item() and ->make_group() to return detailed errors.
The configfs operations ->make_item() and ->make_group() currently
return a new item/group.  A return of NULL signifies an error.  Because
of this, -ENOMEM is the only return code bubbled up the stack.

Multiple folks have requested the ability to return specific error codes
when these operations fail.  This patch adds that ability by changing the
->make_item/group() ops to return ERR_PTR() values.  These errors are
bubbled up appropriately.  NULL returns are changed to -ENOMEM for
compatibility.

Also updated are the in-kernel users of configfs.

This is a rework of reverted commit 11c3b79218.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-07-17 15:21:29 -07:00
Joel Becker
f89ab8619e Revert "configfs: Allow ->make_item() and ->make_group() to return detailed errors."
This reverts commit 11c3b79218.  The code
will move to PTR_ERR().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-07-17 14:53:48 -07:00
Aneesh Kumar K.V
12219aea6b ext4: Cleanup the block reservation code path
The truncate patch should not use the i_allocated_meta_blocks
value. So add seperate functions to be used in the truncate
and alloc path. We also need to release the meta-data block
that we reserved for the blocks that we are truncating.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-17 16:12:08 -04:00
Theodore Ts'o
34071da71a ext4: don't assume extents can't cross block groups when truncating
With the FLEX_BG layout, there is no reason why extents can't cross
block groups, so make the truncate code reserve enough credits so we
don't BUG if we come across such an extent.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-01 21:59:19 -04:00
Theodore Ts'o
bc965ab3f2 ext4: Fix lack of credits BUG() when deleting a badly fragmented inode
The extents codepath for ext4_truncate() requests journal transaction
credits in very small chunks, requesting only what is needed.  This
means there may not be enough credits left on the transaction handle
after ext4_truncate() returns and then when ext4_delete_inode() tries
finish up its work, it may not have enough transaction credits,
causing a BUG() oops in the jbd2 core.

Also, reserve an extra 2 blocks when starting an ext4_delete_inode()
since we need to update the inode bitmap, as well as update the
orphaned inode linked list.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-02 21:10:38 -04:00
Theodore Ts'o
0123c93998 ext4: Fix ext4_ext_journal_restart()
The ext4_ext_journal_restart() is a convenience function which checks
to see if the requested number of credits is present, and if so it
closes the current transaction and attaches the current handle to the
new transaction.  Unfortunately, it wasn't proprely checking the
return value from ext4_journal_extend(), so it was starting a new
transaction when one was not necessary, and returning an error when
all that was necessary was to restart the handle with a new
transaction.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-01 20:57:54 -04:00
Eric Sandeen
d5a0d4f732 ext4: fix ext4_da_write_begin error path
ext4_da_write_begin needs to call journal_stop before returning,
if the page allocation fails.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-02 18:51:06 -04:00
Hidehiro Kawai
e9e34f4e8f jbd2: don't abort if flushing file data failed
In ordered mode, the current jbd2 aborts the journal if a file data buffer
has an error.  But this behavior is unintended, and we found that it has
been adopted accidentally.

This patch undoes it and just calls printk() instead of aborting the
journal.  Unlike a similar patch for ext3/jbd, file data buffers are
written via generic_writepages().  But we also need to set AS_EIO
into their mappings because wait_on_page_writeback_range() clears
AS_EIO before a user process sees it.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-31 22:26:04 -04:00
Hidehiro Kawai
9c83a923c6 ext4: don't read inode block if the buffer has a write error
A transient I/O error can corrupt inode data.  Here is the scenario:

(1) update inode_A at the block_B
(2) pdflush writes out new inode_A to the filesystem, but it results
    in write I/O error, at this point, BH_Uptodate flag of the buffer
    for block_B is cleared and BH_Write_EIO is set
(3) create new inode_C which located at block_B, and
    __ext4_get_inode_loc() tries to read on-disk block_B because the
    buffer is not uptodate
(4) if it can read on-disk block_B successfully, inode_A is
    overwritten by old data

This patch makes __ext4_get_inode_loc() not read the inode block if the
buffer has BH_Write_EIO flag.  In this case, the buffer should have the
latest information, so setting the uptodate flag to the buffer (this
avoids WARN_ON_ONCE() in mark_buffer_dirty().)

According to this change, we would need to test BH_Write_EIO flag for the
error checking.  Currently nobody checks write I/O errors on metadata
buffers, but it will be done in other patches I'm working on.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-26 16:39:26 -04:00
Aneesh Kumar K.V
6be2ded1d7 ext4: Don't allow lg prealloc list to be grow large.
Currently, the locality group prealloc list is freed only when there
is a block allocation failure. This can result in large number of
entries in the preallocation list making ext4_mb_use_preallocated()
expensive.

To fix this, we convert the locality group prealloc list to a hash
list. The hash index is the order of number of blocks in the prealloc
space with a max order of 9. When adding prealloc space to the list we
make sure total entries for each order does not exceed 8. If it is
more than 8 we discard few entries and make sure the we have only <= 5
entries.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-23 14:14:05 -04:00
Aneesh Kumar K.V
1320cbcf77 ext4: Convert the usage of NR_CPUS to nr_cpu_ids.
NR_CPUS can be really large. We should be using nr_cpu_ids instead.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-23 14:09:26 -04:00
Aneesh Kumar K.V
ce89f46cb8 ext4: Improve error handling in mballoc
Don't call BUG_ON on file system failures. Instead use ext4_error and
also handle the continue case properly.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-23 14:09:29 -04:00
Eric Sandeen
b5f10eed81 ext4: lock block groups when initializing
I noticed when filling a 1T filesystem with 4 threads using the
fs_mark benchmark:

fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0

that I occasionally got checksum mismatch errors:

EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 6935

etc.  I'd reliably get 4-5 of them during the run.

It appears that the problem is likely a race to init the bg's
when the uninit_bg feature is enabled.

With the patch below, which adds sb_bgl_locking around initialization,
I was able to complete several runs with no errors or warnings.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-08-02 21:21:08 -04:00
Eric Sandeen
e29d1cde63 ext4: sync up block and inode bitmap reading functions
ext4_read_block_bitmap and read_inode_bitmap do essentially
the same thing, and yet they are structured quite differently.
I came across this difference while looking at doing bg locking
during bg initialization.

This patch:

* removes unnecessary casts in the error messages
* renames read_inode_bitmap to ext4_read_inode_bitmap
* and more substantially, restructures the inode bitmap
  reading function to be more like the block bitmap counterpart.

The change to the inode bitmap reader simplifies the locking
to be applied in the next patch.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-08-02 21:21:02 -04:00
Theodore Ts'o
8a266467b8 ext4: Allow read/only mounts with corrupted block group checksums
If the block group checksums are corrupted, still allow the mount to
succeed, so e2fsck can have a chance to try to fix things up.  Add
code in the remount r/w path to make sure the block group checksums
are valid before allowing the filesystem to be remounted read/write.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-26 14:34:21 -04:00
Aneesh Kumar K.V
d03856bd5e ext4: Fix data corruption when writing to prealloc area
Inserting an extent can cause a new entry in the already existing index
block. That doesn't increase the depth of the instead. Instead it adds a
new leaf block. Now with the new leaf block the path information
corresponding to the logical block should be fetched from the new block.
The old path will be pointing to the old leaf block.

We need to recalucate the path information on extent insert
even if depth doesn't change. Without this change, the extent merge
after converting an unwritten extent to initialized extent takes the wrong
extent and cause data corruption.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-02 18:51:32 -04:00
Linus Torvalds
5b664cb235 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  [PATCH] ocfs2: fix oops in mmap_truncate testing
  configfs: call drop_link() to cleanup after create_link() failure
  configfs: Allow ->make_item() and ->make_group() to return detailed errors.
  configfs: Fix failing mkdir() making racing rmdir() fail
  configfs: Fix deadlock with racing rmdir() and rename()
  configfs: Make configfs_new_dirent() return error code instead of NULL
  configfs: Protect configfs_dirent s_links list mutations
  configfs: Introduce configfs_dirent_lock
  ocfs2: Don't snprintf() without a format.
  ocfs2: Fix CONFIG_OCFS2_DEBUG_FS #ifdefs
  ocfs2/net: Silence build warnings on sparc64
  ocfs2: Handle error during journal load
  ocfs2: Silence an error message in ocfs2_file_aio_read()
  ocfs2: use simple_read_from_buffer()
  ocfs2: fix printk format warnings with OCFS2_FS_STATS=n
  [PATCH 2/2] ocfs2: Instrument fs cluster locks
  [PATCH 1/2] ocfs2: Add CONFIG_OCFS2_FS_STATS config option
2008-07-17 10:55:51 -07:00
Coly Li
c0420ad2ca [PATCH] ocfs2: fix oops in mmap_truncate testing
This patch fixes a mmap_truncate bug which was found by ocfs2 test suite.

In an ocfs2 cluster more than 1 node, run program mmap_truncate, which races
mmap writes and truncates from multiple processes. While the test is
running, a stat from another node forces writeout, causing an oops in
ocfs2_get_block() because it sees a buffer to write which isn't allocated.

This patch fixed the bug by clear dirty and uptodate bits in buffer, leave
the buffer unmapped and return.

Fix is suggested by Mark Fasheh, and I code up the patch.

Signed-off-by: Coly Li <coyli@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-16 16:13:04 -07:00
Linus Torvalds
9c1be0c471 Merge branch 'for_linus' of git://git.infradead.org/~dedekind/ubifs-2.6
* 'for_linus' of git://git.infradead.org/~dedekind/ubifs-2.6:
  UBIFS: include to compilation
  UBIFS: add new flash file system
  UBIFS: add brief documentation
  MAINTAINERS: add UBIFS section
  do_mounts: allow UBI root device name
  VFS: export sync_sb_inodes
  VFS: move inode_lock into sync_sb_inodes
2008-07-16 15:02:57 -07:00
Linus Torvalds
8df1b049bc Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (82 commits)
  NFSv4: Remove BKL from the nfsv4 state recovery
  SUNRPC: Remove the BKL from the callback functions
  NFS: Remove BKL from the readdir code
  NFS: Remove BKL from the symlink code
  NFS: Remove BKL from the sillydelete operations
  NFS: Remove the BKL from the rename, rmdir and unlink operations
  NFS: Remove BKL from NFS lookup code
  NFS: Remove the BKL from nfs_link()
  NFS: Remove the BKL from the inode creation operations
  NFS: Remove BKL usage from open()
  NFS: Remove BKL usage from the write path
  NFS: Remove the BKL from the permission checking code
  NFS: Remove attribute update related BKL references
  NFS: Remove BKL requirement from attribute updates
  NFS: Protect inode->i_nlink updates using inode->i_lock
  nfs: set correct fl_len in nlmclnt_test()
  SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon
  SUNRPC: Refactor rpcb_register to make rpcbindv4 support easier
  SUNRPC: None of rpcb_create's callers wants a privileged source port
  SUNRPC: Introduce a specific rpcb_create for contacting localhost
  ...
2008-07-16 14:49:49 -07:00
Randy Dunlap
3c3622dcb6 Fix compile issues in fs/compat_ioctl.c when CONFIG_BLOCK is disabled
Fix fs/compat_ioctl.c to handle CONFIG_BLOCK=n, CONFIG_SCSI=n to avoid
build errors:

In file included from include/scsi/scsi.h:12,
                 from fs/compat_ioctl.c:71:
include/scsi/scsi_cmnd.h:27:25: warning: "BLK_MAX_CDB" is not defined
include/scsi/scsi_cmnd.h:28:3: error: #error MAX_COMMAND_SIZE can not be bigger than BLK_MAX_CDB
In file included from include/scsi/scsi.h:12,
                 from fs/compat_ioctl.c:71:
include/scsi/scsi_cmnd.h: In function 'scsi_bidi_cmnd':
include/scsi/scsi_cmnd.h:182: error: implicit declaration of function 'blk_bidi_rq'
include/scsi/scsi_cmnd.h:183: error: dereferencing pointer to incomplete type
include/scsi/scsi_cmnd.h: In function 'scsi_in':
include/scsi/scsi_cmnd.h:189: error: dereferencing pointer to incomplete type

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-16 11:46:16 -07:00
Trond Myklebust
cadc723cc1 Merge branch 'bkl-removal' into next 2008-07-15 18:34:58 -04:00
Trond Myklebust
e89e896d31 Merge branch 'devel' into next
Conflicts:

	fs/nfs/file.c

Fix up the conflict with Jon Corbet's bkl-removal tree
2008-07-15 18:34:16 -04:00
Trond Myklebust
f839c4c199 NFSv4: Remove BKL from the nfsv4 state recovery
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:57 -04:00
Trond Myklebust
a86dc496b7 SUNRPC: Remove the BKL from the callback functions
Push it into those callback functions that actually need it.

Note that all the NFS operations use their own locking, so don't need the
BKL. Ditto for the rpcbind client.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:57 -04:00
Trond Myklebust
c3cc8c019c NFS: Remove BKL from the readdir code
Page accesses are serialised using the page locks, whereas all attribute
updates are serialised using the inode->i_lock.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:56 -04:00
Trond Myklebust
76566991f9 NFS: Remove BKL from the symlink code
Page cache accesses are serialised using page locks, whereas attribute
updates are serialised using inode->i_lock.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:56 -04:00
Trond Myklebust
52e2e8d37e NFS: Remove BKL from the sillydelete operations
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:55 -04:00
Trond Myklebust
bd9bb454b7 NFS: Remove the BKL from the rename, rmdir and unlink operations
Attribute updates are safe, and dentry operations are protected using VFS
level locks. Defer removing the BKL from sillyrename until a separate
patch.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:55 -04:00
Trond Myklebust
fc0f684c21 NFS: Remove BKL from NFS lookup code
All dentry-related operations are already BKL-safe, since they are
protected by the VFS locking. No extra locks should be needed in the NFS
code.

In the case of nfs_revalidate_inode(), we're only doing an attribute
update (protected by the inode->i_lock).
In the case of nfs_lookup(), we're instantiating a new dentry, so there
should be no contention possible until after we call d_materialise_unique.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:54 -04:00
Trond Myklebust
fc81af535e NFS: Remove the BKL from nfs_link()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:54 -04:00
Trond Myklebust
f1e2eda235 NFS: Remove the BKL from the inode creation operations
nfs_instantiate() does not require the BKL, neither do the attribute
updates or the RPC code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:53 -04:00
Trond Myklebust
bba67e0e3f NFS: Remove BKL usage from open()
All the NFSv4 stateful operations are already protected by other locks (in
particular by the rpc_sequence locks.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:53 -04:00
Trond Myklebust
b6a2e569e2 NFS: Remove BKL usage from the write path
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:52 -04:00
Trond Myklebust
4d80f2ecd5 NFS: Remove the BKL from the permission checking code
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:52 -04:00
Trond Myklebust
fa6dc9dc59 NFS: Remove attribute update related BKL references
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:51 -04:00
Trond Myklebust
a3d01454bc NFS: Remove BKL requirement from attribute updates
The main problem is dealing with inode->i_size: we need to set the
inode->i_lock on all attribute updates, and so vmtruncate won't cut it.
Make an NFS-private version of vmtruncate that has the necessary locking
semantics.

The result should be that the following inode attribute updates are
protected by inode->i_lock
	nfsi->cache_validity
	nfsi->read_cache_jiffies
	nfsi->attrtimeo
	nfsi->attrtimeo_timestamp
	nfsi->change_attr
	nfsi->last_updated
	nfsi->cache_change_attribute
	nfsi->access_cache
	nfsi->access_cache_entry_lru
	nfsi->access_cache_inode_lru
	nfsi->acl_access
	nfsi->acl_default
	nfsi->nfs_page_tree
	nfsi->ncommit
	nfsi->npages
	nfsi->open_files
	nfsi->silly_list
	nfsi->acl
	nfsi->open_states
	inode->i_size
	inode->i_atime
	inode->i_mtime
	inode->i_ctime
	inode->i_nlink
	inode->i_uid
	inode->i_gid

The following is protected by dir->i_mutex
	nfsi->cookieverf

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:51 -04:00
Trond Myklebust
1b83d70703 NFS: Protect inode->i_nlink updates using inode->i_lock
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:10:50 -04:00
Felix Blyakher
d67d1c7bf9 nfs: set correct fl_len in nlmclnt_test()
fcntl(F_GETLK) on an nfs client incorrectly returns
the values for the conflicting lock. fl_len value is
always 1.
If the conflicting lock is (0, 4095) the F_GETLK
request for (1024, 10) returns (0, 1), which doesn't
even cover the requested range, and is quite confusing.
The fix is trivial, set fl_end from the fl_end value
recieved from the nfs server.

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-15 18:08:59 -04:00
Chuck Lever
367c8c7bd9 lockd: Pass "struct sockaddr *" to new failover-by-IP function
Pass a more generic socket address type to nlmsvc_unlock_all_by_ip() to
allow for future support of IPv6.  Also provide additional sanity
checking in failover_unlock_ip() when constructing the server's IP
address.

As an added bonus, provide clean kerneldoc comments on related NLM
interfaces which were recently added.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-15 16:11:29 -04:00
Ingo Molnar
1a781a777b Merge branch 'generic-ipi' into generic-ipi-for-linus
Conflicts:

	arch/powerpc/Kconfig
	arch/s390/kernel/time.c
	arch/x86/kernel/apic_32.c
	arch/x86/kernel/cpu/perfctr-watchdog.c
	arch/x86/kernel/i8259_64.c
	arch/x86/kernel/ldt.c
	arch/x86/kernel/nmi_64.c
	arch/x86/kernel/smpboot.c
	arch/x86/xen/smp.c
	include/asm-x86/hw_irq_32.h
	include/asm-x86/hw_irq_64.h
	include/asm-x86/mach-default/irq_vectors.h
	include/asm-x86/mach-voyager/irq_vectors.h
	include/asm-x86/smp.h
	kernel/Makefile

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-15 21:55:59 +02:00
J. Bruce Fields
560de0e659 lockd: get host reference in nlmsvc_create_block() instead of callers
It may not be obvious (till you look at the definition of
nlm_alloc_call()) that a function like nlmsvc_create_block() should
consume a reference on success or failure, so I find it clearer if it
takes the reference it needs itself.

And both callers already do this immediately before the call anyway.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-15 15:40:25 -04:00
J. Bruce Fields
6d7bbbbacc lockd: minor svclock.c style fixes
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-15 15:28:43 -04:00
Jeff Layton
6cde4de807 lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock
nlmsvc_lock calls nlmsvc_lookup_host to find a nlm_host struct. The
callers of this function, however, call nlmsvc_retrieve_args or
nlm4svc_retrieve_args, which also return a nlm_host struct.

Change nlmsvc_lock to take a host arg instead of calling
nlmsvc_lookup_host itself and change the callers to pass a pointer to
the nlm_host they've already found.

Since nlmsvc_testlock() now just uses the caller's reference, we no
longer need to get or release it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-15 14:53:33 -04:00
Jeff Layton
8f920d5e29 lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock
nlmsvc_testlock calls nlmsvc_lookup_host to find a nlm_host struct. The
callers of this functions, however, call nlmsvc_retrieve_args or
nlm4svc_retrieve_args, which also return a nlm_host struct.

Change nlmsvc_testlock to take a host arg instead of calling
nlmsvc_lookup_host itself and change the callers to pass a pointer to
the nlm_host they've already found.

We take a reference to host in the place where nlmsvc_testlock()
previous did a new lookup, so the reference counting is unchanged from
before.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-15 14:26:52 -04:00
Linus Torvalds
e4e0fadcd9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
  jfs: remove DIRENTSIZ
  JFS: diAlloc() should return -EIO rather than EIO
  JFS: skip bad iput() call in error path
  JFS: switch to seq_files
  JFS: 0 is not valid errno value so return NULL from jfs_lookup
2008-07-15 11:03:19 -07:00
Linus Torvalds
38c46578ff Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
  [GFS2] Fix GFS2's use of do_div() in its quota calculations
  [GFS2] Remove unused declaration
  [GFS2] Remove support for unused and pointless flag
  [GFS2] Replace rgrp "recent list" with mru list
  [GFS2] Allow local DF locks when holding a cached EX glock
  [GFS2] Fix delayed demote race
  [GFS2] don't call permission()
  [GFS2] Fix module building
  [GFS2] Glock documentation
  [GFS2] Remove all_list from lock_dlm
  [GFS2] Remove obsolete conversion deadlock avoidance code
  [GFS2] Remove remote lock dropping code
  [GFS2] kernel panic mounting volume
  [GFS2] Revise readpage locking
  [GFS2] Fix ordering of args for list_add
  [GFS2] trivial sparse lock annotations
  [GFS2] No lock_nolock
  [GFS2] Fix ordering bug in lock_dlm
  [GFS2] Clean up the glock core
2008-07-15 10:38:46 -07:00
Jeff Layton
b0e92aae15 lockd: nlm_release_host() checks for NULL, caller needn't
No need to check for a NULL argument twice.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-15 12:35:20 -04:00
Linus Torvalds
8d2567a620 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits)
  ext4: Documention update for new ordered mode and delayed allocation
  ext4: do not set extents feature from the kernel
  ext4: Don't allow nonextenst mount option for large filesystem
  ext4: Enable delalloc by default.
  ext4: delayed allocation i_blocks fix for stat
  ext4: fix delalloc i_disksize early update issue
  ext4: Handle page without buffers in ext4_*_writepage()
  ext4: Add ordered mode support for delalloc
  ext4: Invert lock ordering of page_lock and transaction start in delalloc
  mm: Add range_cont mode for writeback
  ext4: delayed allocation ENOSPC handling
  percpu_counter: new function percpu_counter_sum_and_set
  ext4: Add delayed allocation support in data=writeback mode
  vfs: add hooks for ext4's delayed allocation support
  jbd2: Remove data=ordered mode support using jbd buffer heads
  ext4: Use new framework for data=ordered mode in JBD2
  jbd2: Implement data=ordered mode handling via inodes
  vfs: export filemap_fdatawrite_range()
  ext4: Fix lock inversion in ext4_ext_truncate()
  ext4: Invert the locking order of page_lock and transaction start
  ...
2008-07-15 08:36:38 -07:00
Artem Bityutskiy
0d7eff873c UBIFS: include to compilation
Add UBIFS to Makefile and Kbuild.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
2008-07-15 17:35:24 +03:00
Artem Bityutskiy
1e51764a3c UBIFS: add new flash file system
This is a new flash file system. See
http://www.linux-mtd.infradead.org/doc/ubifs.html

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
2008-07-15 17:35:15 +03:00
Linus Torvalds
d1794f2c5b Merge branch 'bkl-removal' of git://git.lwn.net/linux-2.6
* 'bkl-removal' of git://git.lwn.net/linux-2.6: (146 commits)
  IB/umad: BKL is not needed for ib_umad_open()
  IB/uverbs: BKL is not needed for ib_uverbs_open()
  bf561-coreb: BKL unneeded for open()
  Call fasync() functions without the BKL
  snd/PCM: fasync BKL pushdown
  ipmi: fasync BKL pushdown
  ecryptfs: fasync BKL pushdown
  Bluetooth VHCI: fasync BKL pushdown
  tty_io: fasync BKL pushdown
  tun: fasync BKL pushdown
  i2o: fasync BKL pushdown
  mpt: fasync BKL pushdown
  Remove BKL from remote_llseek v2
  Make FAT users happier by not deadlocking
  x86-mce: BKL pushdown
  vmwatchdog: BKL pushdown
  vmcp: BKL pushdown
  via-pmu: BKL pushdown
  uml-random: BKL pushdown
  uml-mmapper: BKL pushdown
  ...
2008-07-14 14:48:31 -07:00
Jonathan Corbet
2fceef397f Merge commit 'v2.6.26' into bkl-removal 2008-07-14 15:29:34 -06:00
Louis Rilling
e752065175 configfs: call drop_link() to cleanup after create_link() failure
When allow_link() succeeds but create_link() fails, the subsystem is not
informed of the failure.

This patch fixes this by calling drop_link() on create_link() failures.

Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-07-14 13:57:16 -07:00
Joel Becker
11c3b79218 configfs: Allow ->make_item() and ->make_group() to return detailed errors.
The configfs operations ->make_item() and ->make_group() currently
return a new item/group.  A return of NULL signifies an error.  Because
of this, -ENOMEM is the only return code bubbled up the stack.

Multiple folks have requested the ability to return specific error codes
when these operations fail.  This patch adds that ability by changing the
->make_item/group() ops to return an int.

Also updated are the in-kernel users of configfs.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-07-14 13:57:16 -07:00
Louis Rilling
6d8344baee configfs: Fix failing mkdir() making racing rmdir() fail
When fixing the rename() vs rmdir() deadlock, we stopped locking default groups'
inodes in configfs_detach_prep(), letting racing mkdir() in default groups
proceed concurrently. This enables races like below happen, which leads to a
failing mkdir() making rmdir() fail, despite the group to remove having no
user-created directory under it in the end.

	process A: 			process B:
	/* PWD=A/B */
	mkdir("C")
	  make_item("C")
	  attach_group("C")
					rmdir("A")
					  detach_prep("A")
					    detach_prep("B")
					      error because of "C"
					  return -ENOTEMPTY
	    attach_group("C/D")
	      error (eg -ENOMEM)
	  return -ENOMEM

This patch prevents such scenarii by making rmdir() wait as long as
detach_prep() fails because a racing mkdir() is in the middle of attach_group().
To achieve this, mkdir() sets a flag CONFIGFS_USET_IN_MKDIR in parent's
configfs_dirent before calling attach_group(), and clears the flag once
attach_group() is done. detach_prep() fails with -EAGAIN whenever the flag is
hit and returns the guilty inode's mutex so that rmdir() can wait on it.

Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-07-14 13:57:16 -07:00
Louis Rilling
b3e76af874 configfs: Fix deadlock with racing rmdir() and rename()
This patch fixes the deadlock between racing sys_rename() and configfs_rmdir().

The idea is to avoid locking i_mutexes of default groups in
configfs_detach_prep(), and rely instead on the new configfs_dirent_lock to
protect against configfs_dirent's linkage mutations. To ensure that an mkdir()
racing with rmdir() will not create new items in a to-be-removed default group,
we make configfs_new_dirent() check for the CONFIGFS_USET_DROPPING flag right
before linking the new dirent, and return error if the flag is set. This makes
racing mkdir()/symlink()/dir_open() fail in places where errors could already
happen, resp. in (attach_item()|attach_group())/create_link()/new_dirent().

configfs_depend() remains safe since it locks all the path from configfs root,
and is thus mutually exclusive with rmdir().

An advantage of this is that now detach_groups() unconditionnaly takes the
default groups i_mutex, which makes it more consistent with populate_groups().

Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-07-14 13:57:16 -07:00
Louis Rilling
107ed40bd0 configfs: Make configfs_new_dirent() return error code instead of NULL
This patch makes configfs_new_dirent return negative error code instead of NULL,
which will be useful in the next patch to differentiate ENOMEM from ENOENT.

Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-07-14 13:57:16 -07:00
Louis Rilling
5301a77da2 configfs: Protect configfs_dirent s_links list mutations
Symlinks to a config_item are listed under its configfs_dirent s_links, but the
list mutations are not protected by any common lock.

This patch uses the configfs_dirent_lock spinlock to add the necessary
protection.

Note: we should also protect the list_empty() test in configfs_detach_prep() but
1/ the lock should not be released immediately because nothing would prevent the
list from being filled after a successful list_empty() test, making the problem
tricky,
2/ this will be solved by the rmdir() vs rename() deadlock bugfix.

Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-07-14 13:57:16 -07:00
Louis Rilling
6f61076406 configfs: Introduce configfs_dirent_lock
This patch introduces configfs_dirent_lock spinlock to protect configfs_dirent
traversals against linkage mutations (add/del/move). This will allow
configfs_detach_prep() to avoid locking i_mutexes.

Locking rules for configfs_dirent linkage mutations are the same plus the
requirement of taking configfs_dirent_lock. For configfs_dirent walking, one can
either take appropriate i_mutex as before, or take configfs_dirent_lock.

The spinlock could actually be a mutex, but the critical sections are either
O(1) or should not be too long (default groups walking in last patch).

ChangeLog:
  - Clarify the comment on configfs_dirent_lock usage
  - Move sd->s_element init before linking the new dirent
  - In lseek(), do not release configfs_dirent_lock before the dirent is
    relinked.

Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-07-14 13:57:15 -07:00
Joel Becker
fe9f387740 ocfs2: Don't snprintf() without a format.
Some system files are per-slot.  Their names include the slot number.
ocfs2_sprintf_system_inode_name() uses the system inode definitions to
fill in the slot number with snprintf().

For global system files, there is no node number, and the name was
printed as a format with no arguments.  -Wformat-nonliteral and
-Wformat-security don't like this.  Instead, use a static "%s" format
and the name as the argument.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-07-14 13:57:15 -07:00
Joel Becker
e407e39783 ocfs2: Fix CONFIG_OCFS2_DEBUG_FS #ifdefs
A couple places use OCFS2_DEBUG_FS where they really mean
CONFIG_OCFS2_DEBUG_FS.

Reported-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-07-14 13:57:15 -07:00
Sunil Mushran
461c6a30ec ocfs2/net: Silence build warnings on sparc64
suseconds_t is type long on most arches except sparc64 where it is type int.
This patch silences the following warnings that are generated when building
on it.

netdebug.c: In function 'nst_seq_show':
netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 13 has type 'suseconds_t'
netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 15 has type 'suseconds_t'
netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 17 has type 'suseconds_t'
netdebug.c: In function 'sc_seq_show':
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 19 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 21 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 23 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 25 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 27 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 29 has type 'suseconds_t'

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-14 13:57:15 -07:00
Wengang Wang
01af482037 ocfs2: Handle error during journal load
This patch ensures the mount fails if the fs is unable to load the journal.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-14 13:57:15 -07:00
Sunil Mushran
56753bd3b9 ocfs2: Silence an error message in ocfs2_file_aio_read()
This patch silences an EINVAL error message in ocfs2_file_aio_read()
that is always due to a user error.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-14 13:57:15 -07:00
Akinobu Mita
7600c72b75 ocfs2: use simple_read_from_buffer()
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-14 13:57:15 -07:00
Randy Dunlap
dd25e55ea1 ocfs2: fix printk format warnings with OCFS2_FS_STATS=n
Fix printk format warnings when OCFS2_FS_STATS=n:

linux-next-20080528/fs/ocfs2/dlmglue.c: In function 'ocfs2_dlm_seq_show':
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 3 has type 'int'
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 4 has type 'int'
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 7 has type 'int'
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 8 has type 'int'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-14 13:57:14 -07:00
Sunil Mushran
8ddb7b004d [PATCH 2/2] ocfs2: Instrument fs cluster locks
This patch adds code to track the number of times the fs takes
various cluster locks as well as the times associated with it.
The information is made available to users via debugfs.

This patch was originally written by Jan Kara <jack@suse.cz>.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-14 13:57:14 -07:00
Sunil Mushran
ce7231e92d [PATCH 1/2] ocfs2: Add CONFIG_OCFS2_FS_STATS config option
This patch adds config option CONFIG_OCFS2_FS_STATS to allow building
the fs with instrumentation enabled. An upcoming patch will provide
support to instrument cluster locking, which is a crucial overhead in
a cluster file system. This config option allows users to avoid the cpu
and memory overhead that is involved in gathering such statistics.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-14 13:57:14 -07:00
Linus Torvalds
a3da5bf84a Merge branch 'x86/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (821 commits)
  x86: make 64bit hpet_set_mapping to use ioremap too, v2
  x86: get x86_phys_bits early
  x86: max_low_pfn_mapped fix #4
  x86: change _node_to_cpumask_ptr to return const ptr
  x86: I/O APIC: remove an IRQ2-mask hack
  x86: fix numaq_tsc_disable calling
  x86, e820: remove end_user_pfn
  x86: max_low_pfn_mapped fix, #3
  x86: max_low_pfn_mapped fix, #2
  x86: max_low_pfn_mapped fix, #1
  x86_64: fix delayed signals
  x86: remove conflicting nx6325 and nx6125 quirks
  x86: Recover timer_ack lost in the merge of the NMI watchdog
  x86: I/O APIC: Never configure IRQ2
  x86: L-APIC: Always fully configure IRQ0
  x86: L-APIC: Set IRQ0 as edge-triggered
  x86: merge dwarf2 headers
  x86: use AS_CFI instead of UNWIND_INFO
  x86: use ignore macro instead of hash comment
  x86: use matching CFI_ENDPROC
  ...
2008-07-14 13:43:24 -07:00
Linus Torvalds
847106ff62 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (25 commits)
  security: remove register_security hook
  security: remove dummy module fix
  security: remove dummy module
  security: remove unused sb_get_mnt_opts hook
  LSM/SELinux: show LSM mount options in /proc/mounts
  SELinux: allow fstype unknown to policy to use xattrs if present
  security: fix return of void-valued expressions
  SELinux: use do_each_thread as a proper do/while block
  SELinux: remove unused and shadowed addrlen variable
  SELinux: more user friendly unknown handling printk
  selinux: change handling of invalid classes (Was: Re: 2.6.26-rc5-mm1 selinux whine)
  SELinux: drop load_mutex in security_load_policy
  SELinux: fix off by 1 reference of class_to_string in context_struct_compute_av
  SELinux: open code sidtab lock
  SELinux: open code load_mutex
  SELinux: open code policy_rwlock
  selinux: fix endianness bug in network node address handling
  selinux: simplify ioctl checking
  SELinux: enable processes with mac_admin to get the raw inode contexts
  Security: split proc ptrace checking into read vs. attach
  ...
2008-07-14 13:36:55 -07:00
Linus Torvalds
dddec01eb8 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (37 commits)
  splice: fix generic_file_splice_read() race with page invalidation
  ramfs: enable splice write
  drivers/block/pktcdvd.c: avoid useless memset
  cdrom: revert commit 22a9189 (cdrom: use kmalloced buffers instead of buffers on stack)
  scsi: sr avoids useless buffer allocation
  block: blk_rq_map_kern uses the bounce buffers for stack buffers
  block: add blk_queue_update_dma_pad
  DAC960: push down BKL
  pktcdvd: push BKL down into driver
  paride: push ioctl down into driver
  block: use get_unaligned_* helpers
  block: extend queue_flag bitops
  block: request_module(): use format string
  Add bvec_merge_data to handle stacked devices and ->merge_bvec()
  block: integrity flags can't use bit ops on unsigned short
  cmdfilter: extend default read filter
  sg: fix odd style (extra parenthesis) introduced by cmd filter patch
  block: add bounce support to blk_rq_map_user_iov
  cfq-iosched: get rid of enable_idle being unused warning
  allow userspace to modify scsi command filter on per device basis
  ...
2008-07-14 13:15:14 -07:00
Benny Halevy
18c60c0a3b dlm: fix uninitialized variable for search_rsb_list callers
gcc 4.3.0 correctly emits the following warning.
search_rsb_list does not *r_ret if no dlm_rsb is found
and _search_rsb may pass the uninitialized value upstream
on the error path when both calls to search_rsb_list
return non-zero error.

The fix sets *r_ret to NULL on search_rsb_list's not-found path.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2008-07-14 13:56:59 -05:00
Masatake YAMATO
311f6fc77c dlm: release socket on error
It seems that `sock' allocated by sock_create_kern in
tcp_connect_to_sock() of dlm/fs/lowcomms.c is not released if
dlm_nodeid_to_addr an error.

Acked-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2008-07-14 13:56:59 -05:00
David Teigland
329fc4c372 dlm: fix basts for granted CW waiting PR/CW
The fix in commit 3650925893 was addressing
the case of a granted PR lock with waiting PR and CW locks.  It's a
special case that requires forcing a CW bast.  However, that forced CW
bast was incorrectly applying to a second condition where the granted
lock was CW.  So, the holder of a CW lock could receive an extraneous CW
bast instead of a PR bast.  This fix narrows the original special case to
what was intended.

Signed-off-by: David Teigland <teigland@redhat.com>
2008-07-14 13:56:59 -05:00
Masatake YAMATO
254ae43ab8 dlm: check for null in device_write
If `device_write' method is called via "dlm-control",
file->private_data is NULL. (See ctl_device_open() in
user.c. ) Through proc->flags is read.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2008-07-14 13:56:59 -05:00
Marcel Holtmann
40be492fe4 [Bluetooth] Export details about authentication requirements
With the Simple Pairing support, the authentication requirements are
an explicit setting during the bonding process. Track and enforce the
requirements and allow higher layers like L2CAP and RFCOMM to increase
them if needed.

This patch introduces a new IOCTL that allows to query the current
authentication requirements. It is also possible to detect Simple
Pairing support in the kernel this way.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2008-07-14 20:13:50 +02:00
Artem Bityutskiy
4ee6afd344 VFS: export sync_sb_inodes
This patch exports the 'sync_sb_inodes()' which is needed for
UBIFS because it has to force write-back from time to time.
Namely, the UBIFS budgeting subsystem forces write-back when
its pessimistic callculations show that there is no free
space on the media.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-07-14 19:10:52 +03:00
Hans Reiser
ae8547b0a9 VFS: move inode_lock into sync_sb_inodes
This patch makes 'sync_sb_inodes()' lock 'inode_lock', rather
than expect that the caller will do this.

This change was previously done by Hans Reiser <reiser@namesys.com>
and sat in the -mm tree.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-07-14 19:10:52 +03:00
Ingo Molnar
d59fdcf2ac Merge commit 'v2.6.26' into x86/core 2008-07-14 11:37:46 +02:00
Eric Paris
2069f45784 LSM/SELinux: show LSM mount options in /proc/mounts
This patch causes SELinux mount options to show up in /proc/mounts.  As
with other code in the area seq_put errors are ignored.  Other LSM's
will not have their mount options displayed until they fill in their own
security_sb_show_options() function.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: James Morris <jmorris@namei.org>
2008-07-14 15:02:05 +10:00
Stephen Smalley
006ebb40d3 Security: split proc ptrace checking into read vs. attach
Enable security modules to distinguish reading of process state via
proc from full ptrace access by renaming ptrace_may_attach to
ptrace_may_access and adding a mode argument indicating whether only
read access or full attach access is requested.  This allows security
modules to permit access to reading process state without granting
full ptrace access.  The base DAC/capability checking remains unchanged.

Read access to /proc/pid/mem continues to apply a full ptrace attach
check since check_mem_permission() already requires the current task
to already be ptracing the target.  The other ptrace checks within
proc for elements like environ, maps, and fds are changed to pass the
read mode instead of attach.

In the SELinux case, we model such reading of process state as a
reading of a proc file labeled with the target process' label.  This
enables SELinux policy to permit such reading of process state without
permitting control or manipulation of the target process, as there are
a number of cases where programs probe for such information via proc
but do not need to be able to control the target (e.g. procps,
lsof, PolicyKit, ConsoleKit).  At present we have to choose between
allowing full ptrace in policy (more permissive than required/desired)
or breaking functionality (or in some cases just silencing the denials
via dontaudit rules but this can hide genuine attacks).

This version of the patch incorporates comments from Casey Schaufler
(change/replace existing ptrace_may_attach interface, pass access
mode), and Chris Wright (provide greater consistency in the checking).

Note that like their predecessors __ptrace_may_attach and
ptrace_may_attach, the __ptrace_may_access and ptrace_may_access
interfaces use different return value conventions from each other (0
or -errno vs. 1 or 0).  I retained this difference to avoid any
changes to the caller logic but made the difference clearer by
changing the latter interface to return a bool rather than an int and
by adding a comment about it to ptrace.h for any future callers.

Signed-off-by:  Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: James Morris <jmorris@namei.org>
2008-07-14 15:01:47 +10:00
Jeff Layton
536abdb080 cifs: fix wksidarr declaration to be big-endian friendly
The current definition of wksidarr works fine on little endian arches
(since cpu_to_le32 is a no-op there), but on big-endian arches, it fails
to compile with this error:

error: braced-group within expression allowed only inside a function

The problem is that this static declaration has cpu_to_le32 embedded
within it, and that expands into a function macro.  We need to use
__constant_cpu_to_le32() instead.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-12 14:33:42 -07:00
Jeff Layton
e911d0cc87 cifs: fix inode leak in cifs_get_inode_info_unix
Try this:

    mount a share with unix extensions
    create a file on it
    umount the share

You'll get the following message in the ring buffer:

VFS: Busy inodes after unmount of cifs. Self-destruct in 5 seconds.  Have a
nice day...

...the problem is that cifs_get_inode_info_unix is creating and hashing
a new inode even when it's going to return error anyway. The first
lookup when creating a file returns an error so we end up leaking this
inode before we do the actual create. This appears to be a regression
caused by commit 0e4bbde94f.

The following patch seems to fix it for me, and fixes a minor
formatting nit as well.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-12 14:33:42 -07:00
Ingo Molnar
ae94b8075a Merge branch 'linus' into x86/core
Conflicts:

	arch/x86/mm/ioremap.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-12 07:29:02 +02:00
Eric Sandeen
e4079a11f5 ext4: do not set extents feature from the kernel
We've talked for a while about getting rid of any feature-
setting from the kernel; this gets rid of the code which would
set the INCOMPAT_EXTENTS flag on the first file write when mounted
as ext4[dev].

With this patch, if the extents feature is not already set on disk,
then mounting as ext4 will fall back to noextents with a warning,
and if -o extents is explicitly requested, the mount will fail,
also with warning.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
c07651b556 ext4: Don't allow nonextenst mount option for large filesystem
The block mapped inode format can address only blocks within 2**32. This
causes a number of issues, the biggest of which is that the block
allocator needs to be taught that certain inodes can not utilize block
numbers > 2**32.  So until this is fixed, it is simplest to fail
mounting of file systems with more than 2**32 blocks if the -o noextents
option is given.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
dd919b9822 ext4: Enable delalloc by default.
Enable delalloc by default to ensure it gets sufficient testing and
because it makes the filesystem much more efficient.  Add a nodealalloc
option to disable delayed allocation, and update ext4_show_options to
show delayed allocation off if it is disabled.

If the data=journal mount option is used, disable delayed allocation
since the delalloc code doesn't support data=journal yet.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-07-11 19:27:31 -04:00
Mingming Cao
3e3398a08d ext4: delayed allocation i_blocks fix for stat
Right now i_blocks is not getting updated until the blocks are actually
allocaed on disk.  This means with delayed allocation, right after files
are copied, "ls -sF" shoes the file as taking 0 blocks on disk.  "du"
also shows the files taking zero space, which is highly confusing to the
user.

Since delayed allocation already keeps track of per-inode total
number of blocks that are subject to delayed allocation, this patch fix
this by using that to adjust the value returned by stat(2). When real
block allocation is done, the i_blocks will get updated. Since the
reserved blocks for delayed allocation will be decreased, this will be
keep value returned by stat(2) consistent.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao
632eaeab1f ext4: fix delalloc i_disksize early update issue
Ext4_da_write_end() used walk_page_buffers() with a callback function of
ext4_bh_unmapped_or_delay() to check if it extended the file size
without allocating any blocks (since in this case i_disksize needs to be
updated).  However, this is didn't work proprely because the buffer head
has not been marked dirty yet --- this is done later in
block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to
always return false.

In addition, walk_page_buffers() checks all of the buffer heads covering
the page, and the only buffer_head that should be checked is the one
covering the end of the write.  Otherwise, given a 1k blocksize
filesystem and a 4k page size, the buffer head covering the first 1k
stripe of the file could be unmapped (because it was a sparse file), and
the second or third buffer_head covering that page could be mapped, and
using walk_page_buffers() would fail in this case since it would stop at
the first unmapped buffer_head and return true.

The core problem is that walk_page_buffers() was intended to do work in
a callback function, and a non-zero return value indicated a failure,
which termined the walk of the buffer heads covering the page.  It was
not intended to be used with a boolean function, such as
ext4_bh_unmapped_or_delay().

Add addtional fix from Aneesh to protect i_disksize update rave with truncate.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
f0e6c98593 ext4: Handle page without buffers in ext4_*_writepage()
It can happen that buffers are removed from the page before it gets
marked dirty and then is passed to writepage().  In writepage() we just
initialize the buffers and check whether they are mapped and non
delay. If they are mapped and non delay we write the page. Otherwise we
mark them dirty.  With this change we don't do block allocation at all
in ext4_*_write_page.

writepage() can get called under many condition and with a locking order
of journal_start -> lock_page, we should not try to allocate blocks in
writepage() which get called after taking page lock.  writepage() can
get called via shrink_page_list even with a journal handle which was
created for doing inode update.  For example when doing
ext4_da_write_begin we create a journal handle with credit 1 expecting a
i_disksize update for the inode. But ext4_da_write_begin can cause
shrink_page_list via _grab_page_cache. So having a valid handle via
ext4_journal_current_handle is not a guarantee that we can use the
handle for block allocation in writepage, since we shouldn't be using
credits that had been reserved for other updates.  That it could result
in we running out of credits when we update inodes.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
cd1aac3292 ext4: Add ordered mode support for delalloc
This provides a new ordered mode implementation which gets rid of using
buffer heads to enforce the ordering between metadata change with the
related data chage.  Instead, in the new ordering mode, it keeps track
of all of the inodes touched by each transaction on a list, and when
that transaction is committed, it flushes all of the dirty pages for
those inodes.  In addition, the new ordered mode reverses the lock
ordering of the page lock and transaction lock, which provides easier
support for delayed allocation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao
61628a3f3a ext4: Invert lock ordering of page_lock and transaction start in delalloc
With the reverse locking, we need to start a transation before taking
the page lock, so in ext4_da_writepages() we need to break the write-out
into chunks, and restart the journal for each chunck to ensure the
write-out fits in a single transaction.

Updated patch from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
which fixes delalloc sync hang with journal lock inversion, and address
the performance regression issue.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao
d2a1763791 ext4: delayed allocation ENOSPC handling
This patch does block reservation for delayed
allocation, to avoid ENOSPC later at page flush time.

Blocks(data and metadata) are reserved at da_write_begin()
time, the freeblocks counter is updated by then, and the number of
reserved blocks is store in per inode counter.
        
At the writepage time, the unused reserved meta blocks are returned
back. At unlink/truncate time, reserved blocks are properly released.

Updated fix from  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to fix the oldallocator block reservation accounting with delalloc, added
lock to guard the counters and also fix the reservation for meta blocks.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-14 17:52:37 -04:00
Mingming Cao
e8ced39d5e percpu_counter: new function percpu_counter_sum_and_set
Delayed allocation need to check free blocks at every write time.
percpu_counter_read_positive() is not quit accurate. delayed
allocation need a more accurate accounting, but using
percpu_counter_sum_positive() is frequently is quite expensive.

This patch added a new function to update center counter when sum
per-cpu counter, to increase the accurate rate for next
percpu_counter_read() and require less calling expensive
percpu_counter_sum().

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Alex Tomas
64769240bd ext4: Add delayed allocation support in data=writeback mode
Updated with fixes from Mingming Cao <cmm@us.ibm.com> to unlock and
release the page from page cache if the delalloc write_begin failed, and
properly handle preallocated blocks.  Also added a fix to clear
buffer_delay in block_write_full_page() after allocating a delayed
buffer.

Updated with fixes from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to update i_disksize properly and to add bmap support for delayed
allocation.

Updated with a fix from Valerie Clement <valerie.clement@bull.net> to
avoid filesystem corruption when the filesystem is mounted with the
delalloc option and blocksize < pagesize.

Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by:  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-07-11 19:27:31 -04:00
Alex Tomas
29a814d2ee vfs: add hooks for ext4's delayed allocation support
Export mpage_bio_submit() and __mpage_writepage() for the benefit of
ext4's delayed allocation support.   Also change __block_write_full_page
so that if buffers that have the BH_Delay flag set it will call
get_block() to get the physical block allocated, just as in the
!BH_Mapped case.

Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara
87c89c232c jbd2: Remove data=ordered mode support using jbd buffer heads
Signed-off-by: Jan Kara <jack@suse.cz>
2008-07-11 19:27:31 -04:00
Jan Kara
678aaf4814 ext4: Use new framework for data=ordered mode in JBD2
This patch makes ext4 use inode-based implementation of data=ordered mode
in JBD2. It allows us to unify some data=ordered and data=writeback paths
(especially writepage since we don't have to start a transaction anymore)
and remove some buffer walking.

Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to fix file system hang due to corrupt jinode values.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara
c851ed5401 jbd2: Implement data=ordered mode handling via inodes
This patch adds necessary framework into JBD2 to be able to track inodes
with each transaction and write-out their dirty data during transaction
commit time.

This new ordered mode brings all sorts of advantages such as possibility 
to get rid of journal heads and buffer heads for data buffers in ordered 
mode, better ordering of writes on transaction commit, simplification of
 some JBD code, no more anonymous pages when truncate of data being 
committed happens.  Also with this new ordered mode, delayed allocation 
on ordered mode is much simpler.

Signed-off-by: Jan Kara <jack@suse.cz>
2008-07-11 19:27:31 -04:00
Jan Kara
9ddfc3dc75 ext4: Fix lock inversion in ext4_ext_truncate()
We cannot call ext4_orphan_add() from under i_data_sem because that
causes a lock ordering violation between i_data_sem and and the
superblock lock.

Updated with Aneesh's locking order fix

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com> 
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara
cf108bca46 ext4: Invert the locking order of page_lock and transaction start
This changes are needed to support data=ordered mode handling via
inodes.  This enables us to get rid of the journal heads and buffer
heads for data buffers in the ordered mode.  With the changes, during
tranasaction commit we writeout the inode pages using the
writepages()/writepage(). That implies we take page lock during
transaction commit. This can cause a deadlock with the locking order
page_lock -> jbd2_journal_start, since the jbd2_journal_start can wait
for the journal_commit to happen and the journal_commit now needs to
take the page lock. To avoid this dead lock reverse the locking order.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara
c7d206b337 vfs: Move mark_inode_dirty() from under page lock in generic_write_end()
There's no need to call mark_inode_dirty() under page lock in
generic_write_end(). It unnecessarily makes hold time of page lock longer
and more importantly it forces locking order of page lock and transaction
start for journaling filesystems.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
2e9ee85035 ext4: Use page_mkwrite vma_operations to get mmap write notification.
We would like to get notified when we are doing a write on mmap section.
This is needed with respect to preallocated area. We split the preallocated
area into initialzed extent and uninitialzed extent in the call back. This
let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and
that would result in data loss. The changes are also needed to handle ENOSPC
when writing to an mmap section of files with holes.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Frederic Bohe
5f21b0e642 ext4: fix online resize with mballoc
Update group infos when updating a group's descriptor.
Add group infos when adding a group's descriptor.
Refresh cache pages used by mb_alloc when changes occur.
This will probably need modifications when META_BG resizing will be allowed.

Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-07-11 19:27:31 -04:00
Eric Sandeen
953e622b60 ext4: use atomic functions to set bh_state
Use the BUFFER_FNS functions (set_buffer_foo) to set buffer
head state atomically instead of nonatomic __set_bit().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara
47b4a50beb ext4: Set journal pointer to NULL when journal is released
Set sbi->s_journal to NULL after we call journal_destroy(). This
will be later needed because after journal_destroy() is called,
ext4_clear_inode() can still be called for some inodes (e.g. root
inode) and we'll need to detect there that journal doesn't exists
anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao
0703143107 ext4: mballoc avoid use root reserved blocks for non root allocation
mballoc allocation missed check for blocks reserved for root users. Add
ext4_has_free_blocks() check before allocation. Also modified
ext4_has_free_blocks() to support multiple block allocation request.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Eric Sandeen
d755fb3842 ext4: call blkdev_issue_flush on fsync
To ensure that bits are truly on-disk after an fsync,
we should call blkdev_issue_flush if barriers are supported.

Inspired by an old thread on barriers, by reiserfs & xfs
which do the same, and by a patch SuSE ships with their kernel

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
654b4908bc ext4: cleanup block allocator
Move the code related to block allocation to a single function and add helper
funtions to differient allocation for data and meta data blocks

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
7061eba75c ext4: Use inode preallocation with -o noextents
When mballoc is enabled, block allocation for old block-based
files are allocated using mballoc allocator instead of old
block-based allocator. The old ext3 block reservation is turned
off when mballoc is turned on.

However, the in-core preallocation is not enabled for block-based/
non-extent based file block allocation. This result in performance
regression, as now we don't have "reservation" ore in-core preallocation
to prevent interleaved fragmentation in multiple writes workload.

This patch fix this by enable per inode in-core preallocation
for non extent files when mballoc is used.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Akinobu Mita
6afd670713 ext4: fix ext4_init_block_bitmap() for metablock block group
When meta_bg feature is enabled and s_first_meta_bg != 0, 
ext4_init_block_bitmap() miscalculates the number of block used by
the group descriptor table (0 or 1 for metablock block group)

This patch fixes this by using ext4_bg_num_gdb()

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Andreas Dilger <adilger@sun.com>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
7477827f66 ext4: Fix sparse warning
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Li Zefan
d9c769b769 ext4: cleanup never-used magic numbers from htree code
dx_root_limit() will had some dead code which forced it to always return
20, and dx_node_limit to always return 22 for debugging purposes.
Remove it.

Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng
9102e4fa80 ext4: Fix ext4_ext_journal_restart() to reflect errors up to the caller
Fix ext4_ext_journal_restart() so it returns any errors reported by 
ext4_journal_extend() and ext4_journal_restart().

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng
1973adcba5 ext4: Make ext4_ext_find_extent fills ext_path completely
When pos=0 or depth, the fields of ext4_ext_path is are not
completely filled.  This patch also removes some unnecessary code.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng
787e0981fa ext4: return error when calling ext4_ext_split failed
ext4_ext_create_new_leaf must return error when its
calling to ext4_ext_split failed.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
a379cd1d6b ext4: Update i_disksize properly when allocating from fallocate area.
When allocating unitialized space at the end of file which had been
preallocated with the FALLOC_FL_KEEP_SIZE option, the file size is not
updated at that time.  But the later we are not updating the file size
when writing to that preallocated space.

These changes are for code correctness.  This patch allows us to update
the i_disksize at the write_end() callback of filesystem properly.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng
363d4251d4 ext4: remove quota allocation when ext4_mb_new_blocks fails
Quota allocation is not removed when ext4_mb_new_blocks calls
kmem_cache_alloc failed.  Also make sure the allocation context is freed
on the error path.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Li Zefan
f9a8ac99fd ext4: remove redundant code in ext4_fill_super()
The previous sb_min_blocksize() has already set the block size.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao
530576bbf3 jbd2: fix race between jbd2_journal_try_to_free_buffers() and jbd2 commit transaction
journal_try_to_free_buffers() could race with jbd commit transaction
when the later is holding the buffer reference while waiting for the
data buffer to flush to disk. If the caller of
journal_try_to_free_buffers() request tries hard to release the buffers,
it will treat the failure as error and return back to the caller. We
have seen the directo IO failed due to this race.  Some of the caller of
releasepage() also expecting the buffer to be dropped when passed with
GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers().

With this patch, if the caller is passing the GFP_KERNEL to indicating
this call could wait, in case of try_to_free_buffers() failed, let's
waiting for journal_commit_transaction() to finish commit the current
committing transaction , then try to free those buffers again with
journal locked.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-13 21:06:39 -04:00
Jose R. Santos
772cb7c83b ext4: New inode allocation for FLEX_BG meta-data groups.
This patch mostly controls the way inode are allocated in order to
make ialloc aware of flex_bg block group grouping.  It achieves this
by bypassing the Orlov allocator when block group meta-data are packed
toghether through mke2fs.  Since the impact on the block allocator is
minimal, this patch should have little or no effect on other block
allocation algorithms. By controlling the inode allocation, it can
basically control where the initial search for new block begins and
thus indirectly manipulate the block allocator.

This allocator favors data and meta-data locality so the disk will
gradually be filled from block group zero upward.  This helps improve
performance by reducing seek time.  Since the group of inode tables
within one flex_bg are treated as one giant inode table, uninitialized
block groups would not need to partially initialize as many inode
table as with Orlov which would help fsck time as the filesystem usage
goes up.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Signed-off-by: Valerie Clement <valerie.clement@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Theodore Ts'o
736603ab29 jbd2: Add commit time into the commit block
Carlo Wood has demonstrated that it's possible to recover deleted
files from the journal.  Something that will make this easier is if we
can put the time of the commit into commit block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Stoyan Gaydarov
4db9c54a53 ext4: replace __FUNCTION__ occurrences
__FUNCTION__ is gcc-specific, use __func__ instead

Signed-off-by: Stoyan Gaydarov <stoyboyker@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-13 21:03:29 -04:00
Shen Feng
7e5a8cdd84 ext4: fix error processing in mb_free_blocks
The error processing of the return value of mb_free_blocks is meanless
because it only returns 0.  This fix includes

- make mb_free_blocks return void

- remove the error processing part in callers

- unlock group before calling ext4_error in mb_free_blocks

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-13 21:03:31 -04:00
Shen Feng
cfbe7e4f5e ext4: error proc entry creation when the fs/ext4 is not correctly created
When the directory fs/ext4 is not correctly created under proc, the entry
under this directory should not be created.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-13 21:03:31 -04:00
Li Zefan
f795e14073 ext4: fix build failure if DX_DEBUG is enabled
ext4_next_entry() is used by the debugging function dx_show_leaf(), so
it must be defined before that function.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Theodore Ts'o
7ad72ca60b ext4: Remove unused variable from ext4_show_options
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Theodore Ts'o
574ca174c9 ext4: Rename read_block_bitmap() to ext4_read_block_bitmap()
Since this a non-static function, make it be ext4 specific to avoid
conflicts with potentially other filesystems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng
3537576a70 ext4: remove double definitions of xattr macros
remove the definitions of macros XATTR_TRUSTED_PREFIX and XATTR_USER_PREFIX
since they are defined in linux/xattr.h

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng
74767c5a2d ext4: miscellaneous error checks and coding cleanups for mballoc
ext4_mb_seq_history_open(): check if sbi->s_mb_history is NULL

ext4_mb_history_init(): replace kmalloc and memset with kzalloc

ext4_mb_init_backend(): remove memset since kzalloc is used

ext4_mb_init(): the return value of ext4_mb_init_backend is int,
	but i is unsigned, replace it with a new int variable.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng
fdf6c7a768 ext4: add error processing when calling ext4_mb_init_cache in mballoc
Add error processing for ext4_mb_load_buddy when it calls
ext4_mb_init_cache.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Reviewed-by:   Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Mingming Cao
31b481dc7c ext4: Fix ext4_mb_init_cache return error
ext4_mb_init_cache() incorrectly always return EIO on success. This
causes the caller of ext4_mb_init_cache() fail when it checks the return
value.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng
69baee062a ext4: improve some code in rb tree part of dir.c
* remove unnecessary code in free_rb_tree_fname

* rename free_rb_tree_fname to ext4_htree_create_dir_info
  since it and ext4_htree_free_dir_info are a pair

* replace kmalloc with kzalloc in ext4_htree_free_dir_info

All these make the code more readable and simple.
PS: this patch is also suitable for ext3.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Alexey Dobriyan
91d9982779 ext4: switch to seq_files
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-07-11 19:27:31 -04:00
Julia Lawall
07d45f1267 ext4: Use BUG_ON() instead of BUG()
if (...) BUG(); should be replaced with BUG_ON(...) when the test has no
side-effects to allow a definition of BUG_ON that drops the code completely.

The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@ disable unlikely @ expression E,f; @@

(
  if (<... f(...) ...>) { BUG(); }
|
- if (unlikely(E)) { BUG(); }
+ BUG_ON(E);
)

@@ expression E,f; @@

(
  if (<... f(...) ...>) { BUG(); }
|
- if (E) { BUG(); }
+ BUG_ON(E);
)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
ed8f9c751f ext4: start searching for the right extent from the goal group.
With mballoc we search for the best extent using different
criteria. We should always use the goal group when we are
starting with a new criteria.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Shen Feng
8a35694e11 ext4: fix comments to say "ext4"
Change second/third to fourth.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
e7dfb2463e ext4: Fix mb_find_next_bit not to return larger than max
Some architectures implement ext4_find_next_bit and
ext4_find_next_zero_bit in such a way that they return
greater than max for some input values. Make sure
mb_find_next_bit and mb_find_next_zero_bit return the
right values.

On 2.6.25 we have include/asm-x86/bitops_32.h
static inline unsigned find_first_bit(const unsigned long *addr, unsigned size)
{
	unsigned x = 0;

	while (x < size) {
		unsigned long val = *addr++;
		if (val)
			return __ffs(val) + x;
		x += (sizeof(*addr)<<3);
	}
	return x;
}

This can return value greater than size.

Reported and fixed here for lustre

https://bugzilla.lustre.org/show_bug.cgi?id=15932
https://bugzilla.lustre.org/attachment.cgi?id=17205

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Duane Griffin
f3b35f063e ext4: validate directory entry data before use
ext4_dx_find_entry uses ext4_next_entry without verifying that the entry is
valid. If its rec_len == 0 this causes an infinite loop. Refactor the loop
to check the validity of entries before checking whether they match and
moving onto the next one.

There are other uses of ext4_next_entry in this file which also look
problematic. They should be reviewed and fixed if/when we have a test-case
that triggers them.

This patch fixes the first case (image hdb.25.softlockup.gz) reported in
http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Duane Griffin
71dc8fbcf5 ext4: handle deleting corrupted indirect blocks
While freeing indirect blocks we attach a journal head to the parent buffer
head, free the blocks, then journal the parent. If the indirect block list
is corrupted and points to the parent the journal head will be detached
when the block is cleared, causing an OOPS.

Check for that explicitly and handle it gracefully.

This patch fixes the third case (image hdb.20000057.nullderef.gz)
reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Duane Griffin
91ef4caf80 ext4: handle corrupted orphan list at mount
If the orphan node list includes valid, untruncatable nodes with nlink > 0
the ext4_orphan_cleanup loop which attempts to delete them will not do so,
causing it to loop forever. Fix by checking for such nodes in the
ext4_orphan_get function.

This patch fixes the second case (image hdb.20000009.softlockup.gz)
reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Dave Chinner
49641f1acf Fix reference counting race on log buffers
When we release the iclog, we do an atomic_dec_and_lock to determine if
we are the last reference and need to trigger update of log headers and
writeout.  However, in xlog_state_get_iclog_space() we also need to
check if we have the last reference count there.  If we do, we release
the log buffer, otherwise we decrement the reference count.

But the compare and decrement in xlog_state_get_iclog_space() is not
atomic, so both places can see a reference count of 2 and neither will
release the iclog.  That leads to a filesystem hang.

Close the race by replacing the atomic_read() and atomic_dec() pair with
atomic_add_unless() to ensure that they are executed atomically.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Tim Shimmin <tes@sgi.com>
Tested-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-11 11:37:18 -07:00
Stoyan Gaydarov
0533400b78 [JFFS2] Use .unlocked_ioctl
This changes the .ioctl to the .unlocked_ioctl version.

Signed-off-by: Stoyan Gaydarov <stoyboyker@gmail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2008-07-11 18:33:42 +01:00
David Howells
4abaca17e7 [GFS2] Fix GFS2's use of do_div() in its quota calculations
Fix GFS2's need_sync()'s use of do_div() on an s64 by using div_s64() instead.

This does assume that gt_quota_scale_den can be cast to an s32.

This was introduced by patch b3b94faa5f.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-07-11 14:35:01 +01:00
Hugh Dickins
96a8e13ed4 exec: fix stack excutability without PT_GNU_STACK
Kernel Bugzilla #11063 points out that on some architectures (e.g. x86_32)
exec'ing an ELF without a PT_GNU_STACK program header should default to an
executable stack; but this got broken by the unlimited argv feature because
stack vma is now created before the right personality has been established:
so breaking old binaries using nested function trampolines.

Therefore re-evaluate VM_STACK_FLAGS in setup_arg_pages, where stack
vm_flags used to be set, before the mprotect_fixup.  Checking through
our existing VM_flags, none would have changed since insert_vm_struct:
so this seems safer than finding a way through the personality labyrinth.

Reported-by: pageexec@freemail.hu
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-10 13:25:43 -07:00
Mark Fasheh
e988cf1cfe ocfs2: Fix flags in ocfs2_file_lock
The stack-glue merge changed the way we use flags in dlmglue in that we now
use the fs/dlm equivalents. Unfortunately, a merge error left the new flock
code only partially updated. This took a while to show up though, because
the lock level constants are actually identical between o2dlm and fs/dlm.
The *_CONVERT and *_NOQUEUE flags have different values though, which is
eventually causing a crash in flags_to_o2dlm().

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-10 09:25:39 -07:00
Li Xiaodong
a93a6ce242 [GFS2] Remove unused declaration
The implementation of gfs2_inode_attr_in is removed.
So remove its declaration.

Signed-off-by: Li Xiaodong <lixd@cn.fujitsu.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-07-10 16:22:23 +01:00
Steven Whitehouse
c9f6a6bbc2 [GFS2] Remove support for unused and pointless flag
The ability to mark files for direct i/o access when opened
normally is both unused and pointless, so this patch removes
support for that feature.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-07-10 16:09:29 +01:00
Steven Whitehouse
9cabcdbd46 [GFS2] Replace rgrp "recent list" with mru list
This patch removes the "recent list" which is used during allocation
and replaces it with the (already existing) mru list used during
deletion. The "recent list" was not a true mru list leading to a number
of inefficiencies including a "next" function which made scanning the
list an order N^2 operation wrt to the number of list elements.

This should increase allocation performance with large numbers of rgrps.
Its also a useful preparation and cleanup before some further changes
which are planned in this area.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-07-10 15:54:12 +01:00
Chuck Lever
f45663ce5f NFS: Allow either strict or sloppy mount option parsing
The kernel's NFS client mount option parser currently doesn't allow
unrecognized or incorrect mount options.  This prevents misspellings or
incorrectly specified mount options from possibly causing silent data
corruption.

However, NFS mount options are not standardized, so different operating
systems can use differently spelled mount options to support similar
features, or can support mount options which no other operating system
supports.

"Sloppy" mount option parsing, which allows the parser to ignore any
option it doesn't recognize, is needed to support automounters that often
use maps that are shared between heterogenous operating systems.

The legacy mount command ignores the validity of the values of mount
options entirely, except for the "sec=" and "proto=" options.  If an
incorrect value is specified, the out-of-range value is passed to the
kernel; if a value is specified that contains non-numeric characters,
it appears as though the legacy mount command sets that option to zero
(probably incorrect behavior in general).

In any case, this sets a precedent which we will partially follow for
the kernel mount option parser:

	+ if "sloppy" is not set, the parser will be strict about both
	  unrecognized options (same as legacy) and invalid option
	  values (stricter than legacy)

	+ if "sloppy" is set, the parser will ignore unrecognized
	  options and invalid option values (same as legacy)

An "invalid" option value in this case means that either the type
(integer, short, or string) or sign (for integer values) of the specified
value is incorrect.

This patch does two things: it changes the NFS client's mount option
parsing loop so that it parses the whole string instead of failing at
the first unrecognized option or invalid option value.  An unrecognized
option or an invalid option value cause the option to be skipped.

Then, the patch adds a "sloppy" mount option that allows the parsing
to succeed anyway if there were any problems during parsing.  When
parsing a set of options is complete, if there are errors and "sloppy"
was specified, return success anyway.  Otherwise, only return success
if there are no errors.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:44 -04:00
Chuck Lever
6738b2512b NFS4: Set security flavor default for NFSv4 mounts like other defaults
Set the default security flavor when we set the other mount option
default values for NFSv4.  This cleans up the NFSv4 mount option parsing
path to look like the NFSv2/v3 one.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:43 -04:00
Chuck Lever
dd07c94750 NFS: Set security flavor default for NFSv2/3 mounts like other defaults
Set the default security flavor when we set the other mount option default
values.  After this change, only the legacy user-space mount path needs to
set the NFS_MOUNT_SECFLAVOUR flag.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:42 -04:00
Chuck Lever
01060c896e NFS: Refactor logic for parsing NFS security flavor mount options
Clean up: Refactor the NFS mount option parsing function to extract the
security flavor parsing logic into a separate function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:41 -04:00
Chuck Lever
0e0cab744b NFS: use documenting macro constants for initializing ac{reg, dir}{min, max}
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:40 -04:00
Chuck Lever
ed596a8adb NFS: Move the nfs_set_port() call out of nfs_parse_mount_options()
The remount path does not need to set the port in the server address.
Since it's not really a part of option parsing, move the nfs_set_port()
call to nfs_parse_mount_options()'s callers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:39 -04:00
Trond Myklebust
259875efed NFS: set transport defaults after mount option parsing is finished
Move the UDP/TCP default timeo/retrans settings for text mounts to
nfs_init_timeout_values(), which was were they were always being
initialised (and sanity checked) for binary mounts.
Document the default timeout values using appropriate #defines.

Ensure that we initialise and sanity check the transport protocols that
may have been specified by the user.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:38 -04:00
Chuck Lever
40fef8a649 SUNRPC: Use only rpcbind v2 for AF_INET requests
Some server vendors support the higher versions of rpcbind only for
AF_INET6.  The kernel doesn't need to use v3 or v4 for AF_INET anyway,
so change the kernel's rpcbind client to query AF_INET servers over
rpcbind v2 only.

This has a few interesting benefits:

1. If the rpcbind request is going over TCP, and the server doesn't
   support rpcbind versions 3 or 4, the client reduces by two the number
   of ephemeral ports left in TIME_WAIT for each rpcbind request.  This
   will help during NFS mount storms.

2. The rpcbind interaction with servers that don't support rpcbind
   versions 3 or 4 will use less network traffic.  Also helpful
   during mount storms.

3. We can eliminate the kernel build option that controls whether the
   kernel's rpcbind client uses rpcbind version 3 and 4 for AF_INET
   servers.  Less complicated kernel configuration...

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:37 -04:00
Jeff Layton
5afc597c5f nfs4: fix potential race with rapid nfs_callback_up/down cycle
If the nfsv4 callback thread is rapidly brought up and down, it's
possible that nfs_callback_svc might never get a chance to run. If
this happens, the cleanup at thread exit might never occur, throwing
the refcounting off and nfs_callback_info in an incorrect state.

Move the clean functions into nfs_callback_down. Also change the
nfs_callback_info struct to track the svc_rqst rather than svc_serv
since we need to know that to call svc_exit_thread.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:32 -04:00
Jeff Layton
ee84dfc454 nfs4: remove BKL from nfs_callback_up and nfs_callback_down
The nfs_callback_mutex is sufficient protection.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:31 -04:00
Benny Halevy
77e03677ac nfs: initialize timeout variable in nfs4_proc_setclientid_confirm
gcc (4.3.0) rightfully warns about this:
/usr0/export/dev/bhalevy/git/linux-pnfs-bh-nfs41/fs/nfs/nfs4proc.c: In function nfs4_proc_setclientid_confirm:
/usr0/export/dev/bhalevy/git/linux-pnfs-bh-nfs41/fs/nfs/nfs4proc.c:2936: warning: timeout may be used uninitialized in this function

nfs4_delay that's passed a pointer to 'timeout' is looking at its value
and sets it up to some value in the range: NFS4_POLL_RETRY_MIN..NFS4_POLL_RETRY_MAX
	if (*timeout <= 0)
		*timeout = NFS4_POLL_RETRY_MIN;
	if (*timeout > NFS4_POLL_RETRY_MAX)
		*timeout = NFS4_POLL_RETRY_MAX;

Therefore it will end up set to some sane, though rather indeterministic, value.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:30 -04:00
Chuck Lever
d8e7748ab8 NFS: handle interface identifiers in incoming IPv6 addresses
Add support in the kernel NFS client's address parser for interface
identifiers.

IPv6 link-local addresses require an additional "interface identifier",
which is a network device name or an integer that indexes the array of
local network interfaces.  They are suffixed to the address with a '%'.
For example:

	fe80::215:c5ff:fe3b:e1b2%2

indicates an interface index of 2.  Or

	fe80::215:c5ff:fe3b:e1b2%eth0

indicates that requests should be routed through the eth0 device.
Without the interface ID, link-local addresses are not usable for NFS.

Both the kernel NFS client mount option parser and the mount.nfs command
can take either form.  The mount.nfs command always passes the address
through getnameinfo(3), which usually re-writes interface indices as
device names.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:29 -04:00
Chuck Lever
ce3b7e1906 NFS: Add string length argument to nfs_parse_server_address
To make nfs_parse_server_address() more generally useful, allow it to
accept input strings that are not terminated with '\0'.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:28 -04:00
Chuck Lever
d1aa082573 NFS: Support raw IPv6 address hostnames during NFS mount operation
Traditionally the mount command has looked for a ":" to separate the
server's hostname from the export path in the mounted on device name,
like this:

	mount server:/export /mounted/on/dir

The server's hostname is "server" and the export path is "/export".

You can also substitute a specific IPv4 network address for the server
hostname, like this:

	mount 192.168.0.55:/export /mounted/on/dir

Raw IPv6 addresses present a problem, however, because they look
something like this:

	fe80::200:5aff:fe00:30b

Note the use of colons.

To get around the presence of colons, copy the Solaris convention used for
mounting IPv6 servers by address: wrap a raw IPv6 address with square
brackets.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:27 -04:00
Chuck Lever
dc04589827 NFS: Use common device name parsing logic for NFSv4 and NFSv2/v3
To support passing a raw IPv6 address as a server hostname, we need to
expand the logic that handles splitting the passed-in device name into
a server hostname and export path

Start by pulling device name parsing out of the mount option validation
functions and into separate helper functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:26 -04:00
Trond Myklebust
cd10072562 NFS: Fix a dependency on CONFIG_NFS_V4 in nfs_remount
Fix the 'nfs4_fs_type' undeclared error in nfs_remount when compiling sans
NFSv4...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Jeff Layton <jlayton@redhat.com>
2008-07-09 12:09:25 -04:00
Trond Myklebust
e468bae97d NFS: Allow redirtying of a completed unstable write.
Currently, if an unstable write completes, we cannot redirty the page in
order to reflect a new change in the page data until after we've sent a
COMMIT request.

This patch allows a page rewrite to proceed without the unnecessary COMMIT
step, putting it immediately back onto the dirty page list, undoing the
VM unstable write accounting, and removing the NFS_PAGE_TAG_COMMIT tag from
the NFS radix tree.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:24 -04:00
Trond Myklebust
e7d39069e3 NFS: Clean up nfs_update_request()
Simplify the loop in nfs_update_request by moving into a separate function
the code that attempts to update an existing cached NFS write.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:23 -04:00
Chuck Lever
396cee977f NFS: missing newline in NFS mount debugging message
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:22 -04:00
Chuck Lever
d33e4dfeab NFS: Treat "intr" and "nointr" options as deprecated
Clean up:  the "intr" and "nointr" mount options were recently retired.
Document this in the NFS mount option parser.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:21 -04:00
Chuck Lever
ecbb3845dd NFS: Allow any value for the "retry" option
The kernel NFS mount option parser should ignore the retry= mount option
since it is meaningful only in user space.  Today it expects a number
rather than arbitrary text, so it ignores the option if the value is
numeric, but chokes if there are other characters in the value.

Change it to allow any text (except ",") as its value.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:20 -04:00
Trond Myklebust
f41f741838 NFS: Ensure we zap only the access and acl caches when setting new acls
...and ensure that we obey the NFS_INO_INVALID_ACL flag when retrieving the
acls.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:19 -04:00
Trond Myklebust
2e96d28672 NFS: Fix a warning in nfs4_async_handle_error
We're not modifying the nfs_server when we call nfs_inc_server_stats and
friends, so allow the compiler to pass 'const' pointers too.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:18 -04:00
Chuck Lever
34e8f92831 NFS: Move fs/nfs/iostat.h to include/linux
The fs/nfs/iostat.h header has definitions that were designed to be exposed
to user space.  Move these definitions under include/linux so user space can
use the definitions in applications that read /proc/self/mountstats.

Also address a handful of coding style issues called out by checkpatch.pl in
fs/nfs/iostat.h.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:17 -04:00
Trond Myklebust
46cb650c22 NFS: Remove the redundant file_open entry from struct nfs_rpc_ops
All instances are set to nfs_open(), so we should just remove the redundant
indirection. Ditto for the file_release op

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:16 -04:00
Trond Myklebust
659bfcd6dd NFS: Fix the ftruncate() credential problem
ftruncate() access checking is supposed to be performed at open() time,
just like reads and writes.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:14 -04:00
Olga Kornievskaia
b6b6152c46 rpc: bring back cl_chatty
The cl_chatty flag alows us to control whether a given rpc client leaves

	"server X not responding, timed out"

messages in the syslog.  Such messages make sense for ordinary nfs
clients (where an unresponsive server means applications on the
mountpoint are probably hanging), but not for the callback client (which
can fail more commonly, with the only result just of disabling some
optimizations).

Previously cl_chatty was removed, do to lack of users; reinstate it, and
use it for the nfsd's callback client.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:10 -04:00
Jeff Layton
48b605f83c NFS: implement option checking when remounting NFS filesystems (resend)
When remounting an NFS or NFS4 filesystem, the new NFS options are not
respected, yet the remount will still return success. This patch adds
a remount_fs sb op for NFS that checks any new nfs mount options against
the existing ones and fails the mount if any have changed.

This is only implemented for string-based mount options since doing
this with binary options isn't really feasible.

This is essentially the same as the original patch I sent out, but
adds a check to see if the addr= option has changed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:09 -04:00
Adrian Bunk
c2d946e55e fs/nfs/nfsroot.c: remove CVS keyword
This patch removes a CVS keyword that wasn't updated for a long time
from a comment.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:08 -04:00
Chuck Lever
48186c7d57 NFS: Fix trace debugging nits in write.c
Clean up: fix a few dprintk messages that still need to show the RPC task ID
correctly, and be sure we use the preferred %lld or %llu instead of %Ld or
%Lu.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:05 -04:00
Chuck Lever
6da24bc9cf NFS: Use NFSDBG_FILE for all fops
Clean up: some fops use NFSDBG_FILE, some use NFSDBG_VFS.  Let's use
NFSDBG_FILE for all fops, and consistently report file names instead
of inode numbers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:04 -04:00
Chuck Lever
b7eaefaa87 NFS: Add debugging facility for NFS aops
Recent work in fs/nfs/file.c neglected to add appropriate trace debugging
for the NFS client's address space operations.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:03 -04:00
Chuck Lever
cc0dd2d105 NFS: Make nfs_open methods consistent
Clean up: Report the same debugging info and count function calls the
same for files and directories in nfs_opendir() and nfs_file_open().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:02 -04:00
Chuck Lever
b84e06c58f NFS: Make nfs_llseek methods consistent
Clean up: Report the same debugging info in nfs_llseek_dir() and
nfs_llseek_file().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:01 -04:00
Chuck Lever
549177863b NFS: Make nfs_fsync methods consistent
Clean up: Report the same debugging info, count function calls the same,
and use similar function naming in nfs_fsync_dir() and nfs_fsync().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:09:00 -04:00
Chuck Lever
6fb1bc1030 NFS: Update help text for CONFIG_NFS_FS
Clean up: refresh the help text for Kconfig items related to the NFS
client.  Remove obsolete URLs, and make the language consistent among
the options.

Also move the ROOT_NFS config option next to the options related to the
NFS client.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:52 -04:00
Trond Myklebust
b5418383ef NFS: do_setlk(): don't flush caches when we have a delegation
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:50 -04:00
Trond Myklebust
7e5f614660 NFS: Revert commit 44dd151d
Revert commit 44dd151d "NFS: Don't mark a written page as uptodate until it
is on disk". While it is true that the write may fail, that is always the
case. There is no reason why we should treat data on pages that are not
already marked as PG_uptodate as being special. The only thing we gain is a
noticeable slowdown when re-reading these pages.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:46 -04:00
Trond Myklebust
efc91ed019 NFS: Optimise append writes with holes
If a file is being extended, and we're creating a hole, we might as well
declare the entire page to be up to date.

This patch significantly improves the write performance for sparse files
in the case where lseek(SEEK_END) is used to append several non-contiguous
writes at intervals of < PAGE_SIZE.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:45 -04:00
Trond Myklebust
2116271a34 NFS: Add correct bounds checking to NFSv2 locks
NFSv2 file locking currently fails the Connectathon tests, because the
calls to the VFS locking code do not return an EINVAL error if the
struct file_lock overflows the 32-bit boundaries.

The problem is due to the fact that we occasionally call helpers from
fs/locks.c in order to avoid RPC calls to the server when we know that a
local process holds the lock. These helpers are, of course, always
64-bit enabled, so EINVAL is not returned in cases when it would if
the call had gone to the NLM code.

For consistency, we therefore add support for a bounds-checking helper.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:40 -04:00
Trond Myklebust
f3d47a3a6a NFS: Fix a preemption count leak in nfs_update_request
The commit 2785259631 (nfs: use GFP_NOFS
preloads for radix-tree insertion) appears to have introduced a bug:
We only want to call radix_tree_preload() once after creating a request.
Calling it every time we loop after we created the request, will cause
preemption count leaks.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Nick Piggin <npiggin@suse.de>
2008-07-09 12:08:39 -04:00
Trond Myklebust
0b4aae7aad NFS: Reduce the stack usage in NFSv3 create operations
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:38 -04:00
Trond Myklebust
57dc9a5747 NFS: Reduce the stack usage in NFSv4 create operations
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-09 12:08:37 -04:00
Ingo Molnar
d028203c04 Merge branch 'x86/core' into x86/unify-pci 2008-07-09 11:39:02 +02:00
Linus Torvalds
b72e9ebe7e Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  [PATCH] ocfs2/dlm: Fixes oops in dlm_new_lockres()
2008-07-08 21:48:26 -07:00
Linus Torvalds
f57e91682d Merge branch 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  SUNRPC: Fix an rpcbind breakage for the case of IPv6 lookups
  SUNRPC: Fix a double-free in rpcbind
  NFS: Fix readdir cache invalidation
2008-07-08 12:40:57 -07:00
Jeff Mahoney
eb35c218d8 reiserfs: discard prealloc in reiserfs_delete_inode
With the removal of struct file from the xattr code,
reiserfs_file_release() isn't used anymore, so the prealloc isn't
discarded.  This causes hangs later down the line.

This patch adds it to reiserfs_delete_inode.  In most cases it will be a
no-op due to it already having been called, but will avoid hangs with
xattrs.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-08 12:39:31 -07:00
Trond Myklebust
2aac05a919 NFS: Fix readdir cache invalidation
invalidate_inode_pages2_range() takes page offset arguments, not byte
ranges.

Another thought is that individual pages might perhaps get evicted by VM
pressure, in which case we might perhaps want to re-read not only the
evicted page, but all subsequent pages too (in case the server returns
more/less data per page so that the alignment of the next entry
changes). We should therefore remove the condition that we only do this on
page->index==0.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-07-08 15:22:40 -04:00
Bernhard Walle
383bc5cecc x86, crashdump, /proc/vmcore: remove CONFIG_EXPERIMENTAL from kdump
I would suggest to remove the "experimental" status from Kdump.
Kdump is now in the kernel since a long time and used by Enterprise
distributions. I don't think that "experimental" is true any more.

Signed-off-by: Bernhard Walle <bwalle@suse.de>
Cc: vgoyal@redhat.com
Cc: kexec@lists.infradead.org
Cc: Bernhard Walle <bwalle@suse.de>
Cc: akpm@linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-08 13:10:41 +02:00
Ingo Molnar
6924d1ab8b Merge branches 'x86/numa-fixes', 'x86/apic', 'x86/apm', 'x86/bitops', 'x86/build', 'x86/cleanups', 'x86/cpa', 'x86/cpu', 'x86/defconfig', 'x86/gart', 'x86/i8259', 'x86/intel', 'x86/irqstats', 'x86/kconfig', 'x86/ldt', 'x86/mce', 'x86/memtest', 'x86/pat', 'x86/ptemask', 'x86/resumetrace', 'x86/threadinfo', 'x86/timers', 'x86/vdso' and 'x86/xen' into x86/devel 2008-07-08 09:16:56 +02:00
Andi Kleen
ce0c0e50f9 x86, generic: CPA add statistics about state of direct mapping v4
Add information about the mapping state of the direct mapping to
/proc/meminfo. I chose /proc/meminfo because that is where all the other
memory statistics are too and it is a generally useful metric even
outside debugging situations. A lot of split kernel pages means the
kernel will run slower.

This way we can see how many large pages are really used for it and how
many are split.

Useful for general insight into the kernel.

v2: Add hotplug locking to 64bit to plug a very obscure theoretical race.
    32bit doesn't need it because it doesn't support hotadd for lowmem.
    Fix some typos
v3: Rename dpages_cnt
    Add CONFIG ifdef for count update as requested by tglx
    Expand description
v4: Fix stupid bugs added in v3
    Move update_page_count to pageattr.c

Signed-off-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-08 08:11:45 +02:00
Benny Halevy
e518f0560a nfsd: take file and mnt write in nfs4_upgrade_open
testing with newpynfs revealed this warning:
Jul  3 07:32:50 buml kernel: writeable file with no mnt_want_write()
Jul  3 07:32:50 buml kernel: ------------[ cut here ]------------
Jul  3 07:32:50 buml kernel: WARNING: at /usr0/export/dev/bhalevy/git/linux-pnfs-bh-nfs41/include/linux/fs.h:855 drop_file_write_access+0x6b/0x7e()
Jul  3 07:32:50 buml kernel: Modules linked in: nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc
Jul  3 07:32:50 buml kernel: Call Trace:
Jul  3 07:32:50 buml kernel: 6eaadc88:  [<6002f471>] warn_on_slowpath+0x54/0x8e
Jul  3 07:32:50 buml kernel: 6eaadcc8:  [<601b790d>] printk+0xa0/0x793
Jul  3 07:32:50 buml kernel: 6eaadd38:  [<601b6205>] __mutex_lock_slowpath+0x1db/0x1ea
Jul  3 07:32:50 buml kernel: 6eaadd68:  [<7107d4d5>] nfs4_preprocess_seqid_op+0x2a6/0x31c [nfsd]
Jul  3 07:32:50 buml kernel: 6eaadda8:  [<60078dc9>] drop_file_write_access+0x6b/0x7e
Jul  3 07:32:50 buml kernel: 6eaaddc8:  [<710804e4>] nfsd4_open_downgrade+0x114/0x1de [nfsd]
Jul  3 07:32:50 buml kernel: 6eaade08:  [<71076215>] nfsd4_proc_compound+0x1ba/0x2dc [nfsd]
Jul  3 07:32:50 buml kernel: 6eaade48:  [<71068221>] nfsd_dispatch+0xe5/0x1c2 [nfsd]
Jul  3 07:32:50 buml kernel: 6eaade88:  [<71312f81>] svc_process+0x3fd/0x714 [sunrpc]
Jul  3 07:32:50 buml kernel: 6eaadea8:  [<60039a81>] kernel_sigprocmask+0xf3/0x100
Jul  3 07:32:50 buml kernel: 6eaadee8:  [<7106874b>] nfsd+0x182/0x29b [nfsd]
Jul  3 07:32:50 buml kernel: 6eaadf48:  [<60021cc9>] run_kernel_thread+0x41/0x4a
Jul  3 07:32:50 buml kernel: 6eaadf58:  [<710685c9>] nfsd+0x0/0x29b [nfsd]
Jul  3 07:32:50 buml kernel: 6eaadf98:  [<60021cb0>] run_kernel_thread+0x28/0x4a
Jul  3 07:32:50 buml kernel: 6eaadfc8:  [<60013829>] new_thread_handler+0x72/0x9c
Jul  3 07:32:50 buml kernel:
Jul  3 07:32:50 buml kernel: ---[ end trace 2426dd7cb2fba3bf ]---

Bruce Fields suggested this (Thanks!):
maybe we need to be doing a mnt_want_write on open_upgrade and mnt_put_write on downgrade?

This patch adds a call to mnt_want_write and file_take_write (which is
doing the actual work).

The counter-calls mnt_drop_write a file_release_write are now being properly
called by drop_file_write_access in the exact path printed by the warning
above.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-07 15:23:34 -04:00
J. Bruce Fields
4f83aa302f nfsd: document open share bit tracking
It's not immediately obvious from the code why we're doing this.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Benny Halevy <bhalevy@panasas.com>
2008-07-07 15:04:50 -04:00
Sunil Mushran
18c6ac383f [PATCH] ocfs2/dlm: Fixes oops in dlm_new_lockres()
Patch fixes a race that can result in an oops while adding a
lockres to the dlm lockres tracking list.

Bug introduced by mainline commit 29576f8bb5.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-07 11:24:29 -07:00
Steven Whitehouse
209806aba9 [GFS2] Allow local DF locks when holding a cached EX glock
We already allow local SH locks while we hold a cached EX glock, so here
we allow DF locks as well. This works only because we rely on the VFS's
invalidation for locally cached data, and because if we hold an EX lock,
then we know that no other node can be caching data relating to this
file.

It dramatically speeds up initial writes to O_DIRECT files since we fall
back to buffered I/O for this and would otherwise bounce between DF and
EX modes on each and every write call. The lessons to be learned from
that are to ensure that (for the time being anyway) O_DIRECT files are
preallocated and that they are written to using reasonably large I/O
sizes. Even so this change fixes that corner case nicely

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-07-07 10:07:28 +01:00
Steven Whitehouse
265d529cef [GFS2] Fix delayed demote race
There is a race in the delayed demote code where it does the wrong thing
if a demotion to UN has occurred for other reasons before the delay has
expired. This patch adds an assert to catch that condition as well as
fixing the root cause by adding an additional check for the UN state.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
2008-07-07 10:02:36 +01:00
David S. Miller
ea2aca084b Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	Documentation/feature-removal-schedule.txt
	drivers/net/wan/hdlc_fr.c
	drivers/net/wireless/iwlwifi/iwl-4965.c
	drivers/net/wireless/iwlwifi/iwl3945-base.c
2008-07-05 23:08:07 -07:00
Andrew Morton
5d7e0d2bd9 Fix pagemap_read() use of struct mm_walk
Fix some issues in pagemap_read noted by Alexey:

- initialize pagemap_walk.mm to "mm" , so the code starts working as
  advertised

- initialize ->private to "&pm" so it wouldn't immediately oops in
  pagemap_pte_hole()

- unstatic struct pagemap_walk, so two threads won't fsckup each other
  (including those started by root, including flipping ->mm when you don't
  have permissions)

- pagemap_read() contains two calls to ptrace_may_attach(), second one
  looks unneeded.

- avoid possible kmalloc(0) and integer wraparound.

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Personally, I'd just remove the functionality entirely  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-05 13:13:44 -07:00
Andrew Morton
20cbc97261 Fix clear_refs_write() use of struct mm_walk
Don't use a static entry, so as to prevent races during concurrent use
of this function.

Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-05 13:07:56 -07:00
Benny Halevy
695e12f8d2 nfsd: tabulate nfs4 xdr encoding functions
In preparation for minorversion 1

All encoders now return an nfserr status (typically their
nfserr argument).  Unsupported ops go through nfsd4_encode_operation
too, so use nfsd4_encode_noop to encode nothing for their reply body.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-04 16:21:30 -04:00
Andrew G. Morgan
086f7316f0 security: filesystem capabilities: fix fragile setuid fixup code
This commit includes a bugfix for the fragile setuid fixup code in the
case that filesystem capabilities are supported (in access()).  The effect
of this fix is gated on filesystem capability support because changing
securebits is only supported when filesystem capabilities support is
configured.)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04 10:40:08 -07:00
Akinobu Mita
6d1029b563 add kernel-doc for simple_read_from_buffer and memory_read_from_buffer
Add kernel-doc comments describing simple_read_from_buffer and
memory_read_from_buffer.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04 10:40:07 -07:00
Jess Guerrero
337e2ab5d1 ntfs: update help text
The url in the help text for ntfs should be updated.

Acked-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04 10:40:07 -07:00
Michael Halcrow
c4a2d7fbec ecryptfs: remove unnecessary mux from ecryptfs_init_ecryptfs_miscdev()
The misc_mtx should provide all the protection required to keep the daemon
hash table sane during miscdev registration.  Since this mutex is causing
gratuitous lockdep warnings, this patch removes it.

Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Reported-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04 10:40:05 -07:00
Jan Kara
10dd08dc04 reiserfs: add missing unlock to an error path in reiserfs_quota_write()
When write in reiserfs_quota_write() fails, we have to properly release
i_mutex. One error path has been missing the unlock...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04 10:40:05 -07:00
Jan Kara
4d04e4fbf8 ext4: add missing unlock to an error path in ext4_quota_write()
When write in ext4_quota_write() fails, we have to properly release
i_mutex.  One error path has been missing the unlock...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04 10:40:05 -07:00
Jan Kara
f5c8f7dae7 ext3: add missing unlock to error path in ext3_quota_write()
When write in ext3_quota_write() fails, we have to properly release
i_mutex.  One error path has been missing the unlock...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04 10:40:05 -07:00
Miklos Szeredi
32502b8413 splice: fix generic_file_splice_read() race with page invalidation
If a page was invalidated during splicing from file to a pipe, then
generic_file_splice_read() could return a short or zero count.

This manifested itself in rare I/O errors seen on nfs exported fuse
filesystems.  This is because nfsd uses splice_direct_to_actor() to read
files, and fuse uses invalidate_inode_pages2() to invalidate stale data on
open.

Fix by redoing the page find/create if it was found to be truncated
(invalidated).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-07-04 09:52:14 +02:00
Octavian Purdila
8b3d3567f7 ramfs: enable splice write
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-07-04 09:52:14 +02:00
J. Bruce Fields
e86322f611 Merge branch 'for-bfields' of git://linux-nfs.org/~tomtucker/xprt-switch-2.6 into for-2.6.27 2008-07-03 16:24:06 -04:00
Eric Van Hensbergen
2e4bef41a0 9p: fix O_APPEND in legacy mode
The legacy protocol's open operation doesn't handle an append operation
(it is expected that the client take care of it).  We were incorrectly
passing the extended protocol's flag through even in legacy mode.  This
was reported in bugzilla report #10689.  This patch fixes the problem
by disallowing extended protocol open modes from being passed in legacy
mode and implemented append functionality on the client side by adding
a seek after the open.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-07-03 09:59:03 -05:00
Alasdair G Kergon
cc371e66e3 Add bvec_merge_data to handle stacked devices and ->merge_bvec()
When devices are stacked, one device's merge_bvec_fn may need to perform
the mapping and then call one or more functions for its underlying devices.

The following bio fields are used:
  bio->bi_sector
  bio->bi_bdev
  bio->bi_size
  bio->bi_rw  using bio_data_dir()

This patch creates a new struct bvec_merge_data holding a copy of those
fields to avoid having to change them directly in the struct bio when
going down the stack only to have to change them back again on the way
back up.  (And then when the bio gets mapped for real, the whole
exercise gets repeated, but that's a problem for another day...)

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-07-03 13:21:15 +02:00
Jens Axboe
b984679efe block: integrity checkpatch cleanups
> 80 char lines and that sort of thing.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-07-03 13:21:13 +02:00
Martin K. Petersen
7ba1ba12ee block: Block layer data integrity support
Some block devices support verifying the integrity of requests by way
of checksums or other protection information that is submitted along
with the I/O.

This patch implements support for generating and verifying integrity
metadata, as well as correctly merging, splitting and cloning bios and
requests that have this extra information attached.

See Documentation/block/data-integrity.txt for more information.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-07-03 13:21:13 +02:00
Martin K. Petersen
51d654e1d8 block: Globalize bio_set and bio_vec_slab
Move struct bio_set and biovec_slab definitions to bio.h so they can
be used outside of bio.c.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-07-03 13:21:13 +02:00
Miklos Szeredi
f58ba88910 [GFS2] don't call permission()
GFS2 calls permission() to verify permissions after locks on the files
have been taken.

For this it's sufficient to call gfs2_permission() instead.  This
results in the following changes:

  - IS_RDONLY() check is not performed
  - IS_IMMUTABLE() check is not performed
  - devcgroup_inode_permission() is not called
  - security_inode_permission() is not called

IS_RDONLY() should be unnecessary anyway, as the per-mount read-only
flag should provide protection against read-only remounts during
operations.  do_gfs2_set_flags() has been fixed to perform
mnt_want_write()/mnt_drop_write() to protect against remounting
read-only.

IS_IMMUTABLE has been added to gfs2_permission()

Repeating the security checks seems to be pointless, as they don't
normally change, and if they do, it's independent of the filesystem
state.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-07-03 10:22:01 +01:00
Benny Halevy
b001a1b6aa nfsd: dprint operation names
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-02 19:03:19 -04:00
Jonathan Corbet
a238b790d5 Call fasync() functions without the BKL
lock_kernel() calls have been pushed down into code which needs it, so
there is no need to take the BKL at this level anymore.

This work inspired and aided by Andi Kleen's unlocked_fasync() patches.

Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2008-07-02 15:06:28 -06:00
Jonathan Corbet
dda6445e21 ecryptfs: fasync BKL pushdown
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2008-07-02 15:06:28 -06:00
Andi Kleen
9465efc9e9 Remove BKL from remote_llseek v2
- Replace remote_llseek with generic_file_llseek_unlocked (to force compilation
failures in all users)
- Change all users to either use generic_file_llseek_unlocked directly or
take the BKL around. I changed the file systems who don't use the BKL
for anything (CIFS, GFS) to call it directly. NCPFS and SMBFS and NFS
take the BKL, but explicitely in their own source now.

I moved them all over in a single patch to avoid unbisectable sections.

Open problem: 32bit kernels can corrupt fpos because its modification
is not atomic, but they can do that anyways because there's other paths who
modify it without BKL.

Do we need a special lock for the pos/f_version = 0 checks?

Trond says the NFS BKL is likely not needed, but keep it for now
until his full audit.

v2: Use generic_file_llseek_unlocked instead of remote_llseek_unlocked
    and factor duplicated code (suggested by hch)

Cc: Trond.Myklebust@netapp.com
Cc: swhiteho@redhat.com
Cc: sfrench@samba.org
Cc: vandrove@vc.cvut.cz

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2008-07-02 15:06:27 -06:00
Jonathan Corbet
9c20616c38 Make FAT users happier by not deadlocking
The FAT BKL removal patch can cause deadlocks.  It turns out that the new
lock_super() calls are unneeded, remove them (as directed by Linus).

Reported-by: "Tony Luck" <tony.luck@intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2008-07-02 15:06:27 -06:00
Arnd Bergmann
b7fdf9fdd6 ocfs2-stack_user: BKL pushdown
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2008-07-02 15:06:23 -06:00
Benny Halevy
f2feb96bc3 nfsd: nfs4 minorversion decoder vectors
Have separate vectors of operation decoders for each minorversion.
Obsolete ops in newer minorversions have default implementation returning
nfserr_opnotsupp.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-02 15:58:21 -04:00
Benny Halevy
3c375c6f3a nfsd: unsupported nfs4 ops should fail with nfserr_opnotsupp
nfserr_opnotsupp should be returned for unsupported nfs4 ops
rather than nfserr_op_illegal.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-02 15:58:21 -04:00
Benny Halevy
347e0ad9c9 nfsd: tabulate nfs4 xdr decoding functions
In preparation for minorversion 1

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-02 15:58:20 -04:00
Benny Halevy
30cff1ffff nfsd: return nfserr_minor_vers_mismatch when compound minorversion != 0
Check minorversion once before decoding any operation and reject with
nfserr_minor_vers_mismatch if != 0 (this still happens in nfsd4_proc_compound).
In this case return a zero length resultdata array as required by RFC3530.

minorversion 1 processing will have its own vector of decoders.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-02 15:58:20 -04:00
Miklos Szeredi
07cad1d2a4 nfsd: clean up mnt_want_write calls
Multiple mnt_want_write() calls in the switch statement looks really
ugly.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-07-01 15:22:03 -04:00
Jens Axboe
18ce3751cc Properly notify block layer of sync writes
fsync_buffers_list() and sync_dirty_buffer() both issue async writes and
then immediately wait on them. Conceptually, that makes them sync writes
and we should treat them as such so that the IO schedulers can handle
them appropriately.

This patch fixes a write starvation issue that Lin Ming reported, where
xx is stuck for more than 2 minutes because of a large number of
synchronous IO in the system:

INFO: task kjournald:20558 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
kjournald     D ffff810010820978  6712 20558      2
ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
Call Trace:
[<ffffffff803ba6f2>] kobject_get+0x12/0x17
[<ffffffff80247537>] getnstimeofday+0x2f/0x83
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d195>] io_schedule+0x5d/0x9f
[<ffffffff8029c1e7>] sync_buffer+0x3b/0x3f
[<ffffffff8066d3f0>] __wait_on_bit+0x40/0x6f
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d48b>] out_of_line_wait_on_bit+0x6c/0x78
[<ffffffff80243909>] wake_bit_function+0x0/0x23
[<ffffffff8029e3ad>] sync_dirty_buffer+0x98/0xcb
[<ffffffff8030056b>] journal_commit_transaction+0x97d/0xcb6
[<ffffffff8023a676>] lock_timer_base+0x26/0x4b
[<ffffffff8030300a>] kjournald+0xc1/0x1fb
[<ffffffff802438db>] autoremove_wake_function+0x0/0x2e
[<ffffffff80302f49>] kjournald+0x0/0x1fb
[<ffffffff802437bb>] kthread+0x47/0x74
[<ffffffff8022de51>] schedule_tail+0x28/0x5d
[<ffffffff8020cac8>] child_rip+0xa/0x12
[<ffffffff80243774>] kthread+0x0/0x74
[<ffffffff8020cabe>] child_rip+0x0/0x12

Lin Ming confirms that this patch fixes the issue. I've run tests with
it for the past week and no ill effects have been observed, so I'm
proposing it for inclusion into 2.6.26.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-07-01 09:07:34 +02:00
Jeff Layton
100766f834 nfsd: treat all shutdown signals as equivalent
knfsd currently uses 2 signal masks when processing requests. A "loose"
mask (SHUTDOWN_SIGS) that it uses when receiving network requests, and
then a more "strict" mask (ALLOWED_SIGS, which is just SIGKILL) that it
allows when doing the actual operation on the local storage.

This is apparently unnecessarily complicated. The underlying filesystem
should be able to sanely handle a signal in the middle of an operation.
This patch removes the signal mask handling from knfsd altogether. When
knfsd is started as a kthread, all signals are ignored. It then allows
all of the signals in SHUTDOWN_SIGS. There's no need to set the mask
as well.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-30 15:27:47 -04:00
Neil Brown
496d6c32d4 nfsd: fix spurious EACCESS in reconnect_path()
Thanks to Frank Van Maarseveen for the original problem report: "A
privileged process on an NFS client which drops privileges after using
them to change the current working directory, will experience incorrect
EACCES after an NFS server reboot. This problem can also occur after
memory pressure on the server, particularly when the client side is
quiet for some time."

This occurs because the filehandle points to a directory whose parents
are no longer in the dentry cache, and we're attempting to reconnect the
directory to its parents without adequate permissions to perform lookups
in the parent directories.

We can therefore fix the problem by acquiring the necessary capabilities
before attempting the reconnection.  We do this only in the
no_subtree_check case, since the documented behavior of the
subtree_check export option requires the server to check that the user
has lookup permissions on all parents.

The subtree_check case still has a problem, since reconnect_path()
unnecessarily requires both read and lookup permissions on all parent
directories.  However, a fix in that case would be more delicate, and
use of subtree_check is already discouraged for other reasons.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Frank van Maarseveen <frankvm@frankvm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-30 15:24:11 -04:00
Linus Torvalds
747606464b Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
  udf: Fix regression in UDF anchor block detection
2008-06-29 12:19:02 -07:00
Linus Torvalds
4f46accee4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  [patch 2/3] vfs: dcache cleanups
  [patch 1/3] vfs: dcache sparse fixes
  [patch 3/3] vfs: make d_path() consistent across mount operations
  [patch 4/4] flock: remove unused fields from file_lock_operations
  [patch 3/4] vfs: fix ERR_PTR abuse in generic_readlink
  [patch 2/4] fs: make struct file arg to d_path const
  [patch 1/4] vfs: path_{get,put}() cleanups
  [patch for 2.6.26 4/4] vfs: utimensat(): fix write access check for futimens()
  [patch for 2.6.26 3/4] vfs: utimensat(): fix error checking for {UTIME_NOW,UTIME_OMIT} case
  [patch for 2.6.26 1/4] vfs: utimensat(): ignore tv_sec if tv_nsec == UTIME_OMIT or UTIME_NOW
  [patch for 2.6.26 2/4] vfs: utimensat(): be consistent with utime() for immutable and append-only files
  [PATCH] fix cgroup-inflicted breakage in block_dev.c
2008-06-29 12:14:37 -07:00
Steven Whitehouse
f17172e001 [GFS2] Fix module building
Two lines missed from the previous patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-27 09:40:57 +01:00
Steven Whitehouse
31fcba00fe [GFS2] Remove all_list from lock_dlm
I discovered that we had a list onto which every lock_dlm
lock was being put. Its only function was to discover whether
we'd got any locks left after umount. Since there was already
a counter for that purpose as well, I removed the list. The
saving is sizeof(struct list_head) per glock - well worth
having.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-27 09:39:50 +01:00
Steven Whitehouse
b2cad26cfc [GFS2] Remove obsolete conversion deadlock avoidance code
This is only used by GFS1 so can be removed.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-27 09:39:47 +01:00
Steven Whitehouse
1bdad60633 [GFS2] Remove remote lock dropping code
There are several reasons why this is undesirable:

 1. It never happens during normal operation anyway
 2. If it does happen it causes performance to be very, very poor
 3. It isn't likely to solve the original problem (memory shortage
    on remote DLM node) it was supposed to solve
 4. It uses a bunch of arbitrary constants which are unlikely to be
    correct for any particular situation and for which the tuning seems
    to be a black art.
 5. In an N node cluster, only 1/N of the dropped locked will actually
    contribute to solving the problem on average.

So all in all we are better off without it. This also makes merging
the lock_dlm module into GFS2 a bit easier.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-27 09:39:44 +01:00
Bob Peterson
9171f5a991 [GFS2] kernel panic mounting volume
This patch fixes Red Hat bugzilla bug 450156.

This started with a not-too-improbable mount failure because the
locking protocol was never set back to its proper "lock_dlm" after the
system was rebooted in the middle of a gfs2_fsck.  That left a
(purposely) invalid locking protocol in the superblock, which caused an
error when the file system was mounted the next time.

When there's an error mounting, vfs calls DQUOT_OFF, which calls
vfs_quota_off which calls gfs2_sync_fs.  Next, gfs2_sync_fs calls
gfs2_log_flush passing s_fs_info.  But due to the error, s_fs_info
had been previously set to NULL, and so we have the kernel oops.

My solution in this patch is to test for the NULL value before passing
it.  I tested this patch and it fixes the problem.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-27 09:39:41 +01:00
Steven Whitehouse
01b7c7ae88 [GFS2] Revise readpage locking
The previous attempt to fix the locking in readpage failed due
to the use of a "try lock" which resulted in occasional high
cpu usage during testing (due to repeated tries) and also it
did not resolve all the ordering problems wrt the transaction
lock (although it did solve all the inode lock ordering problems).

This patch avoids the problem by unlocking the page and getting the
locks in the correct order. This means that we have to retest the
page to ensure that it hasn't changed when we relock the page.

This now passes the tests which were previously failing.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-27 09:39:37 +01:00
Steven Whitehouse
8027473722 [GFS2] Fix ordering of args for list_add
The patch to remove lock_nolock managed to get the arguments
of this list_add backwards. This fixes it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-27 09:39:34 +01:00
Harvey Harrison
2d81afb879 [GFS2] trivial sparse lock annotations
Annotate the &sdp->sd_log_lock.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-27 09:39:31 +01:00
Steven Whitehouse
048bca2237 [GFS2] No lock_nolock
This patch merges the lock_nolock module into GFS2 itself. As well as removing
some of the overhead of the module, it also means that its now impossible to
build GFS2 without a lock module (which would be a pointless thing to do
anyway).

We also plan to merge lock_dlm into GFS2 in the future, but that is a more
tricky task, and will therefore be a separate patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: David Teigland <teigland@redhat.com>
2008-06-27 09:39:28 +01:00
Steven Whitehouse
f3c9d38a26 [GFS2] Fix ordering bug in lock_dlm
This looks like a lot of change, but in fact its not. Mostly its
things moving from one file to another. The change is just that
instead of queuing lock completions and callbacks from the DLM
we now pass them directly to GFS2.

This gives us a net loss of two list heads per glock (a fair
saving in memory) plus a reduction in the latency of delivering
the messages to GFS2, plus we now have one thread fewer as well.
There was a bug where callbacks and completions could be delivered
in the wrong order due to this unnecessary queuing which is fixed
by this patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
2008-06-27 09:39:25 +01:00
Steven Whitehouse
6802e3400f [GFS2] Clean up the glock core
This patch implements a number of cleanups to the core of the
GFS2 glock code. As a result a lot of code is removed. It looks
like a really big change, but actually a large part of this patch
is either removing or moving existing code.

There are some new bits too though, such as the new run_queue()
function which is considerably streamlined. Highlights of this
patch include:

 o Fixes a cluster coherency bug during SH -> EX lock conversions
 o Removes the "glmutex" code in favour of a single bit lock
 o Removes the ->go_xmote_bh() for inodes since it was duplicating
   ->go_lock()
 o We now only use the ->lm_lock() function for both locks and
   unlocks (i.e. unlock is a lock with target mode LM_ST_UNLOCKED)
 o The fast path is considerably shortly, giving performance gains
   especially with lock_nolock
 o The glock_workqueue is now used for all the callbacks from the DLM
   which allows us to simplify the lock_dlm module (see following patch)
 o The way is now open to make further changes such as eliminating the two
   threads (gfs2_glockd and gfs2_scand) in favour of a more efficient
   scheme.

This patch has undergone extensive testing with various test suites
so it should be pretty stable by now.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
2008-06-27 09:39:22 +01:00
Jens Axboe
15c8b6c1aa on_each_cpu(): kill unused 'retry' parameter
It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.

Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-06-26 11:24:38 +02:00
Benjamin Marzinski
5af4e7a0be [GFS2] fix gfs2 block allocation (cleaned up)
This patch fixes bz 450641.

This patch changes the computation for zero_metapath_length(), which it
renames to metapath_branch_start(). When you are extending the metadata
tree, The indirect blocks that point to the new data block must either
diverge from the existing tree either at the inode, or at the first
indirect block. They can diverge at the first indirect block because the
inode has room for 483 pointers while the indirect blocks have room for
509 pointers, so when the tree is grown, there is some free space in the
first indirect block. What metapath_branch_start() now computes is the
height where the first indirect block for the new data block is located.
It can either be 1 (if the indirect block diverges from the inode) or 2
(if it diverges from the first indirect block).

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-24 19:02:28 +01:00
Bob Peterson
17c15da00c [GFS2] BUG: unable to handle kernel paging request at ffff81002690e000
This patch fixes bugzilla bug bz448866: gfs2: BUG: unable to
handle kernel paging request at ffff81002690e000.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-06-24 14:17:45 +01:00
Jan Kara
19fd426a18 Merge branch 'master' into for_mm 2008-06-24 11:43:00 +02:00
Tomas Janousek
e8183c2452 udf: Fix regression in UDF anchor block detection
In some cases it could happen that some block passed test in
udf_check_anchor_block() even though udf_read_tagged() refused to read it later
(e.g. because checksum was not correct).  This patch makes
udf_check_anchor_block() use udf_read_tagged() so that the checking is
stricter.

This fixes the regression (certain disks unmountable) caused by commit
423cf6dc04.

Signed-off-by: Tomas Janousek <tomi@nomi.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
2008-06-24 11:38:03 +02:00
Trond Myklebust
03fa9e84e5 NFS: nfs_updatepage(): don't mark page as dirty if an error occurred
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-06-23 17:09:07 -04:00
Trond Myklebust
b7e2445737 NFS: Fix filehandle size comparisons in the mount code
Fix a sign issue in xdr_decode_fhstatus3()
Fix incorrect comparison in nfs_validate_mount_data()

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-06-23 17:09:06 -04:00
Trond Myklebust
33852a1f2b NFS: Reduce the NFS mount code stack usage.
This appears to fix the Oops reported in
  http://bugzilla.kernel.org/show_bug.cgi?id=10826

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-06-23 17:09:05 -04:00
Miklos Szeredi
cdd16d0265 [patch 2/3] vfs: dcache cleanups
Comment from Al Viro: add prepend_name() wrapper.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 13:07:00 -04:00
Miklos Szeredi
31f3e0b3a1 [patch 1/3] vfs: dcache sparse fixes
Fix the following sparse warnings:

fs/dcache.c:2183:19: warning: symbol 'filp_cachep' was not declared. Should it be static?
fs/dcache.c:115:3: warning: context imbalance in 'dentry_iput' - unexpected unlock
fs/dcache.c:188:2: warning: context imbalance in 'dput' - different lock contexts for basic block
fs/dcache.c:400:2: warning: context imbalance in 'prune_one_dentry' - different lock contexts for basic block
fs/dcache.c:431:22: warning: context imbalance in 'prune_dcache' - different lock contexts for basic block
fs/dcache.c:563:2: warning: context imbalance in 'shrink_dcache_sb' - different lock contexts for basic block
fs/dcache.c:1385:6: warning: context imbalance in 'd_delete' - wrong count at exit
fs/dcache.c:1636:2: warning: context imbalance in '__d_unalias' - unexpected unlock
fs/dcache.c:1735:2: warning: context imbalance in 'd_materialise_unique' - different lock contexts for basic block

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 13:06:36 -04:00
Andreas Gruenbacher
be285c712b [patch 3/3] vfs: make d_path() consistent across mount operations
The path that __d_path() computes can become slightly inconsistent when it
races with mount operations: it grabs the vfsmount_lock when traversing mount
points but immediately drops it again, only to re-grab it when it reaches the
next mount point.  The result is that the filename computed is not always
consisent, and the file may never have had that name. (This is unlikely, but
still possible.)

Fix this by grabbing the vfsmount_lock for the whole duration of
__d_path().

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: John Johansen <jjohansen@suse.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 13:06:13 -04:00
Miklos Szeredi
8837abcab3 nfsd: rename MAY_ flags
Rename nfsd_permission() specific MAY_* flags to NFSD_MAY_* to make it
clear, that these are not used outside nfsd, and to avoid name and
number space conflicts with the VFS.

[comment from hch: rename MAY_READ, MAY_WRITE and MAY_EXEC as well]

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-23 13:02:50 -04:00
NeilBrown
599eb3046a knfsd: nfsd: Handle ERESTARTSYS from syscalls.
OCFS2 can return -ERESTARTSYS from write requests (and possibly
elsewhere) if there is a signal pending.

If nfsd is shutdown (by sending a signal to each thread) while there
is still an IO load from the client, each thread could handle one last
request with a signal pending.  This can result in -ERESTARTSYS
which is not understood by nfserrno() and so is reflected back to
the client as nfserr_io aka -EIO.  This is wrong.

Instead, interpret ERESTARTSYS to mean "try again later" by returning
nfserr_jukebox.  The client will resend and - if the server is
restarted - the write will (hopefully) be successful and everyone will
be happy.

 The symptom that I narrowed down to this was:
    copy a large file via NFS to an OCFS2 filesystem, and restart
    the nfs server during the copy.
    The 'cp' might get an -EIO, and the file will be corrupted -
    presumably holes in the middle where writes appeared to fail.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-23 13:02:50 -04:00
Neil Brown
c7d106c90e nfsd: fix race in nfsd_nrthreads()
We need the nfsd_mutex before accessing nfsd_serv->sv_nrthreads or we
can't even guarantee nfsd_serv will still be there.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-23 13:02:50 -04:00
Jeff Layton
abd1ec4efd lockd: close potential race with rapid lockd_up/lockd_down cycle
If lockd_down is called very rapidly after lockd_up returns, then
there is a slim chance that lockd() will never be called. kthread()
will return before calling the function, so we'll end up never
actually calling the cleanup functions for the thread.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-23 13:02:50 -04:00
Jeff Layton
a75c5d01e4 sunrpc: remove sv_kill_signal field from svc_serv struct
Since we no longer make any distinction between shutdown signals with
nfsd, then it becomes easier to just standardize on a particular signal
to use to bring it down (SIGINT, in this case).

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-23 13:02:49 -04:00
Jeff Layton
9867d76ca1 knfsd: convert knfsd to kthread API
This patch is rather large, but I couldn't figure out a way to break it
up that would remain bisectable. It does several things:

- change svc_thread_fn typedef to better match what kthread_create expects
- change svc_pool_map_set_cpumask to be more kthread friendly. Make it
  take a task arg and and get rid of the "oldmask"
- have svc_set_num_threads call kthread_create directly
- eliminate __svc_create_thread

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-23 13:02:49 -04:00
Jeff Layton
e096bbc648 knfsd: remove special handling for SIGHUP
The special handling for SIGHUP in knfsd is a holdover from much
earlier versions of Linux where reloading the export table was
more expensive. That facility is not really needed anymore and
to my knowledge, is seldom-used.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-23 13:02:49 -04:00
Jeff Layton
3dd98a3bcc knfsd: clean up nfsd filesystem interfaces
Several of the nfsd filesystem interfaces allow changes to parameters
that don't have any effect on a running nfsd service. They are only ever
checked when nfsd is started. This patch fixes it so that changes to
those procfiles return -EBUSY if nfsd is already running to make it
clear that changes on the fly don't work.

The patch should also close some relatively harmless races between
changing the info in those interfaces and starting nfsd, since these
variables are being moved under the protection of the nfsd_mutex.

Finally, the nfsv4recoverydir file always returns -EINVAL if read. This
patch fixes it to return the recoverydir path as expected.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-23 13:02:49 -04:00
Neil Brown
bedbdd8bad knfsd: Replace lock_kernel with a mutex for nfsd thread startup/shutdown locking.
This removes the BKL from the RPC service creation codepath. The BKL
really isn't adequate for this job since some of this info needs
protection across sleeps.

Also, add some comments to try and clarify how the locking should work
and to make it clear that the BKL isn't necessary as long as there is
adequate locking between tasks when touching the svc_serv fields.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-23 13:02:49 -04:00
Benny Halevy
13b1867cac nfsd: make nfs4xdr WRITEMEM safe against zero count
WRITEMEM zeroes the last word in the destination buffer
for padding purposes, but this must not be done if
no bytes are to be copied, as it would result
in zeroing of the word right before the array.

The current implementation works since it's always called
with non zero nbytes or it follows an encoding of the
string (or opaque) length which, if equal to zero,
can be overwritten with zero.

Nevertheless, it seems safer to check for this case.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-23 13:02:48 -04:00
J. Bruce Fields
3b12cd9862 nfsd: add dprintk of compound return
We already print each operation of the compound when debugging is turned
on; printing the result could also help with remote debugging.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-06-23 13:02:48 -04:00
Denis V. Lunev
f9f48ec72b [patch 4/4] flock: remove unused fields from file_lock_operations
fl_insert and fl_remove are not used right now in the kernel. Remove them.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 11:52:30 -04:00
Marcin Slusarz
694a1764d6 [patch 3/4] vfs: fix ERR_PTR abuse in generic_readlink
generic_readlink calls ERR_PTR for negative and positive values
(vfs_readlink returns length of "link"), but it should not
(not an errno) and does not need to.

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 11:52:30 -04:00
Jan Engelhardt
20d4fdc1a7 [patch 2/4] fs: make struct file arg to d_path const
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 11:52:30 -04:00
Jan Blunck
c8e7f449b2 [patch 1/4] vfs: path_{get,put}() cleanups
Here are some more places where path_{get,put}() can be used instead of
dput()/mntput() pair.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 11:52:29 -04:00
Michael Kerrisk
c70f844174 [patch for 2.6.26 4/4] vfs: utimensat(): fix write access check for futimens()
The POSIX.1 draft spec for futimens()/utimensat() says:

        Only a process with the effective user ID equal to the
        user ID of the file, *or with write access to the file*,
        or with appropriate privileges may use futimens() or
        utimensat() with a null pointer as the times argument
        or with both tv_nsec fields set to the special value
        UTIME_NOW.

The important piece here is "with write access to the file", and
this matters for futimens(), which deals with an argument that
is a file descriptor referring to the file whose timestamps are
being updated,  The standard is saying that the "writability"
check is based on the file permissions, not the access mode with
which the file is opened.  (This behavior is consistent with the
semantics of FreeBSD's futimes().)  However, Linux is currently
doing the latter -- futimens(fd, times) is a library
function implemented as

       utimensat(fd, NULL, times, 0)

and within the utimensat() implementation we have the code:

                f = fget(dfd);  // dfd is 'fd'
                ...
                if (f) {
                        if (!(f->f_mode & FMODE_WRITE))
                                goto mnt_drop_write_and_out;

The check should instead be based on the file permissions.

Thanks to Miklos for pointing out how to do this check.
Miklos also pointed out a simplification that could be
made to my first version of this patch, since the checks
for the pathname and file descriptor cases can now be
conflated.

Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 08:43:52 -04:00
Michael Kerrisk
4cca92264e [patch for 2.6.26 3/4] vfs: utimensat(): fix error checking for {UTIME_NOW,UTIME_OMIT} case
The POSIX.1 draft spec for utimensat() says:

    Only a process with the effective user ID equal to the
    user ID of the file or with appropriate privileges may use
    futimens() or utimensat() with a non-null times argument
    that does not have both tv_nsec fields set to UTIME_NOW
    and does not have both tv_nsec fields set to UTIME_OMIT.

If this condition is violated, then the error EPERM should result.
However, the current implementation does not generate EPERM if
one tv_nsec field is UTIME_NOW while the other is UTIME_OMIT.
It should give this error for that case.

This patch:

a) Repairs that problem.
b) Removes the now unneeded nsec_special() helper function.
c) Adds some comments to explain the checks that are being
   performed.

Thanks to Miklos, who provided comments on the previous iteration
of this patch.  As a result, this version is a little simpler and
and its logic is better structured.

Miklos suggested an alternative idea, migrating the
is_owner_or_cap() checks into fs/attr.c:inode_change_ok() via
the use of an ATTR_OWNER_CHECK flag.  Maybe we could do that
later, but for now I've gone with this version, which is
IMO simpler, and can be more easily read as being correct.

Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 08:43:04 -04:00
Michael Kerrisk
94c70b9ba7 [patch for 2.6.26 1/4] vfs: utimensat(): ignore tv_sec if tv_nsec == UTIME_OMIT or UTIME_NOW
The POSIX.1 draft spec for utimensat() says that if a times[n].tv_nsec
field is UTIME_OMIT or UTIME_NOW, then the value in the corresponding
tv_sec field is ignored.  See the last sentence of this para, from
the spec:

    If the tv_nsec field of a timespec structure has
    the special value UTIME_NOW, the file's relevant
    timestamp shall be set to the greatest value
    supported by the file system that is not greater than
    the current time. If the tv_nsec field has the
    special value UTIME_OMIT, the file's relevant
    timestamp shall not be changed. In either case,
    the tv_sec field shall be ignored.

However the current Linux implementation requires the tv_sec value to be
zero (or the EINVAL error results). This requirement should be removed.

Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 08:43:04 -04:00
Michael Kerrisk
12fd0d3088 [patch for 2.6.26 2/4] vfs: utimensat(): be consistent with utime() for immutable and append-only files
This patch fixes utimensat() to make its behavior consistent
with that of utime()/utimes() when dealing with files marked
immutable and append-only.

The current utimensat() implementation also returns EPERM if
'times' is non-NULL and the tv_nsec fields are both UTIME_NOW.
For consistency, the

(times != NULL && times[0].tv_nsec == UTIME_NOW &&
                  times[1].tv_nsec == UTIME_NOW)

case should be treated like the traditional utimes() case where
'times' is NULL.  That is, the call should succeed for a file
marked append-only and should give the error EACCES if the file
is marked as immutable.

The simple way to do this is to set 'times' to NULL
if (times[0].tv_nsec == UTIME_NOW && times[1].tv_nsec == UTIME_NOW).

This is also the natural approach, since POSIX.1 semantics consider the
times == {{x, UTIME_NOW}, {y, UTIME_NOW}}
to be exactly equivalent to the case for
times == NULL.

(Thanks to Miklos for pointing this out.)

Patch 3 in this series relies on the simplification provided
by this patch.

Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 08:43:03 -04:00
Al Viro
fe6e9c1f25 [PATCH] fix cgroup-inflicted breakage in block_dev.c
devcgroup_inode_permission() expects MAY_FOO, not FMODE_FOO; kindly
keep your misdesign consistent if you positively have to inflict it
on the kernel.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-06-23 08:30:55 -04:00
Linus Torvalds
55d8538498 Fix performance regression on lmbench select benchmark
Christian Borntraeger reported that reinstating cond_resched() with
CONFIG_PREEMPT caused a performance regression on lmbench:

	For example select file 500:
	23 microseconds
	32 microseconds

and that's really because we totally unnecessarily do the cond_resched()
in the innermost loop of select(), which is just silly.

This moves it out from the innermost loop (which only ever loops ove the
bits in a single "unsigned long" anyway), which makes the performance
regression go away.

Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-22 12:23:15 -07:00
Linus Torvalds
62a8efe632 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  Ext4: Fix online resize block group descriptor corruption
2008-06-21 16:43:56 -07:00
Arnd Bergmann
514bcc66d4 dlm-user: BKL pushdown
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2008-06-20 14:05:56 -06:00
Linus Torvalds
8f5934278d Replace BKL with superblock lock in fat/msdos/vfat
This replaces the use of the BKL in the FAT family of filesystems with the
existing superblock lock instead.

The code already appears to do mostly proper locking with its own private
spinlocks (and mutexes), but while the BKL could possibly have been
dropped entirely, converting it to use the superblock lock (which is just
a regular mutex) is the conservative thing to do.

As a per-filesystem mutex, it not only won't have any of the possible
latency issues related to the BKL, but the lock is obviously private to
the particular filesystem instance and will thus not cause problems for
entirely unrelated users like the BKL can.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2008-06-20 14:05:54 -06:00
Jonathan Corbet
9514dff918 Remove the lock_kernel() call from chrdev_open()
All in-kernel char device open() functions now either have their own
lock_kernel() calls or clearly do not need one.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2008-06-20 14:05:53 -06:00
Jonathan Corbet
a30427d92d Add a comment in chrdev_open()
I stared at this code for a while and almost deleted it before
understanding crept into my slow brain.  Hopefully this makes life easier
for the next person to happen on it.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2008-06-20 14:05:53 -06:00
Frederic Bohe
2856922c15 Ext4: Fix online resize block group descriptor corruption
This is the patch for the group descriptor table corruption during
online resize pointed out by Theodore Tso.  The problem was caused by
the fact that the ext4 group descriptor can be either 32 or 64 bytes
long.  Only the 64 bytes structure was taken into account.

Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-06-20 11:48:48 -04:00
Linus Torvalds
e899536470 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
  udf: restore UDFFS_DEBUG to being undefined by default
2008-06-18 11:55:03 -07:00
Miklos Szeredi
f948d56435 fuse: fix thinko in max I/O size calucation
Use max not min to enforce a lower limit on the max I/O size.

This bug was introduced by "fuse: fix max i/o size calculation" (commit
e5d9a0df07).

Thanks to Brian Wang for noticing.

Reported-by: Brian Wang <ywang221@hotmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Szabolcs Szakacsits <szaka@ntfs-3g.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-17 18:08:10 -07:00
David S. Miller
169a3ec492 wext: Remove compat handling from fs/compat_ioctl.c
No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-16 18:33:00 -07:00
David S. Miller
87de87d5e4 wext: Dispatch and handle compat ioctls entirely in net/wireless/wext.c
Next we can kill the hacks in fs/compat_ioctl.c and also
dispatch compat ioctls down into the driver and 80211 protocol
helper layers in order to handle iw_point objects embedded in
stream replies which need to be translated.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-16 18:32:46 -07:00
Linus Torvalds
27eaf66b05 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  ocfs2: Remove ->hangup() from stack glue operations.
  ocfs2: Move the call of ocfs2_hb_ctl into the stack glue.
  ocfs2: Move the hb_ctl_path sysctl into the stack glue.
2008-06-16 13:17:33 -07:00
Joel Becker
2c39450b39 ocfs2: Remove ->hangup() from stack glue operations.
The ->hangup() call was only used to execute ocfs2_hb_ctl.  Now that
the generic stack glue code handles this, the underlying stack drivers
don't need to know about it.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-06-16 10:46:52 -07:00
Joel Becker
9f9a99f4ec ocfs2: Move the call of ocfs2_hb_ctl into the stack glue.
Take o2hb_stop() out of the o2cb code and make it part of the generic
stack glue as ocfs2_leave_group().  This also allows us to remove the
ocfs2_get_hb_ctl_path() function - everything to do with hb_ctl is now
part of stackglue.c.  o2cb no longer needs a ->hangup() function.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-06-16 10:46:51 -07:00
Joel Becker
3878f110f7 ocfs2: Move the hb_ctl_path sysctl into the stack glue.
ocfs2 needs to call out to the hb_ctl program at unmount for all cluster
stacks.  The first step is to move the hb_ctl_path sysctl out of the
o2cb code and into the generic stack glue.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-06-16 10:46:50 -07:00
David Woodhouse
a9e0f5293d Remove last traces of a.out support from ELF loader.
In commit d20894a237 ("Remove a.out
interpreter support in ELF loader"), Andi removed support for a.out
interpreters from the ELF loader, which was only ever needed for the
transition from a.out to ELF.

This removes the last traces of that support, in particular the
inclusion of <linux/a.out.h>.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-16 10:20:57 -07:00
David Woodhouse
702773b16e Include <asm/a.out.h> in fs/exec.c only for Alpha.
We only need it for the /sbin/loader hack for OSF/1 executables, and we
don't want to include it otherwise.

While we're at it, remove the redundant '&& CONFIG_ARCH_SUPPORTS_AOUT'
in the ifdef around that code. It's already dependent on __alpha__, and
CONFIG_ARCH_SUPPORTS_AOUT is hard-coded to 'y' there.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-16 10:20:57 -07:00
Paul Collins
e4f3ec0634 udf: restore UDFFS_DEBUG to being undefined by default
Commit 706047a797, "udf: Fix compilation
warnings when UDF debug is on" inadvertently (I assume) enabled
debugging messages by default for UDF.  This patch disables them again.

Signed-off-by: Paul Collins <paul@ondioline.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2008-06-16 14:24:36 +02:00
Ingo Molnar
c54f9da1c8 Merge branch 'linus' into x86/irqstats 2008-06-16 11:27:53 +02:00
Dave Hansen
bcf8039ed4 pagemap: fix large pages in pagemap
We were walking right into huge page areas in the pagemap walker, and
calling the pmds pmd_bad() and clearing them.

That leaked huge pages.  Bad.

This patch at least works around that for now.  It ignores huge pages in
the pagemap walker for the time being, and won't leak those pages.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12 18:05:41 -07:00
Dave Hansen
2165009bdf pagemap: pass mm into pagewalkers
We need this at least for huge page detection for now, because powerpc
needs the vm_area_struct to be able to determine whether a virtual address
is referring to a huge page (its pmd_huge() doesn't work).

It might also come in handy for some of the other users.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12 18:05:41 -07:00
OGAWA Hirofumi
2d518f84e5 fat: relax the permission check of fat_setattr()
New chmod() allows only acceptable permission, and if not acceptable, it
returns -EPERM.  Old one allows even if it can't store permission to on
disk inode.  But it seems too strict for users.

E.g.  https://bugzilla.redhat.com/show_bug.cgi?id=449080: With new one,
rsync couldn't create the temporary file.

So, this patch allows like old one, but now it doesn't change the
permission if it can't store, and it returns 0.

Also, this patch fixes missing check.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12 18:05:39 -07:00
Linus Torvalds
2a212f6996 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] cifs: fix oops on mount when CONFIG_CIFS_DFS_UPCALL is enabled
  [CIFS] Fix hang in mount when negprot causes server to kill tcp session
  disable most mode changes on non-unix/non-cifsacl mounts
  [CIFS] Correct incorrect obscure open flag
  [CIFS] warn if both dynperm and cifsacl mount options specified
  silently ignore ownership changes unless unix extensions are enabled or we're faking uid changes
  [CIFS] remove trailing whitespace
  when creating new inodes, use file_mode/dir_mode exclusively on mount without unix extensions
  on non-posix shares, clear write bits in mode when ATTR_READONLY is set
  [CIFS] remove unused variables
2008-06-11 09:45:51 -07:00
Steve French
79ee9a8b2d [CIFS] cifs: fix oops on mount when CONFIG_CIFS_DFS_UPCALL is enabled
simple "mount -t cifs //xxx /mnt" oopsed on strlen of options
http://kerneloops.org/guilty.php?guilty=cifs_get_sb&version=2.6.25-release&start=16711 \
68&end=1703935&class=oops

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-06-10 21:37:02 +00:00
Steve French
dbdbb87636 [CIFS] Fix hang in mount when negprot causes server to kill tcp session
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-06-10 21:21:56 +00:00
Adrian Bunk
ec1aef3366 jfs: remove DIRENTSIZ
After fat gets fixed the unused DIRENTSIZ macro was the last user of
struct dirent we should get rid of since the kernel and userspace
versions differed.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
2008-06-10 15:12:58 -05:00
Linus Torvalds
5f0e62c3e1 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: enable barriers by default
  jbd2: Fix barrier fallback code to re-lock the buffer head
  ext4: Display the journal_async_commit mount option in /proc/mounts
  jbd2: If a journal checksum error is detected, propagate the error to ext4
  jbd2: Fix memory leak when verifying checksums in the journal
  ext4: fix online resize bug
  ext4: Fix uninit block group initialization with FLEX_BG
  ext4: Fix use of uninitialized data with debug enabled.
2008-06-06 15:30:53 -07:00
Oleg Nesterov
aab2545fdd uml: activate_mm: remove the dead PF_BORROWED_MM check
use_mm() was changed to use switch_mm() instead of activate_mm(), since
then nobody calls (and nobody should call) activate_mm() with
PF_BORROWED_MM bit set.

As Jeff Dike pointed out, we can also remove the "old != new" check, it is
always true.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:36:22 -07:00
Linus Torvalds
156a9ea43a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6:
  capabilities: remain source compatible with 32-bit raw legacy capability support.
  LSM: remove stale web site from MAINTAINERS
2008-06-06 11:31:55 -07:00
Thomas Tuttle
4710d1ac4c pagemap: return EINVAL, not EIO, for unaligned reads of kpagecount or kpageflags
If the user tries to read from a position that is not a multiple of 8, or
read a number of bytes that is not a multiple of 8, they have passed an
invalid argument to read, for the purpose of reading these files.  It's
not an IO error because we didn't encounter any trouble finding the data
they asked for.

Signed-off-by: Thomas Tuttle <ttuttle@google.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:13 -07:00
Thomas Tuttle
bbcdac0c20 pagemap: return map count, not reference count, in /proc/kpagecount
Since pagemap is all about examining pages mapped into processes' memory
spaces, it makes sense for kpagecount to return the map counts, not the
reference counts.

Signed-off-by: Thomas Tuttle <ttuttle@google.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:13 -07:00
Vegard Nossum
aed5417593 proc: calculate the correct /proc/<pid> link count
This patch:

  commit e9720acd72
  Author: Pavel Emelyanov <xemul@openvz.org>
  Date:   Fri Mar 7 11:08:40 2008 -0800

    [NET]: Make /proc/net a symlink on /proc/self/net (v3)

introduced a /proc/self/net directory without bumping the corresponding
link count for /proc/self.

This patch replaces the static link count initializations with a call that
counts the number of directory entries in the given pid_entry table
whenever it is instantiated, and thus relieves the burden of manually
keeping the two in sync.

[akpm@linux-foundation.org: cleanup]
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:13 -07:00
Josef Bacik
9bb91784de ext3: fix online resize bug
There is a bug when we are trying to verify that the reserve inode's
double indirect blocks point back to the primary gdt blocks.  The fix is
obvious, we need to mod the gdb count by the addr's per block.  You can
verify this with the following test case

dd if=/dev/zero of=disk1 seek=1024 count=1 bs=100M
losetup /dev/loop1 disk1
pvcreate /dev/loop1
vgcreate loopvg1 /dev/loop1
lvcreate -l 100%VG loopvg1 -n looplv1
mkfs.ext3 -J size=64 -b 1024 /dev/loopvg1/looplv1
mount /dev/loopvg1/looplv1 /mnt/loop
dd if=/dev/zero of=disk2 seek=1024 count=1 bs=50M
losetup /dev/loop2 disk2
pvcreate /dev/loop2
vgextend loopvg1 /dev/loop2
lvextend -l 100%VG /dev/loopvg1/looplv1
resize2fs /dev/loopvg1/looplv1

without this patch the resize2fs fails, with it the resize2fs succeeds.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:13 -07:00
Pekka Enberg
d100d148aa nommu: fix ksize() abuse
The nommu binfmt code uses ksize() for pointers returned from do_mmap()
which is wrong.  This converts the call-sites to use the nommu specific
kobjsize() function which works as expected.

Cc: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:13 -07:00
Thomas Tuttle
aae8679b0e pagemap: fix bug in add_to_pagemap, require aligned-length reads of /proc/pid/pagemap
Fix a bug in add_to_pagemap.  Previously, since pm->out was a char *,
put_user was only copying 1 byte of every PFN, resulting in the top 7
bytes of each PFN not being copied.  By requiring that reads be a multiple
of 8 bytes, I can make pm->out and pm->end u64*s instead of char*s, which
makes put_user work properly, and also simplifies the logic in
add_to_pagemap a bit.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Thomas Tuttle <ttuttle@google.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:11 -07:00
Pavel Emelyanov
7db9cfd380 devscgroup: check for device permissions at mount time
Currently even if a task sits in an all-denied cgroup it can still mount
any block device in any mode it wants.

Put a proper check in do_open for block device to prevent this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Tested-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:11 -07:00
Akinobu Mita
93b071139a introduce memory_read_from_buffer()
This patch introduces memory_read_from_buffer().

The only difference between memory_read_from_buffer() and
simple_read_from_buffer() is which address space the function copies to.

simple_read_from_buffer copies to user space memory.
memory_read_from_buffer copies to normal memory.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Doug Warzecha <Douglas_Warzecha@dell.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Matt Domsch <Matt_Domsch@dell.com>
Cc: Abhay Salunke <Abhay_Salunke@dell.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Markus Rechberger <markus.rechberger@amd.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Bob Moore <robert.moore@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Len Brown <lenb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Oberparleiter <peter.oberparleiter@de.ibm.com>
Cc: Michael Holzheu <holzheu@de.ibm.com>
Cc: Brian King <brking@us.ibm.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Andrew Vasquez <linux-driver@qlogic.com>
Cc: Seokmann Ju <seokmann.ju@qlogic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:11 -07:00
David Woodhouse
44d1b980c7 Fix various old email addresses for dwmw2
Although if people have questions about ARCnet, perhaps it's _better_
for them to be mailing dwmw2@cam.ac.uk about it...

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:10 -07:00
Michael Halcrow
d3e49afbb6 eCryptfs: remove unnecessary page decrypt call
The page decrypt calls in ecryptfs_write() are both pointless and buggy.
Pointless because ecryptfs_get_locked_page() has already brought the page
up to date, and buggy because prior mmap writes will just be blown away by
the decrypt call.

This patch also removes the declaration of a now-nonexistent function
ecryptfs_write_zeros().

Thanks to Eric Sandeen and David Kleikamp for helping to track this
down.

Eric said:

   fsx w/ mmap dies quickly ( < 100 ops) without this, and survives
   nicely (to millions of ops+) with it in place.

Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:09 -07:00
Adrian Bunk
b8c141e8fd frv: don't offer BINFMT_FLAT
Fix the following compile error:

  CC      fs/binfmt_flat.o
In file included from
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c:36:
/home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/flat.h:14:22: error: asm/flat.h: No such file or directory
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c: In function 'create_flat_tables':
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c:124: error: implicit declaration of function 'flat_stack_align'
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c:125: error: implicit declaration of function 'flat_argvp_envp_on_stack'
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c: In function 'calc_reloc':
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c:347: error: implicit declaration of function 'flat_reloc_valid'
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c: In function 'load_flat_file':
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c:479: error: implicit declaration of function 'flat_old_ram_flag'
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c:755: error: implicit declaration of function 'flat_set_persistent'
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c:757: error: implicit declaration of function 'flat_get_relocate_addr'
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c:765: error: implicit declaration of function 'flat_get_addr_from_rp'
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/binfmt_flat.c:781: error: implicit declaration of function 'flat_put_addr_at_rp'

Reported-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Tested-by: David Howells <dhowells@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06 11:29:08 -07:00
Chris Wright
ddb2c43594 asn1: additional sanity checking during BER decoding
- Don't trust a length which is greater than the working buffer.
  An invalid length could cause overflow when calculating buffer size
  for decoding oid.

- An oid length of zero is invalid and allows for an off-by-one error when
  decoding oid because the first subid actually encodes first 2 subids.

- A primitive encoding may not have an indefinite length.

Thanks to Wei Wang from McAfee for report.

Cc: Steven French <sfrench@us.ibm.com>
Cc: stable@kernel.org
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-05 14:24:54 -07:00
Al Viro
1d92cfd54a cifs endianness fixes
__le16 fields used as host-endian.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Steve French <smfrench@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-04 08:06:01 -07:00
Andrew G. Morgan
ca05a99a54 capabilities: remain source compatible with 32-bit raw legacy capability support.
Source code out there hard-codes a notion of what the
_LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the
raw capability system calls capget() and capset().  Its unfortunate, but
true.

Since the confusing header file has been in a released kernel, there is
software that is erroneously using 64-bit capabilities with the semantics
of 32-bit compatibilities.  These recently compiled programs may suffer
corruption of their memory when sys_getcap() overwrites more memory than
they are coded to expect, and the raising of added capabilities when using
sys_capset().

As such, this patch does a number of things to clean up the situation
for all. It

  1. forces the _LINUX_CAPABILITY_VERSION define to always retain its
     legacy value.

  2. adopts a new #define strategy for the kernel's internal
     implementation of the preferred magic.

  3. deprecates v2 capability magic in favor of a new (v3) magic
     number. The functionality of v3 is entirely equivalent to v2,
     the only difference being that the v2 magic causes the kernel
     to log a "deprecated" warning so the admin can find applications
     that may be using v2 inappropriately.

[User space code continues to be encouraged to use the libcap API which
protects the application from details like this.  libcap-2.10 is the first
to support v3 capabilities.]

Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518.
Thanks to Bojan Smojver for the report.

[akpm@linux-foundation.org: s/depreciate/deprecate/g]
[akpm@linux-foundation.org: be robust about put_user size]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Bojan Smojver <bojan@rexursive.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2008-05-31 16:36:16 -07:00
Sunil Mushran
0f475b2abe [PATCH 3/3] ocfs2/net: Silence build warnings
This patch silences the build warnings concerning o2net_init_nst()
and friends when building without CONFIG_DEBUG_FS enabled.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-05-30 15:15:12 -07:00
Sunil Mushran
959040c37a [PATCH 2/3] ocfs2/dlm: Silence build warnings
This patch silences the build warnings concerning dlm_debug_init()
and friends when building without CONFIG_DEBUG_FS enabled.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-05-30 15:15:10 -07:00
Sunil Mushran
271d772d02 [PATCH 1/3] ocfs2/net: Silence build warnings
This patch silences the build warnings concerning o2net_debugfs_init()
and friends when building without CONFIG_DEBUG_FS enabled.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-05-30 15:15:04 -07:00
Joel Becker
a12630b186 ocfs2: Rename 'user_stack' plugin structure to 'ocfs2_user_plugin'
The static structure describing the userspace cluster plugin for ocfs2
was named 'user_stack', which is a real pain when people are grep(1)ing
the tree for the program stack object 'user_stack'.  Change the name to
something distinct and namespaced.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-05-30 15:14:08 -07:00
Li Zefan
3c65e8743b JFS: diAlloc() should return -EIO rather than EIO
The comment above the function says one of its return value is -EIO,
and also the caller of diAlloc() checks for -EIO:

struct inode *ialloc(struct inode *parent, umode_t mode)
{
	...
	rc = diAlloc(parent, S_ISDIR(mode), inode);
	if (rc) {
		jfs_warn("ialloc: diAlloc returned %d!", rc);
		if (rc == -EIO)
			make_bad_inode(inode);
	...

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
2008-05-28 08:58:56 -05:00
Jens Axboe
ca39d651d1 splice: handle try_to_release_page() failure
splice currently assumes that try_to_release_page() always suceeds,
but it can return failure. If it does, we cannot steal the page.

Acked-by: Mingming Cao <cmm@us.ibm.com
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Tom Zanussi
a82c53a0e3 splice: fix sendfile() issue with relay
Splice isn't always incrementing the ppos correctly, which broke
relay splice.

Signed-off-by: Tom Zanussi <zanussi@comcast.net>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Oleg Nesterov
cbaffba12c posix timers: discard SI_TIMER signals on exec
Based on Roland's patch. This approach was suggested by Austin Clements
from the very beginning, and then by Linus.

As Austin pointed out, the execing task can be killed by SI_TIMER signal
because exec flushes the signal handlers, but doesn't discard the pending
signals generated by posix timers. Perhaps not a bug, but people find this
surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-26 10:37:07 -07:00
Eric Sandeen
571640cad3 ext4: enable barriers by default
I can't think of any valid reason for ext4 to not use barriers when
they are available;  I believe this is necessary for filesystem
integrity in the face of a volatile write cache on storage.

An administrator who trusts that the cache is sufficiently battery-
backed (and power supplies are sufficiently redundant, etc...)
can always turn it back off again.

SuSE has carried such a patch for ext3 for quite some time now.

Also document the mount option while we're at it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-26 12:29:46 -04:00
Theodore Ts'o
034772b068 jbd2: Fix barrier fallback code to re-lock the buffer head
If the device doesn't support write barriers, the write is retried
without ordered mode.  But the buffer head needs to be re-locked or
submit_bh will fail with on BUG(!buffer_locked(bh)).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-06-03 22:31:11 -04:00
Theodore Ts'o
cd0b6a39a1 ext4: Display the journal_async_commit mount option in /proc/mounts
Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Girish Shilamkar <girish@clusterfs.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-26 10:28:28 -04:00
Theodore Ts'o
624080eded jbd2: If a journal checksum error is detected, propagate the error to ext4
If a journal checksum error is detected, the ext4 filesystem will call
ext4_error(), and the mount will either continue, become a read-only
mount, or cause a kernel panic based on the superblock flags
indicating the user's preference of what to do in case of filesystem
corruption being detected.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-06-06 17:50:40 -04:00
Theodore Ts'o
8ea76900be jbd2: Fix memory leak when verifying checksums in the journal
Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Girish Shilamkar <girish@clusterfs.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-26 10:28:09 -04:00
Josef Bacik
944600930a ext4: fix online resize bug
There is a bug when we are trying to verify that the reserve inode's
double indirect blocks point back to the primary gdt blocks.  The fix is
obvious, we need to mod the gdb count by the addr's per block.  This was
verified using the same testcase as with the ext3 equivalent of this
patch.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-06-06 18:05:52 -04:00
Jose R. Santos
0bf7e8379c ext4: Fix uninit block group initialization with FLEX_BG
With FLEX_BG block bitmaps, inode bitmaps and inode tables _MAY_ be
allocated outside the group.  So, when initializing an uninitialized
block bitmap, we need to check the location of this blocks before
setting the corresponding bits in the block bitmap of the newly
initialized group.  Also return the right number of free blocks when
counting the available free blocks in uninit group.

Tested-by: Aneesh Kumar K.V <aneesh.kumar@inux.vnet.ibm.com>
Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-06-03 14:07:29 -04:00
Aneesh Kumar K.V
03cddb80ed ext4: Fix use of uninitialized data with debug enabled.
Fix use of uninitialized data with debug enabled.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-06-05 20:59:29 -04:00
Jan Beulich
a2eddfa959 x86: make /proc/stat account for all interrupts
LAPIC interrupts, which don't go through the generic interrupt handling
code, aren't accounted for in /proc/stat. Hence this patch adds a
mechanism architectures can use to accordingly adjust the statistics.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-25 07:11:49 +02:00
Jeff Layton
5132861a7a disable most mode changes on non-unix/non-cifsacl mounts
CIFS currently allows you to change the mode of an inode on a share that
doesn't have unix extensions enabled, and isn't using cifsacl. The inode
in this case *only* has its mode changed in memory on the client. This
is problematic since it can change any time the inode is purged from the
cache.

This patch makes cifs_setattr silently ignore most mode changes when
unix extensions and cifsacl support are not enabled, and when the share
is not mounted with the "dynperm" option. The exceptions are:

When a mode change would remove all write access to an inode we turn on
the ATTR_READONLY bit on the server and remove all write bits from the
inode's mode in memory.

When a mode change would add a write bit to an inode that previously had
them all turned off, it turns off the ATTR_READONLY bit on the server,
and resets the mode back to what it would normally be (generally, the
file_mode or dir_mode of the share).

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-25 00:33:58 +00:00
Denis V. Lunev
c4185a0e01 proc: proc_get_inode() should get module only once
Any file under /proc/net opened more than once leaked the refcounter
on the module it belongs to.

The problem is that module_get is called for each file opening while
module_put is called only when /proc inode is destroyed. So, lets put
module counter if we are dealing with already initialised inode.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10737

Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: David Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Robert Olsson <robert.olsson@its.uu.se>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Reported-by: Roland Kletzing <devzero@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:11 -07:00
Alan Cox
80119ef5c8 mm: fix atomic_t overflow in vm
The atomic_t type is 32bit but a 64bit system can have more than 2^32
pages of virtual address space available.  Without this we overflow on
ludicrously large mappings

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:09 -07:00
Marcin Slusarz
80bfc25f42 ntfs: le*_add_cpu conversion
replace all:
little_endian_variable = cpu_to_leX(leX_to_cpu(little_endian_variable) +
					expression_in_cpu_byteorder);
with:
	leX_add_cpu(&little_endian_variable, expression_in_cpu_byteorder);
generated with semantic patch

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Acked-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:08 -07:00
Cyrill Gorcunov
71fd5179e8 ecryptfs: fix missed mutex_unlock
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:07 -07:00
Miklos Szeredi
03fb0bce01 fuse: fix bdi naming conflict
Fuse allocates a separate bdi for each filesystem, and registers them
in sysfs with "MAJOR:MINOR" of sb->s_dev (st_dev).  This works fine for
anon devices normally used by fuse, but can conflict with an already
registered BDI for "fuseblk" filesystems, where sb->s_dev represents a
real block device.  In particularl this happens if a non-partitioned
device is being mounted.

Fix by registering with a different name for "fuseblk" filesystems.

Thanks to Ioan Ionita for the bug report.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Ioan Ionita <opslynx@gmail.com>
Tested-by: Ioan Ionita <opslynx@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:07 -07:00
Steve French
b7206153f6 [CIFS] Correct incorrect obscure open flag
Also add defines for pipe subcommand codes

Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-23 20:35:07 +00:00
Steve French
27adb44c4f [CIFS] warn if both dynperm and cifsacl mount options specified
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-23 19:43:29 +00:00
Jeff Layton
4ca691a892 silently ignore ownership changes unless unix extensions are enabled or we're faking uid changes
CIFS currently allows you to change the ownership of a file, but unless
unix extensions are enabled this change is not passed off to the server.

Have CIFS silently ignore ownership changes that can't be persistently
stored on the server unless the "setuids" option is explicitly
specified.

We could return an error here (-EOPNOTSUPP or something), but this is
how most disk-based windows filesystems on behave on Linux (e.g.  VFAT,
NTFS, etc). With cifsacl support and proper Windows to Unix idmapping
support, we may be able to do this more properly in the future.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-23 18:25:17 +00:00
Steve French
4e94a105ed [CIFS] remove trailing whitespace
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-23 18:22:46 +00:00
Jeff Layton
b0fd30d3e7 when creating new inodes, use file_mode/dir_mode exclusively on mount without unix extensions
When CIFS creates a new inode on a mount without unix extensions, it
temporarily assigns the mode that was passed to it in the create/mkdir
call. Eventually, when the inode is revalidated, it changes to have the
file_mode or dir_mode for the mount. This is confusing to users who
expect that the mode shouldn't change this way. It's also problematic
since only the mode is treated this way, not the uid or gid. Suppose you
have a CIFS mount that's mounted with:

uid=0,gid=0,file_mode=0666,dir_mode=0777

...if an unprivileged user comes along and does this on the mount:

mkdir -m 0700 foo
touch foo/bar

...there is a period of time where the touch will fail, since the dir
will initially be owned by root and have mode 0700. If the user waits
long enough, then "foo" will be revalidated and will get the correct
dir_mode permissions.

This patch changes cifs_mkdir and cifs_create to not overwrite the
mode found by the initial cifs_get_inode_info call after the inode is
created on the server. Legacy behavior can be reenabled with the
new "dynperm" mount option.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-23 18:17:16 +00:00
Jeff Layton
4468eb3fd1 on non-posix shares, clear write bits in mode when ATTR_READONLY is set
When mounting a share with posix extensions disabled,
cifs_get_inode_info turns off all the write bits in the mode for regular
files if ATTR_READONLY is set. Directories and other inode types,
however, can also have ATTR_READONLY set, but the mode gives no
indication of this.

This patch makes this apply to other inode types besides regular files.
It also cleans up how modes are set in cifs_get_inode_info for both the
"normal" and "dynperm" cases.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-23 18:17:09 +00:00
Steve French
aaa9bbe039 [CIFS] remove unused variables
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-23 17:38:32 +00:00
Linus Torvalds
6483d152ac Merge branch 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
  [XFS] Fix memory corruption with small buffer reads
  [XFS] Fix inode list allocation size in writeback.
  [XFS] Don't allow memory reclaim to wait on the filesystem in inode
  [XFS] Fix fsync() b0rkage.
  [XFS] Include linux/random.h in all builds, not just debug builds.
2008-05-23 08:13:39 -07:00
Christoph Hellwig
6ab455eeaf [XFS] Fix memory corruption with small buffer reads
When we have multiple buffers in a single page for a blocksize == pagesize
filesystem we might overwrite the page contents if two callers hit it
shortly after each other. To prevent that we need to keep the page locked
until I/O is completed and the page marked uptodate.

Thanks to Eric Sandeen for triaging this bug and finding a reproducible
testcase and Dave Chinner for additional advice.

This should fix kernel.org bz #10421.

Tested-by: Eric Sandeen <sandeen@sandeen.net>

SGI-PV: 981813
SGI-Modid: xfs-linux-melb:xfs-kern:31173a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2008-05-23 18:12:49 +10:00
David Chinner
c8f5f12e46 [XFS] Fix inode list allocation size in writeback.
We only need to allocate space for the number of inodes in the cluster
when writing back inodes, not every byte in the inode cluster. This
reduces the amount of memory needing to be allocated to 256 bytes instead
of 64k.

SGI-PV: 981949
SGI-Modid: xfs-linux-melb:xfs-kern:31182a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2008-05-23 15:26:15 +10:00
David Chinner
49383b0e98 [XFS] Don't allow memory reclaim to wait on the filesystem in inode
writeback

If we allow memory reclaim to wait on the pages under writeback in inode
cluster writeback we could deadlock because we are currently holding the
ILOCK on the initial writeback inode which is needed in data I/O
completion to change the file size or do unwritten extent conversion
before the pages are taken out of writeback state.

SGI-PV: 981091
SGI-Modid: xfs-linux-melb:xfs-kern:31015a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2008-05-23 15:26:03 +10:00
David Chinner
978b723712 [XFS] Fix fsync() b0rkage.
xfs_fsync() fails to wait for data I/O completion before checking if the
inode is dirty or clean to decide whether to log the inode or not. This
misses inode size updates when the data flushed by the fsync() is
extending the file.

Hence, like fdatasync(), we need to wait for I/o completion first, then
check the inode for cleanliness. Doing so makes the behaviour of
xfs_fsync() identical for fsync and fdatasync and we *always* use
synchronous semantics if the inode is dirty. Therefore also kill the
differences and remove the unused flags from the xfs_fsync function and
callers.

SGI-PV: 981296
SGI-Modid: xfs-linux-melb:xfs-kern:31033a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2008-05-23 15:25:25 +10:00
Dave Jones
0a891adccc [CIFS] Fix reversed memset arguments
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-22 14:20:21 +00:00
Igor Mammedov
e4058245ac Adds username in the upcall key for unattended mounts with keytab
Signed-off-by: Igor Mammedov <niallain@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-22 14:09:37 +00:00
Steve French
0d817bc0d6 [CIFS] Remove redundant NULL check
Noticed by Coverity checker.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-22 02:02:03 +00:00
Al Viro
9d8df6aa9b ocfs2 endianness fixes
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-21 16:55:59 -07:00
Al Viro
79bc12a0a0 ecryptfs fixes
memcpy() from userland pointer is a Bad Thing(tm)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-21 16:55:59 -07:00
Al Viro
13c48c4902 fix hppfs Makefile breakage
Fallout from commit 46d7b522eb ("uml: move
hppfs_kern.c to hppfs.c")

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-21 16:55:58 -07:00
Dave Kleikamp
6536d2891b JFS: skip bad iput() call in error path
If jfs_iget() fails, we can't call iput() on the returned error.
Thanks to Eric Sesterhenn's fuzzer testing for reporting the problem.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
2008-05-21 10:45:16 -05:00
Linus Torvalds
5cf11daf9a Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: (21 commits)
  [CIFS] Remove debug statement
  Fix possible access to undefined memory region.
  [CIFS] Enable DFS support for Windows query path info
  [CIFS] Enable DFS support for Unix query path info
  [CIFS] add missing seq_printf to cifs_show_options for hard mount option
  [CIFS] add more complete mount options to cifs_show_options
  [CIFS] Add missing defines for DFS
  CIFSGetDFSRefer cleanup + dfs_referral_level_3 fixed to conform REFERRAL_V3 the MS-DFSC spec.
  Fixed DFS code to work with new 'build_path_from_dentry', that returns full path if share in the dfs, now.
  [CIFS] enable parsing for transport encryption mount parm
  [CIFS] Finishup DFS code
  [CIFS] BKL-removal: convert CIFS over to unlocked_ioctl
  [CIFS] suppress duplicate warning
  [CIFS] Fix paths when share is in DFS to include proper prefix
  add function to convert access flags to legacy open mode
  clarify return value of cifs_convert_flags()
  [CIFS] don't explicitly do a FindClose on rewind when directory search has ended
  [CIFS] cleanup old checkpatch warnings
  [CIFS] CIFSSMBPosixLock should return -EINVAL on error
  fix memory leak in CIFSFindNext
  ...
2008-05-20 21:12:14 -07:00
Steve French
397d71ddfd [CIFS] Remove debug statement
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-21 03:49:46 +00:00
Igor Mammedov
5651ced3ab Fix possible access to undefined memory region.
Signed-off-by: Igor Mammedov <niallain@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-21 03:49:00 +00:00
Linus Torvalds
d40ace0c7b Merge branch 'for-2.6.26' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.26' of git://linux-nfs.org/~bfields/linux: (25 commits)
  svcrdma: Verify read-list fits within RPCSVC_MAXPAGES
  svcrdma: Change svc_rdma_send_error return type to void
  svcrdma: Copy transport address and arm CQ before calling rdma_accept
  svcrdma: Set rqstp transport address in rdma_read_complete function
  svcrdma: Use ib verbs version of dma_unmap
  svcrdma: Cleanup queued, but unprocessed I/O in svc_rdma_free
  svcrdma: Move the QP and cm_id destruction to svc_rdma_free
  svcrdma: Add reference for each SQ/RQ WR
  svcrdma: Move destroy to kernel thread
  svcrdma: Shrink scope of spinlock on RQ CQ
  svcrdma: Use standard Linux lists for context cache
  svcrdma: Simplify RDMA_READ deferral buffer management
  svcrdma: Remove unused READ_DONE context flags bit
  svcrdma: Return error from rdma_read_xdr so caller knows to free context
  svcrdma: Fix error handling during listening endpoint creation
  svcrdma: Free context on post_recv error in send_reply
  svcrdma: Free context on ib_post_recv error
  svcrdma: Add put of connection ESTABLISHED reference in rdma_cma_handler
  svcrdma: Fix return value in svc_rdma_send
  svcrdma: Fix race with dto_tasklet in svc_rdma_send
  ...
2008-05-20 19:30:54 -07:00
Linus Torvalds
e616c63033 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
  pktgen: make sure that pktgen_thread_worker has been executed
  [VLAN]: Propagate selected feature bits to VLAN devices
  drivers/atm/: remove CVS keywords
  vlan: Correctly handle device notifications for layered VLAN devices
  net: Fix call to ->change_rx_flags(dev, IFF_MULTICAST) in dev_change_flags()
  net_sched: cls_api: fix return value for non-existant classifiers
  ipsec: Use the correct ip_local_out function
  ipv6 addrconf: Allow infinite prefix lifetime.
  ipv6 route: Fix lifetime in netlink.
  ipv6 addrconf: Fix route lifetime setting in corner case.
  ndisc: Add missing strategies for per-device retrans timer/reachable time settings.
  ipv6: Move <linux/in6.h> from header-y to unifdef-y.
  l2tp: avoid skb truesize bug if headroom is increased
  wireless: Create 'device' symlink in sysfs
  wireless, airo: waitbusy() won't delay
  libertas: fix command timeout after firmware failure
  mac80211: Add RTNL version of ieee80211_iterate_active_interfaces
  mac80211 : Association with 11n hidden ssid ap.
  hostap: fix "registers" registration in procfs
  isdn/capi: Return proper errnos on module init.
  ...
2008-05-20 17:23:03 -07:00
Steve French
b9a3260f25 [CIFS] Enable DFS support for Windows query path info
Final piece for handling DFS in query_path_info, constructing a
fake inode for the junction directory which the submount will cover.

This handles the non-Unix (Windows etc.) code path.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-20 21:52:32 +00:00
Steve French
0e4bbde94f [CIFS] Enable DFS support for Unix query path info
Final piece for handling DFS in unix_query_path_info, constructing a
fake inode for the junction directory which the submount will cover.

Acked-by: Igor Mammedov <niallain@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-20 19:50:46 +00:00
Linus Torvalds
551395ae66 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  [GFS2] Prefer strlcpy() over snprintf()
  [GFS2] Fix cast from unsigned int to s64
  [GFS2] filesystem consistency error from do_strip
2008-05-20 08:15:18 -07:00
Linus Torvalds
16ae527bfa Merge branch 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] list_for_each_rcu must die: audit
  [patch 1/1] audit_send_reply(): fix error-path memory leak
  [PATCH] open sessionid permissions
2008-05-19 16:38:10 -07:00
Linus Torvalds
e23a5f6687 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  [PATCH] return to old errno choice in mkdir() et.al.
  [Patch] fs/binfmt_elf.c: fix wrong return values
  [PATCH] get rid of leak in compat_execve()
  [Patch] fs/binfmt_elf.c: fix a wrong free
  [PATCH] avoid multiplication overflows and signedness issues for max_fds
  [PATCH] dup_fd() part 4 - race fix
  [PATCH] dup_fd() - part 3
  [PATCH] dup_fd() part 2
  [PATCH] dup_fd() fixes, part 1
  [PATCH] take init_files to fs/file.c
2008-05-19 16:37:45 -07:00
Steve French
89562b777c [CIFS] add missing seq_printf to cifs_show_options for hard mount option
Also Kari Hurtta noticed a missing check in the same function which is now fixed.

CC: Kari Hurtta <hurtta+gmane@siilo.fmi.fi>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-19 22:26:42 +00:00
David Teigland
817d10bad5 dlm: fix plock dev_write return value
The return value on writes to the plock device should be
the number of bytes written.  It was returning 0 instead
when an nfs lock callback was involved.

Reported-by: Nathan Straz <nstraz@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2008-05-19 15:37:27 -05:00
Marcin Slusarz
0035a4b149 dlm: tcp_connect_to_sock should check for -EINVAL, not EINVAL
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: David Teigland <teigland@redhat.com>
2008-05-19 15:37:27 -05:00
Leonardo Potenza
88ad23195e dlm: section mismatch warning fix
Removed the section mismatch message:
WARNING: fs/dlm/dlm.o(.init.text+0x132): Section mismatch in reference from the function init_module() to the function .exit.text:dlm_netlink_exit()

Since dlm_netlink_exit() is called in the init_dlm() error handling,
the __exit annotation has been removed.

Signed-off-by: Leonardo Potenza <lpotenza@inwind.it>
Signed-off-by: David Teigland <teigland@redhat.com>
2008-05-19 15:37:27 -05:00
Matthias Kaehlcke
7a936ce71e dlm: convert connections_lock in a mutex
The semaphore connections_lock is used as a mutex.  Convert it to the mutex
API.

Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Teigland <teigland@redhat.com>
2008-05-19 15:37:27 -05:00
J. Bruce Fields
88dd0be387 nfsd: reorder printk in do_probe_callback to avoid use-after-free
We're currently dereferencing the client after we drop our reference
count to it.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-05-18 19:13:07 -04:00
J. Bruce Fields
b55e0ba19c nfsd: remove unnecessary atomic ops
These bit operations don't need to be atomic.  They're all done under a
single big mutex anyway.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-05-18 19:12:54 -04:00
Linus Torvalds
161fb0cf5c Merge branch 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  SUNRPC: AUTH_SYS "machine creds" shouldn't use negative valued uid/gid
  nfs: make nfs4_drop_state_owner() static
  nfs: path_{get,put}() cleanups
  nfs: replace remaining __FUNCTION__ occurrences
  nfs/lsm: make NFSv4 set LSM mount options
  NFSv4: Check the return value of decode_compound_hdr_arg()
  nfs: fix race in nfs_dirty_request
  NFS: Ensure that 'noac' and/or 'actimeo=0' turn off attribute caching
2008-05-18 15:32:44 -07:00
Steve Grubb
6ee650467d [PATCH] open sessionid permissions
The current permissions on sessionid are a little too restrictive.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-17 03:27:27 -04:00
Steve French
2b280fab12 [CIFS] add more complete mount options to cifs_show_options
adds various options to cifs_show_options
(displayed when you cat /proc/mounts with a cifs mount).  I limited
the new ones to values that are associated with the mount with the
exception of "seal" (which is a per tree connection property, but I
thought was important enough to show through).

Eventually cifs's parse_mount_options also needs to
be rewritten to use the match_token API but that would be a big enough
change that I would prefer that changing parse_mount_options wait
until next release.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-17 03:12:45 +00:00
Al Viro
e9baf6e598 [PATCH] return to old errno choice in mkdir() et.al.
In case when both EEXIST and EROFS would apply we used to
return the former in mkdir(2) and friends.  Lest anyone suspects
us of being consistent, in the same situation knfsd gave clients
nfs_erofs...

	ro-bind series had switched the syscall side of things to
returning -EROFS and immediately broke an application - namely,
mkdir -p.  Patch restores the original behaviour...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:23:18 -04:00
WANG Cong
23c4971e3d [Patch] fs/binfmt_elf.c: fix wrong return values
create_elf_tables() returns 0 on success. But when strnlen_user() "fails",
it returns 0 directly. So this is wrong.

Signed-off-by: WANG Cong <wangcong@zeuux.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:23:11 -04:00
Al Viro
08a6fac1c6 [PATCH] get rid of leak in compat_execve()
Even though copy_compat_strings() doesn't cache the pages,
copy_strings_kernel() and stuff indirectly called by e.g.
->load_binary() is doing that, so we need to drop the
cache contents in the end.

[found by WANG Cong <wangcong@zeuux.org>]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:23:05 -04:00
WANG Cong
5f719558ed [Patch] fs/binfmt_elf.c: fix a wrong free
In kmalloc failing path, we shouldn't free pointers in 'info',
because the struct 'info' is uninitilized when kmalloc is called.

And when kmalloc returns NULL, it's needless to kfree it.

Signed-off-by: WANG Cong <wangcong@zeuux.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

--
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:58 -04:00
Al Viro
eceea0b3df [PATCH] avoid multiplication overflows and signedness issues for max_fds
Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and
we don't want size calculation for ->fd[] to overflow.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:52 -04:00
Al Viro
adbecb128c [PATCH] dup_fd() part 4 - race fix
Parent _can_ be a clone task, contrary to the comment.  Moreover,
more files could be opened while we allocate a copy, in which case
we end up copying only part into new descriptor table.  Since what
we get _is_ affected by all changes in the old range, we can get
rather weird effects - e.g.
	dup2(0, 1024); close(0);
in parallel with fork() resulting in child that sees the effect of
close(), but not that of dup2() done just before that close().

What we need is to recalculate the open_count after having reacquired
->file_lock and if external fdtable we'd just allocated is too small for
it, free the sucker and redo allocation.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:46 -04:00
Al Viro
afbec7fff4 [PATCH] dup_fd() - part 3
merge alloc_files() into dup_fd(), leave setting newf->fdt until the end

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:39 -04:00
Al Viro
9dec3c4d30 [PATCH] dup_fd() part 2
use alloc_fdtable() instead of expand_files(), get rid of pointless
grabbing newf->file_lock, kill magic in copy_fdtable() that used to
be there only to skip copying when called from dup_fd().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:33 -04:00
Al Viro
02afc6267f [PATCH] dup_fd() fixes, part 1
Move the sucker to fs/file.c in preparation to the rest

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:26 -04:00
Al Viro
f52111b154 [PATCH] take init_files to fs/file.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:20 -04:00
Harvey Harrison
9a6ab769bd byteorder: don't directly include linux/byteorder/generic.h
Use asm/byteorder.h instead.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-16 12:01:45 -07:00
Steve French
a1fe78f16e [CIFS] Add missing defines for DFS
Also has minor cleanup of previous patch

CC: Igor Mammedov <niallain@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-16 18:48:38 +00:00
Igor Mammedov
fec4585fd7 CIFSGetDFSRefer cleanup + dfs_referral_level_3 fixed to conform REFERRAL_V3 the MS-DFSC spec.
Signed-off-by: Igor Mammedov <niallain@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-16 18:41:43 +00:00
Adrian Bunk
1d2e88e73e nfs: make nfs4_drop_state_owner() static
nfs4_drop_state_owner() can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-05-16 09:43:31 -07:00
Jan Blunck
31f31db1a1 nfs: path_{get,put}() cleanups
Here are some more places where path_{get,put}() can be used instead of
dput()/mntput() pair.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-05-16 09:43:30 -07:00
Harvey Harrison
3110ff8048 nfs: replace remaining __FUNCTION__ occurrences
__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-05-16 09:43:29 -07:00
Eric Paris
46c8ac7425 nfs/lsm: make NFSv4 set LSM mount options
NFSv3 get_sb operations call into the LSM layer to set security options passed
from userspace.  NFSv4 hooks were not originally added since it was reasonably
late in the merge window and NFSv3 was the only thing that had regressed (v4
has never supported any LSM options)

This patch makes NFSv4 call into the LSM to set security options rather than
just blindly dropping them with no notice to the user as happens today.  This
patch was tested in a simple NFSv4 environment with the context= option and
appeared to work as expected.

Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-05-16 09:43:27 -07:00
Trond Myklebust
3a6258e1fb NFSv4: Check the return value of decode_compound_hdr_arg()
If decode_compound_hdr_arg() returns a resource error, then we cannot
proceed to process the callback. Return a 'GARBAGE_ARGS' rpc-level error to
the caller instead.
If, however, the minor version field is incorrect, then we need to
propagate the resulting NFS4ERR_MINOR_VERS_MISMATCH error back as the
compound status field (setting the nops field to 0).

Finally, if encode_compound_hdr_res() returns an error, we need to return
an RPC_SYSTEM_ERR to the caller.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-05-16 09:43:26 -07:00
Fred Isaman
38def50fab nfs: fix race in nfs_dirty_request
When called from nfs_flush_incompatible, the req is not locked, so
req->wb_page might be set to NULL before it is used by PageWriteback.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-05-16 09:43:23 -07:00
Trond Myklebust
b0b539739f NFS: Ensure that 'noac' and/or 'actimeo=0' turn off attribute caching
Both the 'noac' and 'actimeo=0' mount options should ensure that attributes
are not cached, however a bug in nfs_attribute_timeout() means that
currently, the attributes may in fact get cached for up to one jiffy. This
has been seen to cause corruption in some applications.

The reason for the bug is that the time_in_range() test returns 'true' as
long as the current time lies between nfsi->read_cache_jiffies and
nfsi->read_cache_jiffies + nfsi->attrtimeo. In other words, if jiffies
equals nfsi->read_cache_jiffies, then we still cache the attribute data.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-05-16 09:43:21 -07:00
Igor Mammedov
de2db8d790 Fixed DFS code to work with new 'build_path_from_dentry', that returns full path if share in the dfs, now.
Signed-off-by: Igor Mammedov <niallain@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-16 15:23:28 +00:00
Mingming Cao
02c471cb17 jbd2: update transaction t_state to T_COMMIT fix
Updating the current transaction's t_state is protected by j_state_lock.  We
need to do the same when updating the t_state to T_COMMIT.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-05-15 14:46:17 -04:00
Aneesh Kumar K.V
519deca049 ext4: Retry block allocation if new blocks are allocated from system zone.
If the block allocator gets blocks out of system zone ext4 calls
ext4_error. But if the file system is mounted with errors=continue
retry block allocation. We need to mark the system zone blocks as
in use to make sure retry don't pick them again

System zone is the block range mapping block bitmap, inode bitmap and inode
table.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-15 14:43:20 -04:00
Steve French
95b1cb90b7 [CIFS] enable parsing for transport encryption mount parm
Samba now supports transport encryption on particular exports
(mounted tree ids can be encrypted for servers which support the
unix extensions).  This adds parsing support to cifs mount
option parsing for this.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-15 16:44:38 +00:00
Steve French
c2cf07d591 [CIFS] Finishup DFS code
Fixup GetDFSRefer to prepare for cleanup of SMB response processing
Fix build warning in link.c

Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-15 06:20:02 +00:00
Steve French
f9ddcca4cf [CIFS] BKL-removal: convert CIFS over to unlocked_ioctl
cifs_ioctl doesn't seem to need the BKL for anything, so convert it over
to use unlocked_ioctl.

Signed-off-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-15 05:51:55 +00:00
Steve French
c32916374b [CIFS] suppress duplicate warning
fs/cifs/dir.c: In function 'cifs_ci_compare':
fs/cifs/dir.c:582: warning: passing argument 1 of 'memcpy' discards
qualifiers from pointer target type

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-15 05:41:54 +00:00
Stephen Hemminger
0599ad53fe sysfs: remove error messages for -EEXIST case
It is possible that the entry in sysfs already exists, one case of this is
when a network device is renamed to bonding_masters. Anyway, in this case
the proper error path is for device_rename to return an error code, not to
generate bogus backtrace and errors.

Also, to avoid possible races, the create link should be done before the
remove link. This makes a device rename atomic operation like other renames.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-14 22:34:16 -07:00
Linus Torvalds
8f40f672e6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
* 'for-linus' of ssh://master.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  9p: fix error path during early mount
  9p: make cryptic unknown error from server less scary
  9p: fix flags length in net
  9p: Correct fidpool creation failure in p9_client_create
  9p: use struct mutex instead of struct semaphore
  9p: propagate parse_option changes to client and transports
  fs/9p/v9fs.c (v9fs_parse_options): Handle kstrdup and match_strdup failure.
  9p: Documentation updates
  add match_strlcpy() us it to make v9fs make uname and remotename parsing more robust
2008-05-14 19:30:51 -07:00
Tiger Yang
7e01c8e542 ext3/4: fix uninitialized bs in ext3/4_xattr_set_handle()
This fix the uninitialized bs when we try to replace a xattr entry in
ibody with the new value which require more than free space.

This situation only happens we format ext3/4 with inode size more than 128 and
we have put xattr entries both in ibody and block.  The consequences about
this bug is we will lost the xattr block which pointed by i_file_acl with all
xattr entires in it.  We will alloc a new xattr block and put that large value
entry in it.  The old xattr block will become orphan block.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14 19:11:14 -07:00
Mingming Cao
772279c5f1 jbd: need to hold j_state_lock to updates to transaction t_state to T_COMMIT
Updating the current transaction's t_state is protected by j_state_lock.  We
need to do the same when updating the t_state to T_COMMIT.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14 19:11:14 -07:00
Steve French
646dd53987 [CIFS] Fix paths when share is in DFS to include proper prefix
Some versions of Samba (3.2-pre e.g.) are stricter about checking to make sure that
paths in DFS name spaces are sent in the form \\server\share\dir\subdir ...
instead of \dir\subdir

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-15 01:50:56 +00:00
Eric Van Hensbergen
887b3ece65 9p: fix error path during early mount
There was some cleanup issues during early mount which would trigger
a kernel bug for certain types of failure.  This patch reorganizes the
cleanup to get rid of the bad behavior.

This also merges the 9pnet and 9pnet_fd modules for the purpose of
configuration and initialization.  Keeping the fd transport separate
from the core 9pnet code seemed like a good idea at the time, but in
practice has caused more harm and confusion than good.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-05-14 19:23:27 -05:00
Jim Meyering
ab31267dfe fs/9p/v9fs.c (v9fs_parse_options): Handle kstrdup and match_strdup failure. Now that this function can fail, return an int, diagnose other option-parsing failures, and adjust the sole caller: (v9fs_session_init): Handle kstrdup failure. Propagate any new v9fs_parse_options failure "up".
Signed-off-by: Jim Meyering <meyering@redhat.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-05-14 19:23:25 -05:00
Eric Van Hensbergen
ee443996a3 9p: Documentation updates
The kernel-doc comments of much of the 9p system have been in disarray since
reorganization.  This patch fixes those problems, adds additional documentation
and a template book which collects the 9p information.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-05-14 19:23:25 -05:00
Markus Armbruster
b32a09db4f add match_strlcpy() us it to make v9fs make uname and remotename parsing more robust
match_strcpy() is a somewhat creepy function: the caller needs to make sure
that the destination buffer is big enough, and when he screws up or
forgets, match_strcpy() happily overruns the buffer.

There's exactly one customer: v9fs_parse_options().  I believe it currently
can't overflow its buffer, but that's not exactly obvious.

The source string is a substing of the mount options.  The kernel silently
truncates those to PAGE_SIZE bytes, including the terminating zero.  See
compat_sys_mount() and do_mount().

The destination buffer is obtained from __getname(), which allocates from
name_cachep, which is initialized by vfs_caches_init() for size PATH_MAX.

We're safe as long as PATH_MAX <= PAGE_SIZE.  PATH_MAX is 4096.  As far as
I know, the smallest PAGE_SIZE is also 4096.

Here's a patch that makes the code a bit more obviously correct.  It
doesn't depend on PATH_MAX <= PAGE_SIZE.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Jim Meyering <meyering@redhat.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-05-14 19:23:25 -05:00
Jeff Layton
35fc37d517 add function to convert access flags to legacy open mode
SMBLegacyOpen always opens a file as r/w. This could be problematic
for files with ATTR_READONLY set. Have it interpret the access_mode
into a sane open mode.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-14 18:45:30 +00:00
Jeff Layton
e10f7b551d clarify return value of cifs_convert_flags()
cifs_convert_flags returns 0x20197 in the default case. It's not
immediately evident where that number comes from, so change it
to be an or'ed set of flags. The compiler will boil it down anyway.

(Thanks to Guenter Kukkukk for clarifying the flags).

Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-14 18:44:35 +00:00
Valerie Clement
1930479c4b ext4: mballoc fix mb_normalize_request algorithm for 1KB block size filesystems
In case of inode preallocation, the number of blocks to allocate depends
on the file size and it is calculated in ext4_mb_normalize_request().
Each group in the filesystem is then checked to find one that can be
used for allocation; this is done in ext4_mb_good_group().

When a file bigger than 4MB is created, the requested number of blocks
to preallocate, calculated by ext4_mb_normalize_request is 4096.
However for a filesystem with 1KB block size, the maximum size of the
block buddies used by the multiblock allocator is 2048, so none of
groups in the filesystem satisfies the search criteria in
ext4_mb_good_group(). Scanning all the filesystem groups impacts
performance.

This was demonstrated by using a freshly created, 70GB, 1k block
filesystem, with caches dropped write before the test via
/proc/sys/vm/drop_caches, and with the filesystem mounted with
nodelalloc and nodealloc,nomballoc.  The time to write an 8 megabyte
file using "dd if=/dev/zero of=/mnt/test/fo bs=8k count=1k conv=fsync"
took 35.5091 seconds (236kB/s) with nodellaloc, and 0.233754 seconds
(35.9 MB/s) with the nodelloc,nomballoc options.  With a 1TB partition,
it took several minutes to write 8MB!

This patch modifies the algorithm in ext4_mb_normalize_group_request to
calculate the number of blocks to allocate by taking into account the
maximum size of free blocks chunks handled by the multiblock allocator.

It has also been tested for filesystems with 2KB and 4KB block sizes to
ensure that those cases don't regress.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Valerie Clement <valerie.clement@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 19:31:14 -04:00
Jan Kara
2c8be6b222 ext4: fix typos in messages and comments (journalled -> journaled)
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 21:27:55 -04:00
Jan Kara
0623543b33 ext4: fix synchronization of quota files in journal=data mode
In journal=data mode, it is not enough to do write_inode_now as done in
vfs_quota_on() to write all data to their final location (which is
needed for quota_read to work correctly).  Calling journal_flush() does
its job.

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 19:11:51 -04:00
Jan Kara
cd59e7b978 ext4: Fix mount messages when quota disabled
When quota is disabled, we should not print 'journaled quota not
supported' when user tried to mount non-journaled quota. Also fix typo
in the message.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 19:11:51 -04:00
Jan Kara
dfc5d03f12 ext4: correct mount option parsing to detect when quota options can be changed
We should not allow user to change quota mount options when quota is
just suspended.  It would make mount options and internal quota state
inconsistent.  Also we should not allow user to change quota format when
quota is turned on.  On the other hand we can just silently ignore when
some option is set to the value it already has (mount does this on
remount).

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 19:11:51 -04:00
Steve French
77c57ec896 [CIFS] don't explicitly do a FindClose on rewind when directory search has ended
Do the following series of operations on a CIFS share:

    opendir(dir)
    readdir(dir)
    unlink(file in dir)
    rewinddir(dir)
    readdir(dir)

If the readdir read all entries in the directory this will make CIFS throw an error like this:

     CIFS VFS: Send error in FindClose = -9

CIFS requests "Close at end of search" of the server by setting this bit when issuing FindFirst or FindNext.  Therefore when all search entries are returned, the server may return "end of search" and close the search implicitly when this bit is set by the client on the request.  We check for this when a readdir is explicitly closed - but when the client notices that a directory has changed after the last operation, we attempt to close the directory before reopening by reissuing a second FindFirst. But, the directory may already been implicitly closed (due to end of search) because the first readdir finished. So we only want to issue a FindClose call in this case when we don't expect it to already be closed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-13 21:39:32 +00:00
Cyrill Gorcunov
43f14d856f eCryptFS: fix imbalanced mutex locking
Fix imbalanced calls for mutex lock/unlock on ecryptfs_daemon_hash_mux
Revealed by Ingo Molnar: http://lkml.org/lkml/2008/5/7/260

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:26 -07:00
Jean Delvare
f36f21ecca Fix misuses of bdevname()
bdevname() fills the buffer that it is given as a parameter, so calling
strcpy() or snprintf() on the returned value is redundant (and probably not
guaranteed to work - I don't think strcpy and snprintf support overlapping
buffers.)

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:26 -07:00
Miklos Szeredi
78bb6cb9a8 fuse: add flag to turn on big writes
Prior to 2.6.26 fuse only supported single page write requests.  In theory all
fuse filesystem should be able support bigger than 4k writes, as there's
nothing in the API to prevent it.  Unfortunately there's a known case in
NTFS-3G where big writes cause filesystem corruption.  There could also be
other filesystems, where the lack of testing with big write requests would
result in bugs.

To prevent such problems on a kernel upgrade, disable big writes by default,
but let filesystems set a flag to turn it on.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Szabolcs Szakacsits <szaka@ntfs-3g.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:26 -07:00
KOSAKI Motohiro
4cd1a8fc3d memcg: fix possible panic when CONFIG_MM_OWNER=y
When mm destruction happens, we should pass mm_update_next_owner() the old mm.
 But unfortunately new mm is passed in exec_mmap().

Thus, kernel panic is possible when a multi-threaded process uses exec().

Also, the owner member comment description is wrong.  mm->owner does not
necessarily point to the thread group leader.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: "Paul Menage" <menage@google.com>
Cc: "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:25 -07:00
Eric Sesterhenn
706322496b Fix hfsplus oops on image without extents
Fix an oops with a corrupted hfs+ image.

See http://bugzilla.kernel.org/show_bug.cgi?id=10548 for details.

Problem is that we call hfs_btree_open() from hfsplus_fill_super() to set
HFSPLUS_SB(sb).[ext_tree|cat_tree] Both trees are still NULL at this moment.
If hfs_btree_open() fails for any reason it calls iput() on the page, which
gets to hfsplus_releasepage() which tries to access HFSPLUS_SB(sb).* which is
still NULL and oopses while dereferencing it.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:24 -07:00
Serge E. Hallyn
289f8e27ed capabilities: add bounding set to /proc/self/status
There is currently no way to query the bounding set of another task.  As there
appears to be no security reason not to, and as Michael Kerrisk points out the
following valid reasons to do so exist:

* consistency (I can see all of the other per-thread/process sets in
  /proc/.../status)

* debugging -- I could imagine that it would make the job of debugging an
  application that uses capabilities a little simpler.

this patch adds the bounding set to /proc/self/status right after the
effective set.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:24 -07:00
Jan Kara
9377abd026 quota: don't call sync_fs() from vfs_quota_off() when there's no quota turn off
Sometimes, vfs_quota_off() is called on a partially set up super block (for
example when fill_super() fails for some reason).  In such cases we cannot
call ->sync_fs() because it can Oops because of not properly filled in super
block.  So in case we find there's not quota to turn off, we just skip
everything and return which fixes the above problem.

[akpm@linux-foundation.org: fxi tpyo]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:23 -07:00
Christoph Hellwig
bb45d64224 ufs: remove unneeded ufs_put_inode prototype
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:23 -07:00
Miklos Szeredi
8dc4e37362 ecryptfs: clean up (un)lock_parent
dget(dentry->d_parent) --> dget_parent(dentry)

unlock_parent() is racy and unnecessary.  Replace single caller with
unlock_dir().

There are several other suspect uses of ->d_parent in ecryptfs...

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:23 -07:00
Jeff Dike
46d7b522eb uml: move hppfs_kern.c to hppfs.c
There's no reason for the _kern in hppfs_kern.c, so move it to hppfs.c.

Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:21 -07:00
Jeff Dike
a0612b1f0b uml: hppfs fixes
hppfs tidying and fixes noticed during hch's get_inode work -
      style fixes
      a copy_to_user got its return value checked
      hppfs_write no longer fiddles file->f_pos because it gets and
returns pos in its arguments
      hppfs_delete_inode dputs the underlyng procfs dentry stored in
its private data and mntputs the vfsmnt stashed in s_fs_info
      hppfs_put_super no longer needs to mntput the s_fs_info, so it
no longer needs to exist
      hppfs_readlink and hppfs_follow_link were doing a bunch of stuff
with a struct file which they didn't use
      there is now a ->permission which calls generic_permission
      get_inode was always returning 0 for some reason - it now
returns an inode if nothing bad happened

Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-13 08:02:21 -07:00
Alexey Dobriyan
b2e03ca748 JFS: switch to seq_files
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
2008-05-13 08:22:10 -05:00
Steve French
582d21e5e3 [CIFS] cleanup old checkpatch warnings
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-13 04:54:12 +00:00
Marcin Slusarz
ed5f037005 [CIFS] CIFSSMBPosixLock should return -EINVAL on error
all other codepaths in this function return negative values on errors

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-13 04:01:01 +00:00
Jeff Layton
6353450a2d fix memory leak in CIFSFindNext
When CIFSFindNext gets back an -EBADF from a call, it sets the return
code of the function to 0 and eventually exits. Doing this makes the
cleanup at the end of the function skip freeing the SMB buffer, so
we need to make sure we free the buffer explicitly when doing this.

If we don't you end up with errors like this when unplugging the cifs
kernel module:

slab error in kmem_cache_destroy(): cache `cifs_request': Can't free all objects
 [<c046bdbf>] kmem_cache_destroy+0x61/0xf3
 [<e0f03045>] cifs_destroy_request_bufs+0x14/0x28 [cifs]
 [<e0f2016e>] exit_cifs+0x1e/0x80 [cifs]
 [<c043aeae>] sys_delete_module+0x192/0x1b8
 [<c04451fd>] audit_syscall_entry+0x14b/0x17d
 [<c0405413>] syscall_call+0x7/0xb
 =======================

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2008-05-13 03:06:13 +00:00
Jeff Layton
d0a9c078db [CIFS] CIFS currently allows for permissions to be changed on files, even
when unix extensions and cifsacl support are disabled. These
permissions changes are "ephemeral" however. They are lost whenever
a share is mounted and unmounted, or when memory pressure forces
the inode out of the cache.

Because of this, we'd like to introduce a behavior change to make
CIFS behave more like local DOS/Windows filesystems. When unix
extensions and cifsacl support aren't enabled, then don't silently
ignore changes to permission bits that can't be reflected on the
server.

Still, there may be people relying on the current behavior for
certain applications. This patch adds a new "dynperm" (and a
corresponding "nodynperm") mount option that will be intended
to make the client fall back to legacy behavior when setting
these modes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-12 22:23:49 +00:00
Marcin Slusarz
88f85a55c0 JFS: 0 is not valid errno value so return NULL from jfs_lookup
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: jfs-discussion@lists.sourceforge.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2008-05-12 16:42:43 -05:00
Linus Torvalds
542dafadd8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] don't allow demultiplex thread to exit until kthread_stop is called
  [CIFS] when not using unix extensions, check for and set ATTR_READONLY on create and mkdir
  [CIFS]  add local struct inode pointer to cifs_setattr
  [CIFS] cifs_find_tcp_session cleanup
2008-05-12 13:29:15 -07:00
Jean Delvare
00377d8e38 [GFS2] Prefer strlcpy() over snprintf()
strlcpy is faster than snprintf when you don't use the returned value.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-05-12 08:57:11 +01:00
Andrew Price
ad99f77778 [GFS2] Fix cast from unsigned int to s64
This fixes bz 444829 where allocating a new block caused gfs2 file systems to
report 0 bytes used in df. It was caused by a broken cast from an unsigned int
in gfs2_block_alloc() to a negative s64 in gfs2_statfs_change(). This patch
casts the unsigned int to an s64 before the unary minus is applied.

Signed-off-by: Andrew Price <andy@andrewprice.me.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-05-12 08:54:56 +01:00
Bob Peterson
091806edd4 [GFS2] filesystem consistency error from do_strip
This patch fixes a GFS2 filesystem consistency error reported from
function do_strip.  The problem was caused by a timing window
that allowed two vfs inodes to be created in memory that point
to the same file.  The problem is fixed by making the vfs's
iget_test, iget_set mechanism check and set a new bit in the
in-core gfs2_inode structure while the vfs inode spin_lock is held.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2008-05-12 08:54:53 +01:00
Linus Torvalds
c3921ab715 Add new 'cond_resched_bkl()' helper function
It acts exactly like a regular 'cond_resched()', but will not get
optimized away when CONFIG_PREEMPT is set.

Normal kernel code is already preemptable in the presense of
CONFIG_PREEMPT, so cond_resched() is optimized away (see commit
02b67cc3ba "sched: do not do
cond_resched() when CONFIG_PREEMPT").

But when wanting to conditionally reschedule while holding a lock, you
need to use "cond_sched_lock(lock)", and the new function is the BKL
equivalent of that.

Also make fs/locks.c use it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-11 16:04:48 -07:00
Steve French
e691b9d1a0 [CIFS] don't allow demultiplex thread to exit until kthread_stop is called
cifs_demultiplex_thread can exit under several conditions:

1) if it's signaled
2) if there's a problem with session setup
3) if kthread_stop is called on it

The first two are problems. If kthread_stop is called on the thread,
there is no guarantee that it will still be up. We need to have the
thread stay up until kthread_stop is called on it.

One option would be to not even try to tear things down until after
kthread_stop is called. However, in the case where there is a problem
setting up the session, there's no real reason to try continuing the
loop.

This patch allows the thread to clean up and prepare for exit under all
three conditions, but it has the thread go to sleep until kthread_stop
is called. This allows us to simplify the shutdown code somewhat since
we can be reasonably sure that the thread won't exit after being
signaled but before kthread_stop is called.

It also removes the places where the thread itself set the tsk variable
since it appeared that it could have a potential race where the thread
might never be shut down.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-11 17:45:44 +00:00
Jeff Layton
67750fb9e0 [CIFS] when not using unix extensions, check for and set ATTR_READONLY on create and mkdir
When creating a directory on a CIFS share without POSIX extensions,
and the given mode has no write bits set, set the ATTR_READONLY bit.

When creating a file, set ATTR_READONLY if the create mode has no write
bits set and we're not using unix extensions.

There are some comments about this being problematic due to the VFS
splitting creates into 2 parts. I'm not sure what that's actually
talking about, but I'm assuming that it has something to do with how
mknod is implemented. In the simple case where we have no unix
extensions and we're just creating a regular file, there's no reason
we can't set ATTR_READONLY.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-11 17:45:43 +00:00
Jeff Layton
02eadeffda [CIFS] add local struct inode pointer to cifs_setattr
Clean up cifs_setattr a bit by adding a local inode pointer, and
changing all of the direntry->d_inode references to it. This also adds a
bit of micro-optimization. d_inode shouldn't change over the life of
this function, so we only need to dereference it once.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-11 17:45:43 +00:00
Cyrill Gorcunov
1b20d67218 [CIFS] cifs_find_tcp_session cleanup
This patch cleans up cifs_find_tcp_session so it become
less indented. Also the error of skipping IPv6 matched
addresses fixed.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-05-11 17:45:43 +00:00