Commit Graph

312065 Commits

Author SHA1 Message Date
Jan Kara
4fcf1c6205 mm: Make default vm_ops provide ->page_mkwrite handler
Make default vm_ops provide ->page_mkwrite handler. Currently it only updates
file's modification times and gets locked page but later it will also handle
filesystem freezing.

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:48 +04:00
Jan Kara
41c4d25f78 mm: Update file times from fault path only if .page_mkwrite is not set
Filesystems wanting to properly support freezing need to have control
when file_update_time() is called. After pushing file_update_time()
to all relevant .page_mkwrite implementations we can just stop calling
file_update_time() when filesystem implements .page_mkwrite.

Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:48 +04:00
Jan Kara
14ae417c6f sysfs: Push file_update_time() into bin_page_mkwrite()
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:47 +04:00
Jan Kara
a63e9b2e76 gfs2: Push file_update_time() into gfs2_page_mkwrite()
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel@redhat.com
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:46 +04:00
Jan Kara
120c2bcad8 9p: Push file_update_time() into v9fs_vm_page_mkwrite()
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:46 +04:00
Jan Kara
3ca9c3bd8a ceph: Push file_update_time() into ceph_page_mkwrite()
CC: Sage Weil <sage@newdream.net>
CC: ceph-devel@vger.kernel.org
Acked-by: Sage Weil <sage@newdream.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:45 +04:00
Jan Kara
5e8830dc85 fs: Push file_update_time() into __block_page_mkwrite()
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:44 +04:00
Jan Kara
183fef91cd fb_defio: Push file_update_time() into fb_deferred_io_mkwrite()
CC: Jaya Kumar <jayalk@intworks.biz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:44 +04:00
Al Viro
64894cf843 simplify lookup_open()/atomic_open() - do the temporary mnt_want_write() early
The write ref to vfsmount taken in lookup_open()/atomic_open() is going to
be dropped; we take the one to stay in dentry_open().  Just grab the temporary
in caller if it looks like we are going to need it (create/truncate/writable open)
and pass (by value) "has it succeeded" flag.  Instead of doing mnt_want_write()
inside, check that flag and treat "false" as "mnt_want_write() has just failed".
mnt_want_write() is cheap and the things get considerably simpler and more robust
that way - we get it and drop it in the same function, to start with, rather
than passing a "has something in the guts of really scary functions taken it"
back to caller.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 00:53:35 +04:00
Al Viro
f8310c5920 fix O_EXCL handling for devices
O_EXCL without O_CREAT has different semantics; it's "fail if already opened",
not "fail if already exists".  commit 71574865 broke that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-30 11:50:30 +04:00
Al Viro
bf8848918d lockd: handle lockowner allocation failure in nlmclnt_proc()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 23:17:39 +04:00
Al Viro
446945ab9a lockd: shift grabbing a reference to nlm_host into nlm_alloc_call()
It's used both for client and server hosts; we can't do nlmclnt_release_host()
on failure exits, since the host might need nlmsvc_release_host(), with BUG_ON()
for calling the wrong one.  Makes life simpler for callers, actually...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 23:09:57 +04:00
Kees Cook
a51d9eaa41 fs: add link restriction audit reporting
Adds audit messages for unexpected link restriction violations so that
system owners will have some sort of potentially actionable information
about misbehaving processes.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:43:08 +04:00
Kees Cook
800179c9b8 fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.

Symlinks:

A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp

The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.

Some pointers to the history of earlier discussion that I could find:

 1996 Aug, Zygo Blaxell
  http://marc.info/?l=bugtraq&m=87602167419830&w=2
 1996 Oct, Andrew Tridgell
  http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
 1997 Dec, Albert D Cahalan
  http://lkml.org/lkml/1997/12/16/4
 2005 Feb, Lorenzo Hernández García-Hierro
  http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
 2010 May, Kees Cook
  https://lkml.org/lkml/2010/5/30/144

Past objections and rebuttals could be summarized as:

 - Violates POSIX.
   - POSIX didn't consider this situation and it's not useful to follow
     a broken specification at the cost of security.
 - Might break unknown applications that use this feature.
   - Applications that break because of the change are easy to spot and
     fix. Applications that are vulnerable to symlink ToCToU by not having
     the change aren't. Additionally, no applications have yet been found
     that rely on this behavior.
 - Applications should just use mkstemp() or O_CREATE|O_EXCL.
   - True, but applications are not perfect, and new software is written
     all the time that makes these mistakes; blocking this flaw at the
     kernel is a single solution to the entire class of vulnerability.
 - This should live in the core VFS.
   - This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
 - This should live in an LSM.
   - This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)

Hardlinks:

On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.

The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.

Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].

This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279

This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:37:58 +04:00
Jeff Layton
3134f37e93 vfs: don't let do_last pass negative dentry to audit_inode
I can reliably reproduce the following panic by simply setting an audit
rule on a recent 3.5.0+ kernel:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
 IP: [<ffffffff810d1250>] audit_copy_inode+0x10/0x90
 PGD 7acd9067 PUD 7b8fb067 PMD 0
 Oops: 0000 [#86] SMP
 Modules linked in: nfs nfs_acl auth_rpcgss fscache lockd sunrpc tpm_bios btrfs zlib_deflate libcrc32c kvm_amd kvm joydev virtio_net pcspkr i2c_piix4 floppy virtio_balloon microcode virtio_blk cirrus drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
 CPU 0
 Pid: 1286, comm: abrt-dump-oops Tainted: G      D      3.5.0+ #1 Bochs Bochs
 RIP: 0010:[<ffffffff810d1250>]  [<ffffffff810d1250>] audit_copy_inode+0x10/0x90
 RSP: 0018:ffff88007aebfc38  EFLAGS: 00010282
 RAX: 0000000000000000 RBX: ffff88003692d860 RCX: 00000000000038c4
 RDX: 0000000000000000 RSI: ffff88006baf5d80 RDI: ffff88003692d860
 RBP: ffff88007aebfc68 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
 R13: ffff880036d30f00 R14: ffff88006baf5d80 R15: ffff88003692d800
 FS:  00007f7562634740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000040 CR3: 000000003643d000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process abrt-dump-oops (pid: 1286, threadinfo ffff88007aebe000, task ffff880079614530)
 Stack:
  ffff88007aebfdf8 ffff88007aebff28 ffff88007aebfc98 ffffffff81211358
  ffff88003692d860 0000000000000000 ffff88007aebfcc8 ffffffff810d4968
  ffff88007aebfcc8 ffff8800000038c4 0000000000000000 0000000000000000
 Call Trace:
  [<ffffffff81211358>] ? ext4_lookup+0xe8/0x160
  [<ffffffff810d4968>] __audit_inode+0x118/0x2d0
  [<ffffffff811955a9>] do_last+0x999/0xe80
  [<ffffffff81191fe8>] ? inode_permission+0x18/0x50
  [<ffffffff81171efa>] ? kmem_cache_alloc_trace+0x11a/0x130
  [<ffffffff81195b4a>] path_openat+0xba/0x420
  [<ffffffff81196111>] do_filp_open+0x41/0xa0
  [<ffffffff811a24bd>] ? alloc_fd+0x4d/0x120
  [<ffffffff811855cd>] do_sys_open+0xed/0x1c0
  [<ffffffff810d40cc>] ? __audit_syscall_entry+0xcc/0x300
  [<ffffffff811856c1>] sys_open+0x21/0x30
  [<ffffffff81611ca9>] system_call_fastpath+0x16/0x1b
  RSP <ffff88007aebfc38>
 CR2: 0000000000000040

The problem is that do_last is passing a negative dentry to audit_inode.
The comments on lookup_open note that it can pass back a negative dentry
if O_CREAT is not set.

This patch fixes the oops, but I'm not clear on whether there's a better
approach.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:27:03 +04:00
Al Viro
0b5306b329 brcm80211: pointless current->files passed to filp_close()
... only needed if it's been in descriptor table

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:22 +04:00
Al Viro
586093064d sound_firmware: don't pass crap to filp_close()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:22 +04:00
Al Viro
20818a0caa gadgetfs: clean up
sigh...
* opened files have non-NULL dentries and non-NULL inodes
* close_filp() needs current->files only if the file had been
in descriptor table.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:21 +04:00
Al Viro
09fada5b5f slightly reduce lossage in gdm72xx
* filp_close() needs non-NULL second argument only if it'd been in descriptor
table
* opened files have non-NULL dentries, TYVM
* ... and those dentries are positive - it's kinda hard to open a file that
doesn't exist.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:20 +04:00
Al Viro
32aecdd365 slightly reduce idiocy in drivers/staging/bcm/Misc.c
a) vfs_llseek() does *not* access userland pointers of any kind
b) neither does filp_close(), for that matter
c) ... nor filp_open()
d) vfs_read() does, but we do have a wrapper for that (kernel_read()),
so there's no need to reinvent it.
e) passing current->files to filp_close() on something that never
had been in descriptor table is pointless.

ISAGN: voodoo dolls to be used on voodoo programmers...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:20 +04:00
Al Viro
e4fad8e5d2 consolidate pipe file creation
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:19 +04:00
Al Viro
b5bcdda327 take grabbing f->f_path to do_dentry_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:18 +04:00
Al Viro
5c33b183a3 uninline file_free_rcu()
What inline?  Its only use is passing its address to call_rcu(), for fuck sake!

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:17 +04:00
Al Viro
0b1d90119a ecryptfs_lookup_interpose(): allocate dentry_info first
less work on failure that way

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:17 +04:00
Al Viro
bc65a1215e sanitize ecryptfs_lookup()
* ->lookup() never gets hit with . or ..
* dentry it gets is unhashed, so unless we had gone and hashed it ourselves, there's
no need to d_drop() the sucker.
* wrong name printed in one of the printks (NULL, in fact)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:16 +04:00
Al Viro
faf0201029 clean unix_bind() up a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:15 +04:00
Al Viro
a8104a9fcd pull mnt_want_write()/mnt_drop_write() into kern_path_create()/done_path_create() resp.
One side effect - attempt to create a cross-device link on a read-only fs fails
with EROFS instead of EXDEV now.  Makes more sense, POSIX allows, etc.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:15 +04:00
Al Viro
8e4bfca1d1 mknod: take sanity checks on mode into the very beginning
Note that applying umask can't affect their results.  While
that affects errno in cases like
	mknod("/no_such_directory/a", 030000)
yielding -EINVAL (due to impossible mode_t) instead of
-ENOENT (due to inexistent directory), IMO that makes a lot
more sense, POSIX allows to return either and any software
that relies on getting -ENOENT instead of -EINVAL in that
case deserves everything it gets.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:14 +04:00
Al Viro
921a1650de new helper: done_path_create()
releases what needs to be released after {kern,user}_path_create()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:13 +04:00
Al Viro
25b2692a8a pull unlock+dput() out into do_spu_create()
... and cleaning spufs_create() a bit, while we are at it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:13 +04:00
Al Viro
1ba44cc970 spufs: pull unlock-and-dput() up into spufs_create()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:12 +04:00
Al Viro
66ec7b2cd0 spufs_create_context(): simplify failure exits
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:11 +04:00
Al Viro
67cba9fd64 move spu_forget() into spufs_rmdir()
now that __fput() is *not* done in any callchain containing mmput(),
we can do that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:11 +04:00
Al Viro
8cae6f7158 ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:55 +04:00
Al Viro
11e62a8fab btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:43 +04:00
Al Viro
765927b2d5 switch dentry_open() to struct path, make it grab references itself
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:29 +04:00
Al Viro
bf349a4470 spufs: shift dget/mntget towards dentry_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:17 +04:00
Al Viro
3b6456d2c3 zoran: don't bother with struct file * in zoran_map
all we need it for is file->private_data, which is assign-once, already
assigned by that point and, incidentally, its value is already in use
by zoran ->mmap() anyway.  So just store that pointer instead...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:10 +04:00
Al Viro
3b8b487114 ecryptfs: don't reinvent the wheels, please - use struct completion
... and keep the sodding requests on stack - they are small enough.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:02 +04:00
Al Viro
8fc37ec54c don't expose I_NEW inodes via dentry->d_inode
d_instantiate(dentry, inode);
	unlock_new_inode(inode);

is a bad idea; do it the other way round...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:58 +04:00
Al Viro
32a7991b6a tidy up namei.c a bit
locking/unlocking for rcu walk taken to a couple of inline helpers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:55 +04:00
Al Viro
3c0a616368 unobfuscate follow_up() a bit
really convoluted test in there has grown up during struct mount
introduction; what it checks is that we'd reached the root of
mount tree.
2012-07-23 00:00:45 +04:00
Eric Sandeen
de9b942202 ext3: pass custom EOF to generic_file_llseek_size()
Use the new custom EOF argument to generic_file_llseek_size so
that SEEK_END will go to the max hash value for htree dirs
in ext3 rather than to i_size_read()

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:30 +04:00
Eric Sandeen
ec7268ce21 ext4: use core vfs llseek code for dir seeks
Use the new functionality in generic_file_llseek_size() to
accept a custom EOF position, and un-cut-and-paste all the
vfs llseek code from ext4.

Also fix up comments on ext4_llseek() to reflect reality.

Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:28 +04:00
Eric Sandeen
e8b96eb503 vfs: allow custom EOF in generic_file_llseek code
For ext3/4 htree directories, using the vfs llseek function with
SEEK_END goes to i_size like for any other file, but in reality
we want the maximum possible hash value.  Recent changes
in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
but replicating this core code seems like a bad idea, especially
since the copy has already diverged from the vfs.

This patch updates generic_file_llseek_size to accept
both a custom maximum offset, and a custom EOF position.  With this
in place, ext4_dir_llseek can pass in the appropriate maximum hash
position for both maxsize and eof, and get what it wants.

As far as I know, this does not fix any bugs - nfs in the kernel
doesn't use SEEK_END, and I don't know of any user who does.  But
some ext4 folks seem keen on doing the right thing here, and I can't
really argue.

(Patch also fixes up some comments slightly)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:15 +04:00
Jan Kara
4ea425b63a vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
wakeup_flusher_threads(0) will queue work doing complete writeback for each
flusher thread. Thus there is not much point in submitting another work doing
full inode WB_SYNC_NONE writeback by writeback_inodes_sb().

After this change it does not make sense to call nonblocking ->sync_fs and
block device flush before calling sync_inodes_sb() because
wakeup_flusher_threads() is completely asynchronous and thus these functions
would be called in parallel with inode writeback running which will effectively
void any work they do. So we move sync_inodes_sb() call before these two
functions.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:59:01 +04:00
Jan Kara
d0e91b13eb vfs: Remove unnecessary flushing of block devices
It is not necessary to write block devices twice. The reason why we first did
flush and then proper sync is that
  for_each_bdev() {
    write_bdev()
    wait_for_completion()
  }
is much slower than
  for_each_bdev()
    write_bdev()
  for_each_bdev()
    wait_for_completion()
when there is bigger amount of data. But as is seen in the above, there's no real
need to scan pages and submit them twice. We just need to separate the submission
and waiting part. This patch does that.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:53 +04:00
Jan Kara
a8c7176b6d vfs: Make sys_sync writeout also block device inodes
In case block device does not have filesystem mounted on it, sys_sync will just
ignore it and doesn't writeout its dirty pages. This is because writeback code
avoids writing inodes from superblock without backing device and
blockdev_superblock is such a superblock.  Since it's unexpected that sync
doesn't writeout dirty data for block devices be nice to users and change the
behavior to do so. So now we iterate over all block devices on blockdev_super
instead of iterating over all superblocks when syncing block devices.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:49 +04:00
Jan Kara
5c0d6b60a0 vfs: Create function for iterating over block devices
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:45 +04:00
Jan Kara
b3de653105 vfs: Reorder operations during sys_sync
Change the order of operations during sync from

for_each_sb {
        writeback_inodes_sb();
        sync_fs(nowait);
        __sync_blockdev(nowait);
}
for_each_sb {
        sync_inodes_sb();
        sync_fs(wait);
        __sync_blockdev(wait);
}

to

for_each_sb
        writeback_inodes_sb();
for_each_sb
        sync_fs(nowait);
for_each_sb
        __sync_blockdev(nowait);
for_each_sb
        sync_inodes_sb();
for_each_sb
        sync_fs(wait);
for_each_sb
        __sync_blockdev(wait);

This is a preparation for the following patches in this series.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:41 +04:00