Commit Graph

22810 Commits

Author SHA1 Message Date
Tristan Ye
99e4c75041 Ocfs2/move_extents: helper to validate and adjust moving goal.
First best-effort attempt to validate and adjust the goal (physical address in
block), while it can't guarantee later operation can succeed all the time since
global_bitmap may change a bit over time.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:10 +08:00
Tristan Ye
1c06b91261 Ocfs2/move_extents: find the victim alloc group, where the given #blk fits.
This function tries locate the right alloc group, where a given physical block
resides, it returns the caller a buffer_head of victim group descriptor, and also
the offset of block in this group, by passing the block number.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:10 +08:00
Tristan Ye
202ee5facb Ocfs2/move_extents: defrag a range of extent.
It's a relatively complete function to accomplish defragmentation for entire
or partial extent, one journal handle was kept during the operation, it was
logically doing one more thing than ocfs2_move_extent() acutally, yes, it's
claiming the new clusters itself;-)

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:09 +08:00
Tristan Ye
8f603e567a Ocfs2/move_extents: move a range of extent.
The moving range of __ocfs2_move_extent() was within one extent always, it
consists following parts:

1. Duplicates the clusters in pages to new_blkoffset, where extent to be moved.

2. Split the original extent with new extent, coalecse the nearby extents if possible.

3. Append old clusters to truncate log, or decrease_refcount if the extent was refcounted.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:09 +08:00
Tristan Ye
de474ee8bb Ocfs2/move_extents: lock allocators and reserve metadata blocks and data clusters for extents moving.
ocfs2_lock_allocators_move_extents() was like the common ocfs2_lock_allocators(),
to lock metadata and data alloctors during extents moving, reserve appropriate
metadata blocks and data clusters, also performa a best- effort to calculate the
credits for journal transaction in one run of movement.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:09 +08:00
Tristan Ye
028ba5df63 Ocfs2/move_extents: Add basic framework and source files for extent moving.
Adding new files move_extents.[c|h] and fill it with nothing but
only a context structure.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:08 +08:00
Tristan Ye
220ebc4334 Ocfs2/move_extents: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2.
Patch also manages to add a manipulative struture for this ioctl.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:08 +08:00
Tristan Ye
3e19a25e05 Ocfs2/refcounttree: Publicize couple of funcs from refcounttree.c
The original goal of commonizing these funcs is to benefit defraging/extent_moving
codes in the future,  based on the fact that reflink and defragmentation having
the same Copy-On-Wrtie mechanism.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:08 +08:00
Tristan Ye
d24a10b9f8 Ocfs2: Add a new code 'OCFS2_INFO_FREEFRAG' for o2info ioctl.
This new code is a bit more complicated than former ones, the goal is to
show user all statistics required to take a deep insight into filesystem
on how the disk is being fragmentaed.

The goal is achieved by scaning global bitmap from (cluster)group to group
to figure out following factors in the filesystem:

        - How many free chunks in a fixed size as user requested.
        - How many real free chunks in all size.
        - Min/Max/Avg size(in) clusters of free chunks.
        - How do free chunks distribute(in size) in terms of a histogram,
          just like following:
          ---------------------------------------------------------
          Extent Size Range :  Free extents  Free Clusters  Percent
             32K...   64K-  :             1             1    0.00%
              1M...    2M-  :             9           288    0.03%
              8M...   16M-  :             2           831    0.09%
             32M...   64M-  :             1          2047    0.23%
            128M...  256M-  :             1          8191    0.92%
            256M...  512M-  :             2         21706    2.43%
            512M... 1024M-  :            27        858623   96.29%
          ---------------------------------------------------------

Userspace ioctl() call eventually gets the above info returned by passing
a 'struct ocfs2_info_freefrag' with the chunk_size being specified first.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 12:18:07 +08:00
Tristan Ye
3e5db17d4d Ocfs2: Add a new code 'OCFS2_INFO_FREEINODE' for o2info ioctl.
The new code is dedicated to calculate free inodes number of all inode_allocs,
then return the info to userpace in terms of an array.

Specially, flag 'OCFS2_INFO_FL_NON_COHERENT', manipulated by '--cluster-coherent'
from userspace, is now going to be involved. setting the flag on means no cluster
coherency considered, usually, userspace tools choose none-coherency strategy by
default for the sake of performace.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 12:18:02 +08:00
Tristan Ye
8aa1fa360d Ocfs2: Using inline funcs to set/clear *FILLED* flags in info handler.
It just removes some macros for the sake of typechecking gains.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 12:17:18 +08:00
Aditya Kali
ae81230686 ext4: reserve inodes and feature code for 'quota' feature
I am working on patch to add quota as a built-in feature for ext4
filesystem. The implementation is based on the design given at
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4.
This patch reserves the inode numbers 3 and 4 for quota purposes and
also reserves EXT4_FEATURE_RO_COMPAT_QUOTA feature code.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 19:00:39 -04:00
Johann Lombardi
c5e06d101a ext4: add support for multiple mount protection
Prevent an ext4 filesystem from being mounted multiple times.
A sequence number is stored on disk and is periodically updated (every 5
seconds by default) by a mounted filesystem.
At mount time, we now wait for s_mmp_update_interval seconds to make sure
that the MMP sequence does not change.
In case of failure, the nodename, bdevname and the time at which the MMP
block was last updated is displayed.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:31:25 -04:00
Eric W. Biederman
62ca24baf1 ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
Spotted-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2011-05-24 15:30:33 -07:00
Kazuya Mio
d02a9391f7 ext4: ensure f_bfree returned by ext4_statfs() is non-negative
I found the issue that the number of free blocks went negative.
# stat -f /mnt/mp1/
  File: "/mnt/mp1/"
    ID: e175ccb83a872efe Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 258022     Free: -15        Available: -13122
Inodes: Total: 65536      Free: 63029

f_bfree in struct statfs will go negative when the filesystem has
few free blocks. Because the number of dirty blocks is bigger than
the number of free blocks in the following two cases.

CASE 1:
ext4_da_writepages
  mpage_da_map_and_submit
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
        <--- interrupt statfs systemcall --->
        ext4_da_update_reserve_space
            percpu_counter_sub(&sbi->s_dirtyblocks_counter,
                            used + ei->i_allocated_meta_blocks);

CASE 2:
ext4_write_begin
  __block_write_begin
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
            <--- interrupt statfs systemcall --->
            percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);

To avoid the issue, this patch ensures that f_bfree is non-negative.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
2011-05-24 18:30:07 -04:00
Lukas Czerner
28739eea9c ext4: protect bb_first_free in ext4_trim_all_free() with group lock
We should protect reading bd_info->bb_first_free with the group lock
because otherwise we might miss some free blocks. This is not a big deal
at all, but the change to do right thing is really simple, so lets do
that.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:28:07 -04:00
Lukas Czerner
7894408666 ext4: only load buddy bitmap in ext4_trim_fs() when it is needed
Currently we are loading buddy ext4_mb_load_buddy() for every block
group we are going through in ext4_trim_fs() in many cases just to find
out that there is not enough space to be bothered with. As Amir Goldstein
suggested we can use bb_free information directly from ext4_group_info.

This commit removes ext4_mb_load_buddy() from ext4_trim_fs() and rather
get the ext4_group_info via ext4_get_group_info() and use the bb_free
information directly from that. This avoids unnecessary call to load
buddy in the case the group does not have enough free space to trim.
Loading buddy is now moved to ext4_trim_all_free().

Tested by me with xfstests 251.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:16:27 -04:00
Linus Torvalds
dc522adbee Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  jbd: Fix comment to match the code in journal_start()
  jbd/jbd2: remove obsolete summarise_journal_usage.
  jbd: Fix forever sleeping process in do_get_write_access()
  ext2: fix error msg when mounting fs with too-large blocksize
  jbd: fix fsync() tid wraparound bug
  ext3: Fix fs corruption when make_indexed_dir() fails
  ext3: Fix lock inversion in ext3_symlink()
2011-05-24 15:11:46 -07:00
Linus Torvalds
df3256f9ab Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: make plock operation killable
  dlm: remove shared message stub for recovery
  dlm: delayed reply message warning
  dlm: Remove superfluous call to recalc_sigpending()
2011-05-24 15:04:00 -07:00
Eryu Guan
c867516de5 jbd2: Fix comment to match the code in jbd2__journal_start()
jbd2__journal_start() returns an ERR_PTR() value rather than NULL on
failure.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 17:09:58 -04:00
John W. Linville
31ec97d9ce Merge ssh://master.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 into for-davem 2011-05-24 16:47:54 -04:00
Linus Torvalds
b0ca118dba Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (43 commits)
  TOMOYO: Fix wrong domainname validation.
  SELINUX: add /sys/fs/selinux mount point to put selinuxfs
  CRED: Fix load_flat_shared_library() to initialise bprm correctly
  SELinux: introduce path_has_perm
  flex_array: allow 0 length elements
  flex_arrays: allow zero length flex arrays
  flex_array: flex_array_prealloc takes a number of elements, not an end
  SELinux: pass last path component in may_create
  SELinux: put name based create rules in a hashtable
  SELinux: generic hashtab entry counter
  SELinux: calculate and print hashtab stats with a generic function
  SELinux: skip filename trans rules if ttype does not match parent dir
  SELinux: rename filename_compute_type argument to *type instead of *con
  SELinux: fix comment to state filename_compute_type takes an objname not a qstr
  SMACK: smack_file_lock can use the struct path
  LSM: separate LSM_AUDIT_DATA_DENTRY from LSM_AUDIT_DATA_PATH
  LSM: split LSM_AUDIT_DATA_FS into _PATH and _INODE
  SELINUX: Make selinux cache VFS RCU walks safe
  SECURITY: Move exec_permission RCU checks into security modules
  SELinux: security_read_policy should take a size_t not ssize_t
  ...
2011-05-24 13:38:19 -07:00
Sage Weil
db3540522e ceph: fix cap flush race reentrancy
In e9964c10 we change cap flushing to do a delicate dance because some
inodes on the cap_dirty list could be in a migrating state (got EXPORT but
not IMPORT) in which we couldn't actually flush and move from
dirty->flushing, breaking the while (!empty) { process first } loop
structure.  It worked for a single sync thread, but was not reentrant and
triggered infinite loops when multiple syncers came along.

Instead, move inodes with dirty to a separate cap_dirty_migrating list
when in the limbo export-but-no-import state, allowing us to go back to
the simple loop structure (which was reentrant).  This is cleaner and more
robust.

Audited the cap_dirty users and this looks fine:
list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we
have dirty caps (which list we're on is irrelevant) and list_del_init()
calls still do the right thing.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:12 -07:00
Sage Weil
45e3d3eeb6 ceph: avoid inode lookup on nfs fh reconnect
If we get the inode from the MDS, we have a reference in req; don't do a
fresh lookup.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:06 -07:00
Sage Weil
3c454cf216 ceph: use LOOKUPINO to make unconnected nfs fh more reliable
If we are unable to locate an inode by ino, ask the MDS using the new
LOOKUPINO command.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:05 -07:00
Linus Torvalds
eb08d8ff47 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (52 commits)
  UBIFS: switch to dynamic printks
  UBIFS: fix kernel-doc comments
  UBIFS: fix extremely rare mount failure
  UBIFS: simplify LEB recovery function further
  UBIFS: always cleanup the recovered LEB
  UBIFS: clean up LEB recovery function
  UBIFS: fix-up free space on mount if flag is set
  UBIFS: add the fixup function
  UBIFS: add a superblock flag for free space fix-up
  UBIFS: share the next_log_lnum helper
  UBIFS: expect corruption only in last journal head LEBs
  UBIFS: synchronize write-buffer before switching to the next bud
  UBIFS: remove BUG statement
  UBIFS: change bud replay function conventions
  UBIFS: substitute the replay tree with a replay list
  UBIFS: simplify replay
  UBIFS: store free and dirty space in the bud replay entry
  UBIFS: remove unnecessary stack variable
  UBIFS: double check that buds are replied in order
  UBIFS: make 2 functions static
  ...
2011-05-24 11:51:07 -07:00
Christoph Hellwig
55a7bc5a30 xfs: do not discard alloc btree blocks
Blocks for the allocation btree are allocated from and released to
the AGFL, and thus frequently reused.  Even worse we do not have an
easy way to avoid using an AGFL block when it is discarded due to
the simple FILO list of free blocks, and thus can frequently stall
on blocks that are currently undergoing a discard.

Add a flag to the busy extent tracking structure to skip the discard
for allocation btree blocks.  In normal operation these blocks are
reused frequently enough that there is no need to discard them
anyway, but if they spill over to the allocation btree as part of a
balance we "leak" blocks that we would otherwise discard.  We could
fix this by adding another flag and keeping these block in the
rbtree even after they aren't busy any more so that we could discard
them when they migrate out of the AGFL.  Given that this would cause
significant overhead I don't think it's worthwile for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-24 11:17:22 -05:00
Christoph Hellwig
e84661aa84 xfs: add online discard support
Now that we have reliably tracking of deleted extents in a
transaction we can easily implement "online" discard support
which calls blkdev_issue_discard once a transaction commits.

The actual discard is a two stage operation as we first have
to mark the busy extent as not available for reuse before we
can start the actual discard.  Note that we don't bother
supporting discard for the non-delaylog mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-24 11:17:13 -05:00
Jan Kara
93628ffb9b ext4: fix waiting and sending of a barrier in ext4_sync_file()
jbd2_log_start_commit() returns 1 only when we really start a
transaction.  But we also need to wait for a transaction when the
commit is already running.  Fix this problem by waiting for
transaction commit unconditionally (which is just a quick check if the
transaction is already committed).

Also we have to be more careful with sending of a barrier because when
transaction is being committed in parallel to ext4_sync_file()
running, we cannot be sure that the barrier the journalling code sends
happens after we wrote all the data for fsync (note that not every
data writeout needs to trigger metadata changes thus commit of some
metadata changes can be running while other data is still written
out). So use jbd2_will_send_data_barrier() helper to detect the common
cases when we can be sure barrier will be issued by the commit code
and issue the barrier ourselves in the remaining cases.

Reported-by: Edward Goggin <egoggin@vmware.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 12:00:54 -04:00
Jan Kara
bbd2be3691 jbd2: Add function jbd2_trans_will_send_data_barrier()
Provide a function which returns whether a transaction with given tid
will send a flush to the filesystem device.  The function will be used
by ext4 to detect whether fsync needs to send a separate flush or not.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 11:59:18 -04:00
Jan Kara
81be12c817 jbd2: fix sending of data flush on journal commit
In data=ordered mode, it's theoretically possible (however rare) that
an inode is filed to transaction's t_inode_list and a flusher thread
writes all the data and inode is reclaimed before the transaction
starts to commit.  In such a case, we could erroneously omit sending a
flush to file system device when it is different from the journal
device (because data can still be in disk cache only).

Fix the problem by setting a flag in a transaction when some inode is added
to it and then send disk flush in the commit code when the flag is set.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 11:52:40 -04:00
Yongqiang Yang
b221349fa8 ext4: fix ext4_ext_fiemap_cb() to handle blocks before request range correctly
To get delayed-extent information, ext4_ext_fiemap_cb() looks up
pagecache, it thus collects information starting from a page's
head block.

If blocksize < pagesize, the beginning blocks of a page may lies
before the request range. So ext4_ext_fiemap_cb() should proceed
ignoring them, because they has been handled before. If no mapped
buffer in the range is found in the 1st page, we need to look up
the 2nd page, otherwise delayed-extents after a hole will be ignored.

Without this patch, xfstests 225 will hung on ext4 with 1K block.

Reported-by: Amir Goldstein <amir73il@users.sourceforge.net>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 11:36:58 -04:00
James Morris
434d42cfd0 Merge branch 'next' into for-linus 2011-05-24 22:55:24 +10:00
Robin Dong
98ba073c60 ocfs2: change incorrect 'extern' keyword to 'static' in dlmfs
Change function param_set_dlmfs_capabilities from 'extern' to 'static' since
function param_get_dlmfs_capabilities is also 'static'.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:59:40 -07:00
Sunil Mushran
9f62e96084 ocfs2/dlm: dlm_is_lockres_migrateable() returns boolean
Patch cleans up the gunk added by commit 388c4bcb4e.
dlm_is_lockres_migrateable() now returns 1 if lockresource is deemed
migrateable and 0 if not.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:39 -07:00
Tao Ma
10fca35ff1 ocfs2: Add trace event for trim.
Add the corresponding trace event for trim.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:20 -07:00
Tao Ma
55e67872b6 ocfs2: Add FITRIM ioctl.
Add the corresponding ioctl function for FITRIM.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:19 -07:00
Tao Ma
e80de36d8d ocfs2: Add ocfs2_trim_fs for SSD trim support.
Add ocfs2_trim_fs to support trimming freed clusters in the
volume. A range will be given and all the freed clusters greater
than minlen will be discarded to the block layer.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:18 -07:00
Amerigo Wang
69a60c4d17 ocfs2: remove the /sys/o2cb symlink
It is obsoleted since Dec 2005.

Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:14 -07:00
Tiger Yang
e2b0c215c2 ocfs2: clean up mount option about atime in ocfs2.txt
As ocfs2 supports relatime and strictatime, we need update the
relative document. Atime_quantum need work with strictatime, so only
show it in procfs when mount with strictatime.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:12 -07:00
Linus Torvalds
343800e7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
  fat: Fix statfs->f_namelen
  fat: Replace all printk with fat_msg()
  fat: Add fat_msg() function for preformated FAT messages
  fat: Convert fat_fs_error to use %pV
  fat: Fix possible null deref in fat_cache_add()
  fat: use new setup() for ->dir_ops too
2011-05-23 21:11:38 -07:00
Eryu Guan
c2b67735e5 jbd: Fix comment to match the code in journal_start()
journal_start returns an ERR_PTR() value rather than NULL on failure.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-05-24 00:27:53 +02:00
Linus Torvalds
a77febbef1 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: obey minleft values during extent allocation correctly
  xfs: reset buffer pointers before freeing them
  xfs: avoid getting stuck during async inode flushes
  xfs: fix xfs_itruncate_start tracing
  xfs: fix duplicate workqueue initialisation
  xfs: kill off xfs_printk()
  xfs: fix race condition in AIL push trigger
  xfs: make AIL target updates and compares 32bit safe.
  xfs: always push the AIL to the target
  xfs: exit AIL push work correctly when AIL is empty
  xfs: ensure reclaim cursor is reset correctly at end of AG
  xfs: add an x86 compat handler for XFS_IOC_ZERO_RANGE
  xfs: fix compiler warning in xfs_trace.h
  xfs: cleanup duplicate initializations
  xfs: reduce the number of pagb_lock roundtrips in xfs_alloc_clear_busy
  xfs: exact busy extent tracking
  xfs: do not immediately reuse busy extent ranges
  xfs: optimize AGFL refills
2011-05-23 15:19:16 -07:00
Linus Torvalds
99dff58562 Merge branch 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (48 commits)
  serial: 8250_pci: add support for Cronyx Omega PCI multiserial board.
  tty/serial: Fix break handling for PORT_TEGRA
  tty/serial: Add explicit PORT_TEGRA type
  n_tracerouter and n_tracesink ldisc additions.
  Intel PTI implementaiton of MIPI 1149.7.
  Kernel documentation for the PTI feature.
  export kernel call get_task_comm().
  tty: Remove to support serial for S5P6442
  pch_phub: Support new device ML7223
  8250_pci: Add support for the Digi/IBM PCIe 2-port Adapter
  ASoC: Update cx20442 for TTY API change
  pch_uart: Support new device ML7223 IOH
  parport: Use request_muxed_region for IT87 probe and lock
  tty/serial: add support for Xilinx PS UART
  n_gsm: Use print_hex_dump_bytes
  drivers/tty/moxa.c: Put correct tty value
  TTY: tty_io, annotate locking functions
  TTY: serial_core, remove superfluous set_task_state
  TTY: serial_core, remove invalid test
  Char: moxa, fix locking in moxa_write
  ...

Fix up trivial conflicts in drivers/bluetooth/hci_ldisc.c and
drivers/tty/serial/Makefile.

I did the hci_ldisc thing as an evil merge, cleaning things up.
2011-05-23 12:23:20 -07:00
Theodore Ts'o
072bd7ea74 ext4: use truncate_setsize() unconditionally
In commit c8d46e41 (ext4: Add flag to files with blocks intentionally
past EOF), if the EOFBLOCKS_FL flag is set, we call ext4_truncate()
before calling vmtruncate().  This caused any allocated but unwritten
blocks created by calling fallocate() with the FALLOC_FL_KEEP_SIZE
flag to be dropped.  This was done to make to make sure that
EOFBLOCKS_FL would not be cleared while still leaving blocks past
i_size allocated.  This was not necessary, since ext4_truncate()
guarantees that blocks past i_size will be dropped, even in the case
where truncate() has increased i_size before calling ext4_truncate().

So fix this by removing the EOFBLOCKS_FL special case treatment in
ext4_setattr().  In addition, use truncate_setsize() followed by a
call to ext4_truncate() instead of using vmtruncate().  This is more
efficient since it skips the call to inode_newsize_ok(), which has
been checked already by inode_change_ok().  This is also in a win in
the case where EOFBLOCKS_FL is set since it avoids calling
ext4_truncate() twice.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-23 15:13:02 -04:00
Linus Torvalds
30cb6d5f2e Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  hrtimers: Reorder clock bases
  hrtimers: Avoid touching inactive timer bases
  hrtimers: Make struct hrtimer_cpu_base layout less stupid
  timerfd: Manage cancelable timers in timerfd
  clockevents: Move C3 stop test outside lock
  alarmtimer: Drop device refcount after rtc_open()
  alarmtimer: Check return value of class_find_device()
  timerfd: Allow timers to be cancelled when clock was set
  hrtimers: Prepare for cancel on clock was set timers
2011-05-23 11:30:28 -07:00
Namhyung Kim
825cdcb1a5 splice: add wakeup_pipe_readers()
Add and use wakeup_pipe_readers() to consolidate duplicated codes.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 19:58:53 +02:00
Linus Torvalds
57d19e80f4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
  b43: fix comment typo reqest -> request
  Haavard Skinnemoen has left Atmel
  cris: typo in mach-fs Makefile
  Kconfig: fix copy/paste-ism for dell-wmi-aio driver
  doc: timers-howto: fix a typo ("unsgined")
  perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
  md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
  treewide: fix a few typos in comments
  regulator: change debug statement be consistent with the style of the rest
  Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
  audit: acquire creds selectively to reduce atomic op overhead
  rtlwifi: don't touch with treewide double semicolon removal
  treewide: cleanup continuations and remove logging message whitespace
  ath9k_hw: don't touch with treewide double semicolon removal
  include/linux/leds-regulator.h: fix syntax in example code
  tty: fix typo in descripton of tty_termios_encode_baud_rate
  xtensa: remove obsolete BKL kernel option from defconfig
  m68k: fix comment typo 'occcured'
  arch:Kconfig.locks Remove unused config option.
  treewide: remove extra semicolons
  ...
2011-05-23 09:12:26 -07:00
Tejun Heo
ff2a9941ca block: move bd_set_size() above rescan_partitions() in __blkdev_get()
02e352287a (block: rescan partitions on invalidated devices on
-ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
to simplify condition check.  As rescan_partitions() does its own bdev
size setting, this doesn't break anything; however,
rescan_partitions() prints out the following messages when adjusting
bdev size, which can be confusing.

  sda: detected capacity change from 0 to 146815737856
  sdb: detected capacity change from 0 to 146815737856

This patch restores the original order and remove the warning
messages.

stable: Please apply together with 02e352287a (block: rescan
        partitions on invalidated devices on -ENOMEDIA too).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tony Luck <tony.luck@gmail.com>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-23 08:50:48 -07:00
David Teigland
901025d2f3 dlm: make plock operation killable
Allow processes blocked on plock requests to be interrupted
when they are killed.  This leaves the problem of cleaning
up the lock state in userspace.  This has three parts:

1. Add a flag to unlock operations sent to userspace
indicating the file is being closed.  Userspace will
then look for and clear any waiting plock operations that
were abandoned by an interrupted process.

2. Queue an unlock-close operation (like in 1) to clean up
userspace from an interrupted plock request.  This is needed
because the vfs will not send a cleanup-unlock if it sees no
locks on the file, which it won't if the interrupted operation
was the only one.

3. Do not use replies from userspace for unlock-close operations
because they are unnecessary (they are just cleaning up for the
process which did not make an unlock call).  This also simplifies
the new unlock-close generated from point 2.

Signed-off-by: David Teigland <teigland@redhat.com>
2011-05-23 10:47:06 -05:00
Linus Torvalds
4d9dec4db2 Merge branch 'exec_rm_compat' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc
* 'exec_rm_compat' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc:
  exec: document acct_arg_size()
  exec: unify do_execve/compat_do_execve code
  exec: introduce struct user_arg_ptr
  exec: introduce get_user_arg_ptr() helper
2011-05-23 08:28:34 -07:00
Linus Torvalds
34b064569e Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Wait properly when flushing the ail list
  GFS2: Wipe directory hash table metadata when deallocating a directory
2011-05-23 08:24:09 -07:00
Thomas Gleixner
9ec2690758 timerfd: Manage cancelable timers in timerfd
Peter is concerned about the extra scan of CLOCK_REALTIME_COS in the
timer interrupt. Yes, I did not think about it, because the solution
was so elegant. I didn't like the extra list in timerfd when it was
proposed some time ago, but with a rcu based list the list walk it's
less horrible than the original global lock, which was held over the
list iteration.

Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2011-05-23 13:59:53 +02:00
Tejun Heo
7e69723fef block: move bd_set_size() above rescan_partitions() in __blkdev_get()
02e352287a (block: rescan partitions on invalidated devices on
-ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
to simplify condition check.  As rescan_partitions() does its own bdev
size setting, this doesn't break anything; however,
rescan_partitions() prints out the following messages when adjusting
bdev size, which can be confusing.

  sda: detected capacity change from 0 to 146815737856
  sdb: detected capacity change from 0 to 146815737856

This patch restores the original order and remove the warning
messages.

stable: Please apply together with 02e352287a (block: rescan
        partitions on invalidated devices on -ENOMEDIA too).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tony Luck <tony.luck@gmail.com>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org

Stable note: 2.6.39 only.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 13:26:07 +02:00
Linus Torvalds
caebc160ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: use mark_buffer_dirty to mark btnode or meta data dirty
  nilfs2: always set back pointer to host inode in mapping->host
  nilfs2: get rid of NILFS_I_NILFS
  nilfs2: use list_first_entry
  nilfs2: use empty_aops for gc-inodes
  nilfs2: implement resize ioctl
  nilfs2: add truncation routine of segment usage file
  nilfs2: add routine to move secondary super block
  nilfs2: add ioctl which limits range of segment to be allocated
  nilfs2: zero fill unused portion of super root block
  nilfs2: super root size should change depending on inode size
  nilfs2: get rid of private page allocator
  nilfs2: merge list_del()/list_add_tail() to list_move_tail()
2011-05-22 22:43:01 -07:00
Artem Bityutskiy
56e46742e8 UBIFS: switch to dynamic printks
Switch to debugging using dynamic printk (pr_debug()). There is no good reason
to carry custom debugging prints if there is so cool and powerful generic
dynamic printk infrastructure, see Documentation/dynamic-debug-howto.txt. With
dynamic printks we can switch on/of individual prints, per-file, per-function
and per format messages. This means that instead of doing old-fashioned

echo 1 > /sys/module/ubifs/parameters/debug_msgs

to enable general messages, we can do:

echo 'format "UBIFS DBG gen" +ptlf' > control

to enable general messages and additionally ask the dynamic printk
infrastructure to print process ID, line number and function name. So there is
no reason to keep UBIFS-specific crud if there is more powerful generic thing.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-05-23 08:22:20 +03:00
Tao Ma
28e35e42fb jbd2: Fix the wrong calculation of t_max_wait in update_t_max_wait
t_max_wait is added in commit 8e85fb3f to indicate how long we
were waiting for new transaction to start. In commit 6d0bf005,
it is moved to another function named update_t_max_wait to
avoid a build warning. But the wrong thing is that the original
'ts' is initialized in the start of function start_this_handle
and we can calculate t_max_wait in the right way. while with
this change, ts is initialized within the function and t_max_wait
can never be calculated right.

This patch moves the initialization of ts to the original beginning
of start_this_handle and pass it to function update_t_max_wait so
that it can be calculated right and the build warning is avoided also.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-22 21:45:26 -04:00
Eric Gouriou
f6d2f6b327 ext4: fix unbalanced up_write() in ext4_ext_truncate() error path
ext4_ext_truncate() should not invoke up_write(&EXT4_I(inode)->i_data_sem)
when ext4_orphan_add() returns an error, as it hasn't performed a
down_write() yet. This trivial patch fixes this by moving the up_write()
invocation above the out_stop label.

Signed-off-by: Eric Gouriou <egouriou@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 21:33:00 -04:00
Vivek Haldar
77f4135f2a ext4: count hits/misses of extent cache and expose in sysfs
The number of hits and misses for each filesystem is exposed in
/sys/fs/ext4/<dev>/extent_cache_{hits, misses}.

Tested: fsstress, manual checks.
Signed-off-by: Vivek Haldar <haldar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 21:24:16 -04:00
Yongqiang Yang
93917411be ext4: make ext4_split_extent() handle error correctly
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-22 20:49:12 -04:00
Theodore Ts'o
373cd5c53d ext4: don't show mount options in /proc/mounts if there is no journal
After creating an ext4 file system without a journal:

  # mke2fs -t ext4 -O ^has_journal /dev/sda
  # mount -t ext4 /dev/sda /test

the /proc/mounts will show:
"/dev/sda /test ext4 rw,relatime,user_xattr,acl,barrier=1,data=writeback 0 0"
which can fool users into thinking that the fs is using writeback mode.

So don't set the writeback option when the journal has not been
enabled; we don't depend on the writeback option being set, since
ext4_should_writeback_data() in ext4_jbd2.h tests to see if the
journal is not present before returning true.

Reported-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 16:12:35 -04:00
Heiko Carstens
9ce6e0be06 fs: add missing prefetch.h include
Fixes this build error on s390 and probably other archs as well:

  fs/inode.c: In function 'new_inode':
  fs/inode.c:894:2: error: implicit declaration of function 'spin_lock_prefetch'

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
[ Happens on architectures that don't define their own prefetch
  functions in <asm/processor.h>, and instead rely on the default
  ones in <linux/prefetch.h>   - Linus]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-22 11:26:02 -07:00
Steven Whitehouse
26b06a6958 GFS2: Wait properly when flushing the ail list
The ail flush code has always relied upon log flushing to prevent
it from spinning needlessly. This fixes it to wait on the last
I/O request submitted (we don't need to wait for all of it)
instead of either spinning with io_schedule or sleeping.

As a result cpu usage of gfs2_logd is much reduced with certain
workloads.

Reported-by: Abhijith Das <adas@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-05-21 19:21:07 +01:00
Steven Whitehouse
6d3117b412 GFS2: Wipe directory hash table metadata when deallocating a directory
The deallocation code for directories in GFS2 is largely divided into
two parts. The first part deallocates any directory leaf blocks and
marks the directory as being a regular file when that is complete. The
second stage was identical to deallocating regular files.

Regular files have their data blocks in a different
address space to directories, and thus what would have been normal data
blocks in a regular file (the hash table in a GFS2 directory) were
deallocated correctly. However, a reference to these blocks was left in the
journal (assuming of course that some previous activity had resulted in
those blocks being in the journal or ail list).

This patch uses the i_depth as a test of whether the inode is an
exhash directory (we cannot test the inode type as that has already
been changed to a regular file at this stage in deallocation)

The original issue was reported by Chris Hertel as an issue he encountered
running bonnie++

Reported-by: Christopher R. Hertel <crh@samba.org>
Cc: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-05-21 14:05:58 +01:00
Erez Zadok
1a4022f88d VFS: move BUG_ON test for symlink nd->depth after current->link_count test
This solves a serious VFS-level bug in nested_symlink (which was
rewritten from do_follow_link), and follows the order of depth tests
that existed before.

The bug triggers a BUG_ON in fs/namei.c:1381, when running racer with
symlink and rename ops.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-21 00:12:16 -07:00
Timo Warns
cae13fe4cc Fix for buffer overflow in ldm_frag_add not sufficient
As Ben Hutchings discovered [1], the patch for CVE-2011-1017 (buffer
overflow in ldm_frag_add) is not sufficient.  The original patch in
commit c340b1d640 ("fs/partitions/ldm.c: fix oops caused by corrupted
partition table") does not consider that, for subsequent fragments,
previously allocated memory is used.

[1] http://lkml.org/lkml/2011/5/6/407

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Timo Warns <warns@pre-sense.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-20 16:40:36 -07:00
Linus Torvalds
8e7bfcbab3 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] define "_sdata" symbol
  pstore: Fix Kconfig dependencies for apei->pstore
  pstore: fix potential logic issue in pstore read interface
  pstore: fix pstore filesystem mount/remount issue
  pstore: fix one type of return value in pstore
  [IA64] fix build warning in arch/ia64/oprofile/backtrace.c
2011-05-20 13:39:00 -07:00
Linus Torvalds
91444f47b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: (32 commits)
  [CIFS] Fix to problem with getattr caused by invalidate simplification patch
  [CIFS] Remove sparse warning
  [CIFS] Update cifs to version 1.72
  cifs: Change key name to cifs.idmap, misc. clean-up
  cifs: Unconditionally copy mount options to superblock info
  cifs: Use kstrndup for cifs_sb->mountdata
  cifs: Simplify handling of submount options in cifs_mount.
  cifs: cifs_parse_mount_options: do not tokenize mount options in-place
  cifs: Add support for mounting Windows 2008 DFS shares
  cifs: Extract DFS referral expansion logic to separate function
  cifs: turn BCC into a static inlined function
  cifs: keep BCC in little-endian format
  cifs: fix some unused variable warnings in id_rb_search
  CIFS: Simplify invalidate part (try #5)
  CIFS: directio read/write cleanups
  consistently use smb_buf_length as be32 for cifs (try 3)
  cifs: Invoke id mapping functions (try #17 repost)
  cifs: Add idmap key and related data structures and functions (try #17 repost)
  CIFS: Add launder_page operation (try #3)
  Introduce smb2 mounts as vers=2
  ...
2011-05-20 13:37:49 -07:00
Linus Torvalds
3ed4c0583d Merge branch 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc
* 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (41 commits)
  signal: trivial, fix the "timespec declared inside parameter list" warning
  job control: reorganize wait_task_stopped()
  ptrace: fix signal->wait_chldexit usage in task_clear_group_stop_trapping()
  signal: sys_sigprocmask() needs retarget_shared_pending()
  signal: cleanup sys_sigprocmask()
  signal: rename signandsets() to sigandnsets()
  signal: do_sigtimedwait() needs retarget_shared_pending()
  signal: introduce do_sigtimedwait() to factor out compat/native code
  signal: sys_rt_sigtimedwait: simplify the timeout logic
  signal: cleanup sys_rt_sigprocmask()
  x86: signal: sys_rt_sigreturn() should use set_current_blocked()
  x86: signal: handle_signal() should use set_current_blocked()
  signal: sigprocmask() should do retarget_shared_pending()
  signal: sigprocmask: narrow the scope of ->siglock
  signal: retarget_shared_pending: optimize while_each_thread() loop
  signal: retarget_shared_pending: consider shared/unblocked signals only
  signal: introduce retarget_shared_pending()
  ptrace: ptrace_check_attach() should not do s/STOPPED/TRACED/
  signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUED
  signal: do_signal_stop: Remove the unneeded task_clear_group_stop_pending()
  ...
2011-05-20 13:33:21 -07:00
Linus Torvalds
6c1b8d94bc Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (32 commits)
  GFS2: Move all locking inside the inode creation function
  GFS2: Clean up symlink creation
  GFS2: Clean up mkdir
  GFS2: Use UUID field in generic superblock
  GFS2: Rename ops_inode.c to inode.c
  GFS2: Inode.c is empty now, remove it
  GFS2: Move final part of inode.c into super.c
  GFS2: Move most of the remaining inode.c into ops_inode.c
  GFS2: Move gfs2_refresh_inode() and friends into glops.c
  GFS2: Remove gfs2_dinode_print() function
  GFS2: When adding a new dir entry, inc link count if it is a subdir
  GFS2: Make gfs2_dir_del update link count when required
  GFS2: Don't use gfs2_change_nlink in link syscall
  GFS2: Don't use a try lock when promoting to a higher mode
  GFS2: Double check link count under glock
  GFS2: Improve bug trap code in ->releasepage()
  GFS2: Fix ail list traversal
  GFS2: make sure fallocate bytes is a multiple of blksize
  GFS2: Add an AIL writeback tracepoint
  GFS2: Make writeback more responsive to system conditions
  ...
2011-05-20 13:28:45 -07:00
Linus Torvalds
268bb0ce3e sanitize <linux/prefetch.h> usage
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.

So this fixes things up a bit, using

   grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
   grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.

There are more of them around (mostly network drivers), but this gets
many core ones.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-20 12:50:29 -07:00
Jens Axboe
698567f3fa Merge commit 'v2.6.39' into for-2.6.40/core
Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
had churn in the core area that makes it difficult to handle
patches for eg cfq or blk-throttle. Instead of requiring that they
be based in older versions with bugs that have been fixed later
in the rc cycle, merge in 2.6.39 final.

Also fixes up conflicts in the below files.

Conflicts:
	drivers/block/paride/pcd.c
	drivers/cdrom/viocd.c
	drivers/ide/ide-cd.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:33:15 +02:00
Thomas Gleixner
250f972d85 Merge branch 'timers/urgent' into timers/core
Reason: Get upstream fixes and kfree_rcu which is necessary for a
follow up patch.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-20 20:08:05 +02:00
Lukas Czerner
1bb933fb1f ext4: fix possible use-after-free in ext4_remove_li_request()
We need to take reference to the s_li_request after we take a mutex,
because it might be freed since then, hence result in accessing old
already freed memory. Also we should protect the whole
ext4_remove_li_request() because ext4_li_info might be in the process of
being freed in ext4_lazyinit_thread().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:55:29 -04:00
Lukas Czerner
51ce651156 ext4: fix the mount option "init_itable=n" to work as expected for n=0
For some reason, when we set the mount option "init_itable=0" it
behaves as we would set init_itable=20 which is not right at all.
Basically when we set it to zero we are saying to lazyinit thread not
to wait between zeroing the inode table (except of cond_resched()) so
this commit fixes that and removes the unnecessary condition.  The 'n'
should be also properly used on remount.

When the n is not set at all, it means that the default miltiplier
EXT4_DEF_LI_WAIT_MULT is set instead.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:55:16 -04:00
Lukas Czerner
e1290b3e62 ext4: Remove unnecessary wait_event ext4_run_lazyinit_thread()
For some reason we have been waiting for lazyinit thread to start in the
ext4_run_lazyinit_thread() but it is not needed since it was jus
unnecessary complexity, so get rid of it. We can also remove li_task and
li_wait_task since it is not used anymore.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:49:51 -04:00
Lukas Czerner
4ed5c033c1 ext4: Use schedule_timeout_interruptible() for waiting in lazyinit thread
In order to make lazyinit eat approx. 10% of io bandwidth at max, we
are sleeping between zeroing each single inode table. For that purpose
we are using timer which wakes up thread when it expires. It is set
via add_timer() and this may cause troubles in the case that thread
has been woken up earlier and in next iteration we call add_timer() on
still running timer hence hitting BUG_ON in add_timer(). We could fix
that by using mod_timer() instead however we can use
schedule_timeout_interruptible() for waiting and hence simplifying
things a lot.

This commit exchange the old "waiting mechanism" with simple
schedule_timeout_interruptible(), setting the time to sleep. Hence we
do not longer need li_wait_daemon waiting queue and others, so get rid
of it.

Addresses-Red-Hat-Bugzilla: #699708

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:49:04 -04:00
Tony Luck
3935bb949f Pull pstore into release branch 2011-05-20 10:34:50 -07:00
Steve French
156ecb2d8b [CIFS] Fix to problem with getattr caused by invalidate simplification patch
Fix to earlier "Simplify invalidate part (try #6)" patch
That patch caused problems with connectathon test 5.

Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-20 17:00:01 +00:00
Artem Bityutskiy
bdc1a1b610 UBIFS: fix kernel-doc comments
This is a minor fix for UBIFS kernel-doc comments - we forgot the "@" symbol
for several 'struct ubifs_debug_info'.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-05-20 08:30:13 +03:00
Linus Torvalds
39ab05c8e0 Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (44 commits)
  debugfs: Silence DEBUG_STRICT_USER_COPY_CHECKS=y warning
  sysfs: remove "last sysfs file:" line from the oops messages
  drivers/base/memory.c: fix warning due to "memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION"
  memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION
  SYSFS: Fix erroneous comments for sysfs_update_group().
  driver core: remove the driver-model structures from the documentation
  driver core: Add the device driver-model structures to kerneldoc
  Translated Documentation/email-clients.txt
  RAW driver: Remove call to kobject_put().
  reboot: disable usermodehelper to prevent fs access
  efivars: prevent oops on unload when efi is not enabled
  Allow setting of number of raw devices as a module parameter
  Introduce CONFIG_GOOGLE_FIRMWARE
  driver: Google Memory Console
  driver: Google EFI SMI
  x86: Better comments for get_bios_ebda()
  x86: get_bios_ebda_length()
  misc: fix ti-st build issues
  params.c: Use new strtobool function to process boolean inputs
  debugfs: move to new strtobool
  ...

Fix up trivial conflicts in fs/debugfs/file.c due to the same patch
being applied twice, and an unrelated cleanup nearby.
2011-05-19 18:24:11 -07:00
Sage Weil
9d6fcb081a ceph: check return value for start_request in writepages
Since we pass the nofail arg, we should never get an error; BUG if we do.
(And fix the function to not return an error if __map_request fails.)

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil
6b4a3b517a ceph: remove useless check
rc is only ever 0 or negative in this method.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil
da39822c65 ceph: fix broken comparison in readdir loop
Both off and fi->offset are unsigned, so the difference is always >= 0.
Compare them directly instead of the sign of the difference.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:04 -07:00
Sage Weil
3540303f87 ceph: fix rare potential cap leak
If we grab new_cap, retake the lock, and find we already have a cap now
for the given mds, release new_cap.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:03 -07:00
Sage Weil
ae59808301 ceph: use snprintf for dirstat content
We allocate a buffer for rstats if the dirstat option is enabled.  Use
snprintf.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:02 -07:00
Sage Weil
1b36698577 libceph: remove unused variable
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:24:17 -07:00
Sage Weil
3b66378034 ceph: take reference on mds request r_unsafe_dir
We put ourselves on an inode list for the parent directory of metadata
operations so that an fsync on the directory will wait for metadata updates
to commit to disk.  We weren't holding a reference to that directory,
however, and under certain workloads (fsstress in this case) the directory
can go away.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:20:07 -07:00
Dave Chinner
bf59170a66 xfs: obey minleft values during extent allocation correctly
When allocating an extent that is long enough to consume the
remaining free space in an AG, we need to ensure that the allocation
leaves enough space in the AG for any subsequent bmap btree blocks
that are needed to track the new extent. These have to be allocated
in the same AG as we only reserve enough blocks in an allocation
transaction for modification of the freespace trees in a single AG.

xfs_alloc_fix_minleft() has been considering blocks on the AGFL as
free blocks available for extent and bmbt block allocation, which is
not correct - blocks on the AGFL are there exclusively for the use
of the free space btrees. As a result, when minleft is less than the
number of blocks on the AGFL, xfs_alloc_fix_minleft() does not trim
the given extent to leave minleft blocks available for bmbt
allocation, and hence we can fail allocation during bmbt record
insertion.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 12:03:48 -05:00
Dave Chinner
44396476a0 xfs: reset buffer pointers before freeing them
When we free a vmapped buffer, we need to ensure the vmap address
and length we free is the same as when it was allocated. In various
places in the log code we change the memory the buffer is pointing
to before issuing IO, but we never reset the buffer to point back to
it's original memory (or no memory, if that is the case for the
buffer).

As a result, when we free the buffer it points to memory that is
owned by something else and attempts to unmap and free it. Because
the range does not match any known mapped range, it can trigger
BUG_ON() traps in the vmap code, and potentially corrupt the vmap
area tracking.

Fix this by always resetting these buffers to their original state
before freeing them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 12:03:45 -05:00
Dave Chinner
ee58abdfcc xfs: avoid getting stuck during async inode flushes
When the underlying inode buffer is locked and xfs_sync_inode_attr()
is doing a non-blocking flush, xfs_iflush() can return EAGAIN.  When
this happens, clear the error rather than returning it to
xfs_inode_ag_walk(), as returning EAGAIN will result in the AG walk
delaying for a short while and trying again. This can result in
background walks getting stuck on the one AG until inode buffer is
unlocked by some other means.

This behaviour was noticed when analysing event traces followed by
code inspection and verification of the fix via further traces.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 12:03:42 -05:00
Dave Chinner
e57375153d xfs: fix xfs_itruncate_start tracing
Variables are ordered incorrectly in trace call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 12:03:36 -05:00
Dave Chinner
1beb65ad45 xfs: fix duplicate workqueue initialisation
The workqueue initialisation function is called twice when
initialising the XFS subsystem. Remove the second initialisation
call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 12:03:24 -05:00
Joe Perches
e69522a8cc xfs: kill off xfs_printk()
xfs_alert_tag() can be defined using xfs_alert(), and thereby avoid
using xfs_printk() altogether.  This is the only remaining use of
xfs_printk(), so changing it this way means xfs_printk() can simply
be eliminated.can simply be eliminated.can simply be eliminated.can
simply be eliminated.can simply be eliminated.can simply be
eliminated.can simply be eliminated.can simply be eliminated.can
simply be eliminated.

Also add format checking to the non-debug inline function xfs_debug.
Miscellaneous function prototype argument alignment.

(Updated to delete the definition of xfs_printk(), which is
no longer used or needed.)

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 11:38:09 -05:00
Steve French
ceec1e0fae [CIFS] Remove sparse warning
Move extern for cifsConvertToUCS to different header to prevent following warning:

CHECK   fs/cifs/cifs_unicode.c
fs/cifs/cifs_unicode.c:267:1: warning: symbol 'cifsConvertToUCS' was not declared. Should it be static?

Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:56 +00:00
Steve French
4e64fb33de [CIFS] Update cifs to version 1.72
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:56 +00:00
Shirish Pargaonkar
c4aca0c09f cifs: Change key name to cifs.idmap, misc. clean-up
Change idmap key name from cifs.cifs_idmap to cifs.idmap.
Removed unused structure wksidarr and function match_sid().
Handle errors correctly in function init_cifs().

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:56 +00:00
Sean Finney
f14bcf71d1 cifs: Unconditionally copy mount options to superblock info
Previously mount options were copied and updated in the cifs_sb_info
struct only when CONFIG_CIFS_DFS_UPCALL was enabled.  Making this
information generally available allows us to remove a number of ifdefs,
extra function params, and temporary variables.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:55 +00:00
Sean Finney
5167f11ec9 cifs: Use kstrndup for cifs_sb->mountdata
A relatively minor nit, but also clarified the "consensus" from the
preceding comments that it is in fact better to try for the kstrdup
early and cleanup while cleaning up is still a simple thing to do.

Reviewed-By: Steve French <smfrench@gmail.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:55 +00:00
Sean Finney
046462abca cifs: Simplify handling of submount options in cifs_mount.
With CONFIG_DFS_UPCALL enabled, maintain the submount options in
cifs_sb->mountdata, simplifying the code just a bit as well as making
corner-case allocation problems less likely.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:55 +00:00
Sean Finney
b946845a9d cifs: cifs_parse_mount_options: do not tokenize mount options in-place
To keep strings passed to cifs_parse_mount_options re-usable (which is
needed to clean up the DFS referral handling), tokenize a copy of the
mount options instead.  If values are needed from this tokenized string,
they too must be duplicated (previously, some options were copied and
others duplicated).

Since we are not on the critical path and any cleanup is relatively easy,
the extra memory usage shouldn't be a problem (and it is a bit simpler
than trying to implement something smarter).

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:54 +00:00
Sean Finney
c1508ca236 cifs: Add support for mounting Windows 2008 DFS shares
Windows 2008 CIFS servers do not always return PATH_NOT_COVERED when
attempting to access a DFS share.  Therefore, when checking for remote
shares, unconditionally ask for a DFS referral for the UNC (w/out prepath)
before continuing with previous behavior of attempting to access the UNC +
prepath and checking for PATH_NOT_COVERED.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=31092

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:54 +00:00
Sean Finney
dd61394586 cifs: Extract DFS referral expansion logic to separate function
The logic behind the expansion of DFS referrals is now extracted from
cifs_mount into a new static function, expand_dfs_referral.  This will
reduce duplicate code in upcoming commits.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:54 +00:00
Jeff Layton
460458ce8e cifs: turn BCC into a static inlined function
It's a bad idea to have macro functions that reference variables more
than once, as the arguments could have side effects. Turn BCC() into
a static inlined function instead.

While we're at it, make it return a void * to discourage anyone from
dereferencing it as-is.

Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:53 +00:00
Jeff Layton
820a803ffa cifs: keep BCC in little-endian format
This is the same patch as originally posted, just with some merge
conflicts fixed up...

Currently, the ByteCount is usually converted to host-endian on receive.
This is confusing however, as we need to keep two sets of routines for
accessing it, and keep track of when to use each routine. Munging
received packets like this also limits when the signature can be
calulated.

Simplify the code by keeping the received ByteCount in little-endian
format. This allows us to eliminate a set of routines for accessing it
and we can now drop the *_le suffixes from the accessor functions since
that's now implied.

While we're at it, switch all of the places that read the ByteCount
directly to use the get_bcc inline which should also clean up some
unaligned accesses.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:53 +00:00
Jeff Layton
0e6e37a7a8 cifs: fix some unused variable warnings in id_rb_search
fs/cifs/cifsacl.c: In function ‘id_rb_search’:
fs/cifs/cifsacl.c:215:19: warning: variable ‘linkto’ set but not used
[-Wunused-but-set-variable]
fs/cifs/cifsacl.c:214:18: warning: variable ‘parent’ set but not used
[-Wunused-but-set-variable]

Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:52 +00:00
Pavel Shilovsky
6feb9891da CIFS: Simplify invalidate part (try #5)
Simplify many places when we call cifs_revalidate/invalidate to make
it do what it exactly needs.

Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:52 +00:00
Pavel Shilovsky
0b81c1c405 CIFS: directio read/write cleanups
Recently introduced strictcache mode brought a new code that can be
efficiently used by directio part. That's let us add vectored operations
and break unnecessary cifs_user_read and cifs_user_write.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:51 +00:00
Steve French
be8e3b0044 consistently use smb_buf_length as be32 for cifs (try 3)
There is one big endian field in the cifs protocol, the RFC1001
       length, which cifs code (unlike in the smb2 code) had been handling as
       u32 until the last possible moment, when it was converted to be32 (its
       native form) before sending on the wire.   To remove the last sparse
       endian warning, and to make this consistent with the smb2
       implementation  (which always treats the fields in their
       native size and endianness), convert all uses of smb_buf_length to
       be32.

       This version incorporates Christoph's comment about
       using be32_add_cpu, and fixes a typo in the second
       version of the patch.

Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:51 +00:00
Shirish Pargaonkar
9409ae58e0 cifs: Invoke id mapping functions (try #17 repost)
rb tree search and insertion routines.

A SID which needs to be mapped, is looked up in one of the rb trees
depending on whether SID is either owner or group SID.
If found in the tree, a (mapped) id from that node is assigned to
uid or gid as appropriate.  If unmapped, an upcall is attempted to
map the SID to an id.  If upcall is successful, node is marked as
mapped.  If upcall fails, node stays marked as unmapped and a mapping
is attempted again only after an arbitrary time period has passed.

To map a SID, which can be either a Owner SID or a Group SID, key
description starts with the string "os" or "gs" followed by SID converted
to a string. Without "os" or "gs", cifs.upcall does not know whether
SID needs to be mapped to either an uid or a gid.

Nodes in rb tree have fields to prevent multiple upcalls for
a SID.  Searching, adding, and removing nodes is done within global locks.
Whenever a node is either found or inserted in a tree, a reference
is taken on that node.
Shrinker routine prunes a node if it has expired but does not prune
an expired node if its refcount is not zero (i.e. sid/id of that node
is_being/will_be accessed).
Thus a node, if its SID needs to be mapped by making an upcall,
can safely stay and its fields accessed without shrinker pruning it.
A reference (refcount) is put on the node without holding the spinlock
but a reference is get on the node by holding the spinlock.

Every time an existing mapped node is accessed or mapping is attempted,
its timestamp is updated to prevent it from getting erased or a
to prevent multiple unnecessary repeat mapping retries respectively.

For now, cifs.upcall is only used to map a SID to an id (uid or gid) but
it would be used to obtain an SID for an id.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:51 +00:00
Shirish Pargaonkar
4d79dba0e0 cifs: Add idmap key and related data structures and functions (try #17 repost)
Define (global) data structures to store ids, uids and gids, to which a
SID maps.  There are two separate trees, one for SID/uid and another one
for SID/gid.

A new type of key, cifs_idmap_key_type, is used.

Keys are instantiated and searched using credential of the root by
overriding and restoring the credentials of the caller requesting the key.

Id mapping functions are invoked under config option of cifs acl.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:51 +00:00
Pavel Shilovsky
9ad1506b42 CIFS: Add launder_page operation (try #3)
Add this let us drop filemap_write_and_wait from cifs_invalidate_mapping
and simplify the code to properly process invalidate logic.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:50 +00:00
Steve French
1cb06d0b50 Introduce smb2 mounts as vers=2
As with Linux nfs client, which uses "nfsvers=" or "vers=" to
indicate which protocol to use for mount, specifying

"vers=smb2" or "vers=2"

will force an smb2 mount. When vers is not specified cifs is used

ie "vers=cifs" or "vers=1"

We can eventually autonegotiate down from smb2 to cifs
when smb2 is stable enough to make it the default, but this
is for the future.  At that time we could also implement a
"maxprotocol" mount option as smbclient and Samba have today,
but that would be premature until smb2 is stable.

Intially the smb2 Kconfig option will depend on "BROKEN"
until the merge is complete, and then be "EXPERIMENTAL"
When it is no longer experimental we can consider changing
the default protocol to attempt first.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:50 +00:00
Pavel Shilovsky
257fb1f15d CIFS: Use invalidate_inode_pages2 instead of invalidate_remote_inode (try #4)
Use invalidate_inode_pages2 that don't leave pages even if shrink_page_list()
has a temp ref on them. It prevents a data coherency problem when
cifs_invalidate_mapping didn't invalidate pages but the client thinks that a data
from the cache is uptodate according to an oplock level (exclusive or II).

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:50 +00:00
Jeff Layton
fd5707e1b4 cifs: fix comment in validate_t2
The comment about checking the bcc is in the wrong place. Also make it
match kernel coding style.

Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:50 +00:00
Jeff Layton
4358b5678b VFS: trivial: fix comment on s_maxbytes value warning check
I originally intended to remove this warning in 2.6.34, but it's not in
a high performance codepath and might help us to catch bugs later. Let's
keep it, but fix the comment to allay confusion about its removal.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:49 +00:00
Steve French
b73b9a4ba7 [CIFS] Allow to set extended attribute cifs_acl (try #2)
Allow setting cifs_acl on the server.
Pass on to the server the ACL blob generated by an application.
cifs is just a pass-through, it does not monitor or inspect the contents
of the blob, server decides whether to enforce/apply the ACL blob composed
by an application.
If setting of ACL is succeessful, mark the inode for revalidation.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:49 +00:00
Steve French
43988d7685 [CIFS] Use ecb des kernel crypto APIs instead of
local cifs functions (repost)

Using kernel crypto APIs for DES encryption during LM and NT hash generation
instead of local functions within cifs.
Source file smbdes.c is deleted sans four functions, one of which
uses ecb des functionality provided by kernel crypto APIs.

Remove function SMBOWFencrypt.

Add return codes to various functions such as calc_lanman_hash,
SMBencrypt, and SMBNTencrypt.  Includes fix noticed by Dan Carpenter.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
CC: Dan Carpenter <error27@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:49 +00:00
Shirish Pargaonkar
257208736a cifs: cleanup: Rename and remove config flags
Remove config flag CIFS_EXPERIMENTAL.
Do export operations under new config flag CIFS_NFSD_EXPORT

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:49 +00:00
Steve French
b34cb85cc2 Introduce SMB2 Kconfig option
SMB2 is the followon to the CIFS (and SMB) protocols
and the default for Windows since Windows Vista, and also
now implemented by various non-Windows servers.  SMB2
is more secure, has various performance advantages, including
larger i/o sizes, flow control, better caching model and more.
SMB2 also resolves some scalability limits in the cifs
protocol and adds many new features while being much
simpler (only a few dozen commands instead of hundreds)
and since the protocol is clearer it is
also more consistently implemented across servers
and thus easier to optimize.

After much discussion with Jeff Layton, Jeremy Allison
and others at Connectathon, we decided to move the smb2
code from a distinct .ko and fstype into distinct
C files that optionally build in cifs.ko.  As a result
the Kconfig gets simpler.

To avoid destabilizing cifs, the smb2 code is going
to be moved into its own experimental CONFIG_CIFS_SMB2 ifdef
as it is merged and rereviewed.  The changes to stable
cifs (builds with the smb2 ifdef off) are expected to be
fairly small.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:48 +00:00
Steve French
34c87901e1 Shrink stack space usage in cifs_construct_tcon
We were reserving MAX_USERNAME (now 256) on stack for
something which only needs to fit about 24 bytes ie
string krb50x +  printf version of uid

Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:48 +00:00
Justin P. Mattock
fd62cb7e74 fs:cifs:connect.c remove one to many l's in the word.
The patch below removes an extra "l" in the word.

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:48 +00:00
Steve French
c52a95545c Don't compile in unused reparse point symlink code
Recent Windows versions now create symlinks more frequently
and they do use this "reparse point" symlink mechanism.  We can of course
do symlinks nicely to Samba and other servers which support the
CIFS Unix Extensions and we can also do SFU symlinks and "client only"
"MF" symlinks optionally, but for recent Windows we currently can not
handle the common "reparse point" symlinks fully, removing the caller
for this. We will need to extend and reenable this "reparse point" worker
code in cifs and fix cifs_symlink to call this.  In the interim this code
has been moved to its own config option so it is not compiled in by default
until cifs_symlink fixed up (and tested) to use this.

CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:48 +00:00
Steve French
0eff0e2677 Remove unused CIFSSMBNotify worker function
The CIFSSMBNotify worker is unused, pending changes to allow it to be called
via inotify, so move it into its own experimental config option so it does
not get built in, until the necessary VFS support is fixed.  It used to
be used in dnotify, but according to Jeff, inotify needs minor changes
before we can reenable this.

CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:47 +00:00
Shirish Pargaonkar
9b6763e0aa cifs: Remove unused inode number while fetching root inode
ino is unused in function cifs_root_iget().

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:47 +00:00
James Morris
12a5a2621b Merge branch 'master' into next
Conflicts:
	include/linux/capability.h

Manually resolve merge conflict w/ thanks to Stephen Rothwell.

Signed-off-by: James Morris <jmorris@namei.org>
2011-05-19 18:51:57 +10:00
Jonathan Cameron
a037439637 debugfs: move to new strtobool
No functional changes requires that we eat errors from strtobool.
If people want to not do this, then it should be fixed at a later date.

V2: Simplification suggested by Rusty Russell removes the need for
additional variable ret.

Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-05-19 16:55:28 +09:30
Linus Torvalds
3f80fbff5f Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  configfs: Fix race between configfs_readdir() and configfs_d_iput()
  configfs: Don't try to d_delete() negative dentries.
  ocfs2/dlm: Target node death during resource migration leads to thread spin
  ocfs2: Skip mount recovery for hard-ro mounts
  ocfs2/cluster: Heartbeat mismatch message improved
  ocfs2/cluster: Increase the live threshold for global heartbeat
  ocfs2/dlm: Use negotiated o2dlm protocol version
  ocfs2: skip existing hole when removing the last extent_rec in punching-hole codes.
  ocfs2: Initialize data_ac (might be used uninitialized)
2011-05-18 16:50:28 -07:00
Darrick J. Wong
0e499890c1 ext4: wait for writeback to complete while making pages writable
In order to stabilize pages during disk writes, ext4_page_mkwrite must
wait for writeback operations to complete before making a page
writable.  Furthermore, the function must return locked pages, and
recheck the writeback status if the page lock is ever dropped.  The
"someone could wander in" part of this patch was suggested by Chris
Mason.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-18 13:55:20 -04:00
Darrick J. Wong
7cb1a5351d ext4: clean up some wait_on_page_writeback calls
wait_on_page_writeback already checks the writeback bit, so callers of it
needn't do that test.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-18 13:53:20 -04:00
Tao Ma
ed3ce80a52 ext4: don't warn about mnt_count if it has been disabled
Currently, if we mkfs a new ext4 volume with s_max_mnt_count set to
zero, and mount it for the first time, we will get the warning:

	maximal mount count reached, running e2fsck is recommended

It is really misleading. So change the check so that it won't warn in
that case.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-18 13:29:57 -04:00
Linus Torvalds
a2b9c1f620 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: don't delay blk_run_queue_async
  scsi: remove performance regression due to async queue run
  blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup
  block: rescan partitions on invalidated devices on -ENOMEDIA too
  cdrom: always check_disk_change() on open
  block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe drivers
2011-05-18 06:49:02 -07:00
Joel Becker
24307aa1e7 configfs: Fix race between configfs_readdir() and configfs_d_iput()
configfs_readdir() will use the existing inode numbers of inodes in the
dcache, but it makes them up for attribute files that aren't currently
instantiated.  There is a race where a closing attribute file can be
tearing down at the same time as configfs_readdir() is trying to get its
inode number.

We want to get the inode number of open attribute files, because they
should match while instantiated.  We can't lock down the transition
where dentry->d_inode is set to NULL, so we just check for NULL there.
We can, however, ensure that an inode we find isn't iput() in
configfs_d_iput() until after we've accessed it.

Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-18 04:08:16 -07:00
Joel Becker
df7f99670a configfs: Don't try to d_delete() negative dentries.
When configfs is faking mkdir() on its subsystem or default group
objects, it starts by adding a negative dentry.  It then tries to
instantiate the group.  If that should fail, it must clean up after
itself.

I was using d_delete() here, but configfs_attach_group() promises to
return an empty dentry on error.  d_delete() explodes with the entry
dentry.  Let's try d_drop() instead.  The unhashing is what we want for
our dentry.

Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-18 03:30:58 -07:00
Jeff Layton
11379b5e33 cifs: fix cifsConvertToUCS() for the mapchars case
As Metze pointed out, commit 84cdf74e broke mapchars option:

    Commit "cifs: fix unaligned accesses in cifsConvertToUCS"
    (84cdf74e80) does multiple steps
    in just one commit (moving the function and changing it without
    testing).

    put_unaligned_le16(temp, &target[j]); is never called for any
    codepoint the goes via the 'default' switch statement. As a result
    we put just zero (or maybe uninitialized) bytes into the target
    buffer.

His proposed patch looks correct, but doesn't apply to the current head
of the tree. This patch should also fix it.

Cc: <stable@kernel.org> # .38.x: 581ade4: cifs: clean up various nits in unicode routines (try #2)
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-17 20:54:04 +00:00
Jeff Layton
221d1d7972 cifs: add fallback in is_path_accessible for old servers
The is_path_accessible check uses a QPathInfo call, which isn't
supported by ancient win9x era servers. Fall back to an older
SMBQueryInfo call if it fails with the magic error codes.

Cc: stable@kernel.org
Reported-and-Tested-by: Sandro Bonazzola <sandro.bonazzola@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-17 18:51:14 +00:00
Tao Ma
9199e66528 jbd/jbd2: remove obsolete summarise_journal_usage.
summarise_journal_usage seems to be obsolete for a long time,
so remove it.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-05-17 13:47:42 +02:00
Jan Kara
2842bb20ee jbd: Fix forever sleeping process in do_get_write_access()
In do_get_write_access() we wait on BH_Unshadow bit for buffer to get
from shadow state. The waking code in journal_commit_transaction() has
a bug because it does not issue a memory barrier after the buffer is moved
from the shadow state and before wake_up_bit() is called. Thus a waitqueue
check can happen before the buffer is actually moved from the shadow state
and waiting process may never be woken. Fix the problem by issuing proper
barrier.

CC: stable@kernel.org
Reported-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-05-17 13:47:42 +02:00
Robin Dong
4e299c1d91 ext2: fix error msg when mounting fs with too-large blocksize
When ext2 mounts a filesystem, it attempts to set the block device
blocksize with a call to sb_set_blocksize, which can fail for
several reasons.  The current failure message in ext2 prints:

  EXT2-fs (loop1): error: blocksize is too small

which is not correct in all cases.  This can be demonstrated
by creating a filesystem with

  # mkfs.ext2 -b 8192

on a 4k page system, and attempting to mount it.

Change the error message to a more generic:

  EXT2-fs (loop1): bad blocksize 8192

to match the error message in ext3.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Reviewed-by: Coly Li <bosong.ly@taobao.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-05-17 13:47:42 +02:00
Ted Ts'o
d9b01934d5 jbd: fix fsync() tid wraparound bug
If an application program does not make any changes to the indirect
blocks or extent tree, i_datasync_tid will not get updated.  If there
are enough commits (i.e., 2**31) such that tid_geq()'s calculations
wrap, and there isn't a currently active transaction at the time of
the fdatasync() call, this can end up triggering a BUG_ON in
fs/jbd/commit.c:

	J_ASSERT(journal->j_running_transaction != NULL);

It's pretty rare that this can happen, since it requires the use of
fdatasync() plus *very* frequent and excessive use of fsync().  But
with the right workload, it can.

We fix this by replacing the use of tid_geq() with an equality test,
since there's only one valid transaction id that is valid for us to
start: namely, the currently running transaction (if it exists).

CC: stable@kernel.org
Reported-by: Martin_Zielinski@McAfee.com
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-05-17 13:47:41 +02:00
Jan Kara
86c4f6d855 ext3: Fix fs corruption when make_indexed_dir() fails
When make_indexed_dir() fails (e.g. because of ENOSPC) after it has allocated
block for index tree root, we did not properly mark all changed buffers dirty.
This lead to only some of these buffers being written out and thus effectively
corrupting the directory.

Fix the issue by marking all changed data dirty even in the error failure case.

CC: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2011-05-17 13:47:15 +02:00
Alexey Dobriyan
011159a0a7 airo: correct proc entry creation interfaces
* use proc_mkdir_mode() instead of create_proc_entry(S_IFDIR|...),
  export proc_mkdir_mode() for that, oh well.
* don't supply S_IFREG to proc_create_data(), it's unnecessary

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-05-16 14:25:28 -04:00
Chen Gong
06cf91b4b4 pstore: fix pstore filesystem mount/remount issue
Currently after mount/remount operation on pstore filesystem,
the content on pstore will be lost. It is because current ERST
implementation doesn't support multi-user usage, which moves
internal pointer to the end after accessing it. Adding
multi-user support for pstore usage.

Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2011-05-16 11:05:00 -07:00
Chen Gong
8d38d74b64 pstore: fix one type of return value in pstore
the return type of function _read_ in pstore is size_t,
but in the callback function of _read_, the logic doesn't
consider it too much, which means if negative value (assuming
error here) is returned, it will be converted to positive because
of type casting. ssize_t is enough for this function.

Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2011-05-16 11:04:51 -07:00
Allison Henderson
9b940f8e8c ext4: ext4_ext_convert_to_initialized bug found in extended FSX testing
This patch addresses bugs found while testing punch hole 
with the fsx test.  The patch corrects the number of blocks
that are zeroed out while splitting an extent, and also corrects
the return value to return the number of blocks split out, instead
of the number of blocks zeroed out.

This patch has been tested in addition to the following patches: 
[Ext4 punch hole v7]
[XFS Tests Punch Hole 1/1 v2] Add Punch Hole Testing to FSX

The test ran successfully for 24 hours.

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-16 10:11:09 -04:00
Amir Goldstein
0b26859027 ext4: fix oops in ext4_quota_off()
If quota is not enabled when ext4_quota_off() is called, we must not
dereference quota file inode since it is NULL.  Check properly for
this.

This fixes a bug in commit 21f976975c (ext4: remove unnecessary
[cm]time update of quota file), which was merged for 2.6.39-rc3.

Reported-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-16 09:59:13 -04:00
Artem Bityutskiy
bbf2b37a98 UBIFS: fix extremely rare mount failure
This patch fixes an extremely rare mount failure after a power cut, when mount
fails with ENOSPC error because UBIFS could not find the GC LEB.

In short, the reason for this failure is that after recovery the GC head LEB
contains less free space than it had contained just before the power cut
happened. As a result, if the FS is full, 'ubifs_rcvry_gc_commit()' is unable
to find a dirty LEB to GC and a free LEB, so mount fails.

This patch contains a huge comment with more detailed explanation, please refer
that comment.

Since this is really really rare and unlikely situation, I do not send this
patch to the stable tree, also because it requires a lot of preparation
patches which I did before. So sending this to -stable would be too risky.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-05-16 15:48:48 +03:00
Artem Bityutskiy
43e0707386 UBIFS: simplify LEB recovery function further
Further simplify 'ubifs_recover_leb()' by noticing that we have to call
'clean_buf()' in any case, and it is fine to call it if the offset is
aligned to 'c->min_io_size'. Thus, we do not have to call it separately
from every "if" - just call it once at the end.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-05-16 15:48:48 +03:00
Artem Bityutskiy
7c47bfd0db UBIFS: always cleanup the recovered LEB
Now when we call 'ubifs_recover_leb()' only for LEBs which are potentially
corrupted (i.e., only for last buds, not for all of them), we can cleanup every
LEB, not only those where we find corruption. The reason - unstable bits. Even
though the LEB may look good now, it might contain unstable bits which may hit
us a bit later.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-05-16 15:48:48 +03:00
Artem Bityutskiy
6179920695 UBIFS: clean up LEB recovery function
This patch cleans up 'ubifs_recover_leb()' function and makes it more readable.
Move things which are done only once out of the loop and kill unneeded 'switch'
statement.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-05-16 15:48:48 +03:00