Commit Graph

34606 Commits

Author SHA1 Message Date
Huajun Li
e18c65b2ac f2fs: key functions to handle inline data
Functions to implement inline data read/write, and move inline data to
normal data block when file size exceeds inline data limitation.

Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 20:40:09 +09:00
Gu Zheng
0d47c1adc2 f2fs: convert max_orphans to a field of f2fs_sb_info
Previously, we need to calculate the max orphan num when we try to acquire an
orphan inode, but it's a stable value since the super block was inited. So
converting it to a field of f2fs_sb_info and use it directly when needed seems
a better choose.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 20:37:52 +09:00
Jaegeuk Kim
944fcfc184 f2fs: check the blocksize before calling generic_direct_IO path
The f2fs supports 4KB block size. If user requests dwrite with under 4KB data,
it allocates a new 4KB data block.
However, f2fs doesn't add zero data into the untouched data area inside the
newly allocated data block.

This incurs an error during the xfstest #263 test as follow.

263 12s ... [failed, exit status 1] - output mismatch (see 263.out.bad)
	--- 263.out	2013-03-09 03:37:15.043967603 +0900
	+++ 263.out.bad	2013-12-27 04:20:39.230203114 +0900
	@@ -1,3 +1,976 @@
	QA output created by 263
	fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
	-fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
	+fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
	+truncating to largest ever: 0x12a00
	+truncating to largest ever: 0x75400
	+fallocating to largest ever: 0x79cbf
	...
	(Run 'diff -u 263.out 263.out.bad' to see the entire diff)
	Ran: 263
	Failures: 263
	Failed 1 of 1 tests

It turns out that, when the test tries to write 2KB data with dio, the new dio
path allocates 4KB data block without filling zero data inside the remained 2KB
area. Finally, the output file contains a garbage data for that region.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 20:33:07 +09:00
Jaegeuk Kim
1ec79083b2 f2fs: should put the dnode when NEW_ADDR is detected
When get_dnode_of_data() in get_data_block() returns a successful dnode, we
should put the dnode.
But, previously, if its data block address is equal to NEW_ADDR, we didn't do
that, resulting in a deadlock condition.
So, this patch splits original error conditions with this case, and then calls
f2fs_put_dnode before finishing the function.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 20:33:06 +09:00
Jaegeuk Kim
58bfaf44df f2fs: introduce F2FS_INODE macro to get f2fs_inode
This patch introduces F2FS_INODE that returns struct f2fs_inode * from the inode
page.
By using this macro, we can remove unnecessary casting codes like below.

   struct f2fs_inode *ri = &F2FS_NODE(inode_page)->i;
-> struct f2fs_inode *ri = F2FS_INODE(inode_page);

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 20:32:48 +09:00
Chao Yu
d96b143151 f2fs: check filename length in recover_dentry
In current flow, we will get Null return value of f2fs_find_entry in
recover_dentry when name.len is bigger than F2FS_NAME_LEN, and then we
still add this inode into its dir entry.
To avoid this situation, we must check filename length before we use it.

Another point is that we could remove the code of checking filename length
In f2fs_find_entry, because f2fs_lookup will be called previously to ensure of
validity of filename length.

V2:
 o add WARN_ON() as Jaegeuk Kim suggested.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-26 12:50:09 +09:00
Greg Kroah-Hartman
5bd2010fbe Merge 3.13-rc5 into staging-next
We want these fixes here to handle some merge issues.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-24 09:43:21 -08:00
Linus Torvalds
dc0a6b4fee Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 fix from Jan Kara:
 "One simple fix of oops in ext2 which was recently hit by Christoph"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: Fix oops in ext2_get_block() called from ext2_quota_write()
2013-12-23 17:24:38 -08:00
Jan Kara
4ea7772f82 udf: Fix lockdep warning from udf_symlink()
Lockdep is complaining about UDF:
=============================================
[ INFO: possible recursive locking detected ]
3.12.0+ #16 Not tainted
---------------------------------------------
ln/7386 is trying to acquire lock:
 (&ei->i_data_sem){+.+...}, at: [<ffffffff8142f06d>] udf_get_block+0x8d/0x130

but task is already holding lock:
 (&ei->i_data_sem){+.+...}, at: [<ffffffff81431a8d>] udf_symlink+0x8d/0x690

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&ei->i_data_sem);
  lock(&ei->i_data_sem);

 *** DEADLOCK ***

This is because we hold i_data_sem of the symlink inode while calling
udf_add_entry() for the directory. I don't think this can ever lead to
deadlocks since we never hold i_data_sem for two inodes in any other
place.

The fix is simple - move unlock of i_data_sem for symlink inode up. We
don't need it for anything when linking symlink inode to directory.

Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-12-23 22:02:16 +01:00
Chao Yu
deead09009 f2fs: avoid to set wrong pino of inode when rename dir
When we rename a dir to new name which is not exist previous,
we will set pino of parent inode with ino of child inode in f2fs_set_link.
It destroy consistency of pino, it should be fixed.

Thanks for previous work of Shu Tan.

Signed-off-by: Shu Tan <shu.tan@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:42:51 +09:00
Chao Yu
4f4124d0b9 f2fs: update several comments
Update several comments:
1. use f2fs_{un}lock_op install of mutex_{un}lock_op.
2. update comment of get_data_block().
3. update description of node offset.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:26:03 +09:00
Gu Zheng
7e8f23081a f2fs: remove the rw_flag domain from f2fs_io_info
When using the f2fs_io_info in the low level, we still need to merge the
rw and rw_flag, so use the rw to hold all the io flags directly,
and remove the rw_flag field.

ps.It is based on the previous patch:
f2fs: move all the bio initialization into __bio_alloc

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Gu Zheng
940a6d34b3 f2fs: move all the bio initialization into __bio_alloc
Move all the bio initialization into __bio_alloc, and some minor cleanups are
also added.

v3:
  Use 'bool' rather than 'int' as Kim suggested.

v2:
  Use 'is_read' rather than 'rw' as Yu Chao suggested.
  Remove the needless initialization of bio->bi_private.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Jaegeuk Kim
5459aa9770 f2fs: write dirty meta pages collectively
This patch enhances writing dirty meta pages collectively in background.
During the file data writes, it'd better avoid to write small dirty meta pages
frequently.
So let's give a chance to collect a number of dirty meta pages for a while.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Jaegeuk Kim
bfad7c2d40 f2fs: introduce a new direct_IO write path
Previously, f2fs doesn't support direct IOs with high performance, which throws
every write requests via the buffered write path, resulting in highly
performance degradation due to memory opeations like copy_from_user.

This patch introduces a new direct IO path in which every write requests are
processed by generic blockdev_direct_IO() with enhanced get_block function.

The get_data_block() in f2fs handles:
1. if original data blocks are allocates, then give them to blockdev.
2. otherwise,
  a. preallocate requested block addresses
  b. do not use extent cache for better performance
  c. give the block addresses to blockdev

This policy induces that:
- new allocated data are sequentially written to the disk
- updated data are randomly written to the disk.
- f2fs gives consistency on its file meta, not file data.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Jaegeuk Kim
216fbd6443 f2fs: introduce sysfs entry to control in-place-update policy
This patch introduces new sysfs entries for users to control the policy of
in-place-updates, namely IPU, in f2fs.

Sometimes f2fs suffers from performance degradation due to its out-of-place
update policy that produces many additional node block writes.
If the storage performance is very dependant on the amount of data writes
instead of IO patterns, we'd better drop this out-of-place update policy.

This patch suggests 5 polcies and their triggering conditions as follows.

[sysfs entry name = ipu_policy]

0: F2FS_IPU_FORCE       all the time,
1: F2FS_IPU_SSR         if SSR mode is activated,
2: F2FS_IPU_UTIL        if FS utilization is over threashold,
3: F2FS_IPU_SSR_UTIL    if SSR mode is activated and FS utilization is over
                        threashold,
4: F2FS_IPU_DISABLE    disable IPU. (=default option)

[sysfs entry name = min_ipu_util]

This parameter controls the threshold to trigger in-place-updates.
The number indicates percentage of the filesystem utilization, and used by
F2FS_IPU_UTIL and F2FS_IPU_SSR_UTIL policies.

For more details, see need_inplace_update() in segment.h.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Changman Lee
5dcd8a7150 f2fs: missing kmem_cache_destroy for discard_entry
insmod f2fs.ko is failed after insmod and rmmod firstly.

$ sudo insmod fs/f2fs/f2fs.ko
insmod: error inserting 'fs/f2fs/f2fs.ko': -1 Cannot allocate memory

-- dmesg --
kmem_cache_sanity_check (free_nid): Cache name already exists.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:07 +09:00
Jaegeuk Kim
76130ccabc f2fs: fix the location of tracepoint
We need to get a trace before submit_bio, since its bi_sector is remapped during
the submit_bio.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Jaegeuk Kim
458e6197c3 f2fs: refactor bio->rw handling
This patch introduces f2fs_io_info to mitigate the complex parameter list.

struct f2fs_io_info {
	enum page_type type;		/* contains DATA/NODE/META/META_FLUSH */
	int rw;				/* contains R/RS/W/WS */
	int rw_flag;			/* contains REQ_META/REQ_PRIO */
}

1. f2fs_write_data_pages
 - DATA
 - WRITE_SYNC is set when wbc->WB_SYNC_ALL.

2. sync_node_pages
 - NODE
 - WRITE_SYNC all the time

3. sync_meta_pages
 - META
 - WRITE_SYNC all the time
 - REQ_META | REQ_PRIO all the time

 ** f2fs_submit_merged_bio() handles META_FLUSH.

4. ra_nat_pages, ra_sit_pages, ra_sum_pages
 - META
 - READ_SYNC

Cc: Fan Li <fanofcode.li@samsung.com>
Cc: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Fan Li
63a0b7cb33 f2fs: merge pages with the same sync_mode flag
Previously f2fs submits most of write requests using WRITE_SYNC, but f2fs_write_data_pages
submits last write requests by sync_mode flags callers pass.

This causes a performance problem since continuous pages with different sync flags
can't be merged in cfq IO scheduler(thanks yu chao for pointing it out), and synchronous
requests often take more time.

This patch makes the following modifies to DATA writebacks:

1. every page will be written back using the sync mode caller pass.
2. only pages with the same sync mode can be merged in one bio request.

These changes are restricted to DATA pages.Other types of writebacks are modified
To remain synchronous.

In my test with tiotest, f2fs sequence write performance is improved by about 7%-10% ,
and this patch has no obvious impact on other performance tests.

Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Jaegeuk Kim
6bacf52fb5 f2fs: add unlikely() macro for compiler more aggressively
This patch adds unlikely() macro into the most of codes.
The basic rule is to add that when:
- checking unusual errors,
- checking page mappings,
- and the other unlikely conditions.

Change log from v1:
 - Don't add unlikely for the NULL test and error test: advised by Andi Kleen.

Cc: Chao Yu <chao2.yu@samsung.com>
Cc: Andi Kleen <andi@firstfloor.org>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Chao Yu
cfb271d485 f2fs: add unlikely() macro for compiler optimization
As we know, some of our branch condition will rarely be true. So we could add
'unlikely' to let compiler optimize these code, by this way we could drop
unneeded 'jump' assemble code to improve performance.

change log:
 o add *unlikely* as many as possible across the whole source files at once
   suggested by Jaegeuk Kim.

Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Chao Yu
b9987a277f f2fs: avoid unneeded page release for correct _count of page
In find_fsync_dnodes() and recover_data(), our flow is like this:

->f2fs_submit_page_bio()
	-> f2fs_put_page()
		-> page_cache_release()	---- page->_count declined to zero.
->__free_pages()
	-> put_page_testzero() ---- page->_count will be declined again.

We will get a segment fault in put_page_testzero when CONFIG_DEBUG_VM
is on, or return MM with a bad page with wrong _count num.

So let's just release this page.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:06 +09:00
Chao Yu
a0acdfe05a f2fs: use inner macro GFP_F2FS_ZERO for simplification
Use inner macro GFP_F2FS_ZERO to instead of GFP_NOFS | __GFP_ZERO for
simplification of code.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Younger Liu
40e1ebe97d f2fs: replace the debugfs_root with f2fs_debugfs_root
This minor change for the naming conventions of debugfs_root
to avoid any possible conflicts to the other filesystem.

Signed-off-by: Younger Liu <younger.liucn@gmail.com>
Cc: Younger Liu <younger.liucn@gmail.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
[Jaegeuk Kim: change the patch name]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Younger Liu
c524723ebf f2fs: remove debufs dir if debugfs_create_file() failed
When debugfs_create_file() failed in f2fs_create_root_stats(),
debugfs_root should be remove.

Signed-off-by: Younger Liu <liuyiyang@hisense.com>
Cc: Younger Liu <younger.liucn@gmail.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Chao Yu
9af0ff1c52 f2fs: readahead contiguous pages for restore_node_summary
If cp has no CP_UMOUNT_FLAG, we will read all pages in whole node segment
one by one, it makes low performance. So let's merge contiguous pages and
readahead for better performance.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: adjust the new bio operations]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Jaegeuk Kim
93dfe2ac51 f2fs: refactor bio-related operations
This patch integrates redundant bio operations on read and write IOs.

1. Move bio-related codes to the top of data.c.
2. Replace f2fs_submit_bio with f2fs_submit_merged_bio, which handles read
   bios additionally.
3. Introduce __submit_merged_bio to submit the merged bio.
4. Change f2fs_readpage to f2fs_submit_page_bio.
5. Introduce f2fs_submit_page_mbio to integrate previous submit_read_page and
   submit_write_page.

Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com >
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Jaegeuk Kim
187b5b8b3d f2fs: remove the own bi_private allocation
Previously f2fs allocates its own bi_private data structure all the time even
though we don't use it. But, can we remove this bi_private allocation?

This patch removes such the additional bi_private allocation.

1. Retrieve f2fs_sb_info from its page->mapping->host->i_sb.
 - This removes the usecases of bi_private in end_io.

2. Use bi_private only when we really need it.
 - The bi_private is used only when the checkpoint procedure is conducted.
 - When conducting the checkpoint, f2fs submits a META_FLUSH bio to wait its bio
completion.
 - Since we have no dependancies to remove bi_private now, let's just use
 bi_private pointer as the completion pointer.

Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Chao Yu
8f99a946f3 f2fs: convert recover_orphan_inodes to void
The recover_orphan_inodes() returns no error all the time, so we don't need to
check its errors.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:05 +09:00
Chao Yu
1069bbf7b9 f2fs: check return value of f2fs_readpage in find_data_page
We should return error if we do not get an updated page in find_date_page
when f2fs_readpage failed.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Chao Yu
01d2d1aa06 f2fs: use true and false for boolean variable
The inode_page_locked should be a boolean variable.

struct dnode_of_data {
	struct inode *inode;            /* vfs inode pointer */
	struct page *inode_page;        /* its inode page, NULL is possible */
	struct page *node_page;         /* cached direct node page */
	nid_t nid;                      /* node id of the direct node block */
	unsigned int ofs_in_node;       /* data offset in the node page */
==>	bool inode_page_locked;         /* inode page is locked or not */
	block_t data_blkaddr;           /* block address of the node block */
};

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Chao Yu
aac44046a2 f2fs: correct type of wait in struct bio_private
The void *wait in bio_private is used for waiting completion of checkpoint bio.
So we don't need to use its type as void, but declare it as completion type.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Chao Yu
6947eea957 f2fs: avoid to calculate incorrect max orphan number
Because we will write node summaries when do_checkpoint with umount flag,
our number of max orphan blocks should minus NR_CURSEG_NODE_TYPE additional.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Shu Tan <shu.tan@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Chao Yu
a66c7b2fcf f2fs: remove unneeded code in punch_hole
Because FALLOC_FL_PUNCH_HOLE flag must be ORed with FALLOC_FL_KEEP_SIZE
in fallocate, so we could remove the useless 'keep size' branch code which
will never be excuted in punch_hole.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Fan Li <fanofcode.li@samsung.com>
[Jaegeuk Kim: remove an unnecessary parameter togather]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Jaegeuk Kim
031fa8cc9b f2fs: remove unnecessary condition checks
This patch removes the unnecessary condition checks on:

fs/f2fs/gc.c:667 do_garbage_collect() warn: 'sum_page' isn't an ERR_PTR
fs/f2fs/f2fs.h:795 f2fs_put_page() warn: 'page' isn't an ERR_PTR

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Jaegeuk Kim
f9a4e6df52 f2fs: bug fix on bit overflow from 32bits to 64bits
This patch fixes some bit overflows by the shift operations.

Dan Carpenter reported potential bugs on bit overflows as follows.

fs/f2fs/segment.c:910 submit_write_page()
	warn: should 'blk_addr << ((sbi)->log_blocksize - 9)' be a 64 bit type?
fs/f2fs/checkpoint.c:429 get_valid_checkpoint()
	warn: should '1 << ()' be a 64 bit type?
fs/f2fs/data.c:408 f2fs_readpage()
	warn: should 'blk_addr << ((sbi)->log_blocksize - 9)' be a 64 bit type?
fs/f2fs/data.c:457 submit_read_page()
	warn: should 'blk_addr << ((sbi)->log_blocksize - 9)' be a 64 bit type?
fs/f2fs/data.c:525 get_data_block_ro()
	warn: should 'i << blkbits' be a 64 bit type?

Bug-Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Gu Zheng
3679556794 f2fs: fix a potential out of range issue
Fix a potential out of range issue introduced by commit:
22fb72225a
f2fs: simplify write_orphan_inodes for better readable

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:04 +09:00
Jaegeuk Kim
0e80220ac5 f2fs: remove unnecessary return value
Let's remove the unnecessary return value.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Huajun Li
8274de77b7 f2fs: add a new mount option: inline_data
Add a mount option: inline_data. If the mount option is set,
data of New created small files can be stored in their inode.

Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Huajun Li
1001b3479c f2fs: add flags and helpers to support inline data
Add new inode flags F2FS_INLINE_DATA and FI_INLINE_DATA to indicate
whether the inode has inline data.

Inline data makes use of inode block's data indices region to save small
file. Currently there are 923 data indices in an inode block. Since
inline xattr has made use of the last 50 indices to save its data, there
are 873 indices left which can be used for inline data. When
FI_INLINE_DATA is set, the layout of inode block's indices region is
like below:

+-----------------+
|                 | Reserved. reserve_new_block() will make use of
| i_addr[0]       | i_addr[0] when we need to reserve a new data block
|                 | to convert inline data into regular one's.
|-----------------|
|                 | Used by inline data. A file whose size is less than
| i_addr[1~872]   | 3488 bytes(~3.4k) and doesn't reserve extra
|                 | blocks by fallocate() can be saved here.
|-----------------|
|                 |
| i_addr[873~922] | Reserved for inline xattr
|                 |
+-----------------+

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Changman Lee
03232305ff f2fs: send REQ_META or REQ_PRIO when reading meta area
Let's send REQ_META or REQ_PRIO when reading meta area such as NAT/SIT
etc.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Jaegeuk Kim
a709f4a2f2 f2fs: add detailed information of bio types in the tracepoints
This patch inserts information of bio types in more detail.
So, we can now see REQ_META and REQ_PRIO too.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Huajun Li
b600965c43 f2fs: add a new function: f2fs_reserve_block()
Add the function f2fs_reserve_block() to easily reserve new blocks, and
use it to clean up more codes.

Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Jaegeuk Kim
0daaad97dc f2fs: avoid lock debugging overhead
If CONFIG_F2FS_CHECK_FS is unset, we don't need to add any debugging overhead.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:03 +09:00
Chao Yu
74de593af7 f2fs: read contiguous sit entry pages by merging for mount performance
Previously we read sit entries page one by one, this method lost the chance
of reading contiguous page together. So we read pages as contiguous as
possible for better mount performance.

change log:
 o merge judgements/use 'Continue' or 'Break' instead of 'Goto' as Gu Zheng
   suggested.
 o add mark_page_accessed() before release page to delay VM reclaiming.
 o remove '*order' for simplification of function as Jaegeuk Kim suggested.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: fix a bug on the block address calculation]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Chao Yu
d4d288bc72 f2fs: adds a tracepoint for f2fs_submit_read_bio
This patch adds a tracepoint for f2fs_submit_read_bio.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: integrate tracepoints of f2fs_submit_read(_write)_bio]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Chao Yu
87b8872d5b f2fs: adds a tracepoint for submit_read_page
This patch adds a tracepoint for submit_read_page.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: integrate tracepoints of f2fs_submit_read(_write)_page]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Changman Lee
61ae45c880 f2fs: simplify IS_DATASEG and IS_NODESEG macro
It is not efficient comparing each segment type to find node or data.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: remove unnecessary white spaces]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Jaegeuk Kim
7107e0a9b1 f2fs: merge read IOs at ra_nat_pages()
Change log from v1:
  o add mark_page_accessed() not to reclaim the nat pages.

This patch changes the policy of submitting read bios at ra_nat_pages.

Previously, f2fs submits small read bios with block plugging.
But, with this patch, f2fs itself merges read bios first and then submits a
large bio, which can reduce the bio handling overheads.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Chao Yu
924b720b58 f2fs: add a new function to support for merging contiguous read
For better read performance, we add a new function to support for merging
contiguous read as the one for write.

v1-->v2:
 o add declarations here as Gu Zheng suggested.
 o use new structure f2fs_bio_info introduced by Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Acked-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
2013-12-23 10:18:02 +09:00
Gu Zheng
ce3b7d80ed f2fs: move the list_head initialization into the lock protection region
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:02 +09:00
Gu Zheng
502c6e0bcd f2fs: simplify write_orphan_inodes for better readable
Simplify write_orphan_inodes for better readable. Because we hold the
orphan_inode_mutex, so it's safe to use list_for_each_entry instead of
list_for_each_safe.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Gu Zheng
ef86d70994 f2fs: convert inc/dec_valid_node_count to inc/dec one count
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Gu Zheng
da19b0dc50 f2fs: convert dev_valid_block_count to void
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Gu Zheng
58e674d6ab f2fs: convert remove_inode_page to void
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Jaegeuk Kim
1ff7bd3bb5 f2fs: introduce a bio array for per-page write bios
The f2fs has three bio types, NODE, DATA, and META, and manages some data
structures per each bio types.

The codes are a little bit messy, thus, this patch introduces a bio array
which groups individual data structures as follows.

struct f2fs_bio_info {
	struct bio *bio;		/* bios to merge */
	sector_t last_block_in_bio;	/* last block number */
	struct mutex io_mutex;		/* mutex for bio */
};

struct f2fs_sb_info {
	...
	struct f2fs_bio_info write_io[NR_PAGE_TYPE];	/* for write bios */
	...
};

The code changes from this new data structure are trivial.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Jaegeuk Kim
c11abd1a80 f2fs: disable the extent cache ops on high fragmented files
The f2fs manages an extent cache to search a number of consecutive data blocks
very quickly.

However it conducts unnecessary cache operations if the file is highly
fragmented with no valid extent cache.

In such the case, we don't need to handle the extent cache, but just can disable
the cache facility.

Nevertheless, this patch gives one more chance to enable the extent cache.

For example,
1. create a file
2. write data sequentially which produces a large valid extent cache
3. update some data, resulting in a fragmented extent
4. if the fragmented extent is too small, then drop extent cache
5. close the file

6. open the file again
7. give another chance to make a new extent cache
8. write data sequentially again which creates another big extent cache.
...

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Jaegeuk Kim
971767caf6 f2fs: use sbi->write_mutex for write bios
This patch removes an unnecessary semaphore (i.e., sbi->bio_sem).
There is no reason to use the semaphore when f2fs submits read and write IOs.
Instead, let's use a write mutex and cover the sbi->bio[] by the lock.

Change log from v1:
 o split write_mutex suggested by Chao Yu

Chao described,
"All DATA/NODE/META bio buffers in superblock is protected by
'sbi->write_mutex', but each bio buffer area is independent, So we
should split write_mutex to three for DATA/NODE/META."

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:01 +09:00
Jaegeuk Kim
7d5e510944 f2fs: clean up the do_submit_bio flow
This patch introduces PAGE_TYPE_OF_BIO() and cleans up do_submit_bio() with it.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Chao Yu
75c3c8bc88 f2fs: use f2fs_put_page to release page for uniform style
We should use f2fs_put_page to release page for uniform style of f2fs code.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Jaegeuk Kim
1661d07c2d f2fs: add a tracepoint for f2fs_issue_discard
This patch adds a tracepoint for f2fs_issue_discard.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Jaegeuk Kim
3720887910 f2fs: introduce f2fs_issue_discard() to clean up
Change log from v1:
 o fix 32bit drops reported by Dan Carpenter

This patch adds f2fs_issue_discard() to clean up blkdev_issue_discard() flows.

Dan carpenter reported:
"block_t is a 32 bit type and sector_t is a 64 bit type.  The upper 32
bits of the sector_t are not used because the shift will wrap."

Bug-Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Jaegeuk Kim
7ac8c3b051 f2fs: add a sysfs entry to control max_discards
If frequent small discards are issued to the device, the performance would
be degraded significantly.
So, this patch adds a sysfs entry to control the number of discards to be
issued during a checkpoint procedure.

By default, f2fs does not issue any small discards, which means max_discards
is zero.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Jaegeuk Kim
b29555505d f2fs: add key functions for small discards
This patch adds key functions to activate the small discard feature.

Note that this procedure is conducted during the checkpoint only.

In flush_sit_entries(), when a new dirty sit entry is flushed, f2fs calls
add_discard_addrs() which searches candidates to be discarded.
The candidates should be marked *invalidated* and also previous checkpoint
recognizes it as *valid*.

At the end of a checkpoint procedure, f2fs throws discards based on the
discard entry list.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Jaegeuk Kim
7fd9e544fb f2fs: add a slab cache entry for small discards
This patch adds a slab cache entry for small discards.

Each entry consists of:

struct discard_entry {
	struct list_head list;	/* list head */
	block_t blkaddr;	/* block address to be discarded */
	int len;		/* # of consecutive blocks of the discard */
};

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:18:00 +09:00
Changman Lee
e81c93cf8c f2fs: improve searching speed of __next_free_blkoff
To find a zero bit using the result of OR operation between ckpt_valid_map
and cur_valid_map is more fast than find a zero bit in each bitmap.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: adjust changed function name]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:17:59 +09:00
Changman Lee
9a7f143ab5 f2fs: introduce __find_rev_next(_zero)_bit
When f2fs_set_bit is used, in a byte MSB and LSB is reversed,
in that case we can use __find_rev_next_bit or __find_rev_next_zero_bit.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: change the function names]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-12-23 10:17:59 +09:00
Linus Torvalds
a8472b4bb1 Merge git://git.kvack.org/~bcrl/aio-next
Pull AIO leak fixes from Ben LaHaise:
 "I've put these two patches plus Linus's change through a round of
  tests, and it passes millions of iterations of the aio numa
  migratepage test, as well as a number of repetitions of a few simple
  read and write tests.

  The first patch fixes the memory leak Kent introduced, while the
  second patch makes aio_migratepage() much more paranoid and robust"

* git://git.kvack.org/~bcrl/aio-next:
  aio/migratepages: make aio migrate pages sane
  aio: fix kioctx leak introduced by "aio: Fix a trinity splat"
2013-12-22 11:03:49 -08:00
Linus Torvalds
3dc9acb676 aio: clean up and fix aio_setup_ring page mapping
Since commit 36bc08cc01 ("fs/aio: Add support to aio ring pages
migration") the aio ring setup code has used a special per-ring backing
inode for the page allocations, rather than just using random anonymous
pages.

However, rather than remembering the pages as it allocated them, it
would allocate the pages, insert them into the file mapping (dirty, so
that they couldn't be free'd), and then forget about them.  And then to
look them up again, it would mmap the mapping, and then use
"get_user_pages()" to get back an array of the pages we just created.

Now, not only is that incredibly inefficient, it also leaked all the
pages if the mmap failed (which could happen due to excessive number of
mappings, for example).

So clean it all up, making it much more straightforward.  Also remove
some left-overs of the previous (broken) mm_populate() usage that was
removed in commit d6c355c7da ("aio: fix race in ring buffer page
lookup introduced by page migration support") but left the pointless and
now misleading MAP_POPULATE flag around.

Tested-and-acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-22 11:03:08 -08:00
Benjamin LaHaise
8e321fefb0 aio/migratepages: make aio migrate pages sane
The arbitrary restriction on page counts offered by the core
migrate_page_move_mapping() code results in rather suspicious looking
fiddling with page reference counts in the aio_migratepage() operation.
To fix this, make migrate_page_move_mapping() take an extra_count parameter
that allows aio to tell the code about its own reference count on the page
being migrated.

While cleaning up aio_migratepage(), make it validate that the old page
being passed in is actually what aio_migratepage() expects to prevent
misbehaviour in the case of races.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-12-21 17:56:08 -05:00
Benjamin LaHaise
1881686f84 aio: fix kioctx leak introduced by "aio: Fix a trinity splat"
e34ecee2ae reworked the percpu reference
counting to correct a bug trinity found.  Unfortunately, the change lead
to kioctxes being leaked because there was no final reference count to
put.  Add that reference count back in to fix things.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
2013-12-21 15:57:09 -05:00
Linus Torvalds
a6ddeee32d xfs: bugfixes for 3.13-rc5
- fix memory leak in xfs_dir2_node_removename
 - fix quota assertion in xfs_setattr_size
 - fix quota assertions in xfs_qm_vop_create_dqattach
 - fix for hang when disabling group and project quotas before
   disabling user quotas
 - fix Dave Chinner's email address in MAINTAINERS
 - fix for file allocation alignment
 - fix for assertion in xfs_buf_stale by removing xfsbdstrat
 - fix for alignment with swalloc mount option
 - fix for "retry forever" semantics on IO errors
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJSs0PZAAoJENaLyazVq6ZOELgP/Rcx5JdjfCdvZZ7HFfzabLU6
 TOpyEpc0TJso8C92+UNZJUZWNdToEn/v1VRh6dQ+cCz3RxQfOeOKVKXU1XkCBRQO
 JxW7Pucb+SRoVf+uv6qZCCJUO1oY6JByZ8+9GuBGWK5Ul2ByxTPI50Et0Qy4wM3z
 cDvQVyjtA5+63ToUS0sR8yBSKK+8c9SkjVkdLqa+AoFJHYC+meNrZ0J1PRV2ILWu
 bFJtKFe/tO4jj/UJ1uj6ZjvVQ0jm9JH1ZE4m3tbjPcDCTHyxHu5vSBVSlPO4WbAb
 Tfaj4eB7rQy05yno2/mAjn2koaqTSg1cP5V14TMP1GzBQUpwQDAWsNGkorXPfRIn
 Xsrznxk33fTCTqVSkSnVsXKZhizzPydyVCcvf00YJssYh9IEjVdWVpxedLFVJDmO
 jatsMaEAe7Z8avtah6u5vDGTQCEPQjhHPEqhW/EUfCNG1uK6DjyMG4dDsCMufJ7N
 Ze646oXD6zd45hSPQxMV1r8ZvlQoubUgctOBNqs/nDhOblRQ7MRqkRHhPRvvzsBG
 ffVB145l5v1cud0IcpIbfWPtosnPAvoqYS+qglkXkmXmU7rk0APePDYP7XLh4+qy
 8ROkJQ0rsgmC2cyC/fmwtwWQCMCRUrI9YB2X1zRiBS6TwwATP2uIomtT7GwAfK4+
 AmCwxwy6XPMhUd3xn3Vx
 =32uU
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs

Pull xfs bugfixes from Ben Myers:
 "This contains fixes for some asserts
   related to project quotas, a memory leak, a hang when disabling group or
   project quotas before disabling user quotas, Dave's email address, several
   fixes for the alignment of file allocation to stripe unit/width geometry, a
   fix for an assertion with xfs_zero_remaining_bytes, and the behavior of
   metadata writeback in the face of IO errors.

   Details:
   - fix memory leak in xfs_dir2_node_removename
   - fix quota assertion in xfs_setattr_size
   - fix quota assertions in xfs_qm_vop_create_dqattach
   - fix for hang when disabling group and project quotas before
     disabling user quotas
   - fix Dave Chinner's email address in MAINTAINERS
   - fix for file allocation alignment
   - fix for assertion in xfs_buf_stale by removing xfsbdstrat
   - fix for alignment with swalloc mount option
   - fix for "retry forever" semantics on IO errors"

* tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs:
  xfs: abort metadata writeback on permanent errors
  xfs: swalloc doesn't align allocations properly
  xfs: remove xfsbdstrat error
  xfs: align initial file allocations correctly
  MAINTAINERS: fix incorrect mail address of XFS maintainer
  xfs: fix infinite loop by detaching the group/project hints from user dquot
  xfs: fix assertion failure at xfs_setattr_nonsize
  xfs: fix false assertion at xfs_qm_vop_create_dqattach
  xfs: fix memory leak in xfs_dir2_node_removename
2013-12-20 15:48:45 -08:00
Luck, Tony
df36ac1bc2 pstore: Don't allow high traffic options on fragile devices
Some pstore backing devices use on board flash as persistent
storage. These have limited numbers of write cycles so it
is a poor idea to use them from high frequency operations.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-20 13:12:01 -08:00
Theodore Ts'o
f5a44db5d2 ext4: add explicit casts when masking cluster sizes
The missing casts can cause the high 64-bits of the physical blocks to
be lost.  Set up new macros which allows us to make sure the right
thing happen, even if at some point we end up supporting larger
logical block numbers.

Thanks to the Emese Revfy and the PaX security team for reporting this
issue.

Reported-by: PaX Team <pageexec@freemail.hu>
Reported-by: Emese Revfy <re.emese@gmail.com>                                 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-20 09:29:35 -05:00
Steven Whitehouse
582d2f7aed GFS2: Wait for async DIO in glock state changes
We need to wait for any outstanding DIO to complete in a couple
of situations. Firstly, in case we are changing out of deferred
mode (in inode_go_sync) where GLF_DIRTY will not be set. That
call could be prefixed with a test for gl_state == LM_ST_DEFERRED
but it doesn't seem worth it bearing in mind that the test for
outstanding DIO is very quick anyway, in the usual case that there
is none.

The second case is in inode_go_lock which will catch the cases
where we have a cached EX lock, but where we grant deferred locks
against it so that there is no glock state transistion. We only
need to wait if the state is not deferred, since DIO is valid
anyway in that state.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-20 10:42:08 +00:00
Steven Whitehouse
dfd11184d8 GFS2: Fix incorrect invalidation for DIO/buffered I/O
In patch 209806aba9 we allowed
local deferred locks to be granted against a cached exclusive
lock. That opened up a corner case which this patch now
fixes.

The solution to the problem is to check whether we have cached
pages each time we do direct I/O and if so to unmap, flush
and invalidate those pages. Since the glock state machine
normally does that for us, mostly the code will be a no-op.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-20 10:41:21 +00:00
Masanari Iida
77d84ff87e treewide: Fix typos in printk
Correct spelling typo in various part of kernel

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-12-19 15:10:49 +01:00
Linus Torvalds
2516b61537 Driver core fix for 3.13-rc5
Here's a single sysfs fix for 3.13-rc5 that resolves a lockdep issue in
 sysfs that has been reported.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlKx3T4ACgkQMUfUDdst+ykALQCfchxid3U9xydNKQMx51PwLiKX
 c08AnA1ZhDtQ6rfUi+49YJxhWLgUyYHD
 =glqF
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fix from Greg KH:
 "Here's a single sysfs fix for 3.13-rc5 that resolves a lockdep issue
  in sysfs that has been reported"

* tag 'driver-core-3.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  sysfs: give different locking key to regular and bin files
2013-12-18 14:33:23 -08:00
Christoph Hellwig
eef334e577 xfs: assert that we hold the ilock for extent map access
Make sure that xfs_bmapi_read has the ilock held in some way, and that
xfs_bmapi_write, xfs_bmapi_delay, xfs_bunmapi and xfs_iread_extents are
called with the ilock held exclusively.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-18 16:08:46 -06:00
Christoph Hellwig
568d994e9f xfs: use xfs_ilock_attr_map_shared in xfs_attr_list_int
We might not have read in the extent list at this point, so make sure we
take the ilock exclusively if we have to do so.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-18 16:08:04 -06:00
Christoph Hellwig
683cb94159 xfs: use xfs_ilock_attr_map_shared in xfs_attr_get
We might not have read in the extent list at this point, so make sure we
take the ilock exclusively if we have to do so.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-18 16:07:09 -06:00
Christoph Hellwig
da51d32d45 xfs: use xfs_ilock_data_map_shared in xfs_qm_dqiterate
We might not have read in the extent list at this point, so make sure we
take the ilock exclusively if we have to do so.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-18 16:06:38 -06:00
Christoph Hellwig
f4df8adc83 xfs: use xfs_ilock_data_map_shared in xfs_qm_dqtobp
We might not have read in the extent list at this point, so make sure we
take the ilock exclusively if we have to do so.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-18 16:04:18 -06:00
Christoph Hellwig
4f317369d4 xfs: take the ilock around xfs_bmapi_read in xfs_zero_remaining_bytes
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-18 15:53:38 -06:00
Ben Myers
40194ecc6d xfs: reinstate the ilock in xfs_readdir
Although it was removed in commit 051e7cd44a, ilock needs to be taken in
xfs_readdir because we might have to read the extent list in from disk.  This
keeps other threads from reading from or writing to the extent list while it is
being read in and is still in a transitional state.

This has been associated with "Access to block zero" messages on directories
with large numbers of extents resulting from excessive filesytem fragmentation,
as well as extent list corruption.  Unfortunately no test case at this point.

Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2013-12-18 15:52:36 -06:00
Christoph Hellwig
efa70be165 xfs: add xfs_ilock_attr_map_shared
Equivalent to xfs_ilock_data_map_shared, except for the attribute fork.

Make xfs_getbmap use it if called for the attribute fork instead of
xfs_ilock_data_map_shared.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-18 15:48:44 -06:00
Christoph Hellwig
309ecac8e7 xfs: rename xfs_ilock_map_shared
Make it clear that we're only locking against the extent map on the data
fork.  Also clean the function up a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-18 15:39:30 -06:00
Christoph Hellwig
01f4f32775 xfs: remove xfs_iunlock_map_shared
We can just use xfs_iunlock without any loss of clarity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-18 15:38:57 -06:00
Christoph Hellwig
30ba7ad543 xfs: no need to lock the inode in xfs_find_handle
Both the inode number and the generation do not change on a live inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-18 15:34:28 -06:00
Ben Myers
324bb26144 Merge branch 'xfs-for-linus-v3.13-rc5' into for-next 2013-12-18 10:36:58 -06:00
Jan Kara
34cf865d54 ext4: fix deadlock when writing in ENOSPC conditions
Akira-san has been reporting rare deadlocks of his machine when running
xfstests test 269 on ext4 filesystem. The problem turned out to be in
ext4_da_reserve_metadata() and ext4_da_reserve_space() which called
ext4_should_retry_alloc() while holding i_data_sem. Since
ext4_should_retry_alloc() can force a transaction commit, this is a
lock ordering violation and leads to deadlocks.

Fix the problem by just removing the retry loops. These functions should
just report ENOSPC to the caller (e.g. ext4_da_write_begin()) and that
function must take care of retrying after dropping all necessary locks.

Reported-and-tested-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-18 00:44:44 -05:00
Linus Torvalds
a5905a9205 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull two Ceph fixes from Sage Weil:
 "One of these is fixing a regression from the d_flags file type patch
  that went into -rc1 that broke instantiation of inodes and dentries
  (we were doing dentries first).  The other is just an off-by-one
  corner case"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: Avoid data inconsistency due to d-cache aliasing in readpage()
  ceph: initialize inode before instantiating dentry
2013-12-17 11:46:51 -08:00
Tejun Heo
80b9bbefc3 kernfs: add kernfs_dir_ops
Add support for mkdir(2), rmdir(2) and rename(2) syscalls.  This is
implemented through optional kernfs_dir_ops callback table which can
be specified on kernfs_create_root().  An implemented callback is
invoked when the matching syscall is invoked.

As kernfs keep dcache syncs with internal representation and
revalidates dentries on each access, the implementation of these
methods is extremely simple.  Each just discovers the relevant
kernfs_node(s) and invokes the requested callback which is allowed to
do any kernfs operations and the end result doesn't necessarily have
to match the expected semantics of the syscall.

This will be used to convert cgroup to use kernfs instead of its own
filesystem implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17 08:59:15 -08:00
Tejun Heo
19bbb92620 kernfs: allow negative dentries
kernfs doesn't allow negative dentries - kernfs_iop_lookup() returns
ERR_PTR(-ENOENT) instead of NULL which short-circuits negative dentry
creation and kernfs's d_delete() callback, kernfs_dop_delete(),
returns 1 for all removed nodes.  This in turn allows
kernfs_dop_revalidate() to assume that there's no negative dentry for
kernfs.

This worked fine for sysfs but kernfs is scheduled to grow mkdir(2)
support which depend on negative dentries.  This patch updates so that
kernfs allows negative dentries.  The required changes are almost
trivial - kernfs_iop_lookup() now returns NULL instead of
ERR_PTR(-ENOENT) when the target kernfs_node doesn't exist,
kernfs_dop_delete() is removed and kernfs_dop_revalidate() is updated
to check whether the target dentry is negative and request fresh
lookup if so.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17 08:59:15 -08:00
Tejun Heo
47a52e91f4 kernfs: update kernfs_rename_ns() to consider KERNFS_STATIC_NAME
kernfs_rename_ns() currently assumes that the target sysfs_dirent has
a copied name.  This has been okay because sysfs supports rename only
for directories which always have copied names; however, there's
nothing in kernfs interface which calls for such restriction and
currently invoking kernfs_rename_ns() on a regular file leads to oops
because it ends up trying to kfree() a static name.

This patch updates kernfs_rename_ns() so that it skips kfree() of the
old name if it's static.  This allows it to be used for all node
types.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17 08:59:15 -08:00
Tejun Heo
2063d608f5 kernfs: mark static names with KERNFS_STATIC_NAME
Because sysfs used struct attribute which are supposed to stay
constant, sysfs didn't copy names when creating regular files.  The
specified string for name was supposed to stay constant.  Such
distinction isn't inherent for kernfs.  kernfs_create_file[_ns]()
should be able to take the same @name as kernfs_create_dir[_ns]()

As there can be huge number of sysfs attributes, we still want to be
able to use static names for sysfs attributes.  This patch renames
kernfs_create_file_ns_key() to __kernfs_create_file() and adds
@name_is_static parameter so that the caller can explicitly indicate
that @name can be used without copying.  kernfs is updated to use
KERNFS_STATIC_NAME to distinguish static and copied names.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17 08:59:15 -08:00
Tejun Heo
d0ae3d4347 kernfs: add REMOVED check to create and rename paths
kernfs currently assumes that the caller doesn't try to create a new
node under a removed parent, rename a removed node, or move a node
under a removed node.  While this works fine for sysfs, it'd be nice
to have protection against such cases especially given that kernfs is
planned to add support for mkdir, rmdir and rename requsts from
userland which may make race conditions more likely.

This patch updates create and rename paths to check REMOVED and fail
the operation with -ENOENT if performed on or towards removed nodes.
Note that remove path already has such check.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17 08:59:15 -08:00
Tejun Heo
bb8b9d095c kernfs: add @mode to kernfs_create_dir[_ns]()
sysfs assumed 0755 for all newly created directories and kernfs
inherited it.  This assumption is unnecessarily restrictive and
inconsistent with kernfs_create_file[_ns]().  This patch adds @mode
parameter to kernfs_create_dir[_ns]() and update uses in sysfs
accordingly.  Among others, this will be useful for implementations of
the planned ->mkdir() method.

This patch doesn't introduce any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17 08:59:15 -08:00
Dave Chinner
ac8809f9ab xfs: abort metadata writeback on permanent errors
If we are doing aysnc writeback of metadata, we can get write errors
but have nobody to report them to. At the moment, we simply attempt
to reissue the write from io completion in the hope that it's a
transient error.

When it's not a transient error, the buffer is stuck forever in
this loop, and we cannot break out of it. Eventually, unmount will
hang because the AIL cannot be emptied and everything goes downhill
from them.

To solve this problem, only retry the write IO once before aborting
it. We don't throw the buffer away because some transient errors can
last minutes (e.g.  FC path failover) or even hours (thin
provisioned devices that have run out of backing space) before they
go away. Hence we really want to keep trying until we can't try any
more.

Because the buffer was not cleaned, however, it does not get removed
from the AIL and hence the next pass across the AIL will start IO on
it again. As such, we still get the "retry forever" semantics that
we currently have, but we allow other access to the buffer in the
mean time. Meanwhile the filesystem can continue to modify the
buffer and relog it, so the IO errors won't hang the log or the
filesystem.

Now when we are pushing the AIL, we can see all these "permanent IO
error" buffers and we can issue a warning about failures before we
retry the IO. We can also catch these buffers when unmounting an
issue a corruption warning, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17 09:40:23 -06:00
Dave Chinner
33177f0536 xfs: swalloc doesn't align allocations properly
When swalloc is specified as a mount option, allocations are
supposed to be aligned to the stripe width rather than the stripe
unit of the underlying filesystem. However, it does not do this.

What the implementation does is round up the allocation size to a
stripe width, hence ensuring that all allocations span a full stripe
width. It does not, however, ensure that that allocation is aligned
to a stripe width, and hence the allocations can span multiple
underlying stripes and so still see RMW cycles for things like
direct IO on MD RAID.

So, if the swalloc mount option is set, change the allocation
alignment in xfs_bmap_btalloc() to use the stripe width rather than
the stripe unit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17 09:30:12 -06:00
Christoph Hellwig
83a0adc3f9 xfs: remove xfsbdstrat error
The xfsbdstrat helper is a small but useless wrapper for xfs_buf_iorequest that
handles the case of a shut down filesystem.  Most of the users have private,
uncached buffers that can just be freed in this case, but the complex error
handling in xfs_bioerror_relse messes up the case when it's called without
a locked buffer.

Remove xfsbdstrat and opencode the error handling in the callers.  All but
one can simply return an error and don't need to deal with buffer state,
and the one caller that cares about the buffer state could do with a major
cleanup as well, but we'll defer that to later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17 09:28:43 -06:00
Dave Chinner
6e708bcf65 xfs: align initial file allocations correctly
The function xfs_bmap_isaeof() is used to indicate that an
allocation is occurring at or past the end of file, and as such
should be aligned to the underlying storage geometry if possible.

Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
behaviour of this function for empty files - it turned off
allocation alignment for this case accidentally. Hence large initial
allocations from direct IO are not getting correctly aligned to the
underlying geometry, and that is cause write performance to drop in
alignment sensitive configurations.

Fix it by considering allocation into empty files as requiring
aligned allocation again.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit f9b395a8ef)
2013-12-17 09:17:25 -06:00
Jie Liu
718cc6f88c xfs: fix infinite loop by detaching the group/project hints from user dquot
xfs_quota(8) will hang up if trying to turn group/project quota off
before the user quota is off, this could be 100% reproduced by:
  # mount -ouquota,gquota /dev/sda7 /xfs
  # mkdir /xfs/test
  # xfs_quota -xc 'off -g' /xfs <-- hangs up
  # echo w > /proc/sysrq-trigger
  # dmesg

  SysRq : Show Blocked State
  task                        PC stack   pid father
  xfs_quota       D 0000000000000000     0 27574   2551 0x00000000
  [snip]
  Call Trace:
  [<ffffffff81aaa21d>] schedule+0xad/0xc0
  [<ffffffff81aa327e>] schedule_timeout+0x35e/0x3c0
  [<ffffffff8114b506>] ? mark_held_locks+0x176/0x1c0
  [<ffffffff810ad6c0>] ? call_timer_fn+0x2c0/0x2c0
  [<ffffffffa0c25380>] ? xfs_qm_shrink_count+0x30/0x30 [xfs]
  [<ffffffff81aa3306>] schedule_timeout_uninterruptible+0x26/0x30
  [<ffffffffa0c26155>] xfs_qm_dquot_walk+0x235/0x260 [xfs]
  [<ffffffffa0c059d8>] ? xfs_perag_get+0x1d8/0x2d0 [xfs]
  [<ffffffffa0c05805>] ? xfs_perag_get+0x5/0x2d0 [xfs]
  [<ffffffffa0b7707e>] ? xfs_inode_ag_iterator+0xae/0xf0 [xfs]
  [<ffffffffa0c22280>] ? xfs_trans_free_dqinfo+0x50/0x50 [xfs]
  [<ffffffffa0b7709f>] ? xfs_inode_ag_iterator+0xcf/0xf0 [xfs]
  [<ffffffffa0c261e6>] xfs_qm_dqpurge_all+0x66/0xb0 [xfs]
  [<ffffffffa0c2497a>] xfs_qm_scall_quotaoff+0x20a/0x5f0 [xfs]
  [<ffffffffa0c2b8f6>] xfs_fs_set_xstate+0x136/0x180 [xfs]
  [<ffffffff8136cf7a>] do_quotactl+0x53a/0x6b0
  [<ffffffff812fba4b>] ? iput+0x5b/0x90
  [<ffffffff8136d257>] SyS_quotactl+0x167/0x1d0
  [<ffffffff814cf2ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff81abcd19>] system_call_fastpath+0x16/0x1b

It's fine if we turn user quota off at first, then turn off other
kind of quotas if they are enabled since the group/project dquot
refcount is decreased to zero once the user quota if off. Otherwise,
those dquots refcount is non-zero due to the user dquot might refer
to them as hint(s).  Hence, above operation cause an infinite loop
at xfs_qm_dquot_walk() while trying to purge dquot cache.

This problem has been around since Linux 3.4, it was introduced by:
  [ b84a3a9675 xfs: remove the per-filesystem list of dquots ]

Originally we will release the group dquot pointers because the user
dquots maybe carrying around as a hint via xfs_qm_detach_gdquots().
However, with above change, there is no such work to be done before
purging group/project dquot cache.

In order to solve this problem, this patch introduces a special routine
xfs_qm_dqpurge_hints(), and it would release the group/project dquot
pointers the user dquots maybe carrying around as a hint, and then it
will proceed to purge the user dquot cache if requested.

Cc: stable@vger.kernel.org
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit df8052e7da)
2013-12-17 09:16:40 -06:00
Jie Liu
5c22727895 xfs: fix assertion failure at xfs_setattr_nonsize
For CRC enabled v5 super block, change a file's ownership can simply
trigger an ASSERT failure at xfs_setattr_nonsize() if both group and
project quota are enabled, i.e,

[  305.337609] XFS: Assertion failed: !XFS_IS_PQUOTA_ON(mp), file: fs/xfs/xfs_iops.c, line: 621
[  305.339250] Kernel BUG at ffffffffa0a7fa32 [verbose debug info unavailable]
[  305.383939] Call Trace:
[  305.385536]  [<ffffffffa0a7d95a>] xfs_setattr_nonsize+0x69a/0x720 [xfs]
[  305.387142]  [<ffffffffa0a7dea9>] xfs_vn_setattr+0x29/0x70 [xfs]
[  305.388727]  [<ffffffff811ca388>] notify_change+0x1a8/0x350
[  305.390298]  [<ffffffff811ac39d>] chown_common+0xfd/0x110
[  305.391868]  [<ffffffff811ad6bf>] SyS_fchownat+0xaf/0x110
[  305.393440]  [<ffffffff811ad760>] SyS_lchown+0x20/0x30
[  305.394995]  [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[  305.399870] RIP  [<ffffffffa0a7fa32>] assfail+0x22/0x30 [xfs]

This fix adjust the assertion to check if the super block support both
quota inodes or not.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 5a01dd54f4)
2013-12-17 09:16:08 -06:00
Jie Liu
30d161c9aa xfs: fix false assertion at xfs_qm_vop_create_dqattach
After the previous fix, there still has another ASSERT failure if turning
off any type of quota while fsstress is running at the same time.

Backtrace in this case:

[   50.867897] XFS: Assertion failed: XFS_IS_GQUOTA_ON(mp), file: fs/xfs/xfs_qm.c, line: 2118
[   50.867924] ------------[ cut here ]------------
... <snip>
[   50.867957] Kernel BUG at ffffffffa0b55a32 [verbose debug info unavailable]
[   50.867999] invalid opcode: 0000 [#1] SMP
[   50.869407] Call Trace:
[   50.869446]  [<ffffffffa0bc408a>] xfs_qm_vop_create_dqattach+0x19a/0x2d0 [xfs]
[   50.869512]  [<ffffffffa0b9cc45>] xfs_create+0x5c5/0x6a0 [xfs]
[   50.869564]  [<ffffffffa0b5307c>] xfs_vn_mknod+0xac/0x1d0 [xfs]
[   50.869615]  [<ffffffffa0b531d6>] xfs_vn_mkdir+0x16/0x20 [xfs]
[   50.869655]  [<ffffffff811becd5>] vfs_mkdir+0x95/0x130
[   50.869689]  [<ffffffff811bf63a>] SyS_mkdirat+0xaa/0xe0
[   50.869723]  [<ffffffff811bf689>] SyS_mkdir+0x19/0x20
[   50.869757]  [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[   50.869793] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 <snip>
[   50.870003] RIP  [<ffffffffa0b55a32>] assfail+0x22/0x30 [xfs]
[   50.870050]  RSP <ffff88002941fd60>
[   50.879251] ---[ end trace c93a2b342341c65b ]---

We're hitting the ASSERT(XFS_IS_*QUOTA_ON(mp)) in xfs_qm_vop_create_dqattach(),
however the assertion itself is not right IMHO.  While performing quota off, we
firstly clear the XFS_*QUOTA_ACTIVE bit(s) from struct xfs_mount without taking
any special locks, see xfs_qm_scall_quotaoff().  Hence there is no guarantee
that the desired quota is still active.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 37eb9706eb)
2013-12-17 09:15:45 -06:00
Mark Tinguely
3a8c92086d xfs: fix memory leak in xfs_dir2_node_removename
Fix the leak of kernel memory in xfs_dir2_node_removename()
when xfs_dir2_leafn_remove() returns an error code.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit ef701600fd)
2013-12-17 09:15:12 -06:00
Dongmao Zhang
ece35848c1 dlm: set zero linger time on sctp socket
The recovery time for a failed node was taking a long
time because the failed node could not perform the full
shutdown process.  Removing the linger time speeds this
up.  The dlm does not care what happens to messages to
or from the failed node.

Signed-off-by: Dongmao Zhang <dmzhang@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-12-16 09:52:34 -06:00
Bob Peterson
502be2a32f GFS2: Fix slab memory leak in gfs2_bufdata
This patch fixes a slab memory leak that sometimes can occur
for files with a very short lifespan. The problem occurs when
a dinode is deleted before it has gotten to the journal properly.
In the leak scenario, the bd object is pinned for journal
committment (queued to the metadata buffers queue: sd_log_le_buf)
but is subsequently unpinned and dequeued before it finds its way
to the ail or the revoke queue. In this rare circumstance, the bd
object needs to be freed from slab memory, or it is forgotten.
We have to be very careful how we do it, though, because
multiple processes can call gfs2_remove_from_journal. In order to
avoid double-frees, only the process that does the unpinning is
allowed to free the bd.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-13 21:42:40 +00:00
Bob Peterson
9290a9a7c0 GFS2: Fix use-after-free race when calling gfs2_remove_from_ail
Function gfs2_remove_from_ail drops the reference on the bh via
brelse. This patch fixes a race condition whereby bh is deferenced
after the brelse when setting bd->bd_blkno = bh->b_blocknr;
Under certain rare circumstances, bh might be gone or reused,
and bd->bd_blkno is set to whatever that memory happens to be,
which is often 0. Later, in gfs2_trans_add_unrevoke, that bd fails
the test "bd->bd_blkno >= blkno" which causes it to never be freed.
The end result is that the bd is never freed from the bufdata cache,
which results in this error:
slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-13 21:42:23 +00:00
Steven Whitehouse
dfe5b9ad83 GFS2: don't hold s_umount over blkdev_put
This is a GFS2 version of Tejun's patch:
4f331f01b9
vfs: don't hold s_umount over close_bdev_exclusive() call

In this case its blkdev_put itself that is the issue and this
patch uses the same solution of dropping and retaking s_umount.

Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-13 21:42:03 +00:00
Jan Kara
f9b0e058cb writeback: Fix data corruption on NFS
Commit 4f8ad655db "writeback: Refactor writeback_single_inode()" added
a condition to skip clean inode. However this is wrong in WB_SYNC_ALL
mode because there we also want to wait for outstanding writeback on
possibly clean inode. This was causing occasional data corruption issues
on NFS because it uses sync_inode() to make sure all outstanding writes
are flushed to the server before truncating the inode and with
sync_inode() returning prematurely file was sometimes extended back
by an outstanding write after it was truncated.

So modify the test to also check for pages under writeback in
WB_SYNC_ALL mode.

CC: stable@vger.kernel.org # >= 3.5
Fixes: 4f8ad655db
Reported-and-tested-by: Dan Duval <dan.duval@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-12-14 04:21:26 +08:00
Ben Myers
46f23adf78 Merge branch 'xfs-factor-icluster-macros' into for-next 2013-12-13 14:15:33 -06:00
Li Wang
56f91aad69 ceph: Avoid data inconsistency due to d-cache aliasing in readpage()
If the length of data to be read in readpage() is exactly
PAGE_CACHE_SIZE, the original code does not flush d-cache
for data consistency after finishing reading. This patches fixes
this.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-13 09:11:38 -08:00
Yan, Zheng
86b58d1313 ceph: initialize inode before instantiating dentry
commit b18825a7c8 (Put a small type field into struct dentry::d_flags)
put a type field into struct dentry::d_flags. __d_instantiate() set the
field by checking inode->i_mode. So we should initialize inode before
instantiating dentry when handling mds reply.

Fixes: http://tracker.ceph.com/issues/6930
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-13 09:11:36 -08:00
Jie Liu
f9e5abcfc5 xfs: use xfs_icluster_size_fsb in xfs_imap
Use xfs_icluster_size_fsb() in xfs_imap().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 15:51:49 +11:00
Jie Liu
982e939e4d xfs: use xfs_icluster_size_fsb in xfs_ifree_cluster
Use xfs_icluster_size_fsb() in xfs_ifree_cluster(), rename variable
ninodes to inodes_per_cluster, the latter is more meaningful.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 15:51:49 +11:00
Jie Liu
6e0c7b8c3e xfs: use xfs_icluster_size_fsb in xfs_ialloc_inode_init
Use xfs_icluster_size_fsb() in xfs_ialloc_inode_init(), rename variable
ninodes to inodes_per_cluster, the latter is more meaningful.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 15:51:49 +11:00
Jie Liu
a2ba07b2d2 xfs: use xfs_icluster_size_fsb in xfs_bulkstat
Use xfs_icluster_size_fsb() in xfs_bulkstat(), make the related
variables more meaningful and remove an unused variable nimask
from it.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 15:51:48 +11:00
Jie Liu
904957b750 xfs: introduce a common helper xfs_icluster_size_fsb
Introduce a common routine xfs_icluster_size_fsb() to calculate
and return the number of file system blocks per inode cluster.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 15:51:48 +11:00
Jie Liu
126cd105d4 xfs: get rid of XFS_IALLOC_BLOCKS macros
Get rid of XFS_IALLOC_BLOCKS() marcos, use mp->m_ialloc_blks directly.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 15:51:48 +11:00
Jie Liu
0f49efd805 xfs: get rid of XFS_INODE_CLUSTER_SIZE macros
Get rid of XFS_INODE_CLUSTER_SIZE() macros, use mp->m_inode_cluster_size
directly.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 15:51:48 +11:00
Jie Liu
717834383c xfs: get rid of XFS_IALLOC_INODES macros
Get rid of XFS_IALLOC_INODES() marcos, use mp->m_ialloc_inos directly.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 15:51:46 +11:00
Linus Torvalds
8d2763770c Merge branch 'akpm' (fixes from Andrew)
Merge patches from Andrew Morton:
  "13 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: memcg: do not allow task about to OOM kill to bypass the limit
  mm: memcg: fix race condition between memcg teardown and swapin
  thp: move preallocated PTE page table on move_huge_pmd()
  mfd/rtc: s5m: fix register updating by adding regmap for RTC
  rtc: s5m: enable IRQ wake during suspend
  rtc: s5m: limit endless loop waiting for register update
  rtc: s5m: fix unsuccesful IRQ request during probe
  drivers/rtc/rtc-s5m.c: fix info->rtc assignment
  include/linux/kernel.h: make might_fault() a nop for !MMU
  drivers/rtc/rtc-at91rm9200.c: correct alarm over day/month wrap
  procfs: also fix proc_reg_get_unmapped_area() for !MMU case
  mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
  include/linux/hugetlb.h: make isolate_huge_page() an inline
2013-12-12 18:22:10 -08:00
Jan Beulich
ae5758a1a7 procfs: also fix proc_reg_get_unmapped_area() for !MMU case
Commit fad1a86e25 ("procfs: call default get_unmapped_area on
MMU-present architectures"), as its title says, took care of only the
MMU case, leaving the !MMU side still in the regressed state (returning
-EIO in all cases where pde->proc_fops->get_unmapped_area is NULL).

From the fad1a86e25 changelog:

 "Commit c4fe244857 ("sparc: fix PCI device proc file mmap(2)") added
  proc_reg_get_unmapped_area in proc_reg_file_ops and
  proc_reg_file_ops_no_compat, by which now mmap always returns EIO if
  get_unmapped_area method is not defined for the target procfs file, which
  causes regression of mmap on /proc/vmcore.

  To address this issue, like get_unmapped_area(), call default
  current->mm->get_unmapped_area on MMU-present architectures if
  pde->proc_fops->get_unmapped_area, i.e.  the one in actual file operation
  in the procfs file, is not defined"

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>	[3.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12 18:19:26 -08:00
Christoph Hellwig
ffda4e83aa xfs: remove the quotaoff log format from the quotaoff log item
This one doesn't save a whole lot of memory, but still makes the
code simpler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 11:34:08 +11:00
Christoph Hellwig
ce8e962939 xfs: remove the dquot log format from the dquot log item
No need to keep the dquot log format around all the time, we can
easily generate it at iop_format time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 11:34:07 +11:00
Christoph Hellwig
2f251293b0 xfs: remove the inode log format from the inode log item
No need to keep the inode log format around all the time, we can
easily generate it at iop_format time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 11:34:05 +11:00
Christoph Hellwig
da7765031d xfs: format logged extents directly into the CIL
With the new iop_format scheme there is no need to have a temporary buffer
to format logged extents into, we can do so directly into the CIL.  This
also allows to remove the shortcut for big endian systems that probably
hasn't gotten a lot of test coverage for a long time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 11:34:04 +11:00
Christoph Hellwig
bde7cff67c xfs: format log items write directly into the linear CIL buffer
Instead of setting up pointers to memory locations in iop_format which then
get copied into the CIL linear buffer after return move the copy into
the individual inode items.  This avoids the need to always have a memory
block in the exact same layout that gets written into the log around, and
allow the log items to be much more flexible in their in-memory layouts.

The only caveat is that we need to properly align the data for each
iovec so that don't have structures misaligned in subsequent iovecs.

Note that all log item format routines now need to be careful to modify
the copy of the item that was placed into the CIL after calls to
xlog_copy_iovec instead of the in-memory copy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 11:34:02 +11:00
Christoph Hellwig
1234351cba xfs: introduce xlog_copy_iovec
Add a helper to abstract out filling the log iovecs in the log item
format handlers.  This will allow us to change the way we do the log
item formatting more easily.

The copy in the name is a bit confusing for now as it just assigns a
pointer and lets the CIL code perform the copy, but that will change
soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 11:00:43 +11:00
Christoph Hellwig
3de559fbd0 xfs: refactor xfs_inode_item_format
Split out a function to handle the data and attr fork, as well as a
helper for the really old v1 inodes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 11:00:43 +11:00
Christoph Hellwig
ce9641d6c9 xfs: refactor xfs_inode_item_size
Split out two helpers to size the data and attribute to make the
function more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 11:00:43 +11:00
Christoph Hellwig
7aeb722241 xfs: refactor xfs_buf_item_format_segment
Add two helpers to make the code more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 11:00:43 +11:00
Christoph Hellwig
9597df6b26 xfs: remove duplicate code in xlog_cil_insert_format_items
Share code that was previously duplicated in two branches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-12-13 11:00:42 +11:00
Linus Torvalds
e09f67f147 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is a small collection of fixes.  It was rebased this morning, but
  I was just fixing signed-off-by tags with the wrong email"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix access_ok() check in btrfs_ioctl_send()
  Btrfs: make sure we cleanup all reloc roots if error happens
  Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation
  Btrfs: fix an oops when doing balance relocation
  Btrfs: don't miss skinny extent items on delayed ref head contention
  btrfs: call mnt_drop_write after interrupted subvol deletion
  Btrfs: don't clear the default compression type
2013-12-12 15:25:10 -08:00
Linus Torvalds
c9111b4df4 Merge branch 'for-3.13' of git://linux-nfs.org/~bfields/linux
Pull nfsd reply cache bugfix from Bruce Fields:
 "One bugfix for nfsd crashes"

* 'for-3.13' of git://linux-nfs.org/~bfields/linux:
  nfsd: when reusing an existing repcache entry, unhash it first
2013-12-12 15:24:32 -08:00
Will Deacon
a5c21dcefa dcache: allow word-at-a-time name hashing with big-endian CPUs
When explicitly hashing the end of a string with the word-at-a-time
interface, we have to be careful which end of the word we pick up.

On big-endian CPUs, the upper-bits will contain the data we're after, so
ensure we generate our masks accordingly (and avoid hashing whatever
random junk may have been sitting after the string).

This patch adds a new dcache helper, bytemask_from_count, which creates
a mask appropriate for the CPU endianness.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12 10:39:01 -08:00
Linus Torvalds
48a2f0b272 xfs: bugfixes for 3.13-rc4
- fix for buffer overrun in agfl with growfs on v4 superblock
 - return EINVAL if requested discard length is less than a block
 - fix possible memory corruption in xfs_attrlist_by_handle()
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJSp2+vAAoJENaLyazVq6ZOvm4QAJs3iPQwrqiBV+9KiHKDrHy3
 6/G0XeW1gmWZR7DI2CUtx0Dgjv++y3zaWSR+AaoHtOVDYb2W5rq1dG19bxmBq55I
 ZpV2kXsJuz29KEuHIvZqdoSHL+rdKwPejMezABFtUFINjpjZisjlk/FL7czfuhGS
 d6OGl8/q4tX06lgZXED/ebHB5Zqboha8DLaL5HrvT0RINSoZwgu0Z0gSz5mh50UM
 eKOQJVwZ3DLoO1rhSE8tGjEqJN3GQb4hM6cusrDKQ3fdTHVRQwDvN39z6h/OD0jr
 8jxpbbtxzGiPQPnvtf2s26zrWdEZmDZmBOoh0A9acVxB8ZHkaPq462LJDGLWNK5u
 SqXCTcWebT7xHoBXb0PhzjPMLpJDD2F90xLFC6PNYd/S2y0hWG+qSPcNxNjRQlN7
 MjxL0xe0l0zIsl6WV/zZgWmxkxc7EiRuDPnO3kxkIlxyaH1cXLkkhWxw+3yYFRqm
 HjUfpYRTRWr4JodYom/0vRZc5gqXgY+snIFex5a7d5Ukm2m22QeAbqdByuMtxEp+
 dLN70Qn/57CAzYYScjd4mP1imFwhL/19Uzaiv/J13cW833jJ1n0EHm89fxKOzDHt
 CYHVObph719AmEcoaMFNolyI4EozsxuIzq06v7gwgGSVbOmEqGY5g76hLoyKoxxc
 uv5Yo+hoUQHyC6a7oD7r
 =SbPd
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.13-rc4' of git://oss.sgi.com/xfs/xfs

Pull xfs bugfixes from Ben Myers:

 - fix for buffer overrun in agfl with growfs on v4 superblock

 - return EINVAL if requested discard length is less than a block

 - fix possible memory corruption in xfs_attrlist_by_handle()

* tag 'xfs-for-linus-v3.13-rc4' of git://oss.sgi.com/xfs/xfs:
  xfs: growfs overruns AGFL buffer on V4 filesystems
  xfs: don't perform discard if the given range length is less than block size
  xfs: underflow bug in xfs_attrlist_by_handle()
2013-12-12 10:14:13 -08:00
Dan Carpenter
700ff4f095 Btrfs: fix access_ok() check in btrfs_ioctl_send()
The closing parenthesis is in the wrong place.  We want to check
"sizeof(*arg->clone_sources) * arg->clone_sources_count" instead of
"sizeof(*arg->clone_sources * arg->clone_sources_count)".

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
2013-12-12 07:13:02 -08:00
Wang Shilong
467bb1d27c Btrfs: make sure we cleanup all reloc roots if error happens
I hit an oops when merging reloc roots fails, the reason is that
new reloc roots may be added and we should make sure we cleanup
all reloc roots.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:12:51 -08:00
Wang Shilong
6646374863 Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation
Quota tree and UUID Tree is only cowed, they can not be snapshoted.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:12:36 -08:00
Wang Shilong
c974c4642f Btrfs: fix an oops when doing balance relocation
I hit an oops when inserting reloc root into @reloc_root_tree(it can be
easily triggered when forcing cow for relocation root)

[  866.494539]  [<ffffffffa0499579>] btrfs_init_reloc_root+0x79/0xb0 [btrfs]
[  866.495321]  [<ffffffffa044c240>] record_root_in_trans+0xb0/0x110 [btrfs]
[  866.496109]  [<ffffffffa044d758>] btrfs_record_root_in_trans+0x48/0x80 [btrfs]
[  866.496908]  [<ffffffffa0494da8>] select_reloc_root+0xa8/0x210 [btrfs]
[  866.497703]  [<ffffffffa0495c8a>] do_relocation+0x16a/0x540 [btrfs]

This is because reloc root inserted into @reloc_root_tree is not within one
transaction,reloc root may be cowed and root block bytenr will be reused then
oops happens.We should update reloc root in @reloc_root_tree when cow reloc
root node, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:12:20 -08:00
Filipe David Borba Manana
639eefc8af Btrfs: don't miss skinny extent items on delayed ref head contention
Currently extent-tree.c:btrfs_lookup_extent_info() can miss the lookup
of skinny extent items. This can happen when the execution flow is the
following:

* We do an extent tree lookup and fail to find a skinny extent item;

* As a result, we attempt to see if a non-skinny extent item exists,
  either by looking at previous item in the leaf or by doing another
  full extent tree search;

* We have a transaction and then we check for a matching delayed ref
  head in the transaction's delayed refs rbtree;

* We find such delayed ref head and then we try to lock it with a
  call to mutex_trylock();

* The lock was contended so we jump to the label "again", which repeats
  the extent tree search but for a non-skinny extent item, because we set
  previously metadata variable to 0 and the search key to look for a
  non-skinny extent-item;

* After the jump (and after releasing the transaction's delayed refs
  lock), a skinny extent item might have been added to the extent tree
  but we will miss it because metadata is set to 0 and the search key
  is set for a non-skinny extent-item.

The fix here is to not reset metadata to 0 and to jump to the initial search
key setup if the delayed ref head is contended, instead of jumping directly
to the extent tree search label ("again").

This issue was found while investigating the issue reported at Bugzilla 64961.

David Sterba suspected this function was missing extent items, and that
this could be caused by the last change to this function, which was made
in the following patch:

    [PATCH] Btrfs: optimize btrfs_lookup_extent_info()
    (commit 74be951087)

But in fact this issue already existed before, because after failing to find
a skinny extent item, the code set the search key for a non-skinny extent
item, and on contention of a matching delayed ref head it would not search
the extent tree for a skinny extent item anymore.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:11:58 -08:00
David Sterba
e43f998e47 btrfs: call mnt_drop_write after interrupted subvol deletion
If btrfs_ioctl_snap_destroy blocks on the mutex and the process is
killed, mnt_write count is unbalanced and leads to unmountable
filesystem.

CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:11:38 -08:00
Miao Xie
a7e252af5a Btrfs: don't clear the default compression type
We met a oops caused by the wrong compression type:
[  556.512356] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  556.512370] IP: [<ffffffff811dbaa0>] __list_del_entry+0x1/0x98
[SNIP]
[  556.512490]  [<ffffffff811dbb44>] ? list_del+0xd/0x2b
[  556.512539]  [<ffffffffa05dd5ce>] find_workspace+0x97/0x175 [btrfs]
[  556.512546]  [<ffffffff813c14b5>] ? _raw_spin_lock+0xe/0x10
[  556.512576]  [<ffffffffa05de276>] btrfs_compress_pages+0x2d/0xa2 [btrfs]
[  556.512601]  [<ffffffffa05af060>] compress_file_range.constprop.54+0x1f2/0x4e8 [btrfs]
[  556.512627]  [<ffffffffa05af388>] async_cow_start+0x32/0x4d [btrfs]
[  556.512655]  [<ffffffffa05cc7a1>] worker_loop+0x144/0x4c3 [btrfs]
[  556.512661]  [<ffffffff81059404>] ? finish_task_switch+0x80/0xb8
[  556.512689]  [<ffffffffa05cc65d>] ? btrfs_queue_worker+0x244/0x244 [btrfs]
[  556.512695]  [<ffffffff8104fa4e>] kthread+0x8d/0x95
[  556.512699]  [<ffffffff81050000>] ? bit_waitqueue+0x34/0x7d
[  556.512704]  [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65
[  556.512709]  [<ffffffff813c7eec>] ret_from_fork+0x7c/0xb0
[  556.512713]  [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65

Steps to reproduce:
 # mkfs.btrfs -f <dev>
 # mount -o nodatacow <dev> <mnt>
 # touch <mnt>/<file>
 # chattr =c <mnt>/<file>
 # dd if=/dev/zero of=<mnt>/<file> bs=1M count=10

It is because we cleared the default compression type when setting the
nodatacow. In fact, we needn't do it because we have used COMPRESS flag to
indicate if we need compressed the file data or not, needn't use the
variant -- compress_type -- in btrfs_info to do the same thing, and just
use it to hold the default compression type. Or we would get a wrong compress
type for a file whose own compress flag is set but the compress flag of its
filesystem is not set.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:11:19 -08:00
Tejun Heo
c637b8acbe kernfs: s/sysfs/kernfs/ in internal functions and whatever is left
kernfs has just been separated out from sysfs and we're already in
full conflict mode.  Nothing can make the situation any worse.  Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_*()/kernfs_*()/ in all internal functions
* s/sysfs/kernfs/ in internal strings, comments and whatever is remaining
* Uniformly rename various vfs operations so that they're consistently
  named and distinguishable.

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11 17:39:20 -08:00
Tejun Heo
a797bfc305 kernfs: s/sysfs/kernfs/ in global variables
kernfs has just been separated out from sysfs and we're already in
full conflict mode.  Nothing can make the situation any worse.  Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_mutex/kernfs_mutex/
* s/sysfs_dentry_ops/kernfs_dops/
* s/sysfs_dir_operations/kernfs_dir_fops/
* s/sysfs_dir_inode_operations/kernfs_dir_iops/
* s/kernfs_file_operations/kernfs_file_fops/ - renamed for consistency
* s/sysfs_symlink_inode_operations/kernfs_symlink_iops/
* s/sysfs_aops/kernfs_aops/
* s/sysfs_backing_dev_info/kernfs_bdi/
* s/sysfs_inode_operations/kernfs_iops/
* s/sysfs_dir_cachep/kernfs_node_cache/
* s/sysfs_ops/kernfs_sops/

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11 17:39:20 -08:00
Tejun Heo
df23fc39bc kernfs: s/sysfs/kernfs/ in constants
kernfs has just been separated out from sysfs and we're already in
full conflict mode.  Nothing can make the situation any worse.  Let's
take the chance to name things properly.

This patch performs the following renames.

* s/SYSFS_DIR/KERNFS_DIR/
* s/SYSFS_KOBJ_ATTR/KERNFS_FILE/
* s/SYSFS_KOBJ_LINK/KERNFS_LINK/
* s/SYSFS_{TYPE_FLAGS}/KERNFS_{TYPE_FLAGS}/
* s/SYSFS_FLAG_{FLAG}/KERNFS_{FLAG}/
* s/sysfs_type()/kernfs_type()/
* s/SD_DEACTIVATED_BIAS/KN_DEACTIVATED_BIAS/

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11 17:39:20 -08:00
Tejun Heo
c525aaddc3 kernfs: s/sysfs/kernfs/ in various data structures
kernfs has just been separated out from sysfs and we're already in
full conflict mode.  Nothing can make the situation any worse.  Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_open_dirent/kernfs_open_node/
* s/sysfs_open_file/kernfs_open_file/
* s/sysfs_inode_attrs/kernfs_iattrs/
* s/sysfs_addrm_cxt/kernfs_addrm_cxt/
* s/sysfs_super_info/kernfs_super_info/
* s/sysfs_info()/kernfs_info()/
* s/sysfs_open_dirent_lock/kernfs_open_node_lock/
* s/sysfs_open_file_mutex/kernfs_open_file_mutex/
* s/sysfs_of()/kernfs_of()/

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11 17:39:20 -08:00
Tejun Heo
adc5e8b58f kernfs: drop s_ prefix from kernfs_node members
kernfs has just been separated out from sysfs and we're already in
full conflict mode.  Nothing can make the situation any worse.  Let's
take the chance to name things properly.

s_ prefix for kernfs members is used inconsistently and a misnomer
now.  It's not like kernfs_node is used widely across the kernel
making the ability to grep for the members particularly useful.  Let's
just drop the prefix.

This patch is strictly rename only and doesn't introduce any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11 15:43:48 -08:00
Tejun Heo
324a56e16e kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordingly
kernfs has just been separated out from sysfs and we're already in
full conflict mode.  Nothing can make the situation any worse.  Let's
take the chance to name things properly.

This patch performs the following renames.

* s/sysfs_elem_dir/kernfs_elem_dir/
* s/sysfs_elem_symlink/kernfs_elem_symlink/
* s/sysfs_elem_attr/kernfs_elem_file/
* s/sysfs_dirent/kernfs_node/
* s/sd/kn/ in kernfs proper
* s/parent_sd/parent/
* s/target_sd/target/
* s/dir_sd/parent/
* s/to_sysfs_dirent()/rb_to_kn()/
* misc renames of local vars when they conflict with the above

Because md, mic and gpio dig into sysfs details, this patch ends up
modifying them.  All are sysfs_dirent renames and trivial.  While we
can avoid these by introducing a dummy wrapping struct sysfs_dirent
around kernfs_node, given the limited usage outside kernfs and sysfs
proper, I don't think such workaround is called for.

This patch is strictly rename only and doesn't introduce any
functional difference.

- mic / gpio renames were missing.  Spotted by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11 15:28:36 -08:00
Dave Chinner
f9b395a8ef xfs: align initial file allocations correctly
The function xfs_bmap_isaeof() is used to indicate that an
allocation is occurring at or past the end of file, and as such
should be aligned to the underlying storage geometry if possible.

Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
behaviour of this function for empty files - it turned off
allocation alignment for this case accidentally. Hence large initial
allocations from direct IO are not getting correctly aligned to the
underlying geometry, and that is cause write performance to drop in
alignment sensitive configurations.

Fix it by considering allocation into empty files as requiring
aligned allocation again.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-11 15:23:04 -06:00
Ben Myers
8e825e3a02 xfs: fix calculation of freed inode cluster blocks
rec.ir_startino is an agino rather than an ino.  Use the correct macro
when dealing with it in xfs_difree.

Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2013-12-11 15:22:43 -06:00
Dave Chinner
b3f03bac81 xfs: xfs_dir2_block_to_sf temp buffer allocation fails
If we are using a large directory block size, and memory becomes
fragmented, we can get memory allocation failures trying to
kmem_alloc(64k) for a temporary buffer. However, there is not need
for a directory buffer sized allocation, as the end result ends up
in the inode literal area. This is, at most, slightly less than 2k
of space, and hence we don't need an allocation larger than that
fora temporary buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-11 14:59:20 -06:00
Tejun Heo
a7560a0132 sysfs: fix use-after-free in sysfs_kill_sb()
While restructuring the [u]mount path, 4b93dc9b1c ("sysfs, kernfs:
prepare mount path for kernfs") incorrectly updated sysfs_kill_sb() so
that it first kills super_block and then tries to dereference its
namespace tag to drop it.  Fix it by caching namespace tag before
killing the superblock and then drop the cached namespace tag.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/g/20131205031051.GC5135@yliu-dev.sh.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-10 22:40:12 -08:00
Tejun Heo
9b2db6e189 sysfs: bail early from kernfs_file_mmap() to avoid spurious lockdep warning
This is v3.14 fix for the same issue that a8b1474442 ("sysfs: give
different locking key to regular and bin files") addresses for v3.13.
Due to the extensive kernfs reorganization in v3.14 branch, the same
fix couldn't be ported as-is.  The v3.13 fix was ignored while merging
it into v3.14 branch.

027a485d12 ("sysfs: use a separate locking class for open files
depending on mmap") assigned different lockdep key to
sysfs_open_file->mutex depending on whether the file implements mmap
or not in an attempt to avoid spurious lockdep warning caused by
merging of regular and bin file paths.

While this restored some of the original behavior of using different
locks (at least lockdep is concerned) for the different clases of
files.  The restoration wasn't full because now the lockdep key
assignment depends on whether the file has mmap or not instead of
whether it's a regular file or not.

This means that bin files which don't implement mmap will get assigned
the same lockdep class as regular files.  This is problematic because
file_operations for bin files still implements the mmap file operation
and checking whether the sysfs file actually implements mmap happens
in the file operation after grabbing @sysfs_open_file->mutex.  We
still end up adding locking dependency from mmap locking to
sysfs_open_file->mutex to the regular file mutex which triggers
spurious circular locking warning.

For v3.13, a8b1474442 ("sysfs: give different locking key to regular
and bin files") fixed it by giving sysfs_open_file->mutex different
lockdep keys depending on whether the file is regular or bin instead
of whether mmap exists or not; however, due to the way sysfs is now
layered behind kernfs, this approach is no longer viable.  kernfs can
tell whether a sysfs node has mmap implemented or not but can't tell
whether a bin file from a regular one.

This patch updates kernfs such that kernfs_file_mmap() checks
SYSFS_FLAG_HAS_MMAP and bail before grabbing sysfs_open_file->mutex so
that it doesn't add spurious locking dependency from mmap to
sysfs_open_file->mutex and changes sysfs so that it specifies
kernfs_ops->mmap iff the sysfs file implements mmap.  Combined, this
ensures that sysfs_open_file->mutex is grabbed under mmap path iff the
sysfs file actually implements mmap.  As sysfs_open_file->mutex is
already given a different lockdep key if mmap is implemented, this
removes the spurious locking dependency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dave Jones <davej@redhat.com>
Link: http://lkml.kernel.org/g/20131203184324.GA11320@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-10 21:33:31 -08:00
Jeff Layton
781c2a5a5f nfsd: when reusing an existing repcache entry, unhash it first
The DRC code will attempt to reuse an existing, expired cache entry in
preference to allocating a new one. It'll then search the cache, and if
it gets a hit it'll then free the cache entry that it was going to
reuse.

The cache code doesn't unhash the entry that it's going to reuse
however, so it's possible for it end up designating an entry for reuse
and then subsequently freeing the same entry after it finds it.  This
leads it to a later use-after-free situation and usually some list
corruption warnings or an oops.

Fix this by simply unhashing the entry that we intend to reuse. That
will mean that it's not findable via a search and should prevent this
situation from occurring.

Cc: stable@vger.kernel.org # v3.10+
Reported-by: Christoph Hellwig <hch@infradead.org>
Reported-by: g. artim <gartim@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-12-10 20:34:44 -05:00
Dave Chinner
f94c44573e xfs: growfs overruns AGFL buffer on V4 filesystems
This loop in xfs_growfs_data_private() is incorrect for V4
superblocks filesystems:

		for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++)
			agfl->agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);

For V4 filesystems, we don't have a agfl header structure, and so
XFS_AGFL_SIZE() returns an entire sector's worth of entries, which
we then index from an offset into the sector. Hence: buffer overrun.

This problem was introduced in 3.10 by commit 77c95bba ("xfs: add
CRC checks to the AGFL") which changed the AGFL structure but failed
to update the growfs code to handle the different structures.

Fix it by using the correct offset into the buffer for both V4 and
V5 filesystems.

Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit b7d961b35b)
2013-12-10 10:04:27 -06:00
Jie Liu
2f42d612e7 xfs: don't perform discard if the given range length is less than block size
For discard operation, we should return EINVAL if the given range length
is less than a block size, otherwise it will go through the file system
to discard data blocks as the end range might be evaluated to -1, e.g,
# fstrim -v -o 0 -l 100 /xfs7
/xfs7: 9811378176 bytes were trimmed

This issue can be triggered via xfstests/generic/288.

Also, it seems to get the request queue pointer via bdev_get_queue()
instead of the hard code pointer dereference is not a bad thing.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit f9fd013561)
2013-12-10 10:00:33 -06:00
Dan Carpenter
31978b5cc6 xfs: underflow bug in xfs_attrlist_by_handle()
If we allocate less than sizeof(struct attrlist) then we end up
corrupting memory or doing a ZERO_PTR_SIZE dereference.

This can only be triggered with CAP_SYS_ADMIN.

Reported-by: Nico Golde <nico@ngolde.de>
Reported-by: Fabian Yamaguchi <fabs@goesec.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 071c529eb6)
2013-12-10 09:59:37 -06:00
Tejun Heo
13ccb93f41 Merge branch 'driver-core-linus' into driver-core-next
a8b1474442 ("sysfs: give different locking key to regular and bin
files") in driver-core-linus modifies sysfs_open_file() so that it
gives out different locking classes to sysfs_open_files depending on
whether the file is bin or not.  Due to the massive kernfs
reorganization in driver-core-next, this naturally causes merge
conflict in fs/sysfs/file.c.

Due to the way things are split between kernfs and sysfs in
driver-core-next, the same fix can't easily be applied to
driver-core-next.  This merge simply ignores the offending commit.  A
following patch will implement a separate fix for the issue.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-12-10 08:44:37 -05:00
Jie Liu
df8052e7da xfs: fix infinite loop by detaching the group/project hints from user dquot
xfs_quota(8) will hang up if trying to turn group/project quota off
before the user quota is off, this could be 100% reproduced by:
  # mount -ouquota,gquota /dev/sda7 /xfs
  # mkdir /xfs/test
  # xfs_quota -xc 'off -g' /xfs <-- hangs up
  # echo w > /proc/sysrq-trigger
  # dmesg

  SysRq : Show Blocked State
  task                        PC stack   pid father
  xfs_quota       D 0000000000000000     0 27574   2551 0x00000000
  [snip]
  Call Trace:
  [<ffffffff81aaa21d>] schedule+0xad/0xc0
  [<ffffffff81aa327e>] schedule_timeout+0x35e/0x3c0
  [<ffffffff8114b506>] ? mark_held_locks+0x176/0x1c0
  [<ffffffff810ad6c0>] ? call_timer_fn+0x2c0/0x2c0
  [<ffffffffa0c25380>] ? xfs_qm_shrink_count+0x30/0x30 [xfs]
  [<ffffffff81aa3306>] schedule_timeout_uninterruptible+0x26/0x30
  [<ffffffffa0c26155>] xfs_qm_dquot_walk+0x235/0x260 [xfs]
  [<ffffffffa0c059d8>] ? xfs_perag_get+0x1d8/0x2d0 [xfs]
  [<ffffffffa0c05805>] ? xfs_perag_get+0x5/0x2d0 [xfs]
  [<ffffffffa0b7707e>] ? xfs_inode_ag_iterator+0xae/0xf0 [xfs]
  [<ffffffffa0c22280>] ? xfs_trans_free_dqinfo+0x50/0x50 [xfs]
  [<ffffffffa0b7709f>] ? xfs_inode_ag_iterator+0xcf/0xf0 [xfs]
  [<ffffffffa0c261e6>] xfs_qm_dqpurge_all+0x66/0xb0 [xfs]
  [<ffffffffa0c2497a>] xfs_qm_scall_quotaoff+0x20a/0x5f0 [xfs]
  [<ffffffffa0c2b8f6>] xfs_fs_set_xstate+0x136/0x180 [xfs]
  [<ffffffff8136cf7a>] do_quotactl+0x53a/0x6b0
  [<ffffffff812fba4b>] ? iput+0x5b/0x90
  [<ffffffff8136d257>] SyS_quotactl+0x167/0x1d0
  [<ffffffff814cf2ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff81abcd19>] system_call_fastpath+0x16/0x1b

It's fine if we turn user quota off at first, then turn off other
kind of quotas if they are enabled since the group/project dquot
refcount is decreased to zero once the user quota if off. Otherwise,
those dquots refcount is non-zero due to the user dquot might refer
to them as hint(s).  Hence, above operation cause an infinite loop
at xfs_qm_dquot_walk() while trying to purge dquot cache.

This problem has been around since Linux 3.4, it was introduced by:
  [ b84a3a9675 xfs: remove the per-filesystem list of dquots ]

Originally we will release the group dquot pointers because the user
dquots maybe carrying around as a hint via xfs_qm_detach_gdquots().
However, with above change, there is no such work to be done before
purging group/project dquot cache.

In order to solve this problem, this patch introduces a special routine
xfs_qm_dqpurge_hints(), and it would release the group/project dquot
pointers the user dquots maybe carrying around as a hint, and then it
will proceed to purge the user dquot cache if requested.

Cc: stable@vger.kernel.org
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-09 12:14:44 -06:00
Jie Liu
5a01dd54f4 xfs: fix assertion failure at xfs_setattr_nonsize
For CRC enabled v5 super block, change a file's ownership can simply
trigger an ASSERT failure at xfs_setattr_nonsize() if both group and
project quota are enabled, i.e,

[  305.337609] XFS: Assertion failed: !XFS_IS_PQUOTA_ON(mp), file: fs/xfs/xfs_iops.c, line: 621
[  305.339250] Kernel BUG at ffffffffa0a7fa32 [verbose debug info unavailable]
[  305.383939] Call Trace:
[  305.385536]  [<ffffffffa0a7d95a>] xfs_setattr_nonsize+0x69a/0x720 [xfs]
[  305.387142]  [<ffffffffa0a7dea9>] xfs_vn_setattr+0x29/0x70 [xfs]
[  305.388727]  [<ffffffff811ca388>] notify_change+0x1a8/0x350
[  305.390298]  [<ffffffff811ac39d>] chown_common+0xfd/0x110
[  305.391868]  [<ffffffff811ad6bf>] SyS_fchownat+0xaf/0x110
[  305.393440]  [<ffffffff811ad760>] SyS_lchown+0x20/0x30
[  305.394995]  [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[  305.399870] RIP  [<ffffffffa0a7fa32>] assfail+0x22/0x30 [xfs]

This fix adjust the assertion to check if the super block support both
quota inodes or not.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-09 12:10:30 -06:00
Wei Yongjun
21d71662f8 sysfs, kernfs: remove duplicated include from file.c
Remove duplicated include.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08 18:16:19 -08:00
Dmitry Monakhov
a67c848a8b jbd2: rename obsoleted msg JBD->JBD2
Rename performed via: perl -pi -e 's/JBD:/JBD2:/g' fs/jbd2/*.c

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-12-08 21:14:59 -05:00
Jan Kara
75685071cd jbd2: revise KERN_EMERG error messages
Some of KERN_EMERG printk messages do not really deserve this log
level and the one in log_wait_commit() is even rather useless (the
journal has been previously aborted and *that* is where we should have
been complaining). So make some messages just KERN_ERR and remove the
useless message.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-12-08 21:13:59 -05:00
Theodore Ts'o
f6c07cad08 jbd2: don't BUG but return ENOSPC if a handle runs out of space
If a handle runs out of space, we currently stop the kernel with a BUG
in jbd2_journal_dirty_metadata().  This makes it hard to figure out
what might be going on.  So return an error of ENOSPC, so we can let
the file system layer figure out what is going on, to make it more
likely we can get useful debugging information).  This should make it
easier to debug problems such as the one which was reported by:

    https://bugzilla.kernel.org/show_bug.cgi?id=44731

The only two callers of this function are ext4_handle_dirty_metadata()
and ocfs2_journal_dirty().  The ocfs2 function will trigger a
BUG_ON(), which means there will be no change in behavior.  The ext4
function will call ext4_error_inode() which will print the useful
debugging information and then handle the situation using ext4's error
handling mechanisms (i.e., which might mean halting the kernel or
remounting the file system read-only).

Also, since both file systems already call WARN_ON(), drop the WARN_ON
from jbd2_journal_dirty_metadata() to avoid two stack traces from
being displayed.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker <jlbec@evilplan.org>
2013-12-08 21:12:59 -05:00
Jan Kara
30fac0f75d ext4: Do not reserve clusters when fs doesn't support extents
When the filesystem doesn't support extents (like in ext2/3
compatibility modes), there is no need to reserve any clusters. Space
estimates for writing are exact, hole punching doesn't need new
metadata, and there are no unwritten extents to convert.

This fixes a problem when filesystem still having some free space when
accessed with a native ext2/3 driver suddently reports ENOSPC when
accessed with ext4 driver.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-08 21:11:59 -05:00
Al Viro
9105bb149b ext4: fix del_timer() misuse for ->s_err_report
That thing should be del_timer_sync(); consider what happens
if ext4_put_super() call of del_timer() happens to come just as it's
getting run on another CPU.  Since that timer reschedules itself
to run next day, you are pretty much guaranteed that you'll end up
with kfree'd scheduled timer, with usual fun consequences.  AFAICS,
that's -stable fodder all way back to 2010... [the second del_timer_sync()
is almost certainly not needed, but it doesn't hurt either]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-08 20:52:31 -05:00
Tejun Heo
a8b1474442 sysfs: give different locking key to regular and bin files
027a485d12 ("sysfs: use a separate locking class for open files
depending on mmap") assigned different lockdep key to
sysfs_open_file->mutex depending on whether the file implements mmap
or not in an attempt to avoid spurious lockdep warning caused by
merging of regular and bin file paths.

While this restored some of the original behavior of using different
locks (at least lockdep is concerned) for the different clases of
files.  The restoration wasn't full because now the lockdep key
assignment depends on whether the file has mmap or not instead of
whether it's a regular file or not.

This means that bin files which don't implement mmap will get assigned
the same lockdep class as regular files.  This is problematic because
file_operations for bin files still implements the mmap file operation
and checking whether the sysfs file actually implements mmap happens
in the file operation after grabbing @sysfs_open_file->mutex.  We
still end up adding locking dependency from mmap locking to
sysfs_open_file->mutex to the regular file mutex which triggers
spurious circular locking warning.

Fix it by restoring the original behavior fully by differentiating
lockdep key by whether the file is regular or bin, instead of the
existence of mmap.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dave Jones <davej@redhat.com>
Link: http://lkml.kernel.org/g/20131203184324.GA11320@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-07 21:22:00 -08:00
Christoph Hellwig
c91c46c127 xfs: add xfs_setattr_time
Split out a xfs_setattr_time helper to share code between truncate and
regular setattr similar to xfs_setattr_mode.  I might also have another
caller growing for this in the near future.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-06 17:26:19 -06:00
Christoph Hellwig
0c3d88dfce xfs: tiny xfs_setattr_mode cleanup
Remove the pointless tp argument, and properly align the local variable
declarations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-06 17:18:41 -06:00
Jie Liu
37eb9706eb xfs: fix false assertion at xfs_qm_vop_create_dqattach
After the previous fix, there still has another ASSERT failure if turning
off any type of quota while fsstress is running at the same time.

Backtrace in this case:

[   50.867897] XFS: Assertion failed: XFS_IS_GQUOTA_ON(mp), file: fs/xfs/xfs_qm.c, line: 2118
[   50.867924] ------------[ cut here ]------------
... <snip>
[   50.867957] Kernel BUG at ffffffffa0b55a32 [verbose debug info unavailable]
[   50.867999] invalid opcode: 0000 [#1] SMP
[   50.869407] Call Trace:
[   50.869446]  [<ffffffffa0bc408a>] xfs_qm_vop_create_dqattach+0x19a/0x2d0 [xfs]
[   50.869512]  [<ffffffffa0b9cc45>] xfs_create+0x5c5/0x6a0 [xfs]
[   50.869564]  [<ffffffffa0b5307c>] xfs_vn_mknod+0xac/0x1d0 [xfs]
[   50.869615]  [<ffffffffa0b531d6>] xfs_vn_mkdir+0x16/0x20 [xfs]
[   50.869655]  [<ffffffff811becd5>] vfs_mkdir+0x95/0x130
[   50.869689]  [<ffffffff811bf63a>] SyS_mkdirat+0xaa/0xe0
[   50.869723]  [<ffffffff811bf689>] SyS_mkdir+0x19/0x20
[   50.869757]  [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[   50.869793] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 <snip>
[   50.870003] RIP  [<ffffffffa0b55a32>] assfail+0x22/0x30 [xfs]
[   50.870050]  RSP <ffff88002941fd60>
[   50.879251] ---[ end trace c93a2b342341c65b ]---

We're hitting the ASSERT(XFS_IS_*QUOTA_ON(mp)) in xfs_qm_vop_create_dqattach(),
however the assertion itself is not right IMHO.  While performing quota off, we
firstly clear the XFS_*QUOTA_ACTIVE bit(s) from struct xfs_mount without taking
any special locks, see xfs_qm_scall_quotaoff().  Hence there is no guarantee
that the desired quota is still active.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-06 16:10:21 -06:00
Jie Liu
afbd123db4 xfs: integrate xfs_quota_priv header file to xfs_qm
The xfs_quota_priv header file is only included by xfs_qm header and
there is no much users for its contents, hence we can move those stuff
to xfs_qm header file and kill it.

This patch also remove an unused macro DQFLAGTO_TYPESTR.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-06 14:16:33 -06:00
Jie Liu
c61a9e39f6 xfs: make quota metadata truncation behavior consistent to user space
In xfs_qm_scall_trunc_qfiles(), we ignore the error if failed to remove
the users quota metadata and proceed to remove groups and projects if
they are being there.  However, in user space, the remove operation will
break and return if failed to remove any kind of quota.
Also for v5 super block, we can enabled both group and project quota at
the same time, in this case the current error handling will cover the
group error with projects but they might failed due to different reasons.

It seems we'd better the error handling consistent to the user space and
don't trying to remove another kind of quota metadata if the previous
operation is failed.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-06 14:06:15 -06:00
Linus Torvalds
c537aba00e Merge git://git.kvack.org/~bcrl/aio-next
Pull aio fix from Benjamin LaHaise:
 "AIO fix from Gu Zheng that fixes a GPF that Dave Jones uncovered with
  trinity"

* git://git.kvack.org/~bcrl/aio-next:
  aio: clean up aio ring in the fail path
2013-12-06 08:32:59 -08:00
Gu Zheng
d1b9432712 aio: clean up aio ring in the fail path
Clean up the aio ring file in the fail path of aio_setup_ring
and ioctx_alloc. And maybe it can fix the GPF issue reported by
Dave Jones:
https://lkml.org/lkml/2013/11/25/898

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-12-06 10:22:55 -05:00
Linus Torvalds
002acf1fc1 Power management fixes for 3.13-rc3
- cpufreq regression fix from Bjørn Mork restoring the pre-3.12
    behavior of the framework during system suspend/hibernation to
    avoid garbage sysfs files from being left behind in case of a
    suspend error.
 
  - PNP regression fix to restore the correct states of devices after
    resume from hibernation broken in 3.12.  From Dmitry Torokhov.
 
  - cpuidle fix to prevent cpuidle device unregistration from crashing
    due to a NULL pointer dereference if cpuidle has been disabled
    from the kernel command line.  From Konrad Rzeszutek Wilk.
 
  - intel_idle fix for the C6 state definition on Intel Avoton/Rangeley
    processors from Arne Bockholdt.
 
  - Power capping framework fix to make the energy_uj sysfs attribute
    work in accordance with the documentation.  From Srinivas Pandruvada.
 
  - epoll fix to make it ignore the EPOLLWAKEUP flag if the kernel has
    been compiled with CONFIG_PM_SLEEP unset (in which case that flag
    should not have any effect).  From Amit Pundir.
 
  - cpufreq fix to prevent governor sysfs files from being lost over
    system suspend/resume in some (arguably unusual) situations.  From
    Viresh Kumar.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABCAAGBQJSoTE5AAoJEILEb/54YlRxh/sP/jFZhLTc8g4MC3XzhguuROzT
 u0Pu9VJVfACqz8LyiCOtfOvvb2EPV7VSq7qYqszL5y9Dn5gwIHvQMMBK2PjZ4cc2
 MtSiw02Bk/DEESXYOjt++n5ja/0lc05CtJTlb3uoJXBOCqp3cMrvW7+QqnLbEfbG
 S+TcPFBr+4Owt/J7r2Z2JBYGZ6NbVol/x1hAFjiM+rBan6UGw7uNcg2LgQrVHcs4
 S1Cm6lsJTwRcSiswvJv9/C+ML9Z/1gYYUyu7ijQnGdbNUolyzHY6AxLWZdnSkAQO
 s8JVDRKy9+V44LtnWSENnJNftjlOoXWcZRJxvDePyM3dVpxESBa8Z/AxYWwCcmcB
 e4rsgm/WOF86DMhRu+gfeTF+1OkU7KhuPhbXskbw+JDcZKCui2FP/xti6IAaTsU/
 9M30/VeOpD1UBqckLnDTGcsFif7hVZ9LOHH5wK8OctjyaTMfUYtPd7WxfTQCpcSc
 1M0NQapwfXHASmPmMW4SszAaeduecUdgXU1epOPx0EpOMQhvuLeENJBgVC5uu1cA
 KAQ7suOx9ReS3slso8lpTlTEw2rsDPRuiHcF3hv7YpklNXV2jvtxsl4upHT5VN2t
 3L59Unq8vY2lt3SdzWMypaAquphc3Te5woYoFwgsSPfw40Kr+jg+oUtO6IHqM/ga
 OATUkTffzp+Rp3pg036Y
 =gHBV
 -----END PGP SIGNATURE-----

Merge tag 'pm-3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:

 - cpufreq regression fix from Bjørn Mork restoring the pre-3.12
   behavior of the framework during system suspend/hibernation to avoid
   garbage sysfs files from being left behind in case of a suspend error

 - PNP regression fix to restore the correct states of devices after
   resume from hibernation broken in 3.12.  From Dmitry Torokhov.

 - cpuidle fix to prevent cpuidle device unregistration from crashing
   due to a NULL pointer dereference if cpuidle has been disabled from
   the kernel command line.  From Konrad Rzeszutek Wilk.

 - intel_idle fix for the C6 state definition on Intel Avoton/Rangeley
   processors from Arne Bockholdt.

 - Power capping framework fix to make the energy_uj sysfs attribute
   work in accordance with the documentation.  From Srinivas Pandruvada.

 - epoll fix to make it ignore the EPOLLWAKEUP flag if the kernel has
   been compiled with CONFIG_PM_SLEEP unset (in which case that flag
   should not have any effect).  From Amit Pundir.

 - cpufreq fix to prevent governor sysfs files from being lost over
   system suspend/resume in some (arguably unusual) situations.  From
   Viresh Kumar.

* tag 'pm-3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PowerCap: Fix mode for energy counter
  PNP: fix restoring devices after hibernation
  cpuidle: Check for dev before deregistering it.
  epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled
  cpufreq: fix garbage kobjects on errors during suspend/resume
  cpufreq: suspend governors on system suspend/hibernate
  intel_idle: Fixed C6 state on Avoton/Rangeley processors
2013-12-05 18:26:40 -08:00
Linus Torvalds
5ee540613d Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A small collection of fixes for the current series. It contains:

   - A fix for a use-after-free of a request in blk-mq.  From Ming Lei

   - A fix for a blk-mq bug that could attempt to dereference a NULL rq
     if allocation failed

   - Two xen-blkfront small fixes

   - Cleanup of submit_bio_wait() type uses in the kernel, unifying
     that.  From Kent

   - A fix for 32-bit blkg_rwstat reading.  I apologize for this one
     looking mangled in the shortlog, it's entirely my fault for missing
     an empty line between the description and body of the text"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: fix use-after-free of request
  blk-mq: fix dereference of rq->mq_ctx if allocation fails
  block: xen-blkfront: Fix possible NULL ptr dereference
  xen-blkfront: Silence pfn maybe-uninitialized warning
  block: submit_bio_wait() conversions
  Update of blkg_stat and blkg_rwstat may happen in bh context
2013-12-05 15:33:27 -08:00
Mark Tinguely
ef701600fd xfs: fix memory leak in xfs_dir2_node_removename
Fix the leak of kernel memory in xfs_dir2_node_removename()
when xfs_dir2_leafn_remove() returns an error code.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-05 16:51:19 -06:00
Mark Tinguely
2a84108fe2 xfs: free the list of recovery items on error
Recovery builds a list of items on the transaction's
r_itemq head. Normally these items are committed and freed.
But in the event of a recovery error, these allocations
are leaked.

If the error occurs during item reordering, then reconstruct
the r_itemq list before deleting the list to avoid leaking
the entries that were on one of the temporary lists.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-05 16:50:47 -06:00
Christoph Hellwig
dff6efc326 fs: fix iversion handling
Currently notify_change directly updates i_version for size updates,
which not only is counter to how all other fields are updated through
struct iattr, but also breaks XFS, which need inode updates to happen
under its own lock, and synchronized to the structure that gets written
to the log.

Remove the update in the common code, and it to btrfs and ext4,
XFS already does a proper updaste internally and currently gets a
double update with the existing code.

IMHO this is 3.13 and -stable material and should go in through the XFS
tree.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-05 16:36:21 -06:00
Dave Chinner
b7d961b35b xfs: growfs overruns AGFL buffer on V4 filesystems
This loop in xfs_growfs_data_private() is incorrect for V4
superblocks filesystems:

		for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++)
			agfl->agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);

For V4 filesystems, we don't have a agfl header structure, and so
XFS_AGFL_SIZE() returns an entire sector's worth of entries, which
we then index from an offset into the sector. Hence: buffer overrun.

This problem was introduced in 3.10 by commit 77c95bba ("xfs: add
CRC checks to the AGFL") which changed the AGFL structure but failed
to update the growfs code to handle the different structures.

Fix it by using the correct offset into the buffer for both V4 and
V5 filesystems.

Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-05 16:19:51 -06:00
Linus Torvalds
29be6345bb NFS client bugfixes
- Stable fix for a NFSv4.1 delegation and state recovery deadlock
 - Stable fix for a loop on irrecoverable errors when returning delegations
 - Fix a 3-way deadlock between layoutreturn, open, and state recovery
 - Update the MAINTAINERS file with contact information for Trond Myklebust
 - Close needs to handle NFS4ERR_ADMIN_REVOKED
 - Enabling v4.2 should not recompile nfsd and lockd
 - Fix a couple of compile warnings
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJSoLTpAAoJEGcL54qWCgDy2dgQAIKkKAXccg3OG2b1SxJmiaja
 PcrovNmgg3HvYQ7clUMqtrMByiXEpSybl6tAeXYUWE3sS1DISSBVEwO3MoOiASiM
 951Ssx+CoyhsHYo5aH83sUIiWFl/YsRhpKmSr2cdQd13DQTFbPq896k64Inf6L2/
 9fngoqOD7FunQHn8AiVPoDOQzObB0OuKhYCwuwLt47oPiwgmm12JQNCDxU1i4sxb
 lkGUBLkPMs6D5IyI8XHaMyX3+8MvmPiIsjIKaNJRdhkuX/k7ollucTJXyvyEQKK0
 PhBIWyUULmKcAXYwCfHf9UoyGZFvmj47YggyKcBd26OZUEFekcWrULfym46F1xak
 EcO6D4mlTy5i5W0RBqYCj1oGud57rixZBmhLTbeq6sSJaiqBfGEs225Q17H7rsEB
 YIghHiEFNnBmVWELhHxbJHQoY6HOugmZOuc0dxopaikN/7to8gnYoVyTIVlMfe/t
 UNXZoer6GOOohJGtZ7s7v4Al7EzvwnVnBCBklEAKFJ7Ca2LEmq+b58oQW3nJ1mPn
 y4TnihxYXsSEbqy+Lds9rumRhJLG1oVTpwficAm7N3HdK3abzCIPEt6iOHoCmXQz
 J1B4gmwOKsDqVlCSpBsnc3ZiBlSJGOn6MmVQUCNFpzv/DetWn/BxEUPE8cNm8DaI
 WioD0grC0/9bR8oD1m+w
 =UZ51
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 - Stable fix for a NFSv4.1 delegation and state recovery deadlock
 - Stable fix for a loop on irrecoverable errors when returning
   delegations
 - Fix a 3-way deadlock between layoutreturn, open, and state recovery
 - Update the MAINTAINERS file with contact information for Trond
   Myklebust
 - Close needs to handle NFS4ERR_ADMIN_REVOKED
 - Enabling v4.2 should not recompile nfsd and lockd
 - Fix a couple of compile warnings

* tag 'nfs-for-3.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: fix do_div() warning by instead using sector_div()
  MAINTAINERS: Update contact information for Trond Myklebust
  NFSv4.1: Prevent a 3-way deadlock between layoutreturn, open and state recovery
  SUNRPC: do not fail gss proc NULL calls with EACCES
  NFSv4: close needs to handle NFS4ERR_ADMIN_REVOKED
  NFSv4: Update list of irrecoverable errors on DELEGRETURN
  NFSv4 wait on recovery for async session errors
  NFS: Fix a warning in nfs_setsecurity
  NFS: Enabling v4.2 should not recompile nfsd and lockd
2013-12-05 13:05:48 -08:00
Jie Liu
f9fd013561 xfs: don't perform discard if the given range length is less than block size
For discard operation, we should return EINVAL if the given range length
is less than a block size, otherwise it will go through the file system
to discard data blocks as the end range might be evaluated to -1, e.g,
# fstrim -v -o 0 -l 100 /xfs7
/xfs7: 9811378176 bytes were trimmed

This issue can be triggered via xfstests/generic/288.

Also, it seems to get the request queue pointer via bdev_get_queue()
instead of the hard code pointer dereference is not a bad thing.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-04 15:42:52 -06:00
Christoph Hellwig
10f73d27c8 xfs: fix the comment explaining xfs_trans_dqlockedjoin
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-04 14:26:57 -06:00
Dan Carpenter
071c529eb6 xfs: underflow bug in xfs_attrlist_by_handle()
If we allocate less than sizeof(struct attrlist) then we end up
corrupting memory or doing a ZERO_PTR_SIZE dereference.

This can only be triggered with CAP_SYS_ADMIN.

Reported-by: Nico Golde <nico@ngolde.de>
Reported-by: Fabian Yamaguchi <fabs@goesec.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-04 14:23:46 -06:00
Christoph Hellwig
f230077845 xfs: remove unused FI_ flags
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com.>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-04 14:11:05 -06:00
Eric Sandeen
3fefdeee92 xfs: simplify xfs_setsize_buftarg callchain; remove unused arg
The "verbose" argument to xfs_setsize_buftarg_flags() has been
unused since:

ffe37436 xfs: stop using the page cache to back the buffer cache

Remove it, and fold the function into xfs_setsize_buftarg()
now that there's no need for different types of callers.

Fix inconsistent comment spacing while we're at it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-04 13:53:34 -06:00
Helge Deller
3873d064b8 nfs: fix do_div() warning by instead using sector_div()
When compiling a 32bit kernel with CONFIG_LBDAF=n the compiler complains like
shown below.  Fix this warning by instead using sector_div() which is provided
by the kernel.h header file.

fs/nfs/blocklayout/extents.c: In function ‘normalize’:
include/asm-generic/div64.h:43:28: warning: comparison of distinct pointer types lacks a cast [enabled by default]
fs/nfs/blocklayout/extents.c:47:13: note: in expansion of macro ‘do_div’
nfs/blocklayout/extents.c:47:2: warning: right shift count >= width of type [enabled by default]
fs/nfs/blocklayout/extents.c:47:2: warning: passing argument 1 of ‘__div64_32’ from incompatible pointer type [enabled by default]
include/asm-generic/div64.h:35:17: note: expected ‘uint64_t *’ but argument is of type ‘sector_t *’
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);

Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-12-04 12:57:37 -05:00
Trond Myklebust
f22e5edd22 NFSv4.1: Prevent a 3-way deadlock between layoutreturn, open and state recovery
Andy Adamson reports:

The state manager is recovering expired state and recovery OPENs are being
processed. If kswapd is pruning inodes at the same time, a deadlock can occur
when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
resultant layoutreturn gets an error that the state mangager is to handle,
causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.

At the same time an open is waiting for the inode deletion to complete in
__wait_on_freeing_inode.

If the open is either the open called by the state manager, or an open from
the same open owner that is holding the NFSv4 sequence id which causes the
OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
then the state is deadlocked with kswapd.

The fix is simply to have layoutreturn ignore all errors except NFS4ERR_DELAY.
We already know that layouts are dropped on all server reboots, and that
it has to be coded to deal with the "forgetful client model" that doesn't
send layoutreturns.

Reported-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1385402270-14284-1-git-send-email-andros@netapp.com
Signed-off-by: Trond Myklebust <Trond.Myklebust@primarydata.com>
2013-12-04 12:32:19 -05:00
Linus Torvalds
278717909d Just a single bug fix to the new "directly decompress into the
page cache" code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJSnpD0AAoJEJAch/D1fbHUYFwP/iGUTnyqJYMLOyvDz4CjPCjr
 w0Pfncr78oXiKkOe/ExBVH5JHlkmSa2Lfmt28tiltPgfNQbqLQB9eSvjOzOg+c73
 KNqCAXXUp5a+Jhjlapo1iuyZEnOEPaJHUgiRmneFHzesE9eW6u86JntHR3wLbD8v
 Hw0Z5zkrPGB+M/AsX96dt7hBiz9HJI7IQrurtx2ijrpCBeXq30YNtBMjxzqoYx1C
 qcyMw/qGREObD3qlWKeBgmsKguwMmaygLaA2lUk73RcswKcISwkrUQN+BEqfP+Sc
 cXvXf1C2lG7SOLQH29Osa5Y8EGFnRQlFyTaIxLFWopl3Oks7emDJOrH4rW2SuPLt
 mEUOIVhZ0v6VhJ5uRmauyH4bQxBl0Uonc74bxRm602ThnUEIg2E6i+RBiy0tr/3Z
 UOQB2lACEIfz/jspeCcTTOlnR1r6fKp89XLF+OLKq7qJN3XQ2EtNKrfGHyEUVBxq
 GqLqJuTYbqxidJtx+PDZPW3AT71f4GSJGnL4Tdg13oHoJmLQagIRJCh6Gf5pMp1W
 za0g7jDHimHJbQW/jujDhVu9bLEoWhfq+jo/JW6x+0bblrm8GM4h1FFEqcVjh0SV
 UyG3QjMA4uMLbofB4MNNkgRhGC3o0L2Z3CuYVCNIz6fbQKP1tp6NqyVP89uJ1Br9
 VM3Hhw2TWhXxPYDsDYeg
 =OvYc
 -----END PGP SIGNATURE-----

Merge tag 'squashfs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next

Pull squashfs bugfix from Phillip Lougher:
 "Just a single bug fix to the new "directly decompress into the page
  cache" code"

* tag 'squashfs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
  Squashfs: fix failure to unlock pages on decompress error
2013-12-04 08:54:00 -08:00
Tejun Heo
2322392b02 kernfs: implement "trusted.*" xattr support
kernfs inherited "security.*" xattr support from sysfs.  This patch
extends xattr support to "trusted.*" using simple_xattr_*().  As
trusted xattrs are restricted to CAP_SYS_ADMIN, simple_xattr_*() which
uses kernel memory for storage shouldn't be problematic.

Note that the existing "security.*" support doesn't implement
get/remove/list and the this patch only implements those ops for
"trusted.*".  We probably want to extend those ops to include support
for "security.*".

This patch will allow using kernfs from cgroup which requires
"trusted.*" xattr support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David P. Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04 07:34:45 -08:00
Tejun Heo
9a8049affd kernfs: update sysfs_init_inode_attrs()
sysfs_init_inode_attrs() is a bit clumsy to use requiring the caller
to check whether @sd->s_iattr is already set or not.  Rename it to
sysfs_inode_attrs(), update it to check whether @sd->s_iattr is
already initialized before trying to initialize it and return
@sd->s_iattr.  This simplifies the callers.

While at it,

* Rename struct sysfs_inode_attrs pointer variables to "attrs".  As
  kernfs no longer deals with "struct attribute", this isn't confusing
  and makes it easier to distinguish from struct iattr pointers.

* A new field will be added to sysfs_inode_attrs.  Reindent in
  preparation.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04 07:34:45 -08:00
Jan Kara
301d4c9a28 jbd: Revise KERN_EMERG error messages
Some of KERN_EMERG printk messages do not really deserve this log level
and the one in log_wait_commit() is even rather useless (the journal has
been previously aborted and *that* is where we should have been
complaining). So make some messages just KERN_ERR and remove the useless
message.

Signed-off-by: Jan Kara <jack@suse.cz>
2013-12-04 12:27:46 +01:00
Jan Kara
df4e7ac0bb ext2: Fix oops in ext2_get_block() called from ext2_quota_write()
ext2_quota_write() doesn't properly setup bh it passes to
ext2_get_block() and thus we hit assertion BUG_ON(maxblocks == 0) in
ext2_get_blocks() (or we could actually ask for mapping arbitrary number
of blocks depending on whatever value was on stack).

Fix ext2_quota_write() to properly fill in number of blocks to map.

CC: stable@vger.kernel.org # >= 2.6.12
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-12-04 12:26:51 +01:00
Eryu Guan
5946d08937 ext4: check for overlapping extents in ext4_valid_extent_entries()
A corrupted ext4 may have out of order leaf extents, i.e.

extent: lblk 0--1023, len 1024, pblk 9217, flags: LEAF UNINIT
extent: lblk 1000--2047, len 1024, pblk 10241, flags: LEAF UNINIT
             ^^^^ overlap with previous extent

Reading such extent could hit BUG_ON() in ext4_es_cache_extent().

	BUG_ON(end < lblk);

The problem is that __read_extent_tree_block() tries to cache holes as
well but assumes 'lblk' is greater than 'prev' and passes underflowed
length to ext4_es_cache_extent(). Fix it by checking for overlapping
extents in ext4_valid_extent_entries().

I hit this when fuzz testing ext4, and am able to reproduce it by
modifying the on-disk extent by hand.

Also add the check for (ee_block + len - 1) in ext4_valid_extent() to
make sure the value is not overflow.

Ran xfstests on patched ext4 and no regression.

Cc: Lukáš Czerner <lczerner@redhat.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-03 21:22:21 -05:00
Junho Ryu
4e8d213980 ext4: fix use-after-free in ext4_mb_new_blocks
ext4_mb_put_pa should hold pa->pa_lock before accessing pa->pa_count.
While ext4_mb_use_preallocated checks pa->pa_deleted first and then
increments pa->count later, ext4_mb_put_pa decrements pa->pa_count
before holding pa->pa_lock and then sets pa->pa_deleted.

* Free sequence
ext4_mb_put_pa (1):		atomic_dec_and_test pa->pa_count
ext4_mb_put_pa (2):		lock pa->pa_lock
ext4_mb_put_pa (3):			check pa->pa_deleted
ext4_mb_put_pa (4):			set pa->pa_deleted=1
ext4_mb_put_pa (5):		unlock pa->pa_lock
ext4_mb_put_pa (6):		remove pa from a list
ext4_mb_pa_callback:		free pa

* Use sequence
ext4_mb_use_preallocated (1):	iterate over preallocation
ext4_mb_use_preallocated (2):	lock pa->pa_lock
ext4_mb_use_preallocated (3):		check pa->pa_deleted
ext4_mb_use_preallocated (4):		increase pa->pa_count
ext4_mb_use_preallocated (5):	unlock pa->pa_lock
ext4_mb_release_context:	access pa

* Use-after-free sequence
[initial status]		<pa->pa_deleted = 0, pa_count = 1>
ext4_mb_use_preallocated (1):	iterate over preallocation
ext4_mb_use_preallocated (2):	lock pa->pa_lock
ext4_mb_use_preallocated (3):		check pa->pa_deleted
ext4_mb_put_pa (1):		atomic_dec_and_test pa->pa_count
[pa_count decremented]		<pa->pa_deleted = 0, pa_count = 0>
ext4_mb_use_preallocated (4):		increase pa->pa_count
[pa_count incremented]		<pa->pa_deleted = 0, pa_count = 1>
ext4_mb_use_preallocated (5):	unlock pa->pa_lock
ext4_mb_put_pa (2):		lock pa->pa_lock
ext4_mb_put_pa (3):			check pa->pa_deleted
ext4_mb_put_pa (4):			set pa->pa_deleted=1
[race condition!]		<pa->pa_deleted = 1, pa_count = 1>
ext4_mb_put_pa (5):		unlock pa->pa_lock
ext4_mb_put_pa (6):		remove pa from a list
ext4_mb_pa_callback:		free pa
ext4_mb_release_context:	access pa

AddressSanitizer has detected use-after-free in ext4_mb_new_blocks
Bug report: http://goo.gl/rG1On3

Signed-off-by: Junho Ryu <jayr@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-03 18:10:28 -05:00
Amit Pundir
95f19f658c epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled
Drop EPOLLWAKEUP from epoll events mask if CONFIG_PM_SLEEP is disabled.

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-12-03 15:35:52 +01:00