linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-11-30 13:16:40 +07:00

Author	SHA1	Message	Date
Chris Mason	0b4dcea579	Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir This happens during subvol creation. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-11 11:13:35 -04:00
Chris Mason	067c28adc5	Btrfs: fix -o nodatasum printk spelling It was printing nodatacsum, which was not the correct option name. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-11 09:30:13 -04:00
Yan Zheng	85d4198e40	Btrfs: check duplicate backrefs for both data and metadata lookup_inline_extent_backref only checks for duplicate backref for data extents. It assumes backrefs for tree block never conflict. This patch makes lookup_inline_extent_backref check for duplicate backrefs for both data and tree block, so that we can detect potential bug earlier. This is a safety check, strictly speaking it is not required. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-11 08:51:34 -04:00
Linus Torvalds	8623661180	Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits) Revert "x86, bts: reenable ptrace branch trace support" tracing: do not translate event helper macros in print format ftrace/documentation: fix typo in function grapher name tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK tracing: add protection around module events unload tracing: add trace_seq_vprint interface tracing: fix the block trace points print size tracing/events: convert block trace points to TRACE_EVENT() ring-buffer: fix ret in rb_add_time_stamp ring-buffer: pass in lockdep class key for reader_lock tracing: add annotation to what type of stack trace is recorded tracing: fix multiple use of __print_flags and __print_symbolic tracing/events: fix output format of user stack tracing/events: fix output format of kernel stack tracing/trace_stack: fix the number of entries in the header ring-buffer: discard timestamps that are at the start of the buffer ring-buffer: try to discard unneeded timestamps ring-buffer: fix bug in ring_buffer_discard_commit ftrace: do not profile functions when disabled tracing: make trace pipe recognize latency format flag ...	2009-06-10 19:53:40 -07:00
James Morris	73fbad283c	Merge branch 'next' into for-linus	2009-06-11 11:03:14 +10:00
Shin Hong	fd0fb038d5	Btrfs: init worker struct fields before kthread-run This patch fixes a bug which may result race condition between btrfs_start_workers() and worker_loop(). btrfs_start_workers() executed in a parent thread writes on workers->worker and worker_loop() in a child thread reads workers->worker. However, there is no synchronization enforcing the order of two operations. This patch makes btrfs_start_workers() fill workers->worker before it starts a child thread with worker_loop() Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 20:11:29 -04:00
Linus Torvalds	99e97b860e	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: fix typo in sched-rt-group.txt file ftrace: fix typo about map of kernel priority in ftrace.txt file. sched: properly define the sched_group::cpumask and sched_domain::span fields sched, timers: cleanup avenrun users sched, timers: move calc_load() to scheduler sched: Don't export sched_mc_power_savings on multi-socket single core system sched: emit thread info flags with stack trace sched: rt: document the risk of small values in the bandwidth settings sched: Replace first_cpu() with cpumask_first() in ILB nomination code sched: remove extra call overhead for schedule() sched: use group_first_cpu() instead of cpumask_first(sched_group_cpus()) wait: don't use __wake_up_common() sched: Nominate a power-efficient ilb in select_nohz_balancer() sched: Nominate idle load balancer from a semi-idle package. sched: remove redundant hierarchy walk in check_preempt_wakeup	2009-06-10 15:32:59 -07:00
Michal Simek	0e0c62123b	fs/bio.c: add missing __user annotation As reported by sparse: fs/bio.c:720:13: warning: incorrect type in assignment (different address spaces) fs/bio.c:720:13: expected char iov_addr fs/bio.c:720:13: got void [noderef] <asn:1> fs/bio.c:724:36: warning: incorrect type in argument 2 (different address spaces) fs/bio.c:724:36: expected void const [noderef] <asn:1>from fs/bio.c:724:36: got char iov_addr Signed-off-by: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-10 23:07:15 +02:00
Hisashi Hifumi	4eedeb75e7	Btrfs: pin buffers during write_dev_supers write_dev_supers is called in sequence. First is it called with wait == 0, which starts IO on all of the super blocks for a given device. Then it is called with wait == 1 to make sure they all reach the disk. It doesn't currently pin the buffers between the two calls, and it also assumes the buffers won't go away between the two calls, leading to an oops if the VM manages to free the buffers in the middle of the sync. This fixes that assumption and updates the code to return an error if things are not up to date when the wait == 1 run is done. Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 16:49:25 -04:00
Chris Mason	e5e9a5206a	Btrfs: avoid races between super writeout and device list updates On multi-device filesystems, btrfs writes supers to all of the devices before considering a sync complete. There wasn't any additional locking between super writeout and the device list management code because device management was done inside a transaction and super writeout only happened with no transation writers running. With the btrfs fsync log and other async transaction updates, this has been racey for some time. This adds a mutex to protect the device list. The existing volume mutex could not be reused due to transaction lock ordering requirements. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 15:17:02 -04:00
Jeff Layton	61b6bc525a	cifs: remove never-used in6_addr option This option was never used to my knowledge. Remove it before someone does... Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-06-10 18:34:35 +00:00
Aneesh Kumar K.V	a41f207169	ext4: Avoid corrupting the uninitialized bit in the extent during truncate The unitialized bit was not properly getting preserved in in an extent which is partially truncated because the it was geting set to the value of the first extent to be removed or truncated as part of the truncate operation, and if there are multiple extents are getting removed or modified as part of the truncate operation, it is only the last extent which will might be partially truncated, and its uninitalized bit is not necessarily the same as the first extent to be truncated. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-06-10 14:22:55 -04:00
Jeff Layton	58f7f68f22	cifs: add addr= mount option alias for ip= When you look in /proc/mounts, the address of the server gets displayed as "addr=". That's really a better option to use anyway since it's more generic. What if we eventually want to support non-IP transports? It also makes CIFS option consistent with the NFS option of the same name. Begin the migration to that option name by adding an alias for ip= called addr=. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-06-10 15:39:14 +00:00
Al Viro	7df336ec12	Fix btrfs when ACLs are configured out ... otherwise generic_permission() will allow anything for all files you don't own and that have some group permissions. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:36:43 -04:00
Hisashi Hifumi	524724ed1f	Btrfs: fdatasync should skip metadata writeout In btrfs, fdatasync and fsync are identical, but fdatasync should skip committing transaction when inode->i_state is set just I_DIRTY_SYNC and this indicates only atime or/and mtime updates. Following patch improves fdatasync throughput. --file-block-size=4K --file-total-size=16G --file-test-mode=rndwr --file-fsync-mode=fdatasync run Results: -2.6.30-rc8 Test execution summary: total time: 1980.6540s total number of events: 10001 total time taken by event execution: 1192.9804 per-request statistics: min: 0.0000s avg: 0.1193s max: 15.3720s approx. 95 percentile: 0.7257s Threads fairness: events (avg/stddev): 625.0625/151.32 execution time (avg/stddev): 74.5613/9.46 -2.6.30-rc8-patched Test execution summary: total time: 1695.9118s total number of events: 10000 total time taken by event execution: 871.3214 per-request statistics: min: 0.0000s avg: 0.0871s max: 10.4644s approx. 95 percentile: 0.4787s Threads fairness: events (avg/stddev): 625.0000/131.86 execution time (avg/stddev): 54.4576/8.98 Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:53 -04:00
David Woodhouse	163e783e6a	Btrfs: remove crc32c.h and use libcrc32c directly. There's no need to preserve this abstraction; it used to let us use hardware crc32c support directly, but libcrc32c is already doing that for us through the crypto API -- so we're already using the Intel crc32c acceleration where appropriate. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:53 -04:00
Christoph Hellwig	6cbff00f46	Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION Add support for the standard attributes set via chattr and read via lsattr. Currently we store the attributes in the flags value in the btrfs inode, but I wonder whether we should split it into two so that we don't have to keep converting between the two formats. Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros as they were confusing the existing code and got in the way of the new additions. Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's trivial. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:52 -04:00
Chris Mason	c289811cc0	Btrfs: autodetect SSD devices During mount, btrfs will check the queue nonrot flag for all the devices found in the FS. If they are all non-rotating, SSD mode is enabled by default. If the FS was mounted with -o nossd, the non-rotating flag is ignored. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:52 -04:00
Chris Mason	451d7585a8	Btrfs: add mount -o ssd_spread to spread allocations out Some SSDs perform best when reusing block numbers often, while others perform much better when clustering strictly allocates big chunks of unused space. The default mount -o ssd will find rough groupings of blocks where there are a bunch of free blocks that might have some allocated blocks mixed in. mount -o ssd_spread will make sure there are no allocated blocks mixed in. It should perform better on lower end SSDs. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:52 -04:00
Chris Mason	c604480171	Btrfs: avoid allocation clusters that are too spread out In SSD mode for data, and all the time for metadata the allocator will try to find a cluster of nearby blocks for allocations. This commit adds extra checks to make sure that each free block in the cluster is close to the last one. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:51 -04:00
Chris Mason	3b30c22f64	Btrfs: Add mount -o nossd This allows you to turn off the ssd mode via remount. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:50 -04:00
Chris Mason	d644d8a1e3	Btrfs: avoid IO stalls behind congested devices in a multi-device FS The btrfs IO submission threads try to service a bunch of devices with a small number of threads. They do a congestion check to try and avoid waiting on requests for a busy device. The checks make sure we've sent a few requests down to a given device just so that we aren't bouncing between busy devices without actually sending down any IO. The counter used to decide if we can switch to the next device is somewhat overloaded. It is also being used to decide if we've done a good batch of requests between the WRITE_SYNC or regular priority lists. It may get reset to zero often, leaving us hammering on a busy device instead of moving on to another disk. This commit adds a new counter for the number of bios sent while servicing a device. It doesn't get reset or fiddled with. On multi-device filesystems, this fixes IO stalls in streaming write workloads. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:49 -04:00
Chris Mason	d84275c938	Btrfs: don't allow WRITE_SYNC bios to starve out regular writes Btrfs uses dedicated threads to submit bios when checksumming is on, which allows us to make sure the threads dedicated to checksumming don't get stuck waiting for requests. For each btrfs device, there are two lists of bios. One list is for WRITE_SYNC bios and the other is for regular priority bios. The IO submission threads used to process all of the WRITE_SYNC bios first and then switch to the regular bios. This commit makes sure we don't completely starve the regular bios by rotating between the two lists. WRITE_SYNC bios are still favored 2:1 over the regular bios, and this tries to run in batches to avoid seeking. Benchmarking shows this eliminates stalls during streaming buffered writes on both multi-device and single device filesystems. If the regular bios starve, the system can end up with a large amount of ram pinned down in writeback pages. If we are a little more fair between the two classes, we're able to keep throughput up and make progress on the bulk of our dirty ram. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:49 -04:00
Chris Mason	585ad2c379	Btrfs: fix metadata dirty throttling limits Once a metadata block has been written, it must be recowed, so the btrfs dirty balancing call has a check to make sure a fair amount of metadata was actually dirty before it started writing it back to disk. A previous commit had changed the dirty tracking for metadata without updating the btrfs dirty balancing checks. This commit switches it to use the correct counter. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:48 -04:00
Chris Mason	2c943de6ad	Btrfs: reduce mount -o ssd CPU usage The block allocator in SSD mode will try to find groups of free blocks that are close together. This commit makes it loop less on a given group size before bumping it. The end result is that we are less likely to fill small holes in the available free space, but we don't waste as much CPU building the large cluster used by ssd mode. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:48 -04:00
Chris Mason	cfbb930846	Btrfs: balance btree more often With the new back reference code, the cost of a balance has gone down in terms of the number of back reference updates done. This commit makes us more aggressively balance leaves and nodes as they become less full. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:47 -04:00
Chris Mason	b361242102	Btrfs: stop avoiding balancing at the end of the transaction. When the delayed reference code was added, some checks were added to avoid extra balancing while the delayed references were being flushed. This made for less efficient btrees, but it reduced the chances of loops where no forward progress was made because the balances made more delayed ref updates. With the new dead root removal code and the mixed back references, the extent allocation tree is no longer using precise back refs, and the delayed reference updates don't carry the risk of looping forever anymore. So, the balance avoidance is no longer required. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:47 -04:00
Yan Zheng	5d4f98a28c	Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:46 -04:00
Yan Zheng	5c939df56c	btrfs: Fix set/clear_extent_bit for 'end == (u64)-1' There are some 'start = state->end + 1;' like code in set_extent_bit and clear_extent_bit. They overflow when end == (u64)-1. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-10 11:29:46 -04:00
Christoph Hellwig	ef14f0c157	xfs: use generic Posix ACL code This patch rips out the XFS ACL handling code and uses the generic fs/posix_acl.c code instead. The ondisk format is of course left unchanged. This also introduces the same ACL caching all other Linux filesystems do by adding pointers to the acl and default acl in struct xfs_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-10 17:07:47 +02:00
Ryusuke Konishi	c3a7abf06c	nilfs2: support contiguous lookup of blocks Although get_block() callback function can return extent of contiguous blocks with bh->b_size, nilfs_get_block() function did not support this feature. This adds contiguous lookup feature to the block mapping codes of nilfs, and allows the nilfs_get_blocks() function to return the extent information by applying the feature. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:12 +09:00
Ryusuke Konishi	fa032744ad	nilfs2: add sync_page method to page caches of meta data This applies block_sync_page() function to the sync_page method of page caches for meta data files, gc page caches, and btree node buffers. This is a companion patch of ("nilfs2: enable sync_page mothod") which applied the function for data pages. This allows lock_page() for those meta data to unplug pending bio requests. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:12 +09:00
Ryusuke Konishi	a53b4751ae	nilfs2: use device's backing_dev_info for btree node caches Previously, default_backing_dev_info was used for the mapping of btree node caches. This uses device dependent backing_dev_info to allow detailed control of the device for the btree node pages. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:12 +09:00
Ryusuke Konishi	30c25be71f	nilfs2: return EBUSY against delete request on snapshot This helps userland programs like the rmcp command to distinguish error codes returned against a checkpoint removal request. Previously -EPERM was returned, and not discriminable from real permission errors. This also allows removal of the latest checkpoint because the deletion leads to create a new checkpoint, and thus it's harmless for the filesystem. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:12 +09:00
Ryusuke Konishi	e85dc1d529	nilfs2: enable sync_page method This adds a missing sync_page method which unplugs bio requests when waiting for page locks. This will improve read performance of nilfs. Here is a measurement result using dd command. Without this patch: # mount -t nilfs2 /dev/sde1 /test # dd if=/test/aaa of=/dev/null bs=512k 1024+0 records in 1024+0 records out 536870912 bytes (537 MB) copied, 6.00688 seconds, 89.4 MB/s With this patch: # mount -t nilfs2 /dev/sde1 /test # dd if=/test/aaa of=/dev/null bs=512k 1024+0 records in 1024+0 records out 536870912 bytes (537 MB) copied, 3.54998 seconds, 151 MB/s Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:11 +09:00
Ryusuke Konishi	30bda0b8ae	nilfs2: set bio unplug flag for the last bio in segment This sets BIO_RW_UNPLUG flag on the last bio of each segment during write. The last bio should be unplugged immediately because the caller waits for the completion after the submission. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:11 +09:00
Ryusuke Konishi	003ff182fd	nilfs2: allow future expansion of metadata read out via get info ioctl Nilfs has some ioctl commands to read out metadata from meta data files: - NILFS_IOCTL_GET_CPINFO for checkpoint file, - NILFS_IOCTL_GET_SUINFO for segment usage file, and - NILFS_IOCTL_GET_VINFO for Disk Address Transalation (DAT) file, respectively. Every routine on these metadata files is implemented so that it allows future expansion of on-disk format. But, the above ioctl commands do not support expansion even though nilfs_argv structure can handle arbitrary size for data exchanged via ioctl. This allows future expansion of the following structures which give basic format of the "get information" ioctls: - struct nilfs_cpinfo - struct nilfs_suinfo - struct nilfs_vinfo So, this introduces forward compatility of such ioctl commands. In this patch, a sanity check in nilfs_ioctl_get_info() function is changed to accept larger data structure [1], and metadata read routines are rewritten so that they become compatible for larger structures; the routines will just ignore the remaining fields which the current version of nilfs doesn't know. [1] The ioctl function already has another upper limit (PAGE_SIZE against a structure, which appears in nilfs_ioctl_wrap_copy function), and this will not cause security problem. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:11 +09:00
Hisashi Hifumi	258ef67e24	NILFS2: Pagecache usage optimization on NILFS2 Hi, I introduced "is_partially_uptodate" aops for NILFS2. A page can have multiple buffers and even if a page is not uptodate, some buffers can be uptodate on pagesize != blocksize environment. This aops checks that all buffers which correspond to a part of a file that we want to read are uptodate. If so, we do not have to issue actual read IO to HDD even if a page is not uptodate because the portion we want to read are uptodate. "block_is_partially_uptodate" function is already used by ext2/3/4. With the following patch random read/write mixed workloads or random read after random write workloads can be optimized and we can get performance improvement. I did a performance test using the sysbench. 1 --file-block-size=8K --file-total-size=2G --file-test-mode=rndrw --file-fsync-freq=0 --fil e-rw-ratio=1 run -2.6.30-rc5 Test execution summary: total time: 151.2907s total number of events: 200000 total time taken by event execution: 2409.8387 per-request statistics: min: 0.0000s avg: 0.0120s max: 0.9306s approx. 95 percentile: 0.0439s Threads fairness: events (avg/stddev): 12500.0000/238.52 execution time (avg/stddev): 150.6149/0.01 -2.6.30-rc5-patched Test execution summary: total time: 140.8828s total number of events: 200000 total time taken by event execution: 2240.8577 per-request statistics: min: 0.0000s avg: 0.0112s max: 0.8750s approx. 95 percentile: 0.0418s Threads fairness: events (avg/stddev): 12500.0000/218.43 execution time (avg/stddev): 140.0536/0.01 arch: ia64 pagesize: 16k Thanks. Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:11 +09:00
Ryusuke Konishi	7cde31d7d6	nilfs2: remove nilfs_btree_operations from btree mapping will remove indirect function calls using nilfs_btree_operations table. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:11 +09:00
Ryusuke Konishi	355c6b6103	nilfs2: remove nilfs_direct_operations from direct mapping will remove indirect function calls using nilfs_direct_operations table. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:11 +09:00
Ryusuke Konishi	d4b961576d	nilfs2: remove bmap pointer operations Previously, the bmap codes of nilfs used three types of function tables. The abuse of indirect function calls decreased source readability and suffered many indirect jumps which would confuse branch prediction of processors. This eliminates one type of the function tables, nilfs_bmap_ptr_operations, which was used to dispatch low level pointer operations of the nilfs bmap. This adds a new integer variable "b_ptr_type" to nilfs_bmap struct, and uses the value to select the pointer operations. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:10 +09:00
Ryusuke Konishi	3033342a0b	nilfs2: remove useless b_low and b_high fields from nilfs_bmap struct This will cut off 16 bytes from the nilfs_bmap struct which is embedded in the on-memory inode of nilfs. The b_high field was never used, and the b_low field stores a constant value which can be determined by whether the inode uses btree for block mapping or not. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:10 +09:00
Ryusuke Konishi	e473c1f265	nilfs2: remove pointless NULL check of bpop_commit_alloc_ptr function This indirect function is set to NULL only for gc cache inodes, but the gc cache inodes never call this function. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:10 +09:00
Ryusuke Konishi	f198dbb9cf	nilfs2: move get block functions in bmap.c into btree codes Two get block function for btree nodes, nilfs_bmap_get_block() and nilfs_bmap_get_new_block(), are called only from the btree codes. This relocation will increase opportunities of compiler optimization. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:10 +09:00
Ryusuke Konishi	9f098900ad	nilfs2: remove nilfs_bmap_delete_block nilfs_bmap_delete_block() is a wrapper function calling nilfs_btnode_delete(). This removes it for simplicity. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:10 +09:00
Ryusuke Konishi	087d01b425	nilfs2: remove nilfs_bmap_put_block nilfs_bmap_put_block() is a wrapper function calling brelse(). This eliminates the wrapper for simplicity. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:10 +09:00
Ryusuke Konishi	654137dd46	nilfs2: remove header file for segment list operations This will eliminate obsolete list operations of nilfs_segment_entry structure which has been used to handle mutiple segment numbers. The patch ("nilfs2: remove list of freeing segments") removed use of the structure from the segment constructor code, and this patch simplifies the remaining code by integrating it into recovery.c. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:09 +09:00
Ryusuke Konishi	071cb4b819	nilfs2: eliminate removal list of segments This will clean up the removal list of segments and the related functions from segment.c and ioctl.c, which have hurt code readability. This elimination is applied by using nilfs_sufile_updatev() previously introduced in the patch ("nilfs2: add sufile function that can modify multiple segment usages"). Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:09 +09:00
Ryusuke Konishi	dda54f4b87	nilfs2: add sufile function that can modify multiple segment usages This is a preparation for the later cleanup patch ("nilfs2: remove list of freeing segments"). This adds nilfs_sufile_updatev() to sufile, which can modify multiple segment usages at a time. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:09 +09:00
Ryusuke Konishi	d97a51a7e3	nilfs2: unify bmap operations starting use of indirect block address This simplifies some low level functions of bmap. Three bmap pointer operations, nilfs_bmap_start_v(), nilfs_bmap_commit_v(), and nilfs_bmap_abort_v(), are unified into one nilfs_bmap_start_v() function. And the related indirect function calls are replaced with it. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:09 +09:00
Ryusuke Konishi	6582207064	nilfs2: remove nilfs_dat_prepare_free function This function is unused. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-06-10 23:41:09 +09:00
Boaz Harrosh	fc2fac5b5f	[SCSI] libosd: Define an osd_dev wrapper to retrieve the request_queue libosd users that need to work with bios, must sometime use the request_queue associated with the osd_dev. Make a wrapper for that, and convert all in-tree users. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2009-06-10 09:00:13 -05:00
Boaz Harrosh	62f469b596	[SCSI] libosd: osd_req_{read,write} takes a length parameter For supporting of chained-bios we can not inspect the first bio only, as before. Caller shall pass the total length of the request, ie. sum_bytes(bio-chain). Also since the bio might be a chain we don't set it's direction on behalf of it's callers. The bio direction should be properly set prior to this call. So fix a couple of write users that now need to set the bio direction properly [In this patch I change both library code and user sites at exofs, to make it easy on integration. It should be submitted via James's scsi-misc tree.] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> CC: Jeff Garzik <jeff@garzik.org> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2009-06-10 08:59:52 -05:00
Boaz Harrosh	0e35afbc8b	[SCSI] libosd: osd_req_{read,write}_kern new API By popular demand, define usefull wrappers for osd_req_read/write that recieve kernel pointers. All users had their own. Also remove these from exofs Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2009-06-10 08:57:07 -05:00
Steven Whitehouse	003dec8913	GFS2: Merge gfs2_get_sb into gfs2_get_sb_meta These don't need to be separate functions. Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-06-10 10:31:45 +01:00
Steven Whitehouse	40bc9a27e0	GFS2: Fix cache coherency between truncate and O_DIRECT read If a page was partially zeroed as the result of a truncate, then it was not being correctly marked dirty. This resulted in the deleted data reappearing if the file was read back via direct I/O. Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-06-10 09:09:40 +01:00
Jan Kara	a61d90d75d	jbd: fix race in buffer processing in commit code In commit code, we scan buffers attached to a transaction. During this scan, we sometimes have to drop j_list_lock and then we recheck whether the journal buffer head didn't get freed by journal_try_to_free_buffers(). But checking for buffer_jbd(bh) isn't enough because a new journal head could get attached to our buffer head. So add a check whether the journal head remained the same and whether it's still at the same transaction and list. This is a nasty bug and can cause problems like memory corruption (use after free) or trigger various assertions in JBD code (observed). Signed-off-by: Jan Kara <jack@suse.cz> Cc: <stable@kernel.org> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-09 16:59:03 -07:00
Ian Kent	463aea1a1c	autofs4: remove hashed check in validate_wait() The recent ->lookup() deadlock correction required the directory inode mutex to be dropped while waiting for expire completion. We were concerned about side effects from this change and one has been identified. I saw several error messages. They cause autofs to become quite confused and don't really point to the actual problem. Things like: handle_packet_missing_direct:1376: can't find map entry for (43,1827932) which is usually totally fatal (although in this case it wouldn't be except that I treat is as such because it normally is). do_mount_direct: direct trigger not valid or already mounted /test/nested/g3c/s1/ss1 which is recoverable, however if this problem is at play it can cause autofs to become quite confused as to the dependencies in the mount tree because mount triggers end up mounted multiple times. It's hard to accurately check for this over mounting case and automount shouldn't need to if the kernel module is doing its job. There was one other message, similar in consequence of this last one but I can't locate a log example just now. When checking if a mount has already completed prior to adding a new mount request to the wait queue we check if the dentry is hashed and, if so, if it is a mount point. But, if a mount successfully completed while we slept on the wait queue mutex the dentry must exist for the mount to have completed so the test is not really needed. Mounts can also be done on top of a global root dentry, so for the above case, where a mount request completes and the wait queue entry has already been removed, the hashed test returning false can cause an incorrect callback to the daemon. Also, d_mountpoint() is not sufficient to check if a mount has completed for the multi-mount case when we don't have a real mount at the base of the tree. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-09 16:59:03 -07:00
Hisashi Hifumi	e04cc15f52	ocfs2: fdatasync should skip unimportant metadata writeout In ocfs2, fdatasync and fsync are identical. I think fdatasync should skip committing transaction when inode->i_state is set just I_DIRTY_SYNC and this indicates only atime or/and mtime updates. Following patch improves fdatasync throughput. #sysbench --num-threads=16 --max-requests=300000 --test=fileio --file-block-size=4K --file-total-size=16G --file-test-mode=rndwr --file-fsync-mode=fdatasync run Results: -2.6.30-rc8 Test execution summary: total time: 107.1445s total number of events: 119559 total time taken by event execution: 116.1050 per-request statistics: min: 0.0000s avg: 0.0010s max: 0.1220s approx. 95 percentile: 0.0016s Threads fairness: events (avg/stddev): 7472.4375/303.60 execution time (avg/stddev): 7.2566/0.64 -2.6.30-rc8-patched Test execution summary: total time: 86.8529s total number of events: 300016 total time taken by event execution: 24.3077 per-request statistics: min: 0.0000s avg: 0.0001s max: 0.0336s approx. 95 percentile: 0.0001s Threads fairness: events (avg/stddev): 18751.0000/718.75 execution time (avg/stddev): 1.5192/0.05 Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-06-09 10:45:47 -07:00
Li Zefan	55782138e4	tracing/events: convert block trace points to TRACE_EVENT() TRACE_EVENT is a more generic way to define tracepoints. Doing so adds these new capabilities to this tracepoint: - zero-copy and per-cpu splice() tracing - binary tracing without printf overhead - structured logging records exposed under /debug/tracing/events - trace events embedded in function tracer output and other plugins - user-defined, per tracepoint filter expressions ... Cons: - no dev_t info for the output of plug, unplug_timer and unplug_io events. no dev_t info for getrq and sleeprq events if bio == NULL. no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL. This is mainly because we can't get the deivce from a request queue. But this may change in the future. - A packet command is converted to a string in TP_assign, not TP_print. While blktrace do the convertion just before output. Since pc requests should be rather rare, this is not a big issue. - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT has a unique format, which means we have some unused data in a trace entry. The overhead is minimized by using __dynamic_array() instead of __array(). I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing: dd dd + ioctl blktrace dd + TRACE_EVENT (splice) 1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s 2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s 3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s So the overhead of tracing is very small, and no regression when using those trace events vs blktrace. And the binary output of TRACE_EVENT is much smaller than blktrace: # ls -l -h -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out Following are some comparisons between TRACE_EVENT and blktrace: plug: kjournald-480 [000] 303.084981: block_plug: [kjournald] kjournald-480 [000] 303.084981: 8,0 P N [kjournald] unplug_io: kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1 kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1 remap: kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384 kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384 bio_backmerge: kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald] kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald] getrq: kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald] kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald] bash-2066 [001] 1072.953770: 8,0 G N [bash] bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash] rq_complete: konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0] konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0] ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0] ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0] rq_insert: kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald] kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald] Changelog from v2 -> v3: - use the newly introduced __dynamic_array(). Changelog from v1 -> v2: - use __string() instead of __array() to minimize the memory required to store hex dump of rq->cmd(). - support large pc requests. - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT. - some cleanups. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-09 12:34:23 -04:00
Theodore Ts'o	0eab928221	ext4: Don't treat a truncation of a zero-length file as replace-via-truncate If a non-existent file is opened via O_WRONLY\|O_CREAT\|O_TRUNC, there's no need to treat this as a true file truncation, so we shouldn't activate the replace-via-truncate hueristic. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-06-09 09:54:40 -04:00
Tejun Heo	151060ac13	CUSE: implement CUSE - Character device in Userspace CUSE enables implementing character devices in userspace. With recent additions of ioctl and poll support, FUSE already has most of what's necessary to implement character devices. All CUSE has to do is bonding all those components - FUSE, chardev and the driver model - nicely. When client opens /dev/cuse, kernel starts conversation with CUSE_INIT. The client tells CUSE which device it wants to create. As the previous patch made fuse_file usable without associated fuse_inode, CUSE doesn't create super block or inodes. It attaches fuse_file to cdev file->private_data during open and set ff->fi to NULL. The rest of the operation is almost identical to FUSE direct IO case. Each CUSE device has a corresponding directory /sys/class/cuse/DEVNAME (which is symlink to /sys/devices/virtual/class/DEVNAME if SYSFS_DEPRECATED is turned off) which hosts "waiting" and "abort" among other things. Those two files have the same meaning as the FUSE control files. The only notable lacking feature compared to in-kernel implementation is mmap support. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2009-06-09 11:24:11 +02:00
James Morris	0b4ec6e4e0	Merge branch 'master' into next	2009-06-09 09:27:53 +10:00
Toshiyuki Okajima	9aee228607	ext4: fix dx_map_entry to support 256k directory blocks The dx_map_entry structure doesn't support over 64KB block size by current usage of its member("offs"). Because "offs" treats an offset of copies of the ext4_dir_entry_2 structure as is. This member size is 16 bits. But real offset for over 64KB(256KB) block size needs 18 bits. However, real offset keeps 4 byte boundary, so lower 2 bits is not used. Therefore, we do the following to fix this limitation: For "store": we divide the real offset by 4 and then store this result to "offs" member. For "use": we multiply "offs" member by 4 and then use this result as real offset. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-06-08 12:41:35 -04:00
Christoph Hellwig	8b5403a6d7	xfs: remove SYNC_BDFLUSH SYNC_BDFLUSH is a leftover from IRIX and rather misnamed for todays code. Make xfs_sync_fsdata and xfs_dq_sync use the SYNC_TRYLOCK flag for not blocking on logs just as the inode sync code already does. For xfs_sync_fsdata it's a trivial 1:1 replacement, but for xfs_qm_sync I use the opportunity to decouple the non-blocking lock case from the different flushing modes, similar to the inode sync code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:37:16 +02:00
Christoph Hellwig	b0710ccc6d	xfs: remove SYNC_IOWAIT We want to wait for all I/O to finish when we do data integrity syncs. So there is no reason to keep SYNC_WAIT separate from SYNC_IOWAIT. This causes a little change in behaviour for the ENOSPC flushing code which now does a second submission and wait of buffered I/O, but that should finish ASAP as we already did an asynchronous writeout earlier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:37:11 +02:00
Christoph Hellwig	075fe10286	xfs: split xfs_sync_inodes xfs_sync_inodes is used to write back either file data or inode metadata. In general we always do these separately, except for one fishy case in xfs_fs_put_super that does both. So separate xfs_sync_inodes into separate xfs_sync_data and xfs_sync_attr functions. In xfs_fs_put_super we first call the data sync and then the attr sync as that was the previous order. The moved log force in that path doesn't make a difference because we will force the log again as part of the real unmount process. The filesystem readonly checks are not performed by the new function but instead moved into the callers, given that most callers alredy have it further up in the stack. Also add debug checks that we do not pass in incorrect flags in the new xfs_sync_data and xfs_sync_attr function and fix the one place that did pass in a wrong flag. Also remove a comment mentioning xfs_sync_inodes that has been incorrect for a while because we always take either the iolock or ilock in the sync path these days. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:48 +02:00
Christoph Hellwig	fe588ed328	xfs: use generic inode iterator in xfs_qm_dqrele_all_inodes Use xfs_inode_ag_iterator instead of opencoding the inode walk in the quota code. Mark xfs_inode_ag_iterator and xfs_sync_inode_valid non-static to allow using them from the quota code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:27 +02:00
Dave Chinner	75f3cb1393	xfs: introduce a per-ag inode iterator Given that we walk across the per-ag inode lists so often, it makes sense to introduce an iterator for this. Convert the sync and reclaim code to use this new iterator, quota code will follow in the next patch. Also change xfs_reclaim_inode to return -EGAIN instead of 1 for an inode already under reclaim. This simplifies the AG iterator and doesn't matter for the only other caller. [hch: merged the lookup and execute callbacks back into one to get the pag_ici_lock locking correct and simplify the code flow] Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:14 +02:00
Dave Chinner	abc1064742	xfs: remove unused parameter from xfs_reclaim_inodes The noblock parameter of xfs_reclaim_inodes is only ever set to zero. Remove it and all the conditional code that is never executed. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:12 +02:00
Dave Chinner	1da8eecab5	xfs: factor out inode validation for sync Separate the validation of inodes found by the radix tree walk from the radix tree lookup. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:07 +02:00
Christoph Hellwig	845b6d0cbb	xfs: split inode flushing from xfs_sync_inodes_ag In many cases we only want to sync inode metadata. Split out the inode flushing into a separate helper to prepare factoring the inode sync code. Based on a patch from Dave Chinner, but redone to keep the current behaviour exactly and leave changes to the flushing logic to another patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:05 +02:00
Dave Chinner	5a34d5cd09	xfs: split inode data writeback from xfs_sync_inodes_ag In many cases we only want to sync inode data. Start spliting the inode sync into data sync and inode sync by factoring out the inode data flush. [hch: minor cleanups] Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:35:03 +02:00
Christoph Hellwig	7d095257e3	xfs: kill xfs_qmops Kill the quota ops function vector and replace it with direct calls or stubs in the CONFIG_XFS_QUOTA=n case. Make sure we check XFS_IS_QUOTA_RUNNING in the right spots. We can remove the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set otherwise. This brings us back closer to the way this code worked in IRIX and earlier Linux versions, but we keep a lot of the more useful factoring of common code. Eventually we should also kill xfs_qm_bhv.c, but that's left for a later patch. Reduces the size of the source code by about 250 lines and the size of XFS module by about 1.5 kilobytes with quotas enabled: text data bss dec hex filename 615957 2960 3848 622765 980ad fs/xfs/xfs.o 617231 3152 3848 624231 98667 fs/xfs/xfs.o.old Fallout: - xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects the inode locked and xfs_qm_dqattach which does the locking around it, thus removing XFS_QMOPT_ILOCKED. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:33:32 +02:00
Christoph Hellwig	0c5e1ce89f	xfs: validate quota log items during log recovery Arkadiusz has seen really strange crashes in xfs_qm_dqcheck that I can only explain by a log item being too smal to actually fit the xfs_dqblk_t we're dereferencing all over xfs_qm_dqcheck. So add graceful checks for NULL or too small quota items to the log recovery code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:33:21 +02:00
Christoph Hellwig	e1696834e8	xfs: update max log size Commit a6634fba3dec4a92f0a2c4e30c80b634c0576ad5 in xfsprogs increased the maximum log size supported by mkfs. Merged back the changes to xfs_fs.h so the growfs enforced the same limit and the headers are in sync. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>	2009-06-08 15:32:59 +02:00
Christoph Hellwig	21bea49594	fat: split fat_generic_ioctl Split up fat_generic_ioctl and add separate functions for the two implemented ioctls. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>	2009-06-08 21:39:50 +09:00
Artem Bityutskiy	f2c5dbd7b7	UBIFS: start using hrtimers UBIFS uses timers for write-buffer write-back. It is not crucial for us to write-back exactly on time. We are fine to write-back a little earlier or later. And this means we may optimize UBIFS timer so that it could be groped with a close timer event, so that the CPU would not be waken up just to do the write back. This is optimization to lessen power consumption, which is important in embedded devices UBIFS is used for. hrtimers have a nice feature: they are effectively range timers, and we may defind the soft and hard limits for it. Standard timers do not have these feature. They may only be made deferrable, but this means there is effectively no hard limit. So, we will better use hrtimers. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-06-08 11:14:58 +03:00
Artem Bityutskiy	3f36406f26	UBIFS: do not forget to register BDI device Reviewed-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-06-08 11:14:21 +03:00
Bartlomiej Zolnierkiewicz	db429e9ec0	partitions: add ->set_capacity block device method * Add ->set_capacity block device method and use it in rescan_partitions() to attempt enabling native capacity of the device upon detecting the partition which exceeds device capacity. * Add GENHD_FL_NATIVE_CAPACITY flag to try limit attempts of enabling native capacity during partition scan. Together with the consecutive patch implementing ->set_capacity method in ide-gd device driver this allows automatic disabling of Host Protected Area (HPA) if any partitions overlapping HPA are detected. Cc: Robert Hancock <hancockrwd@gmail.com> Cc: Frans Pop <elendil@planet.nl> Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Emphatically-Acked-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2009-06-07 13:52:52 +02:00
Bartlomiej Zolnierkiewicz	02c33b123e	partitions: warn about the partition exceeding device capacity The current warning message says only about the kernel's action taken without mentioning the underlying reason behind it. Noticed-by: Robert Hancock <hancockrwd@gmail.com> Cc: Frans Pop <elendil@planet.nl> Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl> Cc: Al Viro <viro@zeniv.linux.org.uk> Emphatically-Acked-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2009-06-07 13:52:51 +02:00
Hugh Dickins	f07502dae2	integrity: fix IMA inode leak CONFIG_IMA=y inode activity leaks iint_cache and radix_tree_node objects until the system runs out of memory. Nowhere is calling ima_inode_free() a.k.a. ima_iint_delete(). Fix that by calling it from destroy_inode(). Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-06 14:33:41 -07:00
Steve French	f0472d0ec8	[CIFS] Add mention of new mount parm (forceuid) to cifs readme Also update fs/cifs/CHANGES Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-06-06 21:09:39 +00:00
Jeff Layton	4ae1507f6d	cifs: make overriding of ownership conditional on new mount options We have a bit of a problem with the uid= option. The basic issue is that it means too many things and has too many side-effects. It's possible to allow an unprivileged user to mount a filesystem if the user owns the mountpoint, /bin/mount is setuid root, and the mount is set up in /etc/fstab with the "user" option. When doing this though, /bin/mount automatically adds the "uid=" and "gid=" options to the share. This is fortunate since the correct uid= option is needed in order to tell the upcall what user's credcache to use when generating the SPNEGO blob. On a mount without unix extensions this is fine -- you generally will want the files to be owned by the "owner" of the mount. The problem comes in on a mount with unix extensions. With those enabled, the uid/gid options cause the ownership of files to be overriden even though the server is sending along the ownership info. This means that it's not possible to have a mount by an unprivileged user that shows the server's file ownership info. The result is also inode permissions that have no reflection at all on the server. You simply cannot separate ownership from the mode in this fashion. This behavior also makes MultiuserMount option less usable. Once you pass in the uid= option for a mount, then you can't use unix ownership info and allow someone to share the mount. While I'm not thrilled with it, the only solution I can see is to stop making uid=/gid= force the overriding of ownership on mounts, and to add new mount options that turn this behavior on. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-06-06 21:03:27 +00:00
Ingo Molnar	75b5032212	Merge branch 'linus' into perfcounters/core Merge reason: Pick up the latest fixes before the -v8 perfcounters release. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-06 20:21:28 +02:00
Al Viro	72a43d63cb	ext3/4 with synchronous writes gets wedged by Postfix OK, that's probably the easiest way to do that, as much as I don't like it... Since iget() et.al. will not accept I_FREEING (will wait to go away and restart), and since we'd better have serialization between new/free on fs data structures anyway, we can afford simply skipping I_FREEING et.al. in insert_inode_locked(). We do that from new_inode, so it won't race with free_inode in any interesting ways and it won't race with iget (of any origin; nfsd or in case of fs corruption a lookup) since both still will wait for I_LOCK. Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Tested-by: David Watson <dbwatson@ukfsn.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-06-06 06:17:26 -04:00
Theodore Ts'o	460bcf57b1	Fix nobh_truncate_page() to not pass stack garbage to get_block() The nobh_truncate_page() function is used by ext2, exofs, and jfs. Of these three, only ext2 and jfs's get_block() function pays attention to bh->b_size --- which is normally always the filesystem blocksize except when the get_block() function is called by either mpage_readpage(), mpage_readpages(), or the direct I/O routines in fs/direct_io.c. Unfortunately, nobh_truncate_page() does not initialize map_bh before calling the filesystem-supplied get_block() function. So ext2 and jfs will try to calculate the number of blocks to map by taking stack garbage and shifting it left by inode->i_blkbits. This should be mostly harmless (except the filesystem will do some unnneeded work) unless the stack garbage is less than filesystem's blocksize, in which case maxblocks will be zero, and the attempt to find out whether or not the filesystem has a hole at a given logical block will fail, and the page cache entry might not get zero'ed out. Also if the stack garbage in in map_bh->state happens to have the BH_Mapped bit set, there could be an attempt to call readpage() on a non-existent page, which could cause nobh_truncate_page() to return an error when it should not. Fix this by initializing map_bh->state and map_bh->size. Fortunately, it's probably fairly unlikely that ext2 and jfs users mount with nobh these days. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-06-06 06:17:25 -04:00
Linus Torvalds	064e38aade	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: Fix oops and use after free during space balancing Btrfs: set device->total_disk_bytes when adding new device	2009-06-05 11:54:28 -07:00
Steven Whitehouse	f6d03139d7	GFS2: Fix locking issue mounting gfs2meta fs This patch uses sget() to get a reference to the existing gfs2 sb when mouting the gfs2meta filesystem (in fact thats just another mount of the gfs2 filesystem with a different root and this interface is for backward compatibility). Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reported-by: Benjamin Marzinski <bmarzins@redhat.com> Tested-by: Benjamin Marzinski <bmarzins@redhat.com> Cc: Christoph Hellwig <hch@infradead.org>	2009-06-05 08:35:15 +01:00
Aneesh Kumar K.V	f8514083cd	ext4: truncate the file properly if we fail to copy data from userspace In generic_perform_write if we fail to copy the user data we don't update the inode->i_size. We should truncate the file in the above case so that we don't have blocks allocated outside inode->i_size. Add the inode to orphan list in the same transaction as block allocation This ensures that if we crash in between the recovery would do the truncate. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> CC: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-06-05 00:56:49 -04:00
Aneesh Kumar K.V	1938a150c2	ext4: Avoid leaking blocks after a block allocation failure We should add inode to the orphan list in the same transaction as block allocation. This ensures that if we crash after a failed block allocation and before we do a vmtruncate we don't leak block (ie block marked as used in bitmap but not claimed by the inode). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> CC: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-06-05 01:00:26 -04:00
Eric Sandeen	b31e15527a	ext4: Change all super.c messages to print the device This patch changes ext4 super.c to include the device name with all warning/error messages, by using a new utility function ext4_msg. It's a rather large patch, but very mechanic. I left debug printks alone. This is a straightforward port of a patch which Andi Kleen did for ext3. Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-06-04 17:36:36 -04:00
Jan Kara	03f5d8bcf0	ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle() Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle(). This seems to be a relict from some old days and setting disksize in this function does not make much sense. Currently it was set only by ext4_getblk(). Since the parameter has some effect only if create == 1, it is easy to check by grepping through the sources that the three callers which end up calling ext4_getblk() with create == 1 (ext4_append, ext4_quota_write, ext4_mkdir) do the right thing and set disksize themselves. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-06-09 00:17:05 -04:00
Jens Axboe	172124e220	Revert "block: implement blkdev_readpages" This reverts commit `db2dbb12dc`. It apparently causes problems with partition table read-ahead on archs with large page sizes. Until that problem is diagnosed further, just drop the readpages support on block devices. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-04 22:34:44 +02:00
Chris Mason	44fb551163	Btrfs: Fix oops and use after free during space balancing The btrfs allocator uses list_for_each to walk the available block groups when searching for free blocks. It starts off with a hint to help find the best block group for a given allocation. The hint is resolved into a block group, but we don't properly check to make sure the block group we find isn't in the middle of being freed due to filesystem shrinking or balancing. If it is being freed, the list pointers in it are bogus and can't be trusted. But, the code happily goes along and uses them in the list_for_each loop, leading to all kinds of fun. The fix used here is to check to make sure the block group we find really is on the list before we use it. list_del_init is used when removing it from the list, so we can do a proper check. The allocation clustering code has a similar bug where it will trust the block group in the current free space cluster. If our allocation flags have changed (going from single spindle dup to raid1 for example) because the drives in the FS have changed, we're not allowed to use the old block group any more. The fix used here is to check the current cluster against the current allocation flags. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-04 15:41:27 -04:00
Yan Zheng	2cc3c559fb	Btrfs: set device->total_disk_bytes when adding new device It was not being properly initialized, and so the size saved to disk was not correct. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-06-04 09:23:57 -04:00
Tao Ma	06c59bb896	ocfs2: Remove redundant gotos in ocfs2_mount_volume() Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-06-03 19:20:15 -07:00
Joel Becker	73be192b17	ocfs2: Add statistics for the checksum and ecc operations. It would be nice to know how often we get checksum failures. Even better, how many of them we can fix with the single bit ecc. So, we add a statistics structure. The structure can be installed into debugfs wherever the user wants. For ocfs2, we'll put it in the superblock-specific debugfs directory and pass it down from our higher-level functions. The stats are only registered with debugfs when the filesystem supports metadata ecc. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-06-03 19:15:36 -07:00
Srinivas Eeda	15633a220f	ocfs2 patch to track delayed orphan scan timer statistics Patch to track delayed orphan scan timer statistics. Modifies ocfs2_osb_dump to print the following: Orphan Scan=> Local: 10 Global: 21 Last Scan: 67 seconds ago Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-06-03 19:14:31 -07:00
Srinivas Eeda	83273932fb	ocfs2: timer to queue scan of all orphan slots When a dentry is unlinked, the unlinking node takes an EX on the dentry lock before moving the dentry to the orphan directory. Other nodes that have this dentry in cache have a PR on the same dentry lock. When the EX is requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED during downconvert. The inode is finally deleted when the last node to iput the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set. A problem arises if a node is forced to free dentry locks because of memory pressure. If this happens, the node will no longer get downconvert notifications for the dentries that have been unlinked on another node. If it also happens that node is actively using the corresponding inode and happens to be the one performing the last iput on that inode, it will fail to delete the inode as it will not have the MAYBE_ORPHANED flag set. This patch fixes this shortcoming by introducing a periodic scan of the orphan directories to delete such inodes. Care has been taken to distribute the workload across the cluster so that no one node has to perform the task all the time. Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-06-03 19:14:31 -07:00
Jan Kara	edd45c0849	ocfs2: Correct ordering of ip_alloc_sem and localloc locks for directories We use ordering ip_alloc_sem -> local alloc locks in ocfs2_write_begin(). So change lock ordering in ocfs2_extend_dir() and ocfs2_expand_inline_dir() to also use this lock ordering. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-06-03 19:14:30 -07:00
Jan Kara	80d73f15d1	ocfs2: Fix possible deadlock in quota recovery In ocfs2_finish_quota_recovery() we acquired global quota file lock and started recovering local quota file. During this process we need to get quota structures, which calls ocfs2_dquot_acquire() which gets global quota file lock again. This second lock can block in case some other node has requested the quota file lock in the mean time. Fix the problem by moving quota file locking down into the function where it is really needed. Then dqget() or dqput() won't be called with the lock held. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-06-03 19:14:30 -07:00
Jan Kara	65bac575e3	ocfs2: Fix possible deadlock with quotas in ocfs2_setattr() We called vfs_dq_transfer() with global quota file lock held. This can lead to deadlocks as if vfs_dq_transfer() has to allocate new quota structure, it calls ocfs2_dquot_acquire() which tries to get quota file lock again and this can block if another node requested the lock in the mean time. Since we have to call vfs_dq_transfer() with transaction already started and quota file lock ranks above the transaction start, we cannot just rely on ocfs2_dquot_acquire() or ocfs2_dquot_release() on getting the lock if they need it. We fix the problem by acquiring pointers to all quota structures needed by vfs_dq_transfer() already before calling the function. By this we are sure that all quota structures are properly allocated and they can be freed only after we drop references to them. Thus we don't need quota file lock anywhere inside vfs_dq_transfer(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-06-03 19:14:29 -07:00
Jan Kara	b4c30de39a	ocfs2: Fix lock inversion in ocfs2_local_read_info() This function is called with dqio_mutex held but it has to acquire lock from global quota file which ranks above this lock. This is not deadlockable lock inversion since this code path is take only during mount when noone else can race with us but let's clean this up to silence lockdep. We just drop the dqio_mutex in the beginning of the function and reacquire it in the end since we don't need it - noone can race with us at this moment. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-06-03 19:14:29 -07:00
Jan Kara	4e8a301929	ocfs2: Fix possible deadlock in ocfs2_global_read_dquot() It is not possible to get a read lock and then try to get the same write lock in one thread as that can block on downconvert being requested by other node leading to deadlock. So first drop the quota lock for reading and only after that get it for writing. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-06-03 19:14:28 -07:00
Andreas Dilger	0b8e58a140	ext4: super.c whitespace cleanup Cleanup of whitespace and formatting. Initially driven by confusing indents for the ext4_{block,inode}_bitmap() et. al. helper routines, but figured I'd cleanup some other 80-column wrapping and other indenting problems at the same time. Signed-off-by: Andreas Dilger <adilger@sun.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-06-03 17:59:28 -04:00
Alberto Bertogli	bfcd3555af	jbd2: Fix minor typos in comments in fs/jbd2/journal.c Signed-off-by: Alberto Bertogli <albertito@blitiri.com.ar> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2009-06-09 00:06:20 -04:00
Denis Karpov	85c7859190	FAT: add 'errors' mount option On severe errors FAT remounts itself in read-only mode. Allow to specify FAT fs desired behavior through 'errors' mount option: panic, continue or remount read-only. `mount -t [fat\|vfat] -o errors=[panic,remount-ro,continue] \ <bdev> <mount point>` This is analog to ext2 fs 'errors' mount option. Signed-off-by: Denis Karpov <ext-denis.2.karpov@nokia.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>	2009-06-04 02:34:51 +09:00
Steven Whitehouse	e09f9446b9	GFS2: Remove unused variable Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-06-03 10:07:44 +01:00
Linus Torvalds	4157fd85fc	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: prevent deadlock in xfs_qm_shake() xfs: fix overflow in xfs_growfs_data_private xfs: fix double unlock in xfs_swap_extents()	2009-06-02 09:47:21 -07:00
Jeff Layton	50b64e3b77	cifs: fix IPv6 address length check For IPv6 the userspace mount helper sends an address in the "ip=" option. This check fails if the length is > 35 characters. I have no idea where the magic 35 character limit came from, but it's clearly not enough for IPv6. Fix it by making it use the INET6_ADDRSTRLEN #define. While we're at it, use the same #define for the address length in SPNEGO upcalls. Reported-by: Charles R. Anderson <cra@wpi.edu> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-06-02 15:45:40 +00:00
Artem Bityutskiy	8379ea31e9	UBIFS: allow sync option in rootflags When passing UBIFS parameters via kernel command line, the sync option will be passed to UBIFS as a string, not as an MS_SYNCHRONOUS flag. Teach UBIFS interpreting this flag. Reported-by: Aurélien GÉRÔME <ag@debian.org> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-06-02 11:08:07 +03:00
Abhijith Das	a12af1ebe6	GFS2: smbd proccess hangs with flock() call. GFS2 currently does not support mandatory flocks. An flock() call with LOCK_MAND triggers unexpected behavior because gfs2 is not checking for this lock type. This patch corrects that. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-06-02 08:01:12 +01:00
Felix Blyakher	1b17d76646	xfs: prevent deadlock in xfs_qm_shake() It's possible to recurse into filesystem from the memory allocation, which deadlocks in xfs_qm_shake(). Add check for __GFP_FS, and bail out if it is not set. Signed-off-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Hedi Berriche <hedi@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-06-01 22:59:45 -05:00
Eric Sandeen	e6da7c9fed	xfs: fix overflow in xfs_growfs_data_private In the case where growing a filesystem would leave the last AG too small, the fixup code has an overflow in the calculation of the new size with one fewer ag, because "nagcount" is a 32 bit number. If the new filesystem has > 2^32 blocks in it this causes a problem resulting in an EINVAL return from growfs: # xfs_io -f -c "truncate 19998630180864" fsfile # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile # mount -o loop fsfile /mnt # xfs_growfs /mnt meta-data=/dev/loop0 isize=256 agcount=52, agsize=76288719 blks = sectsz=512 attr=2 data = bsize=4096 blocks=3905982455, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument Reported-by: richard.ems@cape-horn-eng.com Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-06-01 22:59:38 -05:00
Felix Blyakher	1f23920dbf	xfs: fix double unlock in xfs_swap_extents() Regreesion from commit `ef8f7fc`, which rearranged the code in xfs_swap_extents() leading to double unlock of xfs inode ilock. That resulted in xfs_fsr deadlocking itself on platforms, which don't handle double unlock of rw_semaphore nicely. It caused the count go negative, which represents the write holder, without really having one. ia64 is one of the platforms where deadlock was easily reproduced and the fix was tested. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-06-01 22:59:29 -05:00
Felix Blyakher	4156e735d3	xfs: prevent deadlock in xfs_qm_shake() It's possible to recurse into filesystem from the memory allocation, which deadlocks in xfs_qm_shake(). Add check for __GFP_FS, and bail out if it is not set. Signed-off-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Hedi Berriche <hedi@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-06-01 13:13:24 -05:00
Ingo Molnar	23db9f430b	Merge branch 'linus' into perfcounters/core Merge reason: merge almost-rc8 into perfcounters/core, which was -rc6 based - to pick up the latest upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-01 10:01:39 +02:00
Linus Torvalds	b4566ac524	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: fix bh leak in nilfs_cpfile_delete_checkpoints function	2009-05-30 08:04:15 -07:00
Ryusuke Konishi	62013ab5d5	nilfs2: fix bh leak in nilfs_cpfile_delete_checkpoints function The nilfs_cpfile_delete_checkpoints() wrongly skips brelse() for the header block of checkpoint file in case of errors. This fixes the leak bug. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-05-30 22:07:50 +09:00
Linus Torvalds	3218911f83	Merge git://git.infradead.org/~dwmw2/mtd-2.6.30 * git://git.infradead.org/~dwmw2/mtd-2.6.30: jffs2: Fix corruption when flash erase/write failure mtd: MXC NAND driver fixes (v5)	2009-05-29 08:52:13 -07:00
Linus Torvalds	deeb103412	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: Driver Core: do not oops when driver_unregister() is called for unregistered drivers sysfs: file.c: use create_singlethread_workqueue()	2009-05-29 08:49:52 -07:00
Linus Torvalds	c8bce3d3bd	Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux * 'for-2.6.30' of git://linux-nfs.org/~bfields/linux: svcrdma: dma unmap the correct length for the RPCRDMA header page. nfsd: Revert "svcrpc: take advantage of tcp autotuning" nfsd: fix hung up of nfs client while sync write data to nfs server	2009-05-29 08:49:09 -07:00
Oskar Schirmer	c3dc5bec05	flat: fix data sections alignment The flat loader uses an architecture's flat_stack_align() to align the stack but assumes word-alignment is enough for the data sections. However, on the Xtensa S6000 we have registers up to 128bit width which can be used from userspace and therefor need userspace stack and data-section alignment of at least this size. This patch drops flat_stack_align() and uses the same alignment that is required for slab caches, ARCH_SLAB_MINALIGN, or wordsize if it's not defined by the architecture. It also fixes m32r which was obviously kaput, aligning an uninitialized stack entry instead of the stack pointer. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Oskar Schirmer <os@emlix.com> Cc: David Howells <dhowells@redhat.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Bryan Wu <cooloney@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Johannes Weiner <jw@emlix.com> Acked-by: Mike Frysinger <vapier.adi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-29 08:40:02 -07:00
KOSAKI Motohiro	bd6daba909	procfs: make errno values consistent when open pident vs exit(2) race occurs proc_pident_instantiate() has following call flow. proc_pident_lookup() proc_pident_instantiate() proc_pid_make_inode() And, proc_pident_lookup() has following error handling. const struct pid_entry p, last; error = ERR_PTR(-ENOENT); if (!task) goto out_no_task; Then, proc_pident_instantiate should return ENOENT too when racing against exit(2) occur. EINAL has two bad reason. - it implies caller is wrong. bad the race isn't caller's mistake. - man 2 open don't explain EINVAL. user often don't handle it. Note: Other proc_pid_make_inode() caller already use ENOENT properly. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-29 08:40:02 -07:00
Artem Bityutskiy	428ff9d2e3	UBIFS: remove dead code UBIFS assumes that @c->min_io_size is 8 in case of NOR flash. This is because UBIFS alignes all nodes to 8-byte boundary, and maintaining @c->min_io_size introduced unnecessary complications. This patch removes senseless constructs like: if (c->min_io_size == 1) NOR-specific code Also, few commentaries amendments. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-05-29 14:38:37 +03:00
Joakim Tjernlund	81e2962801	jffs2: Fix corruption when flash erase/write failure Erase errors such as: "Newly-erased block contained word 0xa4ef223e at offset 0x0296a014" and failure to write the clean marker, moves the offending erase block to erasing list before calling jffs2_erase_failed(). This is bad as jffs2_erase_failed() will also move the block to the bad_list, but is now moving the wrong block, causing FS corruption. Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-05-29 10:44:46 +01:00
Andrew Morton	086a377edc	sysfs: file.c: use create_singlethread_workqueue() We don't need a kernel thread per CPU for this application. Acked-by: Alex Chiang <achiang@hp.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-05-28 14:24:07 -07:00
Christoph Hellwig	b96d31a62f	cifs: clean up set_cifs_acl interfaces Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-28 18:41:56 +00:00
Christoph Hellwig	1bf4072da6	cifs: reorganize get_cifs_acl Thus spake Christoph: "But this whole set_cifs_acl function is a real mess anyway and needs some splitting up." With this change too, it's possible to call acl_to_uid_mode() with a NULL inode pointer. That (or something close to it) will eventually be necessary when cifs_get_inode_info is reorganized. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-28 17:08:02 +00:00
Steve French	c5077ec423	[CIFS] Update readme to indicate change to default mount (serverino) Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-28 15:09:04 +00:00
Jeff Layton	a0c9217f64	cifs: make serverino the default when mounting Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-28 15:04:17 +00:00
Jeff Layton	bd433d4cf4	cifs: rename cifs_iget to cifs_root_iget The current cifs_iget isn't suitable for anything but the root inode. Rename it with a more appropriate name. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-28 14:57:29 +00:00
Jeff Layton	c4a2c08db7	cifs: make cnvrtDosUnixTm take a little-endian args and an offset The callers primarily end up converting the args from le anyway. Also, most of the callers end up needing to add an offset to the result. The exception to these rules is cnvrtDosCifsTm, but there are no callers of that function, so we might as well remove it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-28 14:57:20 +00:00
Jeff Layton	07119a4df8	cifs: have cifs_NTtimeToUnix take a little-endian arg ...and just have the function call le64_to_cpu. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-28 14:32:31 +00:00
Mimi Zohar	14dba5331b	integrity: nfsd imbalance bug fix An nfsd exported file is opened/closed by the kernel causing the integrity imbalance message. Before a file is opened, there normally is permission checking, which is done in inode_permission(). However, as integrity checking requires a dentry and mount point, which is not available in inode_permission(), the integrity (permission) checking must be called separately. In order to detect any missing integrity checking calls, we keep track of file open/closes. ima_path_check() increments these counts and does the integrity (permission) checking. As a result, the number of calls to ima_path_check()/ima_file_free() should be balanced. An extra call to fput(), indicates the file could have been accessed without first calling ima_path_check(). In nfsv3 permission checking is done once, followed by multiple reads, which do an open/close for each read. The integrity (permission) checking call should be in nfsd_permission() after the inode_permission() call, but as there is no correlation between the number of permission checking and open calls, the integrity checking call should not increment the counters, but defer it to when the file is actually opened. This patch adds: - integrity (permission) checking for nfsd exported files in nfsd_permission(). - a call to increment counts for files opened by nfsd. This patch has been updated to return the nfs error types. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-05-28 09:32:43 +10:00
Wei Yongjun	a0d24b295a	nfsd: fix hung up of nfs client while sync write data to nfs server Commit 'Short write in nfsd becomes a full write to the client' (`31dec2538e`) broken the sync write. With the following commands to reproduce: $ mount -t nfs -o sync 192.168.0.21:/nfsroot /mnt $ cd /mnt $ echo aaaa > temp.txt Then nfs client is hung up. In SYNC mode the server alaways return the write count 0 to the client. This is because the value of host_err in nfsd_vfs_write() will be overwrite in SYNC mode by 'host_err=nfsd_sync(file);', and then we return host_err(which is now 0) as write count. This patch fixed the problem. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-05-27 17:40:06 -04:00
David Howells	911e690e70	CacheFiles: Fixup renamed filenames in comments in internal.h Fix up renamed filenames in comments in fs/cachefiles/internal.h. Originally, the files were all called cf-xxx.c, but they got renamed to just xxx.c. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-27 10:20:13 -07:00
David Howells	348ca1029e	FS-Cache: Fixup renamed filenames in comments in internal.h Fix up renamed filenames in comments in fs/fscache/internal.h. Originally, the files were all called fsc-xxx.c, but they got renamed to just xxx.c. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-27 10:20:13 -07:00
Eric Sandeen	096324873f	xfs: fix overflow in xfs_growfs_data_private In the case where growing a filesystem would leave the last AG too small, the fixup code has an overflow in the calculation of the new size with one fewer ag, because "nagcount" is a 32 bit number. If the new filesystem has > 2^32 blocks in it this causes a problem resulting in an EINVAL return from growfs: # xfs_io -f -c "truncate 19998630180864" fsfile # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile # mount -o loop fsfile /mnt # xfs_growfs /mnt meta-data=/dev/loop0 isize=256 agcount=52, agsize=76288719 blks = sectsz=512 attr=2 data = bsize=4096 blocks=3905982455, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument Reported-by: richard.ems@cape-horn-eng.com Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-05-26 17:46:37 -05:00
Jeff Layton	f55ed1a83d	cifs: tighten up default file_mode/dir_mode The current default file mode is 02767 and dir mode is 0777. This is extremely "loose". Given that CIFS is a single-user protocol, these permissions allow anyone to use the mount -- in effect, giving anyone on the machine access to the credentials used to mount the share. Change this by making the default permissions restrict write access to the default owner of the mount. Give read and execute permissions to everyone else. These are the same permissions that VFAT mounts get by default so there is some precedent here. Note that this patch also removes the mandatory locking flags from the default file_mode. After having looked at how these flags are used by the kernel, I don't think that keeping them as the default offers any real benefit. That flag combination makes it so that the kernel enforces mandatory locking. Since the server is going to do that for us anyway, I don't think we want the client to enforce this by default on applications that just want advisory locks. Anyone that does want this behavior can always enable it by setting the file_mode appropriately. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-26 21:10:55 +00:00
Jeff Layton	46a7574caf	cifs: fix artificial limit on reading symlinks There's no reason to limit the size of a symlink that we can read to 4000 bytes. That may be nowhere near PATH_MAX if the server is sending UCS2 strings. CIFS should be able to read in a symlink up to the size of the buffer. The size of the header has already been accounted for when creating the slabcache, so CIFSMaxBufSize should be the correct size to pass in. Fixes samba bug #6384. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-26 21:09:14 +00:00
Trond Myklebust	95baa25c73	NFSv4: Fix the case where NFSv4 renewal fails If the asynchronous lease renewal fails (usually due to a soft timeout), then we _must_ schedule state recovery in order to ensure that we don't lose the lease unnecessarily or, if the lease is already lost, that we recover the locking state promptly... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-05-26 14:51:00 -04:00
Sam Ravnborg	d0367a508a	nfs: fix build error in nfsroot with initconst fix build error with latest kbuild adjustments to initconst. The commit `a447c09324` ("vfs: Use const for kernel parser table") changed: static match_table_t __initdata tokens = { to static match_table_t __initconst tokens = { But the missing const causes popwerpc to fail with latest updates to __initconst like this: fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict The bug is only present with kbuild-next. Following patch has been build tested. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-05-26 14:51:00 -04:00
Steven Whitehouse	f6eb53498e	GFS2: Remove args subdir from gfs2 sysfs files Since we can cat /proc/mounts there is no need to have this subdirectory in the gfs2 sysfs files. In fact this does not reflect the full range of possible mount argumenmts, where as /proc/mounts does. There was only one userland user of this set of sysfs files and it will function perfectly well without these files being present (in fact that subcommand of gfs2_tool is obsolete anyway). The tune/* subdirectory is also considered mostly obsolete, but there are a few uses of this until mount arguments can be added for the last few functions for which there are no equivalents currently. However the tune/* directory is still in my sights and new code should avoid using it. Only the gfs2_quota and gfs2_tool programs are know to use tune/* at the moment. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-26 15:50:25 +01:00
Steven Whitehouse	e1b28aab58	GFS2: Remove lockstruct subdir from gfs2 sysfs files The lockstruct sub directory contained two entries, both of which are duplicated elsewhere in the gfs2 sysfs files as well as being available via /proc/mounts. There is no userland program using either of them, so this patch removes them. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-26 15:41:27 +01:00
Artem Bityutskiy	7c83f5cb55	UBIFS: use anonymous device UBIFS has erroneuosly set 'sb->s_dev' to the UBI volume character device major/minor. This may lead to clashes if there is another FS mounted to a block device with the same major/minor numbers. User-space programs which use 'stat->st_dev' may get confused because of this. This problem was found by Al Viro. He also pointed the way to fix the problem - use 'set_anon_super()' and 'kill_anon_super()' VFS helpers. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-05-26 12:37:11 +03:00
Theodore Ts'o	88b6edd17c	ext4: Clean up calls to ext4_get_group_desc() If the caller isn't planning on modifying the block group descriptors, there's no need to pass in a pointer to a struct buffer_head. Nuking this saves a tiny amount of CPU time and stack space usage. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-25 11:50:39 -04:00
Theodore Ts'o	759d427aa5	ext4: remove unused function __ext4_write_dirty_metadata The __ext4_write_dirty_metadata() function was introduced by commit `0390131b`, "ext4: Allow ext4 to run without a journal", but nothing ever used the function, either then or since. So let's remove it and save a bit of space. Cc: Frank Mayhar <fmayhar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-25 11:51:00 -04:00
Corentin Chary	8eec2f36fb	UBIFS: return proper error code if the compr is not present If the compressor is not present, mount_ubifs need to return an error code. This way ubifs_fill_super will stop and handle the error. Signed-off-by: Corentin Chary <corentincj@iksaif.net> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-05-25 12:28:27 +03:00
Dave Kleikamp	79f52b77b8	jfs: Add missing mutex_unlock call to error path Jan Kucera found an missing call to mutex_unlock() with his static code checker. It's an unlikely error path to hit in the real world, but it should be fixed. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Reported-by: Jan Kucera <kucera.jan.cz@gmail.com>	2009-05-23 20:28:41 -05:00
Linus Torvalds	3eb9c8be0c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: [CIFS] Avoid open on possible directories since Samba now rejects them	2009-05-23 13:42:53 -07:00
Steve French	8db14ca125	[CIFS] Avoid open on possible directories since Samba now rejects them Small change (mostly formatting) to limit lookup based open calls to file create only. After discussion yesteday on samba-technical about the posix lookup regression, and looking at a problem with cifs posix open to one particular Samba version, Jeff and JRA realized that Samba server's behavior changed in this area (posix open behavior on files vs. directories). To make this behavior consistent, JRA just made a fix to Samba server to alter how it handles open of directories (now returning the equivalent of EISDIR instead of success). Since we don't know at lookup time whether the inode is a directory or file (and thus whether posix open will succeed with most current Samba server), this change avoids the posix open code on lookup open (just issues posix open on creates). This gets the semantic benefits we want (atomicity, posix byte range locks, improved write semantics on newly created files) and file create still is fast, and we avoid the problem that Jeff noticed yesterday with "openat" (and some open directory calls) of non-cached directories to one version of Samba server, and will work with future Samba versions (which include the fix jra just pushed into Samba server). I confirmed this approach with jra yesterday and with Shirish today. Posix open is only called (at lookup time) for file create now. For opens (rather than creates), because we do not know if it is a file or directory yet, and current Samba no longer allows us to do posix open on dirs, we could end up wasting an open call on what turns out to be a dir. For file opens, we wait to call posix open till cifs_open. It could be added here (lookup) in the future but the performance tradeoff of the extra network request when EISDIR or EACCES is returned would have to be weighed against the 50% reduction in network traffic in the other paths. Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com> Tested-by: Jeff Layton <jlayton@redhat.com> CC: Jeremy Allison <jra@samba.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-23 18:57:25 +00:00
Martin K. Petersen	c72758f337	block: Export I/O topology for block devices and partitions To support devices with physical block sizes bigger than 512 bytes we need to ensure proper alignment. This patch adds support for exposing I/O topology characteristics as devices are stacked. logical_block_size is the smallest unit the device can address. physical_block_size indicates the smallest I/O the device can write without incurring a read-modify-write penalty. The io_min parameter is the smallest preferred I/O size reported by the device. In many cases this is the same as the physical block size. However, the io_min parameter can be scaled up when stacking (RAID5 chunk size > physical block size). The io_opt characteristic indicates the optimal I/O size reported by the device. This is usually the stripe width for arrays. The alignment_offset parameter indicates the number of bytes the start of the device/partition is offset from the device's natural alignment. Partition tools and MD/DM utilities can use this to pad their offsets so filesystems start on proper boundaries. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:55 +02:00
Martin K. Petersen	ae03bf639a	block: Use accessor functions for queue limits Convert all external users of queue limits to using wrapper functions instead of poking the request queue variables directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:54 +02:00
Martin K. Petersen	e1defc4ff0	block: Do away with the notion of hardsect_size Until now we have had a 1:1 mapping between storage device physical block size and the logical block sized used when addressing the device. With SATA 4KB drives coming out that will no longer be the case. The sector size will be 4KB but the logical block size will remain 512-bytes. Hence we need to distinguish between the physical block size and the logical ditto. This patch renames hardsect_size to logical_block_size. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:54 +02:00
Jens Axboe	9bd7de51ee	Merge branch 'master' into for-2.6.31 Conflicts: drivers/ide/ide-io.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 20:28:35 +02:00
Jens Axboe	e4b636366c	Merge branch 'master' into for-2.6.31 Conflicts: drivers/block/hd.c drivers/block/mg_disk.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 20:25:34 +02:00
Linus Torvalds	6a44587ee7	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: fix memory leak in nilfs_ioctl_clean_segments	2009-05-22 08:41:13 -07:00
Ryusuke Konishi	d504685363	nilfs2: fix memory leak in nilfs_ioctl_clean_segments This fixes a new memory leak problem in garbage collection. The problem was brought by the bugfix patch ("nilfs2: fix lock order reversal in nilfs_clean_segments ioctl"). Thanks to Kentaro Suzuki for finding this problem. Reported-by: Kentaro Suzuki <k_suzuki@ms.sylc.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-05-22 20:49:04 +09:00
Steven Whitehouse	87ec217411	GFS2: Move gfs2_unlink_ok into ops_inode.c Another function which is only called from one ops_inode.c so we move it and make it static. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-22 10:54:50 +01:00
Steven Whitehouse	536baf02f6	GFS2: Move gfs2_readlinki into ops_inode.c Move gfs2_readlinki into ops_inode.c and make it static Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-22 10:48:59 +01:00
Steven Whitehouse	2286dbfad1	GFS2: Move gfs2_rmdiri into ops_inode.c Move gfs2_rmdiri() into ops_inode.c and make it static. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-22 10:45:09 +01:00
Steven Whitehouse	9e6e0a128b	GFS2: Merge mount.c and ops_super.c into super.c mount.c only contained a single function, so is not really worth retaining on its own. All of the super related code is now either in super.c or ops_fstype.c Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-22 10:36:01 +01:00
Steven Whitehouse	b1e71b0622	GFS2: Clean up some file names This patch renames the ops_*.c files which have no counterpart without the ops_ prefix in order to shorten the name and make it more readable. In addition, ops_address.h (which was very small) is moved into inode.h and inode.h is cleaned up by adding extern where required. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-22 10:01:55 +01:00
James Morris	2c9e703c61	Merge branch 'master' into next Conflicts: fs/exec.c Removed IMA changes (the IMA checks are now performed via may_open()). Signed-off-by: James Morris <jmorris@namei.org>	2009-05-22 18:40:59 +10:00
Mimi Zohar	c9d9ac525a	integrity: move ima_counts_get Based on discussion on lkml (Andrew Morton and Eric Paris), move ima_counts_get down a layer into shmem/hugetlb__file_setup(). Resolves drm shmem_file_setup() usage case as well. HD comment: I still think you're doing this at the wrong level, but recognize that you probably won't be persuaded until a few more users of alloc_file() emerge, all wanting your ima_counts_get(). Resolving GEM's shmem_file_setup() is an improvement, so I'll say Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-05-22 09:45:33 +10:00
Mimi Zohar	b9fc745db8	integrity: path_check update - Add support in ima_path_check() for integrity checking without incrementing the counts. (Required for nfsd.) - rename and export opencount_get to ima_counts_get - replace ima_shm_check calls with ima_counts_get - export ima_path_check Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-05-22 09:43:41 +10:00
Steve French	703a3b8e5c	[CIFS] fix posix open regression Posix open code was not properly adding the file to the list of open files. Fix allocating cifsFileInfo more than once, and adding twice to flist and tlist. Also fix mode setting to be done in one place in these paths. Signed-off-by: Steve French <sfrench@us.ibm.com> Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com> Tested-by: Jeff Layton <jlayton@redhat.com> Tested-by: Luca Tettamanti <kronos.it@gmail.com>	2009-05-21 22:38:08 +00:00
Steven Whitehouse	1ce97e564b	GFS2: Be more aggressive in reclaiming unlinked inodes This patch increases the frequency with which gfs2 looks for unlinked, but still allocated inodes. Its the equivalent operation to ext3's orphan list, but done with bitmaps in the resource groups. This also fixes a bug where a field in the rgrp was too small. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-21 15:18:19 +01:00
Steven Whitehouse	60a0b8f936	GFS2: Add a rgrp bitmap full flag During block allocation, it is useful to know if sections of disk are full on a finer grained basis than a single resource group. This can make a performance difference when resource groups have larger numbers of bitmap blocks, since we no longer have to search them all block by block in each individual bitmap. The full flag is set on a per-bitmap basis when it has been searched and found to have no free space. It is then skipped in subsequent searches until the flag is reset. The resetting occurs if we have to drop the glock on the resource group for any reason, or if we deallocate some blocks within that resource group and thus free up some space. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-21 12:23:12 +01:00
Linus Torvalds	929a8651f4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: fix pointer initialization and checks in cifs_follow_symlink (try #4)	2009-05-20 08:36:53 -07:00
Steven Whitehouse	0901097834	GFS2: Improve resource group error handling This patch improves the error handling in the case where we discover that the summary information in the resource group doesn't match the bitmap information while in the process of allocating blocks. Originally this resulted in a kernel bug, but this patch changes that so that we return -EIO and print some messages explaining what went wrong, and how to fix it. We also remember locally not to try and allocate from the same rgrp again, so that a subsequent allocation in a different rgrp should succeed. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-20 10:48:47 +01:00
Jeff Layton	8b6427a2a8	cifs: fix pointer initialization and checks in cifs_follow_symlink (try #4 ) This is the third respin of the patch posted yesterday to fix the error handling in cifs_follow_symlink. It also includes a fix for a bogus NULL pointer check in CIFSSMBQueryUnixSymLink that Jeff Moyer spotted. It's possible for CIFSSMBQueryUnixSymLink to return without setting target_path to a valid pointer. If that happens then the current value to which we're initializing this pointer could cause an oops when it's kfree'd. This patch is a little more comprehensive than the last patches. It reorganizes cifs_follow_link a bit for (hopefully) better readability. It should also eliminate the uneeded allocation of full_path on servers without unix extensions (assuming they can get to this point anyway, of which I'm not convinced). On a side note, I'm not sure I agree with the logic of enabling this query even when unix extensions are disabled on the client. It seems like that should disable this as well. But, changing that is outside the scope of this fix, so I've left it alone for now. Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@inraded.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-19 15:31:20 +00:00
Steven Whitehouse	ef9e8b14a5	GFS2: Don't warn when delete inode fails on ro filesystem If the filesystem is read-only, then we expect that delete inode will fail, so there is no need to warn about it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-19 14:25:16 +01:00
Miklos Szeredi	b2858d7d16	splice: fix kmaps in default_file_splice_write() Unfortunately multiple kmap() within a single thread are deadlockable, so writing out multiple buffers with writev() isn't possible. Change the implementation so that it does a separate write() for each buffer. This actually simplifies the code a lot since the splice_from_pipe() helper can be used. This limitation is caused by HIGHMEM pages, and so only affects a subset of architectures and configurations. In the future it may be worth to implement default_file_splice_write() in a more efficient way on configs that allow it. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-19 11:37:46 +02:00
Tejun Heo	4fc981ef9e	bio: always copy back data for copied kernel requests When a read bio_copy_kern() request fails, the content of the bounce buffer is not copied back. However, as request failure doesn't necessarily mean complete failure, the buffer state can be useful. This behavior is also inconsistent with the user map counterpart and causes the subtle difference between bounced and unbounced IO causes confusion. This patch makes bio_copy_kern_endio() ignore @err and always copy back data on request completion. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-19 11:36:08 +02:00
Steven Whitehouse	fe64d517df	GFS2: Umount recovery race fix This patch fixes a race condition where we can receive recovery requests part way through processing a umount. This was causing problems since the recovery thread had already gone away. Looking in more detail at the recovery code, it was really trying to implement a slight variation on a work queue, and that happens to align nicely with the recently introduced slow-work subsystem. As a result I've updated the code to use slow-work, rather than its own home grown variety of work queue. When using the wait_on_bit() function, I noticed that the wait function that was supplied as an argument was appearing in the WCHAN field, so I've updated the function names in order to produce more meaningful output. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-19 10:01:18 +01:00
Hunter Adrian	8b3884a841	UBIFS: return error if link and unlink race Consider a scenario when 'vfs_link(dirA/fileA)' and 'vfs_unlink(dirA/fileA, dirB/fileB)' race. 'vfs_link()' does not lock 'dirA->i_mutex', so this is possible. Both of the functions lock 'fileA->i_mutex' though. Suppose 'vfs_unlink()' wins, and takes 'fileA->i_mutex' mutex first. Suppose 'fileA->i_nlink' is 1. In this case 'ubifs_unlink()' will drop the last reference, and put 'inodeA' to the list of orphans. After this, 'vfs_link()' will link 'dirB/fileB' to 'inodeA'. Thir is a problem because, for example, the subsequent 'vfs_unlink(dirB/fileB)' will add the same inode to the list of orphans. This problem was reported by J. R. Okajima <hooanon05@yahoo.co.jp> [Artem: add more comments, amended commit message] Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-05-19 11:01:31 +03:00
Frank Filz	7ee2cb7f32	nfs: Fix NFS v4 client handling of MAY_EXEC in nfs_permission. The problem is that permission checking is skipped if atomic open is possible, but when exec opens a file, it just opens it O_READONLY which means EXEC permission will not be checked at that time. This problem is observed by the following sequence (executed as root): mount -t nfs4 server:/ /mnt4 echo "ls" >/mnt4/foo chmod 744 /mnt4/foo su guest -c "mnt4/foo" Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org Tested-by: Eugene Teo <eugeneteo@kernel.sg> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-18 20:11:12 -07:00
Ingo Molnar	dc3f81b129	Merge commit 'v2.6.30-rc6' into perfcounters/core Merge reason: this branch was on an -rc4 base, merge it up to -rc6 to get the latest upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 07:37:49 +02:00
Manish Katiyar	0f7ee7c172	ext2: Fix memory leak in ext2_fill_super() in case of a failed mount Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-17 23:52:51 -04:00
Manish Katiyar	de5ce03730	ext3: Fix memory leak in ext3_fill_super() in case of a failed mount Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-17 23:52:47 -04:00
Manish Katiyar	f68301656b	ext4: Fix memory leak in ext4_fill_super() in case of a failed mount Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-17 23:52:44 -04:00
Theodore Ts'o	0568c51893	ext4: down i_data_sem only for read when walking tree for fiemap Not sure why I put this in as down_write originally; all we are doing is walking the tree, nothing will change under us and concurrent reads should be no problem. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-17 23:31:23 -04:00
Theodore Ts'o	6fd058f779	ext4: Add a comprehensive block validity check to ext4_get_blocks() To catch filesystem bugs or corruption which could lead to the filesystem getting severly damaged, this patch adds a facility for tracking all of the filesystem metadata blocks by contiguous regions in a red-black tree. This allows quick searching of the tree to locate extents which might overlap with filesystem metadata blocks. This facility is also used by the multi-block allocator to assure that it is not allocating blocks out of the system zone, as well as by the routines used when reading indirect blocks and extents information from disk to make sure their contents are valid. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-17 15:38:01 -04:00
Jeff Mahoney	b83674c0da	reiserfs: fixup perms when xattrs are disabled This adds CONFIG_REISERFS_FS_XATTR protection from reiserfs_permission. This is needed to avoid warnings during file deletions and chowns with xattrs disabled. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-17 11:45:45 -07:00
Jeff Mahoney	ceb5edc457	reiserfs: deal with NULL xattr root w/ xattrs disabled This avoids an Oops in open_xa_root that can occur when deleting a file with xattrs disabled. It assumes that the xattr root will be there, and that is not guaranteed. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-17 11:45:45 -07:00
Jeff Mahoney	12abb35a03	reiserfs: clean up ifdefs With xattr cleanup even with xattrs disabled, much of the initial setup is still performed. Some #ifdefs are just not needed since the options they protect wouldn't be available anyway. This cleans those up. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-17 11:45:45 -07:00
David Teigland	748285ccf7	dlm: use more NOFS allocation Change some GFP_KERNEL allocations to use either GFP_NOFS or ls_allocation (when available) which the fs sets to GFP_NOFS. The point is to prevent allocations from going back into the cluster fs in places where that might lead to deadlock. Signed-off-by: David Teigland <teigland@redhat.com>	2009-05-15 11:24:59 -05:00
Linus Torvalds	5d41343ac8	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Fix race in ext4_inode_info.i_cached_extent ext4: Clear the unwritten buffer_head flag after the extent is initialized ext4: Use a fake block number for delayed new buffer_head ext4: Fix sub-block zeroing for writes into preallocated extents	2009-05-15 08:07:25 -07:00
Sukadev Bhattiprolu	1f71ebedb3	devpts: correctly set default options devpts_get_sb() calls memset(0) to clear mount options and calls parse_mount_options() if user specified any mount options. The memset(0) is bogus since the 'mode' and 'ptmxmode' options are non-zero by default. parse_mount_options() restores options to default anyway and can properly deal with NULL mount options. So in devpts_get_sb() remove memset(0) and call parse_mount_options() even for NULL mount options. Bug reported by Eric Paris: http://lkml.org/lkml/2009/5/7/448. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Tested-by: Marc Dionne <marc.c.dionne@gmail.com> Reported-by: Eric Paris <eparis@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Reviewed-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-15 08:03:23 -07:00
Christine Caulfield	391fbdc5d5	dlm: connect to nodes earlier Make network connections to other nodes earlier, in the context of dlm_recoverd. This avoids connecting to nodes from dlm_send where we try to avoid allocations which could possibly deadlock if memory reclaim goes into the cluster fs which may try to do a dlm operation. Signed-off-by: Christine Caulfield <ccaulfie@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2009-05-15 09:34:12 -05:00
Thomas Gleixner	2d02494f5a	sched, timers: cleanup avenrun users avenrun is an rough estimate so we don't have to worry about consistency of the three avenrun values. Remove the xtime lock dependency and provide a function to scale the values. Cleanup the users. [ Impact: cleanup ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org>	2009-05-15 15:32:45 +02:00
Theodore Ts'o	2ec0ae3ace	ext4: Fix race in ext4_inode_info.i_cached_extent If two CPU's simultaneously call ext4_ext_get_blocks() at the same time, there is nothing protecting the i_cached_extent structure from being used and updated at the same time. This could potentially cause the wrong location on disk to be read or written to, including potentially causing the corruption of the block group descriptors and/or inode table. This bug has been in the ext4 code since almost the very beginning of ext4's development. Fortunately once the data is stored in the page cache cache, ext4_get_blocks() doesn't need to be called, so trying to replicate this problem to the point where we could identify its root cause was extremely difficult. Many thanks to Kevin Shanahan for working over several months to be able to reproduce this easily so we could finally nail down the cause of the corruption. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>	2009-05-15 09:07:28 -04:00
Linus Torvalds	bd67ce0f66	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: fix error handling in parse_DFS_referrals	2009-05-14 19:20:04 -07:00
Linus Torvalds	5732c46849	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: Spelling fix in btrfs_lookup_first_block_group comments Btrfs: make show_options result match actual option names Btrfs: remove outdated comment in btrfs_ioctl_resize() Btrfs: remove some WARN_ONs in the IO failure path Btrfs: Don't loop forever on metadata IO failures Btrfs: init inode ordered_data_close flag properly	2009-05-14 19:18:44 -07:00
Aneesh Kumar K.V	2a8964d63d	ext4: Clear the unwritten buffer_head flag after the extent is initialized The BH_Unwritten flag indicates that the buffer is allocated on disk but has not been written; that is, the disk was part of a persistent preallocation area. That flag should only be set when a get_blocks() function is looking up a inode's logical to physical block mapping. When ext4_get_blocks_wrap() is called with create=1, the uninitialized extent is converted into an initialized one, so the BH_Unwritten flag is no longer appropriate. Hence, we need to make sure the BH_Unwritten is not left set, since the combination of BH_Mapped and BH_Unwritten is not allowed; among other things, it will result ext4's get_block() to be called over and over again during the write_begin phase of write(2). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-14 17:05:39 -04:00
Sankar P	9f55684c2d	Btrfs: Spelling fix in btrfs_lookup_first_block_group comments Signed-off-by: Sankar P <sankar.curiosity@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-05-14 14:00:34 -04:00
Sage Weil	6b65c5c61b	Btrfs: make show_options result match actual option names The notreelog and flushoncommit mount options were being printed slightly differently. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-05-14 14:00:34 -04:00
Li Hong	5d847a8ed9	Btrfs: remove outdated comment in btrfs_ioctl_resize() In Li Zefan's commit `dae7b665cf`, a combination call of kmalloc() and copy_from_user() is replaced by memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more. Signed-off-by: Li Hong <lihong.hi@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-05-14 14:00:33 -04:00
Chris Mason	cc7b0c9b70	Btrfs: remove some WARN_ONs in the IO failure path These debugging WARN_ONs make too much console noise during regular IO failures. An IO failure will still generate a number of messages as we verify checksums etc, but these two are not needed. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-05-14 14:00:33 -04:00
Chris Mason	76a05b35a3	Btrfs: Don't loop forever on metadata IO failures When a btrfs metadata read fails, the first thing we try to do is find a good copy on another mirror of the block. If this fails, read_tree_block() ends up returning a buffer that isn't up to date. The btrfs btree reading code was reworked to drop locks and repeat the search when IO was done, but the changes didn't add a check for failed reads. The end result was looping forever on buffers that were never going to become up to date. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-05-14 14:00:32 -04:00
Chris Mason	2757495c90	Btrfs: init inode ordered_data_close flag properly This flag is used to decide when we need to send a given file through the ordered code to make sure it is fully written before a transaction commits. It was not being properly set to zero when the inode was being setup. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-05-14 14:00:31 -04:00
Theodore Ts'o	2ac3b6e00a	ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_state The ext4_get_blocks() function was depending on the value of bh_result->b_state as an input parameter to decide whether or not update the delalloc accounting statistics by calling ext4_da_update_reserve_space(). We now use a separate flag, EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE, to requests this update, so that all callers of ext4_get_blocks() can clear map_bh.b_state before calling ext4_get_blocks() without worrying about any consistency issues. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-14 13:57:08 -04:00
Jeff Layton	d8e2f53ac9	cifs: fix error handling in parse_DFS_referrals cifs_strndup_from_ucs returns NULL on error, not an ERR_PTR Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-05-14 13:55:32 +00:00
Theodore Ts'o	2fa3cdfb31	ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks() The static function ext4_da_get_block_write() was only used by mpage_da_map_blocks(). So to simplify the code, merge that function into mpage_da_map_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-14 09:29:45 -04:00
Andrew Morton	77f6bf57ba	splice: fix error return code fs/splice.c: In function 'default_file_splice_read': fs/splice.c:566: warning: 'error' may be used uninitialized in this function which is sort-of true. The code will in fact return -ENOMEM instead of the kernel_readv() return value. Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-14 09:49:44 +02:00
Linus Torvalds	210af919c9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus: Squashfs: cody tidying, remove commented out line in Makefile Squashfs: check page size is not larger than the filesystem block size Squashfs: fix breakage when page size > metadata block size	2009-05-13 16:33:25 -07:00
Linus Torvalds	a6aeeebf51	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: destroy bdi on error	2009-05-13 16:32:57 -07:00
Linus Torvalds	014049a1c7	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: check size of array structured data exchanged via ioctls nilfs2: fix lock order reversal in nilfs_clean_segments ioctl nilfs2: fix possible circular locking for get information ioctls nilfs2: ensure to clear dirty state when deleting metadata file block nilfs2: fix circular locking dependency of writer mutex nilfs2: fix possible recovery failure due to block creation without writer	2009-05-13 16:32:16 -07:00
Mel Gorman	f2deae9d4e	Remove implementation of readpage from the hugetlbfs_aops The core VM assumes the page size used by the address_space in inode->i_mapping is PAGE_SIZE but hugetlbfs breaks this assumption by inserting pages into the page cache at offsets the core VM considers unexpected. This would not be a problem except that hugetlbfs also provide a ->readpage implementation. As it exists, the core VM can assume the base page size is being used, allocate pages on behalf of the filesystem, insert them into the page cache and call ->readpage to populate them. These pages are the wrong size and at the wrong offset for hugetlbfs causing confusion. This patch deletes the ->readpage implementation for hugetlbfs on the grounds the core VM should not be allocating and populating pages on behalf of hugetlbfs. There should be no existing users of the ->readpage implementation so it should not cause a regression. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-13 08:04:45 -07:00
Steven Whitehouse	9582d41135	GFS2: Remove a couple of unused sysfs entries These two tunables are pointless and would never need to be changed anyway. There is also a race between them and umount as the deamons which they refer to might have gone away. The easiest way to fix the race is to remove the interface. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-13 15:06:25 +01:00
Steven Whitehouse	48c2b61361	GFS2: Add commit= mount option It has always been possible to adjust the gfs2 log commit interval, but only from the sysfs interface. This adds a mount option, commit=<nn>, which will be familar to ext3 users. The sysfs interface continues to be available as well, although this might be removed in the future. Also this patch cleans up some duplicated structures in the GFS2 sysfs code. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-13 14:49:48 +01:00
Steven Whitehouse	a1c0643ff9	GFS2: Move journal live test at transaction start There seems little point grabbing the transaction glock only to have to release it again if the journal isn't live. This moves the test earlier to avoid grabbing the lock when we don't need it in the first place. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-13 10:56:52 +01:00
Jens Axboe	4f23122858	splice: fix repeated kmap()'s in default_file_splice_read() We cannot reliably map more than one page at the time, or we risk deadlocking. Just allocate the pages from low mem instead. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-13 08:35:35 +02:00
Phillip Lougher	e5d287539d	Squashfs: cody tidying, remove commented out line in Makefile Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-05-13 03:25:20 +01:00
Phillip Lougher	fffb47b80e	Squashfs: check page size is not larger than the filesystem block size Normally the block size (by default 128K) will be larger than the page size, unless a non-standard block size has been specified in Mksquashfs, and the page size is larger than 4K. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-05-13 02:59:26 +01:00
Doug Chapman	a37b06d589	Squashfs: fix breakage when page size > metadata block size Squashfs is broken on any system where the page size is larger than the metadata size (8192). This is easily fixed by ensuring cache->pages is always > 0. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Doug Chapman <doug.chapman@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-05-13 02:56:39 +01:00
Linus Torvalds	2ea3f86848	Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux * 'for-2.6.30' of git://linux-nfs.org/~bfields/linux: nfsd: silence lockdep warning lockd: fix list corruption on lockd restart nfsd4: check for negative dentry before use in nfsv4 readdir nfsd41: slots are freed with session svcrdma: clean up error paths. svcrdma: Fix dma map direction for rdma read targets	2009-05-12 17:11:56 -07:00
Davide Libenzi	bfe3891a5f	epoll: fix size check in epoll_create() Fix a size check WRT the manual pages. This was inadvertently broken by commit `9fe5ad9c8c` ("flag parameters add-on: remove epoll_create size param"). Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: <Hiroyuki.Mach@gmail.com> Cc: rohit verma <rohit.170309@gmail.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-12 14:11:35 -07:00
Aneesh Kumar K.V	33b9817e2a	ext4: Use a fake block number for delayed new buffer_head Use a very large unsigned number (~0xffff) as as the fake block number for the delayed new buffer. The VFS should never try to write out this number, but if it does, this will make it obvious. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-12 14:40:37 -04:00
Aneesh Kumar K.V	9c1ee184a3	ext4: Fix sub-block zeroing for writes into preallocated extents We need to mark the buffer_head mapping preallocated space as new during write_begin. Otherwise we don't zero out the page cache content properly for a partial write. This will cause file corruption with preallocation. Now that we mark the buffer_head new we also need to have a valid buffer_head blocknr so that unmap_underlying_metadata() unmaps the correct block. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-13 18:36:58 -04:00
Theodore Ts'o	a2dc52b5d1	ext4: Add BUG_ON debugging checks to noalloc_get_block_write() Enforce that noalloc_get_block_write() is only called to map one block at a time, and that it always is successful in finding a mapping for given an inode's logical block block number if it is called with create == 1. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-12 13:51:29 -04:00
Theodore Ts'o	b920c75502	ext4: Add documentation to the ext4_get_block functions This adds more documentation to various internal functions in fs/ext4/inode.c, most notably ext4_ind_get_blocks(), ext4_da_get_block_write(), ext4_da_get_block_prep(), ext4_normal_get_block_write(). In addition, the static function ext4_normal_get_block_write() has been renamed noalloc_get_block_write(), since it is used in many places far beyond ext4_normal_writepage(). Plenty of warnings have been added to the noalloc_get_block_write() function, since the way it is used is amazingly fragile. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-14 00:54:29 -04:00
Theodore Ts'o	c217705733	ext4: Define a new set of flags for ext4_get_blocks() The functions ext4_get_blocks(), ext4_ext_get_blocks(), and ext4_ind_get_blocks() used an ad-hoc set of integer variables used as boolean flags passed in as arguments. Use a single flags parameter and a setandard set of bitfield flags instead. This saves space on the call stack, and it also makes the code a bit more understandable. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-14 00:58:52 -04:00
Theodore Ts'o	12b7ac1768	ext4: Rename ext4_get_blocks_wrap() to be ext4_get_blocks() Another function rename for clarity's sake. The _wrap prefix simply confuses people, and didn't add much people trying to follow the code paths. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-14 00:57:44 -04:00
Abhijith Das	7537d81aa7	GFS2: Fix timestamps on write This patch copies the timestamps from the vfs inode into gfs2 and syncs it to the disk inode during writes. Signed-off-by: Abhijith Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-12 16:14:05 +01:00
Theodore Ts'o	e4d996ca80	ext4: Rename ext4_get_blocks_handle() to be ext4_ind_get_blocks() The static function ext4_get_blocks_handle() is badly named. Of course it takes a handle. Since its counterpart for extent-based file is ext4_ext_get_blocks(), rename it to be ext4_ind_get_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-12 00:25:28 -04:00
Theodore Ts'o	f888e652d7	ext4: Simplify function signature for ext4_da_get_block_write() The function ext4_da_get_block_write() is called in exactly one write, and the last argument, create, is always 1. Remove it to simplify the code slightly. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-12 00:21:29 -04:00
Vincent Minet	bc8e67409c	ext4: Fix spinlock assertions on UP systems On UP systems without DEBUG_SPINLOCK, ext4_is_group_locked always fails which triggers a BUG_ON() call. This patch fixes it by using assert_spin_locked instead. Signed-off-by: Vincent Minet <vincent@vincent-minet.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-05-15 08:33:18 -04:00
J. Bruce Fields	8daed1e549	nfsd: silence lockdep warning Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-05-11 17:23:14 -04:00
Jeff Mahoney	2b79bc4f7e	dup2: Fix return value with oldfd == newfd and invalid fd The return value of dup2 when oldfd == newfd and the fd isn't valid is not getting properly sign extended. We end up with 4294967287 instead of -EBADF. I've reproduced this on SLE11 (2.6.27.21), openSUSE Factory (2.6.29-rc5), and Ubuntu 9.04 (2.6.28). This patch uses a signed int for the error value so it is properly extended. Commit `6c5d0512a0` introduced this regression. Reported-by: Jiri Dluhos <jdluhos@novell.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-11 12:18:06 -07:00
Ryusuke Konishi	83aca8f480	nilfs2: check size of array structured data exchanged via ioctls Although some ioctls of nilfs2 exchange data in the form of indirectly referenced array, some of them lack size check on the array elements. This inserts the missing checks and rejects requests if data of ioctl does not have a valid format. We usually don't have to check size of structures that we associated with ioctl commands because the size is tested implicitly for identifying ioctl command; the checks this patch adds are for the cases where the implicit check is not applied. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-05-12 01:48:54 +09:00
Miklos Szeredi	0b0a47f5c4	splice: implement default splice_write method If f_op->splice_write() is not implemented, fall back to a plain write. Use vfs_writev() to write from the pipe buffers. This will allow splice on all filesystems and file types. This includes "direct_io" files in fuse which bypass the page cache. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 14:13:10 +02:00
Miklos Szeredi	6818173bd6	splice: implement default splice_read method If f_op->splice_read() is not implemented, fall back to a plain read. Use vfs_readv() to read into previously allocated pages. This will allow splice and functions using splice, such as the loop device, to work on all filesystems. This includes "direct_io" files in fuse which bypass the page cache. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 14:13:10 +02:00
Miklos Szeredi	7c77f0b3f9	splice: implement pipe to pipe splicing Allow splice(2) to work when both the input and the output is a pipe. Based on the impementation of the tee(2) syscall, but instead of duplicating the buffer references move the buffers from the input pipe to the output pipe. Moving the whole buffer only succeeds if the full length of the buffer is spliced. Otherwise duplicate the buffer, just like tee(2), set the length of the output buffer and advance the offset on the input buffer. Since splice is operating on two pipes, special care needs to be taken with locking to prevent AN ABBA deadlock. Again this is done similarly to the tee(2) syscall, first preparing the input and output pipes so there's data to consume and space for that data, and then doing the move operation while holding both locks. If other processes are doing I/O on the same pipes parallel to the splice, then by the time both inodes are locked there might be no buffers left to move, or no space to move them to. In this case retry the whole operation, including the preparation phase. This could lead to starvation, but I'm not sure if that's serious enough to worry about. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 14:13:09 +02:00
Steven Whitehouse	48bf2b1711	GFS2: Something nonlinear this way comes! For some reason GFS2 has been missing support for non-linear mappings. This patch fixes that, and also avoids taking any locks for mmap in the O_NOATIME case. In fact we don't actually need to take the lock here at all - just doing file_accessed() would be enough, but we have to take the lock eventually and this helps it hit disk (and thus be seen by other nodes) faster. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-11 12:36:44 +01:00
Steven Whitehouse	4a0f9a321a	GFS2: Optimise writepage for metadata This adds a GFS2 specific writepage for metadata, rather than continuing to use the VFS function. As a result we now tag all our metadata I/O with the correct flag so that blktraces will now be less confusing. Also, the generic function was checking for a number of corner cases which cannot happen on the metadata address spaces so that this should be faster too. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-11 12:36:43 +01:00
Steven Whitehouse	c969f58ca4	GFS2: Update the rw flags After Jens recent updates: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a1f242524c3c1f5d40f1c9c343427e34d1aadd6e et al. this is a patch to bring gfs2 uptodate with the core code. Also I've managed to squash another call to ll_rw_block() along the way. There is still one part of the GFS2 I/O paths which are not correctly annotated and that is due to the sharing of the writeback code between the data and metadata address spaces. I would like to change that too, but this patch is still worth doing on its own, I think. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-05-11 12:36:41 +01:00
Tejun Heo	c3a4d78c58	block: add rq->resid_len rq->data_len served two purposes - the length of data buffer on issue and the residual count on completion. This duality creates some headaches. First of all, block layer and low level drivers can't really determine what rq->data_len contains while a request is executing. It could be the total request length or it coulde be anything else one of the lower layers is using to keep track of residual count. This complicates things because blk_rq_bytes() and thus [__]blk_end_request_all() relies on rq->data_len for PC commands. Drivers which want to report residual count should first cache the total request length, update rq->data_len and then complete the request with the cached data length. Secondly, it makes requests default to reporting full residual count, ie. reporting that no data transfer occurred. The residual count is an exception not the norm; however, the driver should clear rq->data_len to zero to signify the normal cases while leaving it alone means no data transfer occurred at all. This reverse default behavior complicates code unnecessarily and renders block PC on some drivers (ide-tape/floppy) unuseable. This patch adds rq->resid_len which is used only for residual count. While at it, remove now unnecessasry blk_rq_bytes() caching in ide_pc_intr() as rq->data_len is not changed anymore. Boaz : spotted missing conversion in osd Sergei : spotted too early conversion to blk_rq_bytes() in ide-tape [ Impact: cleanup residual count handling, report 0 resid by default ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Borislav Petkov <petkovbb@googlemail.com> Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Mike Miller <mike.miller@hp.com> Cc: Eric Moore <Eric.Moore@lsi.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: Mike Miller <mike.miller@hp.com> Cc: Eric Moore <Eric.Moore@lsi.com> Cc: Darrick J. Wong <djwong@us.ibm.com> Cc: Pete Zaitcev <zaitcev@redhat.com> Cc: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 09:50:53 +02:00
Ryusuke Konishi	4f6b828837	nilfs2: fix lock order reversal in nilfs_clean_segments ioctl This is a companion patch to ("nilfs2: fix possible circular locking for get information ioctls"). This corrects lock order reversal between mm->mmap_sem and nilfs->ns_segctor_sem in nilfs_clean_segments() which was detected by lockdep check: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.30-rc3-nilfs-00003-g360bdc1 #7 ------------------------------------------------------- mmap/5294 is trying to acquire lock: (&nilfs->ns_segctor_sem){++++.+}, at: [<d0d0e846>] nilfs_transaction_begin+0xb6/0x10c [nilfs2] but task is already holding lock: (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mm->mmap_sem){++++++}: [<c01470a5>] __lock_acquire+0x1066/0x13b0 [<c01474a9>] lock_acquire+0xba/0xdd [<c01836bc>] might_fault+0x68/0x88 [<c023c61d>] copy_from_user+0x2a/0x111 [<d0d120d0>] nilfs_ioctl_prepare_clean_segments+0x1d/0xf1 [nilfs2] [<d0d0e2aa>] nilfs_clean_segments+0x6d/0x1b9 [nilfs2] [<d0d11f68>] nilfs_ioctl+0x2ad/0x318 [nilfs2] [<c01a3be7>] vfs_ioctl+0x22/0x69 [<c01a408e>] do_vfs_ioctl+0x460/0x499 [<c01a4107>] sys_ioctl+0x40/0x5a [<c01031a4>] sysenter_do_call+0x12/0x38 [<ffffffff>] 0xffffffff -> #0 (&nilfs->ns_segctor_sem){++++.+}: [<c0146e0b>] __lock_acquire+0xdcc/0x13b0 [<c01474a9>] lock_acquire+0xba/0xdd [<c0433f1d>] down_read+0x2a/0x3e [<d0d0e846>] nilfs_transaction_begin+0xb6/0x10c [nilfs2] [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2] [<c0183b0b>] __do_fault+0x165/0x376 [<c01855cd>] handle_mm_fault+0x287/0x5d1 [<c043712d>] do_page_fault+0x2fb/0x30a [<c0435462>] error_code+0x72/0x78 [<ffffffff>] 0xffffffff where nilfs_clean_segments() holds: nilfs->ns_segctor_sem -> copy_from_user() --> page fault -> mm->mmap_sem And, page fault path may hold: page fault -> mm->mmap_sem --> nilfs_page_mkwrite() -> nilfs->ns_segctor_sem Even though nilfs_clean_segments() does not perform write access on given user pages, it may cause deadlock because nilfs->ns_segctor_sem is shared per device and mm->mmap_sem can be shared with other tasks. To avoid this problem, this patch moves all calls of copy_from_user() outside the nilfs->ns_segctor_sem lock in the ioctl. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-05-11 14:54:41 +09:00
Ryusuke Konishi	47eb6b9c8f	nilfs2: fix possible circular locking for get information ioctls This is one of two patches which are to correct possible circular locking between mm->mmap_sem and nilfs->ns_segctor_sem. The problem was detected by lockdep check as follows: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.30-rc3-nilfs-00002-g3552613 #6 ------------------------------------------------------- mmap/5418 is trying to acquire lock: (&nilfs->ns_segctor_sem){++++.+}, at: [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2] but task is already holding lock: (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mm->mmap_sem){++++++}: [<c01470a5>] __lock_acquire+0x1066/0x13b0 [<c01474a9>] lock_acquire+0xba/0xdd [<c01836bc>] might_fault+0x68/0x88 [<c023c730>] copy_to_user+0x2c/0xfc [<d0d11b4f>] nilfs_ioctl_wrap_copy+0x103/0x160 [nilfs2] [<d0d11fa9>] nilfs_ioctl+0x30a/0x3b0 [nilfs2] [<c01a3be7>] vfs_ioctl+0x22/0x69 [<c01a408e>] do_vfs_ioctl+0x460/0x499 [<c01a4107>] sys_ioctl+0x40/0x5a [<c01031a4>] sysenter_do_call+0x12/0x38 [<ffffffff>] 0xffffffff -> #0 (&nilfs->ns_segctor_sem){++++.+}: [<c0146e0b>] __lock_acquire+0xdcc/0x13b0 [<c01474a9>] lock_acquire+0xba/0xdd [<c0433f1d>] down_read+0x2a/0x3e [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2] [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2] [<c0183b0b>] __do_fault+0x165/0x376 [<c01855cd>] handle_mm_fault+0x287/0x5d1 [<c043712d>] do_page_fault+0x2fb/0x30a [<c0435462>] error_code+0x72/0x78 [<ffffffff>] 0xffffffff other info that might help us debug this: 1 lock held by mmap/5418: #0: (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a stack backtrace: Pid: 5418, comm: mmap Not tainted 2.6.30-rc3-nilfs-00002-g3552613 #6 Call Trace: [<c0432145>] ? printk+0xf/0x12 [<c0145c48>] print_circular_bug_tail+0xaa/0xb5 [<c0146e0b>] __lock_acquire+0xdcc/0x13b0 [<d0d10149>] ? nilfs_sufile_get_stat+0x1e/0x105 [nilfs2] [<c013b59a>] ? up_read+0x16/0x2c [<d0d10225>] ? nilfs_sufile_get_stat+0xfa/0x105 [nilfs2] [<c01474a9>] lock_acquire+0xba/0xdd [<d0d0e852>] ? nilfs_transaction_begin+0xb6/0x10c [nilfs2] [<c0433f1d>] down_read+0x2a/0x3e [<d0d0e852>] ? nilfs_transaction_begin+0xb6/0x10c [nilfs2] [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2] [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2] [<c0183b0b>] __do_fault+0x165/0x376 [<c01855cd>] handle_mm_fault+0x287/0x5d1 [<c043700a>] ? do_page_fault+0x1d8/0x30a [<c013b54f>] ? down_read_trylock+0x39/0x43 [<c043712d>] do_page_fault+0x2fb/0x30a [<c0436e32>] ? do_page_fault+0x0/0x30a [<c0435462>] error_code+0x72/0x78 [<c0436e32>] ? do_page_fault+0x0/0x30a This makes the lock granularity of nilfs->ns_segctor_sem finer than that of the mmap semaphore for ioctl commands except nilfs_clean_segments(). The successive patch ("nilfs2: fix lock order reversal in nilfs_clean_segments ioctl") is required to fully resolve the problem. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-05-11 12:57:46 +09:00
David Howells	107db7c7dd	CRED: Guard the setprocattr security hook against ptrace Guard the setprocattr security hook against ptrace by taking the target task's cred_guard_mutex around it. The problem is that setprocattr() may otherwise note the lack of a debugger, and then perform an action on that basis whilst letting a debugger attach between the two points. Holding cred_guard_mutex across the test and the action prevents ptrace_attach() from doing that. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-05-11 08:15:39 +10:00
David Howells	5e751e992f	CRED: Rename cred_exec_mutex to reflect that it's a guard against ptrace Rename cred_exec_mutex to reflect that it's a guard against foreign intervention on a process's credential state, such as is made by ptrace(). The attachment of a debugger to a process affects execve()'s calculation of the new credential state - _and_ also setprocattr()'s calculation of that state. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-05-11 08:15:36 +10:00
Linus Torvalds	93b49d45eb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (22 commits) Fix the race between capifs remount and node creation Fix races around the access to ->s_options switch ufs directories to ufs_sync_file() Switch open_exec() and sys_uselib() to do_open_filp() Make open_exec() and sys_uselib() use may_open(), instead of duplicating its parts Reduce path_lookup() abuses Make checkpatch.pl shut up on fs/inode.c NULL noise in fs/super.c:kill_bdev_super() romfs: cleanup romfs_fs.h ROMFS: romfs_dev_read() error ignored fs: dcache fix LRU ordering ocfs2: Use nd_set_link(). Fix deadlock in ipathfs ->get_sb() Fix a leak in failure exit in 9p ->get_sb() Convert obvious places to deactivate_locked_super() New helper: deactivate_locked_super() reiserfs: remove privroot hiding in lookup reiserfs: dont associate security.* with xattr files reiserfs: fixup xattr_root caching Always lookup priv_root on reiserfs mount and keep it ...	2009-05-10 10:49:08 -07:00
Ryusuke Konishi	843382370e	nilfs2: ensure to clear dirty state when deleting metadata file block This would fix the following failure during GC: nilfs_cpfile_delete_checkpoints: cannot delete block NILFS: GC failed during preparation: cannot delete checkpoints: err=-2 The problem was caused by a break in state consistency between page cache and btree; the above block was removed from the btree but the page buffering the block was remaining in the page cache in dirty state. This resolves the inconsistency by ensuring to clear dirty state of the page buffering the deleted block. Reported-by: David Arendt <admin@prnet.org> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-05-10 17:04:42 +09:00
Al Viro	2a32cebd6c	Fix races around the access to ->s_options Put generic_show_options read access to s_options under rcu_read_lock, split save_mount_options() into "we are setting it the first time" (uses in foo_fill_super()) and "we are relacing and freeing the old one", synchronize_rcu() before kfree() in the latter. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-05-09 10:51:34 -04:00
Al Viro	f9dbd05bc9	switch ufs directories to ufs_sync_file() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-05-09 10:49:42 -04:00
Al Viro	6e8341a11e	Switch open_exec() and sys_uselib() to do_open_filp() ... and make path_lookup_open() static Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-05-09 10:49:42 -04:00

... 3 4 5 6 7 ...

14343 Commits