linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-17 09:57:46 +07:00

Author	SHA1	Message	Date
Christoph Hellwig	f1f724e4b5	xfs: fix locking for inode cache radix tree tag updates The radix-tree code requires it's users to serialize tag updates against other updates to the tree. While XFS protects tag updates against each other it does not serialize them against updates of the tree contents, which can lead to tag corruption. Fix the inode cache to always take pag_ici_lock in exclusive mode when updating radix tree tags. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Patrick Schreurs <patrick@news-service.com> Tested-by: Patrick Schreurs <patrick@news-service.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 19:14:36 -06:00
Tao Ma	cc483f102c	ext4: Fix fencepost error in chosing choosing group vs file preallocation. The ext4 multiblock allocator decides whether to use group or file preallocation based on the file size. When the file size reaches s_mb_stream_request (default is 16 blocks), it changes to use a file-specific preallocation. This is cool, but it has a tiny problem. See a simple script: mkfs.ext4 -b 1024 /dev/sda8 1000000 mount -t ext4 -o nodelalloc /dev/sda8 /mnt/ext4 for((i=0;i<5;i++)) do cat /mnt/4096>>/mnt/ext4/a #4096 is a file with 4096 characters. cat /mnt/4096>>/mnt/ext4/b done debuge4fs -R 'stat a' /dev/sda8\|grep BLOCKS -A 1 And you get BLOCKS: (0-14):8705-8719, (15):2356, (16-19):8465-8468 So there are 3 extents, a bit strange for the lonely 15th logical block. As we write to the 16 blocks, we choose file preallocation in ext4_mb_group_or_file, but in ext4_mb_normalize_request, we meet with the 16*1024 range, so no preallocation will be carried. file b then reserves the space after '2356', so when when write 16, we start from another part. This patch just change the check in ext4_mb_group_or_file, so that for the lonely 15 we will still use group preallocation. After the patch, we will get: debuge4fs -R 'stat a' /dev/sda8\|grep BLOCKS -A 1 BLOCKS: (0-15):8705-8720, (16-19):8465-8468 Looks more sane. Thanks. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2010-03-01 19:06:35 -05:00
Christoph Hellwig	a14a5ab58f	xfs: remove xfs_ipin/xfs_iunpin Inodes are only pinned/unpinned via the inode item methods, and lots of code relies on that fact. So remove the separate xfs_ipin/xfs_iunpin helpers and merge them into their only callers. This also fixes up various duplicate and/or incorrect comments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:56 -06:00
Christoph Hellwig	60ec678371	xfs: cleanup xfs_iunpin_wait/xfs_iunpin_nowait Remove the inode item pointer and ili_last_lsn checks in __xfs_iunpin_wait as any pinned inode is guaranteed to have them valid. After this the xfs_iunpin_nowait case is nothing more than a xfs_log_force_lsn, as we know that the caller has already checked the pincount. Make xfs_iunpin_nowait the new low-level routine just doing the log force and rewrite xfs_iunpin_wait around it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:50 -06:00
Christoph Hellwig	d7658d487f	xfs: kill xfs_lrw.h Move the two declarations to better fitting headers now that xfs_lrw.c is gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:44 -06:00
Christoph Hellwig	d7e84f4137	xfs: factor common xfs_trans_bjoin code Most of xfs_trans_bjoin is duplicated in xfs_trans_get_buf, xfs_trans_getsb and xfs_trans_read_buf. Add a new _xfs_trans_bjoin which can be called by all four functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:37 -06:00
Christoph Hellwig	35a8a72f06	xfs: stop passing opaque handles to xfs_log.c routines Currenly we pass opaque xfs_log_ticket_t handles instead of struct xlog_ticket pointers, and void pointers instead of struct xlog_in_core pointers to various log manager functions. Instead pass properly typed pointers after adding forward declarations for them to xfs_log.h, and adjust the touched function prototypes to the standard XFS style while at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:32 -06:00
Christoph Hellwig	c467c049e7	xfs: split xfs_bmap_btalloc Split out the nullfb case into a separate function to reduce the stack footprint and make the code more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:25 -06:00
Christoph Hellwig	f7008d0aeb	xfs: fix xfs_fsblock_t tracing Using a static buffer in xfs_fmtfsblock means we can corrupt traces if multiple CPUs hit this code path at the same. Just remove xfs_fmtfsblock for now and print the block number purely numerical. If we want the NULLFSBLOCK and NULLSTARTBLOCK formatting back the best way would be a decoding plugin in the trace-cmd userspace command. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:17 -06:00
Christoph Hellwig	024910cbac	xfs: fix inode pincount check in fsync We need to hold the ilock to check the inode pincount safely. While we're at it also remove the check for ip->i_itemp->ili_last_lsn, a pinned inode always has it set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:10 -06:00
Dave Chinner	77d7a0c2ee	xfs: Non-blocking inode locking in IO completion The introduction of barriers to loop devices has created a new IO order completion dependency that XFS does not handle. The loop device implements barriers using fsync and so turns a log IO in the XFS filesystem on the loop device into a data IO in the backing filesystem. That is, the completion of log IOs in the loop filesystem are now dependent on completion of data IO in the backing filesystem. This can cause deadlocks when a flush daemon issues a log force with an inode locked because the IO completion of IO on the inode is blocked by the inode lock. This in turn prevents further data IO completion from occuring on all XFS filesystems on that CPU (due to the shared nature of the completion queues). This then prevents the log IO from completing because the log is waiting for data IO completion as well. The fix for this new completion order dependency issue is to make the IO completion inode locking non-blocking. If the inode lock can't be grabbed, simply requeue the IO completion back to the work queue so that it can be processed later. This prevents the completion queue from being blocked and allows data IO completion on other inodes to proceed, hence avoiding completion order dependent deadlocks. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:52 -06:00
Christoph Hellwig	66d834ea60	xfs: implement optimized fdatasync Allow us to track the difference between timestamp and size updates by using mark_inode_dirty from the I/O completion code, and checking the VFS inode flags in xfs_file_fsync. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:45 -06:00
Christoph Hellwig	fd3200bef7	xfs: remove wrapper for the fsync file operation Currently the fsync file operation is divided into a low-level routine doing all the work and one that implements the Linux file operation and does minimal argument wrapping. This is a leftover from the days of the vnode operations layer and can be removed to simplify the code a bit, as well as preparing for the implementation of an optimized fdatasync which needs to look at the Linux inode state. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:38 -06:00
Christoph Hellwig	00258e36b2	xfs: remove wrappers for read/write file operations Currently the aio_read, aio_write, splice_read and splice_write file operations are divided into a low-level routine doing all the work and one that implements the Linux file operations and does minimal argument wrapping. This is a leftover from the days of the vnode operations layer and can be removed to simplify the code a lot. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:29 -06:00
Christoph Hellwig	dda35b8f84	xfs: merge xfs_lrw.c into xfs_file.c Currently the code to implement the file operations is split over two small files. Merge the content of xfs_lrw.c into xfs_file.c to have it in one place. Note that I haven't done various cleanups that are possible after this yet, they will follow in the next patch. Also the function xfs_dev_is_read_only which was in xfs_lrw.c before really doesn't fit in here at all and was moved to xfs_mount.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:18 -06:00
Christoph Hellwig	b262e5dfd9	xfs: fix dquota trace format The be32_to_cpu in the TP_printk output breaks automatic parsing of the trace format by the trace-cmd tools, so we have to move it into the TP_assign block. While we're at it also fix the format for the quota limits to more regular and easier parseable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:11 -06:00
Eric Sandeen	a9cc799eca	xfs: increase readdir buffer size While doing some testing of readdir perf a while back, I noticed that the buffer size we're using internally is smaller than what glibc gives us by default. Upping this size helped a bit, and seems safe. glibc's __alloc_dir() does: const size_t default_allocation = (4 * BUFSIZ < sizeof (struct dirent64) ? sizeof (struct dirent64) : 4 * BUFSIZ); const size_t small_allocation = (BUFSIZ < sizeof (struct dirent64) ? sizeof (struct dirent64) : BUFSIZ); size_t allocation = default_allocation; #ifdef _STATBUF_ST_BLKSIZE if (statp != NULL && default_allocation < statp->st_blksize) allocation = statp->st_blksize; #endif and #define _G_BUFSIZ 8192 #define _IO_BUFSIZ _G_BUFSIZ # define BUFSIZ _IO_BUFSIZ so the default buffer is 4 * 8192 = 32768 (except in the unlikely case of blocks > 32k....) Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:33:41 -06:00
Linus Torvalds	b1bf936840	Merge branch 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block: (38 commits) block: don't access jiffies when initialising io_context cfq: remove 8 bytes of padding from cfq_rb_root on 64 bit builds block: fix for "Consolidate phys_segment and hw_segment limits" cfq-iosched: quantum check tweak blktrace: perform cleanup after setup error blkdev: fix merge_bvec_fn return value checks cfq-iosched: requests "in flight" vs "in driver" clarification cciss: Fix problem with scatter gather elements in the scsi half of the driver cciss: eliminate unnecessary pointer use in cciss scsi code cciss: do not use void pointer for scsi hba data cciss: factor out scatter gather chain block mapping code cciss: fix scatter gather chain block dma direction kludge cciss: simplify scatter gather code cciss: factor out scatter gather chain block allocation and freeing cciss: detect bad alignment of scsi commands at build time cciss: clarify command list padding calculation cfq-iosched: rethink seeky detection for SSDs cfq-iosched: rework seeky detection block: remove padding from io_context on 64bit builds block: Consolidate phys_segment and hw_segment limits ...	2010-03-01 09:00:29 -08:00
Bob Peterson	4818972efb	GFS2: print glock numbers in hex This patch changes glock numbers from printing in decimal to hex. Since DLM prints corresponding resource IDs in hex, it makes debugging easier. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-03-01 14:09:04 +00:00
Dave Chinner	e5884636da	GFS2: ordered writes are backwards When we queue data buffers for ordered write, the buffers are added to the head of the ordered write list. When the log needs to push these buffers to disk, it also walks the list from the head. The result is that the the ordered buffers are submitted to disk in reverse order. For large writes, this means that whenever the log flushes large streams of reverse sequential order buffers are pushed down into the block layers. The elevators don't handle this particularly well, so IO rates tend to be significantly lower than if the IO was issued in ascending block order. Queue new ordered buffers to the tail of the ordered buffer list to ensure that IO is dispatched in the order it was submitted. This should significantly improve large sequential write speeds. On a disk capable of 85MB/s, speeds increase from 50MB/s to 65MB/s for noop and from 38MB/s to 50MB/s for cfq. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-03-01 14:08:26 +00:00
Steven Whitehouse	c1184f8ab7	GFS2: Remove loopy umount code As a consequence of the previous patch, we can now remove the loop which used to be required due to the circular dependency between the inodes and glocks. Instead we can just invalidate the inodes, and then clear up any glocks which are left. Also we no longer need the rwsem since there is no longer any danger of the inode invalidation calling back into the glock code (and from there back into the inode code). Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-03-01 14:07:53 +00:00
Steven Whitehouse	009d851837	GFS2: Metadata address space clean up Since the start of GFS2, an "extra" inode has been used to store the metadata belonging to each inode. The only reason for using this inode was to have an extra address space, the other fields were unused. This means that the memory usage was rather inefficient. The reason for keeping each inode's metadata in a separate address space is that when glocks are requested on remote nodes, we need to be able to efficiently locate the data and metadata which relating to that glock (inode) in order to sync or sync and invalidate it (depending on the remotely requested lock mode). This patch adds a new type of glock, which has in addition to its normal fields, has an address space. This applies to all inode and rgrp glocks (but to no other glock types which remain as before). As a result, we no longer need to have the second inode. This results in three major improvements: 1. A saving of approx 25% of memory used in caching inodes 2. A removal of the circular dependency between inodes and glocks 3. No confusion between "normal" and "metadata" inodes in super.c Although the first of these is the more immediately apparent, the second is just as important as it now enables a number of clean ups at umount time. Those will be the subject of future patches. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-03-01 14:07:37 +00:00
David S. Miller	47871889c6	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: drivers/firmware/iscsi_ibft.c	2010-02-28 19:23:06 -08:00
James Morris	b4ccebdd37	Merge branch 'next' into for-linus	2010-03-01 09:36:31 +11:00
Dmitry Monakhov	9f7cdbc33f	blkdev: fix merge_bvec_fn return value checks merge_bvec_fn() returns bvec->bv_len on success. So we have to check against this value. But in case of fs_optimization merge we compare with wrong value. This patch must be included in b428cd6da7e6559aca69aa2e3a526037d3f20403 But accidentally i've forgot to add this in the initial patch. To make things straight let's replace all such checks. In fact this makes code easy to understand. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-28 19:47:18 +01:00
Linus Torvalds	642c4c75a7	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (44 commits) rcu: Fix accelerated GPs for last non-dynticked CPU rcu: Make non-RCU_PROVE_LOCKING rcu_read_lock_sched_held() understand boot rcu: Fix accelerated grace periods for last non-dynticked CPU rcu: Export rcu_scheduler_active rcu: Make rcu_read_lock_sched_held() take boot time into account rcu: Make lockdep_rcu_dereference() message less alarmist sched, cgroups: Fix module export rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information rcu: Fix rcutorture mod_timer argument to delay one jiffy rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection rcu: Convert to raw_spinlocks rcu: Stop overflowing signed integers rcu: Use canonical URL for Mathieu's dissertation rcu: Accelerate grace period if last non-dynticked CPU rcu: Fix citation of Mathieu's dissertation rcu: Documentation update for CONFIG_PROVE_RCU security: Apply lockdep-based checking to rcu_dereference() uses idr: Apply lockdep-based diagnostics to rcu_dereference() uses radix-tree: Disable RCU lockdep checking in radix tree vfs: Abstract rcu_dereference_check for files-fdtable use ...	2010-02-28 10:13:16 -08:00
Boaz Harrosh	50a76fd3c3	exofs: groups support * _calc_stripe_info() changes to accommodate for grouping calculations. Returns additional information * old _prepare_pages() becomes _prepare_one_group() which stores pages belonging to one device group. * New _prepare_for_striping iterates on all groups calling _prepare_one_group(). * Enable mounting of groups data_maps (group_width != 0) [QUESTION] what is faster A or B; A. x += stride; x = x % width + first_x; B x += stride if (x < last_x) x = first_x; Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2010-02-28 03:55:53 -08:00
Boaz Harrosh	b367e78bd1	exofs: Prepare for groups * Rename _offset_dev_unit_off() to _calc_stripe_info() and recieve a struct for the output params * In _prepare_for_striping we only need to call _calc_stripe_info() once. The other componets are easy to calculate from that. This code was inspired by what's done in truncate. * Some code shifts that make sense now but will make more sense when group support is added. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2010-02-28 03:44:44 -08:00
Boaz Harrosh	96391e2bae	exofs: Error recovery if object is missing from storage If an object is referenced by a directory but does not exist on a target, it is a very serious corruption that means: 1. Either a power failure with very slim chance of it happening. Because the directory update is always submitted much after object creation, but if a directory is written to one device and the object creation to another it might theoretically happen. 2. It only ever happened to me while developing with BUGs causing file corruption. Crashes could also cause it but they are more like case 1. In any way the object does not exist, so data is surely lost. If there is a mix-up in the obj-id or data-map, then lost objects can be salvaged by off-line fsck. The only recoverable information is the directory name. By letting it appear as a regular empty file, with date==0 (1970 Jan 1st) ownership to root, we enable recovery of the only useful information. And also enable deletion or over-write. I can see how this can hurt. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2010-02-28 03:44:43 -08:00
Boaz Harrosh	86093aaff5	exofs: convert io_state to use pages array instead of bio at input * inode.c operations are full-pages based, and not actually true scatter-gather * Lets us use more pages at once upto 512 (from 249) in 64 bit * Brings us much much closer to be able to use exofs's io_state engine from objlayout driver. (Once I decide where to put the common code) After RAID0 patch the outer (input) bio was never used as a bio, but was simply a page carrier into the raid engine. Even in the simple mirror/single-dev arrangement pages info was copied into a second bio. It is now easer to just pass a pages array into the io_state and prepare bio(s) once. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2010-02-28 03:44:42 -08:00
Boaz Harrosh	5d952b8391	exofs: RAID0 support We now support striping over mirror devices. Including variable sized stripe_unit. Some limits: * stripe_unit must be a multiple of PAGE_SIZE * stripe_unit * stripe_count is maximum upto 32-bit (4Gb) Tested RAID0 over mirrors, RAID0 only, mirrors only. All check. Design notes: * I'm not using a vectored raid-engine mechanism yet. Following the pnfs-objects-layout data-map structure, "Mirror" is just a private case of "group_width" == 1, and RAID0 is a private case of "Mirrors" == 1. The performance lose of the general case over the particular special case optimization is totally negligible, also considering the extra code size. * In general I added a prepare_stripes() stage that divides the to-be-io pages to the participating devices, the previous exofs_ios_write/read, now becomes _write/read_mirrors and a new write/read upper layer loops on all devices calling _write/read_mirrors. Effectively the prepare_stripes stage is the all secret. Also truncate need fixing to accommodate for striping. * In a RAID0 arrangement, in a regular usage scenario, if all inode layouts will start at the same device, the small files fill up the first device and the later devices stay empty, the farther the device the emptier it is. To fix that, each inode will start at a different stripe_unit, according to it's obj_id modulus number-of-stripe-units. And will then span all stripe-units in the same incrementing order wrapping back to the beginning of the device table. We call it a stripe-units moving window. Special consideration was taken to keep all devices in a mirror arrangement identical. So a broken osd-device could just be cloned from one of the mirrors and no FS scrubbing is needed. (We do that by rotating stripe-unit at a time and not a single device at a time.) TODO: We no longer verify object_length == inode->i_size in exofs_iget. (since i_size is stripped on multiple objects now). I should introduce a multiple-device attribute reading, and use it in exofs_iget. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2010-02-28 03:43:08 -08:00
Boaz Harrosh	d9c740d225	exofs: Define on-disk per-inode optional layout attribute * Layouts describe the way a file is spread on multiple devices. The layout information is stored in the objects attribute introduced in this patch. * There can be multiple generating function for the layout. Currently defined: - No attribute present - use below moving-window on global device table, all devices. (This is the only one currently used in exofs) - an obj_id generated moving window - the obj_id is a randomizing factor in the otherwise global map layout. - An explicit layout stored, including a data_map and a device index list. - More might be defined in future ... * There are two attributes defined of the same structure: A-data-files-layout - This layout is used by data-files. If present at a directory, all files of that directory will be created with this layout. A-meta-data-layout - This layout is used by a directory and other meta-data information. Also inherited at creation of subdirectories. * At creation time inodes are created with the layout specified above. A usermode utility may change the creation layout on a give directory or file. Which in the case of directories, will also apply to newly created files/subdirectories, children of that directory. In the simple unaltered case of a newly created exofs, no layout attributes are present, and all layouts adhere to the layout specified at the device-table. * In case of a future file system loaded in an old exofs-driver. At iget(), the generating_function is inspected and if not supported will return an IO error to the application and the inode will not be loaded. So not to damage any data. Note: After this patch we do not yet support any type of layout only the RAID0 patch that enables striping at the super-block level will add support for RAID0 layouts above. This way we are past and future compatible and fully bisectable. * Access to the device table is done by an accessor since it will change according to above information. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2010-02-28 03:35:28 -08:00
Boaz Harrosh	46f4d973f6	exofs: unindent exofs_sbi_read The original idea was that a mirror read can be sub-divided to multiple devices. But this has very little gain and only at very large IOes so it's not going to be implemented soon. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2010-02-28 03:35:27 -08:00
Boaz Harrosh	45d3abcb1a	exofs: Move layout related members to a layout structure * Abstract away those members in exofs_sb_info that are related/needed by a layout into a new exofs_layout structure. Embed it in exofs_sb_info. * At exofs_io_state receive/keep a pointer to an exofs_layout. No need for an exofs_sb_info pointer, all we need is at exofs_layout. * Change any usage of above exofs_sb_info members to their new name. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2010-02-28 03:35:27 -08:00
Boaz Harrosh	22ddc55638	exofs: Recover in the case of read-passed-end-of-file In check_io, implement the case of reading passed end of file, by clearing the pages and recover with no error. In a raid arrangement this can become a legitimate situation in case of holes in the file. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2010-02-28 03:35:26 -08:00
Boaz Harrosh	518f167a37	exofs: Micro-optimize exofs_i_info optimize the exofs_i_info struct usage by moving the embedded vfs_inode to be first. A compiler might optimize away an "add" operation with constant zero. (Which it cannot with other constants) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2010-02-28 03:35:25 -08:00
Boaz Harrosh	34ce4e7c23	exofs: debug print even less * Last debug trimming left in some stupid print, remove them. Fixup some other prints * Shift printing from inode.c to ios.c * Add couple of prints when memory allocation fails. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2010-02-28 03:35:25 -08:00
Wengang Wang	5051f76883	ocfs2: send SIGXFSZ if new filesize exceeds limit -v2 This patch makes ocfs2 send SIGXFSZ if new file size exceeds the rlimit. Processes may get SIGXFSZ on one node (in the cluster) while others will not on another if file size limits are different on the two nodes. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-27 20:08:51 -08:00
Sunil Mushran	6fcef3f04a	ocfs2/userdlm: Add tracing in userdlm Make use of the newly added BASTS masklog to trace ASTs and BASTs in userdlm. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-27 19:57:07 -08:00
Sunil Mushran	9b915181af	ocfs2: Use a separate masklog for AST and BASTs This patch adds a new masklog and uses it allow tracing ASTs and BASTs in the dlmglue layer. This has been found to be very useful in debugging cluster locking issues. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-27 19:57:06 -08:00
Christian Kujau	4912002fff	Remove EXPERIMENTAL from NFS_FSCACHE There's currently an open Ubuntu bug[0], with the intent to compile NFS_FSCACHE (and possibly AFS_FSCACHE, 9P_FSCACHE) into the standard Ubuntu kernel. However, since *_FSCACHE still depends on EXPERIMENTAL, this won't happen. As Arjan van de Ven pointed out[1], the EXPERIMENTAL flag doesn't mean that much any more, I propose the following patch to fs/nfs/Kconfig. I'd do the same for fs/9p/Kconfig and fs/afs/Kconfig, but as I did not test 9p or AFS, I feel it would not be appropriate for me to remove the flag. [0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/440522/comments/5 [1] http://lkml.org/lkml/2010/1/23/145 Signed-off-by: Christian Kujau <lists@nerdbynature.de> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-26 17:22:35 -08:00
Linus Torvalds	4cbd55188f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: use bastmode in debugfs output dlm: Send lockspace name with uevents dlm: send reply before bast dlm: fix ordering of bast and cast	2010-02-26 17:19:30 -08:00
Linus Torvalds	b305956abc	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (52 commits) fs/xfs: Correct NULL test xfs: optimize log flushing in xfs_fsync xfs: only clear the suid bit once in xfs_write xfs: kill xfs_bawrite xfs: log changed inodes instead of writing them synchronously xfs: remove invalid barrier optimization from xfs_fsync xfs: kill the unused XFS_QMOPT_* flush flags V2 xfs: Use delay write promotion for dquot flushing xfs: Sort delayed write buffers before dispatch xfs: Don't issue buffer IO direct from AIL push V2 xfs: Use delayed write for inodes rather than async V2 xfs: Make inode reclaim states explicit xfs: more reserved blocks fixups xfs: turn off sign warnings xfs: don't hold onto reserved blocks on remount,ro xfs: quota limit statvfs available blocks xfs: replace KM_LARGE with explicit vmalloc use xfs: cleanup up xfs_log_force calling conventions xfs: kill XLOG_VEC_SET_TYPE xfs: remove duplicate buffer flags ...	2010-02-26 17:18:52 -08:00
Linus Torvalds	f24407d2bd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt: xfs: fix xfs to work with Virtually Indexed architectures sh: add mm API for DMA to vmalloc/vmap areas arm: add mm API for DMA to vmalloc/vmap areas parisc: add mm API for DMA to vmalloc/vmap areas mm: add coherence API for DMA to vmalloc/vmap areas	2010-02-26 17:05:10 -08:00
Srinivas Eeda	bc9838c4d4	dlm: allow dlm do recovery during shutdown If a node down event happens while dlm shutdown in progress, dlm recovery should be done before dlm is shutdown. We can't migrate unrecovered locks, obviously. But dlm_reco_thread only does recovery if the dlm_state is in DLM_CTXT_JOINED. dlm_reco_thread should do recovery if dlm_state is in DLM_CTXT_JOINED or DLM_CTXT_IN_SHUTDOWN. Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:19 -08:00
Tao Ma	cbaee472f2	ocfs2: Only bug out in direct io write for reflinked extent. In ocfs2_direct_IO_get_blocks, we only need to bug out in case of we are going to write a recounted extent rec. What a silly bug introduced by me! Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Cc: stable@kernel.org	2010-02-26 15:41:19 -08:00
Coly Li	66b116c9d8	ocfs2: fix warning in ocfs2_file_aio_write() This patch fixes a compiling warning in ocfs2_file_aio_write(). Signed-off-by: Coly Li <coly.li@suse.de> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:18 -08:00
Joel Becker	cbe0e331fd	ocfs2_dlmfs: Enable the use of user cluster stacks. Unlike ocfs2, dlmfs has no permanent storage. It can't store off a cluster stack it is supposed to be using. So it can't specify the stack name in ocfs2_cluster_connect(). Instead, we create ocfs2_cluster_connect_agnostic(), which simply uses the stack that is currently enabled. This is find for dlmfs, which will rely on the stack initialization. We add the "stackglue" capability to dlmfs's capability list. This lets userspace know dlmfs can be used with all cluster stacks. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:18 -08:00
Joel Becker	0016eedc41	ocfs2_dlmfs: Use the stackglue. Rather than directly using o2dlm, dlmfs can now use the stackglue. This allows it to use userspace cluster stacks and fs/dlm. This commit forces o2cb for now. A latter commit will bump the protocol version and allow non-o2cb stacks. This is one big sed, really. LKM_xxMODE becomes DLM_LOCK_xx. LKM_flag becomes DLM_LKF_flag. We also learn to check that the LVB is valid before reading it. Any DLM can lose the contents of the LVB during a complicated recovery. userdlm should be checking this. Now it does. dlmfs will return 0 from read(2) if the LVB was invalid. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:18 -08:00
Joel Becker	e8fce482f3	ocfs2_dlmfs: Don't honor truncate. The size of a dlmfs file is LVB_LEN We want folks using dlmfs to be able to use the LVB in places other than just write(2)/read(2). By ignoring truncate requests, we allow 'echo "contents" > /dlm/space/lockname' to work. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:18 -08:00
Joel Becker	553b5eb91a	ocfs2: Pass the locking protocol into ocfs2_cluster_connect(). Inside the stackglue, the locking protocol structure is hanging off of the ocfs2_cluster_connection. This takes it one further; the locking protocol is passed into ocfs2_cluster_connect(). Now different cluster connections can have different locking protocols with distinct asts. Note that all locking protocols have to keep their maximum protocol version in lock-step. With the protocol structure set in ocfs2_cluster_connect(), there is no need for the stackglue to have a static pointer to a specific protocol structure. We can change initialization to only pass in the maximum protocol version. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:17 -08:00
Joel Becker	e603cfb074	ocfs2: Remove the ast pointers from ocfs2_stack_plugins With the full ocfs2_locking_protocol hanging off of the ocfs2_cluster_connection, ast wrappers can get the ast/bast pointers there. They don't need to get them from their plugin structure. The user plugin still needs the maximum locking protocol version, though. This changes the plugin structure so that it only holds the max version, not the entire ocfs2_locking_protocol pointer. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:16 -08:00
Joel Becker	110946c8fb	ocfs2: Hang the locking proto on the cluster conn and use it in asts. With the ocfs2_cluster_connection hanging off of the ocfs2_dlm_lksb, we have access to it in the ast and bast wrapper functions. Attach the ocfs2_locking_protocol to the conn. Now, instead of refering to a static variable for ast/bast pointers, the wrappers can look at the connection. This means different connections can have different ast/bast pointers, and it reduces the need for the static pointer. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:16 -08:00
Joel Becker	c0e4133851	ocfs2: Attach the connection to the lksb We're going to want it in the ast functions, so we convert union ocfs2_dlm_lksb to struct ocfs2_dlm_lksb and let it carry the connection. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:14 -08:00
Joel Becker	a796d2862a	ocfs2: Pass lksbs back from stackglue ast/bast functions. The stackglue ast and bast functions tried to maintain the fiction that their arguments were void pointers. In reality, stack_user.c had to know that the argument was an ocfs2_lock_res in order to get the status off of the lksb. That's ugly. This changes stackglue to always pass the lksb as the argument to ast and bast functions. The caller can always use container_of() to get the ocfs2_lock_res or user_dlm_lock_res. The net effect to the caller is zero. They still get back the lockres in their ast. stackglue gets cleaner, and now can use the lksb itself. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:14 -08:00
Joel Becker	34a9dd7e29	ocfs2_dlmfs: Move to its own directory We're going to remove the tie between ocfs2_dlmfs and o2dlm. ocfs2_dlmfs doesn't belong in the fs/ocfs2/dlm directory anymore. Here we move it to fs/ocfs2/dlmfs. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:14 -08:00
Joel Becker	65b6f34034	ocfs2_dlmfs: Use poll() to signify BASTs. o2dlm's userspace filesystem is an easy way to use the DLM from userspace. It is intentionally simple. For example, it does not allow for asynchronous behavior or lock conversion. This is intentional to keep the interface simple. Because there is no asynchronous notification, there is no way for a process holding a lock to know another node needs the lock. This is the number one complaint of ocfs2_dlmfs users. Turns out, we can solve this very easily. We add poll() support to ocfs2_dlmfs. When a BAST is received, the lock's file descriptor will receive POLLIN. This is trivial to implement. Userdlm already has an appropriate waitqueue, and the lock knows when it is blocked. We add the "bast" capability to tell userspace this is available. Signed-off-by: Joel Becker <joel.becker@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:14 -08:00
Joel Becker	14a437c2b6	ocfs2_dlmfs: Add capabilities parameter. Over time, dlmfs has added some features that were not part of the initial ABI. Unfortunately, some of these features are not detectable via standard usage. For example, Linux's default poll always returns POLLIN, so there is no way for a caller of poll(2) to know when dlmfs added poll support. Instead, we provide this list of new capabilities. Capabilities is a read-only attribute. We do it as a module parameter so we can discover it whether dlmfs is built in, loaded, or even not loaded (via modinfo). The ABI features are local to this machine's dlmfs mount. This is distinct from the locking protocol, which is concerned with inter-node interaction. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:13 -08:00
Joel Becker	399ff3a748	ocfs2: Handle errors while setting external xattr values. ocfs2 can store extended attribute values as large as a single file. It does this using a standard ocfs2 btree for the large value. However, the previous code did not handle all error cases cleanly. There are multiple problems to have. 1) We have trouble allocating space for a new xattr. This leaves us with an empty xattr. 2) We overwrote an existing local xattr with a value root, and now we have an error allocating the storage. This leaves us an empty xattr. where there used to be a value. The value is lost. 3) We have trouble truncating a reused value. This leaves us with the original entry pointing to the truncated original value. The value is lost. 4) We have trouble extending the storage on a reused value. This leaves us with the original value safely in place, but with more storage allocated when needed. This doesn't consider storing local xattrs (values that don't require a btree). Those only fail when the journal fails. Case (1) is easy. We just remove the xattr we added. We leak the storage because we can't safely remove it, but otherwise everything is happy. We'll print a warning about the leak. Case (4) is easy. We still have the original value in place. We can just leave the extra storage attached to this xattr. We return the error, but the old value is untouched. We print a warning about the storage. Case (2) and (3) are hard because we've lost the original values. In the old code, we ended up with values that could be partially read. That's not good. Instead, we just wipe the xattr entry and leak the storage. It stinks that the original value is lost, but now there isn't a partial value to be read. We'll print a big fat warning. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:13 -08:00
Joel Becker	139ffacebf	ocfs2: Set inline xattr entries with ocfs2_xa_set() ocfs2_xattr_ibody_set() is the only remaining user of ocfs2_xattr_set_entry(). ocfs2_xattr_set_entry() actually does two things: it calls ocfs2_xa_set(), and it initializes the inline xattrs. Initializing the inline space really belongs in its own call. We lift the initialization to ocfs2_xattr_ibody_init(), called from ocfs2_xattr_ibody_set() only when necessary. Now ocfs2_xattr_ibody_set() can call ocfs2_xa_set() directly. ocfs2_xattr_set_entry() goes away. Another nice fact is that ocfs2_init_dinode_xa_loc() can trust i_xattr_inline_size. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:13 -08:00
Joel Becker	d3981544d7	ocfs2: Set xattr block entries with ocfs2_xa_set() ocfs2_xattr_block_set() calls into ocfs2_xattr_set_entry() with just the HAS_XATTR flag. Most of the machinery of ocfs2_xattr_set_entry() is skipped. All that really happens other than the call to ocfs2_xa_set() is making sure the HAS_XATTR flag is set on the inode. But HAS_XATTR should be set when we also set di->i_xattr_loc. And that's done in ocfs2_create_xattr_block(). So let's move it there, and then ocfs2_xattr_block_set() can just call ocfs2_xa_set(). While we're there, ocfs2_create_xattr_block() can take the set_ctxt for a smaller argument list. It also learns to set HAS_XATTR_FL, because it knows for sure. ocfs2_create_empty_xatttr_block() in the reflink path fakes a set_ctxt to call ocfs2_create_xattr_block(). Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:13 -08:00
Joel Becker	c5d95df5f7	ocfs2: Let ocfs2_xa_prepare_entry() do space checks. ocfs2_xattr_set_in_bucket() doesn't need to do its own hacky space checking. Let's let ocfs2_xa_prepare_entry() (via ocfs2_xa_set()) do the more accurate work. Whenever it doesn't have space, ocfs2_xattr_set_in_bucket() can try to get more space. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:12 -08:00
Joel Becker	bca5e9bd1e	ocfs2: Gell into ocfs2_xa_set() ocfs2_xa_set() wraps the ocfs2_xa_prepare_entry()/ocfs2_xa_store_value() logic. Both callers can now use the same routine. ocfs2_xa_remove() moves directly into ocfs2_xa_set(). Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:11 -08:00
Joel Becker	73857ee0b5	ocfs2: Allocation in ocfs2_xa_prepare_entry(), values in ocfs2_xa_store_value() ocfs2_xa_prepare_entry() gets all the logic to add, remove, or modify external value trees. Now, when it exits, the entry is ready to receive a value of any size. ocfs2_xa_remove() is added to handle the complete removal of an entry. It truncates the external value tree before calling ocfs2_xa_remove_entry(). ocfs2_xa_store_inline_value() becomes ocfs2_xa_store_value(). It can store any value. ocfs2_xattr_set_entry() loses all the allocation logic and just uses these functions. ocfs2_xattr_set_value_outside() disappears. ocfs2_xattr_set_in_bucket() uses these functions and makes ocfs2_xattr_set_entry_in_bucket() obsolete. That goes away, as does ocfs2_xattr_bucket_set_value_outside() and ocfs2_xattr_bucket_value_truncate(). Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:11 -08:00
Joel Becker	cf2bc80940	ocfs2: Teach ocfs2_xa_loc how to do its own journal work We're going to want to make sure our buffers get accessed and dirtied correctly. So have the xa_loc do the work. This includes storing the inode on ocfs2_xa_loc. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:11 -08:00
Joel Becker	3fc12afa0c	ocfs2: Provide ocfs2_xa_fill_value_buf() for external value processing We use the ocfs2_xattr_value_buf structure to manage external values. It lets the value tree code do its work regardless of the containing storage. ocfs2_xa_fill_value_buf() initializes a value buf from an ocfs2_xa_loc entry. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:11 -08:00
Joel Becker	9dc474005d	ocfs2: Handle value tree roots in ocfs2_xa_set_inline_value() Previously the xattr code would send in a fake value, containing a tree root, to the function that installed name+value pairs. Instead, we pass the real value to ocfs2_xa_set_inline_value(), and it notices that the value cannot fit. Thus, it installs a tree root. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:10 -08:00
Joel Becker	69a3e539d0	ocfs2: Set the xattr name+value pair in one place We create two new functions on ocfs2_xa_loc, ocfs2_xa_prepare_entry() and ocfs2_xa_store_inline_value(). ocfs2_xa_prepare_entry() makes sure that the xl_entry field of ocfs2_xa_loc is ready to receive an xattr. The entry will point to an appropriately sized name+value region in storage. If an existing entry can be reused, it will be. If no entry already exists, it will be allocated. If there isn't space to allocate it, -ENOSPC will be returned. ocfs2_xa_store_inline_value() stores the data that goes into the 'value' part of the name+value pair. For values that don't fit directly, this stores the value tree root. A number of operations are added to ocfs2_xa_loc_operations to support these functions. This reflects the disparate behaviors of xattr blocks and buckets. With these functions, the overlapping ocfs2_xattr_set_entry_local() and ocfs2_xattr_set_entry_normal() can be replaced with a single call scheme. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:10 -08:00
Joel Becker	199799a360	ocfs2: Wrap calculation of name+value pair size. An ocfs2 xattr entry stores the text name and value as a pair in the storage area. Obviously names and values can be variable-sized. If a value is too large for the entry storage, a tree root is stored instead. The name+value pair is also padded. Because of this, there are a million places in the code that do: if (needs_external_tree(value_size) namevalue_size = pad(name_size) + tree_root_size; else namevalue_size = pad(name_size) + pad(value_size); Let's create some convenience functions to make the code more readable. There are three forms. The first takes the raw sizes. The second takes an ocfs2_xattr_info structure. The third takes an existing ocfs2_xattr_entry. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:10 -08:00
Joel Becker	18853b95d1	ocfs2: Add a name_len field to ocfs2_xattr_info. Rather than calculating strlen all over the place, let's store the name length directly on ocfs2_xattr_info. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:09 -08:00
Joel Becker	6b240ff63c	ocfs2: Prefix the member fields of struct ocfs2_xattr_info. struct ocfs2_xattr_info is a useful structure describing an xattr you'd like to set. Let's put prefixes on the member fields so it's easier to read and use. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:09 -08:00
Joel Becker	bde1e5400a	ocfs2: Remove xattrs via ocfs2_xa_loc Add ocfs2_xa_remove_entry(), which will remove an xattr entry from its storage via the ocfs2_xa_loc descriptor. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:09 -08:00
Joel Becker	11179f2c92	ocfs2: Introduce ocfs2_xa_loc The ocfs2 extended attribute (xattr) code is very flexible. It can store xattrs in the inode itself, in an external block, or in a tree of data structures. This allows the number of xattrs to be bounded by the filesystem size. However, the code that manages each possible storage location is different. Maintaining the ocfs2 xattr code requires changing each hunk separately. This patch is the start of a series introducing the ocfs2_xa_loc structure. This structure wraps the on-disk details of an xattr entry. The goal is that the generic xattr routines can use ocfs2_xa_loc without knowing the underlying storage location. This first pass merely implements the basic structure, initializing it, and wiping the name+value pair of the entry. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:08 -08:00
Sunil Mushran	8545e03d82	ocfs2: Add current->comm in trace output Add current->comm to the standard mlog() output to help with debugging. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:08 -08:00
Wengang Wang	96a1cc731a	ocfs2: Clean up the checks for CoW and direct I/O. When ocfs2 has to do CoW for refcounted extents, we disable direct I/O and go through the buffered I/O path. This makes the combined check easier to read. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:07 -08:00
Tiger Yang	b89c54282d	ocfs2: add extent block stealing for ocfs2 v5 This patch add extent block (metadata) stealing mechanism for extent allocation. This mechanism is same as the inode stealing. if no room in slot specific extent_alloc, we will try to allocate extent block from the next slot. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Acked-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-26 15:41:07 -08:00
Alex Elder	398007f863	Merge branch 'linux-2.6.33'	2010-02-26 14:34:02 -06:00
David Teigland	b6fa8796b2	dlm: use bastmode in debugfs output The bast mode that appears in the debugfs output should be useful on both master and process nodes. lkb_highbast is currently printed, and is only useful on the master node. lkb_bastmode is only useful on the process node. This patch sets lkb_bastmode on the master node as well, and uses that value in the debugfs print. Signed-off-by: David Teigland <teigland@redhat.com>	2010-02-26 12:15:54 -06:00
Steven Whitehouse	b4a5d4bc37	dlm: Send lockspace name with uevents Although it is possible to get this information from the path, its much easier to provide the lockspace as a seperate env variable. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2010-02-26 12:14:25 -06:00
David Teigland	cf6620acc0	dlm: send reply before bast When the lock master processes a successful operation (request, convert, cancel, or unlock), it will process the effects of the change before sending the reply for the operation. The "effects" of the operation are: - blocking callbacks (basts) for any newly granted locks - waiting or converting locks that can now be granted The cast is queued on the local node when the reply from the lock master is received. This means that a lock holder can receive a bast for a lock mode that is doesn't yet know has been granted. Signed-off-by: David Teigland <teigland@redhat.com>	2010-02-26 11:57:37 -06:00
Martin K. Petersen	8a78362c4e	block: Consolidate phys_segment and hw_segment limits Except for SCSI no device drivers distinguish between physical and hardware segment limits. Consolidate the two into a single segment limit. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-26 13:58:08 +01:00
Linus Torvalds	6ebdc661b6	Merge branch 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6 * 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6: (41 commits) of: remove undefined request_OF_resource & release_OF_resource of/sparc: Remove sparc-local declaration of allnodes and devtree_lock of: move definition of of_chosen into common code. of: remove unused extern reference to devtree_lock of: put default string compare and #a/s-cell values into common header of/flattree: Don't assume HAVE_LMB of: protect linux/of.h with CONFIG_OF proc_devtree: fix THIS_MODULE without module.h of: Remove old and misplaced function declarations of/flattree: Make the kernel accept ePAPR style phandle information of/flattree: endian-convert members of boot_param_header of: assume big-endian properties, adding conversions where necessary of: use __be32 for cell value accessors of/flattree: use OF_ROOT_NODE_{SIZE,ADDR}_CELLS DEFAULT for fdt parsing of/flattree: use callback to setup initrd from /chosen proc_devtree: include linux/of.h of: make set_node_proc_entry private to proc_devtree.c of: include linux/proc_fs.h of/flattree: merge early_init_dt_scan_memory() common code of: add 'of_' prefix to machine_is_compatible() ...	2010-02-25 15:38:37 -08:00
Paul E. McKenney	7dc5215798	vfs: Apply lockdep-based checking to rcu_dereference() uses Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable() and fcheck_files(). Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: Alexander Viro <viro@zeniv.linux.org.uk> LKML-Reference: <1266887105-1528-8-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:48 +01:00
Jens Axboe	7f03292ee1	Merge branch 'master' into for-2.6.34 Conflicts: include/linux/blkdev.h Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-25 08:48:05 +01:00
Steve French	d7b619cf56	[CIFS] pSesInfo->sesSem is used as mutex. Rename it to session_mutex and convert it to a real mutex. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-25 05:36:46 +00:00
Chuck Lever	58255a4e3c	NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN The server's callback client should stop trying to connect to the client's callback server as soon as it gets ECONNREFUSED. The NFS server's callback client does not call rpc_ping(), but appears to have it's own "ping" procedure, so it wasn't covered by commit `caabea8a`. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-02-24 17:50:28 -08:00
Steve French	122ca0076e	[CIFS] Use unsigned ea length for clarity Jeff correctly noted that using unsigned ea length is more intuitive. CC: Jeff Lyaton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-24 21:56:48 +00:00
David Teigland	7fe2b3190b	dlm: fix ordering of bast and cast When both blocking and completion callbacks are queued for lock, the dlm would always deliver the completion callback (cast) first. In some cases the blocking callback (bast) is queued before the cast, though, and should be delivered first. This patch keeps track of the order in which they were queued and delivers them in that order. This patch also keeps track of the granted mode in the last cast and eliminates the following bast if the bast mode is compatible with the preceding cast mode. This happens when a remotely mastered lock is demoted, e.g. EX->NL, in which case the local node queues a cast immediately after sending the demote message. In this way a cast can be queued for a mode, e.g. NL, that makes an in-transit bast extraneous. Signed-off-by: David Teigland <teigland@redhat.com>	2010-02-24 11:46:53 -06:00
dingdinghua	23e2af3518	jbd2: clean up an assertion in jbd2_journal_commit_transaction() commit_transaction has the same value as journal->j_running_transaction, so we can simplify the assert statement. Signed-off-by: dingdinghua <dingdinghua@nrchpc.ac.cn> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2010-02-24 12:11:20 -05:00
Dmitry Monakhov	56c50f11f4	ext4: trivial quota cleanup The patch is aimed to reorganize and simplify quota code a bit. Quota code is itself complex enough, but we can make it more readable in some places: - Move quota option parsing to separate functions. - Simplify old-quota and journaled-quota mix check. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2010-03-01 23:28:41 -05:00
Dmitry Monakhov	482a74258f	ext4: mount flags manipulation cleanup Replace intermediate EXT4_MOUNT_XXX flags manipulation to corresponding macro. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2010-02-24 11:35:32 -05:00
Jiaying Zhang	c8d46e41bc	ext4: Add flag to files with blocks intentionally past EOF fallocate() may potentially instantiate blocks past EOF, depending on the flags used when it is called. e2fsck currently has a test for blocks past i_size, and it sometimes trips up - noticeably on xfstests 013 which runs fsstress. This patch from Jiayang does fix it up - it (along with e2fsprogs updates and other patches recently from Aneesh) has survived many fsstress runs in a row. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2010-02-24 09:52:53 -05:00
Jeff Layton	835a36ca4a	cifs: set server_eof in cifs_fattr_to_inode Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-23 23:28:43 +00:00
Steve French	96c03bccc7	[CIFS] Minor cleanup to EA patch CC: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-23 20:51:43 +00:00
Jeff Layton	31c0519f7a	cifs: merge CIFSSMBQueryEA with CIFSSMBQAllEAs Add an "ea_name" parameter to CIFSSMBQAllEAs. When it's set make it behave like CIFSSMBQueryEA does now. The current callers of CIFSSMBQueryEA are converted to use CIFSSMBQAllEAs, and the old CIFSSMBQueryEA function is removed. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-23 20:47:32 +00:00
Jeff Layton	0cd126b504	cifs: verify lengths of QueryAllEAs reply Make sure the lengths in a QUERY_ALL_EAS reply don't make the parser walk off the end of the SMB. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-23 20:47:11 +00:00
Jeff Layton	e529614ad0	cifs: increase maximum buffer size in CIFSSMBQAllEAs It's 4000 now, but there's no reason to limit it to that. We should be able to handle a response up to CIFSMaxBufSize. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-23 20:46:49 +00:00
Jeff Layton	6e462b9f2c	cifs: rename name_len to list_len in CIFSSMBQAllEAs ...for clarity and so we can reuse the name for the real name_len. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-23 20:46:27 +00:00
Jeff Layton	f0d3868b78	cifs: clean up indentation in CIFSSMBQAllEAs Add a label that we can goto on error, and reduce some of the if/then/else indentation in this function. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-23 20:45:52 +00:00
Jeff Layton	370b41911c	cifs: add parens around smb_var in BCC macros ...to remove ambiguity about how these values are interpreted when passing in more complex values as arguments. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-23 20:45:21 +00:00
Michael Neuling	a17e18790a	fs/exec.c: fix initial stack reservation `803bf5ec25` ("fs/exec.c: restrict initial stack space expansion to rlimit") attempts to limit the initial stack to 20*PAGE_SIZE. Unfortunately, in attempting ensure the stack is not reduced in size, we ended up not changing the stack at all. This size reduction check is not necessary as the expand_stack call does this already. This caused a regression in UML resulting in most guest processes being killed. Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Anton Blanchard <anton@samba.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Jouni Malinen <j@w1.fi> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-22 19:50:34 -08:00
stephen hemminger	1cc523271e	seq_file: add RCU versions of new hlist/list iterators (v3) Many usages of seq_file use RCU protected lists, so non RCU iterators will not work safely. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-22 15:45:54 -08:00
Jens Axboe	f11cbd74c5	Merge branch 'master' into for-2.6.34	2010-02-22 13:48:51 +01:00
Ben Myers	978ebd97d1	xfs_export_operations.commit_metadata This is the commit_metadata export operation for XFS. - Takes one inode to be committed. - Forces the log up to the lsn of the inode. - Doesn't force the log if the inode doesn't have a pincount. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> [bfields@citi.umich.edu: trivial whitespace fix] Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-02-20 13:14:50 -08:00
Ben Myers	f501912a35	commit_metadata export operation replacing nfsd_sync_dir - Add commit_metadata export_operation to allow the underlying filesystem to decide how to commit an inode most efficiently. - Usage of nfsd_sync_dir and write_inode_now has been replaced with the commit_metadata function that takes a svc_fh. - The commit_metadata function calls the commit_metadata export_op if it's there, or else falls back to sync_inode instead of fsync and write_inode_now because only metadata need be synced here. - nfsd4_sync_rec_dir now uses vfs_fsync so that commit_metadata can be static Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-02-20 13:13:44 -08:00
David Howells	8f9941aecc	CacheFiles: Fix a race in cachefiles_delete_object() vs rename cachefiles_delete_object() can race with rename. It gets the parent directory of the object it's asked to delete, then locks it - but rename may have changed the object's parent between the get and the completion of the lock. However, if such a circumstance is detected, we abandon our attempt to delete the object - since it's no longer in the index key path, it won't be seen again by lookups of that key. The assumption is that cachefilesd may have culled it by renaming it to the graveyard for later destruction. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-20 10:06:35 -05:00
Jiro SEKIBA	0d561f12b4	nilfs2: add reader's lock for cno in nilfs_ioctl_sync This adds reader's lock for the_nilfs->cno in nilfs_ioctl_sync, for the_nilfs->cno should be proctected by segctor_sem when reading. Signed-off-by: Jiro SEKIBA <jir@unicus.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-02-20 21:18:19 +09:00
Chuck Ebbert	aeaa5ccd64	vfs: don't call ima_file_check() unconditionally in nfsd_open() commit `1e41568d73` ("Take ima_path_check() in nfsd past dentry_open() in nfsd_open()") moved this code back to its original location but missed the "else". Signed-off-by: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-20 00:47:31 -05:00
Al Viro	7fee4868be	Switch proc/self to nd_set_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-19 10:25:41 -05:00
Al Viro	ac278a9c50	fix LOOKUP_FOLLOW on automount "symlinks" Make sure that automount "symlinks" are followed regardless of LOOKUP_FOLLOW; it should have no effect on them. Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-19 03:56:42 -05:00
Al Viro	c44dcc56d2	switch inotify_user to anon_inode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-19 03:35:12 -05:00
Jiro SEKIBA	03f29365e8	nilfs2: delete unnecessary condition in load_segment_summary This is a trivial patch to remove unnecessary condition. load_segment_summary() checks crc of segment_summary OR crc of whole log data blocks based on boolean argument full_check. However, callers of the function pass only 1 as full_check, which means only whole log data blocks checking code is running all the time. This patch deletes the condition and full_check argument and also deletes enum 'NILFS_SEG_FAIL_CHECKSUM_SEGSUM' and corresponding case clause, for it is nolonger used anymore. Signed-off-by: Jiro SEKIBA <jir@unicus.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-02-18 20:09:03 +09:00
David S. Miller	2bb4646fce	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2010-02-16 22:09:29 -08:00
Tejun Heo	003cb608a2	percpu: add __percpu sparse annotations to fs Add __percpu sparse annotations to fs. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Alex Elder <aelder@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk>	2010-02-17 11:17:38 +09:00
Eric W. Biederman	7c0ff870d1	sysfs: sysfs_sd_setattr set iattrs unconditionally There is currently a bug in sysfs_sd_setattr inherited from sysfs_setattr in 2.6.32 where the first time we set the attributes on a sysfs file we allocate backing store but do not set the backing store attributes. Resulting in overly restrictive permissions on sysfs files. The fix is to simply modify the code so that it always executes when we update the sysfs attributes, as we did in 2.6.31 and earlier. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Tested-by: Jean Delvare <khali@linux-fr.org> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-02-16 15:42:42 -08:00
Curt Wohlgemuth	73b50c1c92	ext4: Fix BUG_ON at fs/buffer.c:652 in no journal mode Calls to ext4_handle_dirty_metadata should only pass in an inode pointer for inode-specific metadata, and not for shared metadata blocks such as inode table blocks, block group descriptors, the superblock, etc. The BUG_ON can get tripped when updating a special device (such as a block device) that is opened (so that i_mapping is set in fs/block_dev.c) and the file system is mounted in no journal mode. Addresses-Google-Bug: #2404870 Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2010-02-16 15:06:29 -05:00
Linus Torvalds	0813e22d4e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: btrfs_mark_extent_written uses the wrong slot	2010-02-15 19:56:21 -08:00
Chuck Lever	65d269538a	NFS: Too many GETATTR and ACCESS calls after direct I/O The cached read and write paths initialize fattr->time_start in their setup procedures. The value of fattr->time_start is propagated to read_cache_jiffies by nfs_update_inode(). Subsequent calls to nfs_attribute_timeout() will then use a good time stamp when computing the attribute cache timeout, and squelch unneeded GETATTR calls. Since the direct I/O paths erroneously leave the inode's fattr->time_start field set to zero, read_cache_jiffies for that inode is set to zero after any direct read or write operation. This triggers an otw GETATTR or ACCESS call to update the file's attribute and access caches properly, even when the NFS READ or WRITE replies have usable post-op attributes. Make sure the direct read and write setup code performs the same fattr initialization as the cached I/O paths to prevent unnecessary GETATTR calls. This was likely introduced by commit `0e574af1` in 2.6.15, which appears to add new nfs_fattr_init() call sites in the cached read and write paths, but not in the equivalent places in fs/nfs/direct.c. A subsequent commit in the same series, `33801147`, introduces the fattr->time_start field. Interestingly, the direct write reschedule path already has a call to nfs_fattr_init() in the right place. Reported-by: Quentin Barnes <qbarnes@yahoo-inc.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-15 19:53:43 -08:00
Linus Torvalds	0aa2ca9ae1	Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: reiserfs: Fix softlockup while waiting on an inode	2010-02-15 19:51:45 -08:00
dingdinghua	ba869023ea	jbd2: delay discarding buffers in journal_unmap_buffer Delay discarding buffers in journal_unmap_buffer until we know that "add to orphan" operation has definitely been committed, otherwise the log space of committing transation may be freed and reused before truncate get committed, updates may get lost if crash happens. Signed-off-by: dingdinghua <dingdinghua@nrchpc.ac.cn> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2010-02-15 16:35:42 -05:00
Leonard Michlmayr	aca92ff6f5	ext4: correctly calculate number of blocks for fiemap ext4_fiemap() rounds the length of the requested range down to blocksize, which is is not the true number of blocks that cover the requested region. This problem is especially impressive if the user requests only the first byte of a file: not a single extent will be reported. We fix this by calculating the last block of the region and then subtract to find the number of blocks in the extents. Signed-off-by: Leonard Michlmayr <leonard.michlmayr@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2010-03-04 17:07:28 -05:00
Roel Kluin	9aaab0589b	ext4: add missing error checking to ext4_expand_extra_isize_ea() Signed-off-by: Roel Kluin <roel.kluin@gmail.com>	2010-02-15 14:26:16 -05:00
Eric Sandeen	12062dddda	ext4: move __func__ into a macro for ext4_warning, ext4_error Just a pet peeve of mine; we had a mishash of calls with either __func__ or "function_name" and the latter tends to get out of sync. I think it's easier to just hide the __func__ in a macro, and it'll be consistent from then on. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2010-02-15 14:19:27 -05:00
Frederic Weisbecker	175359f89d	reiserfs: Fix softlockup while waiting on an inode When we wait for an inode through reiserfs_iget(), we hold the reiserfs lock. And waiting for an inode may imply waiting for its writeback. But the inode writeback path may also require the reiserfs lock, which leads to a deadlock. We just need to release the reiserfs lock from reiserfs_iget() to fix this. Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Tested-by: Christian Kujau <lists@nerdbynature.de> Cc: Chris Mason <chris.mason@oracle.com>	2010-02-14 19:07:56 +01:00
Jeremy Kerr	7c540d9e3d	proc_devtree: fix THIS_MODULE without module.h Commit `e22f628395` introduced a build breakage for ARM devtree work: the THIS_MODULE macro was added, but we don't have module.h This change adds the necessary #include to get THIS_MODULE defined. While we could just replace it with NULL (PROC_FS is a bool, not a tristate), using THIS_MODULE will prevent unexpected breakage if we ever do compile this as a module. Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Michal Simek <monstr@monstr.eu>	2010-02-14 07:13:41 -07:00
Julia Lawall	d67b1b0325	fs/xfs: Correct NULL test Test the value that was just allocated rather than the previously tested one. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r@ expression x; expression e; identifier l; @@ if (x == NULL \|\| ...) { ... when forall return ...; } ... when != goto l; when != x = e when != &x x == NULL // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-02-13 13:22:53 -06:00
Ryusuke Konishi	d1c6b72a72	nilfs2: move iterator to write log into segment buffer This moves iterator to submit write requests for a series of logs into segbuf.c, and hides nilfs_segbuf_write() and nilfs_segbuf_wait() in the file. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-02-13 12:26:03 +09:00
Ryusuke Konishi	e605f0a724	nilfs2: get rid of s_dirt flag use This replaces s_dirt flag use in nilfs with a new flag added on the nilfs object. The s_dirt flag was used to indicate if sop->write_super() should be called, however the current version of nilfs does not use the callback. Thus, it can be replaced with the own flag. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Jiro SEKIBA <jir@unicus.jp>	2010-02-13 12:26:03 +09:00
Ryusuke Konishi	dcd7618695	nilfs2: get rid of nilfs_segctor_req struct This will clean up nilfs_segctor_req struct and the obscure request argument passed among private methods of segment constructor. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-02-13 12:26:03 +09:00
Jiro SEKIBA	086d1764b2	nilfs2: delete unnecessary condition in nilfs_dat_translate This is a trivial patch to delete unnecessary condition in nilfs_dat_translate. nilfs_dat_translate() will asign translated address to *blocknrp if blocknrp is not NULL. However the condition is unneeded, because all callers of nilfs_dat_translate() pass blocknrp properly. Signed-off-by: Jiro SEKIBA <jir@unicus.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-02-13 12:26:03 +09:00
Ryusuke Konishi	fe5f171bb2	nilfs2: fix potential hang in nilfs_error on errors=remount-ro nilfs_error() calls nilfs_detach_segment_constructor() if errors=remount-ro option is specified, and this may lead to a hang due to recursive locking of, for instance, nilfs->ns_segctor_sem and others. In this case, detaching segment constructor is not necessary because read-only flag is set to the filesystem and further writes are blocked. This fixes the potential hang issue by removing the nilfs_detach_segment_constructor() call from nilfs_error. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-02-13 12:26:03 +09:00
Ryusuke Konishi	7512487e6d	nilfs2: use mnt_want_write in ioctls where write access is needed A few nilfs2 ioctls need to ask for and then later release write access to the mount in order to avoid potential write to read-only mounts. This adds the missing mnt_want_write and mnt_drop_write in nilfs_ioctl_change_cpmode, nilfs_ioctl_delete_checkpoint, and nilfs_ioctl_clean_segments. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-02-13 12:26:02 +09:00
Jiro SEKIBA	e902ec9906	nilfs2: issue discard request after cleaning segments This adds a function to send discard requests for given array of segment numbers, and calls the function when garbage collection succeeded. Signed-off-by: Jiro SEKIBA <jir@unicus.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-02-13 12:26:02 +09:00
Shaohua Li	3f6fae9559	Btrfs: btrfs_mark_extent_written uses the wrong slot My test do: fallocate a big file and do write. The file is 512M, but after file write is done btrfs-debug-tree shows: item 6 key (257 EXTENT_DATA 0) itemoff 3516 itemsize 53 extent data disk byte 1103101952 nr 536870912 extent data offset 0 nr 399634432 ram 536870912 extent compression 0 Looks like a regression introducted by `6c7d54ac87`, where we set wrong slot. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-02-12 16:47:19 -05:00
Christoph Hellwig	180040b89e	xfs: optimize log flushing in xfs_fsync If we have a pinned inode it must have a log item attached to it. Usually that log item will have ili_last_lsn already set, in which case we only need to flush the log up to that LSN instead of doing a full log force. This gives speedups of about 5% in some fsync heavy workloads. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-02-12 13:45:14 -06:00
Christoph Hellwig	87185517de	xfs: only clear the suid bit once in xfs_write file_remove_suid already calls into ->setattr to clear the suid and sgid bits if needed, no need to start a second transaction to do it ourselves. Note that xfs_write_clear_setuid issues a sync transaction while the path through ->setattr doesn't, but that is consistant with the other filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-02-12 13:43:57 -06:00
Steven Whitehouse	07ccb7bf2c	GFS2: Fix bmap allocation corner-case bug This patch solves a corner case during allocation which occurs if both metadata (indirect) and data blocks are required but there is an obstacle in the filesystem (e.g. a resource group header or another allocated block) such that when the allocation is requested only enough blocks for the metadata are returned. By changing the exit condition of this loop, we ensure that a minimum of one data block will always be returned. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-02-12 10:16:14 +00:00
Abhijith Das	0e5a9fb042	GFS2: Fix error code We need this one-liner to signal the mount helper of the 'insufficient journals' condition. Signed-off-by: Abhijith Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-02-12 10:15:51 +00:00
Linus Torvalds	efa82bab8e	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Fix the mapping of the NFSERR_SERVERFAULT error NFS: Remove a redundant check for PageFsCache in nfs_migrate_page() NFS: Fix a bug in nfs_fscache_release_page()	2010-02-11 14:06:28 -08:00
Linus Torvalds	fd48d6c888	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6: [SCSI] qla2xxx: Obtain proper host structure during response-queue processing. [SCSI] compat_ioct: fix bsg SG_IO [SCSI] qla2xxx: make msix interrupt handler safe for irq [SCSI] zfcp: Report FC BSG errors in correct field [SCSI] mptfusion : mptscsih_abort return value should be SUCCESS instead of value 0.	2010-02-11 14:05:55 -08:00
Michael Neuling	803bf5ec25	fs/exec.c: restrict initial stack space expansion to rlimit When reserving stack space for a new process, make sure we're not attempting to expand the stack by more than rlimit allows. This fixes a bug caused by `b6a2fea393` ("mm: variable length argument support") and unmasked by `fc63cf2370` ("exec: setup_arg_pages() fails to return errors"). This bug means that when limiting the stack to less the 20PAGE_SIZE (eg. 80K on 4K pages or 'ulimit -s 79') all processes will be killed before they start. This is particularly bad with 64K pages, where a ulimit below 1280K will kill every process. To test, do: 'ulimit -s 15; ls' before and after the patch is applied. Before it's applied, 'ls' should be killed. After the patch is applied, 'ls' should no longer be killed. A stack limit of 15KB since it's small enough to trigger 20PAGE_SIZE. Also 15KB not a multiple of PAGE_SIZE, which is a trickier case to handle correctly with this code. 4K pages should be fine to test with. [kosaki.motohiro@jp.fujitsu.com: cleanup] [akpm@linux-foundation.org: cleanup cleanup] Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Cc: Anton Blanchard <anton@samba.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-11 13:59:43 -08:00
Andreas Schwab	4cfbafd33f	compat_ioctl: add compat handler for TIOCGSID ioctl This is used by tcgetsid(3). Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-11 13:59:42 -08:00
Li Zefan	66655de6d1	seq_file: Add helpers for iteration over a hlist Some places in kernel need to iterate over a hlist in seq_file, so provide some common helpers. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-10 11:12:06 -08:00
Arnd Bergmann	f79f118528	compat_ioctl: ignore RAID_VERSION ioctl md ioctls are now handled by the md driver itself, but mdadm may call RAID_VERSION on other devices as well. Mark the command as IGNORE_IOCTL so this fails silently rather than printing an annoying message. Reported-by: "Michael S. Tsirkin" <m.s.tsirkin@gmail.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-10 07:36:16 -08:00
Linus Torvalds	5551638acb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: fix dentry hash calculation for case-insensitive mounts [CIFS] Don't cache timestamps on utimes due to coarse granularity [CIFS] Maximum username length check in session setup does not match cifs: fix length calculation for converted unicode readdir names [CIFS] Add support for TCP_NODELAY	2010-02-10 07:16:44 -08:00
Chuck Lever	f895c53f8a	NFS: Make close(2) asynchronous when closing NFS O_DIRECT files For NFSv2 and v3: O_DIRECT writes are always synchronous, and aren't cached, so nothing should be flushed when closing an NFS O_DIRECT file descriptor. Thus there are no write errors to report on close(2). In addition, there's no cached data to verify on the next open(2), so we don't need clean GETATTR results at close time to compare with. Thus, there's no need for the nfs_revalidate_inode() call when closing an NFS O_DIRECT file. This reduces the number of synchronous on-the-wire requests for a simple open-write-close of an NFS O_DIRECT file by roughly 20%. For NFSv4: Call nfs4_do_close() with wait set to zero when closing an NFS O_DIRECT file. The CLOSE will go on the wire, but the application won't wait for it to complete. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:31:05 -05:00
Chuck Lever	7e381172cf	NFS: Improve NFS iostat byte count accuracy for writes The bytes counted by the performance counters for NFS writes should reflect write and sync errors. If the write(2) system call reports an error, the bytes should not be counted. And, if the write is short, the actual number of bytes that was written should be counted, not the number of bytes that was requested. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:31:04 -05:00
Chuck Lever	aa2f1ef10e	NFS: Account for NFS bytes read via the splice API Bytes read via the splice API should be accounted for in the NFS performance statistics. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:31:03 -05:00
Chuck Lever	4184dcf2db	NFS: Fix byte accounting for generic NFS reads Currently, the NFS I/O counters count the number of bytes requested by applications, rather than the number of bytes actually read by the system calls. The number of bytes requested for reads is actually not that useful, because the value is usually a buffer size for reads. That is, that requested number is usually a maximum, and frequently doesn't reflect the actual number of bytes read. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:31:03 -05:00
Chuck Lever	c2459dc462	NFS: Proper accounting for NFS VFS calls Nit: The VFSOPEN and VFSFLUSH counters are function call counters. Count every call to these routines. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:31:02 -05:00
Andy Adamson	9733f0d928	nfs41: cleanup callback code to use __be32 type Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:31:01 -05:00
Andy Adamson	41f54a5548	nfs41: clear NFS4CLNT_RECALL_SLOT bit on session reset Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:31:00 -05:00
Andy Adamson	bae0ac0ee1	nfs41: fix nfs4_callback_recallslot Return NFS4_OK if target high slotid equals enforced high slotid. Fix nfs_client reference leak. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:31:00 -05:00
Andy Adamson	104aeba484	nfs41: resize slot table in reset When session is reset, client can renegotiate slot table size. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:59 -05:00
Andy Adamson	b9efa1b27e	nfs41: implement cb_recall_slot Drain the fore channel and reset the max_slots to the new value. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:59 -05:00
Andy Adamson	4911096f1a	nfs41: back channel drc minimal implementation For now the back channel ca_maxresponsesize_cached is 0 and there is no backchannel DRC. Return NFS4ERR_REP_TOO_BIG_TO_CACHE when a cb_sequence cachethis is true. When it is false, return NFS4ERR_RETRY_UNCACHED_REP as the next operation error. Remember the replay error accross compound operation processing. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:58 -05:00
Andy Adamson	b2f28bd783	nfs41: prepare for back channel drc Make all cb_sequence arguments available to verify_seqid which will make replay decisions. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:58 -05:00
Andy Adamson	e95e60daee	nfs41: remove uneeded checks in callback processing All callback operations have arguments to decode and require processing. The preprocess_nfs4X_op functions catch unsupported or illegal ops so decode_args and process_op pointers are always non NULL. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:57 -05:00
Andy Adamson	b92b301900	nfs41: directly encode back channel error Skip all other processing when error is encountered. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:56 -05:00
Andy Adamson	31d2b4356b	nfs41: fix wrong error on callback header xdr overflow Set NFS4ERR_RESOURCE as CB_COMPOUND status and do not return an op on decode_op_hdr or encode_op_hdr buffer overflow. NFS4ERR_RESOURCE is correct for v4.0. Will fix the return for v4.1 along with all the other NFS4ERR_RESOURCE errors in a later patch. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:56 -05:00
Mike Sager	72ce2b3c06	nfs41: Process callback's referring call list If a CB_SEQUENCE referring call triple matches a slot table entry, the client is still waiting for a response to the original request. In this case, return NFS4ERR_DELAY as the response to the callback. Signed-off-by: Mike Sager <sager@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:55 -05:00
Mike Sager	a7989c3e47	nfs41: Check slot table for referring calls Traverse a list of referring calls and look for a session/slot/seq number match. Signed-off-by: Mike Sager <sager@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:55 -05:00
Mike Sager	8e0d46e138	nfs41: Adjust max cache response size value For the CREATE_SESSION attribute ca_maxresponsesize_cached, calculate the value based on the rpc reply header size plus the maximum nfs compound reply size. Signed-off-by: Mike Sager <sager@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:54 -05:00
Jeff Layton	97cefcc6d0	nfs: handle NFSv2 -EKEYEXPIRED returns from RPC layer appropriately Add a wrapper around rpc_call_sync that handles -EKEYEXPIRED errors from the RPC layer as it would an -EJUKEBOX error if NFSv2 had such a thing. Also, add a handler for that error for async calls that makes it resubmit the RPC on -EKEYEXPIRED. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:52 -05:00
Jeff Layton	b68d69b8c6	nfs: handle NFSv3 -EKEYEXPIRED errors as we would -EJUKEBOX We're using -EKEYEXPIRED to indicate that a krb5 credcache contains an expired ticket and that we should have the NFS layer retry the RPC call instead of returning an error back to the caller. Handle this as we would an -EJUKEBOX error return. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:51 -05:00
Jeff Layton	2c6434888c	nfs4: handle -EKEYEXPIRED errors from RPC layer If a KRB5 TGT ticket expires, we don't want to return an error immediatel. If someone has a long running job and just forgets to run "kinit" in time then this will make it fail. Instead, we want to treat this situation as we would NFS4ERR_DELAY and retry the upcall after delaying a bit with an exponential backoff. This patch just makes any place that would handle NFS4ERR_DELAY also handle -EKEYEXPIRED the same way. In the future, we may want to be more sophisticated however and handle hard vs. soft mounts differently, or specify some upper limit on how long we'll wait for a new TGT to be acquired. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-10 08:30:50 -05:00
Trond Myklebust	fdcb45777a	NFS: Fix the mapping of the NFSERR_SERVERFAULT error It was recently pointed out that the NFSERR_SERVERFAULT error, which is designed to inform the user of a serious internal error on the server, was being mapped to an error value that is internal to the kernel. This patch maps it to the error EREMOTEIO, which is exported to userland through errno.h. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org	2010-02-09 14:29:29 -05:00
Trond Myklebust	7549ad5f9b	NFS: Remove a redundant check for PageFsCache in nfs_migrate_page() Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: David Howells <dhowells@redhat.com>	2010-02-09 14:29:21 -05:00
Trond Myklebust	2c1740098c	NFS: Fix a bug in nfs_fscache_release_page() Not having an fscache cookie is perfectly valid if the user didn't mount with the fscache option. This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=15234 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: David Howells <dhowells@redhat.com> Cc: stable@kernel.org	2010-02-09 14:29:10 -05:00
Linus Torvalds	3af9cf11b6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: 9p: fix p9_client_destroy unconditional calling v9fs_put_trans 9p: fix memory leak in v9fs_parse_options() 9p: Fix the kernel crash on a failed mount 9p: fix option parsing 9p: Include fsync support for 9p client net/9p: fix statsize inside twstat net/9p: fail when user specifies a transport which we can't find net/9p: fix virtio transport to correctly update status on connect	2010-02-09 11:19:06 -08:00
Jeremy Kerr	50ab2fe147	proc_devtree: include linux/of.h Currenly, proc_devtree.c depends on asm/prom.h to include linux/of.h, to provide some device-tree definitions (eg, struct property). Instead, include linux/of.h directly. We still need asm/prom.h for HAVE_ARCH_DEVTREE_FIXUPS. Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>	2010-02-09 08:34:10 -07:00
Jeremy Kerr	8cfb3343f7	of: make set_node_proc_entry private to proc_devtree.c We only need set_node_proc_entry in proc_devtree.c, so move it there. This fixes the !HAVE_ARCH_DEVTREE_FIXUPS build, as we can't make make the definition in linux/of.h conditional on this #define (definitions in asm/prom.h can't be exposed to linux/of.h, due to the enforced #include ordering). Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>	2010-02-09 08:34:10 -07:00
Linus Torvalds	deb0c98c7f	Merge branch 'for-2.6.33' of git://linux-nfs.org/~bfields/linux * 'for-2.6.33' of git://linux-nfs.org/~bfields/linux: Revert "nfsd4: fix error return when pseudoroot missing"	2010-02-08 17:08:01 -08:00
Linus Torvalds	a5f28ae4df	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2/cluster: Make o2net connect messages KERN_NOTICE ocfs2/dlm: Fix printing of lockname ocfs2: Fix contiguousness check in ocfs2_try_to_merge_extent_map() ocfs2/dlm: Remove BUG_ON in dlm recovery when freeing locks of a dead node ocfs2: Plugs race between the dc thread and an unlock ast message ocfs2: Remove overzealous BUG_ON during blocked lock processing ocfs2: Do not downconvert if the lock level is already compatible ocfs2: Prevent a livelock in dlmglue ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast ocfs2: Use compat_ptr in reflink_arguments. ocfs2/dlm: Handle EAGAIN for compatibility - v2 ocfs2: Add parenthesis to wrap the check for O_DIRECT. ocfs2: Only bug out when page size is larger than cluster size. ocfs2: Fix memory overflow in cow_by_page. ocfs2/dlm: Print more messages during lock migration ocfs2/dlm: Ignore LVBs of locks in the Blocked list ocfs2/trivial: Remove trailing whitespaces ocfs2: fix a misleading variable name ocfs2: Sync max_inline_data_with_xattr from tools. ocfs2: Fix refcnt leak on ocfs2_fast_follow_link() error path	2010-02-08 16:05:50 -08:00
Eric Van Hensbergen	bf2d29c64d	9p: fix memory leak in v9fs_parse_options() If match_strdup() fail this function exits without freeing the options string. Signed-off-by: Venkateswararao Jujjuri <jvrao@us.ibm.com> Sigend-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-02-08 17:59:34 -06:00
Eric Van Hensbergen	d8c8a9e365	9p: fix option parsing Options pointer is being moved before calling kfree() which seems to cause problems. This uses a separate pointer to track and free original allocation. Signed-off-by: Venkateswararao Jujjuri <jvrao@us.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>w	2010-02-08 16:23:23 -06:00
M. Mohan Kumar	7a4439c406	9p: Include fsync support for 9p client Implement the fsync in the client side by marking stat field values to 'don't touch' so that server may interpret it as a request to guarantee that the contents of the associated file are committed to stable storage before the Rwstat message is returned. Without this patch, calling fsync on a 9p file results in "Invalid argument" error. Please check the attached C program. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com> Acked-by: Venkateswararao Jujjuri (JV) <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-02-08 15:36:48 -06:00
Jeff Layton	7e469af97e	lockd: don't clear sm_monitored on nsm_reboot_lookup When lockd gets a notify downcall from statd, it'll search its hosts cache and then clear the sm_monitored bit on the host it finds. The idea is apparently to make lockd redo a SM_MON on the next lock request. This is unnecessary and causes the kernel's NSM cache to go out of sync with statd. statd doesn't stop monitoring a host when it gets a SM_NOTIFY and there's no guarantee that another lock will occur after the reclaim and before the unmount. In that event, no SM_UNMON will occur. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-02-08 16:20:35 -05:00
Jeff Layton	cdd30fa166	lockd: release reference to nsm_handle in nlm_host_rebooted nsm_reboot_lookup takes a reference to the nsm_handle that it returns, but nlm_host_rebooted never releases that reference. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-02-08 16:20:35 -05:00
Sunil Mushran	6efd806634	ocfs2/cluster: Make o2net connect messages KERN_NOTICE Connect and disconnect messages are more than informational as they are required during root cause analysis for failures. This patch changes them from KERN_INFO to KERN_NOTICE. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Mark Faseh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-08 13:02:28 -08:00
Sunil Mushran	86a06abab0	ocfs2/dlm: Fix printing of lockname The debug call printing the name of the lock resource was chopping off the last character. This patch fixes the problem. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-08 13:01:31 -08:00
J. Bruce Fields	260c64d235	Revert "nfsd4: fix error return when pseudoroot missing" Commit `f39bde24b2` fixed the error return from PUTROOTFH in the case where there is no pseudofilesystem. This is really a case we shouldn't hit on a correctly configured server: in the absence of a root filehandle, there's no point accepting version 4 NFS rpc calls at all. But the shared responsibility between kernel and userspace here means the kernel on its own can't eliminate the possiblity of this happening. And we have indeed gotten this wrong in distro's, so new client-side mount code that attempts to negotiate v4 by default first has to work around this case. Therefore when commit `f39bde24b2` arrived at roughly the same time as the new v4-default mount code, which explicitly checked only for the previous error, the result was previously fine mounts suddenly failing. We'll fix both sides for now: revert the error change, and make the client-side mount workaround more robust. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-02-08 15:25:23 -05:00
FUJITA Tomonori	84eb8fb42c	[SCSI] compat_ioct: fix bsg SG_IO bsg's SG_IO doesn't work on 32-bit userspace and 64-bit kernelspace. The problem is that both sg and bsg drivers use SG_IO ioctl. sg_ioctl_trans() does 32/64-bit conversion even against bsg header. It messes up bsg header. bsg driver gets garbage. This patch fixes sg_ioctl_trans to handle only sg header (struct sg_io_hdr). Reported-by: Giridhar Malavali <giridhar.malavali@qlogic.com> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2010-02-08 13:43:18 -06:00
Al Viro	cccc6bba3f	Lose the first argument of audit_inode_child() it's always equal to ->d_name.name of the second argument Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-08 14:38:36 -05:00
Al Viro	123df2944c	Lose the new_name argument of fsnotify_move() it's always new_dentry->d_name.name Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-08 14:38:36 -05:00
Jeff Layton	05507fa2ac	cifs: fix dentry hash calculation for case-insensitive mounts case-insensitive mounts shouldn't use full_name_hash(). Make sure we use the parent dentry's d_hash routine when one is set. Reported-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-08 17:52:34 +00:00
Steve French	ccd4bb1beb	[CIFS] Don't cache timestamps on utimes due to coarse granularity force revalidate of the file when any of the timestamps are set since some filesytem types do not have finer granularity timestamps and we can not always detect which file systems round timestamps down to determine whether we can cache the mtime on setattr samba bugzilla 3775 Acked-by: Shirish Pargaonkar <sharishp@us.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-08 17:39:58 +00:00
Linus Torvalds	6339204ecc	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: Take ima_file_free() to proper place. ima: rename PATH_CHECK to FILE_CHECK ima: rename ima_path_check to ima_file_check ima: initialize ima before inodes can be allocated fix ima breakage Take ima_path_check() in nfsd past dentry_open() in nfsd_open() freeze_bdev: don't deactivate successfully frozen MS_RDONLY sb befs: fix leak	2010-02-07 11:18:28 -08:00
Linus Torvalds	80e1e82398	Fix race in tty_fasync() properly This reverts commit `7036251180` ("tty: fix race in tty_fasync") and commit `b04da8bfdf` ("fnctl: f_modown should call write_lock_irqsave/ restore") that tried to fix up some of the fallout but was incomplete. It turns out that we really cannot hold 'tty->ctrl_lock' over calling __f_setown, because not only did that cause problems with interrupt disables (which the second commit fixed), it also causes a potential ABBA deadlock due to lock ordering. Thanks to Tetsuo Handa for following up on the issue, and running lockdep to show the problem. It goes roughly like this: - f_getown gets filp->f_owner.lock for reading without interrupts disabled, so an interrupt that happens while that lock is held can cause a lockdep chain from f_owner.lock -> sighand->siglock. - at the same time, the tty->ctrl_lock -> f_owner.lock chain that commit `7036251180` introduced, together with the pre-existing sighand->siglock -> tty->ctrl_lock chain means that we have a lock dependency the other way too. So instead of extending tty->ctrl_lock over the whole __f_setown() call, we now just take a reference to the 'pid' structure while holding the lock, and then release it after having done the __f_setown. That still guarantees that 'struct pid' won't go away from under us, which is all we really ever needed. Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Américo Wang <xiyou.wangcong@gmail.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-07 10:26:01 -08:00
Al Viro	89068c576b	Take ima_file_free() to proper place. Hooks: Just Say No. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-07 03:07:29 -05:00
Mimi Zohar	9bbb6cad01	ima: rename ima_path_check to ima_file_check ima_path_check actually deals with files! call it ima_file_check instead. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-07 03:06:22 -05:00
Mimi Zohar	8eb988c70e	fix ima breakage The "Untangling ima mess, part 2 with counters" patch messed up the counters. Based on conversations with Al Viro, this patch streamlines ima_path_check() by removing the counter maintaince. The counters are now updated independently, from measuring the file, in __dentry_open() and alloc_file() by calling ima_counts_get(). ima_path_check() is called from nfsd and do_filp_open(). It also did not measure all files that should have been measured. Reason: ima_path_check() got bogus value passed as mask. [AV: mea culpa] [AV: add missing nfsd bits] Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-07 03:06:22 -05:00
Al Viro	1e41568d73	Take ima_path_check() in nfsd past dentry_open() in nfsd_open() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-07 03:06:22 -05:00
Jun'ichi Nomura	4b06e5b9ad	freeze_bdev: don't deactivate successfully frozen MS_RDONLY sb Thanks Thomas and Christoph for testing and review. I removed 'smp_wmb()' before up_write from the previous patch, since up_write() should have necessary ordering constraints. (I.e. the change of s_frozen is visible to others after up_write) I'm quite sure the change is harmless but if you are uncomfortable with Tested-by/Reviewed-by on the modified patch, please remove them. If MS_RDONLY, freeze_bdev should just up_write(s_umount) instead of deactivate_locked_super(). Also, keep sb->s_frozen consistent so that remount can check the frozen state. Otherwise a crash reported here can happen: http://lkml.org/lkml/2010/1/16/37 http://lkml.org/lkml/2010/1/28/53 This patch should be applied for 2.6.32 stable series, too. Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Thomas Backlund <tmb@mandriva.org> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-07 03:06:21 -05:00
Al Viro	8dd5ca532c	befs: fix leak Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-07 03:06:21 -05:00
Steve French	301a6a3177	[CIFS] Maximum username length check in session setup does not match Fix length check reported by D. Binderman (see below) d binderman <dcb314@hotmail.com> wrote: > > I just ran the sourceforge tool cppcheck over the source code of the > new Linux kernel 2.6.33-rc6 > > It said > > [./cifs/sess.c:250]: (error) Buffer access out-of-bounds May turn out to be harmless, but best to be safe. Note max username length is defined to 32 due to Linux (Windows maximum is 20). Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-06 07:08:53 +00:00
Jeff Layton	f12f98dba6	cifs: fix length calculation for converted unicode readdir names cifs_from_ucs2 returns the length of the converted name, including the length of the NULL terminator. We don't want to include the NULL terminator in the dentry name length however since that'll throw off the hash calculation for the dentry cache. I believe that this is the root cause of several problems that have cropped up recently that seem to be papered over with the "noserverino" mount option. More confirmation of that would be good, but this is clearly a bug and it fixes at least one reproducible problem that was reported. This patch fixes at least this reproducer in this kernel.org bug: http://bugzilla.kernel.org/show_bug.cgi?id=15088#c12 Reported-by: Bjorn Tore Sund <bjorn.sund@it.uib.no> Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: stable@kernel.org Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-02-06 06:25:16 +00:00
Roel Kluin	bd6b0bf87d	ocfs2: Fix contiguousness check in ocfs2_try_to_merge_extent_map() The wrong member was compared in the continguousness check. Acked-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-05 15:06:21 -08:00
James Bottomley	73c77e2ccc	xfs: fix xfs to work with Virtually Indexed architectures xfs_buf.c includes what is essentially a hand rolled version of blk_rq_map_kern(). In order to work properly with the vmalloc buffers that xfs uses, this hand rolled routine must also implement the flushing API for vmap/vmalloc areas. [style updates from hch@lst.de] Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2010-02-05 12:32:35 -06:00
Linus Torvalds	adbfbcd12a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: apply updated fallocate i_size fix Btrfs: do not try and lookup the file extent when finishing ordered io Btrfs: Fix oopsen when dropping empty tree. Btrfs: remove BUG_ON() due to mounting bad filesystem Btrfs: make error return negative in btrfs_sync_file() Btrfs: fix race between allocate and release extent buffer.	2010-02-05 07:23:03 -08:00
Fang Wenqi	b2d82ee3c8	fuse: fix large stack use gcc 4.4 warns about: fs/fuse/dev.c: In function ‘fuse_notify_inval_entry’: fs/fuse/dev.c:925: warning: the frame size of 1060 bytes is larger than 1024 bytes The problem is we declare two structures and a large array on the stack, I move the array alway from the stack and allocate memory for it dynamically. Signed-off-by: Fang Wenqi <antonf@turbolinux.com.cn> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-02-05 12:08:31 +01:00
Miklos Szeredi	b21dda438b	fuse: cleanup in fuse_notify_inval_...() Small cleanup in fuse_notify_inval_inode() and fuse_notify_inval_entry(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-02-05 12:08:31 +01:00
Linus Torvalds	a9861b5037	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Don't clobber the attribute type in nfs_update_inode() NFS: Fix a umount race NFS: Fix an Oops when truncating a file NFS: Ensure that we handle NFS4ERR_STALE_STATEID correctly NFSv4.1: Don't call nfs4_schedule_state_recovery() unnecessarily NFSv4: Don't allow posix locking against servers that don't support it NFSv4: Ensure that the NFSv4 locking can recover from stateid errors NFS: Avoid warnings when CONFIG_NFS_V4=n NFS: Make nfs_commitdata_release static NFS: Try to commit unstable writes in nfs_release_page() NFS: Fix a reference leak in nfs_wb_cancel_page()	2010-02-04 16:08:15 -08:00
Linus Torvalds	a3a71ca9a7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Extend umount wait coverage to full glock lifetime GFS2: Wait for unlock completion on umount	2010-02-04 16:06:48 -08:00
Aneesh Kumar K.V	23b5c50945	Btrfs: apply updated fallocate i_size fix This version of the i_size fix for fallocate makes sure we only update the i_size when the current fallocate is really operating outside of i_size. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-02-04 11:33:03 -05:00
Josef Bacik	efd049fb26	Btrfs: do not try and lookup the file extent when finishing ordered io When running the following fio job [torrent] filename=torrent-test rw=randwrite size=4g filesize=4g bs=4k ioengine=sync you would see long stalls where no work was being done. That is because we were doing all this extra work to read in the file extent outside of the transaction, however in the random io case this ends up hurting us because the file extents are not there to begin with. So axe this logic, since we end up reading in the file extent when we go to update it anyway. This took the fio job from 11 mb/s with several ~10 second stalls to 24 mb/s to a couple of 1-2 second stalls. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-02-04 11:31:45 -05:00
Yan, Zheng	7a7965f83e	Btrfs: Fix oopsen when dropping empty tree. When dropping a empty tree, walk_down_tree() skips checking extent information for the tree root. This will triggers a BUG_ON in walk_up_proc(). Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-02-04 11:31:45 -05:00
Miao Xie	d7ce5843bb	Btrfs: remove BUG_ON() due to mounting bad filesystem Mounting a bad filesystem caused a BUG_ON(). The following is steps to reproduce it. # mkfs.btrfs /dev/sda2 # mount /dev/sda2 /mnt # mkfs.btrfs /dev/sda1 /dev/sda2 (the program says that /dev/sda2 was mounted, and then exits. ) # umount /mnt # mount /dev/sda1 /mnt At the third step, mkfs.btrfs exited in the way of make filesystem. So the initialization of the filesystem didn't finish. So the filesystem was bad, and it caused BUG_ON() when mounting it. But BUG_ON() should be called by the wrong code, not user's operation, so I think it is a bug of btrfs. This patch fixes it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-02-04 11:31:44 -05:00
Roel Kluin	014e4ac4f7	Btrfs: make error return negative in btrfs_sync_file() It appears the error return should be negative Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-02-04 11:31:44 -05:00
Yan, Zheng	f044ba7835	Btrfs: fix race between allocate and release extent buffer. Increase extent buffer's reference count while holding the lock. Otherwise it can race with try_release_extent_buffer. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-02-04 11:31:44 -05:00
Kees Cook	d78ca3cd73	syslog: use defined constants instead of raw numbers Right now the syslog "type" action are just raw numbers which makes the source difficult to follow. This patch replaces the raw numbers with defined constants for some level of sanity. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: John Johansen <john.johansen@canonical.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-02-04 14:20:41 +11:00
Kees Cook	002345925e	syslog: distinguish between /proc/kmsg and syscalls This allows the LSM to distinguish between syslog functions originating from /proc/kmsg access and direct syscalls. By default, the commoncaps will now no longer require CAP_SYS_ADMIN to read an opened /proc/kmsg file descriptor. For example the kernel syslog reader can now drop privileges after opening /proc/kmsg, instead of staying privileged with CAP_SYS_ADMIN. MAC systems that implement security_syslog have unchanged behavior. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: John Johansen <john.johansen@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-02-04 14:20:12 +11:00
Sunil Mushran	cda70ba8c0	ocfs2/dlm: Remove BUG_ON in dlm recovery when freeing locks of a dead node During recovery, the dlm frees the locks for the dead node. If it finds a lock in a resource for the dead node, it expects that node to also have a ref in that lock resource. If not, it BUGs. ossbz#1175 was filed with the above BUG. Now, while it is correct that we should be expecting the ref, I see no reason why we have to BUG. After all, we are freeing up the lock and clearing the ref. This patch replaces the BUG_ON with a printk(). Hopefully, that will give us more clues next time this happens. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1175 Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-03 17:51:41 -08:00
Sunil Mushran	079b805782	ocfs2: Plugs race between the dc thread and an unlock ast message This patch plugs a race between the downconvert thread and an unlock ast message. Specifically, after the downconvert worker has done its task, the dc thread needs to check whether an unlock ast made the downconvert moot. Reported-by: David Teigland <teigland@redhat.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Mark Fasheh <mfasheh@sus.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-03 17:26:03 -08:00
Dave Chinner	5322892d86	xfs: kill xfs_bawrite There are no more users of this function left in the XFS code now that we've switched everything to delayed write flushing. Remove it. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-02-04 10:09:14 +11:00
Christoph Hellwig	07fec73625	xfs: log changed inodes instead of writing them synchronously When an inode has already be flushed delayed write, xfs_inode_clean() returns true and hence xfs_fs_write_inode() can return on a synchronous inode write without having written the inode. Currently these sycnhronous writes only come sync(1), unmount, a sycnhronous NFS export and cachefiles so should be relatively rare and out of common performance paths. Realistically, a synchronous inode write is not necessary here; we can avoid writing the inode by logging any non-transactional changes that are pending. This needs to be done with synchronous transactions, but it avoids seeking between the log and inode clusters as we do now. We don't force the log if the inode is pinned, though, so this differs from the fsync case. For normal sys_sync and unmount behaviour this is fine because we do a synchronous log force in xfs_sync_data which is called from the ->sync_fs code. It does however break the NFS synchronous export guarantees for now, but work is under way to fix this at a higher level or for the higher level to provide an additional flag in the writeback control to tell us that a log force is needed. Portions of this patch are based on work from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-02-09 11:43:49 +11:00
Linus Torvalds	c1c0cbb878	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: fix potential leak of dirty data on umount	2010-02-03 08:47:15 -08:00
Trond Myklebust	9b4b351346	NFS: Don't clobber the attribute type in nfs_update_inode() If the NFS_ATTR_FATTR_TYPE field isn't set in fattr->valid, then we should not set the S_IFMT part of inode->i_mode. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-02-03 08:27:35 -05:00
Trond Myklebust	387c149b54	NFS: Fix a umount race Ensure that we unregister the bdi before kill_anon_super() calls ida_remove() on our device name. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org	2010-02-03 08:27:35 -05:00
Trond Myklebust	9f557cd807	NFS: Fix an Oops when truncating a file The VM/VFS does not allow mapping->a_ops->invalidatepage() to fail. Unfortunately, nfs_wb_page_cancel() may fail if a fatal signal occurs. Since the NFS code assumes that the page stays mapped for as long as the writeback is active, we can end up Oopsing (among other things). The only safe fix here is to convert nfs_wait_on_request(), so as to make it uninterruptible (as is already the case with wait_on_page_writeback()). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org	2010-02-03 08:27:22 -05:00
Steven Whitehouse	8f05228ee7	GFS2: Extend umount wait coverage to full glock lifetime Although all glocks are, by the time of the umount glock wait, scheduled for demotion, some of them haven't made it far enough through the process for the original set of waiting code to wait for them. This extends the ref count to the whole glock lifetime in order to ensure that the waiting does catch all glocks. It does make it a bit more invasive, but it seems the only sensible solution at the moment. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-02-03 09:56:21 +00:00
Steven Whitehouse	e402746a94	GFS2: Wait for unlock completion on umount This patch adds a wait on umount between the point at which we dispose of all glocks and the point at which we unmount the lock protocol. This ensures that we've received all the replies to our unlock requests before we stop the locking. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reported-by: Fabio M. Di Nitto <fdinitto@redhat.com>	2010-02-03 09:47:04 +00:00
Sunil Mushran	db0f6ce697	ocfs2: Remove overzealous BUG_ON during blocked lock processing During blocked lock processing, we should consider the possibility that the lock is no longer blocking. Joel Becker <joel.becker@oracle.com> assisted in fixing this issue. Reported-by: David Teigland <teigland@redhat.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-02 23:51:16 -08:00
Sunil Mushran	0d74125a6a	ocfs2: Do not downconvert if the lock level is already compatible During upconvert, if the master were to send a BAST, dlmglue will detect the upconversion in process and send a cancel convert to the master. Upon receiving the AST for the cancel convert, it will re-process the lock resource to determine whether it needs downconverting. Say, the up was from PR to EX and the BAST was for EX. After the cancel convert, it will need to downconvert to NL. However, if the node was originally upconverting from NL to EX, then there would be no reason to downconvert (assuming the same message sequence). This patch makes dlmglue consider the possibility that the current lock level is already compatible and that downconverting is not required. Joel Becker <joel.becker@oracle.com> assisted in fixing this issue. Fixes ossbz#1178 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1178 Reported-by: Coly Li <coly.li@suse.de> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-02 23:51:14 -08:00
Sunil Mushran	a191282601	ocfs2: Prevent a livelock in dlmglue There is possibility of a livelock in __ocfs2_cluster_lock(). If a node were to get an ast for an upconvert request, followed immediately by a bast, there is a small window where the fs may downconvert the lock before the process requesting the upconvert is able to take the lock. This patch adds a new flag to indicate that the upconvert is still in progress and that the dc thread should not downconvert it right now. Wengang Wang <wen.gang.wang@oracle.com> and Joel Becker <joel.becker@oracle.com> contributed heavily to this patch. Reported-by: David Teigland <teigland@redhat.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-02 23:51:13 -08:00
Wengang Wang	0b94a909eb	ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast During bast, set the OCFS2_LOCK_BLOCKED flag only if the lock needs to downconverted. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Acked-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-02 23:50:55 -08:00
Tao Ma	34e6c59af0	ocfs2: Use compat_ptr in reflink_arguments. Although we use u64 to pass userspace pointers to the kernel to avoid compat_ioctl, it doesn't work in some ppc platform. So wrap them with compat_ptr and add compat_ioctl. The detailed discussion about compat_ptr can be found in thread http://lkml.org/lkml/2009/10/27/423. We indeed met with a bug when testing on ppc(-EFAULT is returned when using old_path). This patch try to fix this. I have tested in ppc64(with 32 bit reflink) and x86_64(with i686 reflink), both works. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-02 18:56:37 -08:00
Sunil Mushran	cd34edd8cf	ocfs2/dlm: Handle EAGAIN for compatibility - v2 Mainline commit `aad1b15310` made the dlm_begin_reco_handler() return -EAGAIN instead of EAGAIN. As this error is transmitted over the wire, we want the receiver, dlm_send_begin_reco_message(), to understand both the older EAGAIN and the newer -EAGAIN, to allow rolling upgrade of the cluster nodes. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-02 18:56:34 -08:00
Tao Ma	60c486744c	ocfs2: Add parenthesis to wrap the check for O_DIRECT. Add parenthesis to wrap the check for O_DIRECT. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-02 18:15:37 -08:00
Tao Ma	0a1ea437d8	ocfs2: Only bug out when page size is larger than cluster size. In CoW, we have to make sure that the page is already written out to the disk. So we have a BUG_ON(PageDirty(page)). In ppc platform we have pagesize=64K, so if the cs=4K, if the file have fragmented clusters, we will map the page many times. See this file as an example. Tree Depth: 0 Count: 19 Next Free Rec: 14 ## Offset Clusters Block# Flags 0 0 4 2164864 0x2 Refcounted 1 4 2 `9302792` 0x2 Refcounted ... We have to replace the extent recs one by one, so the page with index 0 will be mapped and dirtied twice. I'd like to leave the BUG_ON there while adding a check so that in case we meet with an error in other platforms, we can find it easily. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-02 18:15:35 -08:00
Tao Ma	d622b89a2f	ocfs2: Fix memory overflow in cow_by_page. In ocfs2_duplicate_clusters_by_page, we calculate map_end by shifting page_index. But actually in case we meet with a large offset(say in a i686 box, poff_t is only 32 bits and page_index=2056240), we will overflow. So change the type of page_index to loff_t. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-02-02 18:14:20 -08:00
anfei zhou	931e80e4b3	mm: flush dcache before writing into page to avoid alias The cache alias problem will happen if the changes of user shared mapping is not flushed before copying, then user and kernel mapping may be mapped into two different cache line, it is impossible to guarantee the coherence after iov_iter_copy_from_user_atomic. So the right steps should be: flush_dcache_page(page); kmap_atomic(page); write to page; kunmap_atomic(page); flush_dcache_page(page); More precisely, we might create two new APIs flush_dcache_user_page and flush_dcache_kern_page to replace the two flush_dcache_page accordingly. Here is a snippet tested on omap2430 with VIPT cache, and I think it is not ARM-specific: int val = 0x11111111; fd = open("abc", O_RDWR); addr = mmap(NULL, 4096, PROT_READ\|PROT_WRITE, MAP_SHARED, fd, 0); (addr+0) = 0x44444444; tmp = (addr+0); *(addr+1) = 0x77777777; write(fd, &val, sizeof(int)); close(fd); The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777. Signed-off-by: Anfei <anfei.zhou@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: <linux-arch@vger.kernel.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-02 18:11:21 -08:00
Linus Torvalds	1a45dcfe25	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cfq-iosched: Do not idle on async queues blk-cgroup: Fix potential deadlock in blk-cgroup block: fix bugs in bio-integrity mempool usage block: fix bio_add_page for non trivial merge_bvec_fn case drbd: null dereference bug drbd: fix max_segment_size initialization	2010-02-02 12:54:37 -08:00
Linus Torvalds	4dab75ec3e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Use GFP_NOFS for alloc structure GFS2: Fix previous patch GFS2: Don't withdraw on partial rindex entries GFS2: Fix refcnt leak on gfs2_follow_link() error path	2010-02-02 12:48:26 -08:00
Linus Torvalds	7ab02af428	Fix 'flush_old_exec()/setup_new_exec()' split Commit `221af7f87b` ("Split 'flush_old_exec' into two functions") split the function at the point of no return - ie right where there were no more error cases to check. That made sense from a technical standpoint, but when we then also combined it with the actual personality setting going in between flush_old_exec() and setup_new_exec(), it needs to be a bit more careful. In particular, we need to make sure that we really flush the old personality bits in the 'flush' stage, rather than later in the 'setup' stage, since otherwise we might be flushing the _new_ personality state that we're just setting up. So this moves the flags and personality flushing (and 'flush_thread()', which is the arch-specific function that generally resets lazy FP state etc) of the old process into flush_old_exec(), so that it doesn't affect any state that execve() is setting up for the new process environment. This was reported by Michal Simek as breaking his Microblaze qemu environment. Reported-and-tested-by: Michal Simek <michal.simek@petalogix.com> Cc: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-02 12:37:44 -08:00
Christoph Hellwig	e8b217e753	xfs: remove invalid barrier optimization from xfs_fsync We always need to flush the disk write cache and can't skip it just because the no inode attributes have changed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-02-02 10:16:26 +11:00
Dave Chinner	20026d9201	xfs: kill the unused XFS_QMOPT_* flush flags V2 dquots are never flushed asynchronously. Remove the flag and the async write support from the flush function. Make the default flush a delwri flush to make the inode flush code, which leaves the XFS_QMOPT_SYNC the only flag remaining. Convert that to use SYNC_WAIT instead, just like the inode flush code. V2: - just pass flush flags straight through Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-02-04 09:48:58 +11:00
Linus Torvalds	13af75740f	Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: reiserfs: Fix vmalloc call under reiserfs lock	2010-02-01 10:46:18 -08:00
Steven Whitehouse	ea8d62dadd	GFS2: Use GFP_NOFS for alloc structure This is called under a glock, so its a good plan to use GFP_NOFS Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-02-01 10:01:34 +00:00
Steven Whitehouse	7fe3ec6fe5	GFS2: Fix previous patch The do_div() call needs to remain. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-02-01 10:00:23 +00:00
Benjamin Marzinski	55f0b4c546	GFS2: Don't withdraw on partial rindex entries ince gfs2 writes the rindex file a block at a time, and releases the exclusive lock after each block, it is possible that another process will grab the lock in the middle of the write. Since rindex entries are not an even divisor of blocks, that other process may see partial entries. On grows, this is fine. The process can simply ignore the the partial entires. Previously, the code withdrew when it saw partial entries. Now it simply ignores them. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-02-01 09:59:54 +00:00
Ryusuke Konishi	3256a05531	nilfs2: fix potential leak of dirty data on umount This fixes incorrect usage of nilfs_segctor_confirm() test function in nilfs_segctor_destroy(); nilfs_segctor_confirm() returns zero if the filesystem is not clean, so its use in nilfs_segctor_destroy() needs inversion. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-01-31 14:57:31 +09:00
Chuck Ebbert	9e9432c267	block: fix bugs in bio-integrity mempool usage Fix two bugs in the bio integrity code: use_bip_pool() always returns 0 because it checks against the wrong limit, causing the mempool to be used only when regular allocation fails. When the mempool is used as a fallback we don't free the data properly. Signed-Off-By: Chuck Ebbert <cebbert@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-30 20:28:19 +01:00
Trond Myklebust	aa696a6f34	nfsd: Use vfs_fsync_range() in nfsd_commit The NFS COMMIT operation allows the client to specify the exact byte range that it wishes to sync to disk in order to optimise server performance. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-01-29 18:53:11 -05:00
Linus Torvalds	67f15b06c1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: check total number of devices when removing missing Btrfs: check return value of open_bdev_exclusive properly Btrfs: do not mark the chunk as readonly if in degraded mode Btrfs: run orphan cleanup on default fs root Btrfs: fix a memory leak in btrfs_init_acl Btrfs: Use correct values when updating inode i_size on fallocate Btrfs: remove tree_search() in extent_map.c Btrfs: Add mount -o compress-force	2010-01-29 10:27:37 -08:00
Linus Torvalds	221af7f87b	Split 'flush_old_exec' into two functions 'flush_old_exec()' is the point of no return when doing an execve(), and it is pretty badly misnamed. It doesn't just flush the old executable environment, it also starts up the new one. Which is very inconvenient for things like setting up the new personality, because we want the new personality to affect the starting of the new environment, but at the same time we do _not_ want the new personality to take effect if flushing the old one fails. As a result, the x86-64 '32-bit' personality is actually done using this insane "I'm going to change the ABI, but I haven't done it yet" bit (TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the personality, but just the "pending" bit, so that "flush_thread()" can do the actual personality magic. This patch in no way changes any of that insanity, but it does split the 'flush_old_exec()' function up into a preparatory part that can fail (still called flush_old_exec()), and a new part that will actually set up the new exec environment (setup_new_exec()). All callers are changed to trivially comply with the new world order. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-29 08:22:01 -08:00
Josef Bacik	035fe03a7a	Btrfs: check total number of devices when removing missing If you have a disk failure in RAID1 and then add a new disk to the array, and then try to remove the missing volume, it will fail. The reason is the sanity check only looks at the total number of rw devices, which is just 2 because we have 2 good disks and 1 bad one. Instead check the total number of devices in the array to make sure we can actually remove the device. Tested this with a failed disk setup and with this test we can now run btrfs-vol -r missing /mount/point and it works fine. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-01-28 16:20:39 -05:00
Josef Bacik	7f59203abe	Btrfs: check return value of open_bdev_exclusive properly Hit this problem while testing RAID1 failure stuff. open_bdev_exclusive returns ERR_PTR(), not NULL. So change the return value properly. This is important if you accidently specify a device that doesn't exist when trying to add a new device to an array, you will panic the box dereferencing bdev. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-01-28 16:20:39 -05:00
Josef Bacik	f48b90756b	Btrfs: do not mark the chunk as readonly if in degraded mode If a RAID setup has chunks that span multiple disks, and one of those disks has failed, btrfs_chunk_readonly will return 1 since one of the disks in that chunk's stripes is dead and therefore not writeable. So instead if we are in degraded mode, return 0 so we can go ahead and allocate stuff. Without this patch all of the block groups in a RAID1 setup will end up read-only, which will mean we can't add new disks to the array since we won't be able to make allocations. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-01-28 16:20:39 -05:00
Josef Bacik	e3acc2a685	Btrfs: run orphan cleanup on default fs root This patch revert's commit `6c090a11e1` Since it introduces this problem where we can run orphan cleanup on a volume that can have orphan entries re-added. Instead of my original fix, Yan Zheng pointed out that we can just revert my original fix and then run the orphan cleanup in open_ctree after we look up the fs_root. I have tested this with all the tests that gave me problems and this patch fixes both problems. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-01-28 16:20:39 -05:00

... 3 4 5 6 7 ...

17047 Commits