need list_for_each_entry_safe() here. Original didn't even
have restart logics, so if you race with umount() it blew up.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
At the same time we can kill s_need_restart and local mutex in there.
__put_super() made public for a while; will be gone later.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We used to remove from s_list and s_instances at the same
time. So let's *not* do the former and skip superblocks
that have empty s_instances in the loops over s_list.
The next step, of course, will be to get rid of rescan logics
in those loops.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make sure that s_umount is acquired *before* we drop the final
active reference; we still have the fast path (atomic_dec_unless)
and we have gotten rid of the window between the moment when
s_active hits zero and s_umount is acquired. Which simplifies
the living hell out of grab_super() and inotify pin_to_kill()
stuff.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
First of all, get_sb_nodev() grabs anon dev minor and we
never free it in ecryptfs ->kill_sb(). Moreover, on one
of the failure exits in ecryptfs_get_sb() we leak things -
it happens before we set ->s_root and ->put_super() won't
be called in that case. Solution: kill ->put_super(), do
all that stuff in ->kill_sb(). And use kill_anon_sb() instead
of generic_shutdown_super() to deal with anon dev leak.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We set the "it's dead, don't mount on it" flag _and_ do not remove it if
we turn the damn thing negative and leave it around. And if it goes
positive afterwards, well...
Fortunately, there's only one place where that needs to be caught:
only d_delete() can turn the sucker negative without immediately freeing
it; all other places that can lead to ->d_iput() call are followed by
unconditionally freeing struct dentry in question. So the fix is obvious:
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16014
Reported-by: Adam Tkac <vonsch@gmail.com>
Tested-by: Adam Tkac <vonsch@gmail.com>
Cc: <stable@kernel.org> [2.6.34.x]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We need at least two to guarantee proper POSIX behaviour, so
never allow a smaller limit than that.
Also expose a /proc/sys/fs/pipe-max-pages sysctl file that allows
root to define a sane upper limit. Make it default to 16 times the
default size, which is 16 pages.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for
growing and shrinking the size of a pipe and adjusts pipe.c and splice.c
(and relay and network splice) usage to work with these larger (or smaller)
pipes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When CONFIG_BLOCK isn't enabled:
mm/page-writeback.c: In function 'laptop_mode_timer_fn':
mm/page-writeback.c:708: error: dereferencing pointer to incomplete type
mm/page-writeback.c:709: error: dereferencing pointer to incomplete type
Fix this by essentially eliminating the laptop sync handlers when
CONFIG_BLOCK isn't set, as most are only used from the block layer code.
The exception is laptop_sync_completion() which is used from sys_sync(),
make that an empty declaration in that case.
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently, native capacity unlocking is initiated only when a
recognized partition extends beyond the end of the disk. However,
there are several other unhandled cases where truncated capacity can
lead to misdetection of partitions.
* Partition table is fully beyond EOD.
* Partition table is partially beyond EOD (daisy chained ones).
* Recognized partition starts beyond EOD.
This patch updates generic partition check code such that all the
above three cases are handled too. For the first two, @state tracks
whether low level partition check code tried to read beyond EOD during
partition scan and triggers native capacity unlocking accordingly.
The third is now handled similarly to the original unlocking case.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Make the following changes to partition check code.
* Add ->bdev to struct parsed_partitions.
* Introduce read_part_sector() which is a simple wrapper around
read_dev_sector() which takes struct parsed_partitions *state
instead of @bdev.
* For functions which used to take @state and @bdev, drop @bdev. For
functions which used to take @bdev, replace it with @state.
* While updating, drop superflous checks on NULL state/bdev in ldm.c.
This cleans up the API a bit and enables better handling of IO errors
during partition check as the generic partition check code now has
much better visibility into what went wrong in the low level code
paths.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
bdops->set_capacity() is unnecessarily generic. All that's required
is a simple one way notification to lower level driver telling it to
try to unlock native capacity. There's no reason to pass in target
capacity or return the new capacity. The former is always the
inherent native capacity and the latter can be handled via the usual
device resize / revalidation path. In fact, the current API is always
used that way.
Replace ->set_capacity() with ->unlock_native_capacity() which take
only @disk and doesn't return anything. IDE which is the only current
user of the API is converted accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Device resize via ->set_capacity() can reveal new partitions (e.g. in
chained partition table formats such as dos extended parts). Restart
partition scan from the beginning after resizing a device. This
change also makes libata always revalidate the disk after resize which
makes lower layer native capacity unlocking implementation simpler and
more robust as resize can be handled in the usual path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
invalidate_bdev() should release all page cache pages which are clean
and not being used; however, if some pages are still in the percpu LRU
add caches on other cpus, those pages are considered in used and don't
get released. Fix it by calling lru_add_drain_all() before trying to
invalidate pages.
This problem was discovered while testing block automatic native
capacity unlocking. Null pages which were read before automatic
unlocking didn't get released by invalidate_bdev() and ended up
interfering with partition scan after unlocking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Calling schedule without setting the task state to non-running will
return immediately, so ensure that we set it properly and check our
sleep conditions after doing so.
This is a fixup for commit 69b62d01.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Even if the writeout itself isn't a data integrity operation, we need
to ensure that the caller doesn't drop the sb umount sem before we
have actually done the writeback.
This is a fixup for commit e913fc82.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (31 commits)
dquot: Detect partial write error to quota file in write_blk() and add printk_ratelimit for quota error messages
ocfs2: Fix lock inversion in quotas during umount
ocfs2: Use __dquot_transfer to avoid lock inversion
ocfs2: Fix NULL pointer deref when writing local dquot
ocfs2: Fix estimate of credits needed for quota allocation
ocfs2: Fix quota locking
ocfs2: Avoid unnecessary block mapping when refreshing quota info
ocfs2: Do not map blocks from local quota file on each write
quota: Refactor dquot_transfer code so that OCFS2 can pass in its references
quota: unify quota init condition in setattr
quota: remove sb_has_quota_active in get/set_info
quota: unify ->set_dqblk
quota: unify ->get_dqblk
ext3: make barrier options consistent with ext4
quota: Make quota stat accounting lockless.
suppress warning: "quotatypes" defined but not used
ext3: Fix waiting on transaction during fsync
jbd: Provide function to check whether transaction will issue data barrier
ufs: add ufs speciffic ->setattr call
BKL: Remove BKL from ext2 filesystem
...
This patch changes quota_tree.c:write_blk() to detect error caused by partial
write to quota file and add a macro to limit control printed quota error
messages so we won't fill up dmesg with a corrupted quota file.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We cannot cancel delayed work from ocfs2_local_free_info because that is called
with dqonoff_mutex held and the work it cancels requires dqonoff_mutex to
finish. Cancel the work before acquiring dqonoff_mutex.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
dquot_transfer() acquires own references to dquots via dqget(). Thus it waits
for dq_lock which creates a lock inversion because dq_lock ranks above
transaction start but transaction is already started in ocfs2_setattr(). Fix
the problem by passing own references directly to __dquot_transfer.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
commit_dqblk() can write quota info to global file. That is actually a bad
thing to do because if we are just modifying local quota file, we are not
prepared (do not hold proper locks, do not have transaction credits) to do
a modification of the global quota file. So do not use commit_dqblk() and
instead call our writing function directly.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We were missing reservation of a journal credit for modification of quota
file inode when creating new dquot structure in the global quota file.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
OCFS2 had three issues with quota locking:
a) When reading dquot from global quota file, we started a transaction while
holding dqio_mutex which is prone to deadlocks because other paths do it
the other way around
b) During ocfs2_sync_dquot we were not protected against concurrent writers
on the same node. Because we first copy data to local buffer, a race
could happen resulting in old data being written to global quota file and
thus causing quota inconsistency after a crash.
c) ip_alloc_sem of quota files was acquired while a transaction is started
in ocfs2_quota_write which can deadlock because we first get ip_alloc_sem
and then start a transaction when extending quota files.
We fix the problem a) by pulling all necessary code to ocfs2_acquire_dquot
and ocfs2_release_dquot. Thus we no longer depend on generic dquot_acquire
to do the locking and can force proper lock ordering.
Problems b) and c) are fixed by locking i_mutex and ip_alloc_sem of
global quota file in ocfs2_lock_global_qf and removing ip_alloc_sem from
ocfs2_quota_read and ocfs2_quota_write.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The position of global quota file info does not change. So we do not have
to do logical -> physical block translation every time we reread it from
disk. Thus we can also avoid taking ip_alloc_sem.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
There is no need to map offset of local dquot structure to on disk block
in each quota write. It is enough to map it just once and store the physical
block number in quota structure in memory. Moreover this simplifies locking
as we do not have to take ip_alloc_sem from quota write path.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently, __dquot_transfer() acquires its own references of dquot structures
that will be put into inode. But for OCFS2, this creates a lock inversion
between dq_lock (waited on in dqget) and transaction start (started in
ocfs2_setattr). Currently, deadlock is impossible because dq_lock is acquired
only during dquot_acquire and dquot_release and we already hold a reference to
dquot structures in ocfs2_setattr so neither of these functions can be called
while we call dquot_transfer. But this is rather subtle and it is hard to teach
lockdep about it. So provide __dquot_transfer function that can be passed dquot
references directly. OCFS2 can then pass acquired dquot references directly to
__dquot_transfer with proper locking.
Signed-off-by: Jan Kara <jack@suse.cz>
Quota must being initialized if size or uid/git changes requested.
But initialization performed in two different places:
in case of i_size file system is responsible for dquot init
, but in case of uid/gid init will be called internally in
dquot_transfer().
This ambiguity makes code harder to understand.
Let's move this logic to one common helper function.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
The methods already do these checks, so remove them in the quotactl
implementation to allow non-VFS quota implementations to also support
these calls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Pass the larger struct fs_disk_quota to the ->set_dqblk operation so
that the Q_SETQUOTA and Q_XSETQUOTA operations can be implemented
with a single filesystem operation and we can retire the ->set_xquota
operation. The additional information (RT-subvolume accounting and
warn counts) are left zero for the VFS quota implementation.
Add new fieldmask values for setting the numer of blocks and inodes
values which is required for the VFS quota, but wasn't for XFS.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Pass the larger struct fs_disk_quota to the ->get_dqblk operation so
that the Q_GETQUOTA and Q_XGETQUOTA operations can be implemented
with a single filesystem operation and we can retire the ->get_xquota
operation. The additional information (RT-subvolume accounting and
warn counts) are left zero for the VFS quota implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
ext4 was updated to accept barrier/nobarrier mount options
in addition to the older barrier=0/1. The barrier story
is complex enough, we should help people by making the options
the same at least, even if the defaults are different.
This patch allows the barrier/nobarrier mount options for ext3,
while keeping nobarrier the default.
It also unconditionally displays barrier status in show_options,
and prints a message at mount time if barriers are not enabled,
just as ext4 does.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Quota stats is mostly writable data structure. Let's alloc percpu
bucket for each value.
NOTE: dqstats_read() function is racy against dqstats_{inc,dec}
and may return inconsistent value. But this is ok since absolute
accuracy is not required.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Suppress compilation warning: "quotatypes" defined but not used.
quotatypes is used only when CONFIG_QUOTA_DEBUG or CONFIG_PRINT_QUOTA_WARNING
is/are defined.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
log_start_commit() returns 1 only when it started a transaction
commit. Thus in case transaction commit is already running, we
fail to wait for the commit to finish. Fix the issue by always
waiting for the commit regardless of the log_start_commit return
value.
Signed-off-by: Jan Kara <jack@suse.cz>
Provide a function which returns whether a transaction with given tid
will send a barrier to the filesystem device. The function will be used
by ext3 to detect whether fsync needs to send a separate barrier or not.
Signed-off-by: Jan Kara <jack@suse.cz>
generic setattr not longer responsible for quota transfer.
use ufs_setattr for all ufs's inodes.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
The BKL is still used in ext2_put_super(), ext2_fill_super(), ext2_sync_fs()
ext2_remount() and ext2_write_inode(). From these calls ext2_put_super(),
ext2_fill_super() and ext2_remount() are protected against each other by
the struct super_block s_umount rw semaphore. The call in ext2_write_inode()
could only protect the modification of the ext2_sb_info through
ext2_update_dynamic_rev() against concurrent ext2_sync_fs() or ext2_remount().
ext2_fill_super() and ext2_put_super() can be left out because you need a
valid filesystem reference in all three cases, which you do not have when
you are one of these functions.
If the BKL is only protecting the modification of the ext2_sb_info it can
safely be removed since this is protected by the struct ext2_sb_info s_lock.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Add a spinlock that protects against concurrent modifications of
s_mount_state, s_blocks_last, s_overhead_last and the content of the
superblock's buffer pointed to by sbi->s_es. The spinlock is now used in
ext2_xattr_update_super_block() which was setting the
EXT2_FEATURE_COMPAT_EXT_ATTR flag on the superblock without protection
before. Likewise the spinlock is used in ext2_show_options() to have a
consistent view of the mount options.
This is a preparation patch for removing the BKL from ext2 in the next
patch.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jan Kara <jack@suse.cz>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
Move ext2_write_super() out of ext2_setup_super() as a preparation for the
next patch that adds a new lock for superblock fields.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Both function originally did similar things except that ext2_sync_super()
is returning after the call to sync_dirty_buffer(sbh). Therefore this
patch adds a wait flag to tell ext2_sync_super() if it has to call
sync_dirty_buffer() to wait for in-progress I/O to finish.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Depending in the state (valid or unchecked) of the filesystem either
ext2_sync_super() or ext2_commit_super() is called. If the filesystem is
currently valid (it is checked), we first mark it unchecked and afterwards
duplicate the work that ext2_sync_super() is doing later. Therefore this
patch removes the duplicate code and calls ext2_sync_super() directly after
marking the filesystem unchecked.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
This is probably a typo since the write time should actually be updated by
ext2_sync_fs() instead of the mount time.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
ext2_sync_fs() used to duplicate the code from ext2_clear_super_error().
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently block/inode/dir counters are initialized before journal was
recovered. In fact after journal recovery this info will probably
change which results in incorrect numbers returned from statfs(2).
BUG:#15768
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
This patch removes a useless call to brelse(bitmap_bh) since at that
point bitmap_bh is NULL and slightly cleans up bitmap_bh handling.
Signed-off-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
- Skip locking if quota is dirty already.
- Return old quota state to help fs-specciffic implementation to optimize
case where quota was dirty already.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
There is no point in loading bitmap for groups which are completely full.
This causes noticeable performance problems (and memory pressure) on small
systems with large full filesystem
(http://marc.info/?l=linux-ext4&m=126843108314310&w=2).
Port of the same ext3 patch.
Signed-off-by: Jan Kara <jack@suse.cz>
There is no point in loading bitmap for groups which are completely full.
This causes noticeable performance problems (and memory pressure) on small
systems with large full filesystem
(http://marc.info/?l=linux-ext4&m=126843108314310&w=2).
Jan Kara: Added a comment and changed check to use cpu-endian value.
Signed-off-by: "Frans van de Wiel" <fvdw@fvdw.eu>
Signed-off-by: Jan Kara <jack@suse.cz>
This allows bin_attr->read,write,mmap callbacks to check file specific data
(such as inode owner) as part of any privilege validation.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In Al's latest vfs tree the code is reworked and S_BIAS has been removed.
It turns out that checking to see if a super block is in the
middle of an unmount in sysfs_exit_ns is unnecessary because we
remove the super_block from the s_supers/s_instances list before
struct sysfs_super_info pointed to by sb->s_fs_info is freed.
For now just delete the unnecessary check to see if a superblock is in the
middle of an unmount, it isn't necessary with or without Al's changes
and it just causes a needless conflict.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add some in-line comments to explain the new infrastructure, which
was introduced to support sysfs directory tagging with namespaces.
I think an overall description someplace might be good too, but it
didn't really seem to fit into Documentation/filesystems/sysfs.txt,
which appears more geared toward users, rather than maintainers, of
sysfs.
(Tejun, please let me know if I can make anything clearer or failed
altogether to comment something that should be commented.)
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When removing a symlink sysfs_remove_link does not provide
enough information to figure out which tagged directory the symlink
falls in. So I need sysfs_delete_link which is passed the target
of the symlink to delete.
sysfs_rename_link is updated to call sysfs_delete_link instead
of sysfs_remove_link as we have all of the information necessary
and the callers are interesting.
Both of these functions now have enough information to find a symlink
in a tagged directory. The only restriction is that they must be called
before the target kobject is renamed or deleted. If they are called
later I loose track of which tag the target kobject was marked with
and can no longer find the old symlink to remove it.
This patch was split from an earlier patch.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
I had hopped to avoid this but the bonding driver adds a file
to /sys/class/net/ and the easiest way to handle that file is
to make it untagged and to register it only once.
So relax the rules on tagged directories, and make bonding work.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The problem. When implementing a network namespace I need to be able
to have multiple network devices with the same name. Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.
What this patch does is to add an additional tag field to the
sysfs dirent structure. For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible. Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.
I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.
For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded. Which means I need
a simple race free way to setup those directories as tagged.
To achieve a reace free design all tagged directories are created
and managed by sysfs itself.
Users of this interface:
- define a type in the sysfs_tag_type enumeration.
- call sysfs_register_ns_types with the type and it's operations
- sysfs_exit_ns when an individual tag is no longer valid
- Implement mount_ns() which returns the ns of the calling process
so we can attach it to a sysfs superblock.
- Implement ktype.namespace() which returns the ns of a syfs kobject.
Everything else is left up to sysfs and the driver layer.
For the network namespace mount_ns and namespace() are essentially
one line functions, and look to remain that.
Tags are currently represented a const void * pointers as that is
both generic, prevides enough information for equality comparisons,
and is trivial to create for current users, as it is just the
existing namespace pointer.
The work needed in sysfs is more extensive. At each directory
or symlink creating I need to check if the directory it is being
created in is a tagged directory and if so generate the appropriate
tag to place on the sysfs_dirent. Likewise at each symlink or
directory removal I need to check if the sysfs directory it is
being removed from is a tagged directory and if so figure out
which tag goes along with the name I am deleting.
Currently only directories which hold kobjects, and
symlinks are supported. There is not enough information
in the current file attribute interfaces to give us anything
to discriminate on which makes it useless, and there are
no potential users which makes it an uninteresting problem
to solve.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add all of the necessary bioler plate to support
multiple superblocks in sysfs.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Make devtmpfs available on (embedded) configurations without SHMEM/TMPFS,
using ramfs instead.
Saves ~15KB.
Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (23 commits)
nilfs2: disallow remount of snapshot from/to a regular mount
nilfs2: use huge_encode_dev/huge_decode_dev
nilfs2: update comment on deactivate_super at nilfs_get_sb
nilfs2: replace MS_VERBOSE with MS_SILENT
nilfs2: add missing initialization of s_mode
nilfs2: fix misuse of open_bdev_exclusive/close_bdev_exclusive
nilfs2: enlarge s_volume_name member in nilfs_super_block
nilfs2: use checkpoint number instead of timestamp to select super block
nilfs2: add missing endian conversion on super block magic number
nilfs2: make nilfs_sc_*_ops static
nilfs2: add kernel doc comments to persistent object allocator functions
nilfs2: change sc_timer from a pointer to an embedded one in struct nilfs_sc_info
nilfs2: remove nilfs_segctor_init() in segment.c
nilfs2: insert checkpoint number in segment summary header
nilfs2: add a print message after loading nilfs2
nilfs2: cleanup multi kmem_cache_{create,destroy} code
nilfs2: move out checksum routines to segment buffer code
nilfs2: move pointer to super root block into logs
nilfs2: change default of 'errors' mount option to 'remount-ro' mode
nilfs2: Combine nilfs_btree_release_path() and nilfs_btree_free_path()
...
* git://git.infradead.org/mtd-2.6: (154 commits)
mtd: cfi_cmdset_0002: use AMD standard command-set with Winbond flash chips
mtd: cfi_cmdset_0002: Fix MODULE_ALIAS and linkage for new 0701 commandset ID
mtd: mxc_nand: Remove duplicate NAND_CMD_RESET case value
mtd: update gfp/slab.h includes
jffs2: Stop triggering block erases from jffs2_write_super()
jffs2: Rename jffs2_erase_pending_trigger() to jffs2_dirty_trigger()
jffs2: Use jffs2_garbage_collect_trigger() to trigger pending erases
jffs2: Require jffs2_garbage_collect_trigger() to be called with lock held
jffs2: Wake GC thread when there are blocks to be erased
jffs2: Erase pending blocks in GC pass, avoid invalid -EIO return
jffs2: Add 'work_done' return value from jffs2_erase_pending_blocks()
mtd: mtdchar: Do not corrupt backing device of device node inode
mtd/maps/pcmciamtd: Fix printk format for ssize_t in debug messages
drivers/mtd: Use kmemdup
mtd: cfi_cmdset_0002: Fix argument order in bootloc warning
mtd: nand: add Toshiba TC58NVG0 device ID
pcmciamtd: add another ID
pcmciamtd: coding style cleanups
pcmciamtd: fixing obvious errors
mtd: chips: add SST39WF160x NOR-flashes
...
Trivial conflicts due to dev_node removal in drivers/mtd/maps/pcmciamtd.c
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (54 commits)
xfs: mark xfs_iomap_write_ helpers static
xfs: clean up end index calculation in xfs_page_state_convert
xfs: clean up mapping size calculation in __xfs_get_blocks
xfs: clean up xfs_iomap_valid
xfs: move I/O type flags into xfs_aops.c
xfs: kill struct xfs_iomap
xfs: report iomap_bn in block base
xfs: report iomap_offset and iomap_bsize in block base
xfs: remove iomap_delta
xfs: remove iomap_target
xfs: limit xfs_imap_to_bmap to a single mapping
xfs: simplify buffer to transaction matching
xfs: Make fiemap work in query mode.
xfs: kill off l_sectbb_mask
xfs: record log sector size rather than log2(that)
xfs: remove dead XFS_LOUD_RECOVERY code
xfs: removed unused XFS_QMOPT_ flags
xfs: remove a few macro indirections in the quota code
xfs: access quotainfo structure directly
xfs: wait for direct I/O to complete in fsync and write_inode
...
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (47 commits)
ocfs2: Silence a gcc warning.
ocfs2: Don't retry xattr set in case value extension fails.
ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break
ocfs2: Reset xattr value size after xa_cleanup_value_truncate().
fs/ocfs2/dlm: Use kstrdup
fs/ocfs2/dlm: Drop memory allocation cast
Ocfs2: Optimize punching-hole code.
Ocfs2: Make ocfs2_find_cpos_for_left_leaf() public.
Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing.
Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead.
ocfs2: Block signals for mkdir/link/symlink/O_CREAT.
ocfs2: Wrap signal blocking in void functions.
ocfs2/dlm: Increase o2dlm lockres hash size
ocfs2: Make ocfs2_extend_trans() really extend.
ocfs2/trivial: Code cleanup for allocation reservation.
ocfs2: make ocfs2_adjust_resv_from_alloc simple.
ocfs2: Make nointr a default mount option
ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE
o2net: log socket state changes
ocfs2: print node # when tcp fails
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1674 commits)
qlcnic: adding co maintainer
ixgbe: add support for active DA cables
ixgbe: dcb, do not tag tc_prio_control frames
ixgbe: fix ixgbe_tx_is_paused logic
ixgbe: always enable vlan strip/insert when DCB is enabled
ixgbe: remove some redundant code in setting FCoE FIP filter
ixgbe: fix wrong offset to fc_frame_header in ixgbe_fcoe_ddp
ixgbe: fix header len when unsplit packet overflows to data buffer
ipv6: Never schedule DAD timer on dead address
ipv6: Use POSTDAD state
ipv6: Use state_lock to protect ifa state
ipv6: Replace inet6_ifaddr->dead with state
cxgb4: notify upper drivers if the device is already up when they load
cxgb4: keep interrupts available when the ports are brought down
cxgb4: fix initial addition of MAC address
cnic: Return SPQ credit to bnx2x after ring setup and shutdown.
cnic: Convert cnic_local_flags to atomic ops.
can: Fix SJA1000 command register writes on SMP systems
bridge: fix build for CONFIG_SYSFS disabled
ARCNET: Limit com20020 PCI ID matches for SOHARD cards
...
Fix up various conflicts with pcmcia tree drivers/net/
{pcmcia/3c589_cs.c, wireless/orinoco/orinoco_cs.c and
wireless/orinoco/spectrum_cs.c} and feature removal
(Documentation/feature-removal-schedule.txt).
Also fix a non-content conflict due to pm_qos_requirement getting
renamed in the PM tree (now pm_qos_request) in net/mac80211/scan.c
This patch modifies the fs/timerfd.c to use the newly created
wait_event_interruptible_locked_irq() macro. This replaces an open
code implementation with a single macro call.
Signed-off-by: Michal Nazarewicz <m.nazarewicz@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
vlynq: make whole Kconfig-menu dependant on architecture
add descriptive comment for TIF_MEMDIE task flag declaration.
EEPROM: max6875: Header file cleanup
EEPROM: 93cx6: Header file cleanup
EEPROM: Header file cleanup
agp: use NULL instead of 0 when pointer is needed
rtc-v3020: make bitfield unsigned
PCI: make bitfield unsigned
jbd2: use NULL instead of 0 when pointer is needed
cciss: fix shadows sparse warning
doc: inode uses a mutex instead of a semaphore.
uml: i386: Avoid redefinition of NR_syscalls
fix "seperate" typos in comments
cocbalt_lcdfb: correct sections
doc: Change urls for sparse
Powerpc: wii: Fix typo in comment
i2o: cleanup some exit paths
Documentation/: it's -> its where appropriate
UML: Fix compiler warning due to missing task_struct declaration
UML: add kernel.h include to signal.c
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (61 commits)
KEYS: Return more accurate error codes
LSM: Add __init to fixup function.
TOMOYO: Add pathname grouping support.
ima: remove ACPI dependency
TPM: ACPI/PNP dependency removal
security/selinux/ss: Use kstrdup
TOMOYO: Use stack memory for pending entry.
Revert "ima: remove ACPI dependency"
Revert "TPM: ACPI/PNP dependency removal"
KEYS: Do preallocation for __key_link()
TOMOYO: Use mutex_lock_interruptible.
KEYS: Better handling of errors from construct_alloc_key()
KEYS: keyring_serialise_link_sem is only needed for keyring->keyring links
TOMOYO: Use GFP_NOFS rather than GFP_KERNEL.
ima: remove ACPI dependency
TPM: ACPI/PNP dependency removal
selinux: generalize disabling of execmem for plt-in-heap archs
LSM Audit: rename LSM_AUDIT_NO_AUDIT to LSM_AUDIT_DATA_NONE
CRED: Holding a spinlock does not imply the holding of RCU read lock
SMACK: Don't #include Ext2 headers
...
* 'for-2.6.35' of git://linux-nfs.org/~bfields/linux: (45 commits)
Revert "nfsd4: distinguish expired from stale stateids"
nfsd: safer initialization order in find_file()
nfs4: minor callback code simplification, comment
NFSD: don't report compiled-out versions as present
nfsd4: implement reclaim_complete
nfsd4: nfsd4_destroy_session must set callback client under the state lock
nfsd4: keep a reference count on client while in use
nfsd4: mark_client_expired
nfsd4: introduce nfs4_client.cl_refcount
nfsd4: refactor expire_client
nfsd4: extend the client_lock to cover cl_lru
nfsd4: use list_move in move_to_confirmed
nfsd4: fold release_session into expire_client
nfsd4: rename sessionid_lock to client_lock
nfsd4: fix bare destroy_session null dereference
nfsd4: use local variable in nfs4svc_encode_compoundres
nfsd: further comment typos
sunrpc: centralise most calls to svc_xprt_received
nfsd4: fix unlikely race in session replay case
nfsd4: fix filehandle comment
...
* 'nfs-for-2.6.35' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (78 commits)
SUNRPC: Don't spam gssd with upcall requests when the kerberos key expired
SUNRPC: Reorder the struct rpc_task fields
SUNRPC: Remove the 'tk_magic' debugging field
SUNRPC: Move the task->tk_bytes_sent and tk_rtt to struct rpc_rqst
NFS: Don't call iput() in nfs_access_cache_shrinker
NFS: Clean up nfs_access_zap_cache()
NFS: Don't run nfs_access_cache_shrinker() when the mask is GFP_NOFS
SUNRPC: Ensure rpcauth_prune_expired() respects the nr_to_scan parameter
SUNRPC: Ensure memory shrinker doesn't waste time in rpcauth_prune_expired()
SUNRPC: Dont run rpcauth_cache_shrinker() when gfp_mask is GFP_NOFS
NFS: Read requests can use GFP_KERNEL.
NFS: Clean up nfs_create_request()
NFS: Don't use GFP_KERNEL in rpcsec_gss downcalls
NFSv4: Don't use GFP_KERNEL allocations in state recovery
SUNRPC: Fix xs_setup_bc_tcp()
SUNRPC: Replace jiffies-based metrics with ktime-based metrics
ktime: introduce ktime_to_ms()
SUNRPC: RPC metrics and RTT estimator should use same RTT value
NFS: Calldata for nfs4_renew_done()
NFS: Squelch compiler warning in nfs_add_server_stats()
...
* 'bkl/procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
sunrpc: Include missing smp_lock.h
procfs: Kill the bkl in ioctl
procfs: Push down the bkl from ioctl
procfs: Use generic_file_llseek in /proc/vmcore
procfs: Use generic_file_llseek in /proc/kmsg
procfs: Use generic_file_llseek in /proc/kcore
procfs: Kill BKL in llseek on proc base
This is the culmination of this sequence of patches. By moving the block
erasing from jffs2_write_super() into the GC code, we avoid huge
latencies on unmount where it waits for _all_ pending blocks to be
erased, and we allow better control for time-critical tasks by stopping
the GC thread.
Signed-off-by: Joakim Tjernlund <joakim.tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Now that we do erases from GC and trigger the GC thread to do them
instead of using kupdated, this function is misnamed. It's only used
for triggering wbuf flush on NAND flash now. Rename it accordingly.
Signed-off-by: Joakim Tjernlund <joakim.tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
We're about to call this from a bunch of places which already hold
c->erase_completion_lock, so add an assertion and change its existing
callers to do the same.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Now that we trigger block erases from jffs2_garbage_collect_pass(),
adjust jffs2_thread_should_wake() to return 1 when there are blocks to
erase.
Signed-off-by: Joakim Tjernlund <joakim.tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
jffs2_garbage_collect_pass() would previously return -EAGAIN if it
couldn't find anything to garbage collect from, and there were blocks on
the erase_pending_list. If the blocks were actually in the process of
being erased, though, then they wouldn't be on that list. Check for
nr_erasing_blocks being non-zero instead.
Fix jffs2_reserve_space() to wait for the in-progress erases to
complete, when jffs2_garbage_collect_pass() returns -EAGAIN.
And fix jffs2_erase_succeeded() to actually wake up the erase_wait wq
that jffs2_reserve_space() is now using.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
We're about to start calling this from the jffs2_garbage_collect_pass(), and
we'll want to know whether it actually did anything or not.
Signed-off-by: Joakim Tjernlund <joakim.tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Rename all iomap_valid identifiers to imap_valid to fit the new
world order, and clean up xfs_iomap_valid to convert the passed in
offset to blocks instead of the imap values to bytes. Use the
simpler inode->i_blkbits instead of the XFS macros for this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The IOMAP_ flags are now only used inside xfs_aops.c for extent
probing and I/O completion tracking, so more them here, and rename
them to IO_* as there's no mapping involved at all.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Now that struct xfs_iomap contains exactly the same units as struct
xfs_bmbt_irec we can just use the latter directly in the aops code.
Replace the missing IOMAP_NEW flag with a new boolean output
parameter to xfs_iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Report the iomap_bn field of struct xfs_iomap in terms of filesystem
blocks instead of in terms of bytes. Shift the byte conversions
into the caller, and replace the IOMAP_DELAY and IOMAP_HOLE flag
checks with checks for HOLESTARTBLOCK and DELAYSTARTBLOCK.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>