linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-15 13:56:45 +07:00

Author	SHA1	Message	Date
Al Viro	84695ffee7	Merge getxattr prototype change into work.lookups The rest of work.xattr stuff isn't needed for this branch	2016-05-02 19:45:47 -04:00
Mimi Zohar	05d1a717ec	ima: add support for creating files using the mknodat syscall Commit `3034a14` "ima: pass 'opened' flag to identify newly created files" stopped identifying empty files as new files. However new empty files can be created using the mknodat syscall. On systems with IMA-appraisal enabled, these empty files are not labeled with security.ima extended attributes properly, preventing them from subsequently being opened in order to write the file data contents. This patch defines a new hook named ima_post_path_mknod() to mark these empty files, created using mknodat, as new in order to allow the file data contents to be written. In addition, files with security.ima xattrs containing a file signature are considered "immutable" and can not be modified. The file contents need to be written, before signing the file. This patch relaxes this requirement for new files, allowing the file signature to be written before the file contents. Changelog: - defer identifying files with signatures stored as security.ima (based on Dmitry Rozhkov's comments) - removing tests (eg. dentry, dentry->d_inode, inode->i_size == 0) (based on Al's review) Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Al Viro <<viro@zeniv.linux.org.uk> Tested-by: Dmitry Rozhkov <dmitry.rozhkov@linux.intel.com>	2016-05-01 09:23:52 -04:00
Al Viro	10c64cea04	atomic_open(): fix the handling of create_error * if we have a hashed negative dentry and either CREAT\|EXCL on r/o filesystem, or CREAT\|TRUNC on r/o filesystem, or CREAT\|EXCL with failing may_o_create(), we should fail with EROFS or the error may_o_create() has returned, but not ENOENT. Which is what the current code ends up returning. * if we have CREAT\|TRUNC hitting a regular file on a read-only filesystem, we can't fail with EROFS here. At the very least, not until we'd done follow_managed() - we might have a writable file (or a device, for that matter) bound on top of that one. Moreover, the code downstream will see that O_TRUNC and attempt to grab the write access (after following possible mount), so if we really should fail with EROFS, it will happen. No need to do that inside atomic_open(). The real logics is much simpler than what the current code is trying to do - if we decided to go for simple lookup, ended up with a negative dentry and had create_error set, fail with create_error. No matter whether we'd got that negative dentry from lookup_real() or had found it in dcache. Cc: stable@vger.kernel.org # v3.6+ Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-04-30 16:40:52 -04:00
Al Viro	fc64005c93	don't bother with ->d_inode->i_sb - it's always equal to ->d_sb ... and neither can ever be NULL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-04-10 17:11:51 -04:00
Christoph Hellwig	bfe8804d90	xfs: use ->readlink to implement the readlink_by_handle ioctl Also drop the now unused readlink_copy export. [dchinner: use d_inode(dentry) rather than dentry->d_inode] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-04-06 07:50:54 +10:00
Andreas Gruenbacher	b8a7a3a667	posix_acl: Inode acl caching fixes When get_acl() is called for an inode whose ACL is not cached yet, the get_acl inode operation is called to fetch the ACL from the filesystem. The inode operation is responsible for updating the cached acl with set_cached_acl(). This is done without locking at the VFS level, so another task can call set_cached_acl() or forget_cached_acl() before the get_acl inode operation gets to calling set_cached_acl(), and then get_acl's call to set_cached_acl() results in caching an outdate ACL. Prevent this from happening by setting the cached ACL pointer to a task-specific sentinel value before calling the get_acl inode operation. Move the responsibility for updating the cached ACL from the get_acl inode operations to get_acl(). There, only set the cached ACL if the sentinel value hasn't changed. The sentinel values are chosen to have odd values. Likewise, the value of ACL_NOT_CACHED is odd. In contrast, ACL object pointers always have an even value (ACLs are aligned in memory). This allows to distinguish uncached ACLs values from ACL objects. In addition, switch from guarding inode->i_acl and inode->i_default_acl upates by the inode->i_lock spinlock to using xchg() and cmpxchg(). Filesystems that do not want ACLs returned from their get_acl inode operations to be cached must call forget_cached_acl() to prevent the VFS from doing so. (Patch written by Al Viro and Andreas Gruenbacher.) Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-31 00:30:15 -04:00
Al Viro	7500c38ac3	fix the braino in "namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()" We should try to trigger automount before bailing out on negative dentry. Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Reported-by: Arend van Spriel <arend@broadcom.com> Tested-by: Arend van Spriel <arend@broadcom.com> Tested-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-31 00:23:05 -04:00
Al Viro	d360775217	constify security_path_{mkdir,mknod,symlink} ... as well as unix_mknod() and may_o_create() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-28 00:47:27 -04:00
Al Viro	9d95afd597	kill dentry_unhash() the last user is gone Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-14 00:16:33 -04:00
Al Viro	949a852e46	namei: teach lookup_slow() to skip revalidate ... and make mountpoint_last() use it. That makes all candidates for lookup with parent locked shared go through lookup_slow(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-14 00:15:46 -04:00
Al Viro	e3c1392808	namei: massage lookup_slow() to be usable by lookup_one_len_unlocked() Return dentry and don't pass nameidata or path; lift crossing mountpoints into the caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-14 00:15:40 -04:00
Al Viro	d6d95ded91	lookup_one_len_unlocked(): use lookup_dcache() No need to lock parent just because of ->d_revalidate() on child; contrary to the stale comment, lookup_dcache() can be used without locking the parent. Result can be moved as soon as we return, of course, but the same is true for lookup_one_len_unlocked() itself. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-14 00:15:36 -04:00
Al Viro	74ff0ffc7f	namei: simplify invalidation logics in lookup_dcache() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-14 00:15:31 -04:00
Al Viro	e9742b5332	namei: change calling conventions for lookup_{fast,slow} and follow_managed() Have lookup_fast() return 1 on success and 0 on "need to fall back"; lookup_slow() and follow_managed() return positive (1) on success. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-14 00:14:35 -04:00
Al Viro	5d0f49c136	namei: untanlge lookup_fast() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-14 00:14:25 -04:00
Al Viro	6c51e513a3	lookup_dcache(): lift d_alloc() into callers ... and kill need_lookup thing Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-05 20:09:32 -05:00
Al Viro	6583fe22d1	do_last(): reorder and simplify a bit bugger off on negatives a bit earlier, simplify the tests Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-05 18:14:03 -05:00
Al Viro	5129fa482b	do_last(): ELOOP failure exit should be done after leaving RCU mode ... or we risk seeing a bogus value of d_is_symlink() there. Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-02-27 19:37:37 -05:00
Al Viro	a7f775428b	should_follow_link(): validate ->d_seq after having decided to follow ... otherwise d_is_symlink() above might have nothing to do with the inode value we've got. Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-02-27 19:31:01 -05:00
Al Viro	d4565649b6	namei: ->d_inode of a pinned dentry is stable only for positives both do_last() and walk_component() risk picking a NULL inode out of dentry about to become positive, then checking its flags and seeing that it's not negative anymore and using (already stale by then) value they'd fetched earlier. Usually ends up oopsing soon after that... Cc: stable@vger.kernel.org # v3.13+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-02-27 19:23:16 -05:00
Al Viro	c80567c82a	do_last(): don't let a bogus return value from ->open() et.al. to confuse us ... into returning a positive to path_openat(), which would interpret that as "symlink had been encountered" and proceed to corrupt memory, etc. It can only happen due to a bug in some ->open() instance or in some LSM hook, etc., so we report any such event and make sure it doesn't trick us into further unpleasantness. Cc: stable@vger.kernel.org # v3.6+, at least Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-02-27 19:17:33 -05:00
Al Viro	5955102c99	wrappers for ->i_mutex access parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-22 18:04:28 -05:00
Linus Torvalds	33caf82acf	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "All kinds of stuff. That probably should've been 5 or 6 separate branches, but by the time I'd realized how large and mixed that bag had become it had been too close to -final to play with rebasing. Some fs/namei.c cleanups there, memdup_user_nul() introduction and switching open-coded instances, burying long-dead code, whack-a-mole of various kinds, several new helpers for ->llseek(), assorted cleanups and fixes from various people, etc. One piece probably deserves special mention - Neil's lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets called without ->i_mutex and tries to avoid ever taking it. That, of course, means that it's not useful for any directory modifications, but things like getting inode attributes in nfds readdirplus are fine with that. I really should've asked for moratorium on lookup-related changes this cycle, but since I hadn't done that early enough... I am asking for that for the coming cycle, though - I'm going to try and get conversion of i_mutex to rwsem with ->lookup() done under lock taken shared. There will be a patch closer to the end of the window, along the lines of the one Linus had posted last May - mechanical conversion of ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/ inode_is_locked()/inode_lock_nested(). To quote Linus back then: ----- \| This is an automated patch using \| \| sed 's/mutex_lock(&$.$->i_mutex)/inode_lock(\1)/' \| sed 's/mutex_unlock(&$.$->i_mutex)/inode_unlock(\1)/' \| sed 's/mutex_lock_nested(&$.$->i_mutex,[ ]I_MUTEX_$[A-Z0-9_]$)/inode_lock_nested(\1, I_MUTEX_\2)/' \| sed 's/mutex_is_locked(&$.$->i_mutex)/inode_is_locked(\1)/' \| sed 's/mutex_trylock(&$.$->i_mutex)/inode_trylock(\1)/' \| \| with a very few manual fixups ----- I'm going to send that once the ->i_mutex-affecting stuff in -next gets mostly merged (or when Linus says he's about to stop taking merges)" 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) nfsd: don't hold i_mutex over userspace upcalls fs:affs:Replace time_t with time64_t fs/9p: use fscache mutex rather than spinlock proc: add a reschedule point in proc_readfd_common() logfs: constify logfs_block_ops structures fcntl: allow to set O_DIRECT flag on pipe fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE fs: xattr: Use kvfree() [s390] page_to_phys() always returns a multiple of PAGE_SIZE nbd: use ->compat_ioctl() fs: use block_device name vsprintf helper lib/vsprintf: add %*pg format specifier fs: use gendisk->disk_name where possible poll: plug an unused argument to do_poll amdkfd: don't open-code memdup_user() cdrom: don't open-code memdup_user() rsxx: don't open-code memdup_user() mtip32xx: don't open-code memdup_user() [um] mconsole: don't open-code memdup_user_nul() [um] hostaudio: don't open-code memdup_user() ...	2016-01-12 17:11:47 -08:00
NeilBrown	bbddca8e8f	nfsd: don't hold i_mutex over userspace upcalls We need information about exports when crossing mountpoints during lookup or NFSv4 readdir. If we don't already have that information cached, we may have to ask (and wait for) rpc.mountd. In both cases we currently hold the i_mutex on the parent of the directory we're asking rpc.mountd about. We've seen situations where rpc.mountd performs some operation on that directory that tries to take the i_mutex again, resulting in deadlock. With some care, we may be able to avoid that in rpc.mountd. But it seems better just to avoid holding a mutex while waiting on userspace. It appears that lookup_one_len is pretty much the only operation that needs the i_mutex. So we could just drop the i_mutex elsewhere and do something like mutex_lock() lookup_one_len() mutex_unlock() In many cases though the lookup would have been cached and not required the i_mutex, so it's more efficient to create a lookup_one_len() variant that only takes the i_mutex when necessary. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-09 03:07:52 -05:00
Al Viro	62fb4a155f	don't carry MAY_OPEN in op->acc_mode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:28:40 -05:00
Al Viro	fceef393a5	switch ->get_link() to delayed_call, kill ->put_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-30 13:01:03 -05:00
Al Viro	d3883d4f93	teach page_get_link() to work in RCU mode more or less along the lines of Neil's patchset, sans the insanity around kmap(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-08 22:41:54 -05:00
Al Viro	6b2553918d	replace ->follow_link() with new method that could stay in RCU mode new method: ->get_link(); replacement of ->follow_link(). The differences are: * inode and dentry are passed separately * might be called both in RCU and non-RCU mode; the former is indicated by passing it a NULL dentry. * when called that way it isn't allowed to block and should return ERR_PTR(-ECHILD) if it needs to be called in non-RCU mode. It's a flagday change - the old method is gone, all in-tree instances converted. Conversion isn't hard; said that, so far very few instances do not immediately bail out when called in RCU mode. That'll change in the next commits. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-08 22:41:54 -05:00
Al Viro	21fc61c73c	don't put symlink bodies in pagecache into highmem kmap() in page_follow_link_light() needed to go - allowing to hold an arbitrary number of kmaps for long is a great way to deadlocking the system. new helper (inode_nohighmem(inode)) needs to be used for pagecache symlinks inodes; done for all in-tree cases. page_follow_link_light() instrumented to yell about anything missed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-08 22:41:36 -05:00
Al Viro	e1a63bbc40	restore_nameidata(): no need to clear now->stack microoptimization: in all callers *now is in the frame we are about to leave. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-06 21:18:27 -05:00
Al Viro	248fb5b955	namei.c: take "jump to root" into a new helper ... and use it both in path_init() (for absolute pathnames) and get_link() (for absolute symlinks). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-06 21:18:21 -05:00
Al Viro	ef55d91700	path_init(): set nd->inode earlier in cwd-relative case that allows to kill the recheck of nd->seq on the way out in this case, and this check on the way out is left only for absolute pathnames. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-06 21:18:16 -05:00
Al Viro	9e6697e26f	namei.c: fold set_root_rcu() into set_root() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-06 21:18:10 -05:00
Mike Marshall	57e3715cfa	typo in fs/namei.c comment Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-06 21:17:18 -05:00
Al Viro	aa80deab33	namei: page_getlink() and page_follow_link_light() are the same thing Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-06 20:43:27 -05:00
Al Viro	2788cc47f4	Don't reset ->total_link_count on nested calls of vfs_path_lookup() we already zero it on outermost set_nameidata(), so initialization in path_init() is pointless and wrong. The same DoS exists on pre-4.2 kernels, but there a slightly different fix will be needed. Cc: stable@vger.kernel.org # v4.2 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-06 12:33:02 -05:00
Linus Torvalds	ad804a0b2a	Merge branch 'akpm' (patches from Andrew) Merge second patch-bomb from Andrew Morton: - most of the rest of MM - procfs - lib/ updates - printk updates - bitops infrastructure tweaks - checkpatch updates - nilfs2 update - signals - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc, dma-debug, dma-mapping, ... * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) ipc,msg: drop dst nil validation in copy_msg include/linux/zutil.h: fix usage example of zlib_adler32() panic: release stale console lock to always get the logbuf printed out dma-debug: check nents in dma_sync_sg* dma-mapping: tidy up dma_parms default handling pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode kexec: use file name as the output message prefix fs, seqfile: always allow oom killer seq_file: reuse string_escape_str() fs/seq_file: use seq_* helpers in seq_hex_dump() coredump: change zap_threads() and zap_process() to use for_each_thread() coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT) signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread() signal: turn dequeue_signal_lock() into kernel_dequeue_signal() signals: kill block_all_signals() and unblock_all_signals() nilfs2: fix gcc uninitialized-variable warnings in powerpc build nilfs2: fix gcc unused-but-set-variable warnings MAINTAINERS: nilfs2: add header file for tracing nilfs2: add tracepoints for analyzing reading and writing metadata files ...	2015-11-07 14:32:45 -08:00
Linus Torvalds	75021d2859	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial updates from Jiri Kosina: "Trivial stuff from trivial tree that can be trivially summed up as: - treewide drop of spurious unlikely() before IS_ERR() from Viresh Kumar - cosmetic fixes (that don't really affect basic functionality of the driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek - various comment / printk fixes and updates all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: bcache: Really show state of work pending bit hwmon: applesmc: fix comment typos Kconfig: remove comment about scsi_wait_scan module class_find_device: fix reference to argument "match" debugfs: document that debugfs_remove*() accepts NULL and error values net: Drop unlikely before IS_ERR(_OR_NULL) mm: Drop unlikely before IS_ERR(_OR_NULL) fs: Drop unlikely before IS_ERR(_OR_NULL) drivers: net: Drop unlikely before IS_ERR(_OR_NULL) drivers: misc: Drop unlikely before IS_ERR(_OR_NULL) UBI: Update comments to reflect UBI_METAONLY flag pktcdvd: drop null test before destroy functions	2015-11-07 13:05:44 -08:00
Michal Hocko	c62d25556b	mm, fs: introduce mapping_gfp_constraint() There are many places which use mapping_gfp_mask to restrict a more generic gfp mask which would be used for allocations which are not directly related to the page cache but they are performed in the same context. Let's introduce a helper function which makes the restriction explicit and easier to track. This patch doesn't introduce any functional changes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Linus Torvalds	6de29ccb50	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull userns hardlink capability check fix from Eric Biederman: "This round just contains a single patch. There has been a lot of other work this period but it is not quite ready yet, so I am pushing it until 4.5. The remaining change by Dirk Steinmetz wich fixes both Gentoo and Ubuntu containers allows hardlinks if we have the appropriate capabilities in the user namespace. Security wise it is really a gimme as the user namespace root can already call setuid become that user and create the hardlink" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: namei: permit linking with CAP_FOWNER in userns	2015-11-05 15:20:56 -08:00
Dirk Steinmetz	f2ca379642	namei: permit linking with CAP_FOWNER in userns Attempting to hardlink to an unsafe file (e.g. a setuid binary) from within an unprivileged user namespace fails, even if CAP_FOWNER is held within the namespace. This may cause various failures, such as a gentoo installation within a lxc container failing to build and install specific packages. This change permits hardlinking of files owned by mapped uids, if CAP_FOWNER is held for that namespace. Furthermore, it improves consistency by using the existing inode_owner_or_capable(), which is aware of namespaced capabilities as of `23adbe12ef` ("fs,userns: Change inode_capable to capable_wrt_inode_uidgid"). Signed-off-by: Dirk Steinmetz <public@rsjtdrjgfuzkfg.com> This is hitting us in Ubuntu during some dpkg upgrades in containers. When upgrading a file dpkg creates a hard link to the old file to back it up before overwriting it. When packages upgrade suid files owned by a non-root user the link isn't permitted, and the package upgrade fails. This patch fixes our problem. Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2015-10-27 16:12:35 -05:00
Trond Myklebust	daf3761c9f	namei: results of d_is_negative() should be checked after dentry revalidation Leandro Awa writes: "After switching to version 4.1.6, our parallelized and distributed workflows now fail consistently with errors of the form: T34: ./regex.c:39:22: error: config.h: No such file or directory From our 'git bisect' testing, the following commit appears to be the possible cause of the behavior we've been seeing: commit 766c4cbfacd8" Al Viro says: "What happens is that `766c4cbfac` got the things subtly wrong. We used to treat d_is_negative() after lookup_fast() as "fall with ENOENT". That was wrong - checking ->d_flags outside of ->d_seq protection is unreliable and failing with hard error on what should've fallen back to non-RCU pathname resolution is a bug. Unfortunately, we'd pulled the test too far up and ran afoul of another kind of staleness. The dentry might have been absolutely stable from the RCU point of view (and we might be on UP, etc), but stale from the remote fs point of view. If ->d_revalidate() returns "it's actually stale", dentry gets thrown away and the original code wouldn't even have looked at its ->d_flags. What we need is to check ->d_flags where `766c4cbfac` does (prior to ->d_seq validation) but only use the result in cases where we do not discard this dentry outright" Reported-by: Leandro Awa <lawa@nvidia.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911 Fixes: `766c4cbfac` ("namei: d_is_negative() should be checked...") Tested-by: Leandro Awa <lawa@nvidia.com> Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-10-10 10:17:27 -07:00
Viresh Kumar	a1c83681d5	fs: Drop unlikely before IS_ERR(_OR_NULL) IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there is no need to do that again from its callers. Drop it. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Jeff Layton <jlayton@poochiereds.net> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve French <smfrench@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2015-09-29 15:13:58 +02:00
Masanari Iida	2a78b857d3	namei: fix warning while make xmldocs caused by namei.c Fix the following warnings: Warning(.//fs/namei.c:2422): No description found for parameter 'nd' Warning(.//fs/namei.c:2422): Excess function parameter 'nameidata' description in 'path_mountpoint' Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-10 13:29:01 -07:00
Eric W. Biederman	397d425dc2	vfs: Test for and handle paths that are unreachable from their mnt_root In rare cases a directory can be renamed out from under a bind mount. In those cases without special handling it becomes possible to walk up the directory tree to the root dentry of the filesystem and down from the root dentry to every other file or directory on the filesystem. Like division by zero .. from an unconnected path can not be given a useful semantic as there is no predicting at which path component the code will realize it is unconnected. We certainly can not match the current behavior as the current behavior is a security hole. Therefore when encounting .. when following an unconnected path return -ENOENT. - Add a function path_connected to verify path->dentry is reachable from path->mnt.mnt_root. AKA to validate that rename did not do something nasty to the bind mount. To avoid races path_connected must be called after following a path component to it's next path component. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-08-21 03:20:10 -04:00
Al Viro	aa65fa35ba	may_follow_link() should use nd->inode Now that we can get there in RCU mode, we shouldn't play with nd->path.dentry->d_inode - it's not guaranteed to be stable. Use nd->inode instead. Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-08-04 23:23:50 -04:00
Al Viro	97242f99a0	link_path_walk(): be careful when failing with ENOTDIR In RCU mode we might end up with dentry evicted just we check that it's a directory. In such case we should return ECHILD rather than ENOTDIR, so that pathwalk would be retries in non-RCU mode. Breakage had been introduced in commit `b18825a` - prior to that we were looking at nd->inode, which had been fetched before verifying that ->d_seq was still valid. That form of check would only be satisfied if at some point the pathname prefix would indeed have resolved to a non-directory. The fix consists of checking ->d_seq after we'd run into a non-directory dentry, and failing with ECHILD in case of mismatch. Note that all branches since 3.12 have that problem... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-08-01 20:18:38 -04:00
Al Viro	06d7137e5c	namei: make set_root_rcu() return void The only caller that cares about its return value can just as easily pick it from nd->root_seq itself. We used to just calculate it and return to caller, but these days we are storing it in nd->root_seq in all cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-06-29 12:07:04 -04:00
Al Viro	b853a16176	turn user_{path_at,path,lpath,path_dir}() into static inlines Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:45 -04:00
Al Viro	9883d1855e	namei: move saved_nd pointer into struct nameidata these guys are always declared next to each other; might as well put the former (pointer to previous instance) into the latter and simplify the calling conventions for {set,restore}_nameidata() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:45 -04:00
Al Viro	520ae68747	inline user_path_create() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:44 -04:00
Al Viro	a2ec4a2d5c	inline user_path_parent() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:44 -04:00
Al Viro	76ae2a5ab1	namei: trim do_last() arguments now that struct filename is stashed in nameidata we have no need to pass it in Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:43 -04:00
Al Viro	c8a53ee5ee	namei: stash dfd and name into nameidata fewer arguments to pass around... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:43 -04:00
Al Viro	102b8af266	namei: fold path_cleanup() into terminate_walk() they are always called next to each other; moreover, terminate_walk() is more symmetrical that way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:42 -04:00
Al Viro	5c31b6cedb	namei: saner calling conventions for filename_parentat() a) make it reject ERR_PTR() for name b) make it putname(name) on all other failure exits c) make it return name on success again, simplifies the callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:42 -04:00
Al Viro	181c37b6e4	namei: saner calling conventions for filename_create() a) make it reject ERR_PTR() for name b) make it putname(name) upon return in all other cases. seriously simplifies the callers... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:42 -04:00
Al Viro	391172c46e	namei: shift nameidata down into filename_parentat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:41 -04:00
Al Viro	abc9f5beb1	namei: make filename_lookup() reject ERR_PTR() passed as name makes for much easier life in callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:41 -04:00
Al Viro	9ad1aaa615	namei: shift nameidata inside filename_lookup() pass root instead; non-NULL => copy to nd.root and set LOOKUP_ROOT in flags Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:40 -04:00
Al Viro	e4bd1c1a95	namei: move putname() call into filename_lookup() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:40 -04:00
Al Viro	625b6d1054	namei: pass the struct path to store the result down into path_lookupat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:39 -04:00
Al Viro	18d8c86011	namei: uninline set_root{,_rcu}() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:39 -04:00
Al Viro	aed434ada6	namei: be careful with mountpoint crossings in follow_dotdot_rcu() Otherwise we are risking a hard error where nonlazy restart would be the right thing to do; it's a very narrow race with mount --move and most of the time it ends up being completely harmless, but it's possible to construct a case when we'll get a bogus hard error instead of falling back to non-lazy walk... For one thing, when crossing _into_ overmount of parent we need to check for mount_lock bumps when we get NULL from __lookup_mnt() as well. For another, and less exotically, we need to make sure that the data fetched in follow_up_rcu() had been consistent. ->mnt_mountpoint is pinned for as long as it is a mountpoint, but we need to check mount_lock after fetching to verify that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:38 -04:00
Al Viro	5a8d87e8ed	namei: unlazy_walk() doesn't need to mess with current->fs anymore now that we have ->root_seq, legitimize_path(&nd->root, nd->root_seq) will do just fine... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:36 -04:00
Al Viro	8f47a0167c	namei: handle absolute symlinks without dropping out of RCU mode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:10:22 -04:00
Al Viro	8c1b456689	enable passing fast relative symlinks without dropping out of RCU mode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:06:28 -04:00
NeilBrown	8fa9dd2466	VFS/namei: make the use of touch_atime() in get_link() RCU-safe. touch_atime is not RCU-safe, and so cannot be called on an RCU walk. However, in situations where RCU-walk makes a difference, the symlink will likely to accessed much more often than it is useful to update the atime. So split out the test of "Does the atime actually need to be updated" into atime_needs_update(), and have get_link() unlazy if it finds that it will need to do that update. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:06:27 -04:00
Al Viro	bc40aee053	namei: don't unlazy until get_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:06:27 -04:00
Al Viro	7973387a2f	namei: make unlazy_walk and terminate_walk handle nd->stack, add unlazy_link We are almost done - primitives for leaving RCU mode are aware of nd->stack now, a new primitive for going to non-RCU mode when we have a symlink on hands added. The thing we are heavily relying upon is that any unlazy failure will be shortly followed by terminate_walk(), with no access to nameidata in between. So it's enough to leave the things in a state terminate_walk() would cope with. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-15 01:06:01 -04:00
Al Viro	0450b2d120	namei: store seq numbers in nd->stack[] we'll need them for unlazy_walk() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:13:14 -04:00
Al Viro	31956502dd	namei: make may_follow_link() safe in RCU mode We can't call that audit garbage in RCU mode - it's doing a weird mix of allocations (GFP_NOFS, immediately followed by GFP_KERNEL) and I'm not touching that... thing again. So if this security sclero^Whardening feature gets triggered when we are in RCU mode, tough - we'll fail with -ECHILD and have everything restarted in non-RCU mode. Only to hit the same test and fail, this time with EACCES and with (oh, rapture) an audit spew produced. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:13:13 -04:00
Al Viro	6548fae2ec	namei: make put_link() RCU-safe very simple - just make path_put() conditional on !RCU. Note that right now it doesn't get called in RCU mode - we leave it before getting anything into stack. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:13:13 -04:00
Al Viro	5f2c4179e1	switch ->put_link() from dentry to inode only one instance looks at that argument at all; that sole exception wants inode rather than dentry. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:13:12 -04:00
NeilBrown	bda0be7ad9	security: make inode_follow_link RCU-walk aware inode_follow_link now takes an inode and rcu flag as well as the dentry. inode is used in preference to d_backing_inode(dentry), particularly in RCU-walk mode. selinux_inode_follow_link() gets dentry_has_perm() and inode_has_perm() open-coded into it so that it can call avc_has_perm_flags() in way that is safe if LOOKUP_RCU is set. Calling avc_has_perm_flags() with rcu_read_lock() held means that when avc_has_perm_noaudit calls avc_compute_av(), the attempt to rcu_read_unlock() before calling security_compute_av() will not actually drop the RCU read-lock. However as security_compute_av() is completely in a read_lock()ed region, it should be safe with the RCU read-lock held. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:13:11 -04:00
Al Viro	181548c051	namei: pick_link() callers already have inode no need to refetch (and once we move unlazy out of there, recheck ->d_seq). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:13:10 -04:00
David Howells	63afdfc781	VFS: Handle lower layer dentry/inode in pathwalk Make use of d_backing_inode() in pathwalk to gain access to an inode or dentry that's on a lower layer. Signed-off-by: David Howells <dhowells@redhat.com>	2015-05-11 08:13:10 -04:00
Al Viro	237d8b327a	namei: store inode in nd->stack[] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:13:09 -04:00
Al Viro	254cf58212	namei: don't mangle nd->seq in lookup_fast() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:13:09 -04:00
Al Viro	6e9918b7b3	namei: explicitly pass seq number to unlazy_walk() when dentry != NULL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:13:09 -04:00
Al Viro	3595e2346c	link_path_walk: use explicit returns for failure exits Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:13:08 -04:00
Al Viro	deb106c632	namei: lift terminate_walk() all the way up Lift it from link_path_walk(), trailing_symlink(), lookup_last(), mountpoint_last(), complete_walk() and do_last(). A _lot_ of those suckers merge. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:13:08 -04:00
Al Viro	3bdba28b72	namei: lift link_path_walk() call out of trailing_symlink() Make trailing_symlink() return the pathname to traverse or ERR_PTR(-E...). A subtle point is that for "magic" symlinks it returns "" now - that leads to link_path_walk("", nd), which is immediately returning 0 and we are back to the treatment of the last component, at whereever the damn thing has left us. Reduces the stack footprint - link_path_walk() called on more shallow stack now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:12:57 -04:00
Al Viro	368ee9ba56	namei: path_init() calling conventions change * lift link_path_walk() into callers; moving it down into path_init() had been a mistake. Stack footprint, among other things... * do _not_ call path_cleanup() after path_init() failure; on all failure exits out of it we have nothing for path_cleanup() to do * have path_init() return pathname or ERR_PTR(-E...) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:10:41 -04:00
Al Viro	34a26b99b7	namei: get rid of nameidata->base we can do fdput() under rcu_read_lock() just fine; all we need to take care of is fetching nd->inode value first. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-11 08:05:05 -04:00
Al Viro	8bcb77fabd	namei: split off filename_lookupat() with LOOKUP_PARENT new functions: filename_parentat() and path_parentat() resp. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:20 -04:00
Al Viro	b5cd339762	namei: may_follow_link() - lift terminate_walk() on failures into caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:20 -04:00
Al Viro	ab10492345	namei: take increment of nd->depth into pick_link() Makes the situation much more regular - we avoid a strange state when the element just after the top of stack is used to store struct path of symlink, but isn't counted in nd->depth. This is much more regular, so the normal failure exits, etc., work fine. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:19 -04:00
Al Viro	1cf2665b5b	namei: kill nd->link Just store it in nd->stack[nd->depth].link right in pick_link(). Now that we make sure of stack expansion in pick_link(), we can do so... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:19 -04:00
Al Viro	fec2fa24e8	may_follow_link(): trim arguments Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:18 -04:00
Al Viro	cd179f4468	namei: move bumping the refcount of link->mnt into pick_link() update the failure cleanup in may_follow_link() to match that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:18 -04:00
Al Viro	e8bb73dfb0	namei: fold put_link() into the failure case of complete_walk() ... and don't open-code unlazy_walk() in there - the only reason for that is to avoid verfication of cached nd->root, which is trivially avoided by discarding said cached nd->root first. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:17 -04:00
Al Viro	fab51e8ab2	namei: take the treatment of absolute symlinks to get_link() rather than letting the callers handle the jump-to-root part of semantics, do it right in get_link() and return the rest of the body for the caller to deal with - at that point it's treated the same way as relative symlinks would be. And return NULL when there's no "rest of the body" - those are treated the same as pure jump symlink would be. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:17 -04:00
Al Viro	4f697a5e17	namei: simpler treatment of symlinks with nothing other that / in the body Instead of saving name and branching to OK:, where we'll immediately restore it, and call walk_component() with WALK_PUT\|WALK_GET and nd->last_type being LAST_BIND, which is equivalent to put_link(nd), err = 0, we can just treat that the same way we'd treat procfs-style "jump" symlinks - do put_link(nd) and move on. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:16 -04:00
Al Viro	6920a4405e	namei: simplify failure exits in get_link() when cookie is NULL, put_link() is equivalent to path_put(), so as soon as we'd set last->cookie to NULL, we can bump nd->depth and let the normal logics in terminate_walk() to take care of cleanups. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:16 -04:00
Al Viro	6e77137b36	don't pass nameidata to ->follow_link() its only use is getting passed to nd_jump_link(), which can obtain it from current->nameidata Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:15 -04:00
Al Viro	8402752ecf	namei: simplify the callers of follow_managed() now that it gets nameidata, no reason to have setting LOOKUP_JUMPED on mountpoint crossing and calling path_put_conditional() on failures done in every caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:15 -04:00
NeilBrown	756daf263e	VFS: replace {, total_}link_count in task_struct with pointer to nameidata task_struct currently contains two ad-hoc members for use by the VFS: link_count and total_link_count. These are only interesting to fs/namei.c, so exposing them explicitly is poor layering. Incidentally, link_count isn't used anymore, so it can just die. This patches replaces those with a single pointer to 'struct nameidata'. This structure represents the current filename lookup of which there can only be one per process, and is a natural place to store total_link_count. This will allow the current "nameidata" argument to all follow_link operations to be removed as current->nameidata can be used instead in the _very_ few instances that care about it at all. As there are occasional circumstances where pathname lookup can recurse, such as through kern_path_locked, we always save and old current->nameidata (if there is one) when setting a new value, and make sure any active link_counts are preserved. follow_mount and follow_automount now get a 'struct nameidata *' rather than 'int flags' so that they can directly access total_link_count, rather than going through 'current'. Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:14 -04:00
Al Viro	626de99676	namei: move link count check and stack allocation into pick_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:13 -04:00
Al Viro	d63ff28f0f	namei: make should_follow_link() store the link in nd->link ... if it decides to follow, that is. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:13 -04:00
Al Viro	4693a547cd	namei: new calling conventions for walk_component() instead of a single flag (!= 0 => we want to follow symlinks) pass two bits - WALK_GET (want to follow symlinks) and WALK_PUT (put_link() once we are done looking at the name). The latter matters only for success exits - on failure the caller will discard everything anyway. Suggestions for better variant are welcome; what this thing aims for is making sure that pending put_link() is done before walk_component() decides to pick a symlink up, rather than between picking it up and acting upon it. See the next commit for payoff. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:12 -04:00
Al Viro	8620c238ed	link_path_walk: move the OK: inside the loop fewer labels that way; in particular, resuming after the end of nested symlink is straight-line. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:12 -04:00
Al Viro	1543972678	namei: have terminate_walk() do put_link() on everything left All callers of terminate_walk() are followed by more or less open-coded eqiuvalent of "do put_link() on everything left in nd->stack". Better done in terminate_walk() itself, and when we go for RCU symlink traversal we'll have to do it there anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:11 -04:00
Al Viro	191d7f73e2	namei: take put_link() into {lookup,mountpoint,do}_last() rationale: we'll need to have terminate_walk() do put_link() on everything, which will mean that in some cases ..._last() will do put_link() anyway. Easier to have them do it in all cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:11 -04:00
Al Viro	1bc4b813e8	namei: lift (open-coded) terminate_walk() into callers of get_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:10 -04:00
Al Viro	f0a9ba7021	lift terminate_walk() into callers of walk_component() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:10 -04:00
Al Viro	70291aecc6	namei: lift (open-coded) terminate_walk() in follow_dotdot_rcu() into callers follow_dotdot_rcu() does an equivalent of terminate_walk() on failure; shifting it into callers makes for simpler rules and those callers already have terminate_walk() on other failure exits. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:09 -04:00
Al Viro	e269f2a73f	namei: we never need more than MAXSYMLINKS entries in nd->stack The only reason why we needed one more was that purely nested MAXSYMLINKS symlinks could lead to path_init() using that many entries in addition to nd->stack[0] which it left unused. That can't happen now - path_init() starts with entry 0 (and trailing_symlink() is called only when we'd already encountered one symlink, so no more than MAXSYMLINKS-1 are left). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:08 -04:00
Al Viro	8eff733a45	link_path_walk: end of nd->depth massage get rid of orig_depth - we only use it on error exit to tell whether to stop doing put_link() when depth reaches 0 (call from path_init()) or when it reaches 1 (call from trailing_symlink()). However, in the latter case the caller would immediately follow with one more put_link(). Just keep doing it until the depth reaches zero (and simplify trailing_symlink() as the result). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:08 -04:00
Al Viro	939724df56	link_path_walk: nd->depth massage, part 10 Get rid of orig_depth checks in OK: logics. If nd->depth is zero, we had been called from path_init() and we are done. If it is greater than 1, we are not done, whether we'd been called from path_init() or trailing_symlink(). And in case when it's 1, we might have been called from path_init() and reached the end of nested symlink (in which case nd->stack[0].name will point to the rest of pathname and we are not done) or from trailing_symlink(), in which case we are done. Just have trailing_symlink() leave NULL in nd->stack[0].name and use that to discriminate between those cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:06 -04:00
Al Viro	dc7af8dc05	link_path_walk: nd->depth massage, part 9 Make link_path_walk() work with any value of nd->depth on entry - memorize it and use it in tests instead of comparing with 1. Don't bother with increment/decrement in path_init(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:06 -04:00
Al Viro	21c3003d36	put_link: nd->depth massage, part 8 all calls are preceded by decrement of nd->depth; move it into put_link() itself. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:05 -04:00
Al Viro	9ea57b72bf	trailing_symlink: nd->depth massage, part 7 move decrement of nd->depth on successful returns into the callers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:05 -04:00
Al Viro	0fd889d59e	get_link: nd->depth massage, part 6 make get_link() increment nd->depth on successful exit Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:04 -04:00
Al Viro	f7df08ee05	trailing_symlink: nd->depth massage, part 5 move increment of ->depth to the point where we'd discovered that get_link() has not returned an error, adjust exits accordingly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:04 -04:00
Al Viro	ef1a3e7b96	link_path_walk: nd->depth massage, part 4 lift increment/decrement into link_path_walk() callers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:03 -04:00
Al Viro	da4e0be04d	link_path_walk: nd->depth massage, part 3 remove decrement/increment surrounding nd_alloc_stack(), adjust the test in it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:03 -04:00
Al Viro	fd4620bbdf	link_path_walk: nd->depth massage, part 2 collapse adjacent increment/decrement pairs. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:02 -04:00
Al Viro	071bf50137	link_path_walk: nd->depth massage, part 1 nd->stack[0] is unused until the handling of trailing symlinks and we want to get rid of that. Having fucked that transformation up several times, I went for bloody pedantic series of provably equivalent transformations. Sorry. Step 1: keep nd->depth higher by one in link_path_walk() - increment upon entry, decrement on exits, adjust the arithmetics inside and surround the calls of functions that care about nd->depth value (nd_alloc_stack(), get_link(), put_link()) with decrement/increment pairs. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:02 -04:00
Al Viro	894bc8c466	namei: remove restrictions on nesting depth The only restriction is that on the total amount of symlinks crossed; how they are nested does not matter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:01 -04:00
Al Viro	3b2e7f7539	namei: trim the arguments of get_link() same story as the previous commit Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:01 -04:00
Al Viro	b9ff44293c	namei: trim redundant arguments of fs/namei.c:put_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:00 -04:00
Al Viro	1d8e03d359	namei: trim redundant arguments of trailing_symlink() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:20:00 -04:00
Al Viro	697fc6ca66	namei: move link/cookie pairs into nameidata Array of MAX_NESTED_LINKS + 1 elements put into nameidata; what used to be a local array in link_path_walk() occupies entries 1 .. MAX_NESTED_LINKS in it, link and cookie from the trailing symlink handling loops - entry 0. This is _not_ the final arrangement; just an easily verified incremental step. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:59 -04:00
Al Viro	9e18f10a30	link_path_walk: cleanup - turn goto start; into continue; Deal with skipping leading slashes before what used to be the recursive call. That way we can get rid of that goto completely. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:59 -04:00
Al Viro	07681481b8	link_path_walk: split "return from recursive call" path Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:58 -04:00
Al Viro	32cd74685c	link_path_walk: kill the recursion absolutely straightforward now - the only variables we need to preserve across the recursive call are name, link and cookie, and recursion depth is limited (and can is equal to nd->depth). So arrange an array of triples to hold instances of those and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:58 -04:00
Al Viro	bdf6cbf179	link_path_walk: final preparations to killing recursion reduce the number of returns in there - turn all places where it returns zero into goto OK and places where it returns non-zero into goto Err. The only non-trivial detail is that all breaks in the loop are guaranteed to be with non-zero err. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:57 -04:00
Al Viro	bb8603f8e1	link_path_walk: get rid of duplication What we do after the second walk_component() + put_link() + depth decrement in there is exactly equivalent to what's done right after the first walk_component(). Easy to verify and not at all surprising, seeing that there we have just walked the last component of nested symlink. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:57 -04:00
Al Viro	48c8b0c571	link_path_walk: massage a bit more Pull the block after the if-else in the end of what used to be do-while body into all branches there. We are almost done with the massage... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:56 -04:00
Al Viro	d40bcc09ab	link_path_walk: turn inner loop into explicit goto Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:56 -04:00
Al Viro	12b0957800	link_path_walk: don't bother with walk_component() after jumping link ... it does nothing if nd->last_type is LAST_BIND. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:55 -04:00
Al Viro	b0c24c3bdf	link_path_walk: handle get_link() returning ERR_PTR() immediately If we get ERR_PTR() from get_link(), we are guaranteed to get err != 0 when we break out of do-while, so we are going to hit if (err) return err; shortly after it. Pull that into the if (IS_ERR(s)) body. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:55 -04:00
Al Viro	95fa25d9f2	namei: rename follow_link to trailing_symlink, move it down Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:54 -04:00
Al Viro	21fef2176e	namei: move the calls of may_follow_link() into follow_link() All remaining callers of the former are preceded by the latter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:53 -04:00
Al Viro	172a39a059	namei: expand the call of follow_link() in link_path_walk() ... and strip __always_inline from follow_link() - remaining callers don't need that. Now link_path_walk() recursion is a direct one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:53 -04:00
Al Viro	5a460275ef	namei: expand nested_symlink() in its only caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:52 -04:00
Al Viro	896475d5bd	do_last: move path there from caller's stack frame We used to need it to feed to follow_link(). No more... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:52 -04:00
Al Viro	caa8563443	namei: introduce nameidata->link shares space with nameidata->next, walk_component() et.al. store the struct path of symlink instead of returning it into a variable passed by caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:51 -04:00
Al Viro	d4dee48bad	namei: don't bother with ->follow_link() if ->i_link is set with new calling conventions it's trivial Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: fs/namei.c	2015-05-10 22:19:51 -04:00
Al Viro	0a959df54b	namei.c: separate the parts of follow_link() that find the link body Split a piece of fs/namei.c:follow_link() that does obtaining the link body into a separate function. follow_link() itself is converted to calling get_link() and then doing the body traversal (if any). The next step will expand follow_link() call in link_path_walk() and this helps to keep the size down... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:50 -04:00
Al Viro	680baacbca	new ->follow_link() and ->put_link() calling conventions a) instead of storing the symlink body (via nd_set_link()) and returning an opaque pointer later passed to ->put_link(), ->follow_link() _stores_ that opaque pointer (into void * passed by address by caller) and returns the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic symlinks) and pointer to symlink body for normal symlinks. Stored pointer is ignored in all cases except the last one. Storing NULL for opaque pointer (or not storing it at all) means no call of ->put_link(). b) the body used to be passed to ->put_link() implicitly (via nameidata). Now only the opaque pointer is. In the cases when we used the symlink body to free stuff, ->follow_link() now should store it as opaque pointer in addition to returning it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:19:45 -04:00
Al Viro	46afd6f61c	namei: lift nameidata into filename_mountpoint() when we go for on-demand allocation of saved state in link_path_walk(), we'll want nameidata to stay around for all 3 calls of path_mountpoint(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:18:33 -04:00
Al Viro	f5beed755b	name: shift nameidata down into user_path_walk() that avoids having nameidata on stack during the calls of ->rmdir()/->unlink() and two of those during the calls of ->rename(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:18:32 -04:00
Al Viro	6a9f40d610	namei: get rid of lookup_hash() it's a convenient helper, but we'll want to shift nameidata down the call chain, so it won't be available there... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:18:32 -04:00
Al Viro	a5cfe2d5e1	do_last: regularize the logics around following symlinks With LOOKUP_FOLLOW we unlazy and return 1; without it we either fail with ELOOP or, for O_PATH opens, succeed. No need to mix those cases... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:18:31 -04:00
Al Viro	fd2805be23	do_last: kill symlink_ok When O_PATH is present, O_CREAT isn't, so symlink_ok is always equal to (open_flags & O_PATH) && !(nd->flags & LOOKUP_FOLLOW). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:18:30 -04:00
Al Viro	f488443d1d	namei: take O_NOFOLLOW treatment into do_last() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:18:30 -04:00
Al Viro	34b128f31c	uninline walk_component() seriously improves the stack and I-cache footprint... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:18:29 -04:00
NeilBrown	37882db054	SECURITY: remove nameidata arg from inode_follow_link. No ->inode_follow_link() methods use the nameidata arg, and it is about to become private to namei.c. So remove from all inode_follow_link() functions. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-10 22:18:29 -04:00
Al Viro	f15133df08	path_openat(): fix double fput() path_openat() jumps to the wrong place after do_tmpfile() - it has already done path_cleanup() (as part of path_lookupat() called by do_tmpfile()), so doing that again can lead to double fput(). Cc: stable@vger.kernel.org # v3.11+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-09 00:12:48 -04:00
Al Viro	766c4cbfac	namei: d_is_negative() should be checked before ->d_seq validation Fetching ->d_inode, verifying ->d_seq and finding d_is_negative() to be true does not mean that inode we'd fetched had been NULL - that holds only while ->d_seq is still unchanged. Shift d_is_negative() checks into lookup_fast() prior to ->d_seq verification. Reported-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-09 00:12:35 -04:00
Al Viro	3cab989afd	RCU pathwalk breakage when running into a symlink overmounting something Calling unlazy_walk() in walk_component() and do_last() when we find a symlink that needs to be followed doesn't acquire a reference to vfsmount. That's fine when the symlink is on the same vfsmount as the parent directory (which is almost always the case), but it's not always true - one _can_ manage to bind a symlink on top of something. And in such cases we end up with excessive mntput(). Cc: stable@vger.kernel.org # since 2.6.39 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-04-24 15:52:14 -04:00
David Howells	4bbcbd3b11	VFS: Make pathwalk use d_is_reg() rather than S_ISREG() Make pathwalk use d_is_reg() rather than S_ISREG() to determine whether to honour O_TRUNC. Since this occurs after complete_walk(), the dentry type field cannot change and the inode pointer cannot change as we hold a ref on the dentry, so this should be safe. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-04-15 15:05:30 -04:00
David Howells	698934df8b	VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk Where we have: if (!dentry->d_inode \|\| d_is_negative(dentry)) { type constructions in pathwalk we should be able to eliminate the check of d_inode and rely solely on the result of d_is_negative() or d_is_positive(). What we do have to take care to do is to read d_inode after calling a d_is_xxx() typecheck function to get the barriering right. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-04-15 15:05:29 -04:00
Al Viro	9e7543e939	remove incorrect comment in lookup_one_len() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-04-11 22:24:30 -04:00
Al Viro	74eb8cc5a5	namei.c: fold do_path_lookup() into both callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-04-11 22:24:30 -04:00
Al Viro	fd2f7cb5bc	kill struct filename.separate just make const char iname[] the last member and compare name->name with name->iname instead of checking name->separate We need to make sure that out-of-line name doesn't end up allocated adjacent to struct filename refering to it; fortunately, it's easy to achieve - just allocate that struct filename with one byte in ->iname[], so that ->iname[0] will be inside the same object and thus have an address different from that of out-of-line name [spotted by Boqun Feng <boqun.feng@gmail.com>] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-04-11 22:21:24 -04:00
Al Viro	6e8a1f8741	switch path_init() to struct filename Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-03-24 17:19:16 -04:00
Al Viro	668696dcbb	switch path_mountpoint() to struct filename Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-03-24 17:19:15 -04:00
Al Viro	5eb6b495c6	switch path_lookupat() to struct filename all callers were passing it ->name of some struct filename Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-03-24 17:19:15 -04:00
Al Viro	94b5d2621a	getname_flags(): clean up a bit Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-03-24 17:19:14 -04:00
David Howells	e36cb0b89c	VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_(dentry) Convert the following where appropriate: (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry). (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry). (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry). This is actually more complicated than it appears as some calls should be converted to d_can_lookup() instead. The difference is whether the directory in question is a real dir with a ->lookup op or whether it's a fake dir with a ->d_automount op. In some circumstances, we can subsume checks for dentry->d_inode not being NULL into this, provided we the code isn't in a filesystem that expects d_inode to be NULL if the dirent really is* negative (ie. if we're going to use d_inode() rather than d_backing_inode() to get the inode pointer). Note that the dentry type field may be set to something other than DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS manages the fall-through from a negative dentry to a lower layer. In such a case, the dentry type of the negative union dentry is set to the same as the type of the lower dentry. However, if you know d_inode is not NULL at the call site, then you can use the d_is_xxx() functions even in a filesystem. There is one further complication: a 0,0 chardev dentry may be labelled DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE. Strictly, this was intended for special directory entry types that don't have attached inodes. The following perl+coccinelle script was used: use strict; my @callers; open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' \|') \|\| die "Can't grep for S_ISDIR and co. callers"; @callers = <$fd>; close($fd); unless (@callers) { print "No matches\n"; exit(0); } my @cocci = ( '@@', 'expression E;', '@@', '', '- S_ISLNK(E->d_inode->i_mode)', '+ d_is_symlink(E)', '', '@@', 'expression E;', '@@', '', '- S_ISDIR(E->d_inode->i_mode)', '+ d_is_dir(E)', '', '@@', 'expression E;', '@@', '', '- S_ISREG(E->d_inode->i_mode)', '+ d_is_reg(E)' ); my $coccifile = "tmp.sp.cocci"; open($fd, ">$coccifile") \|\| die $coccifile; print($fd "$_\n") \|\| die $coccifile foreach (@cocci); close($fd); foreach my $file (@callers) { chomp $file; print "Processing ", $file, "\n"; system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 \|\| die "spatch failed"; } [AV: overlayfs parts skipped] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-02-22 11:38:41 -05:00
Paul Moore	55422d0bd2	audit: replace getname()/putname() hacks with reference counters In order to ensure that filenames are not released before the audit subsystem is done with the strings there are a number of hacks built into the fs and audit subsystems around getname() and putname(). To say these hacks are "ugly" would be kind. This patch removes the filename hackery in favor of a more conventional reference count based approach. The diffstat below tells most of the story; lots of audit/fs specific code is replaced with a traditional reference count based approach that is easily understood, even by those not familiar with the audit and/or fs subsystems. CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:23:58 -05:00
Paul Moore	fd3522fdc8	audit: enable filename recording via getname_kernel() Enable recording of filenames in getname_kernel() and remove the kludgy workaround in __audit_inode() now that we have proper filename logging for kernel users. CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:23:52 -05:00
Al Viro	cbaab2db91	simpler calling conventions for filename_mountpoint() a) make it accept ERR_PTR() as filename (and return its PTR_ERR() in that case) b) make it putname() the sucker in the end otherwise simplifies life for callers... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:22:21 -05:00
Paul Moore	5168910413	fs: create proper filename objects using getname_kernel() There are several areas in the kernel that create temporary filename objects using the following pattern: int func(const char name) { struct filename file = { .name = name }; ... return 0; } ... which for the most part works okay, but it causes havoc within the audit subsystem as the filename object does not persist beyond the lifetime of the function. This patch converts all of these temporary filename objects into proper filename objects using getname_kernel() and putname() which ensure that the filename object persists until the audit subsystem is finished with it. Also, a special thanks to Al Viro, Guenter Roeck, and Sabrina Dubroca for helping resolve a difficult kernel panic on boot related to a use-after-free problem in kern_path_create(); the thread can be seen at the link below: * https://lkml.org/lkml/2015/1/20/710 This patch includes code that was either based on, or directly written by Al in the above thread. CC: viro@zeniv.linux.org.uk CC: linux@roeck-us.net CC: sd@queasysnail.net CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:22:20 -05:00
Paul Moore	0851854972	fs: rework getname_kernel to handle up to PATH_MAX sized filenames In preparation for expanded use in the kernel, make getname_kernel() more useful by allowing it to handle any legal filename length. Thanks to Guenter Roeck for his suggestion to substitute memcpy() for strlcpy(). CC: linux@roeck-us.net CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:22:20 -05:00
Al Viro	fa14a0b8d2	cut down the number of do_path_lookup() callers ... and don't bother with new struct filename when we already have one Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:22:19 -05:00
Linus Torvalds	603ba7e41b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile #2 from Al Viro: "Next pile (and there'll be one or two more). The large piece in this one is getting rid of /proc//ns/ weirdness; among other things, it allows to (finally) make nameidata completely opaque outside of fs/namei.c, making for easier further cleanups in there" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: coda_venus_readdir(): use file_inode() fs/namei.c: fold link_path_walk() call into path_init() path_init(): don't bother with LOOKUP_PARENT in argument fs/namei.c: new helper (path_cleanup()) path_init(): store the "base" pointer to file in nameidata itself make default ->i_fop have ->open() fail with ENXIO make nameidata completely opaque outside of fs/namei.c kill proc_ns completely take the targets of /proc//ns/ symlinks to separate fs bury struct proc_ns in fs/proc copy address of proc_ns_ops into ns_common new helpers: ns_alloc_inum/ns_free_inum make proc_ns_operations work with struct ns_common * instead of void * switch the rest of proc_ns_operations to working with &...->ns netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns common object embedded into various struct ....ns	2014-12-16 15:53:03 -08:00
David Drysdale	51f39a1f0c	syscalls: implement execveat() system call This patchset adds execveat(2) for x86, and is derived from Meredydd Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528). The primary aim of adding an execveat syscall is to allow an implementation of fexecve(3) that does not rely on the /proc filesystem, at least for executables (rather than scripts). The current glibc version of fexecve(3) is implemented via /proc, which causes problems in sandboxed or otherwise restricted environments. Given the desire for a /proc-free fexecve() implementation, HPA suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be an appropriate generalization. Also, having a new syscall means that it can take a flags argument without back-compatibility concerns. The current implementation just defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be added in future -- for example, flags for new namespaces (as suggested at https://lkml.org/lkml/2006/7/11/474). Related history: - https://lkml.org/lkml/2006/12/27/123 is an example of someone realizing that fexecve() is likely to fail in a chroot environment. - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered documenting the /proc requirement of fexecve(3) in its manpage, to "prevent other people from wasting their time". - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a problem where a process that did setuid() could not fexecve() because it no longer had access to /proc/self/fd; this has since been fixed. This patch (of 4): Add a new execveat(2) system call. execveat() is to execve() as openat() is to open(): it takes a file descriptor that refers to a directory, and resolves the filename relative to that. In addition, if the filename is empty and AT_EMPTY_PATH is specified, execveat() executes the file to which the file descriptor refers. This replicates the functionality of fexecve(), which is a system call in other UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and so relies on /proc being mounted). The filename fed to the executed program as argv[0] (or the name of the script fed to a script interpreter) will be of the form "/dev/fd/<fd>" (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively reflecting how the executable was found. This does however mean that execution of a script in a /proc-less environment won't work; also, script execution via an O_CLOEXEC file descriptor fails (as the file will not be accessible after exec). Based on patches by Meredydd Luff. Signed-off-by: David Drysdale <drysdale@google.com> Cc: Meredydd Luff <meredydd@senatehouse.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rich Felker <dalias@aerifal.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-12-13 12:42:51 -08:00
Al Viro	d465887f9d	fs/namei.c: fold link_path_walk() call into path_init() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-11 16:27:57 -05:00
Al Viro	980f3ea2f6	path_init(): don't bother with LOOKUP_PARENT in argument Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-11 16:27:57 -05:00
Al Viro	893b7775a7	fs/namei.c: new helper (path_cleanup()) All callers of path_init() proceed to do the identical cleanup when they are done with nameidata. Don't open-code it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-11 16:27:57 -05:00
Al Viro	5e53084d77	path_init(): store the "base" pointer to file in nameidata itself Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-11 16:27:57 -05:00
Al Viro	1f55a6ec94	make nameidata completely opaque outside of fs/namei.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-10 21:32:13 -05:00
Linus Torvalds	7e05b807b9	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS fixes from Al Viro: "A bunch of assorted fixes, most of them followups to overlayfs merge" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ovl: initialize ->is_cursor Return short read or 0 at end of a raw device, not EIO isofs: don't bother with ->d_op for normal case isofs_cmp(): we'll never see a dentry for . or .. overlayfs: fix lockdep misannotation ovl: fix check for cursor overlayfs: barriers for opening upper-layer directory rcu: Provide counterpart to rcu_dereference() for non-RCU situations staging: android: logger: Fix log corruption regression	2014-11-02 10:28:43 -08:00
Eric Rannaud	69a91c237a	fs: allow open(dir, O_TMPFILE\|..., 0) with mode 0 The man page for open(2) indicates that when O_CREAT is specified, the 'mode' argument applies only to future accesses to the file: Note that this mode applies only to future accesses of the newly created file; the open() call that creates a read-only file may well return a read/write file descriptor. The man page for open(2) implies that 'mode' is treated identically by O_CREAT and O_TMPFILE. O_TMPFILE, however, behaves differently: int fd = open("/tmp", O_TMPFILE \| O_RDWR, 0); assert(fd == -1); assert(errno == EACCES); int fd = open("/tmp", O_TMPFILE \| O_RDWR, 0600); assert(fd > 0); For O_CREAT, do_last() sets acc_mode to MAY_OPEN only: if (opened & FILE_CREATED) { / Don't check for write permission, don't truncate */ open_flag &= ~O_TRUNC; will_truncate = false; acc_mode = MAY_OPEN; path_to_nameidata(path, nd); goto finish_open_created; } But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to may_open(). This patch lines up the behavior of O_TMPFILE with O_CREAT. After the inode is created, may_open() is called with acc_mode = MAY_OPEN, in do_tmpfile(). A different, but related glibc bug revealed the discrepancy: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 The glibc lazily loads the 'mode' argument of open() and openat() using va_arg() only if O_CREAT is present in 'flags' (to support both the 2 argument and the 3 argument forms of open; same idea for openat()). However, the glibc ignores the 'mode' argument if O_TMPFILE is in 'flags'. On x86_64, for open(), it magically works anyway, as 'mode' is in RDX when entering open(), and is still in RDX on SYSCALL, which is where the kernel looks for the 3rd argument of a syscall. But openat() is not quite so lucky: 'mode' is in RCX when entering the glibc wrapper for openat(), while the kernel looks for the 4th argument of a syscall in R10. Indeed, the syscall calling convention differs from the regular calling convention in this respect on x86_64. So the kernel sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and fails with EACCES. Signed-off-by: Eric Rannaud <e@nanocritical.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-30 15:50:13 -07:00
Miklos Szeredi	d1b72cc6d8	overlayfs: fix lockdep misannotation In an overlay directory that shadows an empty lower directory, say /mnt/a/empty102, do: touch /mnt/a/empty102/x unlink /mnt/a/empty102/x rmdir /mnt/a/empty102 It's actually harmless, but needs another level of nesting between I_MUTEX_CHILD and I_MUTEX_NORMAL. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-28 18:32:47 -04:00
Miklos Szeredi	0d7a855526	vfs: add RENAME_WHITEOUT This adds a new RENAME_WHITEOUT flag. This flag makes rename() create a whiteout of source. The whiteout creation is atomic relative to the rename. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:37 +02:00
Miklos Szeredi	787fb6bc96	vfs: add whiteout support Whiteout isn't actually a new file type, but is represented as a char device (Linus's idea) with 0/0 device number. This has several advantages compared to introducing a new whiteout file type: - no userspace API changes (e.g. trivial to make backups of upper layer filesystem, without losing whiteouts) - no fs image format changes (you can boot an old kernel/fsck without whiteout support and things won't break) - implementation is trivial Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:36 +02:00
Miklos Szeredi	cbdf35bcb8	vfs: export check_sticky() It's already duplicated in btrfs and about to be used in overlayfs too. Move the sticky bit check to an inline helper and call the out-of-line helper only in the unlikly case of the sticky bit being set. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:36 +02:00
Miklos Szeredi	bd5d08569c	vfs: export __inode_permission() to modules We need to be able to check inode permissions (but not filesystem implied permissions) for stackable filesystems. Expose this interface for overlayfs. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:35 +02:00
Miklos Szeredi	4aa7c6346b	vfs: add i_op->dentry_open() Add a new inode operation i_op->dentry_open(). This is for stacked filesystems that want to return a struct file from a different filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:35 +02:00
Linus Torvalds	77c688ac87	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The big thing in this pile is Eric's unmount-on-rmdir series; we finally have everything we need for that. The final piece of prereqs is delayed mntput() - now filesystem shutdown always happens on shallow stack. Other than that, we have several new primitives for iov_iter (Matt Wilcox, culled from his XIP-related series) pushing the conversion to ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c cleanups and fixes (including the external name refcounting, which gives consistent behaviour of d_move() wrt procfs symlinks for long and short names alike) and assorted cleanups and fixes all over the place. This is just the first pile; there's a lot of stuff from various people that ought to go in this window. Starting with unionmount/overlayfs mess... ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits) fs/file_table.c: Update alloc_file() comment vfs: Deduplicate code shared by xattr system calls operating on paths reiserfs: remove pointless forward declaration of struct nameidata don't need that forward declaration of struct nameidata in dcache.h anymore take dname_external() into fs/dcache.c let path_init() failures treated the same way as subsequent link_path_walk() fix misuses of f_count() in ppp and netlink ncpfs: use list_for_each_entry() for d_subdirs walk vfs: move getname() from callers to do_mount() gfs2_atomic_open(): skip lookups on hashed dentry [infiniband] remove pointless assignments gadgetfs: saner API for gadgetfs_create_file() f_fs: saner API for ffs_sb_create_file() jfs: don't hash direct inode [s390] remove pointless assignment of ->f_op in vmlogrdr ->open() ecryptfs: ->f_op is never NULL android: ->f_op is never NULL nouveau: __iomem misannotations missing annotation in fs/file.c fs: namespace: suppress 'may be used uninitialized' warnings ...	2014-10-13 11:28:42 +02:00
Al Viro	115cbfdc60	let path_init() failures treated the same way as subsequent link_path_walk() As it is, path_lookupat() and path_mounpoint() might end up leaking struct file reference in some cases. Spotted-by: Eric Biggers <ebiggers3@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-12 17:09:04 -04:00
Linus Torvalds	5e40d331bd	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris. Mostly ima, selinux, smack and key handling updates. * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits) integrity: do zero padding of the key id KEYS: output last portion of fingerprint in /proc/keys KEYS: strip 'id:' from ca_keyid KEYS: use swapped SKID for performing partial matching KEYS: Restore partial ID matching functionality for asymmetric keys X.509: If available, use the raw subjKeyId to form the key description KEYS: handle error code encoded in pointer selinux: normalize audit log formatting selinux: cleanup error reporting in selinux_nlmsg_perm() KEYS: Check hex2bin()'s return when generating an asymmetric key ID ima: detect violations for mmaped files ima: fix race condition on ima_rdwr_violation_check and process_measurement ima: added ima_policy_flag variable ima: return an error code from ima_add_boot_aggregate() ima: provide 'ima_appraise=log' kernel option ima: move keyring initialization to ima_init() PKCS#7: Handle PKCS#7 messages that contain no X.509 certs PKCS#7: Better handling of unsupported crypto KEYS: Overhaul key identification when searching for asymmetric keys KEYS: Implement binary asymmetric key ID handling ...	2014-10-12 10:13:55 -04:00
Eric W. Biederman	5542aa2fa7	vfs: Make d_invalidate return void Now that d_invalidate can no longer fail, stop returning a useless return code. For the few callers that checked the return code update remove the handling of d_invalidate failure. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-09 02:38:57 -04:00
Eric W. Biederman	8ed936b567	vfs: Lazily remove mounts on unlinked files and directories. With the introduction of mount namespaces and bind mounts it became possible to access files and directories that on some paths are mount points but are not mount points on other paths. It is very confusing when rm -rf somedir returns -EBUSY simply because somedir is mounted somewhere else. With the addition of user namespaces allowing unprivileged mounts this condition has gone from annoying to allowing a DOS attack on other users in the system. The possibility for mischief is removed by updating the vfs to support rename, unlink and rmdir on a dentry that is a mountpoint and by lazily unmounting mountpoints on deleted dentries. In particular this change allows rename, unlink and rmdir system calls on a dentry without a mountpoint in the current mount namespace to succeed, and it allows rename, unlink, and rmdir performed on a distributed filesystem to update the vfs cache even if when there is a mount in some namespace on the original dentry. There are two common patterns of maintaining mounts: Mounts on trusted paths with the parent directory of the mount point and all ancestory directories up to / owned by root and modifiable only by root (i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu, cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories maintained by fusermount. In the case of mounts in trusted directories owned by root and modifiable only by root the current parent directory permissions are sufficient to ensure a mount point on a trusted path is not removed or renamed by anyone other than root, even if there is a context where the there are no mount points to prevent this. In the case of mounts in directories owned by less privileged users races with users modifying the path of a mount point are already a danger. fusermount already uses a combination of chdir, /proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The removable of global rename, unlink, and rmdir protection really adds nothing new to consider only a widening of the attack window, and fusermount is already safe against unprivileged users modifying the directory simultaneously. In principle for perfect userspace programs returning -EBUSY for unlink, rmdir, and rename of dentires that have mounts in the local namespace is actually unnecessary. Unfortunately not all userspace programs are perfect so retaining -EBUSY for unlink, rmdir and rename of dentries that have mounts in the current mount namespace plays an important role of maintaining consistency with historical behavior and making imperfect userspace applications hard to exploit. v2: Remove spurious old_dentry. v3: Optimized shrink_submounts_and_drop Removed unsued afs label v4: Simplified the changes to check_submounts_and_drop Do not rename check_submounts_and_drop shrink_submounts_and_drop Document what why we need atomicity in check_submounts_and_drop Rely on the parent inode mutex to make d_revalidate and d_invalidate an atomic unit. v5: Refcount the mountpoint to detach in case of simultaneous renames. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-09 02:38:56 -04:00
Eric W. Biederman	7af1364ffa	vfs: Don't allow overwriting mounts in the current mount namespace In preparation for allowing mountpoints to be renamed and unlinked in remote filesystems and in other mount namespaces test if on a dentry there is a mount in the local mount namespace before allowing it to be renamed or unlinked. The primary motivation here are old versions of fusermount unmount which is not safe if the a path can be renamed or unlinked while it is verifying the mount is safe to unmount. More recent versions are simpler and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount in a directory owned by an arbitrary user. Miklos Szeredi <miklos@szeredi.hu> reports this is approach is good enough to remove concerns about new kernels mixed with old versions of fusermount. A secondary motivation for restrictions here is that it removing empty directories that have non-empty mount points on them appears to violate the rule that rmdir can not remove empty directories. As Linus Torvalds pointed out this is useful for programs (like git) that test if a directory is empty with rmdir. Therefore this patch arranges to enforce the existing mount point semantics for local mount namespace. v2: Rewrote the test to be a drop in replacement for d_mountpoint v3: Use bool instead of int as the return type of is_local_mountpoint Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-09 02:38:54 -04:00
James Morris	6c8ff877cd	Merge commit 'v3.16' into next	2014-10-01 00:44:04 +10:00
James Hogan	a060dc5010	vfs: workaround gcc <4.6 build error in link_path_walk() Commit `d6bb3e9075` ("vfs: simplify and shrink stack frame of link_path_walk()") introduced build problems with GCC versions older than 4.6 due to the initialisation of a member of an anonymous union in struct qstr without enclosing braces. This hits GCC bug 10676 [1] (which was fixed in GCC 4.6 by [2]), and causes the following build error: fs/namei.c: In function 'link_path_walk': fs/namei.c:1778: error: unknown field 'hash_len' specified in initializer This is worked around by adding explicit braces. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676 [2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=159206 Fixes: `d6bb3e9075` (vfs: simplify and shrink stack frame of link_path_walk()) Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-metag@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-09-16 07:44:54 -07:00
Linus Torvalds	d6bb3e9075	vfs: simplify and shrink stack frame of link_path_walk() Commit `9226b5b440` ("vfs: avoid non-forwarding large load after small store in path lookup") made link_path_walk() always access the "hash_len" field as a single 64-bit entity, in order to avoid mixed size accesses to the members. However, what I didn't notice was that that effectively means that the whole "struct qstr this" is now basically redundant. We already explicitly track the "const char *name", and if we just use "u64 hash_len" instead of "long len", there is nothing else left of the "struct qstr". We do end up wanting the "struct qstr" if we have a filesystem with a "d_hash()" function, but that's a rare case, and we might as well then just squirrell away the name and hash_len at that point. End result: fewer live variables in the loop, a smaller stack frame, and better code generation. And we don't need to pass in pointers variables to helper functions any more, because the return value contains all the relevant information. So this removes more lines than it adds, and the source code is clearer too. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-09-15 10:51:07 -07:00
Linus Torvalds	83373f7028	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "double iput() on failure exit in lustre, racy removal of spliced dentries from ->s_anon in __d_materialise_dentry() plus a bunch of assorted RCU pathwalk fixes" The RCU pathwalk fixes end up fixing a couple of cases where we incorrectly dropped out of RCU walking, due to incorrect initialization and testing of the sequence locks in some corner cases. Since dropping out of RCU walk mode forces the slow locked accesses, those corner cases slowed down quite dramatically. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: be careful with nd->inode in path_init() and follow_dotdot_rcu() don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() fix bogus read_seqretry() checks introduced in `b37199e` move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon) [fix] lustre: d_make_root() does iput() on dentry allocation failure	2014-09-14 17:37:36 -07:00
Linus Torvalds	9226b5b440	vfs: avoid non-forwarding large load after small store in path lookup The performance regression that Josef Bacik reported in the pathname lookup (see commit `99d263d4c5` "vfs: fix bad hashing of dentries") made me look at performance stability of the dcache code, just to verify that the problem was actually fixed. That turned up a few other problems in this area. There are a few cases where we exit RCU lookup mode and go to the slow serializing case when we shouldn't, Al has fixed those and they'll come in with the next VFS pull. But my performance verification also shows that link_path_walk() turns out to have a very unfortunate 32-bit store of the length and hash of the name we look up, followed by a 64-bit read of the combined hash_len field. That screws up the processor store to load forwarding, causing an unnecessary hickup in this critical routine. It's caused by the ugly calling convention for the "hash_name()" function, and easily fixed by just making hash_name() fill in the whole 'struct qstr' rather than passing it a pointer to just the hash value. With that, the profile for this function looks much smoother. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-09-14 17:28:32 -07:00
Al Viro	4023bfc9f3	be careful with nd->inode in path_init() and follow_dotdot_rcu() in the former we simply check if dentry is still valid after picking its ->d_inode; in the latter we fetch ->d_inode in the same places where we fetch dentry and its ->d_seq, under the same checks. Cc: stable@vger.kernel.org # 2.6.38+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-09-14 14:24:47 -04:00
Al Viro	7bd88377d4	don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() return the value instead, and have path_init() do the assignment. Broken by "vfs: Fix absolute RCU path walk failures due to uninitialized seq number", which was Cc-stable with 2.6.38+ as destination. This one should go where it went. To avoid dummy value returned in case when root is already set (it would do no harm, actually, since the only caller that doesn't ignore the return value is guaranteed to have nd->root not set, but it's more obvious that way), lift the check into callers. And do the same to set_root(), to keep them in sync. Cc: stable@vger.kernel.org # 2.6.38+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-09-14 14:19:44 -04:00
Al Viro	f5be3e2912	fix bogus read_seqretry() checks introduced in `b37199e` read_seqretry() returns true on mismatch, not on match... Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-09-13 22:14:16 -04:00
Linus Torvalds	99d263d4c5	vfs: fix bad hashing of dentries Josef Bacik found a performance regression between 3.2 and 3.10 and narrowed it down to commit `bfcfaa77bd` ("vfs: use 'unsigned long' accesses for dcache name comparison and hashing"). He reports: "The test case is essentially for (i = 0; i < 1000000; i++) mkdir("a$i"); On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k dir/sec with 3.10. This is because we spend waaaaay more time in __d_lookup on 3.10 than in 3.2. The new hashing function for strings is suboptimal for < sizeof(unsigned long) string names (and hell even > sizeof(unsigned long) string names that I've tested). I broke out the old hashing function and the new one into a userspace helper to get real numbers and this is what I'm getting: Old hash table had 1000000 entries, 0 dupes, 0 max dupes New hash table had 12628 entries, 987372 dupes, 900 max dupes We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash My test does the hash, and then does the d_hash into a integer pointer array the same size as the dentry hash table on my system, and then just increments the value at the address we got to see how many entries we overlap with. As you can see the old hash function ended up with all 1 million entries in their own bucket, whereas the new one they are only distributed among ~12.5k buckets, which is why we're using so much more CPU in __d_lookup". The reason for this hash regression is two-fold: - On 64-bit architectures the down-mixing of the original 64-bit word-at-a-time hash into the final 32-bit hash value is very simplistic and suboptimal, and just adds the two 32-bit parts together. In particular, because there is no bit shuffling and the mixing boundary is also a byte boundary, similar character patterns in the low and high word easily end up just canceling each other out. - the old byte-at-a-time hash mixed each byte into the final hash as it hashed the path component name, resulting in the low bits of the hash generally being a good source of hash data. That is not true for the word-at-a-time case, and the hash data is distributed among all the bits. The fix is the same in both cases: do a better job of mixing the bits up and using as much of the hash data as possible. We already have the "hash_32\|64()" functions to do that. Reported-by: Josef Bacik <jbacik@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-09-13 11:30:10 -07:00
Dmitry Kasatkin	3034a14682	ima: pass 'opened' flag to identify newly created files Empty files and missing xattrs do not guarantee that a file was just created. This patch passes FILE_CREATED flag to IMA to reliably identify new files. Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> 3.14+	2014-09-09 10:28:43 -04:00

... 2 3 4 5 6 ...

1015 Commits